Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

Closed
DannyCork opened this issue Jan 5, 2020 · 9 comments · Fixed by #90
Closed

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

DannyCork opened this issue Jan 5, 2020 · 9 comments · Fixed by #90

Comments

@DannyCork
Copy link

Ran
$ funnel -s /home/danny/JobFunnel/jobfunnel/config/settings.yaml

and got

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 366
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
....
@PaulMcInnis
Copy link
Owner

Looks like the indeed scraper needs updating - will get on this asap.

@PaulMcInnis PaulMcInnis self-assigned this Jan 5, 2020
@PaulMcInnis
Copy link
Owner

OK, I need a bit more information,

Can you show me your settings.yaml ?

@DannyCork
Copy link
Author

DannyCork commented Jan 5, 2020

thanks,
same settings.yaml file


# This is the default settings file. Do not edit.

# All paths are relative to this file.

# Paths.
output_path: 'search'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This takes ~10x longer to run than the other providers

# Filters.
search_terms:
  region:
    province: ''
    city:     'xxxx'
    domain:   'ie'
    radius:   25

  keywords:
    - 'security'

# Black-listed company names
black_list:
  - 'yyyyyyyyyy'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'

@bradleykohler96
Copy link
Collaborator

bradleykohler96 commented Jan 7, 2020

I believe this is because of your search_terms.
These are the terms that are inserted into the URL. I believe we could improve this software by adding some verification process to the search_terms field.
However, it is no mistake that the software did not work.

I copied your generated URL into a web browser and got the following.
Screenshot from 2020-01-07 00-39-01

It is possible by using a country and province/states list to verify geographic locations and produce an error if not found prior to scraping. Perhaps it can be added to the list of things to do?

@DannyCork
Copy link
Author

Yes indeed Brad, So lets pick Dublin as the city

search_terms:
region:
province: ''
city: 'Dublin'
domain: 'ie'
radius: 25

This generates
https://ie.indeed.com/jobs?q=security&l=dublin,+None&radius=25&limit=50&filter=0

note the +None , I believe this is due to the province being null/none ''

The url works fine without +None
https://ie.indeed.com/jobs?q=security&l=dublin&radius=25&limit=50&filter=0

I think logic can be added that doesn't add query strings if the settings are empty..

@DannyCork DannyCork changed the title ailed to scrape Indeed: 'NoneType' object has no attribute 'contents' failed to scrape Indeed: 'NoneType' object has no attribute 'contents' Jan 7, 2020
@PaulMcInnis PaulMcInnis added bug and removed question labels Jan 9, 2020
@PaulMcInnis
Copy link
Owner

Thanks for the investigation, looks like we need to handle in internationalization for areas without provinces.

@remidubroca
Copy link

Hello there,
First of all, thanks a lot for this project !

I just get the same issue, I work-arounded/tested (dirty) only for indeed in french.

Its seems to me that the space in indeed.fr are not simple regular spaces, so using ' ' in regular expression for date is not working, replace with '\s', and the expression are in french (hour=heure, day=jour, month=mois, year=année ...)
So in tools.py (ln 21 to 26) the regular expressions for french become

re.compile(r'(\d+)(?:[\s+]{1,3})?(?:heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?année'),
re.compile(r'[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[hH]ier')

maybe using a bigger date_regex and using an offset depending on the locale ? or internationalize the regex with more alternative like in

re.compile(r'(\d+)(?:[\s+]{1,3})?(?:hour|heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:day|d|jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?month|mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?year|année'),
re.compile(r'[tT]oday|[jJ]ust [pP]osted|[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[yY]esterday|[hH]ier')

for now I only work-arounded with bigger date_regex table and offset, quick and dirty ...

also, in indeed.py, line 133 the count of jobs is failing, I think this is the root cause of the 'NoneType'
num_res = int(re.findall(r'f (\d+) ', num_res.replace(',', ''))[0])
I thought too at first it was the province but an empty province is working.
The issue for french is that the separator for thousands is a space, not a comma
for now I work-arounder, still quick and dirty with special depending on the local with
num_res = int(re.findall(r'(\d+)', num_res.replace('\s', ''))[1])
(I do not remember why the [1] is different in my workaround.)

maybe a better solution should be to use re.sub instead of replace ?
re.sub(r'\s|,','',num_res)
instead of
num_res.replace(',', '')

My 2 cents on this issue
Again, thanks for this project

@tgdn
Copy link

tgdn commented Mar 2, 2020

Hello all, just wanted to know what was the advancement of this issue? Is there a fix or something which is going to be done about this?

Thank you in advance,
Thomas

@markkvdb
Copy link
Collaborator

markkvdb commented Mar 2, 2020

Short answer: no.

Long answer: no, because the problem is caused by the fact that the job listing websites such as glassdoor, monster, etc typically have slightly different websites depending on the country. This small changes breaks the functionality of JobFunnel since we scrap the job listings using tags which are language depended.

The solution to this problem starts by writing an abstract formulation which allows developers to inherit from this abstract formulation to write the web scraper for a particular country. Ideally this is done in such a way such that it is accessible for many developers who do not yet have their country supported. We are working on this but remember that this is a difficult issue since it requires us to find common pattern across all countries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants