failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

DannyCork · 2020-01-05T15:05:05Z

Ran
$ funnel -s /home/danny/JobFunnel/jobfunnel/config/settings.yaml

and got

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 366
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
....

The text was updated successfully, but these errors were encountered:

PaulMcInnis · 2020-01-05T19:48:05Z

Looks like the indeed scraper needs updating - will get on this asap.

PaulMcInnis · 2020-01-05T20:41:48Z

OK, I need a bit more information,

Can you show me your settings.yaml ?

DannyCork · 2020-01-05T22:36:17Z

thanks,
same settings.yaml file


# This is the default settings file. Do not edit.

# All paths are relative to this file.

# Paths.
output_path: 'search'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This takes ~10x longer to run than the other providers

# Filters.
search_terms:
  region:
    province: ''
    city:     'xxxx'
    domain:   'ie'
    radius:   25

  keywords:
    - 'security'

# Black-listed company names
black_list:
  - 'yyyyyyyyyy'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'

bradleykohler96 · 2020-01-07T05:43:00Z

I believe this is because of your search_terms.
These are the terms that are inserted into the URL. I believe we could improve this software by adding some verification process to the search_terms field.
However, it is no mistake that the software did not work.

I copied your generated URL into a web browser and got the following.

It is possible by using a country and province/states list to verify geographic locations and produce an error if not found prior to scraping. Perhaps it can be added to the list of things to do?

DannyCork · 2020-01-07T14:53:42Z

Yes indeed Brad, So lets pick Dublin as the city

search_terms:
region:
province: ''
city: 'Dublin'
domain: 'ie'
radius: 25

This generates
https://ie.indeed.com/jobs?q=security&l=dublin,+None&radius=25&limit=50&filter=0

note the +None , I believe this is due to the province being null/none ''

The url works fine without +None
https://ie.indeed.com/jobs?q=security&l=dublin&radius=25&limit=50&filter=0

I think logic can be added that doesn't add query strings if the settings are empty..

PaulMcInnis · 2020-01-09T13:48:09Z

Thanks for the investigation, looks like we need to handle in internationalization for areas without provinces.

remidubroca · 2020-01-16T14:21:02Z

Hello there,
First of all, thanks a lot for this project !

I just get the same issue, I work-arounded/tested (dirty) only for indeed in french.

Its seems to me that the space in indeed.fr are not simple regular spaces, so using ' ' in regular expression for date is not working, replace with '\s', and the expression are in french (hour=heure, day=jour, month=mois, year=année ...)
So in tools.py (ln 21 to 26) the regular expressions for french become

re.compile(r'(\d+)(?:[\s+]{1,3})?(?:heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?année'),
re.compile(r'[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[hH]ier')

maybe using a bigger date_regex and using an offset depending on the locale ? or internationalize the regex with more alternative like in

re.compile(r'(\d+)(?:[\s+]{1,3})?(?:hour|heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:day|d|jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?month|mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?year|année'),
re.compile(r'[tT]oday|[jJ]ust [pP]osted|[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[yY]esterday|[hH]ier')

for now I only work-arounded with bigger date_regex table and offset, quick and dirty ...

also, in indeed.py, line 133 the count of jobs is failing, I think this is the root cause of the 'NoneType'
num_res = int(re.findall(r'f (\d+) ', num_res.replace(',', ''))[0])
I thought too at first it was the province but an empty province is working.
The issue for french is that the separator for thousands is a space, not a comma
for now I work-arounder, still quick and dirty with special depending on the local with
num_res = int(re.findall(r'(\d+)', num_res.replace('\s', ''))[1])
(I do not remember why the [1] is different in my workaround.)

maybe a better solution should be to use re.sub instead of replace ?
re.sub(r'\s|,','',num_res)
instead of
num_res.replace(',', '')

My 2 cents on this issue
Again, thanks for this project

tgdn · 2020-03-02T11:41:33Z

Hello all, just wanted to know what was the advancement of this issue? Is there a fix or something which is going to be done about this?

Thank you in advance,
Thomas

markkvdb · 2020-03-02T11:46:45Z

Short answer: no.

Long answer: no, because the problem is caused by the fact that the job listing websites such as glassdoor, monster, etc typically have slightly different websites depending on the country. This small changes breaks the functionality of JobFunnel since we scrap the job listings using tags which are language depended.

The solution to this problem starts by writing an abstract formulation which allows developers to inherit from this abstract formulation to write the web scraper for a particular country. Ideally this is done in such a way such that it is accessible for many developers who do not yet have their country supported. We are working on this but remember that this is a difficult issue since it requires us to find common pattern across all countries.

PaulMcInnis self-assigned this Jan 5, 2020

bradleykohler96 added the question label Jan 7, 2020

bradleykohler96 mentioned this issue Jan 7, 2020

failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste' #35

Closed

DannyCork changed the title ~~ailed to scrape Indeed: 'NoneType' object has no attribute 'contents'~~ failed to scrape Indeed: 'NoneType' object has no attribute 'contents' Jan 7, 2020

PaulMcInnis added bug and removed question labels Jan 9, 2020

PaulMcInnis added help wanted Internationalization and removed bug labels Jan 9, 2020

This was referenced Aug 24, 2020

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

Closed

JobFunnel 3.0 with localization, ABC and improved scraping #90

Merged

PaulMcInnis linked a pull request Aug 29, 2020 that will close this issue

JobFunnel 3.0 with localization, ABC and improved scraping #90

Merged

17 tasks

PaulMcInnis closed this as completed in #90 Sep 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

DannyCork commented Jan 5, 2020

PaulMcInnis commented Jan 5, 2020

PaulMcInnis commented Jan 5, 2020

DannyCork commented Jan 5, 2020 •

edited

Loading

bradleykohler96 commented Jan 7, 2020 •

edited

Loading

DannyCork commented Jan 7, 2020

PaulMcInnis commented Jan 9, 2020

remidubroca commented Jan 16, 2020

tgdn commented Mar 2, 2020

markkvdb commented Mar 2, 2020

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

Comments

DannyCork commented Jan 5, 2020

PaulMcInnis commented Jan 5, 2020

PaulMcInnis commented Jan 5, 2020

DannyCork commented Jan 5, 2020 • edited Loading

bradleykohler96 commented Jan 7, 2020 • edited Loading

DannyCork commented Jan 7, 2020

PaulMcInnis commented Jan 9, 2020

remidubroca commented Jan 16, 2020

tgdn commented Mar 2, 2020

markkvdb commented Mar 2, 2020

DannyCork commented Jan 5, 2020 •

edited

Loading

bradleykohler96 commented Jan 7, 2020 •

edited

Loading