Dynamic web scraping and static web scraping #75

thebigG · 2020-05-17T20:48:03Z

Dynamic web scraping and static web scraping

Hello everyone!
Hope you are all doing fantastic!

Description

So after changing to dynamic web scraping, it looks like the way the date of each job is displayed on Glassdoor changes. This was breaking the date_filter, but it looks like it's working now after digging around Glassdoor's html source. I also streamlined some of the dynamic scraping workflow to just go on with the scraping(not holding the program with an input() call) if there is nothing wrong, such as a CAPTCHA requirement.

I went back and tried the static way of scraping Glassdoor; it works sometimes, other times it doesn't. This led me to think about this dynamic/static scraping issue. And I'm wondering if you guys think we should enable JobFunnel to support both--dynamic and static web scraping. After some thought I imagined a couple of ways of implementing this:

Outsource to a non-scraping entity(such as tools.py) the ability of telling whoever is building the scraper if the site supports at the moment static web scraping. Hopefully an example might illustrate this better:

def does_static_work_for_this_site(provider: str):
    ...some logic to figure this out
    ...return true if it works, false if it doesn't

From the perspective of a scraper such as glassdor.py the implementation might look like this:

def scrape(self):
     if does_static_work_for_this_site("glassdoor"):
           ...some static scraping logic
    else:
            ...some dynamic scraping logic
     .....some more logic

The nice thing about this approach is that anybody who might write a new webscraper can add their own implementation to does_static_work_for_this_site and it is also decoupled from the actual scraping, whether that scraping is dynamic or static. The downside of this approach is that scrapers will be littered with if ... else logic like the one above all over the place, and that approach might sacrifice some readability. But on the other hand, abstracting this is hard as has been mentioned before in regards to internationalization implementations.

Let the user decide whether to do a dynamic scrape or a static one. If one were to take this approach, one could decouple the dynamic and static approach entirely from each other. An implementation that comes to mind is to have different files for each implementation such as this:

glassdoor_static.py
glassdoor_dynamic.py

The upside of this is that the dynamic and static approaches are completely separate from each other and not intertwined in one monolithic file littered with if..else logic. The downside is that in a way the codebase gets more complex.

In the end, both approaches make the whole codebase more complex. But the reason why I propose this is that I wonder if other sites might behave in a similar way to Glassdoor, or who knows if Indeed or Monster might start behaving the same way. I also don't think it's a very efficient way of handling things to have somebody just jump in whenever the current approach(whether it be static or dynamic) and have them re-write one or the other implementation again. I wonder what you guys think about this as a way of being prepared for the future if we need to add dynamic support for other sites, or even if new sites that are added might behave very much like Glassdoor;sometimes dynamic works, other times static is the way to go.

Sorry for making this long!
Just wanted to put thoughts down to see what y'all think!
I also wonder if maybe there are better ways of implementing this dynamic/static support, or should it even be implemented at all?

Cheers!

List any additional libraries that will be affected.
List any developers that will be affected or those who you had merge conflicts with.

Context of change

Please add options that are relevant and mark any boxes that apply.

Software (software that runs on the PC)
Library (library that runs on the PC)
Tool (tool that assists coding development)
Other

Type of change

Please mark any boxes that apply.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration.

Tested on an isolated container(Ubuntu 18.04)
Tested the entire pytest suite
Scraped the demo settings file for Canada
Scraped the demo settings file for United States

Checklist:

Please mark any boxes that have been completed.

I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
Any dependent changes have been merged and published in downstream modules.

-TODO:Investigate this issue further -TODO:Implement a way to limit the number of pages being scraped to speed up testing process -It looks like the static way of web scraping is working again in glassdoor.py -TODO:Implement a stateful switch for scrapers that will allow programmers to know whether to use static web scraping or dynamic(selenium-way) scraping

…sdoor scraper

codecov-io · 2020-05-17T20:54:45Z

Codecov Report

Merging #75 into master will increase coverage by 0.04%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   46.96%   47.01%   +0.04%     
==========================================
  Files          11       11              
  Lines         990      989       -1     
==========================================
  Hits          465      465              
+ Misses        525      524       -1

Impacted Files	Coverage Δ
jobfunnel/glassdoor.py	`12.57% <0.00%> (+0.07%)`	⬆️
jobfunnel/tools/tools.py	`80.20% <ø> (ø)`
jobfunnel/tools/filters.py	`93.10% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c3bec6...dc8ff18. Read the comment docs.

bradleykohler96 · 2020-05-18T20:49:31Z

If you check the JobFunnel Wiki I actually have done some work on improving methods of scraping. I'm not sure if we should continue with our current method or move to another method entirely. I used cosine similarity to do dynamic web scraping based on keywords. However, Glassdoor has proved to be the most challenging even for dynamic scraping. This is because they do not label any tags properly based on their contents like Indeed or Monster.

Checkout my branch if you want to see what I believe dynamic scraping should look like.
https://github.com/PaulMcInnis/JobFunnel/tree/studentbrad/tfidf_scrape

The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files.

thebigG · 2020-05-18T22:40:55Z

The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files.

This is really neat! It's a very clever way of doing this as it's completely decoupled from the scrapers themselves, whether they are Glassdoor or Monster. At the end of the day, we just get a dump of html; it doesn't matter if it's selenium or static scraping. If you ask me, this seems to be the way to go. If I understand your approach correctly, this is as abstracted as it gets.

PaulMcInnis · 2020-05-19T00:08:38Z

I guess one thing about JobFunnel was that it relies on community updating of the scrapers to work, but I've always had in mind that we should switch to something with more abstract scrapers.

It would be very cool to use the raw HTML and scrape what we need from it without having to navigate the structure of the data (which is changing)

I get the motivation for the static and dynamic scraping, I think perhaps it is best to allow it to be enabled/disabled, this way we can continue to develop it and use it without relying on it entirely.

I think it makes the most sense to implement in the abstract Scraper class as you mention, with an attribute/method that we can implement, perhaps it can default to False since existing is static.

PaulMcInnis

If this works as well as static does (for glassdoor) lets just remove the commented out static code, I have no problem moving to using dynamic scraping here.

jobfunnel/glassdoor.py

PaulMcInnis · 2020-05-19T00:14:31Z

excited to see the scraping being made more flexible!

markkvdb · 2020-05-19T08:52:57Z

I just read the Wiki about the suggested method for scraping websites using the Cosine Similarity Method. This is definitely an innovative way to find the relevant fields on the website but I am not sure whether it will correctly classify the right fields all the time. Furthermore, the added burden of supporting more countries will become even harder (I think).

That said, I could definitely be a cool R&D project to see how well this actually works in practice. We could abstract the actual parser of a job listing and simply define two objects: a naive static attribute parser and the novel dynamic similarity parser.

PaulMcInnis · 2020-05-19T12:54:21Z

yeah we need something more abstract ideally, but realistically we should keep our hard-earned scraping logic for now, unless we can prove that a bag-of-HTML approach is robust enough.

Looks like the discussion of doing scraping this way is more food for thought and less related to this PR though, which is doing a more flexible HTML search, definitely an upgrade.

thebigG · 2020-05-19T14:26:19Z

Furthermore, the added burden of supporting more countries will become even harder (I think).

Indeed. Internationalization could make the vectorizer approach messy. But part of me is really hopeful and optimistic that we could perhaps make this approach work for different countries/languages. But just like @markkvdb said, it might be best to keep it as an in-research strategy, if we're not 100% sure of its robustness.

If I understood Paul's view on static/dynamic approach correctly:

We would let the user decide whether to statically(non-selenium way) or dynamically fetch data from the site
When you say Scraper abstract class, you mean a completely new class? I ask because now that I think about it I wonder if we could just have a virtual/non-implemented(or perhaps returning False as the default behaviour like Paul mentioned) method like does_static_work_for_this_site that gets implemented by child classes like GlassDoor and Monster on the JobFunnel class, akin to what is currently done with the scrape method.

PaulMcInnis · 2020-05-19T15:20:38Z

perhaps we can just add an attribute to our base scraper class, and have it default to false. I'm not sure it needs to be a method unless it will be based on some logic? If so it can be a method instead.

Right now our base class is JobFunnel but in the future it would make more sense to perhaps build a BaseScraper since inheriting JobFunnel is a lot.

thebigG · 2020-05-19T17:06:58Z

I'm not sure it needs to be a method unless it will be based on some logic?

I think this might be the crux of the issue here. Initially my thinking was to have some logic where the scraper will figure out if static scraping was available. But after re-considering, I think that approach might make the codebase unnecessarily messy and complex. I think it might be much simpler to just implement a simple bool attribute like Paul suggested. Having a simple bool(or even a multi-value data type like enum) attribute is a win-win for both programmers and users. Users may choose dynamic scrape over static, or vice-versa. Programmers, on the other hand, have the flexibility of implementing each scraper how they see fit.

Hopefully, last question I have about this: Given that we implement a simple attribute to know if it should be dynamic or static, in the case of GlassDoor, will it be a sensible solution to have two files glassdoor_static.py and glassdoor_dynamic.py for each implementation?

I lean towards this because it might be easier to maintain as both implementations are clearly separate from each other. On second thought, it might also lead to a good deal of code duplication.

-Added missing assignment to get_webdriver

codecov-commenter · 2020-05-19T21:58:30Z

Codecov Report

Merging #75 into master will increase coverage by 0.09%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   46.96%   47.06%   +0.09%     
==========================================
  Files          11       11              
  Lines         990      988       -2     
==========================================
  Hits          465      465              
+ Misses        525      523       -2

Impacted Files	Coverage Δ
jobfunnel/glassdoor.py	`12.65% <0.00%> (+0.15%)`	⬆️
jobfunnel/tools/tools.py	`80.20% <0.00%> (ø)`
jobfunnel/tools/filters.py	`93.10% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c3bec6...31be644. Read the comment docs.

PaulMcInnis · 2020-05-21T13:50:20Z

I think it's ok to have them separate as you describe while we develop the feature. They are essentially different scrapers.

The only consideration I think wont be properly addressed if we split them into entirely seperate files is that any shared logic will have to be updated in two places in future PR's.

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

where GlassdoorScraperBase has all the logic excluding def scrape(...) which is to be implemented by the child-classes.

Looking forwards to seeing us take on building more generic/abstract scrapers in the future - It's a challenge for sure, makes sense to keep it on the back-burner.

thebigG · 2020-05-21T23:36:53Z

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

This makes a lot of sense. Just wanted to make sure what the future of Glassdoor might be, given how volatile it can be, before writing any tests for it. I think it might actually be better to start implementing the GlassDoorScraperBase classes before moving on with any further testing.

PaulMcInnis · 2020-05-22T01:21:35Z

Yeah don't worry too much about getting the test coverage up for the dynamic scraper given that its more experimental is another thought.

On Thu, May 21, 2020 at 7:37 PM Lorenzo B Gomez ***@***.***> wrote: Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase) This makes a lot of sense. Just wanted to make sure what the future of Glassdoor might be, given how volatile it can be, before writing any tests for it. I think it might actually be better to start implementing the GlassDoorScraperBase classes before moving on with any further testing. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYKY2KCPKCKTSYVXOSARPTRSW3KDANCNFSM4NDROBRQ> .

-- - Paul McInnis

thebigG · 2020-05-25T20:31:25Z

Just wanted to let you guys know that a good friend of mine has started working on a possible solution to this problem of Dynamic/Static scraping-->https://github.com/Arax1/JobFunnel

PaulMcInnis · 2020-06-15T15:09:24Z

Ok, what's the plan for merging this and #77
It looks like the other one needs to go in first and this one gets rebased afterwards.

thebigG · 2020-06-15T15:29:52Z

It looks like the other one needs to go in first and this one gets rebased afterwards.

Not a problem! I will rebase it and push as soon as #77 gets merged 👍

PaulMcInnis

looks good
Do you know if this resolves issue #79 ?

I know it resolves issue #78

thebigG · 2020-06-17T18:21:03Z

Do you know if this resolves issue #79 ?

I haven't been able to re-produce this issue on Windows10 or Linux. Will be investigating #79 further from now on as I develop more unit tests for the scrapers. I suspect it is some kind of edge case. Have been testing JobFunnel on a clean Windows10 and seems to be working ok.

PaulMcInnis · 2020-06-17T23:10:47Z

Same i also was unable to reproduce the issue

On Wed, Jun 17, 2020 at 2:21 PM Lorenzo B Gomez ***@***.***> wrote: Do you know if this resolves issue #79 <#79> ? I haven't been able to re-produce thisissue on Windows10 or Linux. Will be investigating #79 <#79> further from now on as I develop more unit tests for the scrapers. I suspect it is some kind of edge case. Have been testing JobFunnel on a clean Windows10 and seems to be working ok. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYKY2IZQKKB5YQZAI3WB7TRXECR3ANCNFSM4NDROBRQ> .

-- - Paul McInnis

thebigG added 4 commits May 12, 2020 18:09

-TODO:Investigate flaky test appraoch for glassdoor testing

01d92d0

-disable glassdoor testing for TravisCI

93eba91

-Fixed issue when fetching date with dynamic web scraping in the glas…

dc8ff18

…sdoor scraper

thebigG requested review from PaulMcInnis and bradleykohler96 May 17, 2020 20:49

markkvdb mentioned this pull request May 17, 2020

Significant lag in CLI #76

Closed

PaulMcInnis requested changes May 19, 2020

View reviewed changes

jobfunnel/glassdoor.py Outdated Show resolved Hide resolved

jobfunnel/glassdoor.py Outdated Show resolved Hide resolved

-Added Accept-Language 'en-CA' header to glassdoor.py

31be644

-Added missing assignment to get_webdriver

thebigG mentioned this pull request Jun 12, 2020

[Fix Included] Google Chrome Driver undefined #78

Closed

PaulMcInnis mentioned this pull request Jun 15, 2020

ValueError: empty vocabulary #79

Closed

Merge remote-tracking branch 'Paul/master' into testing

df8b8a8

PaulMcInnis approved these changes Jun 17, 2020

View reviewed changes

thebigG merged commit b30b284 into PaulMcInnis:master Jun 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic web scraping and static web scraping #75

Dynamic web scraping and static web scraping #75

thebigG commented May 17, 2020

codecov-io commented May 17, 2020

bradleykohler96 commented May 18, 2020 •

edited

Loading

thebigG commented May 18, 2020

PaulMcInnis commented May 19, 2020 •

edited

Loading

PaulMcInnis left a comment

PaulMcInnis commented May 19, 2020

markkvdb commented May 19, 2020

PaulMcInnis commented May 19, 2020

thebigG commented May 19, 2020

PaulMcInnis commented May 19, 2020 •

edited

Loading

thebigG commented May 19, 2020 •

edited

Loading

codecov-commenter commented May 19, 2020

PaulMcInnis commented May 21, 2020 •

edited

Loading

thebigG commented May 21, 2020

PaulMcInnis commented May 22, 2020 via email

thebigG commented May 25, 2020

PaulMcInnis commented Jun 15, 2020 •

edited

Loading

thebigG commented Jun 15, 2020

PaulMcInnis left a comment

thebigG commented Jun 17, 2020 •

edited

Loading

PaulMcInnis commented Jun 17, 2020 via email

Dynamic web scraping and static web scraping #75

Dynamic web scraping and static web scraping #75

Conversation

thebigG commented May 17, 2020

Dynamic web scraping and static web scraping

Description

Context of change

Type of change

How Has This Been Tested?

Checklist:

codecov-io commented May 17, 2020

Codecov Report

bradleykohler96 commented May 18, 2020 • edited Loading

thebigG commented May 18, 2020

PaulMcInnis commented May 19, 2020 • edited Loading

PaulMcInnis left a comment

Choose a reason for hiding this comment

PaulMcInnis commented May 19, 2020

markkvdb commented May 19, 2020

PaulMcInnis commented May 19, 2020

thebigG commented May 19, 2020

PaulMcInnis commented May 19, 2020 • edited Loading

thebigG commented May 19, 2020 • edited Loading

codecov-commenter commented May 19, 2020

Codecov Report

PaulMcInnis commented May 21, 2020 • edited Loading

thebigG commented May 21, 2020

PaulMcInnis commented May 22, 2020 via email

thebigG commented May 25, 2020

PaulMcInnis commented Jun 15, 2020 • edited Loading

thebigG commented Jun 15, 2020

PaulMcInnis left a comment

Choose a reason for hiding this comment

thebigG commented Jun 17, 2020 • edited Loading

PaulMcInnis commented Jun 17, 2020 via email

bradleykohler96 commented May 18, 2020 •

edited

Loading

PaulMcInnis commented May 19, 2020 •

edited

Loading

PaulMcInnis commented May 19, 2020 •

edited

Loading

thebigG commented May 19, 2020 •

edited

Loading

PaulMcInnis commented May 21, 2020 •

edited

Loading

PaulMcInnis commented Jun 15, 2020 •

edited

Loading

thebigG commented Jun 17, 2020 •

edited

Loading