Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic web scraping and static web scraping #75

Merged
merged 6 commits into from
Jun 17, 2020

Conversation

thebigG
Copy link
Collaborator

@thebigG thebigG commented May 17, 2020

Dynamic web scraping and static web scraping

Hello everyone!
Hope you are all doing fantastic!

Description

So after changing to dynamic web scraping, it looks like the way the date of each job is displayed on Glassdoor changes. This was breaking the date_filter, but it looks like it's working now after digging around Glassdoor's html source. I also streamlined some of the dynamic scraping workflow to just go on with the scraping(not holding the program with an input() call) if there is nothing wrong, such as a CAPTCHA requirement.

I went back and tried the static way of scraping Glassdoor; it works sometimes, other times it doesn't. This led me to think about this dynamic/static scraping issue. And I'm wondering if you guys think we should enable JobFunnel to support both--dynamic and static web scraping. After some thought I imagined a couple of ways of implementing this:

  1. Outsource to a non-scraping entity(such as tools.py) the ability of telling whoever is building the scraper if the site supports at the moment static web scraping. Hopefully an example might illustrate this better:
def does_static_work_for_this_site(provider: str):
    ...some logic to figure this out
    ...return true if it works, false if it doesn't

From the perspective of a scraper such as glassdor.py the implementation might look like this:

def scrape(self):
     if does_static_work_for_this_site("glassdoor"):
           ...some static scraping logic
    else:
            ...some dynamic scraping logic
     .....some more logic

The nice thing about this approach is that anybody who might write a new webscraper can add their own implementation to does_static_work_for_this_site and it is also decoupled from the actual scraping, whether that scraping is dynamic or static. The downside of this approach is that scrapers will be littered with if ... else logic like the one above all over the place, and that approach might sacrifice some readability. But on the other hand, abstracting this is hard as has been mentioned before in regards to internationalization implementations.

  1. Let the user decide whether to do a dynamic scrape or a static one. If one were to take this approach, one could decouple the dynamic and static approach entirely from each other. An implementation that comes to mind is to have different files for each implementation such as this:
  • glassdoor_static.py
  • glassdoor_dynamic.py

The upside of this is that the dynamic and static approaches are completely separate from each other and not intertwined in one monolithic file littered with if..else logic. The downside is that in a way the codebase gets more complex.

In the end, both approaches make the whole codebase more complex. But the reason why I propose this is that I wonder if other sites might behave in a similar way to Glassdoor, or who knows if Indeed or Monster might start behaving the same way. I also don't think it's a very efficient way of handling things to have somebody just jump in whenever the current approach(whether it be static or dynamic) and have them re-write one or the other implementation again. I wonder what you guys think about this as a way of being prepared for the future if we need to add dynamic support for other sites, or even if new sites that are added might behave very much like Glassdoor;sometimes dynamic works, other times static is the way to go.

Sorry for making this long!
Just wanted to put thoughts down to see what y'all think!
I also wonder if maybe there are better ways of implementing this dynamic/static support, or should it even be implemented at all?

Cheers!

List any additional libraries that will be affected.
List any developers that will be affected or those who you had merge conflicts with.

Context of change

Please add options that are relevant and mark any boxes that apply.

  • Software (software that runs on the PC)
  • Library (library that runs on the PC)
  • Tool (tool that assists coding development)
  • Other

Type of change

Please mark any boxes that apply.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration.

  • Tested on an isolated container(Ubuntu 18.04)
  • Tested the entire pytest suite
  • Scraped the demo settings file for Canada
  • Scraped the demo settings file for United States

Checklist:

Please mark any boxes that have been completed.

  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • Any dependent changes have been merged and published in downstream modules.

thebigG added 4 commits May 12, 2020 18:09
-TODO:Investigate this issue further
-TODO:Implement a way to limit the number of pages being scraped to speed up testing process
-It looks like the static way of web scraping is working again in glassdoor.py
-TODO:Implement a stateful switch for scrapers that will allow programmers to know whether to use static web scraping or dynamic(selenium-way) scraping
@codecov-io
Copy link

Codecov Report

Merging #75 into master will increase coverage by 0.04%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   46.96%   47.01%   +0.04%     
==========================================
  Files          11       11              
  Lines         990      989       -1     
==========================================
  Hits          465      465              
+ Misses        525      524       -1     
Impacted Files Coverage Δ
jobfunnel/glassdoor.py 12.57% <0.00%> (+0.07%) ⬆️
jobfunnel/tools/tools.py 80.20% <ø> (ø)
jobfunnel/tools/filters.py 93.10% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c3bec6...dc8ff18. Read the comment docs.

@markkvdb markkvdb mentioned this pull request May 17, 2020
@bradleykohler96
Copy link
Collaborator

bradleykohler96 commented May 18, 2020

If you check the JobFunnel Wiki I actually have done some work on improving methods of scraping. I'm not sure if we should continue with our current method or move to another method entirely. I used cosine similarity to do dynamic web scraping based on keywords. However, Glassdoor has proved to be the most challenging even for dynamic scraping. This is because they do not label any tags properly based on their contents like Indeed or Monster.

Checkout my branch if you want to see what I believe dynamic scraping should look like.
https://github.com/PaulMcInnis/JobFunnel/tree/studentbrad/tfidf_scrape

The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files.

@thebigG
Copy link
Collaborator Author

thebigG commented May 18, 2020

The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files.

This is really neat! It's a very clever way of doing this as it's completely decoupled from the scrapers themselves, whether they are Glassdoor or Monster. At the end of the day, we just get a dump of html; it doesn't matter if it's selenium or static scraping. If you ask me, this seems to be the way to go. If I understand your approach correctly, this is as abstracted as it gets.

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented May 19, 2020

I guess one thing about JobFunnel was that it relies on community updating of the scrapers to work, but I've always had in mind that we should switch to something with more abstract scrapers.

It would be very cool to use the raw HTML and scrape what we need from it without having to navigate the structure of the data (which is changing)

I get the motivation for the static and dynamic scraping, I think perhaps it is best to allow it to be enabled/disabled, this way we can continue to develop it and use it without relying on it entirely.

I think it makes the most sense to implement in the abstract Scraper class as you mention, with an attribute/method that we can implement, perhaps it can default to False since existing is static.

Copy link
Owner

@PaulMcInnis PaulMcInnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works as well as static does (for glassdoor) lets just remove the commented out static code, I have no problem moving to using dynamic scraping here.

jobfunnel/glassdoor.py Outdated Show resolved Hide resolved
jobfunnel/glassdoor.py Outdated Show resolved Hide resolved
@PaulMcInnis
Copy link
Owner

excited to see the scraping being made more flexible!

@markkvdb
Copy link
Collaborator

I just read the Wiki about the suggested method for scraping websites using the Cosine Similarity Method. This is definitely an innovative way to find the relevant fields on the website but I am not sure whether it will correctly classify the right fields all the time. Furthermore, the added burden of supporting more countries will become even harder (I think).

That said, I could definitely be a cool R&D project to see how well this actually works in practice. We could abstract the actual parser of a job listing and simply define two objects: a naive static attribute parser and the novel dynamic similarity parser.

@PaulMcInnis
Copy link
Owner

yeah we need something more abstract ideally, but realistically we should keep our hard-earned scraping logic for now, unless we can prove that a bag-of-HTML approach is robust enough.

Looks like the discussion of doing scraping this way is more food for thought and less related to this PR though, which is doing a more flexible HTML search, definitely an upgrade.

@thebigG
Copy link
Collaborator Author

thebigG commented May 19, 2020

Furthermore, the added burden of supporting more countries will become even harder (I think).

Indeed. Internationalization could make the vectorizer approach messy. But part of me is really hopeful and optimistic that we could perhaps make this approach work for different countries/languages. But just like @markkvdb said, it might be best to keep it as an in-research strategy, if we're not 100% sure of its robustness.

If I understood Paul's view on static/dynamic approach correctly:

  • We would let the user decide whether to statically(non-selenium way) or dynamically fetch data from the site
  • When you say Scraper abstract class, you mean a completely new class? I ask because now that I think about it I wonder if we could just have a virtual/non-implemented(or perhaps returning False as the default behaviour like Paul mentioned) method like does_static_work_for_this_site that gets implemented by child classes like GlassDoor and Monster on the JobFunnel class, akin to what is currently done with the scrape method.

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented May 19, 2020

perhaps we can just add an attribute to our base scraper class, and have it default to false. I'm not sure it needs to be a method unless it will be based on some logic? If so it can be a method instead.

Right now our base class is JobFunnel but in the future it would make more sense to perhaps build a BaseScraper since inheriting JobFunnel is a lot.

@thebigG
Copy link
Collaborator Author

thebigG commented May 19, 2020

I'm not sure it needs to be a method unless it will be based on some logic?

I think this might be the crux of the issue here. Initially my thinking was to have some logic where the scraper will figure out if static scraping was available. But after re-considering, I think that approach might make the codebase unnecessarily messy and complex. I think it might be much simpler to just implement a simple bool attribute like Paul suggested. Having a simple bool(or even a multi-value data type like enum) attribute is a win-win for both programmers and users. Users may choose dynamic scrape over static, or vice-versa. Programmers, on the other hand, have the flexibility of implementing each scraper how they see fit.

Hopefully, last question I have about this: Given that we implement a simple attribute to know if it should be dynamic or static, in the case of GlassDoor, will it be a sensible solution to have two files glassdoor_static.py and glassdoor_dynamic.py for each implementation?

I lean towards this because it might be easier to maintain as both implementations are clearly separate from each other. On second thought, it might also lead to a good deal of code duplication.

-Added missing assignment to get_webdriver
@codecov-commenter
Copy link

Codecov Report

Merging #75 into master will increase coverage by 0.09%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   46.96%   47.06%   +0.09%     
==========================================
  Files          11       11              
  Lines         990      988       -2     
==========================================
  Hits          465      465              
+ Misses        525      523       -2     
Impacted Files Coverage Δ
jobfunnel/glassdoor.py 12.65% <0.00%> (+0.15%) ⬆️
jobfunnel/tools/tools.py 80.20% <0.00%> (ø)
jobfunnel/tools/filters.py 93.10% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c3bec6...31be644. Read the comment docs.

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented May 21, 2020

I think it's ok to have them separate as you describe while we develop the feature. They are essentially different scrapers.

The only consideration I think wont be properly addressed if we split them into entirely seperate files is that any shared logic will have to be updated in two places in future PR's.

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

where GlassdoorScraperBase has all the logic excluding def scrape(...) which is to be implemented by the child-classes.

Looking forwards to seeing us take on building more generic/abstract scrapers in the future - It's a challenge for sure, makes sense to keep it on the back-burner.

@thebigG
Copy link
Collaborator Author

thebigG commented May 21, 2020

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

This makes a lot of sense. Just wanted to make sure what the future of Glassdoor might be, given how volatile it can be, before writing any tests for it. I think it might actually be better to start implementing the GlassDoorScraperBase classes before moving on with any further testing.

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented May 22, 2020 via email

@thebigG
Copy link
Collaborator Author

thebigG commented May 25, 2020

Just wanted to let you guys know that a good friend of mine has started working on a possible solution to this problem of Dynamic/Static scraping-->https://github.com/Arax1/JobFunnel

@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Jun 15, 2020

Ok, what's the plan for merging this and #77
It looks like the other one needs to go in first and this one gets rebased afterwards.

@thebigG
Copy link
Collaborator Author

thebigG commented Jun 15, 2020

It looks like the other one needs to go in first and this one gets rebased afterwards.

Not a problem! I will rebase it and push as soon as #77 gets merged 👍

Copy link
Owner

@PaulMcInnis PaulMcInnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good
Do you know if this resolves issue #79 ?

I know it resolves issue #78

@thebigG
Copy link
Collaborator Author

thebigG commented Jun 17, 2020

Do you know if this resolves issue #79 ?

I haven't been able to re-produce this issue on Windows10 or Linux. Will be investigating #79 further from now on as I develop more unit tests for the scrapers. I suspect it is some kind of edge case. Have been testing JobFunnel on a clean Windows10 and seems to be working ok.

@thebigG thebigG merged commit b30b284 into PaulMcInnis:master Jun 17, 2020
@PaulMcInnis
Copy link
Owner

PaulMcInnis commented Jun 17, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants