-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic web scraping and static web scraping #75
Conversation
-TODO:Investigate this issue further -TODO:Implement a way to limit the number of pages being scraped to speed up testing process -It looks like the static way of web scraping is working again in glassdoor.py -TODO:Implement a stateful switch for scrapers that will allow programmers to know whether to use static web scraping or dynamic(selenium-way) scraping
Codecov Report
@@ Coverage Diff @@
## master #75 +/- ##
==========================================
+ Coverage 46.96% 47.01% +0.04%
==========================================
Files 11 11
Lines 990 989 -1
==========================================
Hits 465 465
+ Misses 525 524 -1
Continue to review full report at Codecov.
|
If you check the JobFunnel Wiki I actually have done some work on improving methods of scraping. I'm not sure if we should continue with our current method or move to another method entirely. I used cosine similarity to do dynamic web scraping based on keywords. However, Glassdoor has proved to be the most challenging even for dynamic scraping. This is because they do not label any tags properly based on their contents like Indeed or Monster. Checkout my branch if you want to see what I believe dynamic scraping should look like. The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files. |
This is really neat! It's a very clever way of doing this as it's completely decoupled from the scrapers themselves, whether they are Glassdoor or Monster. At the end of the day, we just get a dump of html; it doesn't matter if it's selenium or static scraping. If you ask me, this seems to be the way to go. If I understand your approach correctly, this is as abstracted as it gets. |
I guess one thing about JobFunnel was that it relies on community updating of the scrapers to work, but I've always had in mind that we should switch to something with more abstract scrapers. It would be very cool to use the raw HTML and scrape what we need from it without having to navigate the structure of the data (which is changing) I get the motivation for the static and dynamic scraping, I think perhaps it is best to allow it to be enabled/disabled, this way we can continue to develop it and use it without relying on it entirely. I think it makes the most sense to implement in the abstract Scraper class as you mention, with an attribute/method that we can implement, perhaps it can default to False since existing is static. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this works as well as static does (for glassdoor) lets just remove the commented out static code, I have no problem moving to using dynamic scraping here.
excited to see the scraping being made more flexible! |
I just read the Wiki about the suggested method for scraping websites using the Cosine Similarity Method. This is definitely an innovative way to find the relevant fields on the website but I am not sure whether it will correctly classify the right fields all the time. Furthermore, the added burden of supporting more countries will become even harder (I think). That said, I could definitely be a cool R&D project to see how well this actually works in practice. We could abstract the actual parser of a job listing and simply define two objects: a naive static attribute parser and the novel dynamic similarity parser. |
yeah we need something more abstract ideally, but realistically we should keep our hard-earned scraping logic for now, unless we can prove that a bag-of-HTML approach is robust enough. Looks like the discussion of doing scraping this way is more food for thought and less related to this PR though, which is doing a more flexible HTML search, definitely an upgrade. |
Indeed. Internationalization could make the vectorizer approach messy. But part of me is really hopeful and optimistic that we could perhaps make this approach work for different countries/languages. But just like @markkvdb said, it might be best to keep it as an in-research strategy, if we're not 100% sure of its robustness. If I understood Paul's view on static/dynamic approach correctly:
|
perhaps we can just add an attribute to our base scraper class, and have it default to false. I'm not sure it needs to be a method unless it will be based on some logic? If so it can be a method instead. Right now our base class is |
I think this might be the crux of the issue here. Initially my thinking was to have some logic where the scraper will figure out if static scraping was available. But after re-considering, I think that approach might make the codebase unnecessarily messy and complex. I think it might be much simpler to just implement a simple bool attribute like Paul suggested. Having a simple bool(or even a multi-value data type like enum) attribute is a win-win for both programmers and users. Users may choose dynamic scrape over static, or vice-versa. Programmers, on the other hand, have the flexibility of implementing each scraper how they see fit. Hopefully, last question I have about this: Given that we implement a simple attribute to know if it should be dynamic or static, in the case of GlassDoor, will it be a sensible solution to have two files I lean towards this because it might be easier to maintain as both implementations are clearly separate from each other. On second thought, it might also lead to a good deal of code duplication. |
-Added missing assignment to get_webdriver
Codecov Report
@@ Coverage Diff @@
## master #75 +/- ##
==========================================
+ Coverage 46.96% 47.06% +0.09%
==========================================
Files 11 11
Lines 990 988 -2
==========================================
Hits 465 465
+ Misses 525 523 -2
Continue to review full report at Codecov.
|
I think it's ok to have them separate as you describe while we develop the feature. They are essentially different scrapers. The only consideration I think wont be properly addressed if we split them into entirely seperate files is that any shared logic will have to be updated in two places in future PR's. Perhaps we can just make where Looking forwards to seeing us take on building more generic/abstract scrapers in the future - It's a challenge for sure, makes sense to keep it on the back-burner. |
This makes a lot of sense. Just wanted to make sure what the future of Glassdoor might be, given how volatile it can be, before writing any tests for it. I think it might actually be better to start implementing the GlassDoorScraperBase classes before moving on with any further testing. |
Yeah don't worry too much about getting the test coverage up for the
dynamic scraper given that its more experimental is another thought.
On Thu, May 21, 2020 at 7:37 PM Lorenzo B Gomez ***@***.***> wrote:
Perhaps we can just make GlassdoorScraperBase(ABC) with children
GlassdoorStaticScraper(GlassdoorScraperBase)
GlassdoorDynamicScraper(GlassdoorScraperBase)
This makes a lot of sense. Just wanted to make sure what the future of
Glassdoor might be, given how volatile it can be, before writing any tests
for it. I think it might actually be better to start implementing the
GlassDoorScraperBase classes before moving on with any further testing.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#75 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYKY2KCPKCKTSYVXOSARPTRSW3KDANCNFSM4NDROBRQ>
.
--
- Paul McInnis
|
Just wanted to let you guys know that a good friend of mine has started working on a possible solution to this problem of Dynamic/Static scraping-->https://github.com/Arax1/JobFunnel |
Ok, what's the plan for merging this and #77 |
Not a problem! I will rebase it and push as soon as #77 gets merged 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to re-produce this issue on Windows10 or Linux. Will be investigating #79 further from now on as I develop more unit tests for the scrapers. I suspect it is some kind of edge case. Have been testing JobFunnel on a clean Windows10 and seems to be working ok. |
Same i also was unable to reproduce the issue
On Wed, Jun 17, 2020 at 2:21 PM Lorenzo B Gomez ***@***.***> wrote:
Do you know if this resolves issue #79
<#79> ?
I haven't been able to re-produce thisissue on Windows10 or Linux. Will be
investigating #79 <#79>
further from now on as I develop more unit tests for the scrapers. I
suspect it is some kind of edge case. Have been testing JobFunnel on a
clean Windows10 and seems to be working ok.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#75 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYKY2IZQKKB5YQZAI3WB7TRXECR3ANCNFSM4NDROBRQ>
.
--
- Paul McInnis
|
Dynamic web scraping and static web scraping
Hello everyone!
Hope you are all doing fantastic!
Description
So after changing to dynamic web scraping, it looks like the way the date of each job is displayed on Glassdoor changes. This was breaking the
date_filter
, but it looks like it's working now after digging around Glassdoor's html source. I also streamlined some of the dynamic scraping workflow to just go on with the scraping(not holding the program with aninput()
call) if there is nothing wrong, such as a CAPTCHA requirement.I went back and tried the static way of scraping Glassdoor; it works sometimes, other times it doesn't. This led me to think about this dynamic/static scraping issue. And I'm wondering if you guys think we should enable JobFunnel to support both--dynamic and static web scraping. After some thought I imagined a couple of ways of implementing this:
tools.py
) the ability of telling whoever is building the scraper if the site supports at the moment static web scraping. Hopefully an example might illustrate this better:From the perspective of a scraper such as
glassdor.py
the implementation might look like this:The nice thing about this approach is that anybody who might write a new webscraper can add their own implementation to
does_static_work_for_this_site
and it is also decoupled from the actual scraping, whether that scraping is dynamic or static. The downside of this approach is that scrapers will be littered withif ... else
logic like the one above all over the place, and that approach might sacrifice some readability. But on the other hand, abstracting this is hard as has been mentioned before in regards to internationalization implementations.glassdoor_static.py
glassdoor_dynamic.py
The upside of this is that the dynamic and static approaches are completely separate from each other and not intertwined in one monolithic file littered with
if..else
logic. The downside is that in a way the codebase gets more complex.In the end, both approaches make the whole codebase more complex. But the reason why I propose this is that I wonder if other sites might behave in a similar way to Glassdoor, or who knows if Indeed or Monster might start behaving the same way. I also don't think it's a very efficient way of handling things to have somebody just jump in whenever the current approach(whether it be static or dynamic) and have them re-write one or the other implementation again. I wonder what you guys think about this as a way of being prepared for the future if we need to add dynamic support for other sites, or even if new sites that are added might behave very much like Glassdoor;sometimes dynamic works, other times static is the way to go.
Sorry for making this long!
Just wanted to put thoughts down to see what y'all think!
I also wonder if maybe there are better ways of implementing this dynamic/static support, or should it even be implemented at all?
Cheers!
List any additional libraries that will be affected.
List any developers that will be affected or those who you had merge conflicts with.
Context of change
Please add options that are relevant and mark any boxes that apply.
Type of change
Please mark any boxes that apply.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration.
Checklist:
Please mark any boxes that have been completed.