-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code for review #1
base: main
Are you sure you want to change the base?
Conversation
please give some more context for the reviewer, since we don't know what you're trying to accomplish here, what types of feedback you want, etc. you may also want to (very briefly) tell a bit of the story of how you got to this point, like where did this idea originate, what have you solved so far, and and what state you think it's in (does it work? is it complete, or just a start, or a component of a larger project?) |
|
||
require 'faraday' | ||
|
||
# Interface for fetching resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this type of comment doesn't add anything that the name doesn't already imply. i'd suggest deleting it or writing a more complete description of why and how to use it (like you would want from a library's docs)
class UniversalResource | ||
attr_reader :url, :fetcher, :response | ||
|
||
DOUBLE_SEGMENT_TLDS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this list all you care about? are you trying to make this into a general-purpose utility for anyone else to use? i suspect the real list is much longer, there's an officially governed "public suffix list" that you can look into and maybe find a way to pull it from an up to date / authoritative source.
you say that sitemap has some stuff for dealing with url filtering. you also have a url class for doing url stuff. are those responsibilities sliced appropriately? are these classes meant to represent general purpose library tools or are they tailored for a specific task/feature/business case? if they're "business" logic then it's probably good to name them in business terms rather than generic/universal terms. |
none of this code is getting used anywhere... what's a use case example? how would you know if it works? automated tests are a good way to demonstrate that, and they provide other types of value such as design feedback, documentation, regression safety, etc. |
puts "Checking for #{url}" | ||
begin | ||
resource = UniversalResource.new(url, fetcher).fetch | ||
return resource.url if resource.document.exists? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this returns the first resource url that exists, but i'm guessing that you want it to continue checking and return all of them using map
rather than each
?
i think some of these classes are named as if they're generic libraries but you've actually built them to perform specific roles within a specific workflow, e.g. robots isn't really about robots.txt, it's a coordinator that gets sitemaps from a starting url, and parsing robots.txt is just a detail of that process. different languages and frameworks have established cultures with varying levels of strongly adopted/enforced rules and conventions, but for your own private programs, and in ruby language in particular, there aren't many hard rules for things like naming, composition patterns, code organization, etc. - so feel free to use verbs as class names, or make stateful modules as if they're singleton classes, or apply functional programming style with pure functions, immutable data, and stateless classes, or no classes at all. naming patterns that make sense in an enterprise app because of established community patterns, influence of dominant frameworks, dev tool conveniences, etc are all good to learn, but those things are not driven by clean principles, so if you really want to focus on SOLID, ignore all of those things and try experiments and sculpt the language of your program into what works best for you. |
Here is a collection of classes used as a proof-of-concept for performing some web scraping tasks. I've opted to use composition over inheritance deliberately but am open to any an all constructive criticism.. I've also tried to adhere to the SOLID principles, but am still not fluent in them.
It's my first foray into Ruby, so I'm after feedback on code design and Ruby best practice, probably not so much code style, as I use Rubocop formatting.. but if something is really off, let me know.
Use Case
The code itself are the key classes used to perform a bunch of scheduled jobs to scrape particular types of html pages using the robots.txt and sitemaps.xml urls as starting points. The overall idea is to begin with a hostname
mydomain.com
and have these classes use that to construct/fetch/parse/filter the robots.txt/sitemap.xml for urls. The html_document class is then used to parse the contents of each URL's html markup to see if it contains particular meta tags/elements.. ideally to build a database of relevant web pages.resource_fetcher.rb
Performs the fetching of HTTP resources. Uses
faraday
gem.robots.rb
Fetches and parses sitemaps URLs from a robots.txt URL
sitemap.rb
Fetches and parses URLs from sitemap url (using SitemapParser gem). It also contains some static methods for url filtering.
html_document.rb
Parses HTML documents to extract common things like meta tags, buttons and elements containing combinations of attributes and/or text.
universal_resource.rb
Methods for interacting with and fetching/parsing/modifying a URL.