code for review #1

chapmandu · 2024-05-10T12:49:14Z

Here is a collection of classes used as a proof-of-concept for performing some web scraping tasks. I've opted to use composition over inheritance deliberately but am open to any an all constructive criticism.. I've also tried to adhere to the SOLID principles, but am still not fluent in them.

It's my first foray into Ruby, so I'm after feedback on code design and Ruby best practice, probably not so much code style, as I use Rubocop formatting.. but if something is really off, let me know.

Use Case

The code itself are the key classes used to perform a bunch of scheduled jobs to scrape particular types of html pages using the robots.txt and sitemaps.xml urls as starting points. The overall idea is to begin with a hostname mydomain.com and have these classes use that to construct/fetch/parse/filter the robots.txt/sitemap.xml for urls. The html_document class is then used to parse the contents of each URL's html markup to see if it contains particular meta tags/elements.. ideally to build a database of relevant web pages.

resource_fetcher.rb

Performs the fetching of HTTP resources. Uses faraday gem.

robots.rb

Fetches and parses sitemaps URLs from a robots.txt URL

sitemap.rb

Fetches and parses URLs from sitemap url (using SitemapParser gem). It also contains some static methods for url filtering.

html_document.rb

Parses HTML documents to extract common things like meta tags, buttons and elements containing combinations of attributes and/or text.

universal_resource.rb

Methods for interacting with and fetching/parsing/modifying a URL.

def robots
    robots = Robots.new('www.blacklabelskates.com')
    robots.fetch
    render plain: robots.body
  end

  def sitemap
    sitemap = Sitemap.new('https://www.blacklabelskates.com/sitemap.xml')
    sitemap.fetch
    render plain: sitemap.urls
  end

brandondees · 2024-05-10T13:38:10Z

please give some more context for the reviewer, since we don't know what you're trying to accomplish here, what types of feedback you want, etc.

you may also want to (very briefly) tell a bit of the story of how you got to this point, like where did this idea originate, what have you solved so far, and and what state you think it's in (does it work? is it complete, or just a start, or a component of a larger project?)

brandondees · 2024-05-10T13:40:15Z

app/lib/resource_fetcher.rb

+
+require 'faraday'
+
+# Interface for fetching resources


this type of comment doesn't add anything that the name doesn't already imply. i'd suggest deleting it or writing a more complete description of why and how to use it (like you would want from a library's docs)

brandondees · 2024-05-10T13:45:56Z

app/lib/universal_resource.rb

+class UniversalResource
+  attr_reader :url, :fetcher, :response
+
+  DOUBLE_SEGMENT_TLDS = [


is this list all you care about? are you trying to make this into a general-purpose utility for anyone else to use? i suspect the real list is much longer, there's an officially governed "public suffix list" that you can look into and maybe find a way to pull it from an up to date / authoritative source.

brandondees · 2024-05-10T13:59:17Z

you say that sitemap has some stuff for dealing with url filtering. you also have a url class for doing url stuff. are those responsibilities sliced appropriately? are these classes meant to represent general purpose library tools or are they tailored for a specific task/feature/business case? if they're "business" logic then it's probably good to name them in business terms rather than generic/universal terms.

brandondees · 2024-05-10T14:01:01Z

none of this code is getting used anywhere... what's a use case example? how would you know if it works? automated tests are a good way to demonstrate that, and they provide other types of value such as design feedback, documentation, regression safety, etc.

brandondees · 2024-05-10T14:04:49Z

app/lib/robots.rb

+      puts "Checking for #{url}"
+      begin
+        resource = UniversalResource.new(url, fetcher).fetch
+        return resource.url if resource.document.exists?


this returns the first resource url that exists, but i'm guessing that you want it to continue checking and return all of them using map rather than each?

brandondees · 2024-05-10T14:12:18Z

i think some of these classes are named as if they're generic libraries but you've actually built them to perform specific roles within a specific workflow, e.g. robots isn't really about robots.txt, it's a coordinator that gets sitemaps from a starting url, and parsing robots.txt is just a detail of that process.

different languages and frameworks have established cultures with varying levels of strongly adopted/enforced rules and conventions, but for your own private programs, and in ruby language in particular, there aren't many hard rules for things like naming, composition patterns, code organization, etc. - so feel free to use verbs as class names, or make stateful modules as if they're singleton classes, or apply functional programming style with pure functions, immutable data, and stateless classes, or no classes at all.

naming patterns that make sense in an enterprise app because of established community patterns, influence of dominant frameworks, dev tool conveniences, etc are all good to learn, but those things are not driven by clean principles, so if you really want to focus on SOLID, ignore all of those things and try experiments and sculpt the language of your program into what works best for you.

first!

514f677

brandondees reviewed May 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code for review #1

code for review #1

chapmandu commented May 10, 2024 •

edited

Loading

brandondees commented May 10, 2024

brandondees May 10, 2024

brandondees May 10, 2024

brandondees commented May 10, 2024

brandondees commented May 10, 2024

brandondees May 10, 2024

brandondees commented May 10, 2024

code for review #1

Are you sure you want to change the base?

code for review #1

Conversation

chapmandu commented May 10, 2024 • edited Loading

Use Case

resource_fetcher.rb

robots.rb

sitemap.rb

html_document.rb

universal_resource.rb

brandondees commented May 10, 2024

brandondees May 10, 2024

Choose a reason for hiding this comment

brandondees May 10, 2024

Choose a reason for hiding this comment

brandondees commented May 10, 2024

brandondees commented May 10, 2024

brandondees May 10, 2024

Choose a reason for hiding this comment

brandondees commented May 10, 2024

chapmandu commented May 10, 2024 •

edited

Loading