Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code for review #1

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

code for review #1

wants to merge 1 commit into from

Conversation

chapmandu
Copy link
Owner

@chapmandu chapmandu commented May 10, 2024

Here is a collection of classes used as a proof-of-concept for performing some web scraping tasks. I've opted to use composition over inheritance deliberately but am open to any an all constructive criticism.. I've also tried to adhere to the SOLID principles, but am still not fluent in them.

It's my first foray into Ruby, so I'm after feedback on code design and Ruby best practice, probably not so much code style, as I use Rubocop formatting.. but if something is really off, let me know.

Use Case

The code itself are the key classes used to perform a bunch of scheduled jobs to scrape particular types of html pages using the robots.txt and sitemaps.xml urls as starting points. The overall idea is to begin with a hostname mydomain.com and have these classes use that to construct/fetch/parse/filter the robots.txt/sitemap.xml for urls. The html_document class is then used to parse the contents of each URL's html markup to see if it contains particular meta tags/elements.. ideally to build a database of relevant web pages.

resource_fetcher.rb

Performs the fetching of HTTP resources. Uses faraday gem.

robots.rb

Fetches and parses sitemaps URLs from a robots.txt URL

sitemap.rb

Fetches and parses URLs from sitemap url (using SitemapParser gem). It also contains some static methods for url filtering.

html_document.rb

Parses HTML documents to extract common things like meta tags, buttons and elements containing combinations of attributes and/or text.

universal_resource.rb

Methods for interacting with and fetching/parsing/modifying a URL.

def robots
    robots = Robots.new('www.blacklabelskates.com')
    robots.fetch
    render plain: robots.body
  end

  def sitemap
    sitemap = Sitemap.new('https://www.blacklabelskates.com/sitemap.xml')
    sitemap.fetch
    render plain: sitemap.urls
  end

@brandondees
Copy link

please give some more context for the reviewer, since we don't know what you're trying to accomplish here, what types of feedback you want, etc.

you may also want to (very briefly) tell a bit of the story of how you got to this point, like where did this idea originate, what have you solved so far, and and what state you think it's in (does it work? is it complete, or just a start, or a component of a larger project?)


require 'faraday'

# Interface for fetching resources

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this type of comment doesn't add anything that the name doesn't already imply. i'd suggest deleting it or writing a more complete description of why and how to use it (like you would want from a library's docs)

class UniversalResource
attr_reader :url, :fetcher, :response

DOUBLE_SEGMENT_TLDS = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this list all you care about? are you trying to make this into a general-purpose utility for anyone else to use? i suspect the real list is much longer, there's an officially governed "public suffix list" that you can look into and maybe find a way to pull it from an up to date / authoritative source.

@brandondees
Copy link

you say that sitemap has some stuff for dealing with url filtering. you also have a url class for doing url stuff. are those responsibilities sliced appropriately? are these classes meant to represent general purpose library tools or are they tailored for a specific task/feature/business case? if they're "business" logic then it's probably good to name them in business terms rather than generic/universal terms.

@brandondees
Copy link

none of this code is getting used anywhere... what's a use case example? how would you know if it works? automated tests are a good way to demonstrate that, and they provide other types of value such as design feedback, documentation, regression safety, etc.

puts "Checking for #{url}"
begin
resource = UniversalResource.new(url, fetcher).fetch
return resource.url if resource.document.exists?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this returns the first resource url that exists, but i'm guessing that you want it to continue checking and return all of them using map rather than each?

@brandondees
Copy link

i think some of these classes are named as if they're generic libraries but you've actually built them to perform specific roles within a specific workflow, e.g. robots isn't really about robots.txt, it's a coordinator that gets sitemaps from a starting url, and parsing robots.txt is just a detail of that process.

different languages and frameworks have established cultures with varying levels of strongly adopted/enforced rules and conventions, but for your own private programs, and in ruby language in particular, there aren't many hard rules for things like naming, composition patterns, code organization, etc. - so feel free to use verbs as class names, or make stateful modules as if they're singleton classes, or apply functional programming style with pure functions, immutable data, and stateless classes, or no classes at all.

naming patterns that make sense in an enterprise app because of established community patterns, influence of dominant frameworks, dev tool conveniences, etc are all good to learn, but those things are not driven by clean principles, so if you really want to focus on SOLID, ignore all of those things and try experiments and sculpt the language of your program into what works best for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants