Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving pgs asset serving performance (brainstorming) #149

Closed
mac-chaffee opened this issue Oct 13, 2024 · 18 comments · Fixed by #159
Closed

Improving pgs asset serving performance (brainstorming) #149

mac-chaffee opened this issue Oct 13, 2024 · 18 comments · Fixed by #159

Comments

@mac-chaffee
Copy link
Contributor

mac-chaffee commented Oct 13, 2024

Hello! This weekend I was interested in learning how pgs worked so I looked through the code and wrote down any possible places that I thought could affect performance. I didn't do any runtime testing, so take this with a grain of salt.

To serve a single HTML file, the following must happen:

  1. DNS lookup for TXT record to determine project name (findRouteConfig -> GetCustomDomain)
  2. Regex match on all routes (CreateServeBasic)
  3. DB Query 1 (WHERE clause on app_users.name) (FindUserForName)
  4. GetBucket() call to minio
  5. DB Query 2 (WHERE clause on projects.user_id && name) (FindProjectByName)
  6. DB Query 3 (WHERE clause on feature_flags.user_id && name) (HasFeatureForUser)
  7. GetObject() call to minio for _redirects, then parsed
    • (note: redirects may be ignored after minio error)
    • (note: redirect file loaded into RAM without limits)
  8. Regex match on every line of _redirects (calcRoutes)
  9. Three possibilities (handle):
    • If it's a redirect to an external site, it's fetched with http.get() and returned as-is
    • If it's a request to an image, it's fetched from imgproxy with http.get()
    • If it's a request to any other file, it's fetch from minio with GetObject()
  10. GetObject() call to minio for _headers, then parsed
  11. Regex match on every line of _headers
  12. DB Query 4 (WHERE clause on feature_flags.user_id && name) (AnalyticsVisitFromRequest)
  13. Regex match on possible crawlers (trackableRequest -> IsCrawler)
  14. (async) DB Query 5 (INSERT INTO analytics_visits) (AnalyticsCollect)

The following are some ideas for improving performance:

For (1), I predict the DNS lookup is the slowest operation in the list since (I think) all the other operations don't leave your single VM. Are you using Oracle Linux? If it's anything like RHEL, then local DNS caching is not enabled by default. If you enable systemd-resolved, it will cache 4096 responses for up to 2 hours and it will respect the TTL. Users should be encouraged to set high TTLs (>1 hour) to improve performance.

For the database queries (3, 5, 6, 13), an easy win would be to fetch the feature_flags in the same query where we fetch the user, but then we'd still be performing 2 queries per request. Possibly caching is a better solution, see below.

For the GetBucket() call (4), that will send a BucketExists() request to Minio. Technically that's not necessary since you can create Bucket objects using just the name of the bucket.

For the GetObject() calls to read _redirects and _headers, I think caching is our only hope. Caching these would also allow us to cache the compiled regexes.

Caching

Since all of this data is small, I think we could use an in-process, in-memory cache like https://github.com/hashicorp/golang-lru (the 2q implementation sounds smarter since it considers frequency in addition to recency) NVM, if we want to set ttls, we have to use the LRU version. The following work would be required:

  1. Limit _redirects and _headers to something like 5KB so our cache size is bounded.
  2. When a site is first requested, we create something like that AssetHandler struct (plus routes parsed from _redirects and _headers) and save it to the cache, keyed on the user-project slug with a default TTL of something reasonable like 1 hour (so any caching bug we happen to introduce resolves itself in 1 hour).
    • We check for the cached version before creating a new one.
    • We don't cache the TXT record; we rely on systemd-resolved caching since that is designed to respect DNS TTLs. That allows us to quickly determine the user-project slug key that we need for reading from our own cache.
  3. When any of the following happens, we delete our cache entry if it exists:
    • User deleted (or just let it expire)
    • Project deleted (or just let it expire)
    • Project disabled (or just let it expire)
    • User flags change like pico+ or analytics
    • New _redirects or _headers file uploaded (or I guess we could clear the cache on any upload as a user-controlled way of clearing their own cache)

If we do all that, then we can serve assets with a single locally-cached DNS lookup, a single hash table lookup, and a single GetObject() minio call! 🚀

Thoughts? I can contribute for some of this. It's a pretty big change, so just wanted to run it by you before diving too deep.

@neurosnap
Copy link
Member

Hi! Wow, thanks for this in-depth feeback!

So I like all of these changes and can commit to getting this across the finish line.

There are a couple of pieces that require us to help:

  • We need to tweak the VM for DNS caching (need to confirm what we have enabled)
  • We need to make a decision on file size limit for _redirects and _headers

5kb seems perfectly reasonable to me so let's go for it.

For caching, what do you think about the ability to wipe the in-memory cache? In particular:

  • Admin ability to wipe cache
  • User ability to wipe their site/sites cache

Do you have any thoughts on supporting those features?

Finally, let me know what parts you'd like to work on and I can commit to whatever else.

Thanks again, this is awesome!

@mac-chaffee
Copy link
Contributor Author

Some good discussion in this PR: #154

@mac-chaffee
Copy link
Contributor Author

antoniomika and I have had a chat and because of the way we record analytics and consider it a first-class feature at pico, we are leaning towards implementing the caching mechanism in our app code, although we can be convinced otherwise if there's a better way to outsource http caching (which is a solved problem that doesn't need re-inventing).

One thing about the analytics that worries me is that they're going into an unindexed postgres table with no expiration mechanism. Checking analytics for my site takes 3+ seconds already, which doesn't leave a lot of room for adding new features on top of analytics. That's in addition to complicating caching by making all requests hit the origin server to be counted.

I'm wondering if we should be rebuilding analytics to work with caching rather than building caching around the current analytics implementation.

The main change would be "pulling" analytics from whatever is doing the caching (either a CDN or a distributed cache storage system) instead of making the caching system "push" view counts to pgs. Could also use the opportunity to store the metrics in a time series database. Queries like "what time of day do I get the most traffic?" are a lot more efficient in a time series database.

The specifics would depend on which method you pick for caching. But you could make that choice free from the burden of working around the current analytics system. If you do decide to build analytics around the cache, then we could just disable caching for .html files as a first step as we build the caching system.

Just thinking out loud, no strong opinions. My main goal is ensuring you're happy with the resulting setup :)

@neurosnap
Copy link
Member

neurosnap commented Oct 29, 2024

One thing about the analytics that worries me is that they're going into an unindexed postgres table with no expiration mechanism. Checking analytics for my site takes 3+ seconds already, which doesn't leave a lot of room for adding new features on top of analytics. That's in addition to complicating caching by making all requests hit the origin server to be counted.

Thanks for the reminder! This got us from 3+ seconds to around 1 second: ac3be17

I'm wondering if we should be rebuilding analytics to work with caching rather than building caching around the current analytics implementation.

I love this thinking. To provide some context, the reason why we built our own analytics system is because we wanted to be in full control of how the data gets aggregated and stored without using or being up-sold on something off-the-shelf. We have positioned ourselves to host privacy-focused services and all the other self-hosted solutions that I found felt very heavy. It didn't pass the BYO test for me.

Queries like "what time of day do I get the most traffic?" are a lot more efficient in a time series database.

I'm definitely open to converting our analytics table to use timescaleDB and I did investigate it when building but decided against it mainly because of the complexity. I do think there are other things we can do with a vanilla db table that we can try before switching over (e.g. partitioned table). However, that doesn't change our push/pull mechanism.

If we went with Souin or another cache system off-the-shelf, how could we record site-usage analytics using pull?

@mac-chaffee
Copy link
Contributor Author

mac-chaffee commented Oct 30, 2024

Wow the analytics are a heck of a lot faster now! Thanks! 🚀

We have positioned ourselves to host privacy-focused services

Thanks for clarifying, that makes sense (and is one of the reasons I moved my site to pgs!). I think that means we can cross major CDN providers off the list (using tuns instead of Cloudflare Tunnels is another reason I signed up).

I also have doubts about http-cache in the context of multi-regional pgs due to the cache invalidation issue.

I can think of four possible options that involve rebuilding analytics around caching:

Option 1: Scrape Prometheus metrics from Souin

If we use Souin, we'd need a new feature added for obtaining page view counts. I'd bet darkweak would want this feature to be generic enough for others to use, and the best idea I can think of is to expand the existing prometheus metrics to include hostname and route labels (disabled by default, enableable via config). Seems you're already collecting some prometheus metrics for pgs.

Pros:

  • If you expand pgs to multiple regions, it's typical to scrape from each region, then summing the page views together is straightforward; e.g. topk(10, sum by (route) (souin_total_request_counter{host="www.example.com"}))

Cons:

  • High/unbounded cardinality (num_sites * num_users * num_files * num_servers * num_other_labels)
  • No simple way to exclude bot requests. Could add the user-agent to the labels, but that makes the cardinality problem way worse and there's no way you're packing over 1000 regexes into your prometheus queries.

Lack of ability to filter out bots may make this option a non-starter.

Option 2: Souin + no-cache + conditional GET requests for HTML files

This option involves allowing Souin to cache all files except HTML files. For HTML files, we return the following headers:

  • Cache-Control: no-cache to prevent Souin from caching the HTML files
  • Etag: <hash> so browsers will perform a conditional GET request (If-None-Match) for the HTML file

Then we add code to pgs that returns a 304 Not Modified response, while also asynchronously inserting the page view into postgres like normal.

Pgs would probably need it's own kind of cache+TTL thing, like a hashmap of Etags so it can quickly return 304s. This hashmap would of course need to be purged somehow from pgs-ssh.

Pros:

Cons:

  • That extra hashmap is a bit complicated/non-standard and makes cache purging complicated
  • Not very friendly to multiple pgs regions as a result

Option 3: JavaScript Beacon-based analytics

Something at the back of my mind in the discussion of analytics is that really the best kind of caching is client-side caching (I set Cache-Control: max-age=1800 for all my site's static assets), but that completely breaks analytics. That's why most analytics providers use a little JavaScript with sendBeacon() to count page views. This can still respect privacy, see https://www.goatcounter.com/

Pros:

  • Can merge Implement caching using Souin #154 as-is
  • Simple, scalable, multi-region-friendly
  • Analytics mostly unchanged, just moved to a dedicated route
  • Can enable client-side caching

Cons:

  • Users have to add the JavaScript to their sites
  • Users with ad blockers and script blockers may not be counted

Option 4: Use Souin and parse Caddy access logs

Apparently this is how Cloudflare's analytics works when you don't use their JavaScript.

You'd need the following changes:

  • Configure Caddy's log format for your needs
  • Set up a log aggregation system like Loki (needed for multi-region)
  • Create a scheduled task to parse the logs, filters out bots, and INSERT page views into postgres

Pros:

Cons:

  • Separate infrastructure required (log aggregator)
  • Requires writing the log parsing code (can be hard to avoid overlaps between runs)

I'm leaning toward option 4, but what do you think? Are there other options?

@mac-chaffee
Copy link
Contributor Author

Btw I updated my branch with some of things discussed: main...mac-chaffee:pico:caddy-caching

  • Disabled caching for HTML files to support Option 2 (just a proof of concept; can be removed for options 1/3/4)
  • Disabled caching for proxied responses
  • Updated Souin to v1.7.4 which includes darkweak's fixes for the surrogate key caching bug

@neurosnap
Copy link
Member

neurosnap commented Nov 15, 2024

Hey @mac-chaffee

Sorry for the long delay, we had some other pico work that we thought we could fold nicely into your caching work. In particular, we have deployed a change to how we collect site usage analytics. We are now using pipe to collect these metrics (we are calling it a metric-drain).

I think you are right, we should go with option (4).

Here's a patchset that I'm prototyping to adapt our metric drain to receive caddy logs: https://pr.pico.sh/prs/35

Once that is complete all we need to do is figure out how to send the logs from caddy to pipe in a resilient way. You can let us think about that.

For the sake of argument, let's say we have a solution for caddy and our metric-drain, what's left for us to start using your contributions?

@mac-chaffee
Copy link
Contributor Author

No worries!

You all are really serious about dog-fooding haha, that sounds like a cool solution.

For my own understanding, have you decided roughly what the multi-region setup would look like? Would this be like hub-and-spoke where all the stateful stuff (database, minio) lives on a single "hub" server with regional "spokes" that run caddy+pgs/pico?

@mac-chaffee
Copy link
Contributor Author

what's left for us to start using your contributions?

To answer the question, honestly I think I'd just have to rebase my branch and try it out! When the metrics-drain+caddy logs solution is ready, you'd just delete the line that disables caching for html files: r.Header.Set("cache-control", "no-cache").

@neurosnap
Copy link
Member

Re-opening because I think we are at the point with Souin that we might need to investigate alternatives solutions before adding another service like redis to our stack just for http caching. Like I mentioned, we are not against adding redis but it does add complexity to our solution design: #174 (comment)

I'm going to spend some more time looking into alternatives that aren't directly tied to Caddy.

I wonder how hard it would be to use or fork something like this for our use case:

@mac-chaffee
Copy link
Contributor Author

In parallel, I can do some looking into the caddy.UsagePool shared-state situation to help us know for sure if this a show-stopper or just an edge-case bug.

One reason I personally shy away from maintaining an HTTP cache is the number of cache poisoning footguns that exist when creating cache keys. Not insurmountable, but worth keeping in mind.

@neurosnap
Copy link
Member

I'm not sure the best place to put this yet so I'm leaving it here: I cannot PURGE using the Surrogate-Key locally -- or in prod. I'm looking into it

@mac-chaffee
Copy link
Contributor Author

In case of emergency, restarting Caddy can be a backup plan.

@neurosnap
Copy link
Member

FYI: darkweak/souin#583

@neurosnap
Copy link
Member

It does look like badger is working as a single cache, btw, so I say we go with that for now. We just gotta figure out this surrogate key issue.

@mac-chaffee
Copy link
Contributor Author

mac-chaffee commented Dec 11, 2024

Badger stores on-disk rather than in-memory. Looking at the Souin repo history, seems like Olric is the second oldest storage option (second to redis, but olric is embeddable) which may mean it's more stable. I can play around with it though.

@neurosnap
Copy link
Member

Check out this idea: #175

@mac-chaffee
Copy link
Contributor Author

Closed by #175

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants