Skip to content

Path-based file storage #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 30, 2025
Merged

Conversation

jagerman
Copy link
Member

@jagerman jagerman commented Jul 7, 2025

This converts the (large) Session storage server to use filesystem path-based storage for uploaded files instead of storing binary data in the database.

In the early file server days, file contents in the database was used as upload sizes were within a couple TB, and this (in theory) let us replicate all storage between two servers. In practice, however, that replication was too bandwidth heavy to use, and so it has never been properly supported or used.

Additionally, storing 10s of TB in a very busy postgresql database has not worked particularly well, requiring us to implement file rotation of tables and disable vacuuming because of absurdly long times required to vacuum. (It also potentially bottlenecks at 32TB stored in a single table, though with file rotation we're not close to hitting that).

This commit converts the storage to store all files on disk as either:

- `NNN/12345678910NNN`
- `Wu/Wu_KOJmFO306JVj08G9e-MuyHU2cEKJ071xXHdTNJz61`

where the former (last three digits) is used with backwards-compat numeric IDs, and the latter (first two b64 chars) are used when backwards compat IDs are disabled. (Note that we deliberately use so-called "url safe" base64 encoding with _ and - instead of / and + already in generated IDs so that the straight id is an acceptable filename).

This interim commit contains code to load both the existing in-database values as well as on-disk values, but a future commit will come along to remove support for in-database values.

Related changes included here:

  • Increased the cleanup timer frequency to every 5s instead of 15s, so that we are going fewer deletions more often during cleanup.
  • Fix psycopg pools not shutting down gracefully during restart
  • Removed slave database support. That was added as a way to let us transition from one server to another (e.g. by writing everything to both databases until, over time, the second database has everything in it), but that won't work with contents stored on disk.
  • Check expiry during retrieval so that we don't return files that should have been cleaned up (in case cleanup got delayed for some reason).
  • Update license copyright to reflect STF maintenance
  • Flask reformatting
  • Added a TODO about figuring out support for streaming when not making a subrequest.

jagerman added 3 commits July 7, 2025 18:04
This converts the (large) Session storage server to use filesystem
path-based storage for uploaded files instead of storing binary data in
the database.

In the early file server days, file contents in the database was used as
upload sizes were within a couple TB, and this (in theory) let us
replicate all storage between two servers.  In practice, however, that
replication was too bandwidth heavy to use, and so it has never been
properly supported or used.

Additionally, storing 10s of TB in a very busy postgresql database has
not worked particularly well, requiring us to implement file rotation of
tables and disable vacuuming because of absurdly long times required to
vacuum.  (It also potentially bottlenecks at 32TB stored in a single
table, though with file rotation we're not close to hitting that).

This commit converts the storage to store all files on disk as either:

    - `NNN/12345678910NNN`
    - `Wu/Wu_KOJmFO306JVj08G9e-MuyHU2cEKJ071xXHdTNJz61`

where the former (last three digits) is used with backwards-compat
numeric IDs, and the latter (first two b64 chars) are used when
backwards compat IDs are disabled.  (Note that we deliberately use
so-called "url safe" base64 encoding with _ and - instead of / and +
already in generated IDs so that the straight id is an acceptable
filename).

This interim commit contains code to load both the existing in-database
values as well as on-disk values, but a future commit will come along to
remove support for in-database values.

Related changes included here:
- Increased the cleanup timer frequency to every 5s instead of 15s, so
  that we are going fewer deletions more often during cleanup.
- Fix psycopg pools not shutting down gracefully during restart
- Removed slave database support.  That was added as a way to let us
  transition from one server to another (e.g. by writing everything to
  both databases until, over time, the second database has everything in
  it), but that won't work with contents stored on disk.
- Check expiry during retrieval so that we don't return files that
  should have been cleaned up (in case cleanup got delayed for some
  reason).
- Update license copyright to reflect STF maintenance
- Flask reformatting
- Added a TODO about figuring out support for streaming when *not*
  making a subrequest.
@jagerman jagerman merged commit e4ed4db into session-foundation:dev Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant