Path-based file storage #3

jagerman · 2025-07-07T21:11:47Z

This converts the (large) Session storage server to use filesystem path-based storage for uploaded files instead of storing binary data in the database.

In the early file server days, file contents in the database was used as upload sizes were within a couple TB, and this (in theory) let us replicate all storage between two servers. In practice, however, that replication was too bandwidth heavy to use, and so it has never been properly supported or used.

Additionally, storing 10s of TB in a very busy postgresql database has not worked particularly well, requiring us to implement file rotation of tables and disable vacuuming because of absurdly long times required to vacuum. (It also potentially bottlenecks at 32TB stored in a single table, though with file rotation we're not close to hitting that).

This commit converts the storage to store all files on disk as either:

- `NNN/12345678910NNN`
- `Wu/Wu_KOJmFO306JVj08G9e-MuyHU2cEKJ071xXHdTNJz61`

where the former (last three digits) is used with backwards-compat numeric IDs, and the latter (first two b64 chars) are used when backwards compat IDs are disabled. (Note that we deliberately use so-called "url safe" base64 encoding with _ and - instead of / and + already in generated IDs so that the straight id is an acceptable filename).

This interim commit contains code to load both the existing in-database values as well as on-disk values, but a future commit will come along to remove support for in-database values.

Related changes included here:

Increased the cleanup timer frequency to every 5s instead of 15s, so that we are going fewer deletions more often during cleanup.
Fix psycopg pools not shutting down gracefully during restart
Removed slave database support. That was added as a way to let us transition from one server to another (e.g. by writing everything to both databases until, over time, the second database has everything in it), but that won't work with contents stored on disk.
Check expiry during retrieval so that we don't return files that should have been cleaned up (in case cleanup got delayed for some reason).
Update license copyright to reflect STF maintenance
Flask reformatting
Added a TODO about figuring out support for streaming when not making a subrequest.

This converts the (large) Session storage server to use filesystem path-based storage for uploaded files instead of storing binary data in the database. In the early file server days, file contents in the database was used as upload sizes were within a couple TB, and this (in theory) let us replicate all storage between two servers. In practice, however, that replication was too bandwidth heavy to use, and so it has never been properly supported or used. Additionally, storing 10s of TB in a very busy postgresql database has not worked particularly well, requiring us to implement file rotation of tables and disable vacuuming because of absurdly long times required to vacuum. (It also potentially bottlenecks at 32TB stored in a single table, though with file rotation we're not close to hitting that). This commit converts the storage to store all files on disk as either: - `NNN/12345678910NNN` - `Wu/Wu_KOJmFO306JVj08G9e-MuyHU2cEKJ071xXHdTNJz61` where the former (last three digits) is used with backwards-compat numeric IDs, and the latter (first two b64 chars) are used when backwards compat IDs are disabled. (Note that we deliberately use so-called "url safe" base64 encoding with _ and - instead of / and + already in generated IDs so that the straight id is an acceptable filename). This interim commit contains code to load both the existing in-database values as well as on-disk values, but a future commit will come along to remove support for in-database values. Related changes included here: - Increased the cleanup timer frequency to every 5s instead of 15s, so that we are going fewer deletions more often during cleanup. - Fix psycopg pools not shutting down gracefully during restart - Removed slave database support. That was added as a way to let us transition from one server to another (e.g. by writing everything to both databases until, over time, the second database has everything in it), but that won't work with contents stored on disk. - Check expiry during retrieval so that we don't return files that should have been cleaned up (in case cleanup got delayed for some reason). - Update license copyright to reflect STF maintenance - Flask reformatting - Added a TODO about figuring out support for streaming when *not* making a subrequest.

jagerman added 3 commits July 7, 2025 18:04

Reduce log spam for not-found files

2d2b1d0

fix removed counter not counting in-db files

e4ed4db

jagerman merged commit e4ed4db into session-foundation:dev Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Path-based file storage #3

Path-based file storage #3

Uh oh!

jagerman commented Jul 7, 2025

Uh oh!

Uh oh!

Path-based file storage #3

Path-based file storage #3

Uh oh!

Conversation

jagerman commented Jul 7, 2025

Uh oh!

Uh oh!