Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds zip downloads for all of a document's attachments #1471

Merged
merged 26 commits into from
Mar 3, 2025

Conversation

Spoffy
Copy link
Contributor

@Spoffy Spoffy commented Feb 20, 2025

Context

When using internal storage for attachments, all attachments are included in the .grist file when a document is downloaded. They can be extracted if needed using standard SQLite tools.

When using external storage, attachments are stored outside of Grist (e.g. MinIO, S3, filesystem) and can't be easily retrieved.

Proposed solution

This PR adds an API endpoint to download a .zip of all attachments in the document.

This is streamed where possible, resulting in no more than a single file being kept in memory at once (due to SQLite restrictions - external storage keeps no copies in memory).

Related issues

#1021

Has this been tested?

  • 👍 yes, I added tests to the test suite
  • 💭 no, because this PR is a draft and still needs work
  • 🙅 no, because this is not relevant here
  • 🙋 no, because I need help

@Spoffy Spoffy marked this pull request as draft February 20, 2025 20:09
@Spoffy Spoffy force-pushed the spoffy/attachment-downloads branch from 6d68c6d to 001004b Compare February 20, 2025 20:11
Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick q about access control, not a full review since this is still draft.

@@ -959,6 +960,32 @@ export class ActiveDoc extends EventEmitter {
return data;
}

public async getAttachmentsArchive(): Promise<Archive> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no access control here. Enough ActiveDoc methods take responsibility for access control that I think you might want to add a comment that you're not doing access control (or take an OptDocSession and do access control). Specifically, so far the caller just has a canView, which isn't a very high bar. The easy path would be to use ActiveDoc.canDownload. More advanced would be to check view permissions on every attachment.

@Spoffy Spoffy requested a review from paulfitz February 27, 2025 16:06
@Spoffy Spoffy marked this pull request as ready for review February 27, 2025 16:06
Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Minor comments so far. Ping me when ready for re-review and everything done. Will make time to kick tires and try it out.

const filesSeen = new Set<string>();
for (const attachment of attachments) {
if (filesSeen.has(attachment.fileIdent)) {
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've agreed to not dedupe attachments that happen to have the same hash (but may have different names)

name: attachment.fileName,
data: file.contentStream,
});
// TODO - Abort this on shutdown by throwing an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a TODO here.


export interface Archive {
dataStream: stream.Readable,
completed: Promise<void>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you format the end of lines of your interfaces consistently?


return {
dataStream: archive,
// TODO - Should we add a default 'catch' here that logs errors?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a TODO here, I don't quite understand it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this promise swallows errors if there isn't a .catch or await used by the calling code. I'm not sure if we should add some default logging here just to be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims that await stream.promises.finished(archive); will throw if there's an error? With the error caught error? If that's true, that seems enough to me. A docstring saying that the caller owns the Archive promise could be worthwhile just to draw attention? But apart from that, it would be just normal to not leave any promise you are given unhandled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect - thanks for clarifying!

import * as fse from 'fs-extra';
import * as stream from 'node:stream';
import * as path from 'path';
import {Readable} from 'node:stream';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, I'm really disappointed with this one - I did a review of imports before posting the PR for review.

}
}

describe('create_zip_archive', function () {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap everything in a describe Archive? It is helpful heuristically to have a top level test name that matches the filename. Not 100% done in codebase but done enough that it is annoying when not done.

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note from testing.

res.status(200)
.type("application/zip")
// Construct a content-disposition header of the form 'attachment; filename="NAME"'
.set('Content-Disposition', contentDisposition(`${activeDoc.docName}.zip`, {type: 'attachment'}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a try. Everything worked well. The name was a bit jarring, compared to other downloads associated with a document. What do you think about a more human friendly name, that rhymes with what other /download methods do?

private async _getDownloadFilename(req: Request, tableId?: string, optDoc?: Document): Promise<string> {
let filename = optStringParam(req.query.title, 'title');
if (!filename) {
// Query DB for doc metadata to get the doc data.
const doc = optDoc || await this._dbManager.getDoc(req);
const docTitle = doc.name;
const suffix = tableId ? (tableId === docTitle ? '' : `-${tableId}`) : '';
filename = docTitle + suffix || 'document';
}
return filename;
}

Something a bit weird seems to happen if I download with multiple distinct files that have the same name, but we've talked about changing the name structure so no need to poke at that...

Comment on lines 978 to 981
for (const attachment of attachments) {
if (filesSeen.has(attachment.fileIdent)) {
continue;
if (doc._shuttingDown) {
throw new ApiError("Document is shutting down, archiving aborted", 500);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this approach, because there's not technically an ActiveDoc dependency here - there's a DocStorage dependency, which happens to be shut down when ActiveDoc is. I'd like us to refactor shutdown handling at some point across ActiveDoc and all dependencies, but this feels okay for now?

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Spoffy !

!await this._granularAccess.canReadEverything(docSession) &&
!await this.canDownload(docSession)
) {
throw new ApiError("No access to download attachments", 403);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "Insufficient access"

{...chimpy, responseType: 'arraybuffer'});
assert.equal(resp.status, 200);
assert.deepEqual(resp.headers['content-type'], 'application/zip');
assert.deepEqual(resp.headers['content-disposition'], `attachment; filename="TestDocExternalAttachments-Attachments.zip"`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some long line linting around here.

Copy link
Member

@paulfitz paulfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Spoffy !

@Spoffy Spoffy merged commit 5285628 into main Mar 3, 2025
12 checks passed
@Spoffy Spoffy deleted the spoffy/attachment-downloads branch March 3, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants