-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18679. Add API for bulk/paged object deletion #6494
HADOOP-18679. Add API for bulk/paged object deletion #6494
Conversation
💔 -1 overall
This message was automatically generated. |
1774c5b
to
5afb659
Compare
💔 -1 overall
This message was automatically generated. |
+add a FileUtils method to assist deletion here, with why so? Makes reflection binding straighforward: no new types; just two methods. |
5afb659
to
0823d3f
Compare
💔 -1 overall
This message was automatically generated. |
0823d3f
to
030ab9f
Compare
A more minimal design that is easier to use and implement. Caller creates a BulkOperation; they get the page size of it and then submit batches to delete of less than that size. The outcome of each call contains a list of failures. S3A implementation to show how straightforward it is. Even with the single entry page size, it is still more efficient to use this as it doesn't try to recreate a parent dir or perform any probes to see if it is a directory: it maps straight to a DELETE call. Change-Id: Ibe8737e7933fe03d39070e7138a710b50c3d60c2
Add methods in FileUtil to take an FS, cast to a BulkDeleteSource then perform the pageSize/bulkDelete operations. This is to make reflection based access straightforward: no new interfaces or types to work with, just two new methods with type-erased lists. Change-Id: I2d7b1bf8198422de635253fc087ca551a8bc6609
Change-Id: Ib098c07cc1f7747ed1a3131b252656c96c520a75
Using this PR to start with the initial design, implementation and services offered by having lower-level interaction with S3 pushed down into an S3AStore class, with interface/impl split. The bulk delete callbacks now to talk to the store, *not* s3afs, with some minor changes in behaviour (IllegalArgumentException is raised if root paths / are to be deleted) Mock tests are failing; I expected that: they are always brittle. What next? get this in and then move lower level fs ops over a method calling s3client at a time, or in groups, as appropriate. The metric of success are: * all callback classes created in S3A FS can work through the store * no s3client direct invocation in S3AFS Change-Id: Ib5bc58991533fd5d13e78426e88b884b6ae5205c
030ab9f
to
ea19f43
Compare
Changing results of method calls, using Tuples.pair() to return Map.Entry() instances as immutable tuples. Change-Id: Ibdd5a5b11fe0a57b293b9cb623272e272c8bab69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the design.
// multi object delete flag | ||
// this is always true, even if multi object | ||
// delete is disabled -the page size is simply reduced to 1. | ||
case CommonPathCapabilities.BULK_DELETE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: won't this be a bit misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means the API is present and some of the semantics "parent dir existence not guaranteed". For that reason, it will always be faster than before: one DELETE; no LIST/HEAD etc
@Retries.RetryTranslated | ||
private List<Map.Entry<String, String>> deleteSingleObject(final String key) throws IOException { | ||
try { | ||
once("bulkDelete", path, () -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : this is a single object delete,
...hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/BulkDeleteOperationCallbacksImpl.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, great to have the new S3AStore class and some operations moved over
|
||
If multi-object delete is enabled (`fs.s3a.multiobjectdelete.enable` = true), as | ||
it is by default, then the page size is limited to that defined in | ||
`fs.s3a.bulk.delete.page.size`, which MUST be less than or equal to1000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after to
public static List<Map.Entry<Path, String>> bulkDelete(FileSystem fs, Path base, List<Path> paths) | ||
``` | ||
|
||
## S3A Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this section to S3A docs instead, and link here? feel like S3A docs in hadoop-common are hard to find
@@ -46,6 +46,9 @@ public final class StoreStatisticNames { | |||
/** {@value}. */ | |||
public static final String OP_APPEND = "op_append"; | |||
|
|||
/** {@value}. */ | |||
public static final String OP_BULK_DELETE = "op_bulk-delete"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: change to op_bulk_delete
to match how other OPs are named
FYI i want to pull the rate limiter API of #6596 in here too; we'd have a rate limiter in s3a store which if enabled would limit #of deletes which can be issued on a bucket. Ideally it'd be at 3000 on s3 standard, off for s3 express and third party stores, so reduce load this call can generate. |
In #6686 I'm creating a new utils class for reflection access, nothing else. And proposing that all tests of the API use reflection to be really confident it works and that there's no accidental changes which break reflection |
/** | ||
* Delete a list of files/objects. | ||
* <ul> | ||
* <li>Files must be under the path provided in {@link #basePath()}.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writing contract tests for this locally., can't find the implementation of this in S3A.
* The maximum number of objects/files to delete in a single request. | ||
* @return a number greater than or equal to zero. | ||
*/ | ||
int pageSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be greater than 0?
equal to 0 doesn't make sense. also we have the check in S3A impl.
HADOOP-18679.
A more minimal design that is easier to use and implement than #5993
Caller creates a BulkOperation; they get the page size of it and then submit batches to delete of less than that size.
The outcome of each call contains a list of failures.
S3A implementation to show how straightforward it is.
Even with the single entry page size, it is still more efficient to use this as it doesn't try to recreate a parent dir or perform any probes to see if it is a directory: it maps straight to a DELETE call.
How was this patch tested?
If the design looks good, I'll write some contract tests as well as a filesystem api
specification.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?