Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-18679. Add API for bulk/paged object deletion #6494

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Jan 24, 2024

HADOOP-18679.

A more minimal design that is easier to use and implement than #5993

Caller creates a BulkOperation; they get the page size of it and then submit batches to delete of less than that size.

The outcome of each call contains a list of failures.

S3A implementation to show how straightforward it is.

Even with the single entry page size, it is still more efficient to use this as it doesn't try to recreate a parent dir or perform any probes to see if it is a directory: it maps straight to a DELETE call.

How was this patch tested?

If the design looks good, I'll write some contract tests as well as a filesystem api
specification.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 21s Maven dependency ordering for branch
-1 ❌ mvninstall 7m 7s /branch-mvninstall-root.txt root in trunk failed.
-1 ❌ compile 9m 3s /branch-compile-root-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt root in trunk failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.
-1 ❌ compile 8m 32s /branch-compile-root-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt root in trunk failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08.
+1 💚 checkstyle 4m 38s trunk passed
+1 💚 mvnsite 2m 18s trunk passed
+1 💚 javadoc 1m 23s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 3s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 4m 14s trunk passed
-1 ❌ shadedclient 11m 29s branch has errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 28s Maven dependency ordering for patch
+1 💚 mvninstall 1m 47s the patch passed
-1 ❌ compile 12m 33s /patch-compile-root-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt root in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.
-1 ❌ javac 12m 33s /patch-compile-root-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt root in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.
-1 ❌ compile 12m 23s /patch-compile-root-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt root in the patch failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08.
-1 ❌ javac 12m 23s /patch-compile-root-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt root in the patch failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 6m 18s /results-checkstyle-root.txt root: The patch generated 1 new + 3 unchanged - 0 fixed = 4 total (was 3)
+1 💚 mvnsite 2m 56s the patch passed
-1 ❌ javadoc 1m 18s /results-javadoc-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt hadoop-common-project_hadoop-common-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚 javadoc 1m 9s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
-1 ❌ spotbugs 1m 34s /patch-spotbugs-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
-1 ❌ shadedclient 3m 51s patch has errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 17m 58s /patch-unit-hadoop-common-project_hadoop-common.txt hadoop-common in the patch passed.
-1 ❌ unit 1m 2s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch passed.
-1 ❌ asflicense 0m 41s /results-asflicense.txt The patch generated 1 ASF License warnings.
135m 40s
Reason Tests
Failed junit tests hadoop.ipc.TestRPC
hadoop.util.TestDataChecksum
hadoop.fs.s3a.commit.staging.TestDirectoryCommitterScale
hadoop.fs.s3a.TestS3ADeleteOnExit
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/1/artifact/out/Dockerfile
GITHUB PR #6494
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 9de7a95ad1b3 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 1774c5b
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/1/testReport/
Max. process+thread count 302 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api-lean branch from 1774c5b to 5afb659 Compare January 26, 2024 12:15
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 10s Maven dependency ordering for branch
+1 💚 mvninstall 35m 50s trunk passed
+1 💚 compile 18m 13s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 16m 31s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 4m 37s trunk passed
+1 💚 mvnsite 2m 30s trunk passed
+1 💚 javadoc 1m 47s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 33s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 3m 43s trunk passed
+1 💚 shadedclient 39m 48s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 36s Maven dependency ordering for patch
+1 💚 mvninstall 2m 8s the patch passed
+1 💚 compile 18m 41s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 18m 41s the patch passed
+1 💚 compile 17m 6s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 17m 6s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 31s /results-checkstyle-root.txt root: The patch generated 1 new + 3 unchanged - 0 fixed = 4 total (was 3)
+1 💚 mvnsite 2m 28s the patch passed
-1 ❌ javadoc 1m 9s /results-javadoc-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt hadoop-common-project_hadoop-common-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0)
+1 💚 javadoc 1m 33s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 4m 5s the patch passed
+1 💚 shadedclient 38m 38s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 5s hadoop-common in the patch passed.
+1 💚 unit 3m 7s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 57s The patch does not generate ASF License warnings.
260m 38s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/2/artifact/out/Dockerfile
GITHUB PR #6494
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux f605ff408523 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 5afb659
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/2/testReport/
Max. process+thread count 3137 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran marked this pull request as draft February 9, 2024 12:48
@steveloughran
Copy link
Contributor Author

+add a FileUtils method to assist deletion here, with FileUtils.bulkDeletePageSize(path) -> int and `FileUtils.bulkDelete(path, List) -> List; each will create a bulk delete object, execute the operation/probe and then close.

why so?

Makes reflection binding straighforward: no new types; just two methods.

@steveloughran steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api-lean branch from 5afb659 to 0823d3f Compare February 9, 2024 13:47
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 51s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 8s Maven dependency ordering for branch
+1 💚 mvninstall 36m 26s trunk passed
+1 💚 compile 20m 5s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 compile 16m 37s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 checkstyle 4m 42s trunk passed
+1 💚 mvnsite 2m 31s trunk passed
+1 💚 javadoc 1m 47s trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 33s trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
-1 ❌ spotbugs 2m 33s /branch-spotbugs-hadoop-common-project_hadoop-common-warnings.html hadoop-common-project/hadoop-common in trunk has 1 extant spotbugs warnings.
+1 💚 shadedclient 38m 23s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 31s Maven dependency ordering for patch
+1 💚 mvninstall 1m 26s the patch passed
+1 💚 compile 17m 29s the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
+1 💚 javac 17m 29s the patch passed
+1 💚 compile 16m 29s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 javac 16m 29s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 32s /results-checkstyle-root.txt root: The patch generated 1 new + 39 unchanged - 0 fixed = 40 total (was 39)
+1 💚 mvnsite 2m 30s the patch passed
-1 ❌ javadoc 1m 7s /results-javadoc-javadoc-hadoop-common-project_hadoop-common-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt hadoop-common-project_hadoop-common-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 generated 4 new + 0 unchanged - 0 fixed = 4 total (was 0)
+1 💚 javadoc 1m 30s the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08
+1 💚 spotbugs 4m 4s the patch passed
+1 💚 shadedclient 38m 19s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 7s hadoop-common in the patch passed.
+1 💚 unit 3m 9s hadoop-aws in the patch passed.
+1 💚 asflicense 0m 57s The patch does not generate ASF License warnings.
259m 23s
Subsystem Report/Notes
Docker ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/3/artifact/out/Dockerfile
GITHUB PR #6494
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux d1aa5776a4a0 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 0823d3f
Default Java Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/3/testReport/
Max. process+thread count 3137 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6494/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

A more minimal design that is easier to use and implement.

Caller creates a BulkOperation; they get the page size of
it and then submit batches to delete of less than that size.

The outcome of each call contains a list of failures.

S3A implementation to show how straightforward it is.

Even with the single entry page size, it is still more
efficient to use this as it doesn't try to recreate a
parent dir or perform any probes to see if it is a directory:
it maps straight to a DELETE call.

Change-Id: Ibe8737e7933fe03d39070e7138a710b50c3d60c2
Add methods in FileUtil to take an FS, cast to a BulkDeleteSource
then perform the pageSize/bulkDelete operations.

This is to make reflection based access straightforward: no new interfaces
or types to work with, just two new methods with type-erased lists.

Change-Id: I2d7b1bf8198422de635253fc087ca551a8bc6609
Change-Id: Ib098c07cc1f7747ed1a3131b252656c96c520a75
Using this PR to start with the initial design, implementation
and services offered by having lower-level interaction with S3
pushed down into an S3AStore class, with interface/impl split.

The bulk delete callbacks now to talk to the store, *not* s3afs,
with some minor changes in behaviour (IllegalArgumentException is
raised if root paths / are to be deleted)

Mock tests are failing; I expected that: they are always brittle.

What next? get this in and then move lower level fs ops
over a method calling s3client at a time, or in groups, as appropriate.

The metric of success are:
* all callback classes created in S3A FS can work through the store
* no s3client direct invocation in S3AFS

Change-Id: Ib5bc58991533fd5d13e78426e88b884b6ae5205c
@steveloughran steveloughran force-pushed the s3/HADOOP-18679-bulk-delete-api-lean branch from 030ab9f to ea19f43 Compare March 13, 2024 18:59
Changing results of method calls, using Tuples.pair() to
return Map.Entry() instances as immutable tuples.

Change-Id: Ibdd5a5b11fe0a57b293b9cb623272e272c8bab69
Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the design.

// multi object delete flag
// this is always true, even if multi object
// delete is disabled -the page size is simply reduced to 1.
case CommonPathCapabilities.BULK_DELETE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: won't this be a bit misleading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means the API is present and some of the semantics "parent dir existence not guaranteed". For that reason, it will always be faster than before: one DELETE; no LIST/HEAD etc

@Retries.RetryTranslated
private List<Map.Entry<String, String>> deleteSingleObject(final String key) throws IOException {
try {
once("bulkDelete", path, () ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : this is a single object delete,

Copy link
Contributor

@ahmarsuhail ahmarsuhail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, great to have the new S3AStore class and some operations moved over


If multi-object delete is enabled (`fs.s3a.multiobjectdelete.enable` = true), as
it is by default, then the page size is limited to that defined in
`fs.s3a.bulk.delete.page.size`, which MUST be less than or equal to1000.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after to

public static List<Map.Entry<Path, String>> bulkDelete(FileSystem fs, Path base, List<Path> paths)
```

## S3A Implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this section to S3A docs instead, and link here? feel like S3A docs in hadoop-common are hard to find

@@ -46,6 +46,9 @@ public final class StoreStatisticNames {
/** {@value}. */
public static final String OP_APPEND = "op_append";

/** {@value}. */
public static final String OP_BULK_DELETE = "op_bulk-delete";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: change to op_bulk_delete to match how other OPs are named

@steveloughran
Copy link
Contributor Author

FYI i want to pull the rate limiter API of #6596 in here too; we'd have a rate limiter in s3a store which if enabled would limit #of deletes which can be issued on a bucket. Ideally it'd be at 3000 on s3 standard, off for s3 express and third party stores, so reduce load this call can generate.

@steveloughran
Copy link
Contributor Author

In #6686 I'm creating a new utils class for reflection access, nothing else. And proposing that all tests of the API use reflection to be really confident it works and that there's no accidental changes which break reflection

@apache apache deleted a comment from hadoop-yetus Apr 5, 2024
@apache apache deleted a comment from hadoop-yetus Apr 5, 2024
@apache apache deleted a comment from hadoop-yetus Apr 5, 2024
/**
* Delete a list of files/objects.
* <ul>
* <li>Files must be under the path provided in {@link #basePath()}.</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writing contract tests for this locally., can't find the implementation of this in S3A.

* The maximum number of objects/files to delete in a single request.
* @return a number greater than or equal to zero.
*/
int pageSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be greater than 0?
equal to 0 doesn't make sense. also we have the check in S3A impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants