Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19057. S3A: Landsat bucket used in tests no longer accessible #6515

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -585,7 +585,7 @@ If an operation fails with an `AccessDeniedException`, then the role does not ha
the permission for the S3 Operation invoked during the call.

```
> hadoop fs -touch s3a://landsat-pds/a
> hadoop fs -touch s3a://noaa-isd-pds/a

java.nio.file.AccessDeniedException: a: Writing Object on a:
software.amazon.awssdk.services.s3.model.S3Exception: Access Denied
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,9 @@ Specific buckets can have auditing disabled, even when it is enabled globally.

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.audit.enabled</name>
<name>fs.s3a.bucket.noaa-isd-pds.audit.enabled</name>
<value>false</value>
<description>Do not audit landsat bucket operations</description>
<description>Do not audit bucket operations</description>
</property>
```

Expand Down Expand Up @@ -342,9 +342,9 @@ either globally or for specific buckets:
</property>

<property>
<name>fs.s3a.bucket.landsat-pds.audit.referrer.enabled</name>
<name>fs.s3a.bucket.noaa-isd-pds.audit.referrer.enabled</name>
<value>false</value>
<description>Do not add the referrer header to landsat operations</description>
<description>Do not add the referrer header to operations</description>
</property>
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -747,7 +747,7 @@ For example, for any job executed through Hadoop MapReduce, the Job ID can be us
### `Filesystem does not have support for 'magic' committer`

```
org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://landsat-pds': Filesystem does not have support for 'magic' committer enabled
org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://noaa-isd-pds': Filesystem does not have support for 'magic' committer enabled
in configuration option fs.s3a.committer.magic.enabled
```

Expand All @@ -760,42 +760,15 @@ Remove all global/per-bucket declarations of `fs.s3a.bucket.magic.enabled` or se

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.committer.magic.enabled</name>
<name>fs.s3a.bucket.noaa-isd-pds.committer.magic.enabled</name>
<value>true</value>
</property>
```

Tip: you can verify that a bucket supports the magic committer through the
`hadoop s3guard bucket-info` command:
`hadoop s3guard bucket-info` command.


```
> hadoop s3guard bucket-info -magic s3a://landsat-pds/
Location: us-west-2

S3A Client
Signing Algorithm: fs.s3a.signing-algorithm=(unset)
Endpoint: fs.s3a.endpoint=s3.amazonaws.com
Encryption: fs.s3a.encryption.algorithm=none
Input seek policy: fs.s3a.experimental.input.fadvise=normal
Change Detection Source: fs.s3a.change.detection.source=etag
Change Detection Mode: fs.s3a.change.detection.mode=server

S3A Committers
The "magic" committer is supported in the filesystem
S3A Committer factory class: mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
S3A Committer name: fs.s3a.committer.name=magic
Store magic committer integration: fs.s3a.committer.magic.enabled=true

Security
Delegation token support is disabled

Directory Markers
The directory marker policy is "keep"
Available Policies: delete, keep, authoritative
Authoritative paths: fs.s3a.authoritative.path=```
```

### Error message: "File being created has a magic path, but the filesystem has magic file support disabled"

A file is being written to a path which is used for "magic" files,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -284,14 +284,13 @@ a bucket.
The up to date list of regions is [Available online](https://docs.aws.amazon.com/general/latest/gr/s3.html).

This list can be used to specify the endpoint of individual buckets, for example
for buckets in the central and EU/Ireland endpoints.
for buckets in the us-west-2 and EU/Ireland endpoints.


```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<name>fs.s3a.bucket.us-west-2-dataset.endpoint.region</name>
<value>us-west-2</value>
<description>The region for s3a://landsat-pds URLs</description>
</property>

<property>
Expand Down Expand Up @@ -354,9 +353,9 @@ The boolean option `fs.s3a.endpoint.fips` (default `false`) switches the S3A con
For a single bucket:
```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.fips</name>
<name>fs.s3a.bucket.noaa-isd-pds.endpoint.fips</name>
<value>true</value>
<description>Use the FIPS endpoint for the landsat dataset</description>
<description>Use the FIPS endpoint for the NOAA dataset</description>
</property>
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ If it was deployed unbonded, the DT Binding is asked to create a new DT.

It is up to the binding what it includes in the token identifier, and how it obtains them.
This new token identifier is included in a token which has a "canonical service name" of
the URI of the filesystem (e.g "s3a://landsat-pds").
the URI of the filesystem (e.g "s3a://noaa-isd-pds").

The issued/reissued token identifier can be marshalled and reused.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -481,8 +481,8 @@ This will fetch the token and save it to the named file (here, `tokens.bin`),
even if Kerberos is disabled.

```bash
# Fetch a token for the AWS landsat-pds bucket and save it to tokens.bin
$ hdfs fetchdt --webservice s3a://landsat-pds/ tokens.bin
# Fetch a token for the AWS noaa-isd-pds bucket and save it to tokens.bin
$ hdfs fetchdt --webservice s3a://noaa-isd-pds/ tokens.bin
```

If the command fails with `ERROR: Failed to fetch token` it means the
Expand All @@ -498,11 +498,11 @@ host on which it was created.
```bash
$ bin/hdfs fetchdt --print tokens.bin

Token (S3ATokenIdentifier{S3ADelegationToken/Session; uri=s3a://landsat-pds;
Token (S3ATokenIdentifier{S3ADelegationToken/Session; uri=s3a://noaa-isd-pds;
timestamp=1541683947569; encryption=EncryptionSecrets{encryptionMethod=SSE_S3};
Created on vm1.local/192.168.99.1 at time 2018-11-08T13:32:26.381Z.};
Session credentials for user AAABWL expires Thu Nov 08 14:02:27 GMT 2018; (valid))
for s3a://landsat-pds
for s3a://noaa-isd-pds
```
The "(valid)" annotation means that the AWS credentials are considered "valid":
there is both a username and a secret.
Expand All @@ -513,11 +513,11 @@ If delegation support is enabled, it also prints the current
hadoop security level.

```bash
$ hadoop s3guard bucket-info s3a://landsat-pds/
$ hadoop s3guard bucket-info s3a://noaa-isd-pds/

Filesystem s3a://landsat-pds
Filesystem s3a://noaa-isd-pds
Location: us-west-2
Filesystem s3a://landsat-pds is not using S3Guard
Filesystem s3a://noaa-isd-pds is not using S3Guard
The "magic" committer is not supported

S3A Client
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -314,9 +314,8 @@ All releases of Hadoop which have been updated to be marker aware will support t
Example: `s3guard bucket-info -markers aware` on a compatible release.

```
> hadoop s3guard bucket-info -markers aware s3a://landsat-pds/
Filesystem s3a://landsat-pds
Location: us-west-2
> hadoop s3guard bucket-info -markers aware s3a://noaa-isd-pds/
Filesystem s3a://noaa-isd-pds

...

Expand All @@ -326,13 +325,14 @@ Directory Markers
Authoritative paths: fs.s3a.authoritative.path=
The S3A connector is compatible with buckets where directory markers are not deleted

...
```

The same command will fail on older releases, because the `-markers` option
is unknown

```
> hadoop s3guard bucket-info -markers aware s3a://landsat-pds/
> hadoop s3guard bucket-info -markers aware s3a://noaa-isd-pds/
Illegal option -markers
Usage: hadoop bucket-info [OPTIONS] s3a://BUCKET
provide/check information about a specific bucket
Expand All @@ -354,9 +354,8 @@ Generic options supported are:
A specific policy check verifies that the connector is configured as desired

```
> hadoop s3guard bucket-info -markers keep s3a://landsat-pds/
Filesystem s3a://landsat-pds
Location: us-west-2
> hadoop s3guard bucket-info -markers keep s3a://noaa-isd-pds/
Filesystem s3a://noaa-isd-pds

...

Expand All @@ -371,9 +370,8 @@ When probing for a specific policy, the error code "46" is returned if the activ
does not match that requested:

```
> hadoop s3guard bucket-info -markers delete s3a://landsat-pds/
Filesystem s3a://landsat-pds
Location: us-west-2
> hadoop s3guard bucket-info -markers delete s3a://noaa-isd-pds/
Filesystem s3a://noaa-isd-pds

S3A Client
Signing Algorithm: fs.s3a.signing-algorithm=(unset)
Expand All @@ -398,7 +396,7 @@ Directory Markers
Authoritative paths: fs.s3a.authoritative.path=

2021-11-22 16:03:59,175 [main] INFO util.ExitUtil (ExitUtil.java:terminate(210))
-Exiting with status 46: 46: Bucket s3a://landsat-pds: required marker polic is
-Exiting with status 46: 46: Bucket s3a://noaa-isd-pds: required marker polic is
"keep" but actual policy is "delete"

```
Expand Down Expand Up @@ -450,10 +448,10 @@ Audit the path and fail if any markers were found.


```
> hadoop s3guard markers -limit 8000 -audit s3a://landsat-pds/
> hadoop s3guard markers -limit 8000 -audit s3a://noaa-isd-pds/

The directory marker policy of s3a://landsat-pds is "Keep"
2020-08-05 13:42:56,079 [main] INFO tools.MarkerTool (DurationInfo.java:<init>(77)) - Starting: marker scan s3a://landsat-pds/
The directory marker policy of s3a://noaa-isd-pds is "Keep"
2020-08-05 13:42:56,079 [main] INFO tools.MarkerTool (DurationInfo.java:<init>(77)) - Starting: marker scan s3a://noaa-isd-pds/
Scanned 1,000 objects
Scanned 2,000 objects
Scanned 3,000 objects
Expand All @@ -463,8 +461,8 @@ Scanned 6,000 objects
Scanned 7,000 objects
Scanned 8,000 objects
Limit of scan reached - 8,000 objects
2020-08-05 13:43:01,184 [main] INFO tools.MarkerTool (DurationInfo.java:close(98)) - marker scan s3a://landsat-pds/: duration 0:05.107s
No surplus directory markers were found under s3a://landsat-pds/
2020-08-05 13:43:01,184 [main] INFO tools.MarkerTool (DurationInfo.java:close(98)) - marker scan s3a://noaa-isd-pds/: duration 0:05.107s
No surplus directory markers were found under s3a://noaa-isd-pds/
Listing limit reached before completing the scan
2020-08-05 13:43:01,187 [main] INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 3:
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -616,15 +616,14 @@ header.x-amz-version-id="KcDOVmznIagWx3gP1HlDqcZvm1mFWZ2a"
A file with no-encryption (on a bucket without versioning but with intelligent tiering):

```
bin/hadoop fs -getfattr -d s3a://landsat-pds/scene_list.gz
bin/hadoop fs -getfattr -d s3a://noaa-cors-pds/raw/2024/001/akse/AKSE001x.24_.gz

# file: s3a://landsat-pds/scene_list.gz
header.Content-Length="45603307"
header.Content-Type="application/octet-stream"
header.ETag="39c34d489777a595b36d0af5726007db"
header.Last-Modified="Wed Aug 29 01:45:15 BST 2018"
header.x-amz-storage-class="INTELLIGENT_TIERING"
header.x-amz-version-id="null"
# file: s3a://noaa-cors-pds/raw/2024/001/akse/AKSE001x.24_.gz
header.Content-Length="524671"
header.Content-Type="binary/octet-stream"
header.ETag=""3e39531220fbd3747d32cf93a79a7a0c""
header.Last-Modified="Tue Jan 02 00:15:13 GMT 2024"
header.x-amz-server-side-encryption="AES256"
```

###<a name="changing-encryption"></a> Use `rename()` to encrypt files with new keys
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,7 @@ explicitly opened up for broader access.
```bash
hadoop fs -ls \
-D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
s3a://landsat-pds/
s3a://noaa-isd-pds/
```

1. Allowing anonymous access to an S3 bucket compromises
Expand Down Expand Up @@ -1630,11 +1630,11 @@ a session key:
</property>
```

Finally, the public `s3a://landsat-pds/` bucket can be accessed anonymously:
Finally, the public `s3a://noaa-isd-pds/` bucket can be accessed anonymously:

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.aws.credentials.provider</name>
<name>fs.s3a.bucket.noaa-isd-pds.aws.credentials.provider</name>
<value>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</value>
</property>
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -447,7 +447,8 @@ An example of this is covered in [HADOOP-13871](https://issues.apache.org/jira/b

1. For public data, use `curl`:

curl -O https://landsat-pds.s3.amazonaws.com/scene_list.gz
curl -O https://noaa-cors-pds.s3.amazonaws.com/raw/2023/001/akse/AKSE001a.23_.gz

1. Use `nettop` to monitor a processes connections.


Expand Down Expand Up @@ -696,7 +697,7 @@ via `FileSystem.get()` or `Path.getFileSystem()`.
The cache, `FileSystem.CACHE` will, for each user, cachec one instance of a filesystem
for a given URI.
All calls to `FileSystem.get` for a cached FS for a URI such
as `s3a://landsat-pds/` will return that singe single instance.
as `s3a://noaa-isd-pds/` will return that singe single instance.

FileSystem instances are created on-demand for the cache,
and will be done in each thread which requests an instance.
Expand All @@ -720,7 +721,7 @@ can be created simultaneously for different object stores/distributed
filesystems.

For example, a value of four would put an upper limit on the number
of wasted instantiations of a connector for the `s3a://landsat-pds/`
of wasted instantiations of a connector for the `s3a://noaa-isd-pds/`
bucket.

```xml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -260,22 +260,20 @@ define the target region in `auth-keys.xml`.
### <a name="csv"></a> CSV Data Tests

The `TestS3AInputStreamPerformance` tests require read access to a multi-MB
text file. The default file for these tests is one published by amazon,
[s3a://landsat-pds.s3.amazonaws.com/scene_list.gz](http://landsat-pds.s3.amazonaws.com/scene_list.gz).
This is a gzipped CSV index of other files which amazon serves for open use.
text file. The default file for these tests is a public one.
`s3a://noaa-cors-pds/raw/2023/001/akse/AKSE001a.23_.gz`
from the [NOAA Continuously Operating Reference Stations (CORS) Network (NCN)](https://registry.opendata.aws/noaa-ncn/)

Historically it was required to be a `csv.gz` file to validate S3 Select
support. Now that S3 Select support has been removed, other large files
may be used instead.
However, future versions may want to read a CSV file again, so testers
should still reference one.

The path to this object is set in the option `fs.s3a.scale.test.csvfile`,

```xml
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value>s3a://landsat-pds/scene_list.gz</value>
<value>s3a://noaa-cors-pds/raw/2023/001/akse/AKSE001a.23_.gz</value>
</property>
```

Expand All @@ -285,21 +283,21 @@ is hosted in Amazon's US-east datacenter.
1. If the data cannot be read for any reason then the test will fail.
1. If the property is set to a different path, then that data must be readable
and "sufficiently" large.
1. If a `.gz` file, expect decompression-related test failures.

(the reason the space or newline is needed is to add "an empty entry"; an empty
`<value/>` would be considered undefined and pick up the default)


If using a test file in a different AWS S3 region then
a bucket-specific region must be defined.
For the default test dataset, hosted in the `landsat-pds` bucket, this is:
For the default test dataset, hosted in the `noaa-cors-pds` bucket, this is:

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<value>us-west-2</value>
<description>The region for s3a://landsat-pds</description>
</property>
<property>
<name>fs.s3a.bucket.noaa-cors-pds.endpoint.region</name>
<value>us-east-1</value>
</property>
```

### <a name="access"></a> Testing Access Point Integration
Expand Down Expand Up @@ -857,7 +855,7 @@ the tests become skipped, rather than fail with a trace which is really a false
The ordered test case mechanism of `AbstractSTestS3AHugeFiles` is probably
the most elegant way of chaining test setup/teardown.

Regarding reusing existing data, we tend to use the landsat archive of
Regarding reusing existing data, we tend to use the noaa-cors-pds archive of
AWS US-East for our testing of input stream operations. This doesn't work
against other regions, or with third party S3 implementations. Thus the
URL can be overridden for testing elsewhere.
Expand Down
Loading
Loading