Skip to content

Commit

Permalink
HADOOP-18830. Final doc fixup
Browse files Browse the repository at this point in the history
Review in docs/core-siten references to landsat and CSV;
update as appropriate.

This includes
* A review of use of .endpoint vs .endpoint.region
  and a move to fs.s3a.endpoint.region as much as possible.
* reorg of some sections
* fixup of a name= headers.

Change-Id: I5e07cdf153cacf0ce1ee6673d3a094c2d5eaf5a2
  • Loading branch information
steveloughran committed Jan 29, 2024
1 parent 736a3cc commit 034865d
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 79 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ There are three core settings to connect to an S3 store, endpoint, region and wh
<name>fs.s3a.endpoint</name>
<description>AWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com) is assumed.
property, the endpoint/hostname of the S3 Store is inferred from
the value of fs.s3a.endpoint.region, fs.s3a.endpoint.fips and more.
</description>
</property>

Expand Down Expand Up @@ -230,8 +231,9 @@ S3 endpoint, documented [by Amazon](http://docs.aws.amazon.com/general/latest/gr
use local buckets and local copies of data, wherever possible.
2. With the V4 signing protocol, AWS requires the explicit region endpoint
to be used —hence S3A must be configured to use the specific endpoint. This
is done in the configuration option `fs.s3a.endpoint`.
3. All endpoints other than the default endpoint only support interaction
is done by setting the regon in the configuration option `fs.s3a.endpoint.region`,
or by explicitly setting `fs.s3a.endpoint` and `fs.s3a.endpoint.region`.
3. All endpoints other than the default region only support interaction
with buckets local to that S3 instance.
4. Standard S3 buckets support "cross-region" access where use of the original `us-east-1`
endpoint allows access to the data, but newer storage types, particularly S3 Express are
Expand All @@ -248,25 +250,12 @@ The up to date list of regions is [Available online](https://docs.aws.amazon.com
This list can be used to specify the endpoint of individual buckets, for example
for buckets in the central and EU/Ireland endpoints.

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint</name>
<value>s3-us-west-2.amazonaws.com</value>
</property>

<property>
<name>fs.s3a.bucket.eu-dataset.endpoint</name>
<value>s3.eu-west-1.amazonaws.com</value>
</property>
```

Declaring the region for the data is simpler, as it avoid having to look up the full URL and having to worry about historical quirks of regional endpoint hostnames.

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<value>us-west-2</value>
<description>The endpoint for s3a://landsat-pds URLs</description>
<description>The region for s3a://landsat-pds URLs</description>
</property>

<property>
Expand Down Expand Up @@ -421,7 +410,6 @@ bucket by bucket basis i.e. `fs.s3a.bucket.{YOUR-BUCKET}.accesspoint.required`.
```

Before using Access Points make sure you're not impacted by the following:
- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for performance reasons;
- The endpoint for S3 requests will automatically change to use
`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point ARN. While
considering endpoints, if you have any custom signers that use the host endpoint property make
Expand Down
102 changes: 47 additions & 55 deletions hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,6 @@ connection to S3 to interact with a bucket. Unit test suites follow the naming
convention `Test*.java`. Integration tests follow the naming convention
`ITest*.java`.

Due to eventual consistency, integration tests may fail without reason.
Transient failures, which no longer occur upon rerunning the test, should thus
be ignored.

## <a name="policy"></a> Policy for submitting patches which affect the `hadoop-aws` module.

Expand Down Expand Up @@ -56,7 +53,6 @@ make for a slow iterative development.
Please: run the tests. And if you don't, we are sorry for declining your
patch, but we have to.


### What if there's an intermittent failure of a test?

Some of the tests do fail intermittently, especially in parallel runs.
Expand Down Expand Up @@ -147,7 +143,7 @@ Example:
</configuration>
```

### <a name="encryption"></a> Configuring S3a Encryption
## <a name="encryption"></a> Configuring S3a Encryption

For S3a encryption tests to run correctly, the
`fs.s3a.encryption.key` must be configured in the s3a contract xml
Expand Down Expand Up @@ -175,6 +171,21 @@ on the AWS side. Some S3AFileSystem tests are skipped when default encryption is
enabled due to unpredictability in how [ETags](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html)
are generated.

### Disabling the encryption tests

If the S3 store/storage class doesn't support server-side-encryption, these will fail. They
can be turned off.

```xml
<property>
<name>test.fs.s3a.encryption.enabled</name>
<value>false</value>
</property>
```

Encryption is only used for those specific test suites with `Encryption` in
their classname.

## <a name="running"></a> Running the Tests

After completing the configuration, execute the test run through Maven.
Expand Down Expand Up @@ -241,30 +252,24 @@ define the target region in `auth-keys.xml`.

```xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
</property>
```

Alternatively you can use endpoints defined in [core-site.xml](../../../../test/resources/core-site.xml).

```xml
<property>
<name>fs.s3a.endpoint</name>
<value>${frankfurt.endpoint}</value>
<name>fs.s3a.endpoint.region</name>
<value>eu-central-1</value>
</property>
```

This is used for all tests expect for scale tests using a Public CSV.gz file
(see below)

### <a name="csv"></a> CSV Data Tests

The `TestS3AInputStreamPerformance` tests require read access to a multi-MB
text file. The default file for these tests is one published by amazon,
[s3a://landsat-pds.s3.amazonaws.com/scene_list.gz](http://landsat-pds.s3.amazonaws.com/scene_list.gz).
This is a gzipped CSV index of other files which amazon serves for open use.

Historically it was required to be a `csv.gz` file to validate S3 Select
support. Now that S3 Select support has been removed, other large files
may be used instead.
However, future versions may want to read a CSV file again, so testers
should still reference one.

The path to this object is set in the option `fs.s3a.scale.test.csvfile`,

```xml
Expand All @@ -284,19 +289,21 @@ and "sufficiently" large.
(the reason the space or newline is needed is to add "an empty entry"; an empty
`<value/>` would be considered undefined and pick up the default)

Of using a test file in an S3 region requiring a different endpoint value
set in `fs.s3a.endpoint`, a bucket-specific endpoint must be defined.

If using a test file in a different AWS S3 region then
a bucket-specific region must be defined.
For the default test dataset, hosted in the `landsat-pds` bucket, this is:

```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint</name>
<value>s3.amazonaws.com</value>
<description>The endpoint for s3a://landsat-pds URLs</description>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<value>us-west-2</value>
<description>The region for s3a://landsat-pds</description>
</property>
```

### <a name="csv"></a> Testing Access Point Integration
### <a name="access"></a> Testing Access Point Integration

S3a supports using Access Point ARNs to access data in S3. If you think your changes affect VPC
integration, request signing, ARN manipulation, or any code path that deals with the actual
sending and retrieving of data to/from S3, make sure you run the entire integration test suite with
Expand Down Expand Up @@ -551,9 +558,9 @@ They do not run automatically: they must be explicitly run from the command line

Look in the source for these and reads the Javadocs before executing.

## <a name="alternate_s3"></a> Testing against non AWS S3 endpoints.
## <a name="alternate_s3"></a> Testing against non-AWS S3 Stores.

The S3A filesystem is designed to work with storage endpoints which implement
The S3A filesystem is designed to work with S3 stores which implement
the S3 protocols to the extent that the amazon S3 SDK is capable of talking
to it. We encourage testing against other filesystems and submissions of patches
which address issues. In particular, we encourage testing of Hadoop release
Expand All @@ -579,9 +586,11 @@ on third party stores.
<property>
<name>test.fs.s3a.create.create.acl.enabled</name>
<value>false</value>
< /property>
</property>
```

See [Third Party Stores](third_party_stores.html) for more on this topic.

### Public datasets used in tests

Some tests rely on the presence of existing public datasets available on Amazon S3.
Expand All @@ -595,20 +604,6 @@ store that supports these tests.
An example of this might be the MarkerTools tests which require a bucket with a large number of
objects or the requester pays tests that require requester pays to be enabled for the bucket.

### Disabling the encryption tests

If the endpoint doesn't support server-side-encryption, these will fail. They
can be turned off.

```xml
<property>
<name>test.fs.s3a.encryption.enabled</name>
<value>false</value>
</property>
```

Encryption is only used for those specific test suites with `Encryption` in
their classname.

### Disabling the storage class tests

Expand Down Expand Up @@ -654,7 +649,7 @@ If `ITestS3AContractGetFileStatusV1List` fails with any error about unsupported
```

Note: there's no equivalent for turning off v2 listing API, which all stores are now
expected to support.
required to support.


### Testing Requester Pays
Expand Down Expand Up @@ -745,12 +740,8 @@ after setting this rerun the tests
log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
```

There are also some logging options for debug logging of the AWS client
```properties
log4j.logger.com.amazonaws=DEBUG
log4j.logger.com.amazonaws.http.conn.ssl=INFO
log4j.logger.com.amazonaws.internal=INFO
```
There are also some logging options for debug logging of the AWS client;
consult the file.

There is also the option of enabling logging on a bucket; this could perhaps
be used to diagnose problems from that end. This isn't something actively
Expand Down Expand Up @@ -872,13 +863,13 @@ against other regions, or with third party S3 implementations. Thus the
URL can be overridden for testing elsewhere.


### Works With Other S3 Endpoints
### Works With Other S3 Stored

Don't assume AWS S3 US-East only, do allow for working with external S3 implementations.
Those may be behind the latest S3 API features, not support encryption, session
APIs, etc.

They won't have the same CSV test files as some of the input tests rely on.
They won't have the same CSV/large test files as some of the input tests rely on.
Look at `ITestS3AInputStreamPerformance` to see how tests can be written
to support the declaration of a specific large test file on alternate filesystems.

Expand Down Expand Up @@ -935,6 +926,8 @@ modifying the config. As an example from `AbstractTestS3AEncryption`:
protected Configuration createConfiguration() {
Configuration conf = super.createConfiguration();
S3ATestUtils.disableFilesystemCaching(conf);
removeBaseAndBucketOverrides(conf,
SERVER_SIDE_ENCRYPTION_ALGORITHM);

Check failure on line 930 in hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md

View check run for this annotation

ASF Cloudbees Jenkins ci-hadoop / Apache Yetus

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md#L930

blanks: end of line
conf.set(Constants.SERVER_SIDE_ENCRYPTION_ALGORITHM,
getSSEAlgorithm().getMethod());
return conf;
Expand Down Expand Up @@ -991,9 +984,8 @@ than on the maven command line:

### Keeping AWS Costs down

Most of the base S3 tests are designed to use public AWS data
(the landsat-pds bucket) for read IO, so you don't have to pay for bytes
downloaded or long term storage costs. The scale tests do work with more data
Most of the base S3 tests are designed delete files after test runs,
so you don't have to pay for storage costs. The scale tests do work with more data
so will cost more as well as generally take more time to execute.

You are however billed for
Expand Down Expand Up @@ -1102,7 +1094,7 @@ The usual credentials needed to log in to the bucket will be used, but now
the credentials used to interact with S3 will be temporary
role credentials, rather than the full credentials.

## <a name="qualifiying_sdk_updates"></a> Qualifying an AWS SDK Update
## <a name="qualifying_sdk_updates"></a> Qualifying an AWS SDK Update

Updating the AWS SDK is something which does need to be done regularly,
but is rarely without complications, major or minor.
Expand Down
19 changes: 13 additions & 6 deletions hadoop-tools/hadoop-aws/src/test/resources/core-site.xml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,13 @@
</property>

<!-- Per-bucket configurations: landsat-pds -->
<!--
A CSV file in this bucket was used for testing S3 select.
Although this feature has been removed, (HADOOP-18830)
it is still used in some tests as a large file to read
in a bucket without write permissions.
These tests do not need a CSV file.
-->
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.region</name>
<value>us-west-2</value>
Expand All @@ -56,13 +63,13 @@
<description>Do not add the referrer header to landsat operations</description>
</property>

<property>
<name>fs.s3a.bucket.landsat-pds.endpoint.fips</name>
<value>true</value>
<description>Use the fips endpoint</description>
</property>

<!-- Per-bucket configurations: usgs-landsat -->
<!--
This is a requester-pays bucket (so validates that feature)
and, because it has many files, is used to validate paged file
listing without needing to create thousands of files.
-->

<property>
<name>fs.s3a.bucket.usgs-landsat.endpoint.region</name>
<value>us-west-2</value>
Expand Down

0 comments on commit 034865d

Please sign in to comment.