Skip to content

Commit

Permalink
docs: update code examples
Browse files Browse the repository at this point in the history
  • Loading branch information
tom-webber committed Jan 24, 2025
1 parent 426eaff commit 829d165
Showing 1 changed file with 112 additions and 54 deletions.
166 changes: 112 additions & 54 deletions source/documentation/data-docs/amazon-s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ Every bucket has three data access levels:

- Read only
- Read/write
- Admin
- this provides read/write access and allows the user to add and remove other users from the bucket's data access group
- Admin
- this provides read/write access and allows the user to add and remove other users from the bucket's data access group

### Path-specific access

Expand Down Expand Up @@ -146,10 +146,14 @@ For further details, see the relevant sections for [`Rs3tools`](#rs3tools), [`bo

#### JupyterLab / VSCode

The main options for interacting with files stored in AWS S3 buckets on the Analytical Platform via JupyterLab and VSCode are :
The main options for interacting with files stored in AWS S3 buckets on the Analytical Platform via JupyterLab and VSCode include:

- Reading files : [`polars`](https://docs.pola.rs/), [`pandas`](https://pandas.pydata.org/docs/) , [`awswrangler`](https://pypi.org/project/awswrangler/), [`mojap-arrow-pd-parser`](https://github.com/moj-analytical-services/mojap-arrow-pd-parser?tab=readme-ov-file#mojap-arrow-pd-parser)
- Downloading / Uploading files : [`awswrangler`](https://pypi.org/project/awswrangler/), [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
- [`polars`](https://docs.pola.rs/)
- [`pandas`](https://pandas.pydata.org/docs/)
- [`awswrangler`](https://pypi.org/project/awswrangler/)
- [`mojap-arrow-pd-parser`](https://github.com/moj-analytical-services/mojap-arrow-pd-parser?tab=readme-ov-file#mojap-arrow-pd-parser)
[`awswrangler`](https://pypi.org/project/awswrangler/)
- [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)

### Installation and usage

Expand Down Expand Up @@ -268,93 +272,147 @@ s3browser::file_explorer_s3()

You can find out more about how to use `s3browser` on [GitHub](https://github.com/moj-analytical-services/s3browser).

### JupyterLab
### JupyterLab and VSCode (python)

You can read/write directly from s3 using [pandas](https://pandas.pydata.org/docs/user_guide/index.html). However, to get the best representation of the column types in the resulting Pandas dataframe(s), you may wish to use [mojap-arrow-pd-parser](https://github.com/moj-analytical-services/mojap-arrow-pd-parser).
You can read/write directly from s3 using a variety of tools: [awswrangler](https://pypi.org/project/awswrangler/#quick-start), [polars](https://docs.pola.rs/), [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html), [pandas](https://pandas.pydata.org/docs/user_guide/index.html), or [mojap-arrow-pd-parser](https://github.com/moj-analytical-services/mojap-arrow-pd-parser?tab=readme-ov-file#mojap-arrow-pd-parser)

#### `mojap-arrow-pd-parser`
#### `AWS Data Wrangler`

`mojap-arrow-pd-parser` provides easy csv, jsonl and parquet file readers. To install in terminal:
You can also use [`awswrangler`](https://pypi.org/project/awswrangler/#quick-start) to work with data stored in Amazon S3.

To install AWS Wrangler, run the following code in a terminal:

```sh
python -m pip install awswrangler
```

To read a CSV file from S3 using AWS Wrangler:

```py
import awswrangler as wr

# Retrieving the data directly from Amazon S3
df = wr.s3.read_csv("s3://bucket/dataset/", dataset=True)
```

More information can be found in the [product documentation](https://aws-data-wrangler.readthedocs.io/en/stable/tutorials/003%20-%20Amazon%20S3.html).

#### `boto3`

You can also download or read objects using the [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) package.

You can install `boto3` by running the following code in a terminal:

```bash
pip install arrow-pd-parser
python -m pip install boto3
```

To read/write a csv file from s3:
To [download a file from Amazon S3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html#downloading-files), you should use the following code:

```python
from arrow_pd_parser import reader, writer
import boto3

# Specifying the reader Both reader statements are equivalent and call the same readers under the hood
df1 = reader.read("s3://bucket_name/data/all_types.csv", file_format="csv")
df2 = reader.csv.read("s3://bucket_name/data/all_types.csv")
s3_client = boto3.client('s3')
s3_client.download_file('bucket_name', 'key', 'local_path')
```

# You can also pass the reader args to the reader as kwargs
df3 = reader.csv.read("s3://bucket_name/data/all_types.csv", nrows = 2)
# The writer API has the same functionality
writer.write(df1, file_format="parquet")
writer.parquet.write(df1)
If you receive an `ImportError`, try restarting your kernel, so that Python recognises your `boto3` installation.

Here, you should substitute `'bucket_name'` with the name of the bucket, `'key'` with the path of the object in Amazon S3 and `local_path` with the local path where you would like to save the downloaded file.

To upload a file to Amazon S3, you should use the following code:

```python
#Upload sample contents to s3
s3_client = boto3.client('s3')

data = b'This is the content of the file uploaded from python boto3'
file_name = 'your_file_name.txt'

response = s3_client.put_object(Bucket=your_bucket_name, Body=data, Key=file_name)
print(f"AWS response code for uploading file is {(response['ResponseMetadata']['HTTPStatusCode'])}")
```

`mojap-arrow-pd-parser` infers the file type from the extension, so for example `reader.read("s3://bucket_name/file.parquet")` would read a parquet file without need for specifying the file type.
You can find more information in the [package documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_file).

The package also has a lot of other functionality including specifying data types when reading (or writing). More details can be found in the package [README](https://github.com/moj-analytical-services/mojap-arrow-pd-parser#mojap-arrow-pd-parser).
#### `polars`

[`polars`](https://docs.pola.rs/) is a fast DataFrame library implemented in Rust and can be used to read/write data from/to Amazon S3. To install polars and s3fs, run the following code in a terminal:

```bash
python -m pip install polars s3fs
```

To read a CSV file from S3 using polars (see [reading from Cloud storage](https://docs.pola.rs/user-guide/io/cloud-storage/) in the docs):

```py
import polars as pl

source = "s3://bucket/*.csv"

df = pl.read_csv(source)
```

To [write a DataFrame to S3](https://docs.pola.rs/user-guide/io/cloud-storage/#writing-to-cloud-storage#writing-to-cloud-storage):

```py
import polars as pl
import s3fs

df = pl.DataFrame({
"foo": ["a", "b", "c", "d", "d"],
"bar": [1, 2, 3, 4, 5],
})

fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"

# write parquet
with fs.open(destination, mode='wb') as f:
df.write_parquet(f)
```

#### `pandas`

You can use any of the `pandas` read functions (for example, `read_csv` or `read_json`) to download data directly from Amazon S3. This requires that you have installed the `pandas` and `s3fs` packages. To install these, run the following code in a terminal:

```
```bash
python -m pip install pandas s3fs
```

As an example, to read a CSV, you should run the following code:

```
```py
import pandas as pd
pd.read_csv('s3://bucket_name/key')
```

Here, you should substitute `bucket_name` with the name of the bucket and `key` with the path of the object in Amazon S3.

#### `boto3`

You can also download or read objects using the `boto3` package.
#### `mojap-arrow-pd-parser`

You can install `boto3` by running the following code in a terminal:
`mojap-arrow-pd-parser` provides easy csv, jsonl and parquet file readers. To install in terminal:

```
pip install boto3
```bash
pip install arrow-pd-parser
```

To download a file from Amazon S3, you should use the following code:
To read/write a csv file from s3:

```python
import boto3

s3 = boto3.resource('s3')
s3.Object('bucket_name', 'key').download_file('local_path')
```

If you receive an `ImportError`, try restarting your kernel, so that Python recognises your `boto3` installation.

Here, you should substitute `'bucket_name'` with the name of the bucket, `'key'` with the path of the object in Amazon S3 and `local_path` with the local path where you would like to save the downloaded file.
from arrow_pd_parser import reader, writer

To upload a file to Amazon S3, you should use the following code:
# Specifying the reader Both reader statements are equivalent and call the same readers under the hood
df1 = reader.read("s3://bucket_name/data/all_types.csv", file_format="csv")
df2 = reader.csv.read("s3://bucket_name/data/all_types.csv")

```python
#Upload sample contents to s3
s3 = boto3.client('s3')
data = b'This is the content of the file uploaded from python boto3'
file_name='your_file_name.txt'
response =s3.put_object(Bucket= your_bucket_name,Body= data,Key= file_name)
print('AWS response code for uploading file is '+str(response['ResponseMetadata']['HTTPStatusCode']))
# You can also pass the reader args to the reader as kwargs
df3 = reader.csv.read("s3://bucket_name/data/all_types.csv", nrows = 2)
# The writer API has the same functionality
writer.write(df1, file_format="parquet")
writer.parquet.write(df1)
```

You can find more information in the [package documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_file).

#### `AWS Data Wrangler`

You can also use `AWS Wrangler` to work with data stored in Amazon S3.
`mojap-arrow-pd-parser` infers the file type from the extension, so for example `reader.read("s3://bucket_name/file.parquet")` would read a parquet file without need for specifying the file type.

More information can be found in the [product documentation](https://aws-data-wrangler.readthedocs.io/en/stable/tutorials/003%20-%20Amazon%20S3.html).
The package also has a lot of other functionality including specifying data types when reading (or writing). More details can be found in the package [README](https://github.com/moj-analytical-services/mojap-arrow-pd-parser#mojap-arrow-pd-parser).

0 comments on commit 829d165

Please sign in to comment.