Skip to content

Commit

Permalink
Add terraform deploying infrastructure for benchmarks
Browse files Browse the repository at this point in the history
## Description
Currently, in order to run performance benchmarks one need to create the infrastructure manually. This PR adds Terraform scripts which do that automatically for AWS and GCP.

I tested this patch manually on AWS and GCP cloud.

## Does this PR introduce _any_ user-facing changes?
No.

Closes delta-io#1179

Co-authored-by: Grzegorz Koakowski <[email protected]>
Signed-off-by: Scott Sandre <[email protected]>
GitOrigin-RevId: 9cb7769afc7889beb743f499f271d8eac1167c1f
  • Loading branch information
2 people authored and allisonport-db committed Aug 1, 2022
1 parent ff6914b commit 8158663
Show file tree
Hide file tree
Showing 30 changed files with 926 additions and 1 deletion.
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,18 @@ pycodestyle*.py

# For IDE settings
.vscode

# For Terraform
**/.terraform/*
*.tfstate
*.tfstate.*
crash.log
crash.*.log
*.tfvars
*.tfvars.json
override.tf
override.tf.json
*_override.tf
*_override.tf.json
.terraformrc
.terraform.rc
6 changes: 5 additions & 1 deletion benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ The next section will provide the detailed steps of how to setup the necessary H
- A S3 bucket which will be used to generate the TPC-DS data.
- A machine which has access to the AWS setup and where this repository has been downloaded or cloned.

There are two ways to create infrastructure required for benchmarks - using provided [Terraform template](infrastructure/aws/terraform/README.md) or manually (described below).

#### Create external Hive Metastore using Amazon RDS
Create an external Hive Metastore in a MySQL database using Amazon RDS with the following specifications:
- MySQL 8.x on a `db.m5.large`.
Expand Down Expand Up @@ -64,6 +66,8 @@ _________________
or assigned to the [master node](https://cloud.google.com/compute/docs/connect/add-ssh-keys#after-vm-creation) only.
- Ideally, all GCP components used in benchmark should be in the same location (Storage bucket, Dataproc Metastore service and Dataproc cluster).

There are two ways to create infrastructure required for benchmarks - using provided [Terraform template](infrastructure/gcp/terraform/README.md) or manually (described below).

#### Prepare GCS bucket
Create a new GCS bucket (or use an existing one) which is in the same region as your Dataproc cluster.

Expand Down Expand Up @@ -140,7 +144,7 @@ Verify that you have the following information
- <CLOUD_PROVIDER>: Currently either `gcp` or `aws`. For each storage type, different Delta properties might be added.

Then run a simple table write-read test: Run the following in your shell.

```sh
./run-benchmark.py \
--cluster-hostname <HOSTNAME> \
Expand Down
22 changes: 22 additions & 0 deletions benchmarks/infrastructure/aws/terraform/.terraform.lock.hcl

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

84 changes: 84 additions & 0 deletions benchmarks/infrastructure/aws/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Create infrastructure with Terraform

1. Install [Terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli?in=terraform/aws-get-started).
2. Create an IAM user which will be used to create benchmarks infrastructure. Ensure that your AWS CLI is configured.
You should either have valid credentials in shared credentials file (e.g. `~/.aws/credentials`)
```
[default]
aws_access_key_id = anaccesskey
aws_secret_access_key = asecretkey
```
or export keys as environment variables:
```bash
export AWS_ACCESS_KEY_ID="anaccesskey"
export AWS_SECRET_ACCESS_KEY="asecretkey"
```
3. Add permissions for the IAM user. You can either assign `AdministratorAccess` AWS managed policy (discouraged)
or assign AWS managed policies in a more granular way:
* `IAMFullAccess`
* `AmazonVPCFullAccess`
* `AmazonEMRFullAccessPolicy_v2`
* `AmazonElasticMapReduceFullAccess`
* `AmazonRDSFullAccess`
* `AmazonS3FullAccess`
* a custom policy for EC2 key pairs management
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:ImportKeyPair",
"ec2:CreateKeyPair",
"ec2:DeleteKeyPair"
],
"Resource": "arn:aws:ec2:*:*:key-pair/benchmarks_key_pair"
}
]
}
```

4. Create Terraform variable file `benchmarks/infrastructure/aws/terraform/terraform.tfvars` and fill in variable values.
```tf
region = "<REGION>"
availability_zone1 = "<AVAILABILITY_ZONE1>"
availability_zone2 = "<AVAILABILITY_ZONE2>"
benchmarks_bucket_name = "<BUCKET_NAME>"
source_bucket_name = "<SOURCE_BUCKET_NAME>"
mysql_user = "<MYSQL_USER>"
mysql_password = "<MYSQL_PASSWORD>"
emr_public_key_path = "<EMR_PUBLIC_KEY_PATH>"
user_ip_address = "<MY_IP>"
emr_workers = WORKERS_COUNT
tags = {
key1 = "value1"
key2 = "value2"
}
```
Please check `variables.tf` to learn more about each parameter.

5. Run:
```bash
terraform init
terraform validate
terraform apply
```
As a result, a new VPC, a S3 bucket, a MySQL instance (metastore) and a EMR cluster will be created.
The `apply` command returns `master_node_address` that will be used when running benchmarks.
```
Apply complete! Resources: 16 added, 0 changed, 0 destroyed.
Outputs:
master_node_address = "35.165.163.250"
```

6. Once the benchmarks are finished, destroy the resources.
```bash
terraform destroy
```
If the S3 bucket contains any objects, it will not be destroyed automatically.
One need to do that manually to avoid any accidental data loss.
```
Error: deleting S3 Bucket (my-bucket): BucketNotEmpty: The bucket you tried to delete is not empty
status code: 409, request id: Q11TYZ5E0B23QGQ2, host id: WdeFY88km5IBhy+bi2hqXzgjBxjrn1+OPtCstsWDjkwGNCyEhXYjq330DZq1jbfNXojBEejH6Wg=
```
31 changes: 31 additions & 0 deletions benchmarks/infrastructure/aws/terraform/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
module "networking" {
source = "./modules/networking"

availability_zone1 = var.availability_zone1
availability_zone2 = var.availability_zone2
}

module "storage" {
source = "./modules/storage"

benchmarks_bucket_name = var.benchmarks_bucket_name
}

module "processing" {
source = "./modules/processing"

vpc_id = module.networking.vpc_id
subnet1_id = module.networking.subnet1_id
subnet2_id = module.networking.subnet2_id

availability_zone1 = var.availability_zone1
benchmarks_bucket_name = var.benchmarks_bucket_name
source_bucket_name = var.source_bucket_name
mysql_user = var.mysql_user
mysql_password = var.mysql_password
emr_public_key_path = var.emr_public_key_path
emr_workers = var.emr_workers
user_ip_address = var.user_ip_address

depends_on = [module.networking, module.storage]
}
32 changes: 32 additions & 0 deletions benchmarks/infrastructure/aws/terraform/modules/networking/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
resource "aws_vpc" "this" {
cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "benchmarks_subnet1" {
vpc_id = aws_vpc.this.id
availability_zone = var.availability_zone1
cidr_block = "10.0.0.0/17"
}

# There are two subnets needed to create an RDS subnet group. In fact this one is unused.
# If DB subnet group is built using only one AZ, the following error is thrown:
# The DB subnet group doesn't meet Availability Zone (AZ) coverage requirement.
# Current AZ coverage: us-west-2a. Add subnets to cover at least 2 AZs.
resource "aws_subnet" "benchmarks_subnet2" {
vpc_id = aws_vpc.this.id
availability_zone = var.availability_zone2
cidr_block = "10.0.128.0/17"
}

resource "aws_internet_gateway" "this" {
vpc_id = aws_vpc.this.id
}

resource "aws_default_route_table" "public" {
default_route_table_id = aws_vpc.this.default_route_table_id

route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.this.id
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
output "vpc_id" {
value = aws_vpc.this.id
}

output "subnet1_id" {
value = aws_subnet.benchmarks_subnet1.id
}

output "subnet2_id" {
value = aws_subnet.benchmarks_subnet2.id
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
variable "availability_zone1" {
type = string
}

variable "availability_zone2" {
type = string
}
Loading

0 comments on commit 8158663

Please sign in to comment.