Add terraform deploying infrastructure for benchmarks

## Description Currently, in order to run performance benchmarks one need to create the infrastructure manually. This PR adds Terraform scripts which do that automatically for AWS and GCP. I tested this patch manually on AWS and GCP cloud. ## Does this PR introduce _any_ user-facing changes? No. Closes delta-io#1179 Co-authored-by: Grzegorz Koakowski <[email protected]> Signed-off-by: Scott Sandre <[email protected]> GitOrigin-RevId: 9cb7769afc7889beb743f499f271d8eac1167c1f
taiga-db · Aug 1, 2022 · 8158663 · 8158663
1 parent ff6914b
commit 8158663
Show file tree

Hide file tree

Showing 30 changed files with 926 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -114,3 +114,18 @@ pycodestyle*.py
 
 # For IDE settings
 .vscode
+
+# For Terraform
+**/.terraform/*
+*.tfstate
+*.tfstate.*
+crash.log
+crash.*.log
+*.tfvars
+*.tfvars.json
+override.tf
+override.tf.json
+*_override.tf
+*_override.tf.json
+.terraformrc
+.terraform.rc
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -21,6 +21,8 @@ The next section will provide the detailed steps of how to setup the necessary H
 - A S3 bucket which will be used to generate the TPC-DS data.
 - A machine which has access to the AWS setup and where this repository has been downloaded or cloned.
 
+There are two ways to create infrastructure required for benchmarks - using provided [Terraform template](infrastructure/aws/terraform/README.md) or manually (described below).
+
 #### Create external Hive Metastore using Amazon RDS
 Create an external Hive Metastore in a MySQL database using Amazon RDS with the following specifications:
 - MySQL 8.x on a `db.m5.large`.
@@ -64,6 +66,8 @@ _________________
   or assigned to the [master node](https://cloud.google.com/compute/docs/connect/add-ssh-keys#after-vm-creation) only.
 - Ideally, all GCP components used in benchmark should be in the same location (Storage bucket, Dataproc Metastore service and Dataproc cluster).
 
+There are two ways to create infrastructure required for benchmarks - using provided [Terraform template](infrastructure/gcp/terraform/README.md) or manually (described below).
+
 #### Prepare GCS bucket
 Create a new GCS bucket (or use an existing one) which is in the same region as your Dataproc cluster.
 
@@ -140,7 +144,7 @@ Verify that you have the following information
   - <CLOUD_PROVIDER>: Currently either `gcp` or `aws`. For each storage type, different Delta properties might be added.
 
 Then run a simple table write-read test: Run the following in your shell.
- 
+
 ```sh
 ./run-benchmark.py \
     --cluster-hostname <HOSTNAME> \

diff --git a/benchmarks/infrastructure/aws/terraform/.terraform.lock.hcl b/benchmarks/infrastructure/aws/terraform/.terraform.lock.hcl
diff --git a/benchmarks/infrastructure/aws/terraform/README.md b/benchmarks/infrastructure/aws/terraform/README.md
@@ -0,0 +1,84 @@
+# Create infrastructure with Terraform
+
+1. Install [Terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli?in=terraform/aws-get-started).
+2. Create an IAM user which will be used to create benchmarks infrastructure. Ensure that your AWS CLI is configured.
+   You should either have valid credentials in shared credentials file (e.g. `~/.aws/credentials`)
+   ```
+   [default]
+   aws_access_key_id = anaccesskey
+   aws_secret_access_key = asecretkey
+   ```
+   or export keys as environment variables:
+   ```bash
+   export AWS_ACCESS_KEY_ID="anaccesskey"
+   export AWS_SECRET_ACCESS_KEY="asecretkey"
+   ```
+3. Add permissions for the IAM user. You can either assign `AdministratorAccess` AWS managed policy (discouraged)
+   or assign AWS managed policies in a more granular way:
+    * `IAMFullAccess`
+    * `AmazonVPCFullAccess`
+    * `AmazonEMRFullAccessPolicy_v2`
+    * `AmazonElasticMapReduceFullAccess`
+    * `AmazonRDSFullAccess`
+    * `AmazonS3FullAccess`
+    * a custom policy for EC2 key pairs management
+      ```json
+      {
+        "Version": "2012-10-17",
+        "Statement": [
+          {
+            "Effect": "Allow",
+            "Action": [
+              "ec2:ImportKeyPair",
+              "ec2:CreateKeyPair",
+              "ec2:DeleteKeyPair"
+            ],
+            "Resource": "arn:aws:ec2:*:*:key-pair/benchmarks_key_pair"
+          }
+        ]
+      }
+      ```
+
+4. Create Terraform variable file `benchmarks/infrastructure/aws/terraform/terraform.tfvars` and fill in variable values.
+   ```tf
+   region                 = "<REGION>"
+   availability_zone1     = "<AVAILABILITY_ZONE1>"
+   availability_zone2     = "<AVAILABILITY_ZONE2>"
+   benchmarks_bucket_name = "<BUCKET_NAME>"
+   source_bucket_name     = "<SOURCE_BUCKET_NAME>"
+   mysql_user             = "<MYSQL_USER>"
+   mysql_password         = "<MYSQL_PASSWORD>"
+   emr_public_key_path    = "<EMR_PUBLIC_KEY_PATH>"
+   user_ip_address        = "<MY_IP>"
+   emr_workers            = WORKERS_COUNT
+   tags                   = {
+     key1 = "value1"
+     key2 = "value2"
+   }
+   ```
+   Please check `variables.tf` to learn more about each parameter.
+
+5. Run:
+   ```bash
+   terraform init
+   terraform validate
+   terraform apply
+   ```
+   As a result, a new VPC, a S3 bucket, a MySQL instance (metastore) and a EMR cluster will be created.
+   The `apply` command returns `master_node_address` that will be used when running benchmarks.
+   ```
+   Apply complete! Resources: 16 added, 0 changed, 0 destroyed.
+   Outputs:
+   master_node_address = "35.165.163.250"
+   ```
+
+6. Once the benchmarks are finished, destroy the resources.
+   ```bash
+   terraform destroy
+   ```
+   If the S3 bucket contains any objects, it will not be destroyed automatically.
+   One need to do that manually to avoid any accidental data loss.
+   ```
+   Error: deleting S3 Bucket (my-bucket): BucketNotEmpty: The bucket you tried to delete is not empty 
+   status code: 409, request id: Q11TYZ5E0B23QGQ2, host id: WdeFY88km5IBhy+bi2hqXzgjBxjrn1+OPtCstsWDjkwGNCyEhXYjq330DZq1jbfNXojBEejH6Wg=
+   ```
diff --git a/benchmarks/infrastructure/aws/terraform/main.tf b/benchmarks/infrastructure/aws/terraform/main.tf
@@ -0,0 +1,31 @@
+module "networking" {
+  source = "./modules/networking"
+
+  availability_zone1 = var.availability_zone1
+  availability_zone2 = var.availability_zone2
+}
+
+module "storage" {
+  source = "./modules/storage"
+
+  benchmarks_bucket_name = var.benchmarks_bucket_name
+}
+
+module "processing" {
+  source = "./modules/processing"
+
+  vpc_id     = module.networking.vpc_id
+  subnet1_id = module.networking.subnet1_id
+  subnet2_id = module.networking.subnet2_id
+
+  availability_zone1     = var.availability_zone1
+  benchmarks_bucket_name = var.benchmarks_bucket_name
+  source_bucket_name     = var.source_bucket_name
+  mysql_user             = var.mysql_user
+  mysql_password         = var.mysql_password
+  emr_public_key_path    = var.emr_public_key_path
+  emr_workers            = var.emr_workers
+  user_ip_address        = var.user_ip_address
+
+  depends_on = [module.networking, module.storage]
+}
diff --git a/benchmarks/infrastructure/aws/terraform/modules/networking/main.tf b/benchmarks/infrastructure/aws/terraform/modules/networking/main.tf
@@ -0,0 +1,32 @@
+resource "aws_vpc" "this" {
+  cidr_block = "10.0.0.0/16"
+}
+
+resource "aws_subnet" "benchmarks_subnet1" {
+  vpc_id            = aws_vpc.this.id
+  availability_zone = var.availability_zone1
+  cidr_block        = "10.0.0.0/17"
+}
+
+# There are two subnets needed to create an RDS subnet group. In fact this one is unused.
+# If DB subnet group is built using only one AZ, the following error is thrown:
+#     The DB subnet group doesn't meet Availability Zone (AZ) coverage requirement.
+#     Current AZ coverage: us-west-2a. Add subnets to cover at least 2 AZs.
+resource "aws_subnet" "benchmarks_subnet2" {
+  vpc_id            = aws_vpc.this.id
+  availability_zone = var.availability_zone2
+  cidr_block        = "10.0.128.0/17"
+}
+
+resource "aws_internet_gateway" "this" {
+  vpc_id = aws_vpc.this.id
+}
+
+resource "aws_default_route_table" "public" {
+  default_route_table_id = aws_vpc.this.default_route_table_id
+
+  route {
+    cidr_block = "0.0.0.0/0"
+    gateway_id = aws_internet_gateway.this.id
+  }
+}
diff --git a/benchmarks/infrastructure/aws/terraform/modules/networking/outputs.tf b/benchmarks/infrastructure/aws/terraform/modules/networking/outputs.tf
@@ -0,0 +1,11 @@
+output "vpc_id" {
+  value = aws_vpc.this.id
+}
+
+output "subnet1_id" {
+  value = aws_subnet.benchmarks_subnet1.id
+}
+
+output "subnet2_id" {
+  value = aws_subnet.benchmarks_subnet2.id
+}
diff --git a/benchmarks/infrastructure/aws/terraform/modules/networking/variables.tf b/benchmarks/infrastructure/aws/terraform/modules/networking/variables.tf
@@ -0,0 +1,7 @@
+variable "availability_zone1" {
+  type = string
+}
+
+variable "availability_zone2" {
+  type = string
+}