-
Notifications
You must be signed in to change notification settings - Fork 23
Home
This is an entry level tutorial for engineers who wish to build a scalable web application but do not know how to get started. The demo application takes advantage of the various services offered by AWS such as Elastic Computer Cloud (EC2), Application Load Balancer (ALB), Relational Database Service (RDS), ElastiCache, Simple Storage Service (S3), Identity and Access Management (IAM), as well as CloudWatch and AutoScaling. By using AWS we are making things easier so that an engineer with a little experience with Linux can easily finish this tutorial in a couple of hours. If you are building a scalable web application on your own infrastructure or on another public cloud, the theory should be the same but the actual implementation might be somewhat different.
Please note that the code provided here is for demo only and is not intended for production usage.
(1) Executive Summary
I want to build a scalable web application that can serve a large amount of users. I really don’t know how many users I will get so the technical specification is as many as possible. As shown in the above design, this is a photo sharing website with the following functions:
(1) User authentication (login / logout).
(2) Authenticated users can upload photos.
(3) The default page displays the latest N uploaded photos.
In this tutorial, we will accomplish this goal through the following five levels:
(0) A basic version with all the components deployed on one single server. The web application is developed with PHP, using Apache as the web server, with MySQL as the database to store user upload information.
(1) Based on the basic version we developed in level (0), scale the application to two or more servers.
(2) Off load user uploads to S3.
(3) Dynamically scale the size of your server fleet according to the actual traffic to your web application.
(4) Implement a cache layer between the web server and the database server.
(5) Use CloudFront for content delivery.
(6) Use Kinesis Analytics to perform simple near realtime analysis for your web traffic.
(2) LEVEL 0
In this level, we will build a basic version with all the components deployed on one single server. The web application is developed with PHP, using Apache as the web server, with MySQL as the database to store user upload information. You do not need to write any code, because I have a set of demo code prepared for you. You just need to launch an EC2 instance, carry out some basic configurations, then deploy the demo code.
Login to your AWS Console and navigate to the EC2 Console. Launch an EC2 instance with an Ubuntu 18.04 / 20.04 / 22.04 AMI. Make sure that you allocate a public IP to your EC2 instance. In your security group settings, open port 80 for HTTP and port 22 for SSH access. After the instance becomes “running” and passes health checks, SSH into your EC2 instance to setup software dependencies and download the demo code from Github to the default web server folder:
The default username for Ubuntu AMI's is ubuntu, not ec2-user.
First of all we install the MySQL database server:
$ sudo apt-get update
$ sudo apt-get install mysql-server
$ sudo mysql_secure_installation
If you are running Ubuntu 22.04 you might encounter the following error when running mysql_secure_installation. A solution for this issue is provided here.
Output
... Failed! Error: SET PASSWORD has no significance for user 'root'@'localhost' as the authentication method used doesn't store authentication data in the MySQL server. Please consider using ALTER USER instead if you want to change authentication parameters.
New password:
Then we install Apache and PHP:
$ sudo apt-get install awscli apache2
$ sudo apt-get install php libapache2-mod-php php-mysql php-curl php-xml php-memcached php-mbstring
You might need to restart your Apache web server:
$ sudo service apache2 restart
Then we clone the git repository:
$ sudo apt-get install git
$ cd /var
$ sudo chown -R ubuntu:ubuntu www
$ cd /var/www/html
$ git clone https://github.com/qyjohn/web-demo
$ cd web-demo
Change the ownership of folder “uploads” to “www-data” so that Apache can upload files to this folder.
$ sudo chown -R www-data:www-data uploads
Then we create a MySQL database and a MySQL user for our demo. Here we use “web_demo” as the database name, and “user.name” as the MySQL username and "pass.word" as the password. You should never do this. Please use your own username and password.
$ sudo mysql
mysql> CREATE DATABASE web_demo;
mysql> CREATE USER 'user.name'@'localhost' IDENTIFIED BY 'pass.word';
mysql> GRANT ALL PRIVILEGES ON web_demo.* TO 'user.name'@'localhost';
mysql> quit
If you are on Ubuntu 22.04 and you have used the above-mentioned workaround to deal with the issue with mysql_secure_installation, you will need to use provide the root password for the mysql command otherwise the login will fail:
$ sudo mysql -p
In the code you clone from Github, we have pre-populated some demo data as examples. We use the following command to import the demo data in web_demo.sql to the web_demo database. You need to replace user.name with the username you specified in the previous step. When prompted for a password, use the password you specified in the previous step.
$ cd /var/www/html/web-demo
$ mysql -u user.name -p web_demo < web_demo.sql
Use a text editor to open config.php, then change the username and password in config.php, matching the username and password you used in the above-mentioned "CREATE USER" statement.
// Database connection parameters
$db_hostname = "localhost";
$db_database = "web_demo";
$db_username = "user.name";
$db_password = "pass.word";
If you are new to Linux, nano is a nice text editor to get started with. The following command opens config.php in the current working directory for editing. You can use CTRL+o to save your edits, and CTRL+x to exit the editor.
$ nano config.php
In your browser, browse to http://public-ip-address-of-your-ec2-instance/web-demo/index.php. You should see that our application is now working. You can login with your name, then upload some photos for testing. (You might have noticed that this demo application does not ask you for a password. This is because we would like to make things as simple as possible. Handling user password is a very complicate issue, which is beyond the scope of this entry level tutorial.)
If you get a blank page, check the Apache2 logs /var/log/apache2/access.log and /var/log/apache2/error.log. If your HTTP request hits Apache2, there will be a record in the access log. If Apache2 was unable to handle the request, there will be something in the error log.
If you are new to Linux, we recommend that you use commands like more, cat, head and tail to view the content in various logs (instead of opening the log file in a text editor like vi). If you are not familiar with these commands, you can refer to Basic Commands for some quick tips.
Now you are able to get your website working, please upload some more pictures for testing. Upload some small pictures and some big pictures (like 20 MB) to see what happens. Fix any issues you may observe in the tests. Then I suggest that you spend 10 minutes reading through the demo code index.php. The demo code has reasonable documentation in the form of comments, so I am not going to explain the code here.
(3) LEVEL 1
In this level, we will expand the basic version we have in LEVEL 0 and deploy it to multiple servers. To make things easy, we use EFS (which is a managed NFS service) as the shared filesystem for the web server nodes.
STEP 1 - Preparing the EFS File System
Go to the EFS Console and create an EFS file system. The EFS file system needs to be in the same VPC and subnet(s) as your EC2 instances.
Terminate the previous EC2 instance because we no longer need it. Launch a new EC2 instance with the Ubuntu 18.04 / 20.04 / 22.04 operating system. SSH into the EC2 instance to install the following software and mount the EFS file system:
$ sudo apt-get update
$ sudo apt-get install nfs-common
$ sudo mkdir /efs
$ sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 dns-name-of-your-efs-file-system:/ /efs
$ sudo chown -R ubuntu:ubuntu /efs
Then we add the mounting stuff into /etc/fstab to add the following line, so that you do not need to manually mount the EFS file system when the operating system is rebooted.
dns-name-of-your-efs-file-system:/ /efs nfs auto,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 0 0
You can verify the above-mentioned configuration is working using the following commands (run them several times):
$ df -h
$ mount
$ sudo umount /efs
$ sudo mount /efs
STEP 2 - Install the Apache and PHP
Run the following commands to install Apache and PHP. Notice that we are not installing the MySQL server this time.
$ sudo apt-get install apache2 awscli
$ sudo apt-get install php mysql-client libapache2-mod-php php-mysql php-curl php-xml php-mbstring
$ sudo service apache2 restart
Then we use the EFS file system to store our web application.
$ cd /efs
$ git clone https://github.com/qyjohn/web-demo
$ cd web-demo
$ sudo chown -R www-data:www-data uploads
$ cd /var/www/html
$ sudo ln -s /efs/web-demo web-demo
STEP 3 - Launch an RDS Instance
Launch an RDS instance running MySQL. The RDS instance needs to be in the same VPC as your EC2 instances. When launching the RDS instance, create a default database named “web_demo”. When the RDS instance becomes available. Please make sure that the security group being used on the RDS instance allows inbound connection from your EC2 instance. Then, connect to the RDS instance and create a user for your application. This time, when granting privileges, you need to grant external access for the user.
Use the MySQL client to connect to the RDS instance. You specify a password when you create the RDS instance. If you do not specify a username when creating the RDS MySQL instance, then the default username is admin. Again, in the CREATE USER step, you need to create your own username and password, instead of directly using what is provided in this tutorial.
$ mysql -h dns-name-of-rds-instance -u admin -p
mysql> CREATE DATABASE web_demo;
mysql> CREATE USER 'user.name'@'%' IDENTIFIED BY 'pass.word';
mysql> GRANT ALL PRIVILEGES ON web_demo.* TO 'user.name'@'%';
mysql> quit
Then, use the following command to import the demo data in web_demo.sql to the web_demo database on the RDS database:
$ cd /var/www/html/web-demo
$ mysql -h dns-name-of-rds-instance -u username -p web_demo < web_demo.sql
Now, modify config.php with the new database server hostname, username, password, and database name. Test in your browser and make sure that the web application is working. You should at least be able to upload images to the website. Do not proceed to the next step if it does not work. Below are the general trouble shooting steps:
(1) Check access.log to see if the request hits the server.
(2) Check error.log to see if there are any errors or warning.
(3) Are you able to reach the RDS instance from EC2? How do you confirm that?
(4) If you are not able to reach RDS from EC2, what might be the problem?
STEP 4 - Create an ElastiCache Cluster
We use the ElastiCache service for session sharing between multiple web servers. Amazon ElastiCache offers both Memcached clusters and Redis clusters. In PHP, the library for Memcached cluster is php-memcached, the library for Redis cluster is php-redis module to handle sessions.
In this step, you only need to practice one of the following two options.
(Option 1) Using Memcached Cluster
In the ElastiCache console, launch an ElastiCache cluster with Memcached (just 1 single micro/small/medium node is enough) and obtain the endpoint information. The ElastiCache cluster needs to be in the same VPC as your EC2 instances. Please make sure that the security group being used on the ElastiCache cluster allows inbound connection from your EC2 instance.
On the web server, install the php-memcached module:
$ sudo apt-get install php-memcached
Edit /etc/php/x.x/apache2/php.ini (here x.x is the version of your PHP, don't just copy and paste) to use "memcached" as the session handler. You need to make the following modifications:
session.save_handler = memcached
session.save_path = "configuration-endpoint-to-the-elasticache-cluster:11211"
Then you need to restart Apache the web server to make the new configuration effective.
$ sudo service apache2 restart
(Option 2) Using Redis Cluster
In the ElastiCache console, launch an ElastiCache cluster with Redis with the "Cluster Mode enabled" option and obtain the configuration endpoint information. The ElastiCache cluster needs to be in the same VPC as your EC2 instances. Please make sure that the security group being used on the ElastiCache cluster allows inbound connection from your EC2 instance.
If you have not yet installed the php-redis module, you need to install it to make things work.
$ sudo apt-get install php-redis
On the web server, configure php.ini to use Redis for session sharing. Edit /etc/php/x.x/apache2/php.ini. Make the following modifications:
session.save_handler = rediscluster
session.save_path = "seed[]=configuration-endpoint-of-the-elasticache-redis-cluster:6379"
Then you need to restart Apache the web server to make the new configuration effective.
$ sudo service apache2 restart
STEP 5 - Create an AMI
Now, create an AMI from the EC2 instance and launch a new EC2 instance with the AMI. SSH into the new EC2 instance to verify if EFS is automatically mounted.
$ mount
If EFS is not automatically mounted when the new EC2 instance is launched, Apache will not see your web-demo code. To deal with the situation, you will need the EFS mount helper. After installing and configuring the EFS mount helper, create a new AMI and launch a new EC2 instance.
STEP 6 - Create an ALB
In the EC2 Console, create a target group (protocol: HTTP, port: 80) and register the two EC2 instances to the target group. Since we have Apache running on both web servers, you can use HTTP as the health check protocol and “/” as the health check path. Create an internet-facing Application Load Balancer (ALB) and forward traffic to HTTP:80 to the target group.
STEP 7 - Testing
In your browser, browser to http://alb-endpoint/web-demo/index.php. You should be able to login and upload photos. While doing this, you should be able to see the IP address of both EC2 instances without being asked to login again. You should also see the session id remains the same regardless of the changes in the IP address. This proves that session sharing is working, and your demo works on multiple servers.
If things does not work as expected, check the following:
(1) Can your browser reach your ALB?
(2) Do you have any healthy nodes in the target group? If not why?
(3) Can the ALB nodes reach your EC2 instances?
(4) Are there any errors in the error.log (on both EC2 instances)?
(3) LEVEL 2
Using a shared file system is probably OK for web applications with reasonably limited traffic, but will be be problematic when the traffic to your site increases. At that point you can scale out your front end to have as many web servers as you want, but your web application is limited by the capability of the shared file system. In this level, we will resolve this issue by moving the storage from EFS to S3. This way, the web servers only handle your critical business logic, while the images are being served by S3.
From one of the EC2 instance, edit config.php and make some minor changes. It does not matter from which EC2 instance you make the changes, because the source files are stored on EFS. The changes will be reflected on both EC2 instances.
Update config.php with the desired configuration for S3.
$storage_option = "s3"; // hd or s3
$s3_region = "us-east-2";
$s3_bucket = "bucket_name";
$s3_prefix = "uploads";
In order for this new setting to work, we need to attach an IAM Role to both EC2 instances. In the IAM Console, create an EC2 Role in the IAM Console, which allows full access to S3. In the EC2 Console, attach the newly created IAM role to both EC2 instances one by one.
In your browser, again browse to http://dns-name-of-alb/web-demo/index.php. At this point you will see missing images, because the previously uploaded images are not available on S3. Newly uploaded images will go to S3 instead of local disk.
If you can't get this to work, check into the following:
- Check your S3 bucket to see if the pictures are uploaded.
- Check /var/log/apache2/error.log if the pictures failed to be uploaded.
- Check the URL of the S3 object in the S3 console. Compare it with the image URL (right click on the web page and select "View Page Source").
The reason we use IAM Role in this tutorial is that with IAM Role you do not need to supply your AWS Access Key and Secret Key in your code. Rather, Your code will assume the role assigned to the EC2 instance, and access the AWS resources that your EC2 instance is allowed to access. Today many people and organizations host their source code on github.com or some other public repositories. By using IAM roles you no longer hard code your AWS credentials in your application, thus eliminating the possibility of leaking your AWS credentials to the public.
(4) LEVEL 3
Now you have a scalable web application with two web servers, and you know that you can scale in and out the fleet when needed. Is it possible to scale your fleet automatically, according to the workload of your application?
In our demo code, we have a setting to simulate workloads. If you look at config.php, you will find this piece of code:
$latency = 0;
And, in index.php, there is a corresponding statement:
sleep($latency);
That is, when a user request index.php, PHP will sleep for $latency seconds. As we know, as the workload increases, the CPU utilization (as well as other factors) increases, resulting in an increase in latency (response time). By manually manipulating the $latency setting, we can simulate heavy workload to your web application, which is reflected in the average latency.
With AWS, you can use AutoScaling to scale your server fleet in a dynamic fashion, according to the workload of the fleet. In this tutorial, we use average latency (response time) as a trigger for scaling actions. You can achieve this following these steps:
(1) In your EC2 Console, create a launch template using the AMI and the IAM Role we created in LEVEL 1.
(2) Create an AutoScaling group using the launch template we created in step. Make sure that the AutoScaling group receives traffic from your ALB target group. You don’t need to specify any scaling policy at this point.
(3) Create a new CloudWatch Alarm when the average latency (response time) is greater than 1000 ms for at least 1 minutes.
(4) Click on your AutoScaling group, and create a new scaling policy, using the CloudWatch Alarm you just created. The auto scaling action can be “add 1 instance and then wait 300 seconds”. This way, if the average latency of your web application exceeds 1 second, AutoScaling will add one more instance to your fleet.
You can do the testing by adjusting the $latency value on your existing web servers. Please note the source code resides on your EFS file system, when you make the change from one of your EC2 instances, the change is reflected on all of your EC2 instances.
When you are done with this step, you can play with scaling down by creating another CloudWatch Alarm and a corresponding auto scaling policy. The CloudWatch alarm will be alarmed when the average latency is smaller than 500 ms for at least 1 minute, and the auto scaling action can be “remove 1 instance and then wait 300 seconds”.
(5) LEVEL 4
For many web applications, database can be a serious bottleneck. In our photo sharing demo, usually the number of view image requests is much greater than the number of upload requests. It is very possible that for many view requests, the most recent N images are actually the same. However, we are connecting to the database to fetch records for the most recent N images for each and every view request. It would be reasonable to update the images we show on the index page in an incremental way, for example, every 1 or 2 minutes.
In this level, we will add a cache layer (cache server) between the web servers and the database. When we fetch records for the most recent N images from the database, we cache it in the cache server. When there is a new view request coming in, we no longer connect to the database, but return the cached result to the user. When there is a new image upload, we invalidate the cached version by deleting it from the cache server. When the next view request comes, we fetch records for the most recent N images from the database again, then cache this new version to the cache server. This way the cache version is always up-to-date.
The demo code has support for database caching through ElastiCache, using the same ElastiCache instance for session sharing. This caching behavior is not enable by default. You can edit config.php on all web servers with details regarding the cache server:
// Cache configuration
$enable_cache = true;
$cache_type = "memcached"; // memcached or redis
$cache_key = "images_html";
if ($enable_cache && ($cache_type == "memcached"))
{
$cache = open_memcache_connection();
}
else if ($enable_cache && ($cache_type == "redis"))
{
$cache = open_redis_connection();
}
The cache servers are defined in config.php, as below. Make sure that you modify the following code to use your own cache servers. For MemCached, you need to comment out the extra nodes from the code if you have only 1 node.
function open_memcache_connection()
{
// Open a connection to the memcache server
$mem = new Memcached();
$mem->addServer('web-demo.xxxxxx.0001.use2.cache.amazonaws.com', 11211); // node 1
$mem->addServer('web-demo.xxxxxx.0002.use2.cache.amazonaws.com', 11211); // node 2
$mem->addServer('web-demo.xxxxxx.0003.use2.cache.amazonaws.com', 11211); // node 3
return $mem;
}
function open_redis_connection()
{
$parameters = [
'tcp://web-demo.xxxxxx.clustercfg.use2.cache.amazonaws.com:6379' // configuration endpoint
];
$options = [
'cluster' => 'redis'
];
$redis = new Predis\Client($parameters, $options);
return $redis;
}
Refresh the demo application in your browser, you will see that the “Fetching N records from the database.” message is now gone, indicating that the information you are seeing is obtained from ElastiCache. When you upload a new image, you will see this message again, indicating the cache is being updated.
The following code is responsible of handling this cache logic:
// Get the most recent N images
if ($enable_cache)
{
error_log("Cache enabled, try to obtain the cached version.");
// Attemp to get the cached records for the front page
$images_html = $cache->get($cache_key);
if (!$images_html)
{
// If there is no such cached record, get it from the database
$images = retrieve_recent_uploads($db, 10, $storage_option);
// Convert the records into HTML
$images_html = db_rows_2_html($images, $storage_option, $hd_folder, $s3_client, $s3_bucket, $enable_cf, $cf_baseurl);
// Then put the HTML into cache
$cache->set($cache_key, $images_html);
}
}
else
{
// This statement get the last 10 records from the database
$images = retrieve_recent_uploads($db, 10, $storage_option);
$images_html = db_rows_2_html($images, $storage_option, $hd_folder, $s3_client, $s3_bucket, $enable_cf, $cf_baseurl);
}
// Display the images
echo $images_html;
Also pay attention to this code when doing image uploads. We deleted the cache after the user uploads an images. This way, when the next request comes in, we will fetch the latest records from the database, and put them into the cache again.
if ($enable_cache)
{
// Delete the cached record, the user will query the database to get an updated version
if ($cache_type == "memcached")
{
$cache->delete($cache_key);
}
else if ($cache_type == "redis")
{
$cache->del($cache_key);
}
}
(6) LEVEL 5
In this level, we create a CloudFront distribution with your S3 bucket as the origin. This way your static content is served to your end users from the nearest edge locations.
When creating the CloudFront distribution, select the S3 bucket containing your uploads and select "Origin access control settings (recommended)". Here you need to create a control setting with the default signing behavior (sign requests). After creating the distribution, update S3 bucket policy with the policy statement given to you by CloudFront.
In your code, you only need to make the following tiny changes.
$enable_cf = true;
$cf_baseurl = "http://xxxxxxxxxxxxxx.cloudfront.net/";
Reload the web page in your browser to observe the behavior. Are you able to use CloudFront when the uploaded pictures are stored on disk?
(7) LEVEL 6
In this level, we will look into how we can perform real-time log analysis for your web application. This is achieve using the Kinesis data stream and Kinesis Analytics application.
First of all, SSH into your EC2 instance and read the Apache access log to understand what information are kept in the logs.
Now we need to create a Kinesis data stream (using the Kinesis web console) in the us-east-1 region. Here we assume that the name of the Kinesis data stream is web-access-log and 1 shard is sufficient for our demo.
Now configure Apache to log in JSON format. This will make it easier for Kinesis Analytics to work with your logs. Edit /etc/apache2/apache2.conf, find the area with LogFormat, and add the following new log format to it. For more information on custom log format for Apache, please refer to Apache Module mod_log_config.
LogFormat "{ \"request_time\":\"%t\", \"client_ip\":\"%a\", \"client_hostname\":\"%V\", \"server_ip\":\"%A\", \"request\":\"%U\", \"http_method\":\"%m\", \"status\":\"%>s\", \"size\":\"%B\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\" }" kinesis
Then edit /etc/apache2/sites-available/000-default.conf, change the CustomLog line to use your own log format:
CustomLog ${APACHE_LOG_DIR}/access.log kinesis
Restart Apache to allow the new configuration to take effect:
$ sudo service apache2 restart
Read the Apache access log again to see the changes introduced by the above-mentioned commands.
Then, install and configure the Kinesis Agent:
$ cd ~
$ sudo apt-get install openjdk-8-jdk
$ git clone https://github.com/awslabs/amazon-kinesis-agent
$ cd amazon-kinesis-agent
$ sudo ./setup --install
Create a folder to store the checkpoint for Kinesis Agent (read this to understand why we need this):
$ sudo mkdir -p /opt/aws-kinesis-agent/run
$ sudo chmod ugo+rwx /opt/aws-kinesis-agent/run
After the agent is installed, the configuration file can be found in /etc/aws-kinesis/agent.json. Edit the configuration file to send your Apache access log to the web-access-log stream. (Let's not worry about the error log in this tutorial.)
{
"cloudwatch.emitMetrics": true,
"kinesis.endpoint": "kinesis.us-east-1.amazonaws.com",
"firehose.endpoint": "",
"checkpointFile": "/opt/aws-kinesis-agent/run/checkpoints",
"flows": [
{
"filePattern": "/var/log/apache2/access.log",
"kinesisStream": "web-access-log",
"partitionKeyOption": "RANDOM"
}
]
}
Once you updated the configuration file, you can start the Kinesis Agent using the following command:
$ sudo service aws-kinesis-agent stop
$ sudo service aws-kinesis-agent start
Then you can check the status of the Kinesis Agent using the following command:
$ sudo service aws-kinesis-agent status
If the agent is not working as expected, look into the logs (under /var/log/aws-kinesis-agent) to understand what is going on. (If there is no log, what would you do?) It is likely that the user running the Kinesis Agent (aws-kinesis-agent-user) does not have access to the Apache logs (/var/log/apache2/). To resolve this issue, you can add the aws-kinesis-agent-user to the adm group.
$ sudo usermod -a -G adm aws-kinesis-agent-user
$ sudo service aws-kinesis-agent stop
$ sudo service aws-kinesis-agent start
Refresh your web application in the browser, then watch the Kinesis Agent logs to see whether your logs are pushed to the Kinesis streams. When the Kinesis Agent says the logs are successfully sent to destinations, check the "Monitoring" tab in the Kinesis data streams console to confirm this.
Create a new AMI from the above-mentioned EC2 instance, then create a new launch template with the new AMI. Modify your Auto Scaling group to use the new launch template. This way, all of the EC2 instance in your web server fleet is capable of sending logs to your Kinesis stream.
If you are tired of manually refreshing your web browser, you can use the Apache Benchmark tool (ab) to generate the web traffic automatically.
$ ab -n 100000 -c 2 http://<dns-endpoint-of-your-load-balancer>/web-demo/index.php
Now go to the Kinesis Analytics console to create a Kinesis Analytics Application (SQL Applications), with the web-access-log data stream as the source. Click on the "Discover scheme" to automatically discover the scheme in the data. While Kinesis Analytics is discovering your schema, it is important that you have data going into the Kinesis stream (by either manually refreshing your web-demo page, or using ab to generate traffic to your web-demo), otherwise Kinesis Analytics won't be able to see any data. After Kinesis Analytics discovers the schema, save the schema and continue.
In the "Realtime Analytics" tab, click on the "Configure" button which brings you to the SQL Editor. Copy and paste the following sample SQL statements. Then click on the "Save and run SQL" button to start your application.
-- Create a destination stream
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (client_ip VARCHAR(16), request_count INTEGER);
-- Create a pump which continuously selects from a source stream (SOURCE_SQL_STREAM_001)
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
-- Aggregation functions COUNT|AVG|MAX|MIN|SUM|STDDEV_POP|STDDEV_SAMP|VAR_POP|VAR_SAMP
SELECT STREAM "client_ip", COUNT(*) AS request_count
FROM "SOURCE_SQL_STREAM_001"
-- Uses a 10-second tumbling time window
GROUP BY "client_ip", FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);
From multiple EC2 instances, use the Apache Benchmark tool (ab) to generate some more web traffic. Observe and explain the query results in the Kinesis Analytics console.
$ ab -n 100000 -c 2 http://<dns-endpoint-of-your-load-balancer>/web-demo/index.php
Think of this simple Kinesis Analytics application as an entry level intrusion detection system. There are attackers (your EC2 instances running ab) spamming your website with a lot of requests, and you would like to find out who they are (their IP addresses). In the Output tab of the Kinesis Analytics application (under the SQL code), look at the IP addresses identified by the application, are they the real IP address of the attacker (the IP addresses of the EC2 instances running ab)? If not, what are the addresses reported by the Kinesis Analytics application, and how do you find out the real IP address of the attacker with the Kinesis Analytics application? I have some hints for you here.
(8) Others
In this tutorial, we build a scalable web application using various AWS services including EC2, RDS, S3, ALB, CloudWatch, AutoScaling, IAM, and ElastiCache. It demonstrates how easy it is to build a scalable web application that can scale reasonably well using the various AWS building blocks. It should be noted that the code being used in this tutorial is for demo only, and can not be used in a production system.
You are encouraged to explore the following topics:
(1) How do you trouble shoot issues when things does not work in this demo. For example, your application is unable to connect to the RDS instance, or the ElastiCache. What is needed to make things work?
(2) How to identify bottleneck when there is a performance issue? How to enhance the performance of this demo application?
(3) How to make the deployment process easier?
(4) How to make this demo application more secure?
(5) Use Kinesis Firehose to delivery your logs to S3, Elastic Search, or Splunk for further analysis, search, etc.
At the end of this tutorial, please make sure that you terminate all AWS resources you launched.