Optimizing Disk Usage on Amazon ECS ~ Cloud Computing

Failure to monitor disk space utilization can cause problems that prevent Docker containers from working as expected. Amazon EC2 instance disks are used for multiple purposes, such as Docker daemon logs, containers, and images. This post covers techniques to monitor and reclaim disk space on the cluster of EC2 instances used to run your containers.

Amazon ECS is a highly scalable, high performance container management service that supports Docker containers and allows you to run applications easily on a managed cluster of Amazon EC2 instances. You can use ECS to schedule the placement of containers across a cluster of EC2 instances based on your resource needs, isolation policies, and availability requirements.

The ECS-optimized AMI stores images and containers in an EBS volume that uses the devicemapper storage driver in a direct-lvm configuration. As devicemapper stores every image and container in a thin-provisioned virtual device, free space for container storage is not visible through standard Linux utilities such as df. This poses an administrative challenge when it comes to monitoring free space and can also result in increased time troubleshooting task failures, as the cause may not be immediately obvious.
Disk space errors can result in new tasks failing to launch with the following error message:

 Error running deviceCreate (createSnapDevice) dm_task_run failed

NOTE: The scripts and techniques described in this post were tested against the ECS 2016.03.a AMI. You may need to modify these techniques depending on your operating system and environment.

Monitoring

You can use Amazon CloudWatch custom metrics to track EC2 instance disk usage. After a CloudWatch metric is created, you can add a CloudWatch alarm to alert you proactively, before low disk space causes a problem on your cluster.

Step 1: Create an IAM role

The first step is to ensure that the EC2 instance profile for the EC2 instances in the ECS cluster uses the “cloudwatch:PutMetricData” policy, as this is required to publish to CloudWatch.
In the IAM console, choose Policies, Create Policy. Choose Create Your Own Policy, name it “CloudwatchPutMetricData”, and paste in the following policy in JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "CloudwatchPutMetricData",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

After you have saved the policy, navigate to Roles and select the role attached to the EC2 instances in your ECS cluster. Choose Attach Policy, select the “CloudwatchPutMetricData” policy, and choose Attach
Policy.

Step 2: Push metrics to CloudWatch

Open a shell to each EC2 instance in the ECS cluster. Open a text editor and create the following bash script:

#!/bin/bash

### Get docker free data and metadata space and push to CloudWatch metrics
### 
### requirements:
###  * must be run from inside an EC2 instance
###  * docker with devicemapper backing storage
###  * aws-cli configured with instance-profile/user with the put-metric-data permissions
###  * local user with rights to run docker cli commands
###
### Created by Jay McConnell

# install aws-cli, bc and jq if required
if [ ! -f /usr/bin/aws ]; then
  yum -qy -d 0 -e 0 install aws-cli
fi
if [ ! -f /usr/bin/bc ]; then
  yum -qy -d 0 -e 0 install bc
fi
if [ ! -f /usr/bin/jq ]; then
  yum -qy -d 0 -e 0 install jq
fi

# Collect region and instanceid from metadata
AWSREGION=`curl -ss http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region`
AWSINSTANCEID=`curl -ss http://169.254.169.254/latest/meta-data/instance-id`

function convertUnits {
  # convert units back to bytes as both docker api and cli only provide freindly units
  if [ "$1" == "b" ] ; then
    echo $2
  elif [ "$1" == "kb" ] ; then 
    echo "$2*1000" | bc | awk '{print $1}' FS="."
  elif [ "$1" == "mb" ] ; then
    echo "$2*1000*1000" | bc | awk '{print $1}' FS="."
  elif [ "$1" == "gb" ] ; then
    echo "$2*1000*1000*1000" | bc | awk '{print $1}' FS="."
  elif [ "$1" == "tb" ] ; then
    echo "$2*1000*1000*1000*1000" | bc | awk '{print $1}' FS="."
  else
    echo "Unknown unit $1"
    exit 1
  fi
}

function getMetric {
  # Get freespace and split unit
  if [ "$1" == "Data" ] || [ "$1" == "Metadata" ] ; then
    echo $(docker info | grep "$1 Space Available" | awk '{print tolower($5), $4}')
  else
    echo "Metric must be either 'Data' or 'Metadata'"
    exit 1
  fi
}

data=$(convertUnits `getMetric Data`)
aws cloudwatch put-metric-data --value $data --namespace ECS/$AWSINSTANCEID --unit Bytes --metric-name FreeDataStorage --region $AWSREGION
data=$(convertUnits `getMetric Metadata`)
aws cloudwatch put-metric-data --value $data --namespace ECS/$AWSINSTANCEID --unit Bytes --metric-name FreeMetadataStorage --region $AWSREGION

Next, set the script to be executable:

chmod +x /path/to/metricscript.sh

Now, schedule the script to run every 5 minutes via cron. To do this, create the file /etc/cron.d/ecsmetrics with the following contents:

*/5 * * * * root /path/to/metricscript.sh

This pulls both free data and metadata every 5 minutes and push them to CloudWatch with the namespace ECS/.

Disk cleanup

The next step is to clean up the disk, either automatically on a schedule or manually. This post covers cleanup of tasks and images; there is a great blog post, Send ECS Container Logs to CloudWatch Logs for Centralized Monitoring, that covers pushing log files to CloudWatch. Using CloudWatch Logs instead of local log files reduces disk utilization and provides a resilient and centralized place from which to manage logs.

Take a look at what you can do to remove unneeded containers and images from your instances.

Delete containers

Stopped containers should be deleted if they are no longer needed. The ECS agent, by default, deletes all containers that have exited every 3 hours. This behavior can be customized by adding the following to /etc/ecs/ecs.config:

ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=10m

This sets the frequency of the task to 10 minutes.

For this change to take effect, the ECS agent needs to be restarted, which can be done via ssh:

stop ecs; start ecs

To set this up for new instances, attach the following EC2 user data:

cat /etc/ecs/ecs.config | grep -v 'ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION' > /tmp/ecs.config
echo "ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=5m" >> /tmp/ecs.config
mv -f /tmp/ecs.config /etc/ecs/
stop ecs
start ecs

Delete images

By default, Docker caches images indefinitely. Cached images can be useful to reduce the time needed to launch new tasks: if the image is cached, the container can be started from the cache. If you have a lot of images that are rarely used, as is common in CI or development environments, then cleaning these out is a good idea. Use the following commands to remove unused images:

List images:

docker images

Delete an image:

docker rmi IMAGE

This could be condensed and saved to a bash script:

#!/bin/bash
docker images -q | xargs --no-run-if-empty docker rmi

Set the script to be executable:

chmod +x /path/to/cleanupscript.sh

Execute the script daily via cron by creating a file called /etc/cron.d/dockerImageCleanup with the following contents:

00 00 * * * root /path/to/cleanupscript.sh

Conclusion

The techniques described in this post provide visibility into a critical component of running Docker—the disk space used on the cluster’s EC2 instances—and techniques to clean up unnecessary storage. If you have any questions or suggestions for other best practices, please comment below.

Cloud Computing

Thursday, 14 July 2016