Skip to content
brianhigh edited this page Apr 25, 2015 · 30 revisions

Resource Management

Information can be processed most efficiently when the appropriate resources are allocated, and utilized in an effective manner. In order to aid you, we’ll go over some methods of determining your resource needs, and techniques to best utilize the resources available to you.

To some this process is a dark art. At it’s core, is the process of capacity planning. Capacity planning involves determining the amount CPU, RAM, and Disk your information processing needs. To achieve efficient resource utilization, you may need to optimize your work flow, and potentially leverage technologies like parallel processing. In order to ensure you are effectively using your computing capacity, you’ll need to monitor your resource utilization at points throughout your work.

System Resources

The CPU or Central Processing Unit is the heart of your computer. As the name implies virtually all data processing is handled within the CPU. In simplest terms, a CPU’s performance capability is measured in two ways. It’s clock speed, that is the frequency the CPU operates at, which is measured in megahertz or gigahertz. And, the number of cores it has. Each core can perform a single operation (or calculation) at a time. So, the more cores you have, the more calculations that can be performed simultaneously.

RAM or Random Access Memory, is very fast, but short term memory. It’s job is to hold the data that the CPU is actively working with. If your needs call for large amounts of RAM, you’re in luck. It’s relatively easy to upgrade, and fairly cheap.

Disks are used to store data for the long term. But, not all disks are created equally. There are two main types of disk, Solid State Drives, and Hard Disk Drives. An SSD is many times faster than a hard disk, but that comes with a much heftier price tag for large amounts of storage. Hard Disks on the other hand, have been around for years. They are cheap even for storing several terabytes of data.

With disks, there are 3 basic ways to utilize them. The most common is a stand alone drive, either internal or external to your computer. The next most common is network storage, where the disks live somewhere else on the network. Network storage is good for large capacity, but rarely good for high performance. The third, is a disk array, which is several disks working together usually in a redundant fashion so data is retained in the event of a disk failing.

Optimization

In order to optimize your data, and work-flow, you need to identify what resources you are using, identify bottlenecks, and eliminate them.

Every operating system has tools to track resource usage. On Windows, the Performance Monitor (example shown below) is the most helpful. It gives a breakdown of RAM, CPU, Disk, and Network utilization by application. On a Mac, the Activity Monitor is the most user friendly method of tracking resource usage. And, in the newest versions of OS X, it has a color coding scheme to help identify if you are hitting the limits. On Linux, you’ve got many choices, but htop is a solid choice for CPU and RAM monitoring. For disk activity, you’ll want to use iostat.

Windows 7 Performance Monitor
Windows 7 Performance Monitor: Microsoft

Once you have determined your usage, you can try to identify bottlenecks. A bottleneck could be caused by your available resources, or your software. If you aren’t maxing out the CPU, Memory, and Disk, then the bottleneck is likely within the software itself. However, if you are maxing out a particular resource, then increasing the available resources should help. For example, if your system has a single hard disk drive, and it’s being maxed out, replacing it with a solid state drive should speed things up.

Memory Utilization

Random Access Memory or RAM is a finite resource. The amount you can install into a system is limited by the CPU, motherboard, and sometimes the operating system you are running.

In order for a system to perform efficiently, you must use your available RAM in an effective manner. Which means you need to ensure that each application or processis allocated enough memory. While at the same time you must avoid exceeding the available amount. When you exceed that limit, your computer will begin to swap or thrash. The result being poor performance, and a possible loss of data.

Allocating memory is made a bit more difficult due to the behavour of some applications. Software such as R, MATLAB, and Excel by default do all of their data processing in RAM. Which means, you need enough memory to fit the raw data set, and any changes being made to it.

If you plan on working with large data sets, you have a few options. The first option is to purchase enough memory to do all your work in RAM. However, this option can be costly, and doesn’t really scale well. Your second option is to break up the data into smaller pieces, and do your calculations in sections. The third option, is to use a plugin or addon for your software that lets you work from disk. This option however isn’t available for all software.

Case Study: Climate Change Project

To help provide context, we will look at a real research project conducted by the University of Washington.

Let’s start the scenario with a support request made by a research assistant to the departmental IT staff. Here is the actual request.

I'm running a SAS program that involves building a large dataset
from the meteorology files. Each meteorology file is 864 KB, and
I'm merging together about 5,000 of these separate meteorology
files into one giant dataset. So far I've been running this program
since Wednesday, and it's still running. I'm thinking that perhaps
my computer does not have enough processing memory to run this
task efficiently. Also, I'm running this off of my C:/drive, so the
slow processing time is not due to sending information over the
network. I think that if I continue to let my program run over the
weekend, it should definitely be done by Monday. However, I'm
concerned because I will need to run this program again for another
set of about 5,000 files later.

This provides sufficient detail to get an idea that this is a resource management problem. Either more memory needs to be installed, or the (685 line) SAS program needs to by modified to use less memory.

After looking further into this issue, we discovered that there were 5335 pairs of files, with each pair consuming 2284 kilobytes.

Let’s calculate the memory needed to load the entire dataset into RAM:

5335 * 2284 KB = 12185140 KB

12185140 KB is about 12.2 GB of RAM. Plus, as the files were being combined in memory, a second (combined) copy of the data was being collected in a separate table, also stored in memory, before being written out to disk. So, the total memory requirement was over 24 GB of RAM. The computer running the analysis did not have nearly that much RAM installed. So, the main reason this was running slowly was that the hard disk was used as "swap space" when the RAM had been consumed. This "swapping out" of memory to and from disk is extremely slow.

But was it really necessary to load the entire dataset into RAM (twice)?

We ended up writing a short script that used negligible resources and did the job in less than half an hour on a even modest computer. So, "no", we don’t have to load all of the data into RAM. We can just "stream" the data, joining and streaming one pair of files to at a time, continually appending the stream of data to an ever-growing output file.

Here is the simplest version of the script, written in Bash.

#!/bin/bash

echo 'lat,long,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3'

for LATLONG in `find data -type f | sed -n "s/^data\/data_//p"`
do
    DATA=data/data_$LATLONG
    FORCE=force/force_$LATLONG

    if [ -e $FORCE ]; then \
        paste -d, $DATA $FORCE | sed -e "s/^/$LATLONG,/" -e 's/[ \t_,]\+/,/g'
    fi
done

Here is what it does: For every (space-delimited) file in our data/ folder, there is a corresponding (tab-delimited) file in our force/ folder sharing the same location coordinates. We join matching "data" and "force" files, sequentially, line-by-line, by matching paired files on location (latitude and longitude), as found in the file name of each file. (Here is an example of a "data" file name: "data_45.59375_-122.46875".) As we process files, we continually write the combined output "stream" in CSV format to a text file.

Here are example portions of each type of file:

$ head data/data_45.59375_-122.46875
20.525 9.770 3.040 3.080
1.675 7.480 -0.030 3.000
2.300 6.610 2.320 3.000
2.125 7.050 0.210 2.940
4.025 6.970 1.690 2.960
1.500 7.660 1.190 3.010
9.725 6.670 2.480 3.350
23.625 5.610 2.070 3.250
2.850 7.480 0.580 3.410
2.825 7.530 1.930 3.260

$ head force/force_45.59375_-122.46875
1915	01	01	249.8319	38.4160	82.6060	0.1632
1915	01	02	234.7890	41.7855	76.8244	0.1894
1915	01	03	263.7401	28.2177	86.6891	0.1102
1915	01	04	236.5733	40.3042	79.1392	0.1671
1915	01	05	252.5832	33.9665	83.6098	0.1359
1915	01	06	242.8250	39.4904	80.1493	0.1678
1915	01	07	266.3386	28.6491	85.6965	0.1210
1915	01	08	269.7302	25.0456	87.7869	0.0973
1915	01	09	237.3506	42.4395	79.4206	0.1688
1915	01	10	251.5430	37.2574	83.0666	0.1443

We can run our script using these Bash commands:

$ cd /path/to/met
$ bash metmerge.sh > met.csv

A Python version running on a decent server runs about twice as fast as this Bash script running on an old laptop. So, the total run time can be reduced to about 15 minutes, using a single Python process (CPU core), and less than 25 MB of RAM (peak).

#!/usr/bin/python

import os
import sys

hdr = "lat long precip tmax tmin windsp year month day b1 b2 rh b3".split()
print ",".join(hdr)

def process_latlong(latlong):
    latlong_out = latlong.replace("_",",")
    data_file = open ("data/" + "data_" + latlong, "r")
    force_file = open ("force/" + "force_" + latlong, "r")

    for data_line in data_file.readlines():
        data_strip = data_line.strip()
        data_out = data_strip.replace(" ",",")
        force_line = force_file.readline()
        force_strip = force_line.strip()
        force_strip2 = force_strip.replace(" ","")
        force_out = force_strip2.replace("\t",",")
        print latlong_out + "," + data_out + "," + force_out

    data_file.close()
    force_file.close()

dir_list = os.listdir("data/")
for fname in dir_list: process_latlong(fname.strip("data_"))

CPU Utilization

With modern CPUs having multiple cores, parallel processing is the only effective way to utilize all of the CPU power available. In order to utilize it, your data and work-flow may need adjustment. With parallel processing, your data is divided into pieces, and calculations are done on several pieces simultaneously.

Thankfully, there are some well developed tools and techniques to help with this. One of the more common is MapReduce which was popularized by Apache’s Hadoop. MapReduce is a framework for processing large volumes of data in parallel. Some lesser seen tools include GNU Parallel which is a tool used to run and manage command-line tools in a parallel fashion.

As an example, climate data can be processed in a parallel fashion. The data can be divided up by area, and then computation performed on a per area basis. Thus, instead of doing calculations for one ZIP code at a time, you could process data for 4, 8, or more areas at once.

Summary

In summary, there are three main components to resource management:

  • Capacity planning: Identification and allocation of necessary resources.

  • Utilization monitoring: Verifying you are using the resources you’ve allocated.

  • Bottleneck resolution: Identification, and correction of performance bottlenecks.