-
Notifications
You must be signed in to change notification settings - Fork 2
resmgmt
Information can be processed most efficiently when the appropriate resources are allocated, and utilized in an effective manner. In order to aid you, we’ll go over some methods of determining your resource needs, and techniques to best utilize the resources available to you.
To some this process is a dark art. At it’s core, is the process of capacity planning. Capacity planning involves determining the amount CPU, RAM, and Disk your information processing needs. To achieve efficient resource utilization, you may need to optimize your work flow, and potentially leverage technologies like parallel processing. In order to ensure you are effectively using your computing capacity, you’ll need to monitor your resource utilization at points throughout your work.
The CPU or Central Processing Unit is the heart of your computer. As the name implies virtually all data processing is handled within the CPU. In simplest terms, a CPU’s performance capability is measured in two ways. It’s clock speed, that is the frequency the CPU operates at, which is measured in megahertz or gigahertz. And, the number of cores it has. Each core can perform a single operation (or calculation) at a time. So, the more cores you have, the more calculations that can be performed simultaneously.
RAM or Random Access Memory, is very fast, but short term memory. It’s job is to hold the data that the CPU is actively working with. If your needs call for large amounts of RAM, you’re in luck. It’s relatively easy to upgrade, and fairly cheap.
Disks are used to store data for the long term. But, not all disks are created equally. There are two main types of disk, Solid State Drives, and Hard Disk Drives. An SSD is many times faster than a hard disk, but that comes with a much heftier price tag for large amounts of storage. Hard Disks on the other hand, have been around for years. They are cheap even for storing several terabytes of data.
With disks, there are 3 basic ways to utilize them. The most common is a stand alone drive, either internal or external to your computer. The next most common is network storage, where the disks live somewhere else on the network. Network storage is good for large capacity, but rarely good for high performance. The third, is a disk array, which is several disks working together usually in a redundant fashion so data is retained in the event of a disk failing.
In order to optimize your data, and work-flow, you need to identify what resources you are using, identify bottlenecks, and eliminate them.
Every operating system has tools to track resource usage. On Windows,
the Performance Monitor (example shown below) is the most helpful. It
gives a breakdown of RAM, CPU, Disk, and Network utilization by
application. On a Mac, the Activity Monitor is the most user friendly
method of tracking resource usage. And, in the newest versions of OS X,
it has a color coding scheme to help identify if you are hitting the
limits. On Linux, you’ve got many choices, but htop
is a solid choice
for CPU and RAM monitoring. For disk activity, you’ll want to use
iostat
.
Once you have determined your usage, you can try to identify bottlenecks. A bottleneck could be caused by your available resources, or your software. If you aren’t maxing out the CPU, Memory, and Disk, then the bottleneck is likely within the software itself. However, if you are maxing out a particular resource, then increasing the available resources should help. For example, if your system has a single hard disk drive, and it’s being maxed out, replacing it with a solid state drive should speed things up.
Random Access Memory or RAM is a finite resource. The amount you can install into a system is limited by the CPU, motherboard, and sometimes the operating system you are running.
In order for a system to perform efficiently, you must use your available RAM in an effective manner. Which means you need to ensure that each application or processis allocated enough memory. While at the same time you must avoid exceeding the available amount. When you exceed that limit, your computer will begin to swap or thrash. The result being poor performance, and a possible loss of data.
Allocating memory is made a bit more difficult due to the behavour of some applications. Software such as R, MATLAB, and Excel by default do all of their data processing in RAM. Which means, you need enough memory to fit the raw data set, and any changes being made to it.
If you plan on working with large data sets, you have a few options. The first option is to purchase enough memory to do all your work in RAM. However, this option can be costly, and doesn’t really scale well. Your second option is to break up the data into smaller pieces, and do your calculations in sections. The third option, is to use a plugin or addon for your software that lets you work from disk. This option however isn’t available for all software.
To help provide context, we will look at a real research project conducted by the University of Washington.
Let’s start the scenario with a support request made by a research assistant to the departmental IT staff. Here is the actual request.
I'm running a SAS program that involves building a large dataset from the meteorology files. Each meteorology file is 864 KB, and I'm merging together about 5,000 of these separate meteorology files into one giant dataset. So far I've been running this program since Wednesday, and it's still running. I'm thinking that perhaps my computer does not have enough processing memory to run this task efficiently. Also, I'm running this off of my C:/drive, so the slow processing time is not due to sending information over the network. I think that if I continue to let my program run over the weekend, it should definitely be done by Monday. However, I'm concerned because I will need to run this program again for another set of about 5,000 files later.
This provides sufficient detail to get an idea that this is a resource management problem. Either more memory needs to be installed, or the (685 line) SAS program needs to by modified to use less memory.
After looking further into this issue, we discovered that there were 5335 pairs of files, with each pair consuming 2284 kilobytes.
Let’s calculate the memory needed to load the entire dataset into RAM:
5335 * 2284 KB = 12185140 KB
12185140 KB is about 12.2 GB of RAM. Plus, as the files were being combined in memory, a second (combined) copy of the data was being collected in a separate table, also stored in memory, before being written out to disk. So, the total memory requirement may have been over 24 GB of RAM, (assuming SAS does not store the data any more efficiently than ASCII text). The computer running the analysis did not have nearly that much RAM installed. So, the main reason this was running slowly was that the hard disk was used as "swap space" when the RAM had been consumed. This "swapping out" of memory to and from disk is extremely slow.
But was it really necessary to load the entire dataset into RAM (twice)?
We ended up writing a short script that used negligible resources and did the job in less than half an hour, even on a modest computer. So, "no", we don’t have to load all of the data into RAM. We can just "stream" the data to disk, joining and streaming one pair of files to at a time, continually appending the stream of data to an ever-growing output file.
How did we do it?
Let’s start by looking at the input data files. We have two folders of files. For every (space-delimited) file in our data/ folder, there is a corresponding (tab-delimited) file in our force/ folder, sharing the same location coordinates (stored in the filenames). So, we want to link each pair of files on the location, combine the rows line-by-line, then collect the combined rows into a single output data file.
Here is a listing showing the first 9 lines of each type of input file:
$ head -9 data/data_45.59375_-122.15625 27.950 8.170 0.570 3.100 1.950 6.130 -1.000 3.030 3.925 5.650 0.490 3.020 2.400 5.680 -0.880 2.980 8.200 6.330 0.020 2.990 1.600 5.560 -0.360 3.040 14.650 5.420 0.310 3.410 37.775 5.080 -0.340 3.280 6.575 6.550 -0.500 3.440 $ head -9 force/force_45.59375_-122.15625 1915 1 1 237.83 41.655 79.851 0.1660 1915 1 2 244.82 40.886 77.705 0.1666 1915 1 3 248.43 32.984 84.425 0.1174 1915 1 4 246.30 39.536 79.572 0.1494 1915 1 5 245.51 38.740 81.598 0.1405 1915 1 6 249.16 37.549 81.220 0.1393 1915 1 7 254.88 33.961 84.068 0.1190 1915 1 8 262.68 35.973 82.808 0.1245 1915 1 9 258.02 43.198 79.247 0.1583
Here is an example of the desired output, showing the header and first 9 data records of the output CSV file:
$ head met.csv lat,lng,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3 45.59375,-122.15625,27.950,8.170,0.570,3.100,1915,1,1,237.83,41.655,79.851,0.1660 45.59375,-122.15625,1.950,6.130,-1.000,3.030,1915,1,2,244.82,40.886,77.705,0.1666 45.59375,-122.15625,3.925,5.650,0.490,3.020,1915,1,3,248.43,32.984,84.425,0.1174 45.59375,-122.15625,2.400,5.680,-0.880,2.980,1915,1,4,246.30,39.536,79.572,0.1494 45.59375,-122.15625,8.200,6.330,0.020,2.990,1915,1,5,245.51,38.740,81.598,0.1405 45.59375,-122.15625,1.600,5.560,-0.360,3.040,1915,1,6,249.16,37.549,81.220,0.1393 45.59375,-122.15625,14.650,5.420,0.310,3.410,1915,1,7,254.88,33.961,84.068,0.1190 45.59375,-122.15625,37.775,5.080,-0.340,3.280,1915,1,8,262.68,35.973,82.808,0.1245 45.59375,-122.15625,6.575,6.550,-0.500,3.440,1915,1,9,258.02,43.198,79.247,0.1583
Here is a simple version of a Bash script which can do the job.[1]
#!/bin/bash echo 'lat,lng,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3' for LATLONG in `find data -type f | sed -n "s/^data\/data_//p"` do DATA=data/data_$LATLONG FORCE=force/force_$LATLONG if [ -e $FORCE ]; then \ paste -d, $DATA $FORCE | sed -e "s/^/$LATLONG,/" -e 's/[ \t_,]\+/,/g' fi done
Requiring only a small fraction of the number lines of code, as compared to the SAS version, we can combine all of the files in just a matter of minutes.
At the risk of code obfuscation, we can reduce the lines of code to three:
echo 'lat,lng,precip,tmax,tmin,windsp,year,month,day,b1,b2,rh,b3' for D in data/data_*; do L=${D/#data\/data_/}; F=force/force_$L [ -e $F ] && paste -d, $D $F | sed -e "s/^/$L,/" -e 's/[ \t_,]\+/,/g'; done
We don’t recommend this approach, however, as it’s very hard to read and won’t run any faster!
We can run our script using these Bash commands:
$ cd /path/to/met $ bash metmerge.sh > met.csv
How does it work? We join matching "data" and "force" files, sequentially, line-by-line, by matching paired files on location (latitude and longitude), as found in the file name of each file. (Here is an example of a "data" file name: "data_45.59375_-122.46875".) As we process files, we continually write the combined output "stream" in CSV format to a text file.
A Python version running on a decent server runs about twice as fast as this Bash script running on an old laptop. So, the total run time can be reduced to about 15 minutes, using a single Python process (CPU core), and less than 25 MB of RAM (peak).
#!/usr/bin/python import os import sys hdr = "lat lng precip tmax tmin windsp year month day b1 b2 rh b3".split() print ",".join(hdr) def process_latlong(latlong): latlong_out = latlong.replace("_",",") data_file = open ("data/" + "data_" + latlong, "r") force_file = open ("force/" + "force_" + latlong, "r") for data_line in data_file.readlines(): data_out = data_line.strip().replace(" ",",") force_line = force_file.readline() force_out = force_line.strip().replace(" ","").replace("\t",",") print latlong_out + "," + data_out + "," + force_out data_file.close() force_file.close() dir_list = os.listdir("data/") for fname in dir_list: process_latlong(fname.strip("data_"))
With modern CPUs having multiple cores, parallel processing is the only effective way to utilize all of the CPU power available. In order to utilize it, your data and work-flow may need adjustment. With parallel processing, your data is divided into pieces, and calculations are done on several pieces simultaneously.
Thankfully, there are some well developed tools and techniques to help with this. One of the more common is MapReduce which was popularized by Apache’s Hadoop. MapReduce is a framework for processing large volumes of data in parallel. Some lesser seen tools include GNU Parallel which is a tool used to run and manage command-line tools in a parallel fashion.
As an example, climate data can be processed in a parallel fashion. The data can be divided up by area, and then computation performed on a per area basis. Thus, instead of doing calculations for one ZIP code at a time, you could process data for 4, 8, or more areas at once.
In summary, there are three main components to resource management:
-
Capacity planning: Identification and allocation of necessary resources.
-
Utilization monitoring: Verifying you are using the resources you’ve allocated.
-
Bottleneck resolution: Identification, and correction of performance bottlenecks.
The latest version of this document is online at: https://github.com/brianhigh/research-computing/wiki Copyright © The Research Computing Team. This information is provided for educational purposes only. See LICENSE for more information. Creative Commons Attribution 4.0 International Public License.