-
Notifications
You must be signed in to change notification settings - Fork 2
resmgmt
Information can be processed most efficiently when the appropriate resources are allocated, and utilized in an effective manner. In order to aid you, we’ll go over some methods of determining your resource needs, and techniques to best utilize the resources available to you.
To some this process is a dark art. At it’s core, is the process of capacity planning. Capacity planning involves determining the amount CPU, RAM, and Disk your information processing needs. To achieve efficient resource utilization, you may need to optimize your work flow, and potentially leverage technologies like parallel processing. In order to ensure you are effectively using your computing capacity, you’ll need to monitor your resource utilization at points throughout your work.
The CPU or Central Processing Unit is the heart of your computer. As the name implies virtually all data processing is handled within the CPU. In simplest terms, a CPU’s performance capability is measured in two ways. It’s clock speed, that is the frequency the CPU operates at, which is measured in megahertz or gigahertz. And, the number of cores it has. Each core can perform a single operation (or calculation) at a time. So, the more cores you have, the more calculations that can be performed simultaneously.
RAM or Random Access Memory, is very fast, but short term memory. It’s job is to hold the data that the CPU is actively working with. If your needs call for large amounts of RAM, you’re in luck. It’s relatively easy to upgrade, and fairly cheap.
Disks are used to store data for the long term. But, not all disks are created equally. There are two main types of disk, Solid State Drives, and Hard Disk Drives. An SSD is many times faster than a hard disk, but that comes with a much heftier price tag for large amounts of storage. Hard Disks on the other hand, have been around for years. They are cheap even for storing several terabytes of data.
With disks, there are 3 basic ways to utilize them. The most common is a stand alone drive, either internal or external to your computer. The next most common is network storage, where the disks live somewhere else on the network. Network storage is good for large capacity, but rarely good for high performance. The third, is a disk array, which is several disks working together usually in a redundant fashion so data is retained in the event of a disk failing.
In order to optimize your data, and work-flow, you need to identify what resources you are using, identify bottlenecks, and eliminate them.
Every operating system has tools to track resource usage. On Windows,
the Performance Monitor (example shown below) is the most helpful. It
gives a breakdown of RAM, CPU, Disk, and Network utilization by
application. On a Mac, the Activity Monitor is the most user friendly
method of tracking resource usage. And, in the newest versions of OS X,
it has a color coding scheme to help identify if you are hitting the
limits. On Linux, you’ve got many choices, but htop
is a solid choice
for CPU and RAM monitoring. For disk activity, you’ll want to use
iostat
.
Once you have determined your usage, you can try to identify bottlenecks. A bottleneck could be caused by your available resources, or your software. If you aren’t maxing out the CPU, Memory, and Disk, then the bottleneck is likely within the software itself. However, if you are maxing out a particular resource, then increasing the available resources should help. For example, if your system has a single hard disk drive, and it’s being maxed out, replacing it with a solid state drive should speed things up.
Random Access Memory or RAM is a finite resource. The amount you can install into a system is limited by the CPU, motherboard, and sometimes the operating system you are running.
In order for a system to perform efficiently, you must use your available RAM in an effective manner. Which means you need to ensure that each application or processis allocated enough memory. While at the same time you must avoid exceeding the available amount. When you exceed that limit, your computer will begin to swap or thrash. The result being poor performance, and a possible loss of data.
Allocating memory is made a bit more difficult due to the behavour of some applications. Software such as R, MATLAB, and Excel by default do all of their data processing in RAM. Which means, you need enough memory to fit the raw data set, and any changes being made to it.
If you plan on working with large data sets, you have a few options. The first option is to purchase enough memory to do all your work in RAM. However, this option can be costly, and doesn’t really scale well. Your second option is to break up the data into smaller pieces, and do your calculations in sections. The third option, is to use a plugin or addon for your software that lets you work from disk. This option however isn’t available for all software.
With modern CPUs having multiple cores, parallel processing is the only effective way to utilize all of the CPU power available. In order to utilize it, your data and work-flow may need adjustment. With parallel processing, your data is divided into pieces, and calculations are done on several pieces simultaneously.
Thankfully, there are some well developed tools and techniques to help with this. One of the more common is MapReduce which was popularized by Apache’s Hadoop. MapReduce is a framework for processing large volumes of data in parallel. Some lesser seen tools include GNU Parallel which is a tool used to run and manage command-line tools in a parallel fashion.
As an example, climate data can be processed in a parallel fashion. The data can be divided up by area, and then computation performed on a per area basis. Thus, instead of doing calculations for one ZIP code at a time, you could process data for 4, 8, or more areas at once.
In summary, there are three main components to resource management:
-
Capacity planning: Identification and allocation of necessary resources.
-
Utilization monitoring: Verifying you are using the resources you’ve allocated.
-
Bottleneck resolution: Identification, and correction of performance bottlenecks.
The latest version of this document is online at: https://github.com/brianhigh/research-computing/wiki Copyright © The Research Computing Team. This information is provided for educational purposes only. See LICENSE for more information. Creative Commons Attribution 4.0 International Public License.