-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
28 changed files
with
3,307 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
This work was produced at the Lawrence Livermore National Laboratory | ||
(LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44) between | ||
the U.S. Department of Energy (DOE) and Lawrence Livermore National | ||
Security, LLC (LLNS) for the operation of LLNL. | ||
|
||
This work was prepared as an account of work sponsored by an agency of | ||
the United States Government. Neither the United States Government nor | ||
Lawrence Livermore National Security, LLC nor any of their employees, | ||
makes any warranty, express or implied, or assumes any liability or | ||
responsibility for the accuracy, completeness, or usefulness of any | ||
information, apparatus, product, or process disclosed, or represents | ||
that its use would not infringe privately-owned rights. | ||
|
||
Reference herein to any specific commercial products, process, or | ||
services by trade name, trademark, manufacturer or otherwise does | ||
not necessarily constitute or imply its endorsement, recommendation, | ||
or favoring by the United States Government or Lawrence Livermore | ||
National Security, LLC. The views and opinions of authors expressed | ||
herein do not necessarily state or reflect those of the Untied States | ||
Government or Lawrence Livermore National Security, LLC, and shall | ||
not be used for advertising or product endorsement purposes. | ||
|
||
The precise terms and conditions for copying, distribution, and | ||
modification are specified in the file "COPYING". |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,230 @@ | ||
Running Hadoop on Clusters w/ Slurm & Lustre | ||
|
||
Albert Chu | ||
Updated October 3rd, 2013 | ||
[email protected] | ||
|
||
What is this project | ||
-------------------- | ||
|
||
This project contains a number of scripts for running Hadoop jobs in | ||
HPC environments using Slurm and running jobs on top of Lustre. | ||
|
||
This project allows you to: | ||
|
||
- Run Hadoop interactively or via scripts. | ||
- Run Mapreduce 1.0 or 2.0 jobs (i.e. Hadoop 1.0 or 2.0) | ||
- Run against HDFS, HDFS over Lustre, or Lustre raw | ||
- Take advantage of SSDs for local caching if available | ||
- Make decent optimizations of Hadoop for your hardware | ||
|
||
Credit | ||
------ | ||
|
||
First, credit must be given to Kevin Regimbal @ PNNL. Initial | ||
experiments were done using heavily modified versions of scripts Kevin | ||
developed for running Hadoop w/ Slurm & Lustre. A number of the | ||
ideas from Kevin's scripts are still in these scripts. | ||
|
||
Basic Idea | ||
---------- | ||
|
||
The basic idea behind these scripts are to: | ||
|
||
1) Allocate nodes on a cluster using slurm | ||
|
||
2) Scripts will setup configuration files so the Slurm/MPI rank 0 node | ||
is the "master". All compute nodes will have configuration files | ||
created that point to the node designated as the jobtracker/yarn | ||
server. | ||
|
||
3) Launch Hadoop daemons on all nodes. The Slurm/MPI rank 0 node will | ||
run the JobTracker/NameNode (or Yarn Server in Hadoop 2.0). All | ||
remaining nodes will run DataNodes/Tasktrackers (or NodeManager in | ||
Hadoop 2.0). | ||
|
||
Now you have a mini Hadoop cluster to do whatever you want. | ||
|
||
Basics of HDFS over Lustre | ||
-------------------------- | ||
|
||
Instead of using local disk, designate a lustre directory to "emulate" | ||
local disk for each compute node. For example, lets say you have 4 | ||
compute nodes. If we create the following paths in Lustre, | ||
|
||
/lustre/myusername/node-0 | ||
/lustre/myusername/node-1 | ||
/lustre/myusername/node-2 | ||
/lustre/myusername/node-3 | ||
|
||
We can give each of these paths to one of the compute nodes, which | ||
they can treat like a local disk. | ||
|
||
Q: Does that mean I have to constantly rebuild HDFS everytime I start | ||
a job? | ||
|
||
A: No, using Slurm/MPI ranks, "disk-paths" can be consistently | ||
assigned to nodes so that all your HDFS files from a previous run | ||
can exist on a later run. | ||
|
||
Q: But I'll have to consistently use the same number of cluster nodes? | ||
|
||
A: Generally speaking yes. If you decide to change the number of | ||
nodes you run on, you may need to rebalance HDFS blocks or fix | ||
HDFS. Imagine you had a traditional Hadoop cluster and you were | ||
increasing the number of nodes in your cluster or decreasing the | ||
number of nodes in your cluster. How would you have to handle it? | ||
|
||
Increasing the number of nodes in your job is generally "more ok" | ||
than decreasing it. HDFS should be able to find your data and | ||
rebalance it. Be careful if you try to scale down the number of | ||
nodes you use w/o handling it first. Within HDFS respects, you may | ||
have "lost data". | ||
|
||
Basic Instructions | ||
------------------ | ||
|
||
1) Download your favorite version of Hadoop off of Apache and install | ||
it into a location where it's accessible on all cluster nodes. | ||
Usually this is on a NFS home directory. Adjust HADOOP_VERSION and | ||
HADOOP_BUILD_HOME appropriately for the install. | ||
|
||
2) Open up sbatch.hadoop and setup Slurm essentials for your job. | ||
Here are the essentials for the setup: | ||
|
||
#SBATCH --nodes : Set how many nodes you want in your job | ||
|
||
SBATCH_TIMELIMIT : Set the time for this job to run | ||
|
||
#SBATCH --partition : Set the job partition | ||
|
||
HADOOP_SCRIPTS_HOME - Set where your scripts are. | ||
|
||
3) Now setup the essentials for Hadoop. Here are the essentials: | ||
|
||
HADOOP_MODE : The first time you'll probably want to run w/ | ||
'terasort' mode just to try things out. Later you may want to run | ||
w/ 'script' or 'interactive' mode, as that is the more likely way | ||
to run. | ||
|
||
HADOOP_FILESYSTEM_MODE : most will likely you'll want | ||
"hdfsoverlustre". | ||
|
||
HADOOP_HDFSOVERLUSTRE_PATH : For hdfs over lustre, need to set this | ||
|
||
HADOOP_SETUP_TYPE : Are you running Mapreduce version 1 or 2 | ||
|
||
HADOOP_VERSION : Make sure your build matches HADOOP_SETUP_TYPE | ||
(i.e. don't say you want MapReduce 1 and point to Hadoop 2.0 build) | ||
|
||
HADOOP_BUILD_HOME : Where your hadoop code is. Typically in an NFS mount. | ||
|
||
HADOOP_LOCAL_DIR : A small place for conf files and log files local | ||
to each node. Typically /tmp directory. | ||
|
||
JAVA_HOME : B/c you need to ... | ||
|
||
4) If you are happy with the configuration files provided by this | ||
project, you can use them. If not, change them. If you copy them | ||
to a new directory, adjust HADOOP_CONF_FILES in sbatch.hadoop as | ||
needed. | ||
|
||
5) Run "sbatch -k ./sbatch.hadoop" ... and any other options you see | ||
fit. | ||
|
||
6) Look at your slurm output file to see your output. There will also | ||
be some notes/instructions/tips in the slurm output file for | ||
viewing the status of your job, how to interact, etc.. | ||
|
||
Advanced | ||
-------- | ||
|
||
There are many advanced options and other scripting options. Please | ||
see sbatch.hadoop for details. | ||
|
||
The scripts make decent guesstimates on what would be best, but it | ||
always depends on your job and your hardware. Many options in | ||
sbatch.hadoop are available to help you adjust your performance. | ||
|
||
Exported Environment Variables | ||
------------------------------ | ||
|
||
The following environment variables are exported by the sbatch | ||
hadoop-run script and may be useful. | ||
|
||
HADOOP_CLUSTER_NODERANK : the rank of the node you are on. It's often | ||
convenient to do something like | ||
|
||
if [ $HADOOP_CLUSTER_NODERANK == 0 ] | ||
then | ||
.... | ||
fi | ||
|
||
To only do something on one node of your allocation. | ||
|
||
HADOOP_CONF_DIR : the directory that configuration files local to the | ||
node are stored. | ||
|
||
HADOOP_LOG_DIR : the directory log files are stored | ||
|
||
Patching Hadoop | ||
--------------- | ||
|
||
Generally speaking, no modifications to Hadoop are necessary, however | ||
tweaks may be necessary depending on your environment. In some | ||
environments, passwordless ssh is disabled, therefore requiring a | ||
modification to Hadoop to allow you to use non-ssh mechanisms for | ||
launching daemons. | ||
|
||
I have submitted a patch for adjusting this at this JIRA: | ||
|
||
https://issues.apache.org/jira/browse/HADOOP-9109 | ||
|
||
For those who use mrsh (https://github.com/chaos/mrsh), applying one | ||
of the appropriate patches in the 'patches' directory will allow you | ||
to specify mrsh for launching remote daemons instead of ssh using the | ||
HADOOP_REMOTE_CMD environment variable. | ||
|
||
Special Note on Hadoop 1.0 | ||
-------------------------- | ||
|
||
See this Jira: | ||
|
||
https://issues.apache.org/jira/browse/HDFS-1835 | ||
|
||
Hadoop 1.0 appears to have more trouble on diskless systems, as | ||
diskless systems have less entropy in them. So you may wish to apply | ||
the patch in the above jira if things don't seem to be working. I | ||
noticed the following alot on my cluster: | ||
|
||
2013-09-19 10:45:37,620 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(apex114.llnl.gov:50010, storageID=, infoPort=50075, ipcPort=50020) | ||
|
||
Notice the storageID is blank, that's b/c the random number | ||
calculation failed. Subsequently daemons aren't started, | ||
etc. etc. and badness overall. | ||
|
||
If you have root privileges, starting up the rngd daemon is another | ||
way to solve this problem without resorting to patching. | ||
|
||
Dependency | ||
---------- | ||
|
||
This project includes a script called 'hadoop-expand-nodes' which is | ||
used for hostrange expansion within the scripts. It is a hack pieced | ||
together from other scripts. | ||
|
||
The preferred mechanism is to use the hostlist command in the | ||
lua-hostlist project. You can find lua-hostlist here : < FILL IN > | ||
|
||
The main hadoop-run script will use 'hadoop-expand-nodes' if it cannot | ||
find 'hostlist' in its path. | ||
|
||
Contributions | ||
------------- | ||
|
||
Feel free to send me patches for new environment variables, new | ||
adjustments, new optimization possibilities, alternate defaults that | ||
you feel are better, etc. | ||
|
||
Any patches you submit to me for fixes will be appreciated. I am by | ||
no means a bash expert ... in fact I'm quite bad at it. |
Oops, something went wrong.