These instructions have only been tested on:
- Mac OS X
- Ubuntu
If you are using windows, we suggest you to install an Ubuntu system on a virtualization software (e.g. VirtualBox) with at least 4GB memory in it.
Select the Hadoop distribution of your choice. Supported Hadoop versions are 2.6.0, 2.7.5 and 2.9.0.
Step 1 — Install Hadoop 2.x.x
For example:- Hadoop 2.6.0
Download and extract the hadoop-2.6.0 binary into your machine. It’s available at hadoop-2.6.0.tar.gz.
$ mkdir ~/Hadoop $ cd ~/Hadoop $ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz $ tar -xvzf hadoop-2.6.0.tar.gz
Set the environment variables in file
~/.bashrc
.$ vim ~/.bashrc
Add the following text to the file and update the values for
<where Java locates>
and<where hadoop locates>
with the path of where Java and Hadoop are located in your system.export JAVA_HOME="<where Java locates>" #e.g. /usr/lib/jvm/java-1.8.0-openjdk-amd64 export HADOOP_HOME="<where hadoop locates>" #e.g. ~/Hadoop/hadoop-2.6.0 export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
Run the following command to make sure the changes are applied.
$ source ~/.bashrc
Check if environment variables are set correctly by running the following command.
$ hadoop
The results should look similar to the example below.
Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon trace view and modify Hadoop tracing settings or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
Follow steps (i)-(iv) to modify the following files in the Apache Hadoop distribution.
(i).
$HADOOP_HOME/etc/hadoop/core-site.xml
:$ vim $HADOOP_HOME/etc/hadoop/core-site.xml
Copy the following text into the file and replace ${user.name} with your user name.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9010</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> </configuration>
(ii).
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
:$ vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Copy the following text into the file.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
(iii).
$HADOOP_HOME/etc/hadoop/mapred-site.xml
: You will be creating this file. It doesn’t exist in the original package.$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
Copy the following text into the file.
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>512</value> </property> <property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx256m -Xms256m</value> </property> </configuration>
(iv).
$HADOOP_HOME/etc/hadoop/yarn-site.xml
:$ vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
Copy the following text into the file.
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>4096</value> </property> <property> <description>Whether virtual memory limits will be enforced for containers.</description> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> </configuration>
Format the file system using the following code.
$ hdfs namenode -format
You should be able to see it exit with status 0 as follows.
... ... xx/xx/xx xx:xx:xx INFO util.ExitUtil: Exiting with status 0 xx/xx/xx xx:xx:xx INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at xxx.xxx.xxx.xxx
Launch NameNode, SecondaryNameNode and DataNode daemons.
$ $HADOOP_HOME/sbin/start-dfs.sh
Launch ResourceManager and NodeManager Daemons.
$ $HADOOP_HOME/sbin/start-yarn.sh
Check if the daemons started successfully by running the following command.
$ jps
The output should look similar to the following text with
xxxxx
replaced by the process ids for “NameNode”, “SecondaryNameNode”, etc.xxxxx NameNode xxxxx SecondaryNameNode xxxxx DataNode xxxxx NodeManager xxxxx Jps xxxxx ResourceManager
If all the processes listed above aren’t in your output recheck your configurations and rerun steps 6 through 8 after executing the following commands.
Replace ${user.name} with the user name given in step 5 (i).
$ $HADOOP_HOME/sbin/stop-dfs.sh $ $HADOOP_HOME/sbin/stop-yarn.sh $ rm -r /tmp/hadoop-${user.name}
You can browse the web interface for the NameNode at http://localhost:50070 and for the ResourceManager at http://localhost:8080.
Step 2 — Install Harp
Clone Harp repository using the following command. It is available at DSC-SPIDAL/harp.
$ git clone https://github.com/DSC-SPIDAL/harp.git
Set the environment variables in file
~/.bashrc
.$ vim ~/.bashrc
Add the following text into the file. Replace
<where Harp locates>
with the path of where Harp is located in your system.export HARP_ROOT_DIR=<where Harp locates> #e.g. ~/harp export HARP_HOME=$HARP_ROOT_DIR/core/
Run the following command to make sure the changes are applied.
$ source ~/.bashrc
If hadoop is still running, stop it first with the following code.
$ $HADOOP_HOME/sbin/stop-dfs.sh $ $HADOOP_HOME/sbin/stop-yarn.sh
Enter “harp” home directory using the following command.
$ cd $HARP_ROOT_DIR
Compile harp
Select the profile related to your hadoop version (For ex: hadoop-2.6.0) and compile using maven. Supported hadoop versions are 2.6.0, 2.7.5 and 2.9.0.
$ mvn clean package -Phadoop-2.6.0
Install harp plugin to hadoop as demonstrated below.
$ cp core/harp-collective/target/harp-collective-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/ $ cp core/harp-hadoop/target/harp-hadoop-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/ $ cp third_party/fastutil-7.0.13.jar $HADOOP_HOME/share/hadoop/mapreduce/
Edit mapred-site.xml in $HADOOP_HOME/etc/hadoop by using the following code.
$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add java opts settings for map-collective tasks as follows. For example:
<property> <name>mapreduce.map.collective.memory.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.collective.java.opts</name> <value>-Xmx256m -Xms256m</value> </property>
You have completed the Harp installation.
Note
To develop Harp applications add the following property when configuring the job.
jobConf.set("mapreduce.framework.name", "map-collective");
Step 3 — Run harp kmeans example
Copy harp examples to
$HADOOP_HOME
using the following code.$ cp $HARP_ROOT_DIR/ml/java/target/harp-java-0.1.0.jar $HADOOP_HOME
Start Hadoop.
$ cd $HADOOP_HOME $ sbin/start-dfs.sh $ sbin/start-yarn.sh
Run Kmeans Map-collective job. Make sure you are in the
$HADOOP_HOME
folder. The usage ishadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher <num of points> <num of centroids> <vector size> <num of point files per worker> <number of map tasks> <num threads> <number of iteration> <work dir> <local points dir>
<num of points>
— the number of data points you want to generate randomly<num of centriods>
— the number of centroids you want to clustering the data to<vector size>
— the number of dimension of the data<num of point files per worker>
— how many files which contain data points in each worker<number of map tasks>
— number of map tasks<num threads>
— how many threads to launch in each worker<number of iteration>
— the number of iterations to run<work dir>
— the root directory for this running in HDFS<local points dir>
— the harp kmeans will firstly generate files which contain data points to local directory. Set this argument to determine the local directory.
For example:
hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher 1000 10 100 5 2 2 10 /kmeans /tmp/kmeans
To fetch the results, use the following command:
$ hdfs dfs –get <work dir> <local dir> #e.g. hdfs dfs -get /kmeans ~/Document