System configuration

Python and Virtualenv setup

To set up Python and Virtualenv, run the following commands from a terminal:

cd ~/temp
curl -L -o virtualenv.py https://raw.github.com/pypa/virtualenv/master/virtualenv.py
python virtualenv.py cloud-kepler --no-site-packages
. cloud-kepler/bin/activate
pip install numpy
pip install simplejson
pip install pyfits

Test that the basic python code is working:

cat {DIRECTORY_WITH_CLOUD_KEPLER}/test/test_q1.txt | python {DIRECTORY_WITH_CLOUD_KEPLER}/python/download.py

If it starts downloading and spewing base64 encoded numpy arrays, then you’re good.

Hadoop setup

Install Oracle VM VirtualBox 4.2.14 from VirtualBox-4.2.14-86644-win from https://www.virtualbox.org/wiki/Downloads

Extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar.gz from https://ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM

Enter the created folder and extract cloudera-quickstart-demo-vm-4.3.0-virtualbox.tar, you should end up with cloudera-quickstart-demo-vm.ovf and cloudera-quickstart-demo-vm.vmdk in whatever folder you extracted to

Open up Oracle VM VirtualBox Manager

Select the New icon, the Create Virtual Machine window boots up.

For operating system, select Linux and Ubuntu

For memory size, select 4096 MB

For Hard Drive, select “Use an existing virtual hard drive” and path to cloudera-quickstart-demo-vm.vmdk

Press Create. Virtual machine now selectable in the main window on virtualbox manager.

Press the Settings button, opens the settings window.

Choose the system tab

Change chipset to ICH9, make sure “Enable IO APIC” is checked.

Select it and pressed Start, boot begins, this part takes a little while.

If it gets stuck on any one step for more than 20 minutes, you can assume something is wrong.

Eventually the boot sequence will end and you will see a desktop in your virtual machine. Success!

WordCount Example

Note that this assumes a cloudera vm distribution of hadoop.

Inside your virtual machine, go to the Cloudera Hadoop Tutorial at http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_5_1.html

Copy the source code for WordCount and past it into the gedit text editor. Save as WordCount.java in the cloudera’s home folder.

Per the instructions there, open terminal, cd to the home directory, then run as follows:

mkdir wordcount_classes
javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java

Right click on the wordcount_classes folder you made (it will be in the home directory) and select compress. Choose .jar as the file format and wordcount as filename:

echo "Hello World Bye World" > file0
echo "Hello Hadoop Goodbye Hadoop" > file1
hadoop fs -mkdir /user/cloudera /user/cloudera/wordcount /user/cloudera/wordcount/input
hadoop fs -put file* /user/cloudera/wordcount/input
hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input output

According to the Cloudera Tutoria, this should be all you need to do, but I got an error message here, so everything is not quite right yet.

When you first log onto the virtual machine, it should begin with a firefox window open to some kind of cloudera page. Go to this and click the Cloudera Manager link.

Enter ‘admin’ and ‘admin’ as a username and password to access it.

Now you can see the health of your setup’s various components. mapreduce1 will probably be listed as in poor health. click on it

You should see that the jobtracker is the problem. Return to terminal:

sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

Then restart jobtracker by clicking instances the instances tab, clicking on jobtracker, clicking to the processes tab, selecting the actions tab in the corner, and selecting restart:

hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input output

This time it should work:

hadoop fs -cat output/part-00000

This will open up the output folder for you from the hadoop run.

It should look like this:

Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2

If it looks like that then you are good.

It is worth noting that Hadoop won’t work unless the directory you set as your output both does not currently exist and is in your hadoop fs home directory.

Lein setup

Note that this assumes a cloudera vm distribution of hadoop.

You can find Lein at https://github.com/technomancy/leiningen

Download the script from https://raw.github.com/technomancy/leiningen/stable/bin/lein and place it wherever you want:

export $HOME=/home
cd
cd ..
cd etc/profile.d
sudo vim lein.sh

On one line of the file write export PATH=$PATH:{wherever your lein file is located} (in my case /home/cloudera/Desktop)

Save the file and exit.

Exit and reenter terminal to get back to you home directory:

chmod 755 {location of lein}

Lein should now be functioning, call ‘lein’ in terminal to test.

LEMUR setup

Note that this assumes a cloudera vm distribution of hadoop.

Lemur can be downloaded from http://download.climate.com/lemur/releases/lemur-1.3.1.tgz. follow that link and the file should appear in your download folder.

Extract it, and then put it wherever you want it to be:

export $HOME=/home
cd
cd ..
cd etc/profile.d
sudo vim lemur.sh

You are now writing a file which will allow your system to recognize lemur.

on the first line of the file write export LEMUR_HOME={wherever you saved your lemur file} (in my case /home/cloudera/Desktop/lemur).

on the second line of the file write export LEMUR_AWS_ACCESS_KEY={your aws access key}

on the third line of the file write export LEMUR_AWS_SECRET_KEY={your aws secret key}

on the fourth line of the file write export PATH=$PATH:$LEMUR_HOME/bin

save the file and exit.

Lemur should now work, call lemur in terminal to test.

References