Running word-count example on a Hadoop commodity-hardware cluster and on a Hadoop local installation

Sep 20, 2012

Last weekend I spent some hours assembling old computer parts to create my commodity hardware cluster for running Hadoop. I already had a local installation in my notebook, so I thought it would be cool to run the word-count example in both scenarios to see what would be the results.

But first, let's review the hardware configurations:

Cluster set up

devcluster01 (NameNode)

Intel Core 2 Duo 2.3 GHz
4 GB
200 GB
100 Mbps Full Duplex network card

devcluster02

AMD Athlon 1.6 GHz
500 MB
4 GB
100 Mbps Full Duplex network card

devcluster03

Celeron 1.3 GHz
300MB
10 GB
100 Mbps Full Duplex network card

devcluster04 (TaskTracker)

Celeron 2.26 GHz
512 MB
80 GB
100 Mbps Full Duplex network card

devcluster05

AMD Duron 1.1 GHz
512MB
40 GB
100 Mbps Full Duplex network card

Network

Ethernet 10/100 D-Link hub

Standalone installation

Intel Core i5 2.3 GHz (quad core)
6 GB
500 GB

Notice that there is one NameNode (exclusive) and one JobTracker (exclusive too). I'm following the default for a cluster installation, but will try switching a DataNode/TaskTracker with less computing power for the NameNode or the JobTracker (they were randomly selected), and using both servers as DataNode/TaskTracker too.

The word-count example that I used can be found at https://github.com/kinow/hadoop-wordcount. And the data used are free e-books from Gutenberg project, saved as text plain UTF-8.

First I used the default data from this tutorial, that includes only three books. Then I increased to 8 books. Not happy with the result I tried 30, and finally 66 books. You can get the data from the same GitHub repository mentioned above.

Using the web interface I retrieved the total time to execute each job, and using the following R script, plotted the graph below (for more on plotting R graphs, check this link).

a1 = c(73, 75, 132, 248) # time in the cluster
a2 = c(40, 48, 121, 224) # time running locally
files = c(3, 8, 30, 66) # number of files used
plot(x=files, xlab="Number of files", y=a1, ylab="Time (s)", col="red", type="o") # plot cluster line
lines(y= a2, x=files, type="o", pch=22, lty=2, col="blue") # add the local line
title(main="Hadoop Execution in seconds", col.name="black", font.main=2)
g_range < - range(0, a1, files)
legend(2, g_range[2], c("Cluster","Local"), cex=0.8, col=c("red","blue"), pch=21:22, lty=1:2) #legend

The cluster is running slower than the standalone installation. During this week I'll investigate how to get better results with the cluster. I have three computers running tasks in the distributed cluster (the other two are the NameNode and JobTracker), and my notebook has four cores. It may be influencing the results. There is also the network latency, low memory in some nodes and changing the NameNode and JobTracker.

All in all, it's been fun to configure the cluster and run the experiments. It is good for practicing with Hadoop and HDFS, as well as getting a better idea on how to manage a cluster.

Edit: JobTracker and TaskTracker were mixed up (thanks rretzbach)

Categories: Blog

Tags: Big Data