Running word-count example on a Hadoop commodity-hardware cluster and on a Hadoop local installation

Last weekend I spent some hours assembling old computer parts to create my commodity hardware cluster for running Hadoop. I already had a local installation in my notebook, so I thought it would be cool to run the word-count example in both scenarios to see what would be the results.

But first, let's review the hardware configurations:


Cluster set up

devcluster01 (NameNode)

devcluster02

devcluster03

devcluster04 (TaskTracker)

devcluster05

Network


Standalone installation


Notice that there is one NameNode (exclusive) and one JobTracker (exclusive too). I'm following the default for a cluster installation, but will try switching a DataNode/TaskTracker with less computing power for the NameNode or the JobTracker (they were randomly selected), and using both servers as DataNode/TaskTracker too.

The word-count example that I used can be found at https://github.com/kinow/hadoop-wordcount. And the data used are free e-books from Gutenberg project, saved as text plain UTF-8.

First I used the default data from this tutorial, that includes only three books. Then I increased to 8 books. Not happy with the result I tried 30, and finally 66 books. You can get the data from the same GitHub repository mentioned above.

Using the web interface I retrieved the total time to execute each job, and using the following R script, plotted the graph below (for more on plotting R graphs, check this link).

a1 = c(73, 75, 132, 248) # time in the cluster
a2 = c(40, 48, 121, 224) # time running locally
files = c(3, 8, 30, 66) # number of files used
plot(x=files, xlab="Number of files", y=a1, ylab="Time (s)", col="red", type="o") # plot cluster line
lines(y= a2, x=files, type="o", pch=22, lty=2, col="blue") # add the local line
title(main="Hadoop Execution in seconds", col.name="black", font.main=2)
g_range < - range(0, a1, files)
legend(2, g_range[2], c("Cluster","Local"), cex=0.8, col=c("red","blue"), pch=21:22, lty=1:2) #legend

The cluster is running slower than the standalone installation. During this week I'll investigate how to get better results with the cluster. I have three computers running tasks in the distributed cluster (the other two are the NameNode and JobTracker), and my notebook has four cores. It may be influencing the results. There is also the network latency, low memory in some nodes and changing the NameNode and JobTracker.

All in all, it's been fun to configure the cluster and run the experiments. It is good for practicing with Hadoop and HDFS, as well as getting a better idea on how to manage a cluster.


Edit: JobTracker and TaskTracker were mixed up (thanks rretzbach)