Running word-count example on a Hadoop commodity-hardware cluster and on a Hadoop local installation
Last weekend I spent some hours assembling old computer parts to create my commodity hardware cluster for running Hadoop. I already had a local installation in my notebook, so I thought it would be cool to run the word-count example in both scenarios to see what would be the results.
But first, let's review the hardware configurations:
Cluster set up
devcluster01 (NameNode)
- Intel Core 2 Duo 2.3 GHz
- 4 GB
- 200 GB
- 100 Mbps Full Duplex network card
devcluster02
- AMD Athlon 1.6 GHz
- 500 MB
- 4 GB
- 100 Mbps Full Duplex network card
devcluster03
- Celeron 1.3 GHz
- 300MB
- 10 GB
- 100 Mbps Full Duplex network card
devcluster04 (TaskTracker)
- Celeron 2.26 GHz
- 512 MB
- 80 GB
- 100 Mbps Full Duplex network card
devcluster05
- AMD Duron 1.1 GHz
- 512MB
- 40 GB
- 100 Mbps Full Duplex network card
Network
- Ethernet 10/100 D-Link hub
Standalone installation
- Intel Core i5 2.3 GHz (quad core)
- 6 GB
- 500 GB
Notice that there is one NameNode (exclusive) and one JobTracker (exclusive too). I'm following the default for a cluster installation, but will try switching a DataNode/TaskTracker with less computing power for the NameNode or the JobTracker (they were randomly selected), and using both servers as DataNode/TaskTracker too.
The word-count example that I used can be found at https://github.com/kinow/hadoop-wordcount. And the data used are free e-books from Gutenberg project, saved as text plain UTF-8.
First I used the default data from this tutorial, that includes only three books. Then I increased to 8 books. Not happy with the result I tried 30, and finally 66 books. You can get the data from the same GitHub repository mentioned above.
Using the web interface I retrieved the total time to execute each job, and using the following R script, plotted the graph below (for more on plotting R graphs, check this link).
a1 = c(73, 75, 132, 248) # time in the cluster
a2 = c(40, 48, 121, 224) # time running locally
files = c(3, 8, 30, 66) # number of files used
plot(x=files, xlab="Number of files", y=a1, ylab="Time (s)", col="red", type="o") # plot cluster line
lines(y= a2, x=files, type="o", pch=22, lty=2, col="blue") # add the local line
title(main="Hadoop Execution in seconds", col.name="black", font.main=2)
g_range < - range(0, a1, files)
legend(2, g_range[2], c("Cluster","Local"), cex=0.8, col=c("red","blue"), pch=21:22, lty=1:2) #legend
The cluster is running slower than the standalone installation. During this week I'll investigate how to get better results with the cluster. I have three computers running tasks in the distributed cluster (the other two are the NameNode and JobTracker), and my notebook has four cores. It may be influencing the results. There is also the network latency, low memory in some nodes and changing the NameNode and JobTracker.
All in all, it's been fun to configure the cluster and run the experiments. It is good for practicing with Hadoop and HDFS, as well as getting a better idea on how to manage a cluster.
Edit: JobTracker and TaskTracker were mixed up (thanks rretzbach)
Categories: Blog
Tags: Big Data