In the first article here, I walked through importing netflow data into a single Hadoop instance (pseudonode) and mentioned a progression of the project to add multiple nodes. The ability to do distributed storage and distributed processing of data is ultimately the benefit of using Hadoop/HDFS. So, let’s expand on the project and add one or more nodes. I have another article that may be helpful here on how to install a Hadoop pseudonode on Ubuntu in 20 minutes – this procedure can be used for namenodes (masters) or datanodes (slaves).
Additional nodes (datanodes in Hadoop language – the namenode is the master) can be installed the same as a master, however, not all of the packages/processes are required. But, it really doesn’t matter if they are installed. So, installing a new node can follow the same procedure as the first node. The quickest install procedure I’ve seen I have documented here. You can follow this process to test and prepare your second, third, fourth, etc node. I’ve followed this procedure to install five nodes.
Basic architecture of a cluster
The architecture of Hadoop lends itself well to adding additional nodes. At a high level, this is because ‘slaves’ join/register to a common master without having to do any configuration on the master itself. Therefore, adding nodes is pretty easy. There also isn’t any ‘slave’ specific configurations, so replicating/duplicating a secondary node multiple times from a common image in a virtualized environment makes it really really easy to add additional slaves and scale.
In the first project here and here we added a pseudonode – a single instance that has the basic primitives of a multi-node setup — just not in a multinode setup. Because of that approach, the services are already running to easily add slaves — such as a dedicated namenode master.
The basic architecture is as follows for HDFS
- There is one namenode – the master that is responsible for understanding and managing the entire file system namespace
- There are multiple datanodes which store some of the data in the cluster and manage storage on that specific node
The basic architecture is as follows for MapReduce
- There is a JobTracker that runs on the master that acts as the front door for jobs
- There is a TaskTracker that runs on the slaves – the job of the JobTracker is to delegate tasks to run on the slave TaskTrackers
So – the process of adding additional nodes to the cluster is as simple as running more datanodes and tasktrackers.
Configuration changes on slaves and master
The configuration changes on the slave are simple assuming you’ve installed a system previously as a working pseudonode.
First, I’d recommend stopping all the hadoop services on the server: for i in /etc/init.d/hadoop*; do $i stop; done
Two files then have to be updated.
<value><insert master ip here>:8021</value>
<value>hdfs://<insert master ip here>:8020</value>
Configuration changes on master
Warning — make sure you made the above changes on the master!!
Since we had a previously (and currently still hopefully) working pseudonode – we have to make some changes on the master to make it work within a multinode cluster.
Repoint hive table HDFS location
Hive will have set a table location likely using 127.0.0.1. This won’t work when remote slaves are connecting to the table. So, we have to repoint the table to an accessible listener IP.
From Hive, execute this command: alter table netflow set location “hdfs://<insert server ethernet listener IP here>:8020/user/hive/warehouse/netflow”;
You can verify the change was made by doing “describe extended netflow” from Hive and look for the location setting.
Change DFS replication on the master
This configuration should already exist and be set to ‘1’ – I would suggest changing it to the number of nodes in the cluster and if there are more than 3 nodes, then set to 3. This determines how many instances store copies of the data.
You can now start all Hadoop services on the masters and slaves.
The simplest way to test is to go to the namenode node list at the below URL. You should see multiple live datanodes, storage layout, etc.
The benefit of a cluster
If you’ve been running a single node cluster for awhile, you won’t see any obvious performance immediately after adding a second node. The data will still largely reside on the master. And – the benefits of Hadoop are seen the most when data exists across many nodes and jobs are delegated to many nodes. On my master, I saw performance times like this below – 3.3million rows counted in 75 seconds.
select count(*) from netflow;
MapReduce Total cumulative CPU time: 24 seconds 750 msec
Ended Job = job_1385175791065_0001
MapReduce Jobs Launched:
Job 0: Map: 5 Reduce: 1 Cumulative CPU: 24.75 sec HDFS Read: 1188346202 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 750 msec
Time taken: 75.507 seconds
I want better performance — so what do you do ?
Hadoop has the ability to forcefully distribute data across all the nodes with one simple command run as user HDFS.
You’ll see data get moved around with logs like as follows:
13/11/24 10:59:55 INFO balancer.Balancer: Moving block 2638860088088871374 from 10.0.0.9:50010 to 10.0.0.8:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:55 INFO balancer.Balancer: Moving block -9186205980643229570 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:55 INFO balancer.Balancer: Moving block -8143255398161719535 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:56 INFO balancer.Balancer: Moving block -5917892684609518162 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:56 INFO balancer.Balancer: Moving block -3587740737792534969 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:56 INFO balancer.Balancer: Moving block 1237121992209463697 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:56 INFO balancer.Balancer: Moving block -4541986761876659524 from 10.0.0.9:50010 to 10.0.0.6:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:56 INFO balancer.Balancer: Moving block 2401718455020941834 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:56 INFO balancer.Balancer: Moving block -7557648849917782450 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
13/11/24 10:59:57 INFO balancer.Balancer: Moving block -5456894913981872546 from 10.0.0.9:50010 to 10.0.0.4:50010 through 10.0.0.9:50010 is succeeded.
Check your namenode master and you should see data being distributed more evenly. This process took a good bit of time and obviously depends on the size of your dataset. For me it was 17 minutes.
While my table size grow, performance wasn’t any better. Maybe I still have some work to do still 😉
MapReduce Total cumulative CPU time: 41 seconds 600 msec
Ended Job = job_1385229206293_0003
MapReduce Jobs Launched:
Job 0: Map: 9 Reduce: 1 Cumulative CPU: 41.6 sec HDFS Read: 1408527708 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 41 seconds 600 msec
Time taken: 116.403 seconds
An issue I experiences multiple times was this error: Datanode denied communication with namenode: DatanodeRegistration(0.0.0.0. This likely means that the forward and reverse entries for the slave connecting to the master can not be resolved. So, functioning DNS is important.
Some common problems you may run into are initialization issues with the HDFS. It is simple to just re-format with the hadoop namenode -format command. The thing to remember is that permissions for /var/lib/hadoop-hdfs should have ownership with hdfs:hadoop.
Other good troubleshooting tips are
- Run the “jps” command to show java processes. You’ll want to see SecondaryNameNode, DataNode, JobHistoryServer, ResourceManager, NodeManager
- There are lots of very detailed logs in /var/log/hadoop*. For example, /var/log/hadoop-hdfs. You can just ‘tail -f *’ while you start and stop services and look for errors. Most errors are self explanatory but the nugget of info you need may be buried in a lot of log output
- Simple start and stop services seemed to fix things for me
- for i in /etc/init.d/hadoop*; do $i start; done
- for i in /etc/init.d/hadoop*; do $i stop; done