Sunday, May 8, 2011

Installing Flume in the cluster - A complete step by step tutorial

Flume Cluster Setup :




In the previous post, you used Flume on a single machine with the default configuration settings. With the default settings, nodes automatically search for a Master on localhost on a standard port. In order for the Flume nodes to find the Master in a fully distributed setup, you must specify site-specific static configuration settings.

Before we start:-
Before we start configure flume, you need to have a running Hadoop cluster, which will be the centralize storage for flume. Please refer to Installing Hadoop in the cluster - A complete step-by-step tutorial post before continuing.

Installation steps:-

Perform following steps on Master Machine.
1. Download flume-0.9.1.tar.gz from  https://github.com/cloudera/flume/downloads   and extract to some path in your computer. Now I am calling Flume installation root as $FLUME_INSTALL_DIR. 

2. Edit the file /etc/hosts on the master machine (Also in agent and collector machines) and add the following lines.

192.168.41.67 flume-master
192.168.41.53 flume-collector
hadoop-namenode-machine-IP hadoop-namenode

3. Open the file $FLUME_INSTALL_DIR/conf/flume-site.xml and Edit the following properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>flume.master.servers</name>
<value>flume-master</value>
</property>
<property>
<name>flume.collector.event.host</name>
<value>flume-collector</value>
<description>This is the host name of the default "remote" collector.
</description>
</property>
<property>
<name>flume.collector.port</name>
<value>35853</value>
<description>This default tcp port that the collector listens to in order to receive events it is collecting.
</description>
</property>
</configuration>

4. Repeat step 1 to 3 on collector and agents machines.
Note: - The Agent Flume nodes are co-located on machines with the service that is producing logs.

Start flume processes:-

1. Start Flume master:- The Master can be manually started by executing the following command on Master Machine.
        1.1 $Flume_INSTALL_DIR/bin/flume master
1.2 After the Master is started, you can access it by pointing a web browser to http://flume-master:35871/. This web page displays the status of all Flume nodes that have contacted the Master, and shows each node’s currently assigned configuration. When you start this up without Flume nodes running, the status and configuration tables will be empty.

2. Start Flume collector:- The Collector can be manually started by executing the following command on Collector Machine.
         2.1 $Flume_INSTALL_DIR/bin/flume node –n flume-collector

2.2 To check whether a Flume node (collector) is up, point your browser to the Flume Node status page athttp://flume-collector:35862/. Each node displays its own data on a single table that includes diagnostics and metrics data about the node, its data flows, and the system metrics about the machine it is running on. If you have multiple instances of the flume node program running on a machine, it will automatically increment the port number and attempt to bind to the next port (35863, 35864, etc) and log the eventually selected port.

2.3 If the node is up, you should also refresh the Master’s status page (http://flume-master:35871) to make sure that the node has contacted the Master. You brought up one node whose name is flume-collector, so you should have one node listed in the Master’s node status table.

3. Start Flume agent:- The Agent can be manually started by executing the following command on Agent Machine (agent Flume nodes are co-located on machines with the service that is producing logs.)
      3.1 $Flume_INSTALL_DIR/bin/flume node –n flume-agent

      3.2 Perform step 2.3 again.

Note: - Similarly you can start other Flume agent by executing following commands:-
Start second agent:- $Flume_INSTALL_DIR/bin/flume node –n flume-agent1
Start third agent:- $Flume_INSTALL_DIR/bin/flume node –n flume-agent2

Configuring Flume nodes via master:-

1. Configuration of Flume Collector: - On the Master’s web page click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.
Node name: flume-collector
Source: collectorSource(35853)
Sink: collectorSink("hdfs://hadoop-namenode:9000/user/flume /logs/%Hoo ","%{host}-")
Note: - The collector writes to an HDFS cluster (assuming the HDFS namenode machine is called hadoop-namenode)

2. Configuration of Flume Agent:- On the Master’s web page, click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.
Node name: flume-agent
Source: tail(“path/to/logfile”)
Ex:- tail("/home/$USER/logAnalytics/dot.log")
Sink: agentSink("flume-collector",35853)

Note: - Use same configuration for each Flume Agent.

Friday, May 6, 2011

Installing Flume in the pseudo mode - A complete step by step tutorial




Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.

The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS).

Installation in pseudo-distributed mode:-


In pseudo-distributed mode, several processes of flume are run on single machine.There are two kinds of processes in the system:
1. Flume Master: - The Flume Master is the central management point, controls the Flume node data flows and monitors Flume nodes.
2. Flume Node: - The Flume nodes are divided into two categories:-
2.1 Flume Agent: - The agent Flume nodes are co-located on machines with the service that is producing logs.
2.2 Flume collector: - The collector listens for data from multiple agents, aggregates logs, and then eventually write the data to HDFS. 

Fig: - Flume processes and there configuration.

Before we start:-

Before we start configure flume, you need to have a running Hadoop cluster, which will be the centralize storage for flume. Please refere to Installing Hadoop in the cluster - A complete step by step tutorial post before continuing.


Installation steps:-

1. Download flume-0.9.1.tar.gz from https://github.com/cloudera/flume/downloads and extract to some path in your computer. Now I am calling Fllume installation root as $Flume_INSTALL_DIR.



2. The Master can be manually started by executing the following command:


2.1 $Flume_INSTALL_DIR/bin/flume master


2.2 After the Master is started, you can access it by pointing a web browser to http://localhost:35871/.This web page displays the status of all Flume nodes that have contacted the Master, and shows each node’s currently assigned configuration. When you start this up without Flume nodes running, the status and configuration tables will be empty.


3. The flume collector can be manually started by executing the following command in another terminal.


3.1 $Flume_INSTALL_DIR/bin/flume node –n flume-collector


3.2 To check whether a Flume node is up, point your browser to the Flume Node status page athttp://localhost:35862/. Each node displays its own data on a single table that includes diagnostics and metrics data about the node, its data flows, and the system metrics about the machine it is running on. If you have multiple instances of the flume node program running on a machine, it will automatically increment the port number and attempt to bind to the next port (35863, 35864, etc) and log the eventually selected port.


3.3 If the node is up, you should also refresh the Master’s status page (http:// localhost: 35871) to make sure that the node has contacted the Master. You brought up one node whose name is flume-collector, so you should have one node listed in the Master’s node status table.


4. Configuring a collector via master:-

4.1 On the Master’s web page click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.

Node name:flume-collector

Source: collectorSource(35853)

Sink:collectorSink("hdfs://hadoop-namenode:9000/user/flume /logs/%Hoo ","%{host}-")

Note: - The collector writes to an HDFS cluster (assuming the HDFS nameNode is called namenode).


5. The flume node can be manually started by executing the following command in another terminal.

5.1 $Flume_INSTALL_DIR/bin/flume node –n flume-agent

5.2 Perform step 3.2 and 3.3 again.


6. Configuring an agent via master:-

6.1 On the Master’s web page, click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.

Node name:flume-agent

Source: tail(“path/to/logfile”)
Ex:-tail("/home/impetus/logAnalytics/dot.log")

Sink: agentSink("localhost",35853)


7. To check whether data is stored into hdfs or not, you can check it by pointing browser to http://localhost:50070/.






             
















Saturday, January 22, 2011

Installation of HBase in the cluster - A complete step by step tutorial

HBase cluster setup :

HBase is an open-source, distributed, versioned, column-oriented store modeled after Google 'Bigtable’.

This tutorial will describe how to setup and run Hbase cluster, with not too much explanation about hbase. There are a number of articles where the Hbase are described in details.

We will build hbase cluster using three Ubuntu machine in this tutorial.

A distributed HBase depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to get to the running ZooKeeper cluster. HBase by default manages a ZooKeeper cluster for you, or you can manage it on your own and point HBase to it. In our case, we are using default ZooKeeper cluster, which is manage by Hbase

Following are the capacities in which nodes may act in our cluster:

1. Hbase Master:- The HbaseMaster is responsible for assigning regions to HbaseRegionserver, monitors the health of each HbaseRegionserver.

2. Zookeeper: - For any distributed application, ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

3. Hbase Regionserver:- The HbaseRegionserver is responsible for handling client read and write requests. It communicates with the Hbasemaster to get a list of regions to serve and to tell the master that it is alive.

In our case, one machine in the cluster is designated as Hbase master and Zookeeper. The rest of machine in the cluster act as a Regionserver.
Before we start:

Before we start configure HBase, you need to have a running Hadoop cluster, which will be the storage for hbase(Hbase store data in Hadoop Distributed File System). Please refere to Installing Hadoop in the cluster - A complete step by step tutorial post before continuing.

INSTALLING AND CONFIGURING HBASE MASTER

1. Download hbase-0.20.6.tar.gz from http://www.apache.org/dyn/closer.cgi/hbase/ and extract to some path in your computer. Now I am calling hbase installation root as $HBASE_INSTALL_DIR.


2. Edit the file /etc/hosts on the master machine and add the following lines.
                192.168.41.53 hbase-master       hadoop-namenode 
                #Hbase Master and Hadoop Namenode is configure on same machine
                192.168.41.67 hbase-regionserver1        
                192.168.41.67 hbase-regionserver2

Note: Run the command “ping hbase-master”. This command is run to check whether the hbase-master machine ip is being resolved to actual ip not localhost ip.

3. We have needed to configure password less login from hbase-master to all regionserver machines.
                2.1. Execute the following commands on hbase-master machine.
                $ssh-keygen -t rsa
                $scp .ssh/id_rsa.pub ilab@hbase-regionserver1:~ilab/.ssh/authorized_keys
                $scp .ssh/id_rsa.pub ilab@hbase-regionserver2:~ilab/.ssh/authorized_keys

4. Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and set the $JAVA_HOME.
export JAVA_HOME=/user/lib/jvm/java-6-sun

Note:  If you are using open jdk , then give the path of open jdk.


5. Open the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties.
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
                <property>
                                <name>hbase.master</name>
                                <value>hbase-master:60000</value>
                                <description>The host and port that the HBase master runs at.
                                                     A value of 'local' runs the master and a regionserver
                                                     in a single process.
                                </description>
                </property>

                <property>
                                <name>hbase.rootdir</name>
                                <value>hdfs://hadoop-namenode:9000/hbase</value>
                                <description>The directory shared by region servers.</description>
                </property>

       
<property>
                                <name>hbase.cluster.distributed</name>
                                <value>true</value>
                                <description>The mode the cluster will be in. Possible values are
                                false: standalone and pseudo-distributed setups with managed
                                Zookeeper true: fully-distributed with unmanaged Zookeeper
                                Quorum (see hbase-env.sh)
                                </description>
                </property>
                <property>
                                <name>hbase.zookeeper.property.clientPort</name>
                                <value>2222</value>
                                <description>Property from ZooKeeper's config zoo.cfg.
                                The port at which the clients will connect.
                                </description>
                </property>

                <property>
                <name>hbase.zookeeper.quorum</name>
                <value>hbase-master</value>
                <description>Comma separated list of servers in the ZooKeeper Quorum.
                                     For example,
                                     "host1.mydomain.com,host2.mydomain.com".
                                     By default this is set to localhost for local and
                                     pseudo-distributed modes of operation. For a
                                     fully-distributed setup, this should be set to a full
                                     list of ZooKeeper quorum servers. If
                                     HBASE_MANAGES_ZK is set in hbase-env.sh
                                     this is the list of servers which we will start/stop
                                     ZooKeeper on.
                </description>
                </property>
    </configuration>

Note:-
In our case, Zookeeper and hbase master both are running in same machine.

6. Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and uncomment the following line:
                export HBASE_MANAGES_ZK=true        

7. Open the file $HBASE_INSTALL_DIR/conf/regionservers and add all the regionserver machine names.

    hbase-regionserver1
    hbase-regionserver2
    hbase-master

Note: Add hbase-master machine name only if you are running a regionserver on hbase-master machine.

INSTALLING AND CONFIGURING HBASE REGIONSERVER

1. Download hbase-0.20.6.tar.gz from http://www.apache.org/dyn/closer.cgi/hbase/ and extract to some path in your computer. Now I am calling hbase installation root as $HBASE_INSTALL_DIR.

2. Edit the file /etc/hosts on the hbase-regionserver machine and add the following lines.
                192.168.41.53 hbase-master       hadoop-namenode

Note: In my case, Hbase-master and hadoop-namenode are running on same machine.

Note: Run the command “ping hbase-master”. This command is run to check whether the hbase-master machine ip is being resolved to actual ip not localhost ip.

3.We have needed to configure password less login from hbase-regionserver to hbase-master machine.
                2.1. Execute the following commands on hbase-server machine.
                $ssh-keygen -t rsa
                $scp .ssh/id_rsa.pub ilab@hbase-master:~ilab/.ssh/authorized_keys2
               
4. Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and set the $JAVA_HOME.
export JAVA_HOME=/user/lib/jvm/java-6-sun

Note:  If you are using open jdk , then give the path of open jdk.

5. Open the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties.
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <configuration>
                <property>
                                <name>hbase.master</name>
                                <value>hbase-master:60000</value>
                                <description>The host and port that the HBase master runs at.
                                                     A value of 'local' runs the master and a regionserver
                                                     in a single process.
                                </description>
                </property>

                <property>
                                <name>hbase.rootdir</name>
                                <value>hdfs://hadoop-namenode:9000/hbase</value>
                                <description>The directory shared by region servers.</description>
                </property>

                <property>
                                <name>hbase.cluster.distributed</name>
                                <value>true</value>
                                <description>The mode the cluster will be in. Possible values are
                                false: standalone and pseudo-distributed setups with managed
                                Zookeeper true: fully-distributed with unmanaged Zookeeper
                                Quorum (see hbase-env.sh)
                                </description>
                </property>
                <property>
                                <name>hbase.zookeeper.property.clientPort</name>
                                <value>2222</value>
                <description>Property from ZooKeeper's config zoo.cfg.
                                The port at which the clients will connect.
                                </description>
                </property>

                <property>
                <name>hbase.zookeeper.quorum</name>
                <value>hbase-master</value>
                <description>Comma separated list of servers in the ZooKeeper Quorum.
                                For example, "host1.mydomain.com,host2.mydomain.com".
                                By default this is set to localhost for local and
                                pseudo-distributed modes of operation. For a fully-distributed
                                setup, this should be set to a ful list of ZooKeeper quorum
                                servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
                                this is the list of servers which we will start/stop ZooKeeper on.
                 </description>
                 </property>
    </configuration>

6. Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and uncomment the following line:
                export HBASE_MANAGES_ZK=true

Note:-
Above steps is required on all the datanode in the hadoop cluster.



START AND STOP HBASE CLUSTER

1. Starting the Hbase Cluster:-

we have need to start the daemons only on the hbase-master machine, it will start the daemons in all regionserver machines. Execute the following  command to start the hbase cluster.
                $HBASE_INSTALL_DIR/bin/start-hbase.sh
               
Note:-
           At this point, the following Java processes should run on hbase-master machine. 
               ilab@hbase-master:$jps
               14143 Jps
               14007 HQuorumPeer
               14066 HMaster
               
and the following java processes should run on hbase-regionserver machine.
                23026 HRegionServer
                23171 Jps

2. Starting the hbase shell:-
                $HBASE_INSTALL_DIR/bin/hbase shell
                HBase Shell; enter 'help<RETURN>' for list of supported commands.
                Version: 0.20.6, r965666, Mon Jul 19 16:54:48 PDT 2010
                hbase(main):001:0>
               
                Now,create table in hbase.
                hbase(main):001:0>create 't1','f1'
                0 row(s) in 1.2910 seconds
                hbase(main):002:0>
               
Note: - If table is created successfully, then everything is running fine.

3. Stoping the Hbase Cluster:-
    Execute the following command on hbase-master machine to stop the hbase cluster.
                $HBASE_INSTALL_DIR/bin/stop-hbase.sh




Tuesday, January 4, 2011

Installation of hadoop in the cluster - A complete step by step tutorial




Hadoop Cluster Setup:

Hadoop is a fault-tolerant distributed system for data storage which is highly scalable.
Hadoop has two important parts:-

1. Hadoop Distributed File System(HDFS):-A distributed file system that provides high throughput access to application data.

2. MapReduce:-A software framework for distributed processing of large data sets on compute clusters.

In this tutorial, I will describe how to setup and run Hadoop cluster. We will build Hadoop cluster using three Ubuntu machine in this tutorial.

Following are the capacities in which nodes may act in our cluster:-

1. NameNode:-Manages the namespace, file system metadata, and access control. There is exactly one NameNode in each cluster.

2. SecondaryNameNode:-Downloads periodic checkpoints from the nameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster.

3. JobTracker: - Hands out tasks to the slave nodes. There is exactly one JobTracker in each cluster.

4. DataNode: -Holds file system data. Each data node manages its own locally-attached storage (i.e., the node's hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster.

5. TaskTracker: - Slaves that carry out map and reduce tasks. There are one or more TaskTrackers in each cluster.

In our case, one machine in the cluster is designated as namenode, Secondarynamenode and jobTracker.This is the master. The rest of machine in the cluster act as both Datanode and TaskTracker. They are slaves.

Below diagram show, how the Hadoop cluster will look after Installation:-

Fig: After Installation, Hadoop cluster will look like.

Installation, configuring and running of hadoop cluster is done in three steps:
1. Installing and configuring hadoop namenode.
2. Installing and configuring hadoop datanodes.
3. Start and stop hadoop cluster.

INSTALLING AND CONFIGURING HADOOP NAMENODE

1. Download hadoop-0.20.2.tar.gz from http://www.apache.org/dyn/closer.cgi/hadoop/core/ and extract to some path in your computer. Now I am calling hadoop installation root as $HADOOP_INSTALL_DIR.

2. Edit the file /etc/hosts on the namenode machine and add the following lines.
           
192.168.41.53    hadoop-namenode
            192.168.41.87    hadoop-datanode1
            192.168.41.67    hadoop-datanode2

Note: Run the command “ping hadoop-namenode”. This command is run to check whether the namenode machine ip is being resolved to actual ip not localhost ip.

3. We have needed to configure password less login from namenode to all datanode machines.
            2.1. Execute the following commands on namenode machine.
                        $ssh-keygen -t rsa
                        $scp .ssh/id_rsa.pub ilab@192.168.41.87:~ilab/.ssh/authorized_keys
                        $scp .ssh/id_rsa.pub ilab@192.168.41.67:~ilab/.ssh/authorized_keys

4. Open the file $HADOOP_INSTALL_DIR/conf/hadoop-env.sh and set the $JAVA_HOME.
export JAVA_HOME=/path/to/javaeg : export JAVA_HOME=/user/lib/jvm/java-6-sun
Note:  If you are using open jdk , then give the path of that open jdk.

5. Go to $HADOOP_INSTALL_DIR and create new directory hadoop-datastore. This directory is creating to store metadata information.

6. Open the file $HADOOP_INSTALL_DIR/conf/core-site.xml and add the following properties. This file is edit to configure the namenode to store information like port number and metadata directories. Add the properties in the format below:
            <!-- Defines the namenode and port number -->
            <property>
                              <name>fs.default.name</name>
                              <value>hdfs://hadoop-namenode:9000</value>
                              <description>This is the namenode uri</description>
            </property>
            <property>
      <name>hadoop.tmp.dir</name>
      <value>$HADOOP_INSTALL_DIR/hadoop-0.20.2/hadoop-datastore
      </value>
      <description>A base for other temporary directories.</description>
            </property>

7. Open the file $HADOOP_INSTALL_DIR/conf/hdfs-site.xml and add the following properties. This file is edit to configure the replication factor of the hadoop setup. Add the properties in the format below:
           
<property>
                       <name>dfs.replication</name>
                       <value>2</value>
<description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
                       </description>
            </property>

8. Open the file $HADOOP_INSTALL_DIR/conf/mapred-site.xml and add the following properties. This file is edit to configure the host and port of the MapReduce job tracker in thenamenode of the hadoop setup. Add the properties in the format below:
            <property>
                        <name>mapred.job.tracker</name>
                        <value>hadoop-namenode:9001</value>
                        <description>The host and port that the MapReduce job tracker runs
                        at.  If "local", then jobs are run in-process as a single map and reduce 
                        task.
                        </description>
            </property>

9. Open the file $HADOOP_INSTALL_DIR/conf/masters and add the machine names where a secondary namenodes will run. This file is edit to configure the Hadoop Secondary Namenode
hadoop-namenode.
           
Note: In my case, both primary namenode and Secondary namenode are running on same machine. So, I have added hadoop-namenode in $HADOOP_INSTALL_DIR/conf/masters file.

10. Open the file $HADOOP_INSTALL_DIR/conf/slaves and add all the datanodes machine names:-
            hadoop-namenode     
/* in case you want the namenode to also store data(i.e namenode also behave like a datanode) this can be  mentioned in the slaves file.*/
            hadoop-datanode1
            hadoop-datanode2

INSTALLING AND CONFIGURING HADOOP DATANODE


1. Download hadoop-0.20.2.tar.gz from http://www.apache.org/dyn/closer.cgi/hadoop/core/ and extract to some path in your computer. Now I am calling hadoop installation root as $HADOOP_INSTALL_DIR.

2. Edit the file /etc/hosts on the datanode machine and add the following lines.
           
192.168.41.53    hadoop-namenode
            192.168.41.87    hadoop-datanode1
            192.168.41.67    hadoop-datanode2

Note: Run the command “ping hadoop-namenode”. This command is run to check whether   the namenode machine ip is being resolved to actual ip not localhost ip.

3. We have needed to configure password less login from all datanode machines to namenode machine.
            3.1. Execute the following commands on datanode machine.
                        $ssh-keygen -t rsa
                        $scp .ssh/id_rsa.pub ilab@192.168.41.53:~ilab/.ssh/authorized_keys2

4. Open the file $HADOOP_INSTALL_DIR/conf/hadoop-env.sh and set the $JAVA_HOME.
export JAVA_HOME=/path/to/java
eg : export JAVA_HOME=/user/lib/jvm/java-6-sun

Note:  If you are using open jdk , then give the path of that open jdk.

5. Go to $HADOOP_INSTALL_DIR and create new directory hadoop-datastore. This directory is creating to store metadata information.

6. Open the file $HADOOP_INSTALL_DIR/conf/core-site.xml and add the following properties. This file is edit to configure the datanode to determine the host, port, etc. for a filesystem. Add the properties in the format below:
           <!-- The uri's authority is used to determine the host, port, etc. for a filesystem. -->
            <property>
                        <name>fs.default.name</name>
                        <value>hdfs://hadoop-namenode:9000</value>
                        <description>This is the namenode uri</description>
            </property>
            <property>
                       <name>hadoop.tmp.dir</name>
                       <value>$HADOOP_INSTALL_DIR/hadoop-0.20.2/hadoop-datastore
                       </value>
                       <description>A base for other temporary directories.</description>
            </property>

7. Open the file $HADOOP_INSTALL_DIR/conf/hdfs-site.xml and add the following properties. This file is edit to configure the replication factor of the hadoop setup. Add the properties in the format below:
            <property>
                                    <name>dfs.replication</name>
                                    <value>2</value>
                                    <description>Default block replication.
                                    The actual number of replications can be specified when the file 
                                    is created. The default is used if replication is not specified in
                                    create time.
                                    </description>
            </property>

8. Open the file $HADOOP_INSTALL_DIR/conf/mapred-site.xml and add the following properties. This file is edit to identify the host and port at which MapReduce job tracker runs in the namenode of the hadoop setup. Add the properties in the format below
            <property>
                        <name>mapred.job.tracker</name>
                        <value>hadoop-namenode:9001</value>
                        <description>The host and port that the MapReduce job tracker runs
                         at.  If "local", then jobs are run in-process as a single map and reduce 
                         task.
                        </description>
</property>

Note:-Step 9 and 10 are not mandatory.

9. Open $HADOOP_INSTALL_DIR/conf/masters and add the machine names where a secondary namenodes will run.
            hadoop-namenode

Note: In my case, both primary namenode and Secondary namenode are running on same machine. So, I have added hadoop-namenode in $HADOOP_INSTALL_DIR/conf/masters file.

10. open $HADOOP_INSTALL_DIR/conf/slaves and add all the datanodes machine names
hadoop-namenode                  /* In case you want the namenode to also store data(i.e namenode also behave like datanode) this can be mentioned in the slaves file.*/
            hadoop-datanode1
            hadoop-datanode2

  
Note:-
Above steps is  required on all the datanode in the hadoop cluster.

START AND STOP HADOOP CLUSTER

1. Formatting the namenode:-
Before we start our new Hadoop cluster, we have to format Hadoop’s distributed filesystem (HDFS) for the namenode. We have needed to do this the first time when we start our Hadoop cluster. Do not format a running Hadoop namenode, this will cause all your data in the HDFS filesytem to be lost.
Execute the following command on namenode machine to format the file system.
$HADOOP_INSTALL_DIR/bin/hadoop namenode -format

2. Starting the Hadoop cluster:-
            Starting the cluster is done in two steps.
           
2.1 Start HDFS daemons:-
           
Execute the following command on namenode machine to start HDFS daemons.
            $HADOOP_INSTALL_DIR/bin/start-dfs.sh
            Note:-
            At this point, the following Java processes should run on namenode
            machine. 
                        ilab@hadoop-namenode:$jps // (the process IDs don’t matter of course.)
                        14799 NameNode
                        15314 Jps
                        14977 SecondaryNameNode
                        ilab@hadoop-namenode:$
            and the following java procsses should run on datanode machine.
                        ilab@hadoop-datanode1:$jps //(the process IDs don’t matter of course.)
                        15183 DataNode
                        15616 Jps
                        ilab@hadoop-datanode1:$

            2.2 Start MapReduce daemons:-
            Execute the following command on the machine you want the jobtracker to run 
            on.
$HADOOP_INSTALL_DIR/bin/start-mapred.sh     
//In our case, we will run bin/start-mapred.sh on namenode machine:
           Note:-
           At this point, the following Java processes should run on namenode machine.       
                        ilab@hadoop-namenode:$jps // (the process IDs don’t matter of course.)
                        14799 NameNode
                        15314 Jps
                        14977 SecondaryNameNode
                        15596 JobTracker                 
                        ilab@hadoop-namenode:$

            and the following java procsses should run on datanode machine.
                        ilab@hadoop-datanode1:$jps //(the process IDs don’t matter of course.)
                        15183 DataNode
                        15616 Jps
                        15897 TaskTracker               
                        ilab@hadoop-datanode1:$

3. Stopping the Hadoop cluster:-
            Like starting the cluster, stopping it is done in two steps.
3.1 Stop MapReduce daemons:-
Run the command /bin/stop-mapred.sh on the jobtracker machine. In our case, we will run bin/stop-mapred.sh on namenode:
            3.2 Stop HDFS daemons:-
                        Run the command /bin/stop-dfs.sh on the namenode machine.