Setting up Hadoop Cluster with ubuntu


Installation

  1. Install hadoop: https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation
  2. (optional) Links to installing hive/pig/hbase are on the bottom of the link in step 2.

Configuration

  1. Hive:  by default it does not allow multiple users, because by default it uses a local derby metastore. So, if you wish to have multiple users connect to it, install mysql, and configure /etc/hive/conf/hive-site.xml.  See the hive configuration section of:  https://ccp.cloudera.com/display/CDHDOC/Hive+Installation

    Also see:  https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
  2. You have to set JAVA_HOME.
  3. There are several files you need to setup in order for hdfs/hive to work, all located, by default, at /etc/hadoop/conf:
  4. core-site.xml: You need to set hadoop.tmp.dir and fs.default.name.  For example:
      1. <property>
          <name>hadoop.tmp.dir</name>
          <value>/home/hadoop/hdfs-tmp/${user.name}</value>
          <description>A base for other temporary directories.</description>
        </property>
        <property>
          <name>fs.default.name</name>
          <value>hdfs://localhost:54310</value>
          <description>The name of the default file system.  A URI whose
          scheme and authority determine the FileSystem implementation.  The
          uri's scheme determines the config property (fs.SCHEME.impl) naming
          the FileSystem implementation class.  The uri's authority is used to
          determine the host, port, etc. for a filesystem.</description>
        </property>
    1. hdfs-site.xml You need to set dfs.name.dir and dfs.data.dir.  For example:
      1. <property>
          <name>dfs.name.dir</name>
          <value>/home/hadoop/name</value>
        </property>
        <property>
          <name>dfs.data.dir</name>
          <value>/home/hadoop/data,/home/hadoop/data1</value>
        </property>

         

         

         

    2. mapred-site.xml You need to set mapred.local.dir, mapred.tmp.dir, mapred.job.tracker:
      1. <property>
          <name>mapred.local.dir</name>
          <value>/home/hadoop/mapred/local</value>
        </property>
        <property>
          <name>mapred.tmp.dir</name>
          <value>/home/hadoop/mapred/temp</value>
        </property>
        <property>
          <name>mapred.job.tracker</name>
          <value>localhost:9001</value>
        </property>
    3. For the directories you have specified in the aforementioned xml configuration files, you must ensure that their permissions are set correctly.
      1. hadoop.tmp.dir can have permission 1777 (sticky bit) for user hdfs
      2. dfs.name.dir and dfs.data.dir can be owned by hdfs user and hadoop group with 700 permission.
      3. mapred.local.dir and mapred.tmp.dir can belong to the mapred user and hadoop group.

Test installation

Some quick tests you can do to test installation:

Hive:

 

mkdir /tmp/hivetest
echo " a b c d" > /tmp/hivetest/hivetest.txt
hive
create external table wkstable1 (one string, two string, three string, four string) row format delimited  fields terminated by ' ' stored as textfile location  '/tmp/hivetest'
show tables;
select two from wkstable1;

 

Pig:

 

mkdir /tmp/pig
cp /etc/passwd /tmp/pig
cd /tmp/pig
pig
A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;