Wednesday, July 27, 2011

Easy Install HBase/Hadoop in Pseudo Distributed Mode

Introduction

This documentation should get you up and running with a full pseudo distributed Hadoop/HBase installation in an Ubuntu VM quickly. I use Ubuntu because the Debian Package Management (apt) is by far the best way to install software on a machine. It is possible to also use this on regular hardware as well.

The reason why you will need this is because much of the existing documentation is spread around quite a few different locations. Thus, I've already done the work of digging this information out so that you don't have to.

This documentation is intended to be read and used from top to bottom. Before you do an initial install, I suggest you read through it once first.

Reference Manuals

Create the virtual machine

The first thing that you will want to do is download a copy of the Ubuntu Server 10.04 64bit ISO image. This version is the current Long Term Support (LTS) version. These instructions may work with a newer version, but I'm suggesting the LTS because that is what I test with and also what your operations team will most likely want to install into production. Once you have the ISO, create a new virtual machine using your favorite VM manager (I like vmware fusion on my Mac).

Unix Box Setup

Once you have logged into the box, we need to setup some resources...

echo "deb http://archive.canonical.com/ lucid partner" > /etc/apt/sources.list.d/partner.list
echo "deb http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list
echo "deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list
echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections
echo "hdfs  -       nofile  32768" >> /etc/security/limits.conf
echo "hbase  -       nofile  32768" >> /etc/security/limits.conf
echo "hdfs soft/hard nproc 32000" >> /etc/security/limits.conf
echo "hbase soft/hard nproc 32000" >> /etc/security/limits.conf
echo "session required  pam_limits.so" >> /etc/pam.d/common-session

aptitude install curl wget
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
aptitude update
aptitude install openssh-server ntp
aptitude install sun-java6-jdk
aptitude safe-upgrade
reboot now

You can now use ifconfig -a to find out the IP address of the virtual machine and log into it via ssh. You will want to execute most of the commands below as root.

LZO Compression

This setup provides LZO compression for your data in HBase which greatly reduces the amount of data which is stored on disk. Sadly, LZO is under the GPL license, so it can't be distributed with Apache. Therefore, I'm providing a nice debian that I got ahold of for you to use. On your vm:

wget "https://github.com/lookfirst/fileshare/blob/master/Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb?raw=true"
dpkg -i Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb

Hadoop / HDFS

Install some packages:
apt-get install hadoop-0.20
apt-get install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker
apt-get install hadoop-0.20-conf-pseudo

Edit some files:

/etc/hadoop/conf/hdfs-site.xml
<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>4096</value>
</property>
/etc/hadoop/conf/core-site.xml
<property>
   <name>io.compression.codecs</name>
   <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
 
<property>
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
/etc/hadoop/conf/mapred-site.xml
<property>
   <name>mapred.compress.map.output</name>
   <value>true</value>
 </property>
 
 <property>
   <name>mapred.map.output.compression.codec</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>
 
 <property>
   <name>mapred.child.ulimit</name>
   <value>1835008</value>
 </property>
   
 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>2</value>
 </property>

 <property>
   <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>2</value>
 </property>

ZooKeeper
apt-get install hadoop-zookeeper-server
/etc/zookeeper/zoo.cfg
Change localhost to 127.0.0.1
 Add: maxClientCnxns=0
service hadoop-zookeeper-server restart

HDFS/HBase Setup

Make an /hbase folder in hdfs
sudo -u hdfs hadoop fs -mkdir /hbase
sudo -u hdfs hadoop fs -chown hbase /hbase
NOTE: If you want to delete an existing hbase folder, first stop hbase!
sudo -u hdfs hadoop fs -rmr -skipTrash /hbase

HBase Installation
apt-get install hadoop-hbase
apt-get install hadoop-hbase-master

/etc/hbase/conf/hbase-site.xml
<property>
   <name>hbase.cluster.distributed</name>
   <value>true</value>
</property>
<property>
   <name>hbase.rootdir</name>
   <value>hdfs://localhost/hbase</value>
</property>

/etc/hbase/conf/hbase-env.sh
export HBASE_CLASSPATH=`ls /usr/lib/hadoop/lib/cloudera-hadoop-lzo-*.jar`
export HBASE_MANAGES_ZK=false
export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

/etc/hadoop/conf/hadoop-env.sh
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH":`hbase classpath`

Now, restart the master and start the region server:
service hadoop-hbase-master restart
apt-get install hadoop-hbase-regionserver

Starting/Stopping everything

Start
  • service hadoop-zookeeper-server start
  • for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
  • service hadoop-hbase-master start
  • service hadoop-hbase-regionserver start
Stop
  • service hadoop-hbase-regionserver stop
  • service hadoop-hbase-master stop
  • for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done
  • service hadoop-zookeeper-server stop

Hbase Shell
su - hbase
hbase shell

Ports

To ensure that everything is working correctly, visit your VM's ip address with these ports on the end of a http url.
  • HDFS: 50070
  • JobTracker: 50030
  • TaskTracker: 50060
  • Hbase Master: 60010
  • Hbase RegionServer: 60030

13 comments:

Ahmed Kamal said...

Thanks for this tutorial, I've discovered an even faster way to run hadoop. I could setup a multi-node cluster in under one minute using Ubuntu's Ensemble

Check the video at
http://cloud.ubuntu.com/2011/08/ensemble-meets-hadoop-on-the-cloud/

and more technical details at
http://cloud.ubuntu.com/2011/08/hadoop-cluster-with-ubuntu-server-and-ensemble/

Jon Scott Stevens said...

Ensemble looks cool. However, that example fails at a couple of things that my documentation covers.

a) I prefer the Cloudera distribution of Hadoop.
b) That example of Ensemble doesn't edit the configuration files to add things like the lzo compression.

So, while you can get something up and running with that example, it isn't actually what you will need.

Obviously, extending that example to do what I've done would make a great blog posting for someone.

Ray V. said...

I just set this up using a clean Natty install on a VM. One additional snag I ran into is explained here.

The fix amounts to doing this in /etc/hosts:
#127.0.1.1 ubuntu
127.0.0.1 ubuntu

Roy said...

Thanks for a great HOWTO -- I got stuck right after:

$ sudo -u hdfs hadoop fs -mkdir /hbase

Getting the following:

11/12/06 13:06:10 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s).
11/12/06 13:06:11 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 1 time(s).
11/12/06 13:06:12 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 2 time(s).
11/12/06 13:06:13 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 3 time(s).
11/12/06 13:06:14 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 4 time(s).
11/12/06 13:06:15 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 5 time(s).
11/12/06 13:06:16 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 6 time(s).
11/12/06 13:06:17 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 7 time(s).
11/12/06 13:06:18 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 8 time(s).
11/12/06 13:06:19 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020. Already tried 9 time(s).
Bad connection to FS. command aborted. exception: Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused

Appreciate any insights

Jon Scott Stevens said...

What should be running on port 8020? Sounds like you aren't using the Cloudera packages. Grep through the various config files in /etc to see what is set to that port and make sure it is running.

Fred said...

Hi Jon,

First thank you for this nice tutorial.
It seems that after the configuration in the pseudo-distributed mode, I have as you mention my hbase shell running and I can visit all WEB page.So seems that the service is up.

Now I am trying to access this hbase through datanucleus and I think that I did a mistake in my configuration to accss it because it generate an exception kike MasterNotFoundException + ConnectionRefused.

Do you have any ideas where it could come from?

Thank you very much,
I will really appreciate your help on that.

Best!

Fred

Jon Stevens said...

Sorry Fred, I don't know!

Fred said...

Thanks to answer!
I dig a bit the pb and it seems that from "outside" my server I can't access the 60000 port...connection refused. I have an Ubuntu 10.4 LTS distro. Do you have some idea? I want to mention that:
NETSTAT:
netstat -a | grep 60000
tcp 0 0 ip-97-74-115-18.i:60000 *:* LISTEN
tcp 0 0 ip-97-74-115-18.i:34215 ip-97-74-115-18.i:60000 ESTABLISHED
tcp 0 0 ip-97-74-115-18.i:60000 ip-97-74-115-18.i:34215 ESTABLISHED

AND my firwall is desactivated

Jon Stevens said...

Most likely, everything is configured to listen on localhost or 127.0.0.1. You may want to try to have it listen on 0.0.0.0 or setup a firewall rule to forward all requests for 0.0.0.0:60000 to 127.0.0.1:60000

schrilax said...

i am encountering the following error when i try to use the hbase shell.

ERROR: org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Timed out (10000ms)

this is how my zoo.cfg looks like
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/zookeeper
# the port at which the clients will connect
clientPort=2181
server.0=127.0.0.1:2888:3888
maxClientCnxns=0

this is how my hosts file looks like:

127.0.0.1 localhost
127.0.1.1 skrilax-VirtualBox

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

i am pretty sure it is a dns resolution issue but not sure what is the problem.

Shantanu Kumar said...

Since Oracle Java is not available anymore, users can install Java thusly:

sudo add-apt-repository ppa:ferramroberto/java && sudo apt-get update

sudo apt-get install sun-java6-jdk

Swati Verma said...

I am new to HBASE, and while trying to install the same on Ubuntu system, I am facing some problem.

Below is the error log from Zookeeper log file

2014-01-18 06:10:51,392 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x143a5b052980000, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:744) 2014-01-18 06:10:51,394 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:56671 which had sessionid 0x143a5b052980000

Below is error log from master log:

2014-01-18 06:10:51,381 INFO org.apache.zookeeper.ZooKeeper: Session: 0x143a5b052980000 closed 2014-01-18 06:10:51,381 INFO org.apache.hadoop.hbase.master.HMaster: HMaster main thread exiting 2014-01-18 06:10:51,381 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master java.lang.RuntimeException: HMaster Aborted at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2120)

Please note, I am able to start Hbase successfully. I mean after starting Hbase, I am able to see Hmaster running using jps command. But as soon as I try to go to Hbase shell, this issue arises and then by executing jps command, I don't find Hmaster in list.

Please help me in this issue, as I tried to solve it by myself from last for days, but no luck. Please help

Jhon David said...
This comment has been removed by a blog administrator.