Saturday, July 30, 2011

Jobsingle and seeking!

I'm happily jobsingle and I'm looking for the next perfect job.

I'd like to find a stealthy startup that expects huge amounts of traffic and has a clear business model that won't go bust when this bubble bursts. I'm interested in the backend technology and helping solve scaling solutions based on my wide experience working in that area for the last 5 years.

http://linkedin.com/in/lookfirst

cheers,

jon

Why machine images don’t work

I just read a really good blog posting from RightScale on "Why machine images don’t work."

After investigating the way that AMI's are built and seeing how utterly difficult it was to build one, I've been trying to put to words my feelings about it. This article does a great job of describing it with four simple statements:
  • Images are too monolithic.
  • Images are opaque.
  • Images are too big.
  • Images are too static.
I have one additional thing that I would like to add:
  • Images are difficult to upgrade.
I quickly came to the conclusion that building Debian packages that can be quickly installed on any machine is a far better way to go. Not only are they easily to create, but you can integrate them into your continuous integration system so that every time someone commits code, a new package is built and added to a central repository. Updating the code on your machines in all of your environments is as simple as 'aptitude update; aptitude safe-upgrade'.

I also think that using tools like Fabric, Puppet and Chef (FPC) just add another layer of unnecessary complexity which completely fails the KISS principle. You can do everything that FPC can do in a single Debian or break it out into multiple ones depending on how you want to setup the deployment hierarchy. Why install some other complicated piece of software (and all of its dependencies) with its own domain specific language when you can just write relatively simple bash shell scripts?

With my deployments, I like to set things up so that there is a 'base' Debian package which gets installed called project-init. It is responsible for creating users, accepting the JDK license agreement, setting the timezone of the machine and any other low level OS settings that can be used on all machines.

From there, everything gets layered on top of that base package. Some packages will be optional depending on which environment they get put into. For example, you probably don't want or need to install Clarity on your production servers. If you need packages for specific environments (dev, staging, prod), you can use Debian Virtual packages to create 'aliases' for which the system you want installed.

In the end, I know this system works really well. I've done it for one of the most complicated systems one can imagine with 25+ different packages for all of the components that needed to be installed.

Wednesday, July 27, 2011

Easy Install HBase/Hadoop in Pseudo Distributed Mode

Introduction

This documentation should get you up and running with a full pseudo distributed Hadoop/HBase installation in an Ubuntu VM quickly. I use Ubuntu because the Debian Package Management (apt) is by far the best way to install software on a machine. It is possible to also use this on regular hardware as well.

The reason why you will need this is because much of the existing documentation is spread around quite a few different locations. Thus, I've already done the work of digging this information out so that you don't have to.

This documentation is intended to be read and used from top to bottom. Before you do an initial install, I suggest you read through it once first.

Reference Manuals

Create the virtual machine

The first thing that you will want to do is download a copy of the Ubuntu Server 10.04 64bit ISO image. This version is the current Long Term Support (LTS) version. These instructions may work with a newer version, but I'm suggesting the LTS because that is what I test with and also what your operations team will most likely want to install into production. Once you have the ISO, create a new virtual machine using your favorite VM manager (I like vmware fusion on my Mac).

Unix Box Setup

Once you have logged into the box, we need to setup some resources...

echo "deb http://archive.canonical.com/ lucid partner" > /etc/apt/sources.list.d/partner.list
echo "deb http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list
echo "deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list
echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections
echo "hdfs  -       nofile  32768" >> /etc/security/limits.conf
echo "hbase  -       nofile  32768" >> /etc/security/limits.conf
echo "hdfs soft/hard nproc 32000" >> /etc/security/limits.conf
echo "hbase soft/hard nproc 32000" >> /etc/security/limits.conf
echo "session required  pam_limits.so" >> /etc/pam.d/common-session

aptitude install curl wget
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
aptitude update
aptitude install openssh-server ntp
aptitude install sun-java6-jdk
aptitude safe-upgrade
reboot now

You can now use ifconfig -a to find out the IP address of the virtual machine and log into it via ssh. You will want to execute most of the commands below as root.

LZO Compression

This setup provides LZO compression for your data in HBase which greatly reduces the amount of data which is stored on disk. Sadly, LZO is under the GPL license, so it can't be distributed with Apache. Therefore, I'm providing a nice debian that I got ahold of for you to use. On your vm:

wget "https://github.com/lookfirst/fileshare/blob/master/Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb?raw=true"
dpkg -i Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb

Hadoop / HDFS

Install some packages:
apt-get install hadoop-0.20
apt-get install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker
apt-get install hadoop-0.20-conf-pseudo

Edit some files:

/etc/hadoop/conf/hdfs-site.xml
<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>4096</value>
</property>
/etc/hadoop/conf/core-site.xml
<property>
   <name>io.compression.codecs</name>
   <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
 
<property>
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
/etc/hadoop/conf/mapred-site.xml
<property>
   <name>mapred.compress.map.output</name>
   <value>true</value>
 </property>
 
 <property>
   <name>mapred.map.output.compression.codec</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>
 
 <property>
   <name>mapred.child.ulimit</name>
   <value>1835008</value>
 </property>
   
 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>2</value>
 </property>

 <property>
   <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>2</value>
 </property>

ZooKeeper
apt-get install hadoop-zookeeper-server
/etc/zookeeper/zoo.cfg
Change localhost to 127.0.0.1
 Add: maxClientCnxns=0
service hadoop-zookeeper-server restart

HDFS/HBase Setup

Make an /hbase folder in hdfs
sudo -u hdfs hadoop fs -mkdir /hbase
sudo -u hdfs hadoop fs -chown hbase /hbase
NOTE: If you want to delete an existing hbase folder, first stop hbase!
sudo -u hdfs hadoop fs -rmr -skipTrash /hbase

HBase Installation
apt-get install hadoop-hbase
apt-get install hadoop-hbase-master

/etc/hbase/conf/hbase-site.xml
<property>
   <name>hbase.cluster.distributed</name>
   <value>true</value>
</property>
<property>
   <name>hbase.rootdir</name>
   <value>hdfs://localhost/hbase</value>
</property>

/etc/hbase/conf/hbase-env.sh
export HBASE_CLASSPATH=`ls /usr/lib/hadoop/lib/cloudera-hadoop-lzo-*.jar`
export HBASE_MANAGES_ZK=false
export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

/etc/hadoop/conf/hadoop-env.sh
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH":`hbase classpath`

Now, restart the master and start the region server:
service hadoop-hbase-master restart
apt-get install hadoop-hbase-regionserver

Starting/Stopping everything

Start
  • service hadoop-zookeeper-server start
  • for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
  • service hadoop-hbase-master start
  • service hadoop-hbase-regionserver start
Stop
  • service hadoop-hbase-regionserver stop
  • service hadoop-hbase-master stop
  • for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done
  • service hadoop-zookeeper-server stop

Hbase Shell
su - hbase
hbase shell

Ports

To ensure that everything is working correctly, visit your VM's ip address with these ports on the end of a http url.
  • HDFS: 50070
  • JobTracker: 50030
  • TaskTracker: 50060
  • Hbase Master: 60010
  • Hbase RegionServer: 60030

Sunday, July 24, 2011

Chickens

I added a new wing to the coop today. It's officially a doublewide now.


Friday, July 22, 2011

Lion removes Java?

Yes it does! Java really isn't installed this time. So, just open up Terminal.app and type 'java'. It will automatically install from there.

My guess is that there is some sort of Oracle licensing agreement that prevents Apple from distributing Java with the release. You'd think that these two massive public companies would be able to work something out

Once you have done that, if you are a developer, you will need to fix where the source code is:
  1. Go to http://connect.apple.com and download: Java for Mac OS X 10.7 Developer Package
  2. Install it.
  3. Open a Terminal.app window
  4. cd /System/Library/Frameworks/JavaVM.framework/Home
  5. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/src.jar .
  6. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/docs.jar .
  7. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/appledocs.jar .
p.s. The version number seems to have gone down (384 to 383) with 10.7 vs. 10.6.8 and there is now an appledocs.jar that I didn't notice before.

p.p.s. This fixes apps like 'Network Connect' which depend on Java being installed.

p.p.p.s. Check out Similarity.com. Developed on a Mac, using Java, running on Google App Engine.

Thursday, July 21, 2011

jmxtrans - speaking engagement

I did a fun short talk tonight for the SF Bay Area Large-Scale Production Engineering group at the Yahoo! campus on my little open source Java monitoring project called jmxtrans.

Skip to 21:45.

Sorry, I was a bit nervous as this is my first talk in front of so many people in a long time.

Monday, July 18, 2011

HBase MultiTableOutputFormat writing to multiple tables in one Map Reduce Job

Recently, I've been having a lot of fun learning about HBase and Hadoop. One esoteric thing I just learned about is the way that HBase tables are populated.

By default, HBase / Map Reduce jobs can only write to a single table because you set the output handler at the job level with the job.setOutputFormatClass(). However, if you are creating an HBase table, chances are that you are going to want to build an index related to that table so that you can do fast queries on the master table. The most optimal way to do this is to write the data to both tables at the same time when you are importing the data. The alternative is to write another M/R job to do this after the fact, but that means reading all of the data twice, which is a lot of extra load on the system for no real benefit. Thus, in order to write to both tables at the same time, in the same M/R job, you need to take advantage of the MultiTableOutputFormat class to achieve this result. The key here is that when you write to the context, you specify the name of the table you are writing to. This is some basic example code (with a lot of the meat removed) which demonstrates this.

static class TsvImporter extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
	@Override
	public void map(LongWritable offset, Text value, Context context) throws IOException {
		// contains the line of tab separated data we are working on (needs to be parsed out).
		byte[] lineBytes = value.getBytes();

		// rowKey is the hbase rowKey generated from lineBytes
		Put put = new Put(rowKey);
		// Create your KeyValue object
		put.add(kv);
		context.write("actions", put); // write to the actions table

		// rowKey2 is the hbase rowKey
		Put put = new Put(rowKey2);
		// Create your KeyValue object
		put.add(kv);
		context.write("actions_index", put); // write to the actions table
	}
}

public static Job createSubmittableJob(Configuration conf, String[] args) throws IOException {
	String pathStr = args[0];
	Path inputDir = new Path(pathStr);
	Job job = new Job(conf, "my_custom_job");
	job.setJarByClass(TsvImporter.class);
	FileInputFormat.setInputPaths(job, inputDir);
	job.setInputFormatClass(TextInputFormat.class);
	
	// this is the key to writing to multiple tables in hbase
	job.setOutputFormatClass(MultiTableOutputFormat.class);
	job.setMapperClass(TsvImporter.class);
	job.setNumReduceTasks(0);

	TableMapReduceUtil.addDependencyJars(job);
	TableMapReduceUtil.addDependencyJars(job.getConfiguration());
	return job;
}