I'm happily jobsingle and I'm looking for the next perfect job.
I'd like to find a stealthy startup that expects huge amounts of traffic and has a clear business model that won't go bust when this bubble bursts. I'm interested in the backend technology and helping solve scaling solutions based on my wide experience working in that area for the last 5 years.
http://linkedin.com/in/lookfirst
cheers,
jon
Saturday, July 30, 2011
Why machine images don’t work
I just read a really good blog posting from RightScale on "Why machine images don’t work."
After investigating the way that AMI's are built and seeing how utterly difficult it was to build one, I've been trying to put to words my feelings about it. This article does a great job of describing it with four simple statements:
- Images are too monolithic.
- Images are opaque.
- Images are too big.
- Images are too static.
I have one additional thing that I would like to add:
- Images are difficult to upgrade.
I quickly came to the conclusion that building Debian packages that can be quickly installed on any machine is a far better way to go. Not only are they easily to create, but you can integrate them into your continuous integration system so that every time someone commits code, a new package is built and added to a central repository. Updating the code on your machines in all of your environments is as simple as 'aptitude update; aptitude safe-upgrade'.
I also think that using tools like Fabric, Puppet and Chef (FPC) just add another layer of unnecessary complexity which completely fails the KISS principle. You can do everything that FPC can do in a single Debian or break it out into multiple ones depending on how you want to setup the deployment hierarchy. Why install some other complicated piece of software (and all of its dependencies) with its own domain specific language when you can just write relatively simple bash shell scripts?
With my deployments, I like to set things up so that there is a 'base' Debian package which gets installed called project-init. It is responsible for creating users, accepting the JDK license agreement, setting the timezone of the machine and any other low level OS settings that can be used on all machines.
From there, everything gets layered on top of that base package. Some packages will be optional depending on which environment they get put into. For example, you probably don't want or need to install Clarity on your production servers. If you need packages for specific environments (dev, staging, prod), you can use Debian Virtual packages to create 'aliases' for which the system you want installed.
In the end, I know this system works really well. I've done it for one of the most complicated systems one can imagine with 25+ different packages for all of the components that needed to be installed.
Wednesday, July 27, 2011
Easy Install HBase/Hadoop in Pseudo Distributed Mode
Introduction
This documentation should get you up and running with a full pseudo distributed Hadoop/HBase installation in an Ubuntu VM quickly. I use Ubuntu because the Debian Package Management (apt) is by far the best way to install software on a machine. It is possible to also use this on regular hardware as well.
The reason why you will need this is because much of the existing documentation is spread around quite a few different locations. Thus, I've already done the work of digging this information out so that you don't have to.
This documentation is intended to be read and used from top to bottom. Before you do an initial install, I suggest you read through it once first.
Reference Manuals
Create the virtual machine
The first thing that you will want to do is download a copy of the Ubuntu Server 10.04 64bit ISO image. This version is the current Long Term Support (LTS) version. These instructions may work with a newer version, but I'm suggesting the LTS because that is what I test with and also what your operations team will most likely want to install into production. Once you have the ISO, create a new virtual machine using your favorite VM manager (I like vmware fusion on my Mac).
Unix Box Setup
Once you have logged into the box, we need to setup some resources...
You can now use ifconfig -a to find out the IP address of the virtual machine and log into it via ssh. You will want to execute most of the commands below as root.
LZO Compression
This setup provides LZO compression for your data in HBase which greatly reduces the amount of data which is stored on disk. Sadly, LZO is under the GPL license, so it can't be distributed with Apache. Therefore, I'm providing a nice debian that I got ahold of for you to use. On your vm:
Hadoop / HDFS
Install some packages:
Edit some files:
/etc/hadoop/conf/hdfs-site.xml
ZooKeeper
HDFS/HBase Setup
Make an /hbase folder in hdfs
HBase Installation
/etc/hbase/conf/hbase-site.xml
/etc/hbase/conf/hbase-env.sh
/etc/hadoop/conf/hadoop-env.sh
Now, restart the master and start the region server:
Starting/Stopping everything
Start
Hbase Shell
Ports
To ensure that everything is working correctly, visit your VM's ip address with these ports on the end of a http url.
This documentation should get you up and running with a full pseudo distributed Hadoop/HBase installation in an Ubuntu VM quickly. I use Ubuntu because the Debian Package Management (apt) is by far the best way to install software on a machine. It is possible to also use this on regular hardware as well.
The reason why you will need this is because much of the existing documentation is spread around quite a few different locations. Thus, I've already done the work of digging this information out so that you don't have to.
This documentation is intended to be read and used from top to bottom. Before you do an initial install, I suggest you read through it once first.
Reference Manuals
- https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation
- https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+in+Pseudo-Distributed+Mode
- https://ccp.cloudera.com/display/CDHDOC/ZooKeeper+Installation
- https://ccp.cloudera.com/display/CDHDOC/HBase+Installation
- http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed
- http://hbase.apache.org/book.html
- http://hbase.apache.org/pseudo-distributed.html
Create the virtual machine
The first thing that you will want to do is download a copy of the Ubuntu Server 10.04 64bit ISO image. This version is the current Long Term Support (LTS) version. These instructions may work with a newer version, but I'm suggesting the LTS because that is what I test with and also what your operations team will most likely want to install into production. Once you have the ISO, create a new virtual machine using your favorite VM manager (I like vmware fusion on my Mac).
Unix Box Setup
Once you have logged into the box, we need to setup some resources...
echo "deb http://archive.canonical.com/ lucid partner" > /etc/apt/sources.list.d/partner.list echo "deb http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list echo "deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections echo "hdfs - nofile 32768" >> /etc/security/limits.conf echo "hbase - nofile 32768" >> /etc/security/limits.conf echo "hdfs soft/hard nproc 32000" >> /etc/security/limits.conf echo "hbase soft/hard nproc 32000" >> /etc/security/limits.conf echo "session required pam_limits.so" >> /etc/pam.d/common-session aptitude install curl wget curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - aptitude update aptitude install openssh-server ntp aptitude install sun-java6-jdk aptitude safe-upgrade reboot now
You can now use ifconfig -a to find out the IP address of the virtual machine and log into it via ssh. You will want to execute most of the commands below as root.
LZO Compression
This setup provides LZO compression for your data in HBase which greatly reduces the amount of data which is stored on disk. Sadly, LZO is under the GPL license, so it can't be distributed with Apache. Therefore, I'm providing a nice debian that I got ahold of for you to use. On your vm:
wget "https://github.com/lookfirst/fileshare/blob/master/Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb?raw=true" dpkg -i Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb
Hadoop / HDFS
Install some packages:
apt-get install hadoop-0.20 apt-get install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker apt-get install hadoop-0.20-conf-pseudo
Edit some files:
/etc/hadoop/conf/hdfs-site.xml
<property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property>/etc/hadoop/conf/core-site.xml
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value> </property> <property> <name>io.compression.codec.lzo.class</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property>/etc/hadoop/conf/mapred-site.xml
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> <property> <name>mapred.child.ulimit</name> <value>1835008</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> </property>
ZooKeeper
apt-get install hadoop-zookeeper-server/etc/zookeeper/zoo.cfg
Change localhost to 127.0.0.1 Add: maxClientCnxns=0
service hadoop-zookeeper-server restart
HDFS/HBase Setup
Make an /hbase folder in hdfs
sudo -u hdfs hadoop fs -mkdir /hbase sudo -u hdfs hadoop fs -chown hbase /hbase NOTE: If you want to delete an existing hbase folder, first stop hbase! sudo -u hdfs hadoop fs -rmr -skipTrash /hbase
HBase Installation
apt-get install hadoop-hbase apt-get install hadoop-hbase-master
/etc/hbase/conf/hbase-site.xml
<property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.rootdir</name> <value>hdfs://localhost/hbase</value> </property>
/etc/hbase/conf/hbase-env.sh
export HBASE_CLASSPATH=`ls /usr/lib/hadoop/lib/cloudera-hadoop-lzo-*.jar` export HBASE_MANAGES_ZK=false export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64
/etc/hadoop/conf/hadoop-env.sh
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH":`hbase classpath`
Now, restart the master and start the region server:
service hadoop-hbase-master restart apt-get install hadoop-hbase-regionserver
Starting/Stopping everything
Start
- service hadoop-zookeeper-server start
- for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
- service hadoop-hbase-master start
- service hadoop-hbase-regionserver start
- service hadoop-hbase-regionserver stop
- service hadoop-hbase-master stop
- for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done
- service hadoop-zookeeper-server stop
Hbase Shell
su - hbase hbase shell
Ports
To ensure that everything is working correctly, visit your VM's ip address with these ports on the end of a http url.
- HDFS: 50070
- JobTracker: 50030
- TaskTracker: 50060
- Hbase Master: 60010
- Hbase RegionServer: 60030
Sunday, July 24, 2011
Friday, July 22, 2011
Lion removes Java?
Yes it does! Java really isn't installed this time. So, just open up Terminal.app and type 'java'. It will automatically install from there.
My guess is that there is some sort of Oracle licensing agreement that prevents Apple from distributing Java with the release. You'd think that these two massive public companies would be able to work something out
Once you have done that, if you are a developer, you will need to fix where the source code is:
p.p.s. This fixes apps like 'Network Connect' which depend on Java being installed.
p.p.p.s. Check out Similarity.com. Developed on a Mac, using Java, running on Google App Engine.
My guess is that there is some sort of Oracle licensing agreement that prevents Apple from distributing Java with the release. You'd think that these two massive public companies would be able to work something out
Once you have done that, if you are a developer, you will need to fix where the source code is:
- Go to http://connect.apple.com and download: Java for Mac OS X 10.7 Developer Package
- Install it.
- Open a Terminal.app window
- cd /System/Library/Frameworks/JavaVM.framework/Home
- sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/src.jar .
- sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/docs.jar .
- sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/appledocs.jar .
p.p.s. This fixes apps like 'Network Connect' which depend on Java being installed.
p.p.p.s. Check out Similarity.com. Developed on a Mac, using Java, running on Google App Engine.
Thursday, July 21, 2011
jmxtrans - speaking engagement
I did a fun short talk tonight for the SF Bay Area Large-Scale Production Engineering group at the Yahoo! campus on my little open source Java monitoring project called jmxtrans.
Skip to 21:45.
Sorry, I was a bit nervous as this is my first talk in front of so many people in a long time.
Skip to 21:45.
Sorry, I was a bit nervous as this is my first talk in front of so many people in a long time.
Monday, July 18, 2011
HBase MultiTableOutputFormat writing to multiple tables in one Map Reduce Job
Recently, I've been having a lot of fun learning about HBase and Hadoop. One esoteric thing I just learned about is the way that HBase tables are populated.
By default, HBase / Map Reduce jobs can only write to a single table because you set the output handler at the job level with the job.setOutputFormatClass(). However, if you are creating an HBase table, chances are that you are going to want to build an index related to that table so that you can do fast queries on the master table. The most optimal way to do this is to write the data to both tables at the same time when you are importing the data. The alternative is to write another M/R job to do this after the fact, but that means reading all of the data twice, which is a lot of extra load on the system for no real benefit. Thus, in order to write to both tables at the same time, in the same M/R job, you need to take advantage of the MultiTableOutputFormat class to achieve this result. The key here is that when you write to the context, you specify the name of the table you are writing to. This is some basic example code (with a lot of the meat removed) which demonstrates this.
By default, HBase / Map Reduce jobs can only write to a single table because you set the output handler at the job level with the job.setOutputFormatClass(). However, if you are creating an HBase table, chances are that you are going to want to build an index related to that table so that you can do fast queries on the master table. The most optimal way to do this is to write the data to both tables at the same time when you are importing the data. The alternative is to write another M/R job to do this after the fact, but that means reading all of the data twice, which is a lot of extra load on the system for no real benefit. Thus, in order to write to both tables at the same time, in the same M/R job, you need to take advantage of the MultiTableOutputFormat class to achieve this result. The key here is that when you write to the context, you specify the name of the table you are writing to. This is some basic example code (with a lot of the meat removed) which demonstrates this.
static class TsvImporter extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { @Override public void map(LongWritable offset, Text value, Context context) throws IOException { // contains the line of tab separated data we are working on (needs to be parsed out). byte[] lineBytes = value.getBytes(); // rowKey is the hbase rowKey generated from lineBytes Put put = new Put(rowKey); // Create your KeyValue object put.add(kv); context.write("actions", put); // write to the actions table // rowKey2 is the hbase rowKey Put put = new Put(rowKey2); // Create your KeyValue object put.add(kv); context.write("actions_index", put); // write to the actions table } } public static Job createSubmittableJob(Configuration conf, String[] args) throws IOException { String pathStr = args[0]; Path inputDir = new Path(pathStr); Job job = new Job(conf, "my_custom_job"); job.setJarByClass(TsvImporter.class); FileInputFormat.setInputPaths(job, inputDir); job.setInputFormatClass(TextInputFormat.class); // this is the key to writing to multiple tables in hbase job.setOutputFormatClass(MultiTableOutputFormat.class); job.setMapperClass(TsvImporter.class); job.setNumReduceTasks(0); TableMapReduceUtil.addDependencyJars(job); TableMapReduceUtil.addDependencyJars(job.getConfiguration()); return job; }
Subscribe to:
Posts (Atom)