Sunday, December 18, 2011

Don't use the jQuery .data() method. Use .attr() instead.

I just discovered the hard way that the jQuery .data() method is horribly broken. By design, it attempts to convert whatever you put into it into a native type.

I've got a template where I'm generating a button with a data-key element:

<button id="fooButton" data-key="1.4000">Click me to edit</button>

http://jsfiddle.net/KwjvA/

It looks like a float where one could assume that 1.4000 === 1.4, but what I really want here is the string 1.4000. I certainly don't want my input to be modified. One suggestion I found in the issue tracker is to put single quotes around the field (ie: data-key="'1.4000'"). That seems rather absurd as well.

The only reason why I'm warning about this here is that I've seen a bunch of libraries using .data() to store stuff in elements in the DOM. I think it is a really bad idea to have a method called .data() where you expect to be able to store something in it and be able to get back out exactly what you put into it.

The recommended alternative is to use .attr(). The problem with this is that while it achieves the same effect, it is actually much different functionality from .data(). .data() stores information within an internal cache within jQuery, while .attr() calls element.setAttribute().

I read through several bug reports on the jQuery website, where people are also confused by this behavior and all of them get closed with a wontfix. I see this as a terrible choice. Yuck.

Update: Here is a bug I just filed, hopefully that explains things better to the people who seem to be having a hard time understanding what I'm talking about: http://bugs.jquery.com/ticket/11060

Thursday, December 8, 2011

Optimizing Web Application JavaScript Delivery

For my new company, I had the following design goals for my heavy use of JavaScript (JS):
  • I'm using CoffeeScript (CS), so I need to have my IDE automatically compile CS to JS when I save. That way the whole Change file, Save, Reload the browser process works cleanly.
  • Be able to split the files up depending on which page is loaded so that only the JS which is needed for the page is sent to the client.
  • Be able to differentiate JS files to be loaded between different states such as logged in, logged out and both. That way, once someone is already logged in, the JS which controls the login and forgot password dialog does not get served again. The flip side is that JS for logged in pages is not served up to anonymous users.
  • In development mode, have everything un-minimized, but in production mode, automatically minimize everything.
  • Run all of the JS through the Closure compiler regardless of dev/prod so that I know that things that work in dev also work in prod.
  • Limit the number of <script> tags to the bare minimum. Ideally, 2-3 for .js files served from my site and not directly off of a CDN. Fewer loads means less network traffic.
  • Be able to transparently support new code / application versions so that when I upgrade the application, the browsers dump their cached copies of my files.
  • No dependencies on external xml, json, property or other configuration file formats to implement the goals above. Everything should be configured by someone editing the html templates.
In order to accomplish the goals above, I first looked at a bunch of solutions, but they all failed in various ways. So, I started on my own and went through various iterations before I came up with the ideal solution which I think is pretty unique and easy.

Let's start off by talking about one of the tools I'm using. LabJS enables me to load JS only when I need it. As part of the 'master' template which contains the skin for all pages, at the very bottom before the </body> element I have something that looks like this:

<script src="/js/LAB.min.js"></script>
<script>
    var country = "${country}";
    var fbAppId = "${fb.APP_ID}";

    var js = '${tool.jsbuilder(
        me != null,
        'json2:both',
        'handlebars.1.0.0.beta.master:both',
        'bootstrap-twipsy:both',
        'bootstrap-popover:both',
        'gen/global/page:both',
        'gen/global/common:both',
        'gen/global/search:both',
        'gen/modal/loginDialog:out', // ! logged in
        'gen/global/loggedInMenu:in', // logged in
        'gen/global/master:both'
        )}';

    var lab = $LAB
        .script("//ajax.googleapis.com/ajax/libs/jquery/1.7/jquery.min.js")
        .wait()
        .script("//ajax.googleapis.com/ajax/libs/jqueryui/1.8.16/jquery-ui.js")
        .script("//connect.facebook.net/en_US/all.js")
        .script("//apis.google.com/js/plusone.js")
        .script("//platform.twitter.com/widgets.js")
        .wait()
        .script(js)
        ;

    // Variable "pagecode" should be a function that takes a LAB and does any page-specific loading
    if (typeof pagecode === 'function')
        pagecode(lab);
</script>

Since I'm using Cambridge Templates with JEXL to process things first, the ${tool.jsbuilder(...)} section runs some Java code which does a lot of the magic during the rendering portion of the page. The first argument is a boolean to indicate whether or not I'm logged in. 'me' is an object in the context and if it is null I'm not logged in. The rest of the arguments are String[]. The method signature looks like this:

    public String jsbuilder(boolean loggedIn, String[] files);

What happens in that method is that it will parse the array of Strings, and based on a setting of 'in' for logged in, 'out' for logged out, or 'both' for either logged in or out, it will compare that to the loggedIn boolean and either load the appropriate JS file or not. The files are then loaded into memory, in order, and sent through the Closure compiler to minify the code.

The output from Closure is then cached in a global HashMap which is never cleared out. (Note: for languages that don't really persist memory between requests, like PHP, you can store this data in something like memcached).

The key of the Map is generated from a md5 hash of the list of filenames combined together + application version. The hash looks something like this: be712950814b2ccc6b92ff5c3. This hash is the String that is returned from the jsbuilder method. By using the names + application version, that ensures a new hash will be generated each time the application is upgraded.

In dev mode, the Map isn't used at all. The code is generated for each request, which ensures that my changes get immediately reflected in the browser. In production, the Map is first checked for the key and if it exists, the key is immediately returned from the jsbuilder method.

The final rendered page looks something like this to the web browser:

    var js = '/js2/be712950814b2ccc6b92ff5c3.js';

When the LabJS code executes in the browser and loads my script with the line .script(js), there is a Servlet listening for requests to /js2/*.js and it looks up the key from the url in the Map and returns the appropriate JS data. This servlet can also set the correct browser cache headers depending on dev or prod.

As you can see, 10+ separate files have been combined and minified into a single file which makes the requests more efficient. All without configuration files or a crazy syntax that only a backend developer can understand.

If I wanted to split the JS files into more loads so that the browser can take advantage of concurrent loading, I could do that as well by just creating more calls to jsbuilder. That is effectively what is happening in the pagecode section near the end of the </script> element above. The body template which is loaded into the master template by Cambridge, optionally has a JS function defined called pagecode. When it executes, it calls lab.script() again with similar output from the jsbuilder tool. This allows me to split up my code so that there is global code as well as page specific code.

Enjoy.

Wednesday, December 7, 2011

Github Pages

I'm a huge fan of Github. Pull requests are the best innovation since the idea behind open source was created.

For one of my projects hosted on Google Code, I was recently asked to move it to Github so that someone could submit pull requests more easily. I happily complied because of how much I love Github.

As part of this move, I decided to finally explore Github Pages in order to publish the nicely formatted documentation of my project. I was thinking they'd be as great as pull requests, and after an hour of reading the documentation and installing everything, I was terribly disappointed.

The main issue is that it uses a site generation tool called Jekyll. While this tool is generally ok, it has a quite a few major failings as a product for Github.
  • It is clearly a product of Not-Invented-Here syndrome. How many static website generators does this world need? Why did you feel the need to invent yet another one? Hell, I don't even want a static website generator, I just want to write some documentation.
  • I don't want 50 different options for generating content. Just give me Markdown. I don't even want 2 different types of Markdown. Just give me the one that works best, as default.
  • By default, it comes with nothing to help you design a site full of great looking documentation. Even just a default template setup would suffice. Some people have tried to create some helper projects, but they all have terrible UI and they all seem somewhat abandoned.
  • It requires me to become an expert in this tool. I have to install a bunch of random software, learn configuration files, learn a specific file layout. All I want to do is write some documentation that looks nice!
  • It was basically created for publishing blogs. Why is this being promoted as GH Pages? It seems rather absurd for a source code repository to provide a tool for publishing blogs, but not a tool for publishing great documentation.
Back over on Google Code, I just create some wiki pages, link them together with a table of contents (also a wiki page) and point at a specific url. Everything just works and looks great.

Github, please fix this!

Thursday, December 1, 2011

Social buttons

Today, I finally got around to implementing those social Like buttons that you see on all of the websites. Personally, I never click them, but it is clear lots of other people do so I'm going with the bandwagon.

They look something like this:

The top ones are Facebook, then Twitter, then Google+.

Styling the first two rows of buttons with CSS is simple. They have class="fb-like" and class="twitter-share-button". I can move them around on my page and place them exactly where I want them quite easily.

What does Google have you ask? Nothing. Zip. Zilch. Instead, it has id="___plusone_0" which means that it is somewhat useless as a general CSS selector.

I know this is nit picky on such a small issue, but an oversight like this seems hard to fathom. I've google'd around a bit and it really makes me wonder, how come no other site designers have brought this up?

I was hoping that a work around for this would be to just write it as a div instead of as the <g:plusone> element. But it turns out that div loses the class attribute when it is re-written by their JavaScript.

<div class="g-plusone" data-size="tall" ... ></div>

In the end, I just put my own div around the element, but that seems like such a kludge when the other services seem to do this correctly.

Maybe I'm reading too far into this because it is such a little thing, but it really makes me feel like Google doesn't understand the needs of site developers like the other Social players do.

Thursday, November 17, 2011

Contributing to Open Source

I've been working on various open source projects since around 1993. Long before I even really thought of it as open source. It just seemed natural to me to make the fixes I needed and contribute them back. It was always a bit of a challenge to figure out how to get my fixes to the developers. Obviously, they don't know me, so they aren't going to just let me write the files directly. So, I end up sending patches via email or some other means.

Over the years, the process for contributing to projects has gotten easier. Even more recently, it has grown by leaps and bounds thanks to Github.

Case in point. I've been using the twitter bootstrap project for parts of the design of my new company Voost. I like the project a lot. Like millions of other projects, it is hosted on github.

Yesterday, I noticed a small bit of documentation was missing, so I forked the project by clicking a button on the website, created a branch to work on (git checkout -b docadditions), edited the documentation, committed and published my changes and then created a pull request which tells the developers of bootstrap that I have something to contribute:

https://github.com/twitter/bootstrap/pull/647

Mark, one of the developers, who I've never met in my life, was able to take my contributions and combine them with his code by simply clicking a button on the website. Yes, it was that easy.

I also had an enhancement request... so I created an issue...

https://github.com/twitter/bootstrap/issues/646

It was resolved in a few hours with just a small bit of effort. I can then merge his changes into my local fork of the project with a couple easy commands. We stay in perfect sync together.

Bam. That is how collaborative development should work.

As a comparison, in the past, I've done a huge amount of work for the Apache Software Foundation. They have a great open source license, and a huge following. But, they don't use github.

With the ASF, it feels like 1993 again. For each project I want to contribute to, it feels like I'm making a lifetime commitment to that project.

I have to go to the project website and navigate around to figure out how to join a mailing list. This takes several contextual steps in an email client. I need make sure to setup a mail filter to deal with a potentially insane amount of email that I really don't care about. Then, I email a patch to the list (or put it up on gist / pastebin)... and I hope maybe one of the developers might be watching my carefully crafted subject line. Chances are that nobody would respond or the email would get lost, so I'd have to keep nagging people because everyone is busy...

I don't really contribute to the ASF nearly as much anymore.

Thursday, November 3, 2011

Month 3 - It is official now.

It is official, Jeff and I have our own company now. Voost LLC. You can sign up on the website to be notified when we launch.

The focus of the company is on sporting event registrations. As a road bike racer who started his 'career' sending checks in the mail to promoters, I've often wondered why there isn't a great solution for handling registrations to events. (Especially when you'd get to an event, but your check didn't.)

Many companies have sprung up to do this online, but they suffer from high fees, websites that look like they were designed in the early 1990's by people with no design skills and general poor execution. Promoters have to jump through hoops to create events and get their money. Participants are missing out on social features like communicating about the event, ride share / hotels and simple things like the ability to see their (and others) results over time. How many times have you had registration open and the site just melted down as 6k people all tried to register at the same time?

So, it turns out that there is 10+ million athletes in the United States who cross a finish line at an endurance event every year. Voost is setup to do it right and disrupt the entire industry with a well designed solution. We've done our homework and we have a clear path of execution. Now, it is just a matter of time and effort to put our designs into code.

Things are still rolling along with long days of coding and adding features. It is now possible to sign up for an event and go through the entire purchase process. It sounds simple, but there was a huge amount of plumbing that needed to happen first. We are currently working on the event editor which allows promoters to easily setup their event entirely online in no time at all.

We've done a huge amount of work and I'm really proud of it. On a technical note, we've recently switched from Sass to Less. I like the syntax of Less better (especially the mixin's) and it was pretty simple to do.

We are hoping for a soft launch around December and pick up steam as we head into the new year. Wish us luck.

Friday, October 7, 2011

Two months of work...

I realize that I've lapsed in posting! I've had my head down and full concentration is in effect. The days have been long, with 12+ hour near daily sessions of coding away. We've made an astonishing amount of progress in such a short time. The framework is built and things are starting to come together. Features are appearing all over the place.

A few days ago, we got some major news. One of our direct competitors has gone out of business. On one hand, it worries me that they couldn't survive. On the other hand, I know that a big part of their failure is due to the fact that they needed to contract out their website development to another company. This means that every time they needed to add a feature or fix a bug on the site, they had to pay someone to do this work for them. This is a huge advantage that we have over all of the existing companies. Both Jeff and I are the ones who really know how to write code. Our only expense is our own time spend on this project. This will also allow us to offer our superior service at a much lower price than anyone else.

Thanks to a discussion with very smart friend of mine, we were able to take advantage of the other companies failure and craft a well worded email to a large group of people who we look forward to doing business with in the future. We asked them for their ideas on what they would like to see from us. Nothing beats listening to the people you want to do business with and letting them have a positive influence over what you are creating. We got a lot of amazing feedback and being a small agile company, have already made the appropriate adjustments.

All of the other sites who currently have a business still have a major advantage over us. They have a shipping product and we don't. Of course, this also means that we can study them in minute detail to find their weaknesses and exploit them to our advantage. I feel that we have been very good at this. We've spent a huge amount of time designing around every one else's failures. Our user experience is unique in its beauty and simplicity. It is hard to beat that.

Interesting times... looking forward to our beta release... now back to coding...

Saturday, September 3, 2011

After a Month of Development

It has been about a month since I came up with my new Stealthy Startup idea and we've gotten so much done already. Never mind the fact that I'm working and coding like a mad man. The integration of the whole infrastructure is complete and I'm coding up application features as quickly as I can. This is a huge project and will take a lot of effort to do it right. That said, it really is a lot of fun to develop on this platform and the site is very web 3.0 dynamic.

We've switched around a few JavaScript libraries. We replaced RequireJS with LabJS. The documentation for RequireJS is pretty, but when you really dig into it, it is more like fluff. We got what we wanted out of LabJS though.

For textarea's we've integrated the Markdown editor from Stackoverflow. That was actually quite fun to do. Thanks for making that available.

I spent way too much time on implementing file upload. I started to write my own, but got overwhelmed with the cross browser issues. I ended up using Plupload, which isn't great, but everything else out there is a pile of junk. The lack of good documentation made it a royal pain to integrate Plupload, but once I dug through the source code and figured out how it works, I was able to make it do what I want.

CoffeeScript has been a godsend. The guy who came up with it should get a Nobel prize for brilliance. If you are writing any JavaScript at all, you should immediately stop what you are doing and switch your entire environment to CS. I'm writing a metric ton of CS and I just can't even begin to explain how much easier my life has been.

There is a lot of libraries for dealing with html forms in a MVC way, but in the end, I've found that they require as much code as just writing it yourself. CoffeeScript's scoping and class system makes it easy to contain the logic.

Using Github is also great. I've got two private repo's setup now and it is so nice to be able to get emails and diff's without having to setup and manage my own server. Plain and simple: Fuck Subversion. I know those are harsh words, but as others have said, setting CVS as your goal to beat was a really bad idea. They are almost done with version 1.7 and basic merging is still a pile of shit. Everything is always going to be a catchup game to git, so you might as well just use git. I'm still in trial mode with Tower and this is definitely an application worth paying for.

I know everyone is complaining about Google App Engine suddenly costing them more than $0. The reality of the situation is that nothing else compares to GAE. You know what is worth an unlimited amount of money? Not having to do IT. I'm an expert sysadmin and I never want to be woken up in the middle of the night ever again because some server is acting badly. I'm more than happy to pay for the privilege of having that be someone else's problem. I really don't care if you can set something up on Linode or AWS for less money. It just isn't worth the headache when one of your $10/month micro instances decides to randomly disappear.

I still can't tell you what I'm working on, but we should have something to show soon.

Tuesday, August 9, 2011

Tools for a new generation

We've been hard at work on the Stealthy Startup. A few days of 12+ hour coding sessions and things are really starting to come together. We have a lot more work to do to implement the application logic, but at least we have settled on what we feel is a great basis for building a next generation web application.

One of the goals of this application is to do as little framework code as possible. While we are more than technically capable, we also don't want to have to hire an IT / operations department or run servers ourselves. There just isn't a need to do that any longer and this will allow us to focus on coding features and running the business. It is simply amazing that the services and frameworks that have been created in the last 6-7 years exist and are as high quality as they are.

While everyone seems to be zigging towards Rubyland, I still feel that staying the course with Java is the right choice. You just can't beat a strongly typed language when it comes to building an application. Especially from scratch. The need to be able to re-factor code on a whim prevents bad decisions and mistakes from propagating long term. On top of that, the JVM is simply a faster product. Scala is another approach, but I'm afraid I'm just not fond of the language syntax. I'm super efficient at writing bug free Java, so why would I want to slow my development time down? Anyway, this isn't intended as a language war posting...

People have been asking me what tools we are using for our startup, so here is the list so far:
Google App Engine and the Datastore enables us to not have to run servers or manage databases. It also means that as our traffic increases, we can scale automatically, without even having to think about it or code up solutions on EC2. This is because we are running on Google's own architecture and we can trust them to deal with these issues in a timely fashion. We also know that we won't have to wear a pager if the system goes down and we won't have to hire a team in India to help us when things go south at 2am.

Cambridge Template Engine is one of the more unique and somewhat 'beta' projects that we are using, so it is a bit of a risk, but not that bad as it seems to work pretty well as it stands right now. That said, the language it uses enables us to write extremely flexible html code. Take a look at this blog posting for a comparison between various engines.

Lombok is one of the coolest extensions to Java ever invented. It prevents you from having to write all of that silly boilerplate code that the Rubytards make fun of the Javatards for. It also allows you to annotate your classes with a simple @Slf4j annotation to enable a 'log' object to be used. Unfortunately, it only works with Eclipse, so if you are in a mixed IDE environment, you are out of luck. That said, Eclipse is the best free IDE out there, so there isn't much reason to not use it.

These days, using Rest on the server side is pretty much the best way to deal with things. You aren't returning HTML, you are returning data and you use JavaScript to process that data. Both Resteasy and Htmleasy make this simple to implement.

For dependency injection, Guice is the way to go. While I like some parts of Spring, it doesn't suffer from the jarfest/xmlfest that Spring comes with and is just as fast and powerful.

RequireJS is great because it allows us to separate our CoffeeScript/JS into small components that we can include on pages that need it. Effectively giving us an 'import' statement.

You don't really want to use JavaScript anymore either. Once you check out CoffeeScript, you won't go back. People who are against using CS are out of their minds. Take this example for using JQuery to do an Ajax call to flip some toggles(), bring effects to the page and post some data to a server:

$ ->gt;	$('#nameSave').click -&amp;gt;
		$('#nameEdit img').show()
		$('#nameSave').attr("disabled", "disabled")
		$.ajax
			type: "POST"
			url: "/account/name"
			data:
				firstName: $("input[name='firstName']").text()
				lastName: $("input[name='lastName']").text()
			success: ->
				$('.confirmation').show()
				$('.error').hide()
				$('#nameEdit img').fadeOut(1500)
				$('#nameSave').attr("disabled", "")
			error: ->
				$('.confirmation').hide()
				$('.error').fadeIn().delay(2000).fadeOut()
				$('#nameEdit img').fadeOut(1500)
				$('#nameSave').delay(2000).attr("disabled", "")

That generates this JavaScript:

(function() {
  $(function() {
    return $('#nameSave').click(function() {
      $('#nameEdit img').show();
      $('#nameSave').attr("disabled", "disabled");
      return $.ajax({
        type: "POST",
        url: "/account/name",
        data: {
          firstName: $("input[name='firstName']").text(),
          lastName: $("input[name='lastName']").text()
        },
        success: function() {
          $('.confirmation').show();
          $('.error').hide();
          $('#nameEdit img').fadeOut(1500);
          return $('#nameSave').attr("disabled", "");
        },
        error: function() {
          $('.confirmation').hide();
          $('.error').fadeIn().delay(2000).fadeOut();
          $('#nameEdit img').fadeOut(1500);
          return $('#nameSave').delay(2000).attr("disabled", "");
        }
      });
    });
  });
}).call(this);

I know they are both kind of difficult to read (that is the nature of these languages), but which one do you think is easier to write? I've integrated it with Eclipse with a Builder so that as soon as you save a .coffee file, it gets 'compiled' into js. That JS is then watched with another Builder which runs is through the RequireJS optimizer which runs UglyJS or Closure Compiler on the output.

Sass scss is wonderful because it gives us greater control over our CSS. Another Eclipse Builder is used to automatically generate the CSS files as well.

We are using GitHub for all of our source code and wiki documents. At $7/month it is the perfect environment for doing development. Their new Mac GitHub native app is pretty useful as well. I just wish you could do a sync without having to also push.

JRebel allows us to write Java code, click save and then reload in the browser to see the changes immediately. No need to restart Jetty/Tomcat/JBoss any longer.

Friday, August 5, 2011

A busy week... A new company...

You'd think that being recently jobsingle would have allowed me to slow down for a few weeks and catch my breath. Well, turns out that is not the case. I just can't sit still. My mind has been buzzing for new possibilities. As a result of this focus, a really great business idea that satisfies all of my requirements for success has presented itself to me. After talking it through with a few people who also like the idea, I know it is a GO.

Thus, my buddy Jeff and I are in the process of creating a stealthy startup that is going to provide a useful service to a whole lot of people, in a market that is sorely lacking in technology and sophistication. Even better is that it starts with an area I know quite well and absolutely love... bicycles. We have an exciting name, a great domain, a solid plan, a clear set of requirements and are busy setting up the infrastructure to execute upon.

I love the idea of working for myself again. Why didn't I think of this sooner? ;-)

Saturday, July 30, 2011

Jobsingle and seeking!

I'm happily jobsingle and I'm looking for the next perfect job.

I'd like to find a stealthy startup that expects huge amounts of traffic and has a clear business model that won't go bust when this bubble bursts. I'm interested in the backend technology and helping solve scaling solutions based on my wide experience working in that area for the last 5 years.

http://linkedin.com/in/lookfirst

cheers,

jon

Why machine images don’t work

I just read a really good blog posting from RightScale on "Why machine images don’t work."

After investigating the way that AMI's are built and seeing how utterly difficult it was to build one, I've been trying to put to words my feelings about it. This article does a great job of describing it with four simple statements:
  • Images are too monolithic.
  • Images are opaque.
  • Images are too big.
  • Images are too static.
I have one additional thing that I would like to add:
  • Images are difficult to upgrade.
I quickly came to the conclusion that building Debian packages that can be quickly installed on any machine is a far better way to go. Not only are they easily to create, but you can integrate them into your continuous integration system so that every time someone commits code, a new package is built and added to a central repository. Updating the code on your machines in all of your environments is as simple as 'aptitude update; aptitude safe-upgrade'.

I also think that using tools like Fabric, Puppet and Chef (FPC) just add another layer of unnecessary complexity which completely fails the KISS principle. You can do everything that FPC can do in a single Debian or break it out into multiple ones depending on how you want to setup the deployment hierarchy. Why install some other complicated piece of software (and all of its dependencies) with its own domain specific language when you can just write relatively simple bash shell scripts?

With my deployments, I like to set things up so that there is a 'base' Debian package which gets installed called project-init. It is responsible for creating users, accepting the JDK license agreement, setting the timezone of the machine and any other low level OS settings that can be used on all machines.

From there, everything gets layered on top of that base package. Some packages will be optional depending on which environment they get put into. For example, you probably don't want or need to install Clarity on your production servers. If you need packages for specific environments (dev, staging, prod), you can use Debian Virtual packages to create 'aliases' for which the system you want installed.

In the end, I know this system works really well. I've done it for one of the most complicated systems one can imagine with 25+ different packages for all of the components that needed to be installed.

Wednesday, July 27, 2011

Easy Install HBase/Hadoop in Pseudo Distributed Mode

Introduction

This documentation should get you up and running with a full pseudo distributed Hadoop/HBase installation in an Ubuntu VM quickly. I use Ubuntu because the Debian Package Management (apt) is by far the best way to install software on a machine. It is possible to also use this on regular hardware as well.

The reason why you will need this is because much of the existing documentation is spread around quite a few different locations. Thus, I've already done the work of digging this information out so that you don't have to.

This documentation is intended to be read and used from top to bottom. Before you do an initial install, I suggest you read through it once first.

Reference Manuals

Create the virtual machine

The first thing that you will want to do is download a copy of the Ubuntu Server 10.04 64bit ISO image. This version is the current Long Term Support (LTS) version. These instructions may work with a newer version, but I'm suggesting the LTS because that is what I test with and also what your operations team will most likely want to install into production. Once you have the ISO, create a new virtual machine using your favorite VM manager (I like vmware fusion on my Mac).

Unix Box Setup

Once you have logged into the box, we need to setup some resources...

echo "deb http://archive.canonical.com/ lucid partner" > /etc/apt/sources.list.d/partner.list
echo "deb http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list
echo "deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib" >> /etc/apt/sources.list.d/cloudera.list
echo "sun-java6-bin shared/accepted-sun-dlj-v1-1 boolean true" | debconf-set-selections
echo "hdfs  -       nofile  32768" >> /etc/security/limits.conf
echo "hbase  -       nofile  32768" >> /etc/security/limits.conf
echo "hdfs soft/hard nproc 32000" >> /etc/security/limits.conf
echo "hbase soft/hard nproc 32000" >> /etc/security/limits.conf
echo "session required  pam_limits.so" >> /etc/pam.d/common-session

aptitude install curl wget
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
aptitude update
aptitude install openssh-server ntp
aptitude install sun-java6-jdk
aptitude safe-upgrade
reboot now

You can now use ifconfig -a to find out the IP address of the virtual machine and log into it via ssh. You will want to execute most of the commands below as root.

LZO Compression

This setup provides LZO compression for your data in HBase which greatly reduces the amount of data which is stored on disk. Sadly, LZO is under the GPL license, so it can't be distributed with Apache. Therefore, I'm providing a nice debian that I got ahold of for you to use. On your vm:

wget "https://github.com/lookfirst/fileshare/blob/master/Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb?raw=true"
dpkg -i Cloudera-hadoop-lzo_20110510102012.2bd0d5b-1_amd64.deb

Hadoop / HDFS

Install some packages:
apt-get install hadoop-0.20
apt-get install hadoop-0.20-namenode hadoop-0.20-datanode hadoop-0.20-jobtracker hadoop-0.20-tasktracker
apt-get install hadoop-0.20-conf-pseudo

Edit some files:

/etc/hadoop/conf/hdfs-site.xml
<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>4096</value>
</property>
/etc/hadoop/conf/core-site.xml
<property>
   <name>io.compression.codecs</name>
   <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
 
<property>
   <name>io.compression.codec.lzo.class</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
/etc/hadoop/conf/mapred-site.xml
<property>
   <name>mapred.compress.map.output</name>
   <value>true</value>
 </property>
 
 <property>
   <name>mapred.map.output.compression.codec</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>
 
 <property>
   <name>mapred.child.ulimit</name>
   <value>1835008</value>
 </property>
   
 <property>
   <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>2</value>
 </property>

 <property>
   <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>2</value>
 </property>

ZooKeeper
apt-get install hadoop-zookeeper-server
/etc/zookeeper/zoo.cfg
Change localhost to 127.0.0.1
 Add: maxClientCnxns=0
service hadoop-zookeeper-server restart

HDFS/HBase Setup

Make an /hbase folder in hdfs
sudo -u hdfs hadoop fs -mkdir /hbase
sudo -u hdfs hadoop fs -chown hbase /hbase
NOTE: If you want to delete an existing hbase folder, first stop hbase!
sudo -u hdfs hadoop fs -rmr -skipTrash /hbase

HBase Installation
apt-get install hadoop-hbase
apt-get install hadoop-hbase-master

/etc/hbase/conf/hbase-site.xml
<property>
   <name>hbase.cluster.distributed</name>
   <value>true</value>
</property>
<property>
   <name>hbase.rootdir</name>
   <value>hdfs://localhost/hbase</value>
</property>

/etc/hbase/conf/hbase-env.sh
export HBASE_CLASSPATH=`ls /usr/lib/hadoop/lib/cloudera-hadoop-lzo-*.jar`
export HBASE_MANAGES_ZK=false
export HBASE_LIBRARY_PATH=/usr/lib/hadoop/lib/native/Linux-amd64-64

/etc/hadoop/conf/hadoop-env.sh
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH":`hbase classpath`

Now, restart the master and start the region server:
service hadoop-hbase-master restart
apt-get install hadoop-hbase-regionserver

Starting/Stopping everything

Start
  • service hadoop-zookeeper-server start
  • for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
  • service hadoop-hbase-master start
  • service hadoop-hbase-regionserver start
Stop
  • service hadoop-hbase-regionserver stop
  • service hadoop-hbase-master stop
  • for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done
  • service hadoop-zookeeper-server stop

Hbase Shell
su - hbase
hbase shell

Ports

To ensure that everything is working correctly, visit your VM's ip address with these ports on the end of a http url.
  • HDFS: 50070
  • JobTracker: 50030
  • TaskTracker: 50060
  • Hbase Master: 60010
  • Hbase RegionServer: 60030

Sunday, July 24, 2011

Chickens

I added a new wing to the coop today. It's officially a doublewide now.


Friday, July 22, 2011

Lion removes Java?

Yes it does! Java really isn't installed this time. So, just open up Terminal.app and type 'java'. It will automatically install from there.

My guess is that there is some sort of Oracle licensing agreement that prevents Apple from distributing Java with the release. You'd think that these two massive public companies would be able to work something out

Once you have done that, if you are a developer, you will need to fix where the source code is:
  1. Go to http://connect.apple.com and download: Java for Mac OS X 10.7 Developer Package
  2. Install it.
  3. Open a Terminal.app window
  4. cd /System/Library/Frameworks/JavaVM.framework/Home
  5. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/src.jar .
  6. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/docs.jar .
  7. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-383.jdk/Contents/Home/appledocs.jar .
p.s. The version number seems to have gone down (384 to 383) with 10.7 vs. 10.6.8 and there is now an appledocs.jar that I didn't notice before.

p.p.s. This fixes apps like 'Network Connect' which depend on Java being installed.

p.p.p.s. Check out Similarity.com. Developed on a Mac, using Java, running on Google App Engine.

Thursday, July 21, 2011

jmxtrans - speaking engagement

I did a fun short talk tonight for the SF Bay Area Large-Scale Production Engineering group at the Yahoo! campus on my little open source Java monitoring project called jmxtrans.

Skip to 21:45.

Sorry, I was a bit nervous as this is my first talk in front of so many people in a long time.

Monday, July 18, 2011

HBase MultiTableOutputFormat writing to multiple tables in one Map Reduce Job

Recently, I've been having a lot of fun learning about HBase and Hadoop. One esoteric thing I just learned about is the way that HBase tables are populated.

By default, HBase / Map Reduce jobs can only write to a single table because you set the output handler at the job level with the job.setOutputFormatClass(). However, if you are creating an HBase table, chances are that you are going to want to build an index related to that table so that you can do fast queries on the master table. The most optimal way to do this is to write the data to both tables at the same time when you are importing the data. The alternative is to write another M/R job to do this after the fact, but that means reading all of the data twice, which is a lot of extra load on the system for no real benefit. Thus, in order to write to both tables at the same time, in the same M/R job, you need to take advantage of the MultiTableOutputFormat class to achieve this result. The key here is that when you write to the context, you specify the name of the table you are writing to. This is some basic example code (with a lot of the meat removed) which demonstrates this.

static class TsvImporter extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
	@Override
	public void map(LongWritable offset, Text value, Context context) throws IOException {
		// contains the line of tab separated data we are working on (needs to be parsed out).
		byte[] lineBytes = value.getBytes();

		// rowKey is the hbase rowKey generated from lineBytes
		Put put = new Put(rowKey);
		// Create your KeyValue object
		put.add(kv);
		context.write("actions", put); // write to the actions table

		// rowKey2 is the hbase rowKey
		Put put = new Put(rowKey2);
		// Create your KeyValue object
		put.add(kv);
		context.write("actions_index", put); // write to the actions table
	}
}

public static Job createSubmittableJob(Configuration conf, String[] args) throws IOException {
	String pathStr = args[0];
	Path inputDir = new Path(pathStr);
	Job job = new Job(conf, "my_custom_job");
	job.setJarByClass(TsvImporter.class);
	FileInputFormat.setInputPaths(job, inputDir);
	job.setInputFormatClass(TextInputFormat.class);
	
	// this is the key to writing to multiple tables in hbase
	job.setOutputFormatClass(MultiTableOutputFormat.class);
	job.setMapperClass(TsvImporter.class);
	job.setNumReduceTasks(0);

	TableMapReduceUtil.addDependencyJars(job);
	TableMapReduceUtil.addDependencyJars(job.getConfiguration());
	return job;
}

Wednesday, June 29, 2011

Fix missing source for Java Mac OS X 10.6 Update 5

Once again, Apple does something stupid to us poor idiot Java developers. This will make clicking through to the JDK source work in Eclipse again after updating to the latest Java for Mac OS X.
  1. Go to http://connect.apple.com and download Java for Mac OS X 10.6 Update 5 Developer Package
  2. Install it.
  3. Open a Terminal.app window
  4. cd /System/Library/Frameworks/JavaVM.framework/Home
  5. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-384.jdk/Contents/Home/src.jar .
  6. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_26-b03-384.jdk/Contents/Home/docs.jar .

Thursday, June 2, 2011

Mount an Ubuntu virtual machine via NFS on a Mac

I do a lot of local development in various Ubuntu virtual machines running in vmware fusion. I do some work in them, mess them up and then delete them when I'm done. It tends to be a pain to xfer data between the VM and my local host machine, so I found a simple way to do it with NFS exports from the VM, mounted on my Mac.

For simplicity, first setup your VM with bridged networking. Then, execute this as root on your vm:

aptitude install nfs-common nfs-kernel-server portmap
echo "/var *(rw,no_subtree_check,all_squash,anonuid=0,anongid=0)" >> /etc/exports
echo "/etc *(rw,no_subtree_check,all_squash,anonuid=0,anongid=0)" >> /etc/exports
invoke-rc.d nfs-kernel-server restart || true


Obviously edit the echo statements above to export different directories.

Open up the DiskUtility application and select File->NFS Mounts...

Enter the IP address of your VM / mountpoint (nfs://10.0.0.1/etc) and the mount location on your local Mac. Given the example above that mounts /etc and /var, you'd probably want something like /Volumes/var and /Volumes/etc

You also need to pass -P for the Advanced Mount parameters.

That's it! Now you have root write access on your VM from your Mac.

Sunday, May 29, 2011

New release of Sardine

After working for the last few months with the developer of Cyberduck (David Kocher), there is a shiny new release of Sardine. This is the best modern Java webdav client around.

This release makes it safer, easier to use and more compatible with webdav servers than ever. I can also now brag that my code is being used in a really cool product used by a ton of people. The last release had nearly 900 downloads, so I expect this next release to be even more popular.

Saturday, May 28, 2011

Release Engineering at Facebook

Along with my regular duties of coding product features and inventing cool technologies, I've been doing build and release engineering for a while now. This means that I'm responsible for building the systems which not only help developers get work done quickly, but also the building of the artifacts that are pushed to production. It isn't my favorite thing to do, but it turns out that I'm pretty ok at it.

This video is an inside look within facebook about how their release engineering process works. It is pretty amazing that this guy has been able to support the level of development that facebook has with a very minimal team. I share quite a few of his views about how developers should own their work and the culture around that. About the only thing that we differ on is the way branches are managed. I prefer iterations and feature branches over working on trunk (which should be kept in a stable state).

It is an hour long video, but well worth the watch if you are interested in this stuff...

Sunday, May 22, 2011

fallback

fallback provides a nice example web application archive (war) for integrating Spring / Hibernate / JMX / JPA / Ehcache.

In order to come up with this clean of an integration, there is a ton of conflicting blog postings and documentation you would have to sift through. My goal here is to do that work for you and provide a nice basis for starting from. The project is up on github in the hopes that you will fork it and make improvements yourself.

The fallback project itself is a very basic 3-tier web application with a RESTful servlet frontend that takes a request and calls a method on a bean which contains the business logic. If you are coming from EJB3 experience, this will look very familiar. Annotations are used as much as possible to simplify the Spring configuration.

Wednesday, April 6, 2011

The Subversion Mistake

At my workplace, when I first got here, we were doing the waterfall method of development. We would do 3 months worth of hard work, with thousands of commits and tons of features. Then we would hand things off to QA to test. Spend a few more weeks bug fixing. Then when things were 'certified' by QA, we would spend a whole weekend (and the next week) doing the release and bug fixing in production. Then the cycle would repeat again. Ugly.

Now, based on feedback I gave, we use an iterative approach to development. It is extremely flexible and has allowed us to increase the rate of releases to our customers as well as the stability of our production environment. We did about 9 iteration releases in 3 months, our customers got features more quickly and we have fewer mid-week critical bug fixing patches. Everyone from the developers all the way up to the marketing department loves that we have become more agile as a company.

In order to support this model of development, we had to change the way we use subversion. Before, we would have a branch for the version in production and do all of our main work on trunk. We would then copy any bug fixes to that branch and do a release from there. I reversed and expanded that model.
  • Trunk is now what is in production.
  • Main development and bug fixes happen on numbered iteration branches (iteration-001, iteration-002, etc.)
  • Features happen on branches named after the feature. (foobranch, barbranch, etc)
  • Each iteration branch is based off of the previous iteration. If a checkin happens in iteration-001, it is merged into iteration-002. (ex: cd iteration-002; svn merge ^/branches/iteration-001 .)
  • No commits happen directly to trunk, only merges. For midweek releases, we cherry pick individual commits from an iteration branch to trunk. (ex: cd trunk; svn merge -c23423 ^/branches/iteration-001 .)
  • Feature branches are based off of an iteration.
Unfortunately, we are quickly learning that subversion does not support this model of development at all and I've had to become a subversion merge expert.

One reason for this is that every time we cherry pick a change from an iteration to trunk, we also need to merge trunk back into the iteration. This is so that the mergeinfo is recorded properly and to make --reintegrate work when we decide to 'close' the iteration. When iteration-001 is closed, I then cd iteration-002; svn merge --record-only ^/trunk to 'reset' the pointer at iteration-002 now. If I don't do this, trunk quickly gets out of sync with an iteration and subversion makes merging a nightmare of conflicts. It shouldn't be this way, but it is.

Another reason is that any feature branch that spans more than one iteration does not have its mergeinfo tracked properly. For example, I have a branch called 'foo'. It is based off of iteration-001 and kept up to date with development (cd foo; svn merge ^/branches/iteration-001 .). At some point, iteration-002 is created and people are committing to it. Also, iteration-001 is routinely merged into iteration-002.

The issue is that the mergeinfo for my branch foo knows nothing about the mergeinfo contained in iteration-002. Thus, if I try to 'upgrade' my branch foo to iteration-002, subversion will try to merge from the start of iteration-002 all the way to HEAD. This obviously won't work because it will try to re-apply changes that have already been applied to iteration-002 and conflict madness will ensue.

The only solution I've been able to come up with for this problem is to just use revision numbers when doing that first merge of iteration-002 into my branch foo. This is completely counter intuitive to having merge tracking. Therefore, I take the point where trunk was 'reset' into iteration-002 and do the merge manually with revision numbers. After that, I also need to cleanup the mergeinfo to make sure it is solid and future mergeinfo merges will work. (cd mybranch; svn merge -r2334:HEAD ^/branches/iteration-002 .; svn merge --record-only ^/branches/iteration-002; svn ci).

I've watched the video from Linus about git. I'm a total convert, the subversion developers really screwed the pooch with the choices they made. (Sidenote: I worked at CollabNet during this period and was actually at some of the discussions.) The issue for me now is that in a corporate environment with 25+ developers distributed all around the world, just switching to git is not an easy task. It isn't like I can just take a 2gig repository of files and import it into git and tell people to switch. Training people who barely know subversion on how to use git is a daunting issue. We also have a lot of prior integration with subversion, such as our issue tracking, reviewboard, commit emails, commit hooks, etc. that would all need to be re-integrated.

So, for now, I keep documenting all of these subversion gotcha's in our wiki while exploring and learning more in my not-so-spare time about how to migrate to git. I'm writing this post with the hope that if someone else reads it in time, they will choose to use git instead of making the subversion mistake.

Wednesday, March 9, 2011

Fix missing source for Java Mac OS X 10.6 Update 4

Once again, Apple does something stupid to us poor idiot Java developers. This will make clicking through to the JDK source work in Eclipse again after updating to the latest Java for Mac OS X.
  1. Go to http://connect.apple.com and download Java for Mac OS X 10.6 Update 4 Developer Package
  2. Install it.
  3. Open a Terminal.app window
  4. cd /System/Library/Frameworks/JavaVM.framework/Home
  5. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_24-b07-334.jdk/Contents/Home/src.jar .
  6. sudo ln -s /Library/Java/JavaVirtualMachines/1.6.0_24-b07-334.jdk/Contents/Home/docs.jar .

Sunday, January 30, 2011

Why .deb for packaging and deployment?

Using Debian's (.deb files) to package and deploy software across a large number of machines is a really powerful tool. It can be a bit esoteric (what isn't in unix land?), but once you get the general idea for how it works, you will be amazed at how cool it is. Obviously there is a bunch systems out there for doing this packaging/install process, so why pick Debian over something else?

Pros:
  • The online documentation is solid and easy to find.
  • It employs a set of best practices according to well defined documentation...
  • An extremely thorough lint mechanism that checks that the debian you have built is valid and fits into those best practices.
  • No DSL. Doesn't require learning anything more than bash/make/ant or hiring expensive consultants.
  • The process around installing/updating debians on machines can be easily scripted with bash.
  • Can be easily tested on a local vmware ubuntu image.
  • Security, md5sum's of all files, gpg signed .deb files and keys and automatic validation.
  • The distribution model is appropriate for my iterative development style. Add a config file to a server that points to the branch (trunk/iteration) you want to install and the system will automatically choose the latest version of the package and install it.
  • It is a pull based model. You log into the server you want to install software on, execute aptitude/apt-get, and it will connect via http to our central distribution host and download the latest version of the requested software from there. No need for jumphosts and the signed files provides adequate levels of security.
  • There is a fairly sophisticated dependency mechanism and resolution system.
  • It is easily integrated with our Ant/Hudson build system and doesn't require a mess of servers/services to be running either on the build server or the servers we deploy to.
  • There is no limitations to what you can do on the system you are installing onto.
  • All of the configuration is done through a set of clearly defined small text files that are checked into subversion with each project.
  • It is easy to ask for user input and process that information as part of the installation process.
Cons:
  • It really only works if you are using some flavor of Debian (like Ubuntu) on your servers. Something I prefer, so less of an issue for me.
  • Like anything new, it can be complicated to get up to speed on and takes analyzing existing debian files to understand how others choose to implement things. On the other hand, this is also a benefit... pick a package similar to what you want to install, see how someone else did it and then replicate that yourself.

Tools that I considered are:
  • cfengine - the old complicated beast. typically overkill. has its own DSL. Many processes running.
  • puppet - Less about distributing software and more about making sure all systems are configured correctly. Documentation is questionable. Requires learning a DSL. Not really necessary if you use kickstart for initial bootstrap and make it easy to deploy code to servers because if you need the machine reconfigured, you fix kickstart, wipe the box and reinstall. Requires a daemon on the box.
  • chef - Documentation is questionable. Requires learning a DSL + ruby and writing recipes for everything, no lint process, server is a mess of complicated brittle projects (so much so that opscode is pushing their 'platform' as the way to go), requires daemon on the box.
  • fabric - A closer fit as python is better than ruby for this, but I prefer the pull based method of distribution over using ssh keys.
  • capistrano - Fairly ruby-centric focus. Simple DSL. Push based system.
In future posts, I'll go through and talk about how .deb files are developed and deployed.