Wednesday, October 24, 2012

How to determine DKIM key length?

I read this article in Wired about a researcher who figured out that Google was using a weak (512-bit) key for its implementation of DKIM. It turns out this is old news as someone did the same thing to Facebook.

This got me wondering if the keys for our emailing service are long enough. But, how to easily determine the length of the key? Turns out it is kind of convoluted so I decided to repeat the info here for my own benefit when I forget about this later.

Take your public DKIM key (probably from your DNS TXT record). It looks like this one from Google:

20120113._domainkey.google.com. 86400 IN TXT "k=rsa\; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAp5kQ31/aZDreQqR9/ikNe00ywRvZBFHod6dja+Xdui4C1y8SVrkUMQQLOO49UA+ROm4evxAru5nGPbSl7WJzyGLl0z8Lt+qjGSa3+qxf4ZhDQ2chLS+2g0Nnzi6coUpF8r" "juvuWHWXnzpvLxE5TQdfgp8yziNWUqCXG/LBbgeGqCIpaQjlaA6GtPbJbh0jl1NcQLqrOmc2Kj2urNJAW+UPehVGzHal3bCtnNz55sajugRps1rO8lYdPamQjLEJhwaEg6/E50m58BVVdK3KHvQzrQBwfvm99mHLALJqkFHnhyKARLQf8tQMy8wVtIwY2vOUwwJxt3e0KcIX6NtnjSSwIDAQAB"

Save the p= part of the TXT record into a file (google.key) that is line wrapped to around 78 columns (yes, it needs to be line wrapped or the openssl command used below breaks). Google seems to store their key in two parts so I removed the " " that is embedded in the middle of the blob above. You will also need the BEGIN/END PK block.

-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAp5kQ31/aZDreQqR9/
ikNe00ywRvZBFHod6dja+Xdui4C1y8SVrkUMQQLOO49UA+ROm4evxAru5nGPbSl7WJzyGLl0z8Lt+
qjGSa3+qxf4ZhDQ2chLS+2g0Nnzi6coUpF8r
juvuWHWXnzpvLxE5TQdfgp8yziNWUqCXG/
LBbgeGqCIpaQjlaA6GtPbJbh0jl1NcQLqrOmc2Kj2urNJAW+
UPehVGzHal3bCtnNz55sajugRps1rO8lYdPamQjLEJhwaEg6/
E50m58BVVdK3KHvQzrQBwfvm99mHLALJqkFHnhyKARLQf8tQMy8wVtIwY2vOUwwJxt3e0KcIX6Ntn
jSSwIDAQAB
-----END PUBLIC KEY-----

Then run openssl rsa -noout -text -pubin < google.key

Which then outputs this chunk with your answer:

Modulus (2048 bit):
    00:a7:99:10:df:5f:da:64:3a:de:42:a4:7d:fe:29:
    0d:7b:4d:32:c1:1b:d9:04:51:e8:77:a7:63:6b:e5:
    dd:ba:2e:02:d7:2f:12:56:b9:14:31:04:0b:38:ee:
    3d:50:0f:91:3a:6e:1e:bf:10:2b:bb:99:c6:3d:b4:
    a5:ed:62:73:c8:62:e5:d3:3f:0b:b7:ea:a3:19:26:
    b7:fa:ac:5f:e1:98:43:43:67:21:2d:2f:b6:83:43:
    67:ce:2e:9c:a1:4a:45:f2:b8:ee:be:e5:87:59:79:
    f3:a6:f2:f1:13:94:d0:75:f8:29:f3:2c:e2:35:65:
    2a:09:71:bf:2c:16:e0:78:6a:82:22:96:90:8e:56:
    80:e8:6b:4f:6c:96:e1:d2:39:75:35:c4:0b:aa:b3:
    a6:73:62:a3:da:ea:cd:24:05:be:50:f7:a1:54:6c:
    c7:6a:5d:db:0a:d9:cd:cf:9e:6c:6a:3b:a0:46:9b:
    35:ac:ef:25:61:d3:da:99:08:cb:10:98:70:68:48:
    3a:fc:4e:74:9b:9f:01:55:57:4a:dc:a1:ef:43:3a:
    d0:07:07:ef:9b:df:66:1c:b0:0b:26:a9:05:1e:78:
    72:28:04:4b:41:ff:2d:40:cc:bc:c1:5b:48:c1:8d:
    af:39:4c:30:27:1b:77:7b:42:9c:21:7e:8d:b6:78:
    d2:4b
Exponent: 65537 (0x10001)

Wednesday, May 9, 2012

Scrape webpages with node.js

I recently had the task of scraping data from a website so I choose to use node.js in order to get a bit more experience with it. After checking out a few different options for scraping, I finally settled on the node.io project which provided the most robust handling and configuration features that I could find.

The hard part about scraping data from websites is coming up with ways to quickly and reliably pick out pieces from the document object model (DOM). These days, I spend a lot of time using the jQuery selector syntax to develop my site which means that ideally I'd find a solution that can download a webpage and then provide me with jQuery-like functions and selectors to pick out pieces from the DOM. For this purpose, node.io uses a project called node-soupselect by default, but I found the selector syntax to be lacking. Thus, I layered another project called cheerio on top. Whatever you do, don't use jsdom as it is too slow and very strict in its processing of html.

Node.io stands out from the rest of the projects because it applies a 'jobs' approach to scraping. This is something that I used in another project of mine and it worked out really well. In other words, you write a 'job' which gets executed and if there is a failure during the run of the job, you have control over what you do next (skip, fail, retry).

In order to develop and debug your code, you shouldn't continually hit the website that you are scraping from. What I do is download the page I want to scrape and then run a local webserver to serve up the file. I found the python webserver the easiest to use as it is just one simple command, python -m SimpleHTTPServer 8000 which will serve up files from whichever directory you run that command from.

For debugging code, I recommend setting up the excellent node-inspector which will allow you to setup breakpoints in your code and step through to inspect objects as necessary, just like you do with web page JavaScript development. This becomes invaluable with JavaScript because the lack of types makes it hard to know what properties objects have. For logging output to the console so that I could keep track of the execution, I ended up with the nlogger project which I wasn't super happy with, but worked well enough for this project.

Writing your first job is easy and if you are a CoffeeScript (CS) fan, node.io will automatically compile your CS files for you.  If you aren't a CS fan, I apologize as my example is in CS. This simple job @get's a page and selects the <title> element:

nodeio = require 'node.io'
cheerio = require 'cheerio'
log = require('nlogger').logger(module)

count = 0
class InputScraper extends nodeio.JobClass
    input: [476,1184]
    run: (inputId) ->
        @get "http://localhost:8000/#{inputId}.html", (err, data, headers) =>
            @exit(err) if err?
            try
                log.info('Started: {}', inputId)
                $ = cheerio.load(data)
                @emit($('title').text())
                log.info("(#{count}) Finished inputId: #{inputId}")
                count++
            catch error
                log.error('Error: {} : {}', inputId, error.stack)
                @skip()
    output: (data) ->
        console.log(data)
@class = InputScraper
@job = new InputScraper({spoof:true, max: 1})

nodeio.start(@job)

The 'run' function is called for each piece of input data in the array (476, 1184). @get() grabs the html page data. On success, @get() executes the callback function which loads the data into cheerio and @emit()'s the title. When the 'run' function is complete, node.io calls output which logs the @emit() data to the console.

In my code, the line above the @emit(), I have another class function which I pass the $ cheerio object into which handles all of my parsing and the result is an object that I pass into @emit(). This allows me to re-use the InputScraper boilerplate to parse all sorts of different pages.

I also run things directly with node 'nodeio.start(@job)' so that I can use the node-inspector more easily. This means that I also end up actually compiling the CS myself using my answer on StackOverflow.

Obviously, this is a pretty simple example, but it should get you up and running with the node.io framework quickly. The speed of this is fairly impressive, on my desktop and home network, I was able to crawl and extract data from about 2000 webpages using max: 20 in about 1.5 minutes. Most of the time is spent downloading the pages and the parsing only takes a few milliseconds.

Thursday, March 29, 2012

The first comprehensive review of 22 online athletic event registration services. The Good, the Bad, and the Ugly.

Before we decided to build Voost, we spent a lot of time studying the myriad registration services already available online. There are a staggeringly large number of choices with wildly differing levels of sophistication. Comparing them is incredibly difficult, in no small part due to the fact that many of these websites seem to go out of their way to hide critical information like fees and disbursement schedules.

Over a hundred hours of tedious research, creating accounts, combing through documentation and FAQs, setting up test events, going through registration processes, calculating fees, etc. While we certainly have biases, we have tried to present the information as objectively as possible - this isn't a cheap marketing gimmick designed to show green checkboxes for us and red Xes for our competitors. We freely admit that there are features other services have that Voost does not (yet!) - and this information is reflected on the table.

The table is not complete. We have left fields blank where we just couldn't figure out the answers (and believe me, we tried). Despite our best efforts, there may also be errors - again, some websites seem to deliberately befuddle attempts at objective comparison.



Wednesday, January 25, 2012

GitHire spam again...

Just keeping this around for posterity since they deleted the HN posting and maybe others will want to see the truth if they are googling around for why you are getting spam from these guys.

http://news.ycombinator.com/item?id=3508655

I called them out for being a the spammers that they are, and I got a rather odd response:
Hi LatchKey,
I'm really sorry that we sent you that email. We just launched a little over a week ago with this crazy idea, and were extremely surprised at how quickly we were overwhelmed with orders.
We made a bad judgement call in sending some emails to people asking if anyone is interested in jobs.
If it makes you feel any better, you can see that we aren't finding very many talented engineers, and we will likely need to refund a lot of money in a few days.
We are honestly trying to be a great service for software developers and employers. We need feedback from people like yourself to learn how we can be the best service possible to reshape the hiring industry.
We actually sent you an email, but never heard back. Please let us know if you're interested in continuing this discussion further on or off of a public message board.
Thanks for keeping us honest.
The HN thread goes on with a lot of people pointing out their own issues with GitHire and I have my response to the above quote here: http://news.ycombinator.com/item?id=3508750

Sorry, but I just don't have any tolerance for spammers or people trying to profit off my profile without my permission.

Sunday, January 22, 2012

Scalable System Architecture Comedy

I was reading Scaling a PHP MySQL Web Application, which is a technical document published on the Oracle website. As I was scrolling down the page, I saw the typical Load balancing Figure 1 that you always see in any PHP/MySql web application.


But then, as the article goes on, it gets more entertaining. It goes on to Figure 3, showing Multiple MySQL Slaves, which is now 4 machines.

But wait, there's more. Now you need a dedicated database slave for each Web server, so the picture expands to even more lines and arrows in Figure 4. A total of 8 machines.


As you keep scrolling, you get to Figure 5. A real gem of an image. Arrows in every direction. Arrows jumping through other arrows. 8 machines, but a completely incomprehensible image.


Ok, now we've randomized the connections between all the web servers and database slaves.

Could you imagine one of these machines going down or throwing errors and trying to figure out which one it is or how it connects to the other machines?

We all know that as systems grow, they get more complex. That said, if you draw an incomprehensible picture of your architecture, it is a clear sign that you are doing it wrong.

Monday, January 16, 2012

GitHire.com is a spammer

I've kept up with all the recent HackerNew articles on GitHire. They seemed like a rather interesting service because I believe that hosting your projects on a site like GitHub, is the best kind of resume a software engineer can have.

That is, until they just spammed me with a whole list of completely random and unrelated jobs. I could understand it if I signed up to their site and requested spam (aka: LinkedIn), but I'm definitely not interested as I'm in the middle of starting my own company!


After checking out their site, I realized they've also got a profile up for me that I had no hand in creating, nor desire having. Apologies if that link stops working, hopefully they will remove my information soon. Maybe I should be pissed that I'm only in their Top 50%? ;-)


Since I wanted to remove myself, I clicked the "Opt out of Githire" link, which then takes me to a page on Github to authorize their application?!?


Hell no, I'm not going to authorize your application, just so I can opt out of your website. That is wrong on so many levels.

Anyway, I cc'd support@github on my response to 'Steve', so hopefully they will be going away soon. I can't see how GitHub is allowing this company to exist, when they so clearly violate their terms of service policy.

Thursday, January 5, 2012

Going on 6 months now...

Here is a bit of a status update for the new year:

We've been working full steam ahead on Voost, going on 6 months now. In that timeframe, Jeff and I have accomplished an impressive amount of work for just two people. I've been putting in 10-15 hour days, seven days a week, of solid coding and Jeff has been doing the same. When he or I are out of town, we sit on Skype all day (and night) long working through any issues we have and bouncing ideas off each other. It has been a hugely productive cooperative development effort.

I've become a much better UX/UI designer than when I started. It had been years since I had worked on this side of things and it has been a lot of fun picking it back up. I've also become an absolute expert in CoffeeScript, JQuery, Less and all of the other hot front end technologies that are out there. On the back end, we've integrated with BrowserID for secure sign in as an option to Facebook Connect. We've also switched to Objectify 4 which is the most advanced way to interact with the Google AppEngine database backend.

The sad face news is we have nothing public to show for all of this hard work quite yet. I could go on with a list of reasons, but they aren't really worth going over in detail. Suffice it to say, we just aren't ready to launch. I'd say that we are about 85-90% of the way there. Hopefully not more than about a month or so. For a few of my friends, what we have is enough and they are pressuring me to just put something out there, even if it is incomplete or buggy. I'm pushing back on them.

While I realize the cycling season is quickly picking up, I'm not in a huge hurry. Thankfully, after years of penny pinching, I have enough savings left to last me until we do launch. I really want to do this right. I want all of my cards on the table. I want people to wonder why nobody has done a site this good before. As cheesy as it sounds, I expect something close to perfection, even if it isn't absolutely feature complete. I think of how the original iPhone disrupted the cell phone market. We went from the clam shells and keyboards to touch screens overnight. It may sound silly, but I'm passionate about doing something like that with the event registration market.

Even without all of the features that other companies have, our application is light years more advanced than any other registration product out there. I know this because I've seen their systems, analyzed everything wrong with them and spent the time to come up with a vastly better designed user experience. This takes a lot of hard work and this will be a huge differentiator in the marketplace for us. I'm very proud of that fact. It will be very expensive and nearly impossible for our competitors to hire enough engineering talent to catch up with us.

A question I get a lot is: do you have any customers? Well, we don't. Not yet. I'm ok with that because I do have enough contacts and relationships to get the word out there to promoters. I think that people also really want this product, so when we launch, it will almost sell itself. I can't tell you how many times I've heard 'I hate XYZ's excessive fees!' and 'This XYZ registration site is so difficult to use!'

Besides, the cycling community, our initial target audience, is very small and I don't want to really start pressuring promoters to try out a system that isn't launched yet. I sure wouldn't trust anyone who doesn't have a live product. On the flip side, if I was a competitor, I'd be really scared of us right now. We are going to be very hungry for customers and it will be that much harder to compete when we have a better product with better pricing.

A bit of good news is that we are close to having a great company logo. We put a bounty up on one of those crowd sourced websites full of designers and got a number of excellent designs, out of over a hundred submissions. We are in the process of choosing the final one over the next couple of days. I look forward to announcing it.

Thanks for listening. Thanks to all my friends and family for the encouragement and advice. Thanks to my wife for putting up with me working all the time. Thanks to everyone who has offered to help. Expect another update soon. This is going to be a lot of fun!

Sunday, December 18, 2011

Don't use the jQuery .data() method. Use .attr() instead.

I just discovered the hard way that the jQuery .data() method is horribly broken. By design, it attempts to convert whatever you put into it into a native type.

I've got a template where I'm generating a button with a data-key element:

<button id="fooButton" data-key="1.4000">Click me to edit</button>

http://jsfiddle.net/KwjvA/

It looks like a float where one could assume that 1.4000 === 1.4, but what I really want here is the string 1.4000. I certainly don't want my input to be modified. One suggestion I found in the issue tracker is to put single quotes around the field (ie: data-key="'1.4000'"). That seems rather absurd as well.

The only reason why I'm warning about this here is that I've seen a bunch of libraries using .data() to store stuff in elements in the DOM. I think it is a really bad idea to have a method called .data() where you expect to be able to store something in it and be able to get back out exactly what you put into it.

The recommended alternative is to use .attr(). The problem with this is that while it achieves the same effect, it is actually much different functionality from .data(). .data() stores information within an internal cache within jQuery, while .attr() calls element.setAttribute().

I read through several bug reports on the jQuery website, where people are also confused by this behavior and all of them get closed with a wontfix. I see this as a terrible choice. Yuck.

Update: Here is a bug I just filed, hopefully that explains things better to the people who seem to be having a hard time understanding what I'm talking about: http://bugs.jquery.com/ticket/11060

Thursday, December 8, 2011

Optimizing Web Application JavaScript Delivery

For my new company, I had the following design goals for my heavy use of JavaScript (JS):
  • I'm using CoffeeScript (CS), so I need to have my IDE automatically compile CS to JS when I save. That way the whole Change file, Save, Reload the browser process works cleanly.
  • Be able to split the files up depending on which page is loaded so that only the JS which is needed for the page is sent to the client.
  • Be able to differentiate JS files to be loaded between different states such as logged in, logged out and both. That way, once someone is already logged in, the JS which controls the login and forgot password dialog does not get served again. The flip side is that JS for logged in pages is not served up to anonymous users.
  • In development mode, have everything un-minimized, but in production mode, automatically minimize everything.
  • Run all of the JS through the Closure compiler regardless of dev/prod so that I know that things that work in dev also work in prod.
  • Limit the number of <script> tags to the bare minimum. Ideally, 2-3 for .js files served from my site and not directly off of a CDN. Fewer loads means less network traffic.
  • Be able to transparently support new code / application versions so that when I upgrade the application, the browsers dump their cached copies of my files.
  • No dependencies on external xml, json, property or other configuration file formats to implement the goals above. Everything should be configured by someone editing the html templates.
In order to accomplish the goals above, I first looked at a bunch of solutions, but they all failed in various ways. So, I started on my own and went through various iterations before I came up with the ideal solution which I think is pretty unique and easy.

Let's start off by talking about one of the tools I'm using. LabJS enables me to load JS only when I need it. As part of the 'master' template which contains the skin for all pages, at the very bottom before the </body> element I have something that looks like this:

<script src="/js/LAB.min.js"></script>
<script>
    var country = "${country}";
    var fbAppId = "${fb.APP_ID}";

    var js = '${tool.jsbuilder(
        me != null,
        'json2:both',
        'handlebars.1.0.0.beta.master:both',
        'bootstrap-twipsy:both',
        'bootstrap-popover:both',
        'gen/global/page:both',
        'gen/global/common:both',
        'gen/global/search:both',
        'gen/modal/loginDialog:out', // ! logged in
        'gen/global/loggedInMenu:in', // logged in
        'gen/global/master:both'
        )}';

    var lab = $LAB
        .script("//ajax.googleapis.com/ajax/libs/jquery/1.7/jquery.min.js")
        .wait()
        .script("//ajax.googleapis.com/ajax/libs/jqueryui/1.8.16/jquery-ui.js")
        .script("//connect.facebook.net/en_US/all.js")
        .script("//apis.google.com/js/plusone.js")
        .script("//platform.twitter.com/widgets.js")
        .wait()
        .script(js)
        ;

    // Variable "pagecode" should be a function that takes a LAB and does any page-specific loading
    if (typeof pagecode === 'function')
        pagecode(lab);
</script>

Since I'm using Cambridge Templates with JEXL to process things first, the ${tool.jsbuilder(...)} section runs some Java code which does a lot of the magic during the rendering portion of the page. The first argument is a boolean to indicate whether or not I'm logged in. 'me' is an object in the context and if it is null I'm not logged in. The rest of the arguments are String[]. The method signature looks like this:

    public String jsbuilder(boolean loggedIn, String[] files);

What happens in that method is that it will parse the array of Strings, and based on a setting of 'in' for logged in, 'out' for logged out, or 'both' for either logged in or out, it will compare that to the loggedIn boolean and either load the appropriate JS file or not. The files are then loaded into memory, in order, and sent through the Closure compiler to minify the code.

The output from Closure is then cached in a global HashMap which is never cleared out. (Note: for languages that don't really persist memory between requests, like PHP, you can store this data in something like memcached).

The key of the Map is generated from a md5 hash of the list of filenames combined together + application version. The hash looks something like this: be712950814b2ccc6b92ff5c3. This hash is the String that is returned from the jsbuilder method. By using the names + application version, that ensures a new hash will be generated each time the application is upgraded.

In dev mode, the Map isn't used at all. The code is generated for each request, which ensures that my changes get immediately reflected in the browser. In production, the Map is first checked for the key and if it exists, the key is immediately returned from the jsbuilder method.

The final rendered page looks something like this to the web browser:

    var js = '/js2/be712950814b2ccc6b92ff5c3.js';

When the LabJS code executes in the browser and loads my script with the line .script(js), there is a Servlet listening for requests to /js2/*.js and it looks up the key from the url in the Map and returns the appropriate JS data. This servlet can also set the correct browser cache headers depending on dev or prod.

As you can see, 10+ separate files have been combined and minified into a single file which makes the requests more efficient. All without configuration files or a crazy syntax that only a backend developer can understand.

If I wanted to split the JS files into more loads so that the browser can take advantage of concurrent loading, I could do that as well by just creating more calls to jsbuilder. That is effectively what is happening in the pagecode section near the end of the </script> element above. The body template which is loaded into the master template by Cambridge, optionally has a JS function defined called pagecode. When it executes, it calls lab.script() again with similar output from the jsbuilder tool. This allows me to split up my code so that there is global code as well as page specific code.

Enjoy.