Wednesday, October 24, 2012

How to determine DKIM key length?

I read this article in Wired about a researcher who figured out that Google was using a weak (512-bit) key for its implementation of DKIM. It turns out this is old news as someone did the same thing to Facebook.

This got me wondering if the keys for our emailing service are long enough. But, how to easily determine the length of the key? Turns out it is kind of convoluted so I decided to repeat the info here for my own benefit when I forget about this later.

Take your public DKIM key (probably from your DNS TXT record). It looks like this one from Google: 86400 IN TXT "k=rsa\; p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAp5kQ31/aZDreQqR9/ikNe00ywRvZBFHod6dja+Xdui4C1y8SVrkUMQQLOO49UA+ROm4evxAru5nGPbSl7WJzyGLl0z8Lt+qjGSa3+qxf4ZhDQ2chLS+2g0Nnzi6coUpF8r" "juvuWHWXnzpvLxE5TQdfgp8yziNWUqCXG/LBbgeGqCIpaQjlaA6GtPbJbh0jl1NcQLqrOmc2Kj2urNJAW+UPehVGzHal3bCtnNz55sajugRps1rO8lYdPamQjLEJhwaEg6/E50m58BVVdK3KHvQzrQBwfvm99mHLALJqkFHnhyKARLQf8tQMy8wVtIwY2vOUwwJxt3e0KcIX6NtnjSSwIDAQAB"

Save the p= part of the TXT record into a file (google.key) that is line wrapped to around 78 columns (yes, it needs to be line wrapped or the openssl command used below breaks). Google seems to store their key in two parts so I removed the " " that is embedded in the middle of the blob above. You will also need the BEGIN/END PK block.

-----END PUBLIC KEY-----

Then run openssl rsa -noout -text -pubin < google.key

Which then outputs this chunk with your answer:

Modulus (2048 bit):
Exponent: 65537 (0x10001)

Wednesday, May 9, 2012

Scrape webpages with node.js

I recently had the task of scraping data from a website so I choose to use node.js in order to get a bit more experience with it. After checking out a few different options for scraping, I finally settled on the project which provided the most robust handling and configuration features that I could find.

The hard part about scraping data from websites is coming up with ways to quickly and reliably pick out pieces from the document object model (DOM). These days, I spend a lot of time using the jQuery selector syntax to develop my site which means that ideally I'd find a solution that can download a webpage and then provide me with jQuery-like functions and selectors to pick out pieces from the DOM. For this purpose, uses a project called node-soupselect by default, but I found the selector syntax to be lacking. Thus, I layered another project called cheerio on top. Whatever you do, don't use jsdom as it is too slow and very strict in its processing of html. stands out from the rest of the projects because it applies a 'jobs' approach to scraping. This is something that I used in another project of mine and it worked out really well. In other words, you write a 'job' which gets executed and if there is a failure during the run of the job, you have control over what you do next (skip, fail, retry).

In order to develop and debug your code, you shouldn't continually hit the website that you are scraping from. What I do is download the page I want to scrape and then run a local webserver to serve up the file. I found the python webserver the easiest to use as it is just one simple command, python -m SimpleHTTPServer 8000 which will serve up files from whichever directory you run that command from.

For debugging code, I recommend setting up the excellent node-inspector which will allow you to setup breakpoints in your code and step through to inspect objects as necessary, just like you do with web page JavaScript development. This becomes invaluable with JavaScript because the lack of types makes it hard to know what properties objects have. For logging output to the console so that I could keep track of the execution, I ended up with the nlogger project which I wasn't super happy with, but worked well enough for this project.

Writing your first job is easy and if you are a CoffeeScript (CS) fan, will automatically compile your CS files for you.  If you aren't a CS fan, I apologize as my example is in CS. This simple job @get's a page and selects the <title> element:

nodeio = require ''
cheerio = require 'cheerio'
log = require('nlogger').logger(module)

count = 0
class InputScraper extends nodeio.JobClass
    input: [476,1184]
    run: (inputId) ->
        @get "http://localhost:8000/#{inputId}.html", (err, data, headers) =>
            @exit(err) if err?
      'Started: {}', inputId)
                $ = cheerio.load(data)
      "(#{count}) Finished inputId: #{inputId}")
            catch error
                log.error('Error: {} : {}', inputId, error.stack)
    output: (data) ->
@class = InputScraper
@job = new InputScraper({spoof:true, max: 1})


The 'run' function is called for each piece of input data in the array (476, 1184). @get() grabs the html page data. On success, @get() executes the callback function which loads the data into cheerio and @emit()'s the title. When the 'run' function is complete, calls output which logs the @emit() data to the console.

In my code, the line above the @emit(), I have another class function which I pass the $ cheerio object into which handles all of my parsing and the result is an object that I pass into @emit(). This allows me to re-use the InputScraper boilerplate to parse all sorts of different pages.

I also run things directly with node 'nodeio.start(@job)' so that I can use the node-inspector more easily. This means that I also end up actually compiling the CS myself using my answer on StackOverflow.

Obviously, this is a pretty simple example, but it should get you up and running with the framework quickly. The speed of this is fairly impressive, on my desktop and home network, I was able to crawl and extract data from about 2000 webpages using max: 20 in about 1.5 minutes. Most of the time is spent downloading the pages and the parsing only takes a few milliseconds.

Thursday, March 29, 2012

The first comprehensive review of 22 online athletic event registration services. The Good, the Bad, and the Ugly.

Before we decided to build Voost, we spent a lot of time studying the myriad registration services already available online. There are a staggeringly large number of choices with wildly differing levels of sophistication. Comparing them is incredibly difficult, in no small part due to the fact that many of these websites seem to go out of their way to hide critical information like fees and disbursement schedules.

Over a hundred hours of tedious research, creating accounts, combing through documentation and FAQs, setting up test events, going through registration processes, calculating fees, etc. While we certainly have biases, we have tried to present the information as objectively as possible - this isn't a cheap marketing gimmick designed to show green checkboxes for us and red Xes for our competitors. We freely admit that there are features other services have that Voost does not (yet!) - and this information is reflected on the table.

The table is not complete. We have left fields blank where we just couldn't figure out the answers (and believe me, we tried). Despite our best efforts, there may also be errors - again, some websites seem to deliberately befuddle attempts at objective comparison.

Wednesday, January 25, 2012

GitHire spam again...

Just keeping this around for posterity since they deleted the HN posting and maybe others will want to see the truth if they are googling around for why you are getting spam from these guys.

I called them out for being a the spammers that they are, and I got a rather odd response:
Hi LatchKey,
I'm really sorry that we sent you that email. We just launched a little over a week ago with this crazy idea, and were extremely surprised at how quickly we were overwhelmed with orders.
We made a bad judgement call in sending some emails to people asking if anyone is interested in jobs.
If it makes you feel any better, you can see that we aren't finding very many talented engineers, and we will likely need to refund a lot of money in a few days.
We are honestly trying to be a great service for software developers and employers. We need feedback from people like yourself to learn how we can be the best service possible to reshape the hiring industry.
We actually sent you an email, but never heard back. Please let us know if you're interested in continuing this discussion further on or off of a public message board.
Thanks for keeping us honest.
The HN thread goes on with a lot of people pointing out their own issues with GitHire and I have my response to the above quote here:

Sorry, but I just don't have any tolerance for spammers or people trying to profit off my profile without my permission.

Sunday, January 22, 2012

Scalable System Architecture Comedy

I was reading Scaling a PHP MySQL Web Application, which is a technical document published on the Oracle website. As I was scrolling down the page, I saw the typical Load balancing Figure 1 that you always see in any PHP/MySql web application.

But then, as the article goes on, it gets more entertaining. It goes on to Figure 3, showing Multiple MySQL Slaves, which is now 4 machines.

But wait, there's more. Now you need a dedicated database slave for each Web server, so the picture expands to even more lines and arrows in Figure 4. A total of 8 machines.

As you keep scrolling, you get to Figure 5. A real gem of an image. Arrows in every direction. Arrows jumping through other arrows. 8 machines, but a completely incomprehensible image.

Ok, now we've randomized the connections between all the web servers and database slaves.

Could you imagine one of these machines going down or throwing errors and trying to figure out which one it is or how it connects to the other machines?

We all know that as systems grow, they get more complex. That said, if you draw an incomprehensible picture of your architecture, it is a clear sign that you are doing it wrong.

Monday, January 16, 2012 is a spammer

I've kept up with all the recent HackerNew articles on GitHire. They seemed like a rather interesting service because I believe that hosting your projects on a site like GitHub, is the best kind of resume a software engineer can have.

That is, until they just spammed me with a whole list of completely random and unrelated jobs. I could understand it if I signed up to their site and requested spam (aka: LinkedIn), but I'm definitely not interested as I'm in the middle of starting my own company!

After checking out their site, I realized they've also got a profile up for me that I had no hand in creating, nor desire having. Apologies if that link stops working, hopefully they will remove my information soon. Maybe I should be pissed that I'm only in their Top 50%? ;-)

Since I wanted to remove myself, I clicked the "Opt out of Githire" link, which then takes me to a page on Github to authorize their application?!?

Hell no, I'm not going to authorize your application, just so I can opt out of your website. That is wrong on so many levels.

Anyway, I cc'd support@github on my response to 'Steve', so hopefully they will be going away soon. I can't see how GitHub is allowing this company to exist, when they so clearly violate their terms of service policy.

Thursday, January 5, 2012

Going on 6 months now...

Here is a bit of a status update for the new year:

We've been working full steam ahead on Voost, going on 6 months now. In that timeframe, Jeff and I have accomplished an impressive amount of work for just two people. I've been putting in 10-15 hour days, seven days a week, of solid coding and Jeff has been doing the same. When he or I are out of town, we sit on Skype all day (and night) long working through any issues we have and bouncing ideas off each other. It has been a hugely productive cooperative development effort.

I've become a much better UX/UI designer than when I started. It had been years since I had worked on this side of things and it has been a lot of fun picking it back up. I've also become an absolute expert in CoffeeScript, JQuery, Less and all of the other hot front end technologies that are out there. On the back end, we've integrated with BrowserID for secure sign in as an option to Facebook Connect. We've also switched to Objectify 4 which is the most advanced way to interact with the Google AppEngine database backend.

The sad face news is we have nothing public to show for all of this hard work quite yet. I could go on with a list of reasons, but they aren't really worth going over in detail. Suffice it to say, we just aren't ready to launch. I'd say that we are about 85-90% of the way there. Hopefully not more than about a month or so. For a few of my friends, what we have is enough and they are pressuring me to just put something out there, even if it is incomplete or buggy. I'm pushing back on them.

While I realize the cycling season is quickly picking up, I'm not in a huge hurry. Thankfully, after years of penny pinching, I have enough savings left to last me until we do launch. I really want to do this right. I want all of my cards on the table. I want people to wonder why nobody has done a site this good before. As cheesy as it sounds, I expect something close to perfection, even if it isn't absolutely feature complete. I think of how the original iPhone disrupted the cell phone market. We went from the clam shells and keyboards to touch screens overnight. It may sound silly, but I'm passionate about doing something like that with the event registration market.

Even without all of the features that other companies have, our application is light years more advanced than any other registration product out there. I know this because I've seen their systems, analyzed everything wrong with them and spent the time to come up with a vastly better designed user experience. This takes a lot of hard work and this will be a huge differentiator in the marketplace for us. I'm very proud of that fact. It will be very expensive and nearly impossible for our competitors to hire enough engineering talent to catch up with us.

A question I get a lot is: do you have any customers? Well, we don't. Not yet. I'm ok with that because I do have enough contacts and relationships to get the word out there to promoters. I think that people also really want this product, so when we launch, it will almost sell itself. I can't tell you how many times I've heard 'I hate XYZ's excessive fees!' and 'This XYZ registration site is so difficult to use!'

Besides, the cycling community, our initial target audience, is very small and I don't want to really start pressuring promoters to try out a system that isn't launched yet. I sure wouldn't trust anyone who doesn't have a live product. On the flip side, if I was a competitor, I'd be really scared of us right now. We are going to be very hungry for customers and it will be that much harder to compete when we have a better product with better pricing.

A bit of good news is that we are close to having a great company logo. We put a bounty up on one of those crowd sourced websites full of designers and got a number of excellent designs, out of over a hundred submissions. We are in the process of choosing the final one over the next couple of days. I look forward to announcing it.

Thanks for listening. Thanks to all my friends and family for the encouragement and advice. Thanks to my wife for putting up with me working all the time. Thanks to everyone who has offered to help. Expect another update soon. This is going to be a lot of fun!