Sunday, January 22, 2012

Scalable System Architecture Comedy

I was reading Scaling a PHP MySQL Web Application, which is a technical document published on the Oracle website. As I was scrolling down the page, I saw the typical Load balancing Figure 1 that you always see in any PHP/MySql web application.

But then, as the article goes on, it gets more entertaining. It goes on to Figure 3, showing Multiple MySQL Slaves, which is now 4 machines.

But wait, there's more. Now you need a dedicated database slave for each Web server, so the picture expands to even more lines and arrows in Figure 4. A total of 8 machines.

As you keep scrolling, you get to Figure 5. A real gem of an image. Arrows in every direction. Arrows jumping through other arrows. 8 machines, but a completely incomprehensible image.

Ok, now we've randomized the connections between all the web servers and database slaves.

Could you imagine one of these machines going down or throwing errors and trying to figure out which one it is or how it connects to the other machines?

We all know that as systems grow, they get more complex. That said, if you draw an incomprehensible picture of your architecture, it is a clear sign that you are doing it wrong.


SteveL said...

Its roughly the same problem you get on any distributed system with load balancing/dynamic routing though, isn't it? Even if you draw "message bus" or "service discovery", you mean "messages dispatched or dropped on a whim". This is why all exceptions forwarded back must include the host that played up.

It's nice to include things like source and port too.

For Hadoop we now not only include those things, the standard socket exceptions include links to diagnostics pages such as this one for Connection Refused, as I got fed up with people on the user list asking why their hadoop server was refusing connections

Unknown said...

Don't get me started with Hadoop. This is a classic case of failure to communicate properly (and given the logs are on 50+++ different machines, even more difficult to debug).

Instead of getting frustrated and writing a wiki page to point the dumb people to (I've been there myself, so I know what you are going through), I would fix Hadoop error message output to be more helpful. I would also find a way to make it so that there was a centralized point (like ZooKeeper is to config) where one could look in order to figure out what was going wrong.

The problem is Hadoop, not the people.