Twitter is, fundamentally, a messaging system. Twitter was not architected as a messaging system, however. For expediency’s sake, Twitter was built with technologies and practices that are more appropriate to a content management system. Over the last year and a half we’ve tried to make our system behave like a messaging system as much as possible, but that’s introduced a great deal of complexity and unpredictability. When we’re in crisis mode, adding more instrumentation to help us navigate the web of interdependencies in our current architecture is often our primary recourse. This is, clearly, not optimal.
Our direction going forward is to replace our existing system, component-by-component, with parts that are designed from the ground up to meet the requirements that have emerged as Twitter has grown.
Amid the huge number of “oh no twitter is down. make it faster!” posts, we have some good ones.
The answer isn’t “Use PHP” ;)
If I was Twitter I wouldn’t be looking for Erlang as the answer, but I would be interested in talking to Joe Armstrong. I wouldn’t jump to Java as the answer, but I would be reaching to talk to Cameron Purdy of Tangosol Coherence (now Oracle). These people have seen systems that make Twitter look like a toy in comparison, and it is the knowledge that is more valuable than any technology.
If you think about contorting a typical LAMP stack to run Twitter you quickly shudder. Having a database layer, even with master/slaves, is scary.
Twitter needs a Mom, and it looks like it is finally getting one. With true message oriented middleware, and money to get the systems they need, they should be fine. As Cedric says this isn’t an original problem.
The system of messages shouldn’t be living in bottleneck databases. Instead, they can be flowing through a river of distributed cache systems. The Jabber side of the house shouldn’t be able to “bring down” the entire website. The beauty of publish subscribe and messaging is that you can throttle things nicely. You shouldn’t be “running out of database connections.” You can tune the number of listeners for the “track” command for example, and if it is getting abused you limit its resources. Sure, this may mean that you get messages to people a little later, but who cares. If messages got a little slower would people even realise? Compare that to the birds lifting the animal to safety message.
In fact, if you think about systems such as the stock exchange. You will realise that you rarely get truly real-time access. Most of the time you are on a delay of some kind, and that is fine. Through the distributed caching architecture you can push out messages to other systems to do their work. One of these systems will be the website itself. Twitter.com is just another client to the messaging backbone. Even if the backbone is in trouble, the website can still show the current view of the world, and could even still batch up work to be done.
I was talking to another startup that is migrating away from a database backed system, and soon the entire real-time world will be in a huge distributed cache. I am sure that Twitter will be moving there too.
Currently, I still feed bad for the engineers. I have been there; The point where you are the limits of your current architecture and you know it can tank at any time. You are firefighting all day and night, and thus don’t even have much time to fix anything at all. It is hard work. It is tiring work. It is demoralizing work.
However, I know that Alex and the rest of the crew will pull through their current situation, which after all came about thanks to the amount of love that its users have for the service, and one day the new architecture will be there in a way where we will look back and remember the early days, where downtime was such an issue.
Thanks for all the hard work guys. I can’t wait to be tweeting on a fully loosely coupled architecture, talking to one of your Moms!