Twitter Scaling Issues
Published by steve May 27th, 2008 in Silicon Valley, Software, TechnologyRobert Scoble states: “First of all, Twitter doesn’t store my Tweets 25,000 times. It stores them once and then it remixes them.”
I do not know how Twitter is built. However, “back in the day,” I was the development manager for a real-time, stock quote delivery system, so I do have experience with architectural issues similar to those that Twitter may be facing.
Let’s look at the procedure Robert refers to as “remixes them.” In the simplest architecture, there would be a single list (database, flat file, etc) of all the twitters created by everyone stored in chronological order. You may, as a storage optimization, just store a user id with the twitter string, and tweet time stamp (aka a tweet).
In this single architecture, a “remix” would require a query across all the tweets for a period of time for all people that a person follows. This query would be fairly fast when the number of tweets in the specified period is fairly small and the number of users a person follows is fairly small. You can see that this type of query becomes more expensive when the number of users you follow increases and the overall number of tweets per period increases.
So to speed up this query, you could build an index of tweets based on users. But maintaining this index would become expensive, especially during high incoming tweet periods.
So one might try to optimize the architecture by breaking up the universal store into list of tweets per person. Now each incoming tweet can be easily added to the user’s tweet list. Then the “remix” of tweets from the people you follow would require a join and sort across each tweet list of the people you follow. This would become increasingly more expensive when a user starts to increase the number of people they follow. It would be particularly expensive for super users who follow lots of users.
A reasonable compromise might be to keep a single universal stream of tweets in chronological order and two lists for each user: a list of pointers of all their tweets, and a list of pointers to all the tweets from the people they follow.
Maintaining these lists would look something like: sender publishes a tweet, it is added to the universal store, a pointer is then added to the sender’s tweet list, and then “push the tweet to followers” by walking sender’s list of followers and add a pointer to the tweet to each “follow” list.
This approach scales fairly well. The act of updating a user’s tweet list is separate from the managing of follow lists. Updating follow lists can be partitioned across multiple servers. Each server can just take (using shared queues) a tweet from the universal store and “fan it out” to the appropriate follow lists.
To optimize the “fan it out” process, a messaging product like JMS or TIBCO Rendezvous can broadcast the tweets to the servers that manage follow lists. This would require a universal store process to publish all tweets and a cloud of follow list managers listening (aka subscribing) to tweet broadcasts. When a particular follow list manager receives tweets it then updates the appropriate follow lists. Each tweet is broadcast once. It can be processed zero to thousand of times based on the number of follow list servers are in the cloud. It is easy to add follow list managers to the cloud with no impact on the upstream processing.
This approach also nicely addresses Twitter’s need for separate outbound follower queues for users that use Instant Messaging or SMS. Each follow list manager could be specialized. Some follow list managers could render Web pages, others be IM bridges, and others be SMS bridges. Users would be assigned to a particular follow list server based on their delivery preference.
For further scaling optimization, one can have several tweet stores instead of one single universal store. You just need to ensure that all incoming tweets from a particular user are added to the same store to maintain ordered delivery to followers.
So it is quite reasonable to copy (at least references) each of Robert’s tweets 25,000 times, just do so in a scalable manner.