Posted by: cmani2010 | April 27, 2009

“We will be back shortly”,”We are undergoing maintenance” !!

As with a lot of people, I am a user of the web2.0 sites like LinkedIn, Twitter, gmail, Yahoo etc. Infact sites like LinkedIn have become really addictive, and I do check the website atleast a couple of times a day, if I have time and am not on the road. So, on a Sunday morning, I opened Linkedin.com at around 10:15 am Indian standard time and I got this message :

I was a little annoyed, that my favorite website is undergoing maintenance, and more so during day time in India !! Of course, it got me thinking, about the larger aspects of this issue. In this day and age, when your business is used by people across geographies, do things like having a maintenance window during offpeak hours in your local region (in this case the US), really work? Is it acceptable, and any ideas to overcome this?

A google search on “Linkedin architecture” came across this piece of information about LinkedIn’s architecture. A couple of points catch my eye, “The Cloud is a server that caches the entire LinkedIn network graph in memory”, “Rebuilding an instance of The Cloud from disk takes 8 hours” .. The components used in the architecture like Tomcat, Jetty, MySQL, Oracle etc are certainly capable of 99.999% availability, if architected that way. The Linkedin architecture, is great stuff to learn and understand, how high volume websites are built. There are plenty of details on how caching is done, LinkedIn also uses a push based architecture for generating content (could this be the reason???). I cannot really draw conclusions, on how to avoid this.

But, I have a couple of points, on how downtime can be avoided, and at the same time, allocate time for upgrades (hardware and software) and maintenance:
1. Use a Content Delivery Network (CDN) like Akamai, to deliver Content. This way, you have websites which deliver content to different regions across the globe, without downtime in one region affecting others. Of course the cost, could be factor. The other thing to consider would be, if the contents are very dynamic, how will this work?
2. Use a Rolling Upgrade kind of strategy. That is, if you have several servers in a cluster, high availability setup, remove one server from the cluster, upgrade and push it backup to the cluster, and then take the next one. Of course, there may be a few minutes/hours, when the versions of the applications. will be different. But, you can avoid downtime.
My thoughts, on a Monday afternoon 😉 Hope this makes sense !!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: