“How we messed up so you don’t have to.”
What works for someone else does not necessarily work for you. You should be trying out all kinds of technologies.
A browser is not that great for live dynamic content. That is why you need to use synchronous web apps. On facebook there are two examples of this: chat and the news feed.
Take caution when releasing large changes in your project. Facebook had some issues with the synchronous push for version 2.
With these kinds of projects you shouldn’t think too far ahead of yourself, otherwise you’ll be wasting a lot of time and never get the full scope of the project completed. This will cause you to be discouraged.
Everybody does the exact same thing. When you’re broadcasting and an issue occurs then everyone who’s logged on trys to refresh their browser at the same time. On meebo the login/logout function is very server intensive. So on holidays they have many more login/logout requests. To get around this they queue up logout requests, show a messege to the user that they’ve been logged out (which makes them happy), and then eventually logs them out.
Front-end vs. back-end.
On facebook they use MetaChache which speeds up a lot. On the news feed/homepage they store multiple things in a cluster/cache and then pass them all at once to the client. This makes it less intensive over time.
Can we rollback?
Meebo has 5-6 componenets that get rolled out in order. Incremental release is good too, so you can release something to 10-20% of people, and if something occurs its less bothersome to roll it back. Symlinks also is used by meebo. What Simlinks do, if something goes wrong they have roll back pointers that will do it instantly.
On facebook they feel that the response of an alert can be worse than the actual need for the alert. They had an issue where one of the servers were overheating, so one of the engineers decided to reboot all of the server which should have only taken 10 minutes. However, this restart caused the news feed to be down for 4-5 hours one day.
What do you monitor?
Meebo monitors server health, user experience, how long it takes to load the front page, and the health of the IM networks. Before when AIM/MSN would have issues users thought it was Meebo that was down. In result, it wasn’t MSN, but meebo actually had a login bug. Measuring features that are used more often, and trends of how many Meebo acounts are created every day. So if they release a new release they can see if these trends are effected and if they now get less participation.
Facebook has tools that produce latencey.
Can I throw more money at it? Changes between hardware/software.
Meebo tries to keep the server load at 50-60% of the server’s load. They find if more than that are on the server, then it takes a huge load.
Good enough vs. Perfect. When do you decide when something is good enough to ship?
You need to estimate how much of a load a certain feature will cause on your server. A lot of times when you test something it runs just fine, and then you push it live it ends up causing chaotic issues. Sometimes just asking the user if it is good enough, and to ask for user feedback then they feel like they’re actively involved in the project.
Gatekeeper is software that Facebook uses to slowly release new features to only some users. Meebo has features where they can turn things off and on to load balance test new pushes.
It is good to know what you can cache and what you cannot. You should not just cache anything. Do never be afraid to ask the user what is wrong with your product. You’re never going to know about the product as much as a whole as the user. You need to be aware about what the user is thinking, and not what new features your engineers want to implement that day.
Questions and Answers
What about automated testing?
It is very hard to load test because you cannot anticipate what all users will do. The best way to load test, sometimes, is just to put it live and see what happens. You can always roll it out to 10%/20%…60% and when it reaches 60% with no issues it should be fine to push out to everyone. Putting timers on
What happens when you make infastructure changes?
Facebook noticed that their CPU was running slower and slower, and they looked into it and found it was Memory Fragmentation. More and more features will cause this to happen as your infastructure gets bigger. One of the metrics Facebook looks at is how much CPU usage they can actualy use until the server starts messing up.
What are some stories when human error caused issues?
When facebook released the new homepage everyone decided to take a vacation at the same time afterwards. They had monitoring scripts running, but they broke, so when everyone got back they noticed the monitor was shut off. Meebo accidently deleted all of their user accounts, and had to restore them. Vogt bumped a router and killed half of their user’s connections.
Facebook has 10,000 servers. Each server has 32GB of ram. Wow!