Friday, 13 July 2012

Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way


Building and Scaling a Startup on Rails: 12 Things We Learned the Hard Way

There are a bunch of basic functional elements to building out a popular Rails app that I've never really seen explained in one place, but we had to learn the hard way while building Posterous. Here's a rundown of what we've learned, in the hopes that some Google linkjuice may bring an intrepid wanderer to this forlorn part of the woods and help you out.

Static Storage
S3 is awesome. Yes, you can host everything off S3. Is it a little more expensive? Probably. But if you're engineer constrained (and if you're a startup, you absolutely are) -- set it and forget it. If you absolutely must have local storage across app servers, then MogileFSGFS, or HDFS or even NFS(yuck) are possibilities. For alternatives to S3, Mosso is supposed to be good too.
Images, files, whatever. Just drop it there. People say a lot of stuff about the Cloud, but it's real and a game changer for anyone doing user generated content.

HTTP Cache Control
The HTTP protocol lets you tell browsers what static content they can cache. You set this in apache.  Rails automatically will put timestamps in the IMG / javascript / CSS tags, assuming you're using the helpers. The Firefox pluginYSlow coupled with Firebug are your friends here. The improvement is significant and well worth your time, especially if you add gzip'ing. 100KB initial page load can be brought down to 5K (just the HTML file) on subsequent clicks around your site. 

You're not going to run full text search out of your DB. It's totally not worth it to roll anything custom here. The smart money is on Sphinx with theThinkingSphinx plugin is probably your best bet. If you have more than one app server, you'll want to use this. Alternatively, Solr with Acts as Solr can be used if you're a Java geek / have Lucene/Solr experience previously. 

Storage engine matters, and you should probably use InnoDB

MyISAM is marginally faster for reads, but InnoDB will make you more crash resistant and will not lock tables on writes. Read about the difference, because when your servers are on fire, you will realize MySQL feels like a pretty thin layer of goop on top of your storage engine. MyISAM is actually the default on MySQL, which makes sense for most crappy phpBB installations -- but probably not good enough for you. The default can hurt you.
Oh yeah, and if you can start with some replication in place, do it. You'll want at least one slave for backups anyway. 
Fix your DB bottlenecks with query_reviewer and New RelicThis basically saves your ass completely. Everyone complains that Rails is slow. Rails is not slow, just like Java Swing is not slow. Rails makes it easy to shoot yourself in the face. If you do follow-the-textbook-example bumbling around with Rails ActiveRecord objects, you will end up with pages that drive 100 queries and take several seconds to return. 
Above is a screenshot from query_reviewer. It tells you every single query being run, and alerts you to things that use temporary tables, file sorts and/or just damn slow queries.
In a nutshell, you need indexes to avoid full table scans. The traditional way is to run EXPLAIN manually on queries coming out of your dev log. Query_reviewer lets you see it all right there in the left corner of your web browser. It's brilliant. You also need to eager load associations that will use in your views by passing :include to your ActiveRecord find method call, so that you can batch up SQL queries instead of destroying your DB server with 100 queries per dynamic page.
New Relic is new for us, but it helps us see what is really happening on our production site. If your site is on fire, it's a freaking beautiful gift from the heavens above. You'll see exactly what controllers are slow, which servers in your cluster, how load is on all your machines, and which queries are slow.
Memcache laterIf you memcache first, you will never feel the pain and never learn how bad your database indexes and Rails queries are. What happens when scale gets so big that your memcache setup is dying? Oh, right, you're even more screwed than you would have been if you got your DB right in the first place. Also, if this is your first time doing scaling Rails / a db-driven site, there's only one way to learn how, and putting it off til later probably isn't the way. Memcache is like a bandaid for a bullet hole -- you're gonna die.

You're only as fast as your slowest query.
If you're using nginx or Apache as a load balancer in front of a pack of mongrels (or thins or whatever else is cool/new/hip), then each of those mongrels acts like a queue. The upshot is that if you EVER have a request that takes a long time to finish, you're in a world of hurt. So say you have 4 mongrels, and Request A comes in to port 8000 and it takes 10 seconds. The load balancer is naive and keeps passing requests to Port 8000 even though that port is busy. (Note: This might help, but we don't use it)
Then what happens? Sad town happens. 1 in 4 requests after Request A will go to port 8000, and all of those requests will wait in line as that mongrel chugs away at the slow request. Effective wait time on 1/4th of your requests in that 10 second period may be as long as 10 seconds, even if normally it should only take 50msec!
Enter the wonderful mongrel proctitle. Now, you can see exactly what is blocking your mongrels. I keep this on a watch in a terminal at all times. It's what I look at immediately if our constant uptime tests tell us something's wrong. Super useful. 
The answer is: a) run some mongrels dedicated to slow running jobs (meh) or b) run Phusion Passenger, or c) run slow stuff offline... which leads us to...

Offline Job QueuesSo you gotta send some emails. Or maybe denormalize your DB. Or resize photos, or transcode video or audio. But how do you do it in the 200msec that you need to return a web request? You don't. You use Workling or Delayed Jobor nanite. It'll happen outside of your mongrels and everyone will be happier.
I don't know why people don't talk about this more, because if you run a site that basically does anything, you need something like this. It *should* be a part of Rails, but isn't. It isn't a part of Rails in the same way that SwingWorker in Java wasn't a part of Java Swing core like forever, even though it absolutely had to be. 
If you don't monitor it, it will probably go down, and you will never know.
Test your site uptime, not just ping but actual real user requests that hit the DB. Sure, you could use pingdom if you're lazy, but it seriously takes like 10 lines of ruby code to write an automated daemon that runs, does a user action and checks that your site is not hosed. open-uri is your friend. You don't know if you're up if you're not checking. Do not tolerate downtime.
Also, use god for mongrel and process monitoring. Mongrels die or go crazy. You gotta keep them in their place. (What's funny is that god leaks memory over time with Ruby 1.8.6 *sigh*). Muninmonit, and nagios are also great to have.
Keep an eye on your resources -- IO ok? Disk space? It's the worst thing every to have a site crash because you forgot to clean the logs or you ran out of disk space. Make cronjobs for cleaning all logs and temp directories, so that you can set it and forget it. Because you will forget, until you are reminded in the worst way. 
Read the source, and cut back the whiningYou will learn more reading the source and debugging / fixing bugs in plugins and sometimes Rails itself than a) complaining on a mailing list or b) whining about shit on your twitter. It's Ruby open source code -- if it's broken, there's a reason. There's a bug, or you're doing it wrong. Fix it yourself, drop it into a github fork, and submit back. 

Beware old plugins
They don't work well. And they sit around on Google sucking up time and effort.Acts as paranoid is one. They look legit, with famous names who created them. Don't fall for it. Insist on using code that has been updated recently. Rails changes pretty fast, and plugins that don't get updated will waste your time, cause random bugs, and basically make your life crap.
Github is new on the scene and has totally revolutionized Rails. When in doubt, search Github. If it's not on Github, it's probably dead/not-maintained. Be wary. 
Beware old anythingActually, if this blog post is older than even 6 months or 1 year -- you might want to go elsewhere. Rails moves fast. What's hot and "must have" in Rails now may be totally a piece of crap / barely functioning garbage later. Same with any blog posts. Be super wary of the Rails wiki. There be dragons -- I mean, really stuff that references Rails 1.2.6 or earlier!
And that's a wrap.There's tons more stuff, but this is a pretty decent list of stuff to watch out for. If you have any suggestions for other things I missed, or questions, please do leave a comment below!
If you liked this article, please try and/or follow me on twitter at@posterous and @garrytan!
Posterous is also hiring Rails engineers for full time positions in San Francisco. We're a small team of all hardcore developers and looking for like minded folks to join up. Well-funded by the top-tier VC's and angels. We grew over 10x in 2009!

No comments:

Post a Comment