Friday, August 15, 2014

Gigascale Engineering talk by Bruno Bowden

Bruno Bowden (engineer on Google Earth, Enterprise Gmail) gave a great talk on "Gigascale Engineering"

He shared a lot of wisdom, which as the old saying goes, came from a lot of experience, which came from a lot of mistakes (including almost bringing down Gmail, messing up Eric Schmidt's email and having Santa Claus nearly cause an international incident.)

To summarize his points:
  1. The best decision is "Not to build"
    Learn to fail fast, give up and quit
  2. Recycle
    re-use, don’t re-invent the wheel.  Benefit from someone else's headaches and hard work.  Use something that's been battle tested, and someone  else has already worked through the bugs
  3. "Above the fold"
    Example he used was how one change, moving the download button up in the page so for certain displays with low resolution the download button could be seen w/o scrolling, increased the number of downloads of Google Earth by 50-100 million.
    My interpretation was find what will have the biggest impact.  Not what you think is will be cool or  interesting.

Then their was a really interesting section of the talk on "Pushing the reliability/Innovation curve". He showed how simple curve where the more innovation, the less reliability.  The "pushing" involves ways in which you can decrease risk which still increasing innovation. 

Below are a couple of the ideas he shared:
  • He talked about the basics such as Code Reviews.
  • He also mentioned that while Google "doesn't do TDD" all code is "guilty until proven innocent".   He stressed they do unit testing etc, they just haven't don't but into the "though shall always write tests first" approach.  My thought is that people trump process and TDD may add most value when used with less "experienced" developers to help them stay out of trouble.  That said I do like TDD, but frankly don't always practice it.
  • He mentioned that the number one threat to Google software was Google engineers.  And how they set Rate limits (throttling) to keep people from bringing systems to their knees
  • Using an exponential backoff strategy on retries to limit potential damage.  This reminded me of retry schemes for collision detection on Ethernet.  
  • Having built in "Real Time Controls" designed into the system to help you deal with performance issues
  • Having a way to safely/canary in the coal mine test knew code
  • Using N+2 Rollbacks
  • The advantage of "Keeping it Simple" and how engineers seem to love to create complexity.  A charge I have been guilty of way too many times.
  • He used a quote from John Dean "Design for 10x" which I understood as design for a 10 fold growth.
  • The earlier you can compartmentalize things the better before it becomes a Monolith
  • How when you first startup up you want to push things out as fast as possible and iterate quickly so you can find out what works (and hopefully abandon quickly what doesn't).  But once you start to scale you need to shift to a regular release cycle and a more repeatable process
  • Using protocol buffers to help deal with multiple versions of the protocol as you upgrade on a large multi-server environment.
  • Importance of managing your managements expectations.  Yes change is risky and when we do this there will be some fires (of course you go through the risk/reward analysis to determine its worth the fires that will happen).
  • What to do when failure happens (really important and unfortunately I see too many cases where this isn't handled properly or if they do the Post Mortem, the actions are not followed up on):
    • Alert Early - ie: don't hide the problem or cover it up.  Let people know so you can start to deal with it and get some help.
    • Escalate Response - Get some help, don't try and solve it all yourself, especially if its a potential crisis in production.
    • Stabilize - revert if necessary, but stabilize the system
    • Perform a Post Mortem - Ask "How can we avoid this in the future",  usually there are multiple points of failure.
    • Where could we have done more "Defensive Engineering"
    • Reassess the risks/rewards of what you are doing.
In addition to the wise advise there was a great story about how Santa Tracker almost caused an international incident, by flying over Toronto, Ohio (well he said some Toronto in the US, so I assume Ohio) as opposed to that other one in Canada.  Fortunately the Canadian general was kept sufficiently distracted so that he didn't look at the Santa Tracker until Mr. Claus got back on course.

Here is his Ted Talk on Santa Tracker:

Cheers,
Stephen