Friday, August 15, 2014

Gigascale Engineering talk by Bruno Bowden

Bruno Bowden (engineer on Google Earth, Enterprise Gmail) gave a great talk on "Gigascale Engineering"

He shared a lot of wisdom, which as the old saying goes, came from a lot of experience, which came from a lot of mistakes (including almost bringing down Gmail, messing up Eric Schmidt's email and having Santa Claus nearly cause an international incident.)

To summarize his points:
  1. The best decision is "Not to build"
    Learn to fail fast, give up and quit
  2. Recycle
    re-use, don’t re-invent the wheel.  Benefit from someone else's headaches and hard work.  Use something that's been battle tested, and someone  else has already worked through the bugs
  3. "Above the fold"
    Example he used was how one change, moving the download button up in the page so for certain displays with low resolution the download button could be seen w/o scrolling, increased the number of downloads of Google Earth by 50-100 million.
    My interpretation was find what will have the biggest impact.  Not what you think is will be cool or  interesting.

Then their was a really interesting section of the talk on "Pushing the reliability/Innovation curve". He showed how simple curve where the more innovation, the less reliability.  The "pushing" involves ways in which you can decrease risk which still increasing innovation. 

Below are a couple of the ideas he shared:
  • He talked about the basics such as Code Reviews.
  • He also mentioned that while Google "doesn't do TDD" all code is "guilty until proven innocent".   He stressed they do unit testing etc, they just haven't don't but into the "though shall always write tests first" approach.  My thought is that people trump process and TDD may add most value when used with less "experienced" developers to help them stay out of trouble.  That said I do like TDD, but frankly don't always practice it.
  • He mentioned that the number one threat to Google software was Google engineers.  And how they set Rate limits (throttling) to keep people from bringing systems to their knees
  • Using an exponential backoff strategy on retries to limit potential damage.  This reminded me of retry schemes for collision detection on Ethernet.  
  • Having built in "Real Time Controls" designed into the system to help you deal with performance issues
  • Having a way to safely/canary in the coal mine test knew code
  • Using N+2 Rollbacks
  • The advantage of "Keeping it Simple" and how engineers seem to love to create complexity.  A charge I have been guilty of way too many times.
  • He used a quote from John Dean "Design for 10x" which I understood as design for a 10 fold growth.
  • The earlier you can compartmentalize things the better before it becomes a Monolith
  • How when you first startup up you want to push things out as fast as possible and iterate quickly so you can find out what works (and hopefully abandon quickly what doesn't).  But once you start to scale you need to shift to a regular release cycle and a more repeatable process
  • Using protocol buffers to help deal with multiple versions of the protocol as you upgrade on a large multi-server environment.
  • Importance of managing your managements expectations.  Yes change is risky and when we do this there will be some fires (of course you go through the risk/reward analysis to determine its worth the fires that will happen).
  • What to do when failure happens (really important and unfortunately I see too many cases where this isn't handled properly or if they do the Post Mortem, the actions are not followed up on):
    • Alert Early - ie: don't hide the problem or cover it up.  Let people know so you can start to deal with it and get some help.
    • Escalate Response - Get some help, don't try and solve it all yourself, especially if its a potential crisis in production.
    • Stabilize - revert if necessary, but stabilize the system
    • Perform a Post Mortem - Ask "How can we avoid this in the future",  usually there are multiple points of failure.
    • Where could we have done more "Defensive Engineering"
    • Reassess the risks/rewards of what you are doing.
In addition to the wise advise there was a great story about how Santa Tracker almost caused an international incident, by flying over Toronto, Ohio (well he said some Toronto in the US, so I assume Ohio) as opposed to that other one in Canada.  Fortunately the Canadian general was kept sufficiently distracted so that he didn't look at the Santa Tracker until Mr. Claus got back on course.

Here is his Ted Talk on Santa Tracker:

Cheers,
Stephen

2 comments:

Unknown said...

"...people trump process and TDD may add most value when used with less "experienced" developers to help them stay out of trouble."

I think you're fundamentally misunderstanding many of the purposes and applications of TDD, and leveraging a truism ('people trump process') to allay this.

Of course, in general, people trump process, but TDD isn't necessarily applied as a salve to nurse against particularly bad people (though it can reveal them ;). It's really there to ensure that good peoples' expectations are met across iterations of development.

In this regard, an "experienced" developer would, in fact, go about *making sure* that such tests are in place (even before coding!), especially (and this comes within the concept of the Gigascale talk) since you may not be sure who the next developer to change your code may be in your rapidly growing enterprise.

"People trump process" works between maybe 2 or 3 people who know each other well and that they are all amazing developers (and even then it is suspect). But if what you're developing is going to be worked on by more and more people over time, how can you be certain every one will be a great developer?

So, essentially, I argue that TDD's value is less about nursing inexperienced developers than it is about augmenting the development processes of experienced ones.

MrStevesScience said...

I will be the first to admit I misunderstand some of "the purposes and application of TDD."

That said I think we should distinguish between writing tests first and self testing code. I whole heartedly agree with having self testing code and having good tests. I do take objection to the TDD dogma I sometimes hear where people I otherwise respect say "if you don't write tests first you are not a real programmer".

Finally on your last point the problem is not just having "inexperienced developers" but also "experienced ones" who may not have learned much from all that experience ;)

So good people are my first choice and having good methodologies/team habits augments, which include self testing code (whether or not those tests were written first or not) also helps.

Cheers,
Stephen