9 Comments
User's avatar
Abhinav Upadhyay's avatar

Fascinating story. Early days of "testing in production".

Rich's avatar

I remember this event well. I was working for a large competing telecom equipment manufacturer at the time and we were all stunned at the extent of the outage. A lot of new customers were gained as a result of that day and there was a renewed emphasis on testing everything before deploying to production equipment.

Brian Vargas's avatar

Amazing article! Keep up the work!

Anton Zaides's avatar

Great story!

Behind any tech malfunction there is a hasty engineer 🙃

John's avatar

I’m sure you mean a hasty middle manager, not a hasty engineer

Anton Zaides's avatar

A haste middle engineering manager 🙃

Fran Soto's avatar

I see 2 main areas in which we have evolved ever since

- Testing as you mentioned

- Rollback mechanisms to mitigate the impact

I don't know too much about how software deployments to switches work, but some canary deployments may catch this kind of thing as it seems the malfunction was immediate once the code was deployed.

Thanks for the very interesting article! It was interesting to take a look at the past :)

Raja R Harinath's avatar

Unfortunately, the recent Optus outage in Australia sounds almost identical.