I remember this event well. I was working for a large competing telecom equipment manufacturer at the time and we were all stunned at the extent of the outage. A lot of new customers were gained as a result of that day and there was a renewed emphasis on testing everything before deploying to production equipment.
I see 2 main areas in which we have evolved ever since
- Testing as you mentioned
- Rollback mechanisms to mitigate the impact
I don't know too much about how software deployments to switches work, but some canary deployments may catch this kind of thing as it seems the malfunction was immediate once the code was deployed.
Thanks for the very interesting article! It was interesting to take a look at the past :)
Fascinating story. Early days of "testing in production".
I remember this event well. I was working for a large competing telecom equipment manufacturer at the time and we were all stunned at the extent of the outage. A lot of new customers were gained as a result of that day and there was a renewed emphasis on testing everything before deploying to production equipment.
Amazing article! Keep up the work!
Thank you Brian!
Great story!
Behind any tech malfunction there is a hasty engineer 🙃
I’m sure you mean a hasty middle manager, not a hasty engineer
A haste middle engineering manager 🙃
I see 2 main areas in which we have evolved ever since
- Testing as you mentioned
- Rollback mechanisms to mitigate the impact
I don't know too much about how software deployments to switches work, but some canary deployments may catch this kind of thing as it seems the malfunction was immediate once the code was deployed.
Thanks for the very interesting article! It was interesting to take a look at the past :)
Unfortunately, the recent Optus outage in Australia sounds almost identical.