9 Comments

Fascinating story. Early days of "testing in production".

Expand full comment
Nov 20, 2023Liked by Engineer's Codex

I remember this event well. I was working for a large competing telecom equipment manufacturer at the time and we were all stunned at the extent of the outage. A lot of new customers were gained as a result of that day and there was a renewed emphasis on testing everything before deploying to production equipment.

Expand full comment

Amazing article! Keep up the work!

Expand full comment
author

Thank you Brian!

Expand full comment

Great story!

Behind any tech malfunction there is a hasty engineer 🙃

Expand full comment

I’m sure you mean a hasty middle manager, not a hasty engineer

Expand full comment

A haste middle engineering manager 🙃

Expand full comment

I see 2 main areas in which we have evolved ever since

- Testing as you mentioned

- Rollback mechanisms to mitigate the impact

I don't know too much about how software deployments to switches work, but some canary deployments may catch this kind of thing as it seems the malfunction was immediate once the code was deployed.

Thanks for the very interesting article! It was interesting to take a look at the past :)

Expand full comment
Nov 24, 2023Liked by Engineer's Codex

Unfortunately, the recent Optus outage in Australia sounds almost identical.

Expand full comment