How one line of code caused a $60 million…

Nov 13, 2023

60,000 people lost full phone service, half of AT&T's network was down, and 500 airline flights were delayed

Read →

9 Comments

Abhinav Upadhyay

Nov 13, 2023

Fascinating story. Early days of "testing in production".

Expand full comment

Rich

Nov 20, 2023

I remember this event well. I was working for a large competing telecom equipment manufacturer at the time and we were all stunned at the extent of the outage. A lot of new customers were gained as a result of that day and there was a renewed emphasis on testing everything before deploying to production equipment.

Expand full comment

Brian Vargas

Nov 14, 2023

Amazing article! Keep up the work!

Expand full comment

Reply (1)

Engineer's Codex

Nov 14, 2023

Thank you Brian!

Expand full comment

Anton Zaides

Nov 13, 2023

Great story!

Behind any tech malfunction there is a hasty engineer 🙃

Expand full comment

Reply (1)

John

Nov 13, 2023

I’m sure you mean a hasty middle manager, not a hasty engineer

Expand full comment

Reply (1)

Anton Zaides

Nov 13, 2023

A haste middle engineering manager 🙃

Expand full comment

Fran Soto

Nov 13, 2023

I see 2 main areas in which we have evolved ever since

- Testing as you mentioned

- Rollback mechanisms to mitigate the impact

I don't know too much about how software deployments to switches work, but some canary deployments may catch this kind of thing as it seems the malfunction was immediate once the code was deployed.

Thanks for the very interesting article! It was interesting to take a look at the past :)

Expand full comment

Reply (1)

Raja R Harinath

Nov 24, 2023

Unfortunately, the recent Optus outage in Australia sounds almost identical.

Expand full comment

Engineer’s Codex

How one line of code caused a $60 million…