2008-01-14 19:33:14

SwissIX Outage - Rumours

Starting from 14:12 on Sunday, the 20th of January, massive packet loss started to occurr on all links at SwissIX. Shortly after that, all connectivity from many autonomous systems stopped. Swisscom GPRS fallback worked well though, apparently since Swisscom had enough peerings outside of SwissIX.

Investigating the situation in SwissIX, there were various incarnations of always the same packets flooding the network – it looked entirely like a switch loop. In theory, however, all switches used in datacenters should do spanning tree in order to avoid this situation, so this explanation appears somewhat vague. A mail on layerone-customer however also claims that a switch loop in InterXion was causing the failure. Maybe an unusual situation triggered a bug, causing STP to fail?

Cyberlink restored connectivity pretty quickly, once the interfaces at SwissIX were shut down, the traffic ran over Germany and similar fallback routes. swissix.ch remained unreachable, also via Swisscom GPRS. The SwiNOG coordination network was split into two pieces, rendering it ad absurdum.

Solnet took more than 2 hours to recover. Around 15:30, the full packet loss after the main Zuchwil router converted to a routing loop with the nexthop. Around 16:30, connectivity started to get back to normal. Maybe there are too few non-SwissIX peers available for this case?

Unfortunately, the outage falls right between the moving of the NGAS.ch network monitoring infrastructure, so there are no concise graphs available showing what exactly had happened. This has been corrected immediately afterwards.

