|
10/16/2008 Major Network Outage
Summary: On 10/16 DoIT suffered a network problem which affected a majority of the workstations on campus. In order to troubleshoot the issue, DoIT personnel were forced to bring down large portions of the network at various times throughout the afternoon and evening.
Systems Affected: Access to all systems (including telephone) for all internal users was down at various times, for periods of several minutes to several hours. Overall, the major outage event took place over a span of 3 to 4 hours. Inbound Internet traffic was unaffected
Background: DHCP (Dynamic Host Configuration Protocol) is the means by which the majority of user workstations on campus acquire an address so that they may communicate on the campus network. At some time on 10/15 or 10/16 (we believe) a network anomaly occurred which caused DHCP to stop functioning properly. At around 3:45 pm the campus suffered a widespread momentary power outage. This caused nearly all workstations on campus to lose power and consequently reboot. Those machines which depended on DHCP were unable to gain an address and therefore could not communicate within SPSU or with the Internet.
Events: At approximately 3:45 pm the power went out to the majority of campus momentarily (cause unknown). DoIT quickly became aware that certain areas of the network had ceased to function after power resumed, but the root cause was not readily apparent.
By 4:15 it was obvious that certain areas were not able to access the DHCP. DoIT personnel attempted to track down the cause. Network scanning revealed that DHCP requests were arriving at the server and the server was responding, but the responses were not making it back to the requesting workstations.
This behavior seemed to indicate a problem with the core router which controls access to DHCP for all subnets via "helper addresses" (DHCP cannot be routed). At approximately 4:45 pm the decision was made to reboot the core router to see if DHCP functionality returned. The reboot did not address the problem.
For a few moments after the reboot, however, DHCP was flowing. It was therefore believed that the problem might be coming from a specific building. The decision was made to systematically disconnect access from all external switches to see if DHCP functionality returned.
After closing down the connections to all campus nodes outside of the Data Center DHCP began functioning again after a delay of about 5 minutes. At this point DoIT began bringing buildings back online, one by one, watching to see how DHCP responded. Unfortunately, a wait of at least 5 minutes was necessary after each building was brought back online. An attempt was made to prioritize the order in which connectivity was returned, beginning with building J.
By 6:30pm the source building had been determined and all other buildings had had connectivity returned. The "problem building" experienced several more hours of various up/down time while the root cause was investigated. Also, it is possible that the remainder of campus may have experienced intermittent DHCP outages during testing, but by that point the majority of machines should have already acquired network addresses.
DoIT staff worked with Enterasys phone support until approx. 6:00 am the following morning attempting to diagnose the root cause of the problem (those employees returned to campus at 11:00 am). Additional DoIT staff came in at around 7:30am to continue the troubleshooting process. A dedicated DHCP server for the "problem building" was brought online and that final building achieved somewhat normal functionality by 1:00pm.
Diagnosis: At the time of writing DoIT and our networking vendor still do not know the exact cause of the problem. DoIT's hypothesis is that a workstation or device in the source building has a malfunction (or sourced a deliberate attack) which "poisoned" the DHCP process (affecting the DHCP server and/or network hardware). If this is the case the "attack" is being launched as a normal-looking DHCP request (e.g., not a visible DOS or other attack) and is therefore almost impossible to track down.
Prognosis: Outlook is good/guarded. No major loss of data was reported. There does not appear to have been any security-related threat involved. DoIT will continue to monitor the situation for recurrence.
Treatment: The source building has been isolated to its own DHCP server. This will allow DoIT to protect the rest of campus while focusing on the root cause.
Findings: Edge-to-edge network flow monitoring would have allowed DoIT staff to troubleshoot and diagnose the problem more quickly and may possibly have identified the root cause of the problem. However, these systems are extremely expensive (> $100K) so it is not likely that one will be obtained soon.
|