Search
Southern Polytechnic State University
About SPSU Prospective students Student services Student life Academics Continuing Education Distance Learning Administrative offices Recreational Sports and Athletics Alumni and Foundation Faculty/staff
Information Technology


WIRELESS

IT Projects



08/22/2008 Major Systems Outage

Summary: On 08/22 DoIT suffered a major systems outage which affected nearly all major campus systems.

Systems Affected: Banner, university web server, faculty and student file services, various departmental servers, others.

Background: Previously, anomalous behavior had been detected involving the main DoIT SAN disk storage array.  Per manufacturer instructions, non-disruptive replacement of a Fibre Channel (FC) interface had been planned for the evening of 08/22.

Events: At approx. 3:00pm many campus servers connected to the main SAN disk array went offline simultaneously.  DoIT Networking and Systems staff immediately began diagnostic procedures. 

Initial diagnosis seemed to point to a failure of the FC card scheduled for replacement.  However, the system's redundant controller failed to take migrate storage connections, thus causing the loss of disk activity for connected servers.  It was determined that the best course of action was to replace the faulty controller while systems were down.  At 3:30 an announcement was made to campus that repairs were being initiated and recovery at 4:00pm was estimated.

Prior to initiating the FC card swap, though, it was discovered that far more servers were down than could be explained by the loss of a single card.  Further investigation revealed that all servers connected to a certain portion (20%) of the FC fabric were functioning, and all others were down. 

A support call was placed to the FC fabric vendor.  In the meantime, DoIT staff members began relocating critical servers to the functioning portion of the FC fabric. The first such migration was of the www.spsu.edu server, and a message was immediately posted to the main campus web page.

Per manufacturer's suggestion, several non-disruptive attempts were made to revive the non-functional portions of the FC fabric.  At approximately 6:00 pm it was determined that a full reset of the entire fabric was required.

DoIT staff immediately began performing a safe shutdown of all connected servers.  The FC fabric switches were restarted and confirmed functioning.  DoIT staff then began to bring up systems (in priority order) and confirming filessytem integrity and proper system function.

The majority of systems were back online by 6:45pm.

Diagnosis: None of the affected systems showed any meaningful log data.  FC vendor's hypothesis: most likely cause was a degradation in FC fabric integrity over time due to malfunctioning FC card in SAN disk array, ultimately resulting in a failure of a large portion of the Fibre Channel fabric.

Prognosis: Outlook is good/guarded.  No major system errors were detected on startup. Many major components of the SAN were restarted and the new FC card was swapped into the disk array.  However, without concrete log data from either the disk or fabric vendors it is impossible to know for sure if the root cause has been addressed.

Treatment: Certain critical servers were consolidated onto the portion of the FC fabric which is closest to the SAN disk array controllers to improve likelihood of survivability were this to recur in the future.  DoIT staff will continue to monitor the overall health of the disk array and Fibre Channel infrastructure. 

Findings: Repair progress was hindered by lack of sufficient labeling in portions of the SAN cabling infrastructure; the soon-to-be-hired Data Center Manager position will be critical in addressing these issues.


Need Help? Call the IT Helpdesk: 678-915-HELP (4357)
Having problems?
Call the Helpdesk at 678-915-HELP (4357)



Southern Polytechnic
State University
1100 South Marietta
Parkway
Marietta, GA
30060-2896
1-800-635-3204
678-915-7778