Saturday, October 01, 2011

What a dying SFP looks like

Fibre channel (FC) storage is handy, and generally very reliable, in my experience. I certainly do not miss the days of messing around with disks in a server-room. And I like the fact that RAIDs may be cut up into slices (LUNs) which may be shared by many servers, resulting in very efficient use of the disks (if so wanted).

One part about FC that I dislike (in addition to the price tags): SFPs. Why on earth are transceivers not an integral part of a Fibre Channel switch? Having the transceivers be separate units means more electrical contact points, and a potential support mess (it's not hard to imagine a situation where the support contract of an SFP has run out, while the switch itself is still covered).

Anyway: Today, I experienced an defunct SFP, for the first time. The following observations may give a hint of how to discover that an SFP is starting to malfunction. The setup is an IBM DS4800 storage system where port 2 on controller B is connected to port 0 on an IBM TotalStorage SAN32B FC switch (which is an IBM-branded Brocade 5100 switch).

Friday morning at 07:49, in syslog: A few messages like this from the FC switch:
raslogd: 2011/09/30-07:49:07, [SNMP-1008], 2113, WWN 10:... | FID 128, INFO, IBM_2005_B32_B,  The last device change happened at : Fri Sep 30 07:49:01 2011

At the same time the storage system started complaining about "Drive not on preferred path due to ADT/RDAC failover", meaning that at least one server had started using a non-optimal path, most likely due to I/O timeouts on the preferred path. And a first spike in the bad_os count occurred for the FC switch port:

bad_os is a counter which exists in Brocade switches, and possibly others. Brocade describes it as the number of invalid ordered sets received.

At 10:55, in syslog:
raslogd: 2011/09/30-10:55:02, [FW-1424], 2118, WWN 10:... | FID 128, WARNING, IBM_2005_B32_B, Switch status changed from HEALTHY to MARGINAL
At the same time, there was a slightly larger spike in the bad_os graph.
Coinciding: The storage system sent a mail warning about "Data rate negotiation failed" for the port.

At 17:00: The count for bit-traffic flat-lined (graph not shown). I.e.: All traffic had ceased.

At no point did the graphs for C3 discards, encoding errors or CRC errors show any spikes.

The next morning, the involved optical cable was switched; that didn't help. Inserting another SFP helped, leading to the conclusion that the old SFP had started to malfunction.

Morale: Make sure to not just keep spare cables around. A spare SFP should also be kept in stock.
And monitor your systems: A centralized and regularly inspected syslog is invaluable. Generating graphs for key counters is also mandatory for mature systems operation; one way to collect and display key counts for Brocade switches is to use Munin and a Munin plugin which I wrote.

PS: Brocade documentation states that SFP problems might result in the combination of rises in CRC errors and encoding/disparity errors. This did not happen in this situation.