Troubleshooting and Analysing Problem on Server and FC Switch


I want to share one of the case that we solved last days about Linux and Brocade SAN switch. Problem observed by application admin that queries take too much time. It was a typical I/O performance problem when they checked the application and database.

There were some tests, which have almost done by system admin when we start to analyze the problem. I‘ll typically order of controls in my post step by step.

What kind of hardware and software did we work on it?

  • Brocade SAN Switch
  • HP Proliant Servers
  • RedHat Enterprise Linux

Step 1: Check DB Queries

When DBA checks queries it was clear that queries take a much longer time than before. For example; last day the same queries finished in 60 microseconds but now it takes nearly 600 microseconds. And also there was some task which waits for disk I/O.

Step 2: Analyze System Side

All system logs can be monitored on “/var/log/messages” file. When the application got performance issues, some logs has appeared on that file, which belongs to Qlogic driver. The bad news was this problem was repeated at random intervals. When the problem occurred all disks were online and disks paths were active.

    “Jul 2 09:40:31 db02 kernel: qla2xxx [0000:47:00.1]-801c:2: Abort command issued nexus=2:4:6–1 2002.”

These logs indicate that an error condition being reported from SAN while perform I/O operation. It is generic errors that without deeply analyze you could not find root cause. However, it is clear that this performance problem related to Storage and SAN configuration or topologies.

Step 3: Check Errors on System and SAN Switch Side

Top-down and bottom-up investing are two different ways to analyze a problem. In this case, it will be an advantage to use the bottom-up method. Because an error appeared first time on system logs.

From RHEL logs, it shows that some kind of reset occurs on qlogic device which id “0000:47:00.1”. Error messages “qla2xxx [0000:47:00.1]-801c:2: Abort command issued nexus=2:4:6–1 2002” is explained like;

  • qla2xxx is the name of the driver
  • 0000:47:00.1 is PCI bus information
  • 801c:2 is a hexadecimal id which identifies the part of code
  • 1 is the number of SCSI target
  • Abort command issued nexus=2:4:6 Abort command was in progress for the SCSI target 2:4:6
  • 2002 means reset succeeded

Step 4: Enable Extended Logging

Enable extended logging for qla2xx driver. In addition, if you need more logs on SCSI layer you should enable logging from the kernel parameter.

# chmod u+w /sys/module/qla2xxx/parameters/ql2xextended_error_logging
# echo "1" > /sys/module/qla2xxx/parameters/ql2xextended_error_logging
# sysctl -w dev.scsi.logging_level=0x1003

Check additional error logging in “/var/log/messages” when the problem occurs again.

  • Jul 2 09:40:30 DB2 kernel: qla2xxx [0000:47:00.1]-8802:2: Aborting from RISC nexus=2:4:6 sp=ffff882963f50bc0 cmd=ffff883062feea80 handle=5af
  • Jul 2 09:40:30 DB2 kernel: qla2xxx [0000:47:00.1]-8804:2: Abort command mbx success cmd=ffff883062feea80.
  • Jul 2 09:40:30 DB2 kernel: qla2xxx [0000:47:00.1]-3822:2: FCP command status: 0x5–0x0 (0x80000) nexus=2:4:6 portid=040500 oxid=0x504 cdb=2a20008588b200000100 len=0x200 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882963f50bc0 cp=ffff883062feea80.

“[0000:47:00.1]” shows which HBA getting trouble while request I/O. It is a critical step to find out, which HBA is problematic.

The beginning of the number shows the PCI address of HBA in that case.

#lspci | grep -i qlogic
#systool -vc fc_host

These two commands show disk service time. You should check disk service time to get information about which disks are getting trouble while sent I/O requests.

#sar  -d  1 10000
#iostat  -xcnz 1 1000

Svctm and %util are two columns that you should check. When a problem occurs these two values will be observed svctm > 40 and %util > 100.

So now, we have finished analyzing the server site. Now let us make a clean report about the case that we focused on.

Observed:

  • Disk Performance issues
  • Syslog HBA abort messages
  • SCSI errors
  • High service time and wait
  • High disk utilization with low write and read request

Resolution:

  • These types of errors indicate an error condition returned from the SAN also Storage Side.
  • Check to verify there are no errors on SAN Switch, FC Cabling, Zoning and Storage Array.

Step 5: SAN Switch Analysis

This is time to look at the SAN switch port’s performance metrics. First, we should find out which port used by Servers on SAN switch.

a) Get Port WWN from Server Side:

#systool -vc fc_host

b) Check where this port connected to the FC Switch side.

#switchshow

c) Check port performance metrics.

#portstatsshow 12
stat_wtx                      2164888415  4-byte words transmitted
stat_wrx                      1286366630  4-byte words received
stat_ftx                      938268642   Frames transmitted
stat_frx                      1515127631  Frames received
stat_c2_frx                   0           Class 2 frames received
stat_c3_frx                   1515127631  Class 3 frames received
stat_lc_rx                    0           Link control frames received
stat_mc_rx                    0           Multicast frames received
stat_mc_to                    0           Multicast timeouts
stat_mc_tx                    0           Multicast frames transmitted
tim_rdy_pri                   0           Time R_RDY high priority
tim_txcrd_z                   876         Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc  0- 3:  0           0           0           876      
tim_txcrd_z_vc  4- 7:  0           0           0           0         
tim_txcrd_z_vc  8-11:  0           0           0           0        
tim_txcrd_z_vc 12-15:  0           0           0           0        
er_enc_in                     0           Encoding errors inside of frames
er_crc                        0           Frames with CRC errors
er_trunc                      0           Frames shorter than minimum
er_toolong                    0           Frames longer than maximum
er_bad_eof                    0           Frames with bad end-of-frame
er_enc_out                    13480       Encoding error outside of frames
er_bad_os                     13004       Invalid ordered set
er_rx_c3_timeout              99          Class 3 receive frames discarded due to timeout
er_tx_c3_timeout              0           Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach            0           Class 3 frames discarded due to destination unreachable
er_other_discard              0           Other discards
er_type1_miss                 0           frames with FTB type 1 miss
er_type2_miss                 0           frames with FTB type 2 miss
er_type6_miss                 0           frames with FTB type 6 miss
er_zone_miss                  0           frames with hard zoning miss
er_lun_zone_miss              0           frames with LUN zoning miss
er_crc_good_eof               0           Crc error with good eof
er_inv_arb                    0           Invalid ARB
open                          0           loop_open
transfer                      0           loop_transfer
opened                        0           FL_Port opened
starve_stop                   0           tenancies stopped due to starvation
fl_tenancy                    0           number of times FL has the tenancy
nl_tenancy                    0           number of times NL has the tenancy
zero_tenancy                  0           zero tenancy

If you are observing very high er_enc_out and er_bad_os errors and they are increasing rapidly, it means there is a physical connection problem between Server and SAN switch. Please check FC Cable, SFP, and GBIC.

On the other hand, if only er_bad_os error increases that should be a “fillword“ configuration problem. You should ask your vendor about “fillword” configuration.

Device vendors have their own recommendations about “fillword” parameters. It can be useful to check this link.

 Mode “2” is the formal industry standard and it has recommended for HDS. Most of vendors recommend using mode “3” because it covers mode “1” and “2”. I recommend to use “fillword” 3 for all 8G-ports then if require change it to “0”.

Other parameters that we should check out are “er_tx_c3_timeout” and “tim_txcrd_z”. These two parameters mostly have a close relationship with the performance of FC ports.

tim_txcrd_z” parameter shows the number of times that port was unable to transmit frames because “BB credit” was zero. It means that if a port is well utilized or not. All samples get in 2.5-microsecond intervals. An increment of this parameter means that frames could not be sent to attach device in 2.5 microseconds. So this control is a good way to check if your port is well utilized or not.

d) How to monitor tim_txcrd_z parameter?

Clear specific port statistics then check “tim_txcrd_z” at every minute to observe if the counter increases more than 400.000. If your counter increases 400.000 per minute, so you have a problem with that port. Check your server and storage I/O metrics.

On the other hand, if number of frames that was unable to transmit increases more than 400.000 per minute, “er_tx_c3_timeout” will probably start to increase. It is more critical than “tim_txcrd_z” parameter. Because frames transmission will be discarded due to time_out. At this point, you should replace FC Cable, GBIC, and SFP. Ask your vendor to check I/O size and performance metrics.

Portstatsshow:

stat_wtx                      2324107763  4-byte words transmitted
stat_wrx                      2872536291  4-byte words received
stat_ftx                      403937269   Frames transmitted
stat_frx                      252991294   Frames received
stat_c2_frx                   0           Class 2 frames received
stat_c3_frx                   252991294   Class 3 frames received
stat_lc_rx                    0           Link control frames received
stat_mc_rx                    0           Multicast frames received
stat_mc_to                    0           Multicast timeouts
stat_mc_tx                    0           Multicast frames transmitted
tim_rdy_pri                   0           Time R_RDY high priority
tim_txcrd_z                   14578896    Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc  0- 3:  0           0           0           15426365 
tim_txcrd_z_vc  4- 7:  0           0           0           0        
tim_txcrd_z_vc  8-11:  0           0           0           0        
tim_txcrd_z_vc 12-15:  0           0           0           0        
er_enc_in                     0           Encoding errors inside of frames
er_crc                        0           Frames with CRC errors
er_trunc                      0           Frames shorter than minimum
er_toolong                    0           Frames longer than maximum
er_bad_eof                    0           Frames with bad end-of-frame
er_enc_out                    0           Encoding error outside of frames
er_bad_os                     0           Invalid ordered set
er_rx_c3_timeout              0           Class 3 receive frames discarded due to timeout
er_tx_c3_timeout              478         Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach            0           Class 3 frames discarded due to destination unreachable
er_other_discard              0           Other discards
er_type1_miss                 0           frames with FTB type 1 miss
er_type2_miss                 0           frames with FTB type 2 miss
er_type6_miss                 0           frames with FTB type 6 miss
er_zone_miss                  0           frames with hard zoning miss
er_lun_zone_miss              0           frames with LUN zoning miss
er_crc_good_eof               0           Crc error with good eof
er_inv_arb                    0           Invalid ARB
open                          0           loop_open
transfer                      0           loop_transfer
opened                        0           FL_Port opened
starve_stop                   0           tenancies stopped due to starvation
fl_tenancy                    0           number of times FL has the tenancy
nl_tenancy                    0           number of times NL has the tenancy
zero_tenancy                  0           zero tenancy

 

Tagged In:

I'm a IT Infrastructure and Operations Architect with extensive experience and administration skills and works for Turk Telekom. I provide hardware and software support for the IT Infrastructure and Operations tasks.

199 Total Posts
Follow Me

1 Comments

  1. I'm extremely impressed with your writing skills as well as with the layout on your weblog. Is this a paid theme or did you customize it yourself? Anyway keep up the nice quality writing, it's rare to see a great blog like this one these days.

Leave a Reply