Troubleshooting and Analysing Problem on Server and FC Switch

No comments

troubleI want to share one of case that we solved last days about Linux and Brocade SAN switch. Problem observed by application admin that queries take too much time. It was a typical I/O performance problem when they checked application and db sites.

There was some tests, which have almost done by system admin when we start to analyze problem. I‘ll typically order of control steps in my post step by steps.

What kind of hardware and software we worked on it?

  • Brocade SAN Switch
  • HP Proliant Servers
  • RedHat Enterprise Linux

 

Step 1: Check DB Queries

When DBA checks queries it was clear that queries takes much longer time that before. For example;  last  day same  queries  finished in 60 microseconds  but  now it  takes  nearly  600 microseconds. And also  there were some task which waits for disk I/O.

Step 2: Analyze System Side

All system logs can be monitored on “/var/log/messages” file. For these cases when they got performance issues, some logs appear on messages file, which belongs to Qlogic drivers. This case was a complex case because the problem was repeated at random intervals. When problem occurred all disk was online and disks paths were active.

“Jul  2 09:40:31 db02 kernel: qla2xxx [0000:47:00.1]-801c:2: Abort command issued nexus=2:4:6 —  1 2002.”

This logs indicate that an error condition being reported from SAN while perform I/O operation. It is generic errors that without deeply analyze you could not find root cause. However, it is clear that this performance problem related to Storage and SAN configuration or topologies.

Step 3: Check Errors on System and SAN Switch Side

Top-down and bottom-up investing are two different ways to analyze a problem. For this case, it will be advantage to use bottom-up method. Because an error appeared first time on system’s  logs.

From RHEL logs, it shows that some kind of reset occurs on qlogic device which id “0000:47:00.1”.  Error messages “qla2xxx [0000:47:00.1]-801c:2: Abort command issued nexus=2:4:6 —  1 2002” is explained  like;

  • qla2xxx is the name  of the driver
  • 0000:47:00.1 is PCI bus information
  • 801c:2 is a hexadecimal id which identifies the part of code
  • 1 is number of SCSI target
  • Abort command issued nexus=2:4:6 Abort command  was  in progress for the  SCSI  target  2:4:6
  • 2002 means reset succeeded

Step 4: Enable Extended Logging

Enable extended logging for qla2xx driver. In addition, if you need to more logs on SCSI layer you should enable logging from kernel parameter.

Check additional error logging in “/var/log/messages” when problem occurs again.

Jul  2 09:40:30 DB2 kernel: qla2xxx [0000:47:00.1]-8802:2: Aborting from RISC nexus=2:4:6 sp=ffff882963f50bc0 cmd=ffff883062feea80 handle=5af

Jul  2 09:40:30 DB2 kernel: qla2xxx [0000:47:00.1]-8804:2: Abort command mbx success cmd=ffff883062feea80.

Jul  2 09:40:30 DB2 kernel: qla2xxx [0000:47:00.1]-3822:2: FCP command status: 0x5-0x0 (0x80000) nexus=2:4:6 portid=040500 oxid=0x504 cdb=2a20008588b200000100 len=0x200 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882963f50bc0 cp=ffff883062feea80.

“[0000:47:00.1]” shows which HBA getting trouble while request I/O. It is a critical step to find out, which HBA is problematic.

Beginning of the number shows PCI address of HBA at that case.

These two commands show disk service time. You should check disk service time to get information about which disks are getting trouble while sent I/O requests.

Svctm and %util are two columns that you should be checked. When problem occurs these two value will be observed  svctm > 40 and  %util > 100.

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

sdk               0.00     0.00    0.00    5.00     0.00     5.00     1.00     0.00    0.20    0.00    0.20   0.20   0.10

sdj               0.00     0.00    0.00   17.00     0.00    64.00     3.76     0.00    0.12    0.00    0.12   0.12   0.20

sdl               0.00     0.00    0.00   14.00     0.00    61.00     4.36     0.00    0.07    0.00    0.07   0.07   0.10

sds               0.00     0.00    9.00    1.00   288.00    32.00    32.00     0.00    0.30    0.33    0.00   0.30   0.30

sdx               0.00     0.00    4.00    1.00     4.00     1.00     1.00     0.00    0.20    0.25    0.00   0.20   0.10

sdv               0.00     0.00    0.00   12.00     0.00    12.00     1.00     0.00    0.17    0.00    0.17   0.17   0.20

 

So now, we have finished analyzing server site. Now let us make a clean report about case that we focused on.

Observed:

  • Disk Performance issues
  • Syslog HBA abort  messages
  • SCSI errors
  • High service time and wait
  • High disk utilization with low write and read request

Resolution:

  • These type of errors indicate an error condition returned from the SAN also Storage Side.
  • Check to verify there is no errors on SAN Switch, FC Cabling, Zoning and Storage Array.

Step 5: SAN Switch Analysis

This is time to look at the SAN switch port’s performance metrics. First, we should find out which port used by Servers on SAN switch.

Get Port WWN from Server Side:

Check where this port connected on FC Switch side.

Check port performance metrics.

stat_wtx                      2164888415  4-byte words transmitted

stat_wrx                      1286366630  4-byte words received

stat_ftx                      938268642   Frames transmitted

stat_frx                      1515127631  Frames received

stat_c2_frx                   0           Class 2 frames received

stat_c3_frx                   1515127631  Class 3 frames received

stat_lc_rx                    0           Link control frames received

stat_mc_rx                    0           Multicast frames received

stat_mc_to                    0           Multicast timeouts

stat_mc_tx                    0           Multicast frames transmitted

tim_rdy_pri                   0           Time R_RDY high priority

tim_txcrd_z                   876         Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc  0- 3:  0           0           0           876      

tim_txcrd_z_vc  4- 7:  0           0           0           0         

tim_txcrd_z_vc  8-11:  0           0           0           0        

tim_txcrd_z_vc 12-15:  0           0           0           0        

er_enc_in                     0           Encoding errors inside of frames

er_crc                        0           Frames with CRC errors

er_trunc                      0           Frames shorter than minimum

er_toolong                    0           Frames longer than maximum

er_bad_eof                    0           Frames with bad end-of-frame

er_enc_out                    13480       Encoding error outside of frames

er_bad_os                     13004       Invalid ordered set

er_rx_c3_timeout              99          Class 3 receive frames discarded due to timeout

er_tx_c3_timeout              0           Class 3 transmit frames discarded due to timeout

er_c3_dest_unreach            0           Class 3 frames discarded due to destination unreachable

er_other_discard              0           Other discards

er_type1_miss                 0           frames with FTB type 1 miss

er_type2_miss                 0           frames with FTB type 2 miss

er_type6_miss                 0           frames with FTB type 6 miss

er_zone_miss                  0           frames with hard zoning miss

er_lun_zone_miss              0           frames with LUN zoning miss

er_crc_good_eof               0           Crc error with good eof

er_inv_arb                    0           Invalid ARB

open                          0           loop_open

transfer                      0           loop_transfer

opened                        0           FL_Port opened

starve_stop                   0           tenancies stopped due to starvation

fl_tenancy                    0           number of times FL has the tenancy

nl_tenancy                    0           number of times NL has the tenancy

zero_tenancy                  0           zero tenancy

How to analyze er_enc_out and er_bad_os for Brocade  SAN Switch?  

If you are observing very high er_enc_out and er_bad_os errors and they are increasing rapidly, it means there is a physical connection problem between Server and SAN switch. Please check FC Cable, SFP and GBIC.

On the other hand, if only er_bad_os error increases that should be a “fillword“ configuration problem. You should ask your vendor about fillword  configuration.

Mode Desription
0 | -idle-idle Sets IDLE mode in the Link Init and IDLE as the fill word (default).
1 | -arbff-arbff Sets ARB(ff) in the Link Init and ARB(ff) as the fill word.
2 | -idlef-arbff Sets IDLE mode in the Link Init and ARB(ff) as the fill word.
3 | -aa-then-ia Attempts hardware arbff-arbff (mode 1) first. If the attempt fails to go into active state, this command executes software idle-arb (mode 2). Mode 3 is the preferable to modes 1 and 2 as it captures more cases

Device vendors have their own recommendation about fillword parameters. It can be useful to check this link.

Mode “2” is the formal industry standard and it has recommended for HDS. Most of vendors recommend using mode “3” because it covers mode “1” and “2”. I recommend to use “fillword” 3 for all 8G-ports then if require change it to “0”.

portcfgshow  :

Ports of Slot 0         0   1   2   3     4   5   6   7     8   9  10  11    12  13  14  15

———————-+—+—+—+—+—–+—+—+—+—–+—+—+—+—–+—+—+—

Speed                  AN  AN  AN  AN    AN  AN  AN  AN    AN  AN  AN  AN    AN  AN  AN  AN

Fill Word(On Active)    3   3   3   3     3   3   3   3     3   3   3   3     3   3   3   3

Fill Word(Current)      3   3   3   3     3   3   3   3     3   3   3   3     3   3   3   3 

How to analyze er_tx_c3_timeout tim_txcrd_z for Brocade SAN Switch?

Other parameters that we should check out are “er_tx_c3_timeout” and “tim_txcrd_z”. These two parameters are mostly have a close relationship with performance of FC ports.

tim_txcrd_z parameter shows number  of times that port was  unable to transmit frames because  BB credit was zero. It means that if port is well utilized or not. All samples gets in 2.5 microseconds intervals. An increment of this parameter means that frames could not be sent to attach device in 2.5 microseconds. So this control is a good way to check if your port is well utilized or not.

How to monitor  tim_txcrd_z  parameter?

  • Clear port statistics
  • Check tim_txcrd_z at every minutes to find out if counter increases more than 400.000. If your counter increases 400.000 per a minute, so you have a problem about this port. Check your server and storage I/O metrics.

On the other hand, if number of frames that was unable to transmit increases more than 400.000 per a minute, er_tx_c3_timeout will probably start to increase. It is more critical than tim_txcrd_z parameter. Because frames transmission will be discard due to time_out.  At this point, you should replace FC Cable, GBIC, and SFP. Ask your vendor to check I/O size and performance metrics.

Portstatsshow:

stat_wtx                      2324107763  4-byte words transmitted

stat_wrx                      2872536291  4-byte words received

stat_ftx                      403937269   Frames transmitted

stat_frx                      252991294   Frames received

stat_c2_frx                   0           Class 2 frames received

stat_c3_frx                   252991294   Class 3 frames received

stat_lc_rx                    0           Link control frames received

stat_mc_rx                    0           Multicast frames received

stat_mc_to                    0           Multicast timeouts

stat_mc_tx                    0           Multicast frames transmitted

tim_rdy_pri                   0           Time R_RDY high priority

tim_txcrd_z                   14578896    Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc  0- 3:  0           0           0           15426365 

tim_txcrd_z_vc  4- 7:  0           0           0           0        

tim_txcrd_z_vc  8-11:  0           0           0           0        

tim_txcrd_z_vc 12-15:  0           0           0           0        

er_enc_in                     0           Encoding errors inside of frames

er_crc                        0           Frames with CRC errors

er_trunc                      0           Frames shorter than minimum

er_toolong                    0           Frames longer than maximum

er_bad_eof                    0           Frames with bad end-of-frame

er_enc_out                    0           Encoding error outside of frames

er_bad_os                     0           Invalid ordered set

er_rx_c3_timeout              0           Class 3 receive frames discarded due to timeout

er_tx_c3_timeout              478         Class 3 transmit frames discarded due to timeout

er_c3_dest_unreach            0           Class 3 frames discarded due to destination unreachable

er_other_discard              0           Other discards

er_type1_miss                 0           frames with FTB type 1 miss

er_type2_miss                 0           frames with FTB type 2 miss

er_type6_miss                 0           frames with FTB type 6 miss

er_zone_miss                  0           frames with hard zoning miss

er_lun_zone_miss              0           frames with LUN zoning miss

er_crc_good_eof               0           Crc error with good eof

er_inv_arb                    0           Invalid ARB

open                          0           loop_open

transfer                      0           loop_transfer

opened                        0           FL_Port opened

starve_stop                   0           tenancies stopped due to starvation

fl_tenancy                    0           number of times FL has the tenancy

nl_tenancy                    0           number of times NL has the tenancy

zero_tenancy                  0           zero tenancy

Follow me

Abdurrahim

I'm a System Engineer with extensive experience and administration skills and works for Interbank Card Center Of Turkey.I provide hardware and software support for the following Unix/Linux and Windows platforms.(Oracle Solaris,HP-UX, Linux, IBM-AIX, Windows Servers)
Follow me
facebooktwittergoogle_pluslinkedinby feather

No comments yet.

You must be logged in to post a comment.