Diagnostics guide for SGO Storage based on fibre channel disk arrays & Stornext software


NOTE: The following does not cover upgrades or new installations. It is a diagnostic guide for storage volumes that where working fine and have stopped working


Preliminar Note: The hardware tests mentioned below are meant for personal without detailed knowledge of SAN hardware or specialized diagnostic tools. This is not the technical manual or a guide for system engineers, just an introduction designed for Mistika operators.


THE BASICS


- Stornext software has two componets, the server software  and the client software. Standalone systems have both componets on the same computer. When there are more computers, one of the computers is the server  (called "MDC" or "Metadata server"), and the others need an ethernet connection with it in order to use the storage.


The clients read and write directly to/from the disk arrays by using fibre channel direct connections, but they can not do it without also having the network connection with the MDC, which is the only system who manages the  centralised tables of disk content (filesystem metadata) for all the other computers


Then, when the storage does not work,  these are the most probable reasons:


A - The network connection between our computer and the MDC server is not working (if they are not in the same computer), or the MDC is down

B - A  connection with the disk array is not working (optical fibre cables or SFP). The problem can be in the connections for the MDC or in the connections for our client computer.

C - There is no Stornext license available for our computer

D - There is a problem in the disk array

E - There is a problem in the fibre channel board

F - A power cycle was not made in the right order. (It can happen after a blackout).


(Note: From now we will refer later to that list of options by their id).


Then, when you don't have access to the storage, first thing is to search for obvious reasons, and to try to discard elements that are not related:


- If the other computers are working in the SAN, then the problem has to be in our connections.  But if none of the computers have access to the storage, then the problem will be something centralised, like the MDC server,  the storage, a network switch...


- If one storage volume is working but not other, most probably it is a problem in  the disk array containing that volume or in its connections.


- Check lights: All related cables (both network and fibre channel) have a LED light at both sides.  Also the disk array in the front panel.


Depending on the results of  the analysis above, now choose between the next tests:

Checking license:


It is a text file stored in /usr/cvfs/config/license.dat, in the system acting as metadata server.

The important expiration date is only the one in the first section of the file, the others are for optional features or maintenance warranties and are not necessary to access the storage.


Trying a power cycle:


When the reason is not clear,  you can try a full power cycle as follows:
Power on network switches and fibre channel switches, if any.
Power on JBODs of the disk arrays, if any.
Power on the disk arrays (the boxes where the optical fibers are connected)   Wait 3 minutes, until all the status lights are solid.
Power on the MDC (if any). Wait 2 minutes
Power on the client workstations.

To Power off, just follow the inverse order.


Diagnosing the network connection with the MDC:


In the client computer, the MDC hostname or ip address is kept in the /usr/cvfs/config/fsnameservers file. Take a look to the ip or hostname in that file, and then execute this command:


ping IP_or_HostName


If the ping does not answer with packet received messages, then the problem is  (A),  the network connection with the MDC (or either the MDC is down)


If the network of the local computer is working with other computers, then check that the MDC computer is up and running, and that you have ping connection with it (as explained above).


Now the following commands will ask the MDC service which volumes are available.  You can execute it in your client or in the server:


su

/usr/cvfs/bin/cvadmin 

exit


First lines should see a list of the storage volumes that are available.  If your desired volume do  not appear, then the MDC is not serving the volume. Potencial reasons are B,C,D,E


Diagnosing fibre channel connections with the storage:


Each fibre channel cable has a LED light in the ports at both sides. When there is no activity it must be a solid light.  If it is off then it is not working. 


If that is the case:


Check the optical fibers are firmly attached.


Each fibre channel port connector (SFP) must have  a laser light coming from one of the two connectors.  Check that you can see that light, for example, point to it with your phone camera and look at the screen, you should see a red light on one side of the connector. 


You can also check the optical fiber cables in a similar way, connect one side to a fibre channel port and check that there is light in one of the fiber connectors in the other side, then connect the cable to the port at the other side and check the other fiber.


Note: Dust in the connectors is a well known cause of random problems with optical fibers (they are made of glass, and the light needs to pass cleanly). Never let a fiber connector to touch the floor.


The SFP (the little connectors where you plug the fibers) are also a typical cause of failure. You can swap two of those connectors to check if the problem moves. (but in general do not move SFPs between different devices, they are not necessarily compatible or supporting same speeds)



Diagnosing the disk array:


Apart from the LED lights in the front panel, you can connect to the disk array trough the network ports:


For Dothill disk arrays, these are the default connections. You can connect using the firefox browser or similar:


Upper controller: 10.0.0.2

Bottom controller (optional): 10.0.0.3


If they have been changed, you may have them documented in the /etc/hosts file.


First thing you will see the login screen. Default credentials are:


Login: manage

Password: !manage


Once you are in check for abnormal warning symbols, and also the logs. Do not change anything if you are not absolutely sure about what you are doing. Some operations can destroy all data.


Note: Errors of  type "Degraded" mean that something is not redundant, but they should not block your access to the storage. Meanwhile errors labeled as "Critical" will block it.


Note: Disk arrays provide a menu to configure  email alerts to warn the user about problems.  But they should be carefully configured by IT personal and tested properly, as some methods are not reliable.  For example,  sending  emails from non fixed ips to non specialised email services like gmail will probably bounce back and the alerts will be lost.


Note: When the problem is related to the disk array and you want to open a support case, you will get a faster answer if you attach the log files when you open the case, as it is the first thing they will probably ask you later. 



Diagnosing the fibre channel board:


Swap cables between ports and check for their status ligths. Sometimes there is a problem in one port, but not in the others.


Execute these commands:


lspci


This is the hardware list and it should mention the FC boards, they are typically  Atto or Qlogic boards. If they do not appear in there then power off and check that they are firmly attached, or remove the board and plug it again.


If there is no light in a necessary port, connect just one fibre cable doing a loop (not the two cables of a fiber pair, just get one single fiber and connect it from one hole to the other hole of the same port, doing a loop). Once the system is fully up and running (this test requires the driver fully loaded),  If the status light does not come up then the port is probably broken . Check all ports, if none of them work the board is probably broken (although there can be a serious problem with the driver)