63.111 bytes

Service Hints & Tips

Document ID: MCGN-3N3KZG

Servers - High availability using the IBM ServeRAID Adapter

Applicable to: World-Wide

Abstract
This white paper outlines IBM recommendations for obtaining high availability of an IBM server in a RAID environment, using the IBM ServeRAID adapter.

December 1997

Note : This document is intended for use by system administrators.

Preventive Measures to Help Obtain High Availability
IBM recommends the following precautions in order to help obtain high availability of the RAID subsystem:
- Define a Hot Spare
- Install NetFinity Manager
- Data Scrub Drives Weekly (Does not apply to ServeRAID II with 2.3 or higher firmware.)
- Apply All Updates
- Install and Use RAID Administration and Monitoring Utilities
- Ensure Current Backup of RAID Configuration is Available
- Have a RAID Configuration Utility Diskette Available

Define a Hot Spare
Defining a hot spare drive minimizes the length of time a server operates with degraded performance when a defunct drive occurs. The hot spare also allows the "inconsistent" drive to be easily recognized in the event of a multiple defunct drive failure such that recovery procedures require much less technical expertise. The section below explains this advantage in greater detail:

Hot Spare Advantages
When a system has a drive that becomes defunct, data is not written to this DDD drive, but data is written to the other drives in the array. Therefore that DDD drive becomes "inconsistent" with the rest of the drives in the array. When multiple drives appear DDD, the first and most critical task is defining the "inconsistent" drive correctly. The "inconsistent" drive must be the last drive replaced since it requires rebuilding (and, if truly defective, may need physical replacement). If the "inconsistent" drive is software replaced (See Software Replace vs. Physical Replace) first when a multiple DDD failure occurs, the "inconsistent" data will be used to rebuild another drive. This eventually corrupts the other drives (and data) on the system.

However, when an HSP is defined, you are protected from rebuilding another drive from an "inconsistent" drive. This is because of the way the RAID adapter marks the states of drives. When a system has a defined HSP, as soon as the HSP takes over for the DDD drive, the RAID adapter marks the DDD drive as a defunct hot spare (DHS) drive in its configuration. If you perform a software replace or physical replace of this DHS drive, the RAID adapter starts the DHS drive and changes the state from DHS to HSP. The RAID adapter does not allow this drive to be brought back to ONL status.

When the HSP takes over for the DDD drive, the HSP is rebuilt to replace the DDD drive. During the rebuilding of the HSP drive, it appears in the RBL state. The RBL state changes to ONL once this drive is completely rebuilt and fully operational for the now DHS drive.

If an HSP is not defined and multiple drives appear DDD, then determination of the "inconsistent" drive is more difficult. You must now read the RAID log , generated by IPSMON, to determine the "inconsistent" drive. The "inconsistent" drive is the drive which goes DDD first. To examine this process in a little more depth, consider the following points. When the first drive appears DDD, the operating system remains operational with the remaining drives. It writes to all the other drives in the array except for the first DDD drive. When the second DDD occurs, the operating system is no longer functional and does not write to any drives. If writing to the RAID log, generated by IPSMON, can only occur while the operating system is operational, the first DDD drive must by default be the "inconsistent" drive. To rectify this situation, you must change the "consistent" drives from DDD to ONL by using the Set Device State option and ensure that the "inconsistent" drive is the one you try to rebuild.
If a HSP drive is defined but did not complete the rebuild, then it is much easier to identify the "inconsistent" drive. The "inconsistent" drive remains in RBL status. The DDD drive will appear with a DHS status.

Install and Use NetFinity Manager
You should install NetFinity Manager 5.0 or greater in order to monitor the RAID array remotely. Netfinity Manager can be used to schedule data scrubbing to occur at any time of the day, so synchronization of the RAID array can be scheduled for off-peak hours and will not require user input to get things started. With NetFinity services installed at the server, and the NetFinity Manager installed on a workstation, the RAID array can be monitored, and even synchronized, from a remote location. The system can also be configured to send alert messages regarding the RAID subsystem over the network to the workstation. You can even setup NetFinity Manager to page someone, e.g., the network administrator or a service technician, if a certain alert condition is reached. NetFinity Manager can also perform many other functions such as monitoring processor utilization, critical file monitoring and detecting installed software across the network. Netfinity is also used to capture PFA alerts from hard files and then send system alerts to the appropriate parties. In order to use Netfinity 5.0 to schedule data scrubbing, please download NF50RAID.EXE from http://www.us.pc.ibm.com/files.html. This file contains updated Netfinity program files which are required for scheduling data scrubbing on controllers with the write policy set to write-back cache. When installed with the NetFinity Manager code the following operating systems are affected: OS/2, WINNT, and WIN95.

Data Scrub Drives Weekly
One of the best ways to recognize potential disk media problems in advance and correct them before a failure occurs, is to Data Scrub (This is done in the background by the ServeRAID II Adapter with firmware 2.30 or higher). Sector media errors can be identified and corrected simply by forcing all data sectors in the array to be accessed through Data Scrubbing. Data Scrubbing checks all data sectors in the array and should be performed weekly. With the IBM ServeRAID and ServeRAID II Adapters, an easy process used to accomplish Data Scrubbing is synchronization. Data Scrubbing will force all sectors of the drives contained in the array to be read in the background while allowing concurrent user disk activity. Netfinity Manager 5.0 will allow you to automatically schedule synchronization from either the server or a remote manager. Netfinity Manager 5.0 can be obtained at no additional charge by customers that have purchased an IBM server that ships with ServerGuide. If the customer has another type of scheduler such as the AT scheduler built into Windows NT, then the IBM ServeRAID Adapter's IPSSEND command line utility may be used to allow the customer to schedule Data Scrubbing without Netfinity Manager installed. The IPSSEND utility is available on the ServeRAID Supplemental Diskette.

Apply All Updates
You should apply all updates regarding RAID. Check the IBM Server web site at http://www.us.pc.ibm.com/server/server.html or call the HelpCenter for up-to-date information.

Install and Use RAID Administration and Monitoring Utilities
The RAID administration utility alerts the user via the speaker and display if a drive becomes DDD or if a Predictive Failure Analysis (PFA) alert occurs. PFA support on disk drives recognizes potentially bad drives and alerts systems administrators allowing them to replace the unit before a catastrophic drive failure. The PFA alert prompts you to replace the drive before actual failure, so that a HSP is always present.

The RAID administration utility monitors RAID operations, displaying results on the RAID Administration screen. A separate utility, called IPSMON.EXE, can log RAID events to a file. You can specify whether you want the utility to save the file to a diskette drive, local hard drive, or network hard drive; however, the recommended policy is to use the diskette drive or a network drive. This practice makes it easier to recover from situations where the operating system is not accessible due to the failure. The logs themselves are required to recover data from systems when multiple DDD drives occur. The logs also provide essential RAID history for that server when troubleshooting and isolating a defective part in cases where it is not the drive that is defective.

Another monitor checks the status of the adapter in a remote system in a client/server environment supported with TCP/IP. The server part of the client/server environment runs on the system containing the adapter and is supported with the following operating systems:
- Novell NetWare
- Windows NT
- OS/2
- SCO OpenServer

The client part of the client/server environment runs on the system used for monitoring, in the following Windows environments:
- Windows NT Server Version 3.51 and 4.0
- Windows NT Workstation Version 3.51 and 4.0 Windows 95

Ensure Current Backup of RAID Configuration is Available
You should always have a current backup of the RAID configuration; anytime the array changes, you should make another backup. To create this backup, select Backup Config. to Diskette under Advanced Functions on the Main Menu of the RAID Configuration Diskette. You are prompted to enter a filename; the default is CONFIG. IBM recommends that you provide a unique name and backup to a different diskette each time. A unique name ensures that a good backup is not inadvertently overwritten, and a different diskette allows you to write-protect the diskette and keep it in a safe place. NetFinity Manager 5.0 or above also allows you to backup the configuration under the RAID manager.

Have a RAID Configuration Utility Diskette Available
Having a copy of the RAID Configuration Utility Diskette is crucial when working on a RAID system. Ensure that you always have a RAID Configuration Utility Diskette available in close proximity to all RAID systems. Due to possible changes of drive states, the backup RAID configuration stored on the diskette may differ from the current working RAID configuration.

Recovery Procedures for DDD Drives
This section provides you with procedures for recovering from many different DDD scenarios.

Topics Include:
- An Overview of Drive Replacement
- Using and Understanding the RAID Administration Log
- First Actions to be Performed on Service Call with DDD Drives
- Recovery Procedures When HSP is Present at Time of Failure
- Recovery Procedures When HSP is Not Present at Time of Failure
- Recovery From RAID Adapter Failure

An Overview of Drive Replacement
For the IBM ServeRAID Adapter, you replace drives via the RAID configuration utility. To begin, select the Rebuild Drive option under Rebuild/Device Management option on the RAID Configuration Main Menu. The Rebuild Drive command requires that all drives in the array (except the drive being rebuilt) are online. When you select Rebuild Drive, the RAID adapter sends a start unit command to the drive being rebuilt. Once the drive starts successfully, the drive state changes from defunct (DDD) to rebuild (RBL). The logical drive remains in a critical (CRT) state until the rebuild is completed. Once the rebuild completes successfully, the logical drive changes to the okay (OKY) state..

If the logical drives are in an off-line (OFL) state, meaning that multiple drives in an array are defunct, then you must use the Set Device State to ONL on all drives except the "inconsistent" drive before rebuilding. If you mistakenly use this command on the "inconsistent" drive, the rebuild process corrupts data.

Note: Please review the following section, "Software Replace vs. Physical Replace," as well as the recovery procedures sections later in this document to understand which drive in a scenario is the "inconsistent" drive. Knowing how to determine the "inconsistent" drive is extremely important. This knowledge ensures you can troubleshoot correctly without corrupting data.

After you select the Rebuild Drive option, the next prompt asks you to indicate whether you want to rebuild the drive in the same location or a new location. Select Same Location if you physically replaced the old hard disk drive with a new one in the same bay. Select New Location to assign a hard disk drive in a new location. After the drive location has been selected, the adapter sends a start unit command to the drive and begins the rebuilding process on the critical logical drives in the array. The rebuild completes quickly under the following circumstances:

- if the defunct (DDD) drive was not defective
- if the drive is in the same bay location
- if no write operations or configuration changes occurred that require rebuilding data in that drive

During the rebuilding process, the IBM ServeRAID Adapter reports the drive state as RBL. Once the rebuild completes successfully, the drive state changes from RBL to ONL and the logical drive state changes from CRT to OKY.

A DDD drive can occur in cases where the adapter is unable to determine the root cause failure. If the drive is not actually defective, you can use the rebuild option to resolve a DDD state without the need to replace the drive physically. This type of drive replacement where the drive is not physically replaced is called a Software Replace.

Software Replace vs. Physical Replace
When the RAID Adapter communicates with the hard file and receives an unexpected response, the adapter will mark the drive defunct in order to avoid any potential data loss. For example, this could occur in the event of a power loss to any of the components in the SCSI RAID subsystem. In this case, the RAID adapter will error on the side of safety and will no longer write to that drive although the drive may not be defective in any way.

These circumstances warrant either a software replace or a physical replace, as discussed in the following bullets:

- Using a software replace is recommended to try to recover data when multiple DDD drives occur. In this situation, you may lose data on drives that are not actually defective if you run a normal rebuild process.

WARNING: IF YOU USE THE WRONG ORDER WHEN YOU ATTEMPT A SOFTWARE REPLACE, DATA CORRUPTION RESULTS.

- A software replace should also be performed for a single DDD drive when a hot spare (HSP) is not present in the system and it is the first time the drive has been marked DDD. In this situation, the software replace requires a rebuild of the drive. During the rebuild, all sectors of the drive are rebuilt. Therefore, the drive is tested very well. If a rebuild of the drive completes successfully, the drive does not need to be physically replaced.

- If a DDD drive has been replaced by an HSP, you should physically replace the DHS drive. Under these circumstances, a software replace will only send a start unit command to the drive. If the unit starts successfully, then the drive is seen as good by the RAID adapter. Just restarting a drive does not sufficiently test the drive. Therefore, the drive should be physically replaced to ensure a good HSP drive is present in the system.

Using and Understanding the RAID Administration Log
Being able to read the RAID log produced by the IPSMON or Netfinity RAID Manager is a very important part of recovering an array when one or more drives are marked DDD. From the RAID log, you can determine in what order drives went DDD, and, if multiple drives are DDD, which one is the "inconsistent" drive. The RAID log is created by either running IPSMON.EXE or Netfinity Manager. IPSMON.EXE is available on the ServeRAID Supplemental Diskette from the IBM web site: http://www.us.pc.ibm.com/files.html. Search on "ServeRAID." Netfinity Manager is part of ServerGuide which ships with every IBM PC Server and Netfinity Server. The following is an excerpt from a RAID log created by the IPSMON utility:

09/12/97 09:33:36 INF003:A1C-B-- synchronization started

09/12/97 09:40:22 INF004:A1C-B-- synchronization completed

09/12/97 09:41:43 CRT001:A1C3B03 dead drive detected

09/12/97 09:42:13 INF001:A1C-B-- rebuild started

09/12/97 09:52:11 INF002:A1C-B-- rebuild completed

09/12/97 09:55:24 CRT001:A1C3B04 dead drive detected


The original configuration was:
- bay 1: HSP
- bay 2: ONL
- bay 3: ONL
- bay 4: ONL

The format is as follows:
date time error type:Ax Cx Bxx message

The x following A will be the adapter number, the x following C is the channel, and the xx following B is the bay number. Error type can be either informational or critical. The message will give a brief description of the RAID event that has occurred.

The first two lines of the RAID log show that a synchronization was started and proceeded to complete successfully. At a later point in time, on line 3 of the RAID log, a dead drive is detected on adapter 1, channel 3, bay 3. In this case, since a HSP drive is defined, the rebuild starts automatically. Both the start and finish of the rebuild is logged by the IPSMON monitoring utility. Later on, the drive in bay 4 is marked dead, but no rebuild is started because the HSP drive has already been used.

In the current RAID log, the drive in bay 4 is the "inconsistent" drive, and you must physically replace it. If more drives are DDD but not listed in the RAID log because the server has trapped (OS/2 or NT) or the volume was dismounted (NetWare), then you need to software replace those drives before replacing the drive in bay 2, because the other drives contain the correct information to rebuild the "inconsistent" drive.

Before you perform any actions on the hardware, use NetFinity Manager, the RAID administration program, or the RAID configuration program to fill in the attached template at the end of this document with the current status of all the drives, both internal and external. This template provides a three-channel diagram to accommodate all types of IBM RAID adapters.

For the IBM ServeRAID Adapters, if power is lost or another drive is marked DDD during a rebuild operation, the rebuild fails and the drive being rebuilt remains in the RBL state. Because of this, the "inconsistent" drive remains recognizable.

First Actions to be Performed On Service Call with DDD Drives
1. Pull ServeRAID Administration Log created by IPSMON utility or Netfinity Manager. IPSMON can be obtained from the PC ServeRAID Supplemental Diskette Version on IBM web site http://www.us.pc.ibm.com/files.html Search on "ServeRAID".

2. View Configuration with ServeRAID Administration and Monitoring Utility, ServeRAID Configuration utility if operating system is not functional, or RAID Netfinity Manager to determine whether HSP was present in the system or not. If any of the drives indicate DHS status, then an HSP drive was definitely present in the system.

3. View ServeRAID Device Event Table by either clicking on specific ServeRAID Adapter in the Administration and Monitoring Utility and selecting Device Event Log. Or you may view the Error Counters for each drive with RAID Netfinity Manager. If the operating system is not functional, you may boot the ServeRAID Configuration Diskette and view the device errors by selecting Display Drive Information for each drive under Rebuild/Device Management.

Hard Events - The number of SCSI I/O processor errors that occurred on the drive since the Device Error Table was last cleared. It also indicates if the drive exceeded Predictive Failure Analysis (PFA) threshold.

Action: Contact your support representative for further problem determination.

Soft Events - The number of SCSI Check Condition status messages returned from the Drive (except Unit Attention) since the Device Error Table was last cleared.

Action:
1. If HSP is present, follow procedures in the next section: Recovery Procedures when HSP is Present at Time of Failure

2. If HSP is not present, follow procedures in the section: Recovery Procedures When HSP is Not Present at Time of Failure

Miscellaneous Events - The number of other errors (such as selection time-out, unexpected bus free, or SCSI phase error) that occur on the drive since the Device Error Table was last cleared.

Action: Ensure cabling and connectors are seated properly. If backplane, ensure backplane is not bowed causing poor drive connection. If there are no problems with cable, backplane, etc., determine whether HSP drive is present or not and follow appropriate Recovery Procedures listed below but do not software replace the drive. Physically replace the drive.

Parity Events - The number of parity errors that occurred on the SCSI bus since the Device Error Table was last cleared.

Action : Check to ensure SCSI bus is properly Terminated with one and only one Active Terminator placed at the end of the SCSI Chain. If a backplane is the last device on the chain, then the backplane terminates the bus as long as no cable is plugged into the daisy-chained connector on the backplane.

PFA Events - Predictive Failure Analysis

Action : Determine whether HSP drive is present or not and follow appropriate Recovery Procedures listed below, but do not software replace the drive. Physically replace the drive.

Recovery Procedures When HSP is Present at Time of Failure
One DHS Drive, No RBL

Follow the steps below to bring the DHS drive back to HSP state if the following items are true:

- Only one drive is marked DHS and the rest are ONL.

- The RAID logical drive status is OKY because an HSP is present in the system. Either the HSP drive is the hard drive that went DHS or the HSP has already automatically taken over for the DHS drive and has been rebuilt successfully.

- There are no drives with a RBL status.
Once you verify the conditions above through either the RAID administration log or the RAID administration utility, perform the following steps to bring the DHS drive back to HSP status.

1. Physically replace the hard drive in the DHS bay with a new one of the same capacity or greater.

2. With a RAID-1 or RAID-5 array, the operating system is still functional at this point. Use either NetFinity Manager 5.0 or the RAID administration utility to bring the drive back to HSP status. With the RAID administration utility, open the options menu and select Set Device State.

3. When you see the prompt to select a drive, highlight the drive you just replaced (it should still be marked DHS in the utility), and press Enter. Be careful to select the correct drive, because you have the option to select any drive connected to the IBM ServeRAID adapter, including ONL drives.

4. You now have a menu listing all the different drive states possible, but you are only able to highlight DHS, HSP, or Standby Hot Spare (SHS). Highlight HSP (or SHS if necessary), and press Enter.

5. The adapter issues a start unit command to the drive. Once the drive successfully spins up, the adapter changes the drive's status to HSP (or SHS) and saves the new configuration.

6. If you see an "Error in starting drive" message, reinsert cables, the hard drive, etc., to verify these are connected properly, then go to step 2. If the error persists, go to step 1.

7. If the error still occurs with a known good hard drive, then troubleshoot to determine defective part, which may be a cable, back plane, RAID adapter, etc. Once you have replaced the defective part so that there is a good connection between adapter and hard drive, go to step 2.

One DDD Drive, One DHS Drive, No RBL
If the system has a DDD drive and a DHS drive, and a defined hot spare existed prior to the drive failures, then the system should still be up and running as long as the logical drives are configured as RAID-5 or RAID-1. The logical drives in the array will be in the CRT state due to one drive in the array being defunct. Perform the following steps to bring the logical drive from CRT to OKY state:

Note: Because the operating system is functional, this procedure assumes you are using the RAID administration utility within the operating system to recover.

1. Physically replace the drives that are marked DDD and DHS.

2. Click on the DDD drive from within the RAID administration utility and then click on Rebuild Drive. You see a message confirming that the drive is starting. The drive then starts the rebuild process. When this process is complete, the drive is marked ONL.

3. After the rebuild is complete, click on the DHS drive from within the RAID administration utility. Select Set Device State. You then see several options. Select HSP (or SHS if necessary) and click on OK. The adapter issues a start unit command to the drive, and you see a message confirming that the drive is starting. Once the drive spins up and the adapter saves the drive's configuration, the drive is marked HSP (or SHS, as applicable).

More than One DDD Drive, One DHS, No RBL
In this scenario, the operating system is no longer functional. Therefore, you must boot to the RAID Option Diskette to recover the array. It is extremely important to confirm that either the RAID administration utility or NetFinity Manager 5.0 has been running prior to the drives being marked defunct. If so, the utility or NetFinity Manager has logged the sequence of DDD events to a log file either on a diskette or on a local or network drive. With this file, you can view the log file on another machine to determine the "inconsistent" drive. When you know which drive is "inconsistent", you can attempt to recover data.
Note: The previous paragraph states "attempt to recover" because once you lose more than one drive in a set of RAID-5 or RAID-1 logical drives, loss of data is definitely a possibility. The steps below guide you through a recovery, if at all possible.
1. View the RAID log on another machine and write down the order in which the drives went defunct.

2. Boot to the RAID configuration diskette, and select View Configuration. Make sure that the template contains the correct information for the status of all drives, not just those listed in the RAID log.

3. Using the RAID configuration utility, select Set Device State and choose a DDD drive that is not listed in the RAID log. Set that drive to an ONL state. Repeat this step until the only DDD drives remaining are those indicated in the RAID log file.

Note: The drives marked DDD that are not listed in the RAID log are the last ones to go defunct. You must recover these drives first so that the information from them can be used to rebuild the original drive that failed (the "inconsistent" drive). If you do not replace the "inconsistent" drive last, then the system uses it to rebuild the last drive that went defunct, resulting in corrupted data. Therefore, it is extremely important to perform step 3 carefully.

4. Select Set Device State and then select the last drive to go defunct according to the log file. Set that device to the ONL state. Repeat step 4 until there is only one DDD drive remaining.

5. Select Set Device State and choose the DHS drive. Change its state from DHS to HSP.

6. Select Rebuild and highlight the DDD drive.

One DHS Drive, Zero or More DDD Drives, and One RBL Drive
Usually when you have a RBL drive after bringing up a system, it is because the data on the drive was being rebuilt when the system went down. If there are DDD drives as well, then those drives are more than likely the cause of the system crash. The following steps allow you to attempt to recover the array:

1. Boot to the RAID configuration utility.

2. Select View Configuration and write down the current status of each drive. Physically replace the DHS drive.

3. Return to the utility's Main Menu and choose Device Management. Select Set Device State. If you see any DDD drives, highlight them and change their state to ONL. If you do not see any DDD drives, then highlight the DHS drive and change its state to HSP (or SHS). Repeat this step until there are no more drives marked DDD or DHS.

4. Select Rebuild and highlight the RBL drive. The rebuild process begins, and all data will be rebuilt to the drive.

Recovery Procedures When HSP is not Present at Time of Failure
One DDD Drive, No RBL

Follow these steps to bring the DDD drive back to the ONL state if the following items are true:

- Only one drive is marked DDD and the rest are ONL.
- There are no drives with an RBL status.

Once the conditions above are verified through either the RAID administration log or the RAID administration utility, perform the following steps to bring the DDD drive back to ONL status.

1. If drive has never been marked DDD, proceed to step 3 to software replace the drive using the ServeRAID Administration and Monitoring Utility or Netfinity RAID Manager.

Note: Refer to "Software Replace vs. Physical Replace" section of this paper to understand differences between software and physical replacement

2. If the drive has been marked DDD before, proceed to step 6.

3. With a RAID-1 or RAID-5 array, the operating system will be functional. Use either NetFinity Manager or the RAID administration utility within the operating system to start the Rebuild process. With the RAID administration utility, click on the drive marked DDD, and select Rebuild from the menu that appears.

4. The adapter issues a start unit command to the drive. You receive a message confirming that the drive is starting. The drive then begins the rebuild process. Once the drive completes this process, the drive's status changes to ONL.

5. If you see a "Error in starting drive" message, reinsert the cables, hard drive, etc., to verify there is a good connection, then go to step 3. If the error persists, go to step 6.

6. Physically replace the hard drive in the DDD bay with a new one of the same capacity or greater and go to step 3.

7. If the error still occurs with a known good hard file, then troubleshoot to determine if the cable, back plane, RAID adapter, etc., is defective.

Note: The RAID adapter should not be replaced in many cases. If Hard Events are reported in the ServeRAID Device Error Table, which can be viewed by clicking on the logical drive from the ServeRAID Administration and Monitoring Utility, then contact your support representative to determine if the adapter needs replacement.

Once you have replaced the defective part so that there is a good connection between the adapter and hard drive, go to step 3.

Two DDD Drives, No RBL
In this case, with no defined hot spare drive, then the server more than likely trapped (under OS/2 and NT), or the volume was dismounted (under NetWare). To solve this scenario, you must examine the RAID log generated by the RAID administration utility and follow the steps below:

1. Boot to the RAID configuration utility for your RAID adapter.

2. Select Set Device State and highlight the drive marked DDD last by the RAID adapter. Set this drive's state to ONL. The drive spins up and changes from DDD to ONL status.

WARNING: IF YOU USE THE WRONG ORDER WHEN YOU SELECT SET DEVICE STATE TO CHANGE DRIVE'S STATE TO ONL, DATA CORRUPTION. SEE NOTE BELOW TO DETERMINE LAST DRIVE MARKED DDD BY THE RAID ADAPTER.

Note: Refer to "Using and Understanding the RAID Administration Log" section of this document, for details on obtaining and interpreting the RAID log. If only one drive is recorded in the RAID log because the RAID adapter was not able to log the defunct drive before the operating system went down, then the last drive that went defunct is the drive that is not recorded in the RAID log. If two drives are recorded in the RAID log, then the last drive to go defunct is the second drive listed in the log - the drive with the most recent time stamp.

3. If the drive has been marked DDD before, proceed to step 8.

4. Proceed to step 5 to software replace the remaining DDD drive using the ServeRAID Administration and Monitoring Utility or Netfinity RAID Manager.

Note: Refer to "Software Replace vs. Physical Replace" section of this paper to understand differences between software and physical replacement

5. With a RAID-1 or RAID-5 array, the operating system will be functional. Use either NetFinity Manager or the RAID administration utility within the operating system to start the Rebuild process. With the RAID administration utility, click on the drive marked DDD, and select Rebuild from the menu that appears.

6. The adapter issues a start unit command to the drive. You receive a message confirming that the drive is starting. The drive then begins the rebuild process. Once the drive completes this process, the drive's status changes to ONL.

7. If you see a "Error in starting drive" message, reinsert the cables, hard drive, etc., to verify there is a good connection, then go to step 5. If the error persists, go to step 8.

8. Physically replace the hard drive in the DDD bay with a new one of the same capacity or greater and go to step 5.

9. If the error still occurs with a known good hard file, then troubleshoot to determine if the cable, back plane, RAID adapter, etc., is defective.

Note: The RAID adapter should not be replaced in many cases. If Hard Events are reported in the ServeRAID Device Error Table, which can be viewed by clicking on the logical drive from the ServeRAID Administration and Monitoring Utility, then contact you support representative determine if the adapter needs replacement.

10. Once you have replaced the defective part so that there is a good connection between the adapter and hard drive, go to step 5.

11. If software replacement brings all drives back ONL and makes system operational, carefully inspect all cables, etc. to ensure that cable or backplane is not defective. Check all backplane connectors and ensure that backplane is not bowed. When multiple drives are marked defunct, it is often the communication channel (cable or backplane) that is the cause of the failure. If backplane is bowed, drives and backplane connectors may not seat properly causing it to have a bad connection. Also, with hot-swap drives that are removed frequently, connectors could become damaged if too much force is exerted.

More than Two DDD Drives, No RBL
1. View the RAID log on another machine and write down the order in which drives went defunct.

2. Boot to the RAID Configuration Diskette and select View Configuration. Make sure that the template contains the correct information for the status of all drives, not just those listed in the RAID log.

3. Using the RAID configuration utility, select Set Device State and choose a DDD drive not listed in the RAID log to software replace the drives. Change the state of this drive to ONL. Perform this step until only two DDD drives are remaining. One or both of these drives should be listed as the first two drives to go defunct as indicated in the RAID log.

WARNING: IF YOU USE THE WRONG ORDER WHEN YOU SELECT SET DEVICE STATE TO CHANGE DRIVES' STATEs TO ONL, DATA CORRUPTION RESULTS. ENSURE THAT YOU ONLY CHANGE DEVICE STATES TO ONL OF DRIVES NOT LISTED AS DDD IN THE RAID LOG. THE FIRST DRIVE THAT WENT DEFUNCT REQUIRES REBUILDING. SO IT MUST BE REPLACED LAST.

NOTE: Refer to "Using and Understanding the RAID Administration Log" section of this document, for details on obtaining and interpreting the RAID log. Refer to "Software Replace vs. Physical Replace" section of this paper to understand differences between software and physical replacement

4. Follow the same procedure used to recover from two DDD drives, as outlined in the previous section.

Recovery From RAID Adapter Failure
When a RAID adapter fails, you must replace the RAID adapter and then restore the RAID configuration to the new RAID adapter. There are three ways to restore the RAID Configuration:

If you have a backup of the current RAID configuration, then perform the following steps:

1. Boot to the RAID Option Diskette.

2. Select Advanced Functions.

3. Select Restore/Convert Saved Configuration.

4. Insert the diskette that contains the backed up configuration, and press Enter.

5. A list of backup configuration names appears. Select the correct configuration name and press Enter.

6. A confirmation window appears. Select Yes to restore the configuration or No to return to the previous menu.

OR

If you do not have the most recent backup copy of the RAID configuration, then you can restore the array by using information stored on the hard drives. Use the following steps to perform this operation:

1. Boot to the RAID Option Diskette.

2. Select Advanced Functions.

3. Select Init/View Synchronize Config. from the Advanced Functions menu.

4. If there are drives connected to the adapter that are either not showing up or not showing up as Ready (RDY), then select Initialize Configuration. This restores the factory default settings on the RAID adapter and resets all functional hard disk drives to the RDY state.

5. Select Configuration Synchronization from the Init/View Synchronize Config. menu.

6. From this menu, select Hard Disk Drive as source. This retrieves the configuration information from the hard drive. A confirmation window appears. Select Yes if you want to restore the configuration or No if you do not want to restore the configuration.

OR

1. Press CTRL+I during POST to enter the Mini-Config.

2. Select Advanced Functions.

3. Select Import Configuration.

Drive Template
As mentioned in the section titled "Using and Understanding the RAID Administration Log," you may find this template useful to record the status of drives as you begin the troubleshooting process.

Channel 1

Channel 2

Channel 3


Definitions
Array
In the RAID environment, data is striped across multiple physical hard drives. The array is defined as the set of hard drives included in the data striping.

Data Scrubbing
Data Scrubbing forces all data sectors in a logical drive to be accessed so that sector media errors are identified and corrected at the disk level using disk ECC information if possible, or at the array level using RAID parity information if necessary. For a high level of data protection, Data Scrubbing should be performed weekly.

Logical Drive
The array specifies which drives should be included in the striping of data. Each array is subdivided into one or more logical drives. The logical drives specify the following:

- The number and size of the physical drives as seen by the operating system. The operating system sees each defined logical drive as a physical drive.

- The RAID level. When a logical drive is defined, its RAID level (0, 1, or 5) is also defined.

RAID-0
RAID level 0 stripes the data across all of the drives of the array. RAID-0 offers substantial speed enhancement, but provides for no data redundancy. Therefore, a defective hard disk within the array results in loss of data in the logical drive assigned level 0, but only in that logical drive.

RAID-1
RAID level 1 provides an enhanced feature for disk mirroring that stripes data as well as copies of the data across all the drives of the array. The first stripe is the data stripe, and the second stripe is the mirror (copy) of the first data stripe The data in the mirror stripe is written on another drive. Because data is mirrored, the capacity of the logical drive when assigned level 1 is 50% of the physical capacity of the grouping of hard disk drives in the array.

RAID-5
RAID level 5 stripes data and parity across all drives of the array. When a disk array is assigned RAID-5, the capacity of the logical drive is reduced by one physical drive size because of parity storage. The parity is spread across all drives in the array. If one drive fails, the data can be rebuilt. If more than one drive fails, but one or none of the drives are actually defective, then data may not be lost. You can use a process called software replacement on the non-defective hard drives.

Software Replace
A Software Replace of a hard file refers to when the hard file is not physically replaced in the system. A drive may have been marked defunct but brought back online using the RAID Administration program. The drive is rebuilt without having been physically replaced. This could occur because when the RAID Adapter communicates with the hard file and receives an unexpected response, the adapter will mark the drive defunct in order to avoid any potential data loss.

Synchronization
Synchronization reads all the data bits of the entire logical drive, calculates the parity bit for the data, compares the calculated parity with the existing parity, and updates the existing parity if inconsistent. With the IBM ServeRAID Adapter, synchronization must be run on RAID-5 logical drives before storing data in order to ensure data integrity and provide RAID-5 data protection.

The following definitions describe the logical drive states for the IBM ServeRAID Adapter:

CRITICAL (CRT)
This is the status for RAID-1 and RAID-5 arrays where the system is running in degraded mode because one drive is DDD. If another drive goes DDD, the array will be OFL and the operating system will not be operational.

FRE
The drive is not defined. Only the IBM ServeRAID Adapter has this state.

LDM
The logical drive is undergoing a RAID level change. This state is only available in the remote system of the Administration and Monitoring Program. Only the IBM ServeRAID Adapter has this state.

OFFLINE (OFL)
The array has exceeded the number of DDD drives allowed for the specific RAID level; therefore, the array is no longer operational.

OKY
The IBM ServeRAID adapter's logical drive state where all drives in the array are online and fully operational. The IBM ServeRAID adapter also assigns device states to physical drives. The following definitions describe these device states:

DDD
The RAID adapter marks a ONL or RBL drive defunct, changing its status to DDD when one of the following conditions occur:
- The drive does not respond to SCSI Selection (Selection Time-out).
- A write verification failure occurs when the RAID adapter tries to recover from a media error reported by the drive media.
- Drive reported a hardware error.
Note: Media error recovery and conditions under which a drive is marked defunct during the recovery process varies slightly depending upon the specific RAID adapter.

DHS
A hot-spare or standby hot-spare drive (see below) enters the defunct hot-spare (DHS) state if it fails to respond to the adapter commands. Once a DHS drive is replaced, its state changes from DHS to HSP. Only the IBM ServeRAID Adapter has the DHS state.

EMP
No device is present in the bay or the adapter cannot communicate with the drive. This state is represented with dashes (- - -) on the IBM ServeRAID configuration screen, or a blank space on the Administration and Monitor screen. Only the IBM ServeRAID Adapter has this state.

HSP
A hot-spare (HSP) drive is a drive designated to be a replacement for the first DDD drive that occurs. The state of the drive appears as HSP. When a DDD drive occurs and a HSP is defined, the hot-spare drive takes over for the drive that appears as DDD. The HSP drive is rebuilt to be identical to the DDD drive. During the rebuilding of the HSP drive, this drive changes to the RBL state. The RBL state will turn to ONL once the drive is completely rebuilt and fully operating for the DDD drive.

ONL
Drives that RAID adapter detects as installed, operational, and configured into an array appear as ONL (online).

PFA
The firmware of a hard drive uses algorithms to track the error rates on the drive. The drive alerts the user with a Predictive Failure Analysis (PFA) alert via the RAID administration utility and NetFinity Manager when degradation of drive performance (read/write errors) is detected. When a PFA alert occurs, physical replacement of the drive is recommended.

RBL
A drive in this state is being rebuilt. Only the IBM ServeRAID Adapter has this state. A physical hard drive can enter the RBL state if one of the following conditions occurs:

- A good working drive replaces a DDD drive that is part of the critical logical drive. At the end of a successful rebuild, the state of the physical drive changes to ONL, and the state of the corresponding logical drives changes to OKY.

- The HSP or standby hot-spare (SHS) drive is added to the array and the state changes from HSP or SHS to RBL. At the same time, the DDD drive is removed from the array and its state changes to DHS from DDD. The adapter then automatically reconstructs data in the RBL drive. The state of the corresponding logical drive remains CRT (if the RAID level is 1 or 5) or OFL during the rebuild process. When the rebuild completes successfully, the device state changes from RBL to ONL and the logical drive state changes from CRT to OKY.

- A ready (RDY) or standby (SBY) drive replaces a DDD drive that is part of the critical logical drive. The state of the RDY or SBY drive becomes RBL. When the rebuild completes successfully, the state changes to ONL. The DDD drive is removed from the logical drive and becomes DHS.

RDY
RDY appears as the status of a drive that the RAID adapter detects as installed, spun up, but not configured in an array.

SBY
A standby drive is a hard disk drive that the RAID adapter has spun down. Devices such as tape drives and CD-ROM drives are also considered to be in a standby state. Only the IBM ServeRAID Adapter has the state.

SHS
A standby hot-spare is a hot-spare drive that the adapter has spun down. If a drive becomes defunct and no suitable hot-spare drive is available, a standby hot-spare of the appropriate size spins up and enters the RBL state. You must have at least four hard disk drives if you want a standby hot-spare with RAID-5. Only the IBM ServeRAID Adapter has the state.

Additional Information

Web Sites
IBM maintains extensive and timely information on the world wide web. Visit the following sites for more information on IBM servers and other IBM products. These sources contain product information, performance data, and technical literature.

IBM Home Page ............................................... http://www.ibm.com
IBM PSG Home page .................................... http://www.pc.ibm.com
IBM PSG Server Home page ....................... http://www.pc.ibm.com/us/server/server.html
IBM PSG Support .............................................. http://www.pc.ibm.com/us/support.html
TechConnect Program .................................... http://www.pc.ibm.com/techconnect/
File repositories ................................................. http://www.pc.ibm.com/us/files.html or ftp://ftp.pcco.ibm.com

FaxBack System

IBM Personal Systems Group (PSG) FaxBack............1-800-426-3395
IBM FaxBack............................................................................1-800-IBM-4FAX (426-4329)

White Papers

The following White Papers pertain to RAID and hardfile technologies. These provide procedures for ensuring the highest protection and availability of customer data and are viewable on-line in PDF format at: http://www.pc.ibm.com/support
From this site select "Other Intel processor based servers" and then select "Online publications".

1. Using IBM RAID Adapters to Avoid Data Loss (PSG FaxBack doc# 11202)
2. High Availability of Your RAID Subsystem with (PSG FaxBack doc# 11204)
IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter
IBM F/W Streaming RAID Adapter/A
3. Understanding Hard Disk Drive Media Defects. (PSG FaxBack doc# 11205)

Notice
© International Business Machines Corporation 1997. All rights reserved.
References in this publication to IBM products, programs or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM's product, program, or service may be used. Any functional equivalent program that does not infringe any of IBM's intellectual property rights may be used instead of the IBM product, program or service.
Information in this paper was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA.
The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS WITHOUT WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. The information about non-IBM (VENDOR) products in this manual has been supplied by the vendor and IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. This publication could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time.

The following terms are trademarks or registered trademarks of the International Business Machines Corporation in the United States and/or other countries.

OS/2® NetFinity®

Microsoft, Windows, Windows NT, and the Windows logo are registered trademarks of Microsoft Corporation.

UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited.

Other company, product, and service names may be trademarks or service marks of others. IBM Server White Paper IBM Corporation 1997. All rights reserved.

Search Keywords

Hint Category

RAID

Date Created

21-10-97

Last Updated

24-02-99

Revision Date

23-02-2000

Brand

IBM PC Server

Product Family

Netfinity 7000, PC Server 300, PC Server 310, PC Server 315, PC Server 320, PC Server 325, PC Server 330, PC Server 520, PC Server 704, PC Server 720, Rack/Storage Enclosures, ServeRAID

Machine Type

8651, 3517, 3518, 3519, 3527, 9306, Various

Model

TypeModel

Retain Tip (if applicable)

Reverse Doclinks
and Admin Purposes