32.363 bytes |
Parts Information |
Document ID: MCGN-44PLKD |
ServeRAID - Recovery Procedures for DDD Drives
Applicable to: World-Wide
Recovery Procedures for DDD Drives
Note
Information in this section is for use with all ServeRAID adapters listed in this HMM. |
Procedures for recovering from DDD scenarios include:
- Drive Replacement (Rebuilding a Defunct Drive)
- Software and Physical Replacement
- Using and Understanding the ServeRAID Administration Log
- Recovery From ServeRAID Adapter Failure
- When HSP is Present at Time of Failure
- When HSP is Not Present at Time of Failure
Note
The following information applies only to drives that are part of the same array |
Drive Replacement (Rebuilding a Defunct Drive)
When a hard disk drive goes defunct (DDD), a Rebuild operation is required to reconstruct the data for the device in its respective disk array. The ServeRAID adapters and controllers can reconstruct RAID level-1 and RAID level-5 logical drives, but they cannot reconstruct data stored in RAID level-0 logical drives.
To prevent data integrity problems, the ServeRAID adapters and controllers set the RAID level-0 logical drives to Blocked during a Rebuild operation. After the Rebuild operation completes, you can unblock the RAID level-0 logical drives, and access them once again. But remember, the logical drive might contain damaged data.
Before you rebuild a drive, review the following guidelines and general information.
Guidelines for the Rebuild operation
The replacement hard disk drive must have a capacity equal to or greater than the failed drive.
If the hard disk drive being rebuilt is part of a RAID level-0 logical drive, the RAID level-0 logical drive is blocked.
- You must unblock any RAID level-0 logical drives at the end of the rebuild operation.
- If you use the Administration and Monitoring Utility to initiate the rebuild operation, you can unblock the blocked RAID level-0 drive when the rebuild operation completes.
Data in a logical drive with RAID level-0 is lost during the rebuild operation. If you backed up your data before the drive failed, you can restore the data to the new drive.
General Information about the Rebuild Operation
A physical hard disk drive can enter the rebuild state if:
- You physically replace a defunct drive that is part of the critical logical drive. When you physically replace a defunct drive in a critical logical drive, the ServeRAID adapter/controller rebuilds the data on the new physical drive before it changes the logical drive state back to Okay.
- The ServeRAID adapter/controller adds a hot-spare or a standby hot-spare drive to the array and changes its state from Hot-Spare or Standby Hot-Spare to Rebuilding.
Automatically Rebuilding the Defunct Drive
The ServeRAID adapter/controller will rebuild a defunct drive automatically when all of the following conditions exist:
- The physical drive that failed is part of a RAID level-1 or RAID level-5 logical drive.
- A hot-spare or standby hot-spare drive with a capacity equal to or greater than the capacity of the defunct drive is available the moment the drive fails.
- When multiple hot-spare drives are available, the ServeRAID adapter/controller searches for a hot-spare drive of the appropriate size. The smallest drive that meets this requirement enters the Rebuild state.
- If no hot-spare or standby hot-spare drives are available, the rebuild operation will start the moment you replace the defective drive. Note: If you physically replace the drive and the new drive does not appear in the Physical Drives branch of the Main Tree, you must scan for new or removed Ready drives.
- No rebuild, synchronization, or logical-drive migration operation is in process.
Software and Physical Replacement: When the ServeRAID adapter communicates with the hardfile and receives an unexpected response, the adapter will mark the drive defunct in order to avoid any potential data loss. For example, this could occur in the event of a power loss to any of the components in the SCSI ServeRAID subsystem. In this case, the ServeRAID adapter will err on the side of safety and will no longer write to that drive, although the drive may not be defective in any way.
It is recommended to use a software replace to recover data when multiple DDD drives occur. In this situation, you may lose data on drives that are not actually defective if you run a normal rebuild process.
Perform a software replace or a physical replace according to the following criteria:
- A software replace is recommended when trying to recover data when multiple DDD drives occur. In this situation, you may lose data on drives that are not actually defective if you run a normal rebuild process.
-
Warning
If you use the wrong order when you attempt a software replace, data corruption results. |
- Perform a software replace also for a single DDD drive when a hot-spare (HSP) is not present in the system and the drive has been marked DDD for the first time. In such a situation, the software replace requires a rebuild of the drive. During the rebuild, all sectors of the drive are rebuilt. Therefore, the drive is well tested. If a rebuild of the drive completes successfully, the drive need not be physically replaced.
- Replace the DHS drive physically if a DDD drive has been replaced by an HSP. Under these circumstances, a software replace sends only a start unit command to the drive. If the unit starts successfully, the drive is seen as good by the ServeRAID adapter. Restarting a drive does not test the drive sufficiently. Therefore, the drive must be replaced physically to ensure that a good HSP drive is present in the system.
Using and Understanding the ServeRAID Administration Log: The ability to read the ServeRAID log, generated by the IPSMON or Netfinity ServeRAID Manager, is a most important part of recovering an array when one or more drives are marked DDD. From the ServeRAID log, you can determine in what order drives went DDD, and, if multiple drives are DDD, which one is the "inconsistent" or "out-of-synch" drive. The ServeRAID log is created by running either IPSMON.EXE or Netfinity Manager. IPSMON.EXE is available on the IBM ServeRAID Command Line Programs Diskette from the IBM web site:
http://www.us.pc.ibm.com/files.html
Search on "ServeRAID."
Netfinity Manager is part of ServerGuide, which is shipped with every IBM PC Server and Netfinity Server.
The following is an excerpt from a ServeRAID log created by the IPSMON utility:
RAID log |
09/12/97 09:33:36 INF003:A1C-B -- synchronization
started |
09/12/97 09:40:22 INF004:A1C-B -- synchronization
completed |
09/12/97 09:41:43 CRT001:A1C3B03 -- dead drive
detected |
09/12/97 09:42:13 INF001:A1C-B -- rebuild started |
09/12/97 09:52:11 INF002:A1C-B -- rebuild
completed |
09/12/97 09:55:24 CRT001:A1C3B04 -- dead drive
detected |
The original configuration was:
- SID 1: HSP
- SID 2: ONL
- SID 3: ONL
- SID 4: ONL
The format is as follows:
date time error type:A x C x B xx message
The x following A is the adapter number; the x following C is the channel, and the xx following B is the SID number. An error type can be either informational or critical. The message gives a brief description of the RAID event that has occurred.
The first two lines of the ServeRAID log show that a synchronization was started and proceeded to complete successfully. At a later point in time, on line 3 of the ServeRAID log, a dead drive is detected on adapter 1, channel 3, SID 3. In this case, since an HSP drive is defined, the rebuild starts automatically. Both the start and finish of the rebuild is logged by the IPSMON monitoring utility. Later on, the drive in SID 4 is marked dead, but no rebuild is started because the HSP drive has already been used.
In the current ServeRAID log, the drive in SID 4 is the "inconsistent" drive, and must be physically replaced. If more drives are DDD but are not listed in the ServeRAID log because the server has trapped (OS/2 or NT) or the volume was dismounted (NetWare), you must software-replace those drives before replacing the drive in SID 2, because the other drives contain the correct information to rebuild the "inconsistent" drive.
If this is the case, you should mark other drives ONL and then rebuild drive SID 4.
Before you perform any actions on the hardware, make a photocopy of the Channel Record table on "Channel Record Table". Use Netfinity Manager, the ServeRAID administration program, or the ServeRAID configuration program to fill in the copy of the Channel Record table at the end of this manual with the current status of all the drives, both internal and external. This table provides a three-channel diagram to accommodate all types of ServeRAID adapters.
For the ServeRAID adapters, if power is lost or another drive is marked DDD during a rebuild operation, the rebuild fails and the drive being rebuilt remains in the RBL state. Consequently, the "inconsistent" drive remains recognizable.
Recovery From ServeRAID Adapter Failure
When a ServeRAID adapter fails, you must replace the ServeRAID adapter and then restore the ServeRAID configuration to the new ServeRAID adapter. There are three ways to restore the ServeRAID configuration:
Method 1 (preferred method)
1. Press CTRL+I during POST to enter the Mini-Configuration.
2. Select Advanced Functions.
3. Select Import Configuration.
Method 2
1. Boot to the ServeRAID DOS Configuration Diskette.
2. Select Advanced Functions.
3. Select Init/View Synchronize Config.
4. If there are drives connected to the adapter that are either not showing up or not showing up as Ready (RDY), then select Initialize Configuration. This restores the factory default settings on the ServeRAID adapter and resets all functional hard disk drives to the RDY state.
5. Select Configuration Synchronization from the Init/View Synchronize Config. menu.
6. From this menu, select Hard Disk Drive as source. This retrieves the configuration information from the hard drive. A confirmation window appears. Select Yes if you want to restore the configuration or No if you do not want to restore the configuration.
Method 3: In the event that you are unable to import the configuration from the hard drive, you can attempt to restore the configuration from your backed up diskette.
1. Boot to the ServeRAID DOS Configuration Diskette.
2. Select Advanced Functions.
3. Select Restore/Convert Saved Configuration.
4. Insert the diskette that contains the backed up configuration, and press Enter.
5. A list of backup configuration names appears. Select the correct configuration name and press Enter.
6. A confirmation window appears. Select Yes to restore the configuration or No to return to the previous menu.
Recovery Procedures
There are two groupings of recovery procedures:
- "When HSP is Present at Time of Failure"
- "When HSP is Not Present at Time of Failure"
When HSP is Present at Time of Failure:
Use the following recovery procedures when an HSP state is present at the time of a disk failure:
- "One DHS Drive, No RBL"
- "One DDD Drive, One DHS Drive, No RBL"
- "More Than One DDD Drive, One DHS, No RBL"
- "One DHS Drive, One or More DDD Drives, and One RBL Drive"
One DHS Drive, No RBL: Follow the steps below to bring the DHS drive back to an HSP state if the following items are true:
- Only one drive is marked DHS and the rest are ONL.
- The ServeRAID logical drive status is OKY because an HSP is present in the system. Either the HSP drive is the hard drive that went DHS or the HSP drive has already taken over for the DHS drive automatically and has been rebuilt successfully.
- There are no drives with a RBL status.
Once you verify the conditions above through either the ServeRAID Administration log or the ServeRAID Administration Utility, perform the following steps to bring the DHS drive back to HSP status:
1. Physically replace the DHS hard drive with a new one of the same capacity or greater in the same location. If Hot-Swap Rebuild is Enabled, the state of the new drive will automatically be set to HSP. If this occurs, skip to step 5. If the new drive is not automatically set to HSP, then this must be done manually. Instructions for manually setting the drive state continue with step 2.
2. With a RAID-1 or RAID-5 logical drive, the operating system is still functional at this point. Use either Netfinity Manager 5.0 (or later) or the ServeRAID Administration Utility to bring the drive back to HSP status. Using the ServeRAID Administration Utility, select the DHS drive.
3. A window named Device Management appears, listing all the possible drive states. Select Hot-Spare (HSP), or Standby Hot-Spare (SHS) if necessary, then select Set Device State.
4. The adapter issues a start unit command to the drive. Once the drive successfully spins up, the adapter changes the drive's state to HSP (or SHS) and saves the new configuration. A message appears that reads, "Device state changed from DHS to HSP (or SHS)". Select OK on this message and then on the Device Management window.
5. If you see a message that says: Error in starting drive
Reinsert the cables and hard drive to verify that they are connected properly, then go to step 2. If the error persists, go to step 1.
6. If the error still occurs with a hard drive that is known to be good, troubleshoot to determine the defective part. The defective part can be a cable, a back plane, a ServeRAID adapter, or some other component. Once you have replaced the defective part so that there is a good connection between the adapter and the hard drive, go to step 2.
One DDD Drive, One DHS Drive, No RBL: If the system has a DDD drive and a DHS drive and a hot-spare (HSP) existed prior to the drive failures, the system should still run as long as the logical drives are configured as RAID-5 or RAID-1. The logical drives in the array will be in the CRT state due to one drive in the array being defunct.
Note
Because the operating system is functional, this procedure assumes you are using the ServeRAID Administration Utility within the operating system to recover. |
Perform the following steps to bring the logical drive from CRT to OKY status:
1. Physically replace the drives that are marked DDD and DHS.
Important
Before you proceed, verify whether it is in a rebuild state. Do not attempt to start another rebuild.
If the customer has enabled the HOT SWAP rebuild option, the new adapter automatically starts a rebuild. If this is the
case, the following step occurs automatically. |
2. Select the DDD drive from within the ServeRAID Administration Utility and then select Rebuild Drive. You see a message confirming that the drive is starting. The drive then starts the rebuild process. When this process is complete, the drive is marked ONL.
3. After the rebuild is complete, select the DHS drive from within the ServeRAID administration utility. You then see several options. Choose HSP (or SHS if necessary) and choose Set Device State. The adapter issues a start unit command to the drive. Once the drive spins up and the adapter saves the drive's configuration, the drive is marked HSP, or SHS, as is applicable.
More Than One DDD Drive, One DHS, No RBL
Note
The following procedure for multiple DDD drives assumes that the operating system is installed on one of the DDD drives. |
In this scenario, the operating system is no longer functional. Therefore, you must boot to the ServeRAID DOS Configuration Diskette to recover the logical drive. It is extremely important to confirm that either the ServeRAID Administration Utility or
Netfinity Manager 5.0 has been running prior to the drives being marked defunct. If so, the utility or Netfinity Manager has logged the sequence of DDD events to a log file, either on a diskette or on a local or network drive. With this file, you can view the log file on another machine to determine the "inconsistent" drive. When you know which drive is "inconsistent", you can attempt to recover data.
Note
Once you lose more than one drive in a set of RAID-5 or RAID-1 logical drives, loss of data is definitely a possibility. The steps below guide you through a recovery. However, a recovery may not be possible in every case. |
1. View the ServeRAID log on another machine and write down the order in which the drives went defunct.
2. Boot to the ServeRAID DOS Configuration Diskette and choose View Configuration. Make sure that the Channel Record Table contains the correct information for the status of all drives, not just those listed in the RAID log.
3. Using the ServeRAID Configuration Utility program, choose Set Device State and choose a DDD drive that is not listed in the ServeRAID log. Set that drive to an ONL state. Repeat this step until the only DDD drives remaining are those indicated in the ServeRAID log file.
Important
The drives marked DDD that are not listed in the ServeRAID log are the last ones to go defunct. You must recover those drives first so that the information from them can be used to rebuild the original drive that failed, that is, the "inconsistent" drive. If you do not replace the "inconsistent" drive last, the system uses it to rebuild the last drive that went defunct, and data can be corrupted as a result. Therefore, it is extremely important to perform the above step 3 carefully! |
4. Choose Set Device State and then choose the last drive to go defunct according to the log file. Set that device to the ONL state. Repeat until there is only one DDD drive remaining.
5. Choose Set Device State and choose the DHS drive. Change its state from DHS to HSP.
6. Choose Rebuild and highlight the DDD drive.
One DHS Drive, One or More DDD Drives, and One RBL Drive
Note
The following procedure for multiple DDD drives assumes that the operating system is installed on one of the DDD drives. |
Important
The ServeRAID adapter must be using firmware level 2.87 or later for the following procedure to have a chance at success. |
Usually when you have a RBL drive after bringing up a system, it is because the data on the drive was being rebuilt when the system went down. If there are DDD drives as well, those drives are more than likely the cause of the system crash. The following steps allow you to attempt to recover the logical drive:
1. Boot to the ServeRAID DOS Configuration Diskette for the ServeRAID adapter.
2. Choose View Configuration.
3. Write down the current status of each drive.
4. Physically replace the DHS drive.
5. Return to the utility's Main Menu and choose Device Management.
6. Choose Set Device State.
- If you see any DDD drives, highlight them and change their status to ONL.
- If you do not see any DDD drives, highlight the DHS drive and change its state to HSP, or SHS, as appropriate. Repeat this step until there are no more drives marked DDD or DHS.
7. Choose Rebuild and highlight the RBL drive. The rebuild process begins, and all data will be rebuilt to the drive.
When HSP is Not Present at Time of Failure: Use the following recovery procedures when HSP is not present at the time of a disk failure:
- "One DDD Drive, No RBL"
- "Two DDD Drives, No RBL"
- "More Than Two DDD Drives, No RBL"
One DDD Drive, No RBL: Follow these steps to bring the DDD drive back to the ONL state if the following items are true:
- Only one drive is marked DDD and the rest are ONL.
- There are no drives with an RBL status.
Once the conditions above are verified through either the RAID administration log or the ServeRAID Administration Utility, perform the following steps to bring the DDD drive back to ONL status.
1. If drive has never been marked DDD, proceed to step 3 to software-replace the drive using the ServeRAID Administration and Monitoring Utility or Netfinity ServeRAID Manager.
Note
Refer to "Software and Physical Replacement" to understand differences between software and physical replacement. |
2. If the drive has been marked DDD before, physically replace the hard drive in the DDD SID with a new one of the same capacity or greater.
3. With a RAID-1 or RAID-5 logical drive, the operating system will be functional. Use either Netfinity Manager or the ServeRAID Administration Utility within the operating system to start the Rebuild process. With the ServeRAID Administration Utility, select the drive marked DDD, and choose Rebuild from the menu that appears.
4. The adapter issues a start unit command to the drive. The drive then begins the rebuild process. Once the drive completes this process, the drive's status changes to ONL.
5. If you see the message: Error in starting drive
Reinsert the cables, hard drive, etc., to verify there is a good connection. Go to step 3. If the error persists, go to step 6.
6. Physically replace the hard drive in the DDD SID with a new one of the same capacity or greater and go to step 3.
7. If the error still occurs with a known good hard file, troubleshoot to determine if the cable, back plane, ServeRAID adapter, or other component is defective.
Note
You can view the ServeRAID Device Event Table by clicking on the logical drive from the ServeRAID Administration and
Monitoring Utility. If Hard Events are reported in the ServeRAID Device Event Table, refer to "Device Event Table" for information. The ServeRAID adapter should not be replaced, in many cases. |
8. Once you have replaced the defective part so that there is a good connection between the adapter and the hard drive, go to step 3.
Two DDD Drives, No RBL
Note
The following procedure for multiple DDD drives assumes that the operating system is installed on one of the DDD drives. |
In this case, with no defined hot-spare drive, the server more than likely trapped (under OS/2 and NT) or the volume was dismounted (under NetWare). To resolve this scenario, examine the ServeRAID log generated by the ServeRAID
Administration Utility and follow the steps below:
1. Boot to the ServeRAID DOS Configuration Diskette for the ServeRAID adapter.
2. Choose Set Device State and highlight the drive marked DDD last by the ServeRAID adapter. Set this drive's state to ONL. The drive spins up and changes from DDD to ONL status.
Warning
If you use the wrong order when you set the drive's state to ONL, data corruption results. See the following note to
determine the last drive marked DDD by the ServeRAID adapter. |
Note
Refer to "Using and Understanding the ServeRAID Administration Log" for details on obtaining and interpreting the ServeRAID log.
- If only one drive is recorded in the ServeRAID log because the ServeRAID adapter was not able to log the defunct drive before the operating system went down, the last drive that went defunct is the drive that is not recorded in the ServeRAID log.
- If two drives are recorded in the ServeRAID log, then the last drive to go defunct is the second drive listed in the log, that is the drive with the most recent time stamp.
|
3. If the drive has been marked DDD before, physically replace the hard drive with a new one of the same capacity or greater. Proceed to step 5.
4. Proceed to step 5 to software-replace the remaining DDD drive using the ServeRAID Administration and Monitoring Utility or Netfinity ServeRAID Manager.
Note
Refer to "Software and Physical Replacement" to understand the differences between software and physical replacement. |
5. With a RAID-1 or RAID-5 logical drive, the operating system will be functional. Use either Netfinity Manager or the ServeRAID Administration and Monitoring Utility within the operating system to start the Rebuild process. With the ServeRAID Administration and Monitoring Utility, select the drive marked DDD and choose Rebuild from the menu that appears.
6. The adapter issues a start unit command to the drive. The drive then begins the Rebuild process. Once the drive completes this process, the drive's status changes to ONL.
7. If you see the message: Error in starting drive
Re-insert the cables, hard drive, and all other components to verify that there is a good connection. Go to step 5. If the error persists, go to step 8
8. Physically replace the hard drive in the DDD SID with a new one of the same capacity or greater and go to step 5
9. If the error still occurs with a known good hard drive, troubleshoot to determine if the cable, back plane, ServeRAID adapter, or other component is defective.
Note
You can view the ServeRAID Device Event Table by selecting the logical drive from the ServeRAID Administration and Monitoring Utility. If Hard Events are reported in the ServeRAID Device Event Table, refer to "Device Event Table" for
more information. The ServeRAID adapter, in many cases, should not be replaced. |
10. Once you have replaced the defective part so that there is a good connection between the adapter and the hard drive, go to step 5.
11. If software replacement brings all the drives back ONL and makes the system operational, carefully inspect all cables, etc., to ensure that the cables or backplane are not defective. Make sure that the card is seated properly. When multiple drives are marked defunct, it is often the communication channel (cable or backplane) that is the cause of the failure. If
the backplane is bowed, drives and backplane connectors may not seat properly, causing it to have a bad connection. Also, with hot-swap drives that are removed frequently, connectors could become damaged if too much force is
exerted.
More Than Two DDD Drives, No RBL
Note
The following procedure for multiple DDD drives assumes that the operating system is installed on one of the DDD drives. |
1. View the ServeRAID log on another machine and write down the order in which drives went defunct.
2. Boot to the ServeRAID DOS Configuration Diskette and choose View Configuration. Make sure that the Channel Record Table contains the correct information for the status of all drives, not just those listed in the ServeRAID log.
3. Using the ServeRAID Configuration Utility program, choose Set Device State, then choose a DDD drive not listed in the
ServeRAID log to software-replace the drives. Change the state of this drive to ONL.
4. Perform the previous step until only two DDD drives are remaining. One or both of these drives should be listed as the first drive(s) to go defunct as indicated in the ServeRAID log.
Note
If you choose the wrong order when you choose Set Device State to change the drives' states to ONL, data corruption
results. Be sure that you only change device states to ONL for drives not listed as DDD in the ServeRAID log. The first
defunct drive requires rebuilding, so it must be replaced last. |
|
Search Keywords |
|
|
Document Category |
Controllers |
|
Date Created |
01-02-99 |
|
Last Updated |
15-03-99 |
|
Revision Date |
15-03-2000 |
|
Brand |
IBM PC Server |
|
Product Family |
ServeRAID |
|
Machine Type |
Various |
|
Model |
|
|
TypeModel |
|
|
Retain Tip (if applicable) |
|
|
Reverse Doclinks
and Admin Purposes |
|