Diagnosing Checkstops


Contents

About This Document
    Related Documentation
How to Determine if a Checkstop Has Occurred
What Is a "machine check"?
What Is a Checkstop?
How to Proceed
Possible Software Resolutions for Checkstop Conditions

About This Document

This document discusses checkstops, a machine check that occurs during another machine check. This document applies to AIX Versions 3.2, 4.1, 4.2 and 4.3.

Related Documentation

For more in-depth coverage of this subject, the following IBM documents are recommended:

The AIX and RS/6000 product documentation library is also available:

http://www.rs6000.ibm.com/resource/


How to Determine if a Checkstop Has Occurred

A checkstop is indicated by an LED value of 185, 186, or 187 on the LED display of the main unit. If the machine does not have an LED display or the machine has been rebooted, then evidence of a checkstop should exist in the system error report. Look for an entry labeled CHECKSTOP in the error report to determine if a checkstop occurred.


What Is a "machine check"?

A machine check is an error logged by the machine check handler. Causes of a machine check could be:

A non-maskable interrupt (NMI) is generated. The operating system logs the machine check, including various error logging registers reporting the cause of the machine check, and a system dump initiates.


What Is a Checkstop?

A checkstop is a machine check that occurs during another machine check. A checkstop also occurs when the machine (usually a processor but sometimes a cache, memory, or I/O bus controller) determines that something is in an "impossible" state. An error occurs that cannot be isolated to a particular bus transfer in progress, or a processor detects no progress being made. The processor is not able to complete any instructions for some period of time.

When a system checkstops, the clocks in the machine are frozen within a few cycles after the error and the service processor saves the part of the state of the CPUs in NVRAM. It then attempts to do a full hardware reset and restart the system a number of times.

When the system reboots, the data is copied to a file in the /usr/lib/ras directory (ras stands for reliability and service). Two file names are used (checkstop.A and checkstop.B) in a rotating manner. The total number of checkstops that occurred during the reboot attempts, before the system came up successfully, is logged in the error log entry along with the file name.

If a second machine check occurs before the operating system completes logging the error (to NVRAM) and initiates a complete hardware reset or halts, the processor will checkstop.


How to Proceed

Checkstops are inherently hardware phenomena. They do not necessarily indicate a solid failure of a component, so diagnostics will rarely determine that a problem exists. The checkstop file that is generated is required to determine the cause of the checkstop and what corrective actions are needed to resolve the situation. This file would be examined by your hardware service organization. For further information, contact one of the following:

Use the following instructions to package these files for hardware service examination.

Gather system information by performing the following steps:

  1. Run the command snap -g. This will put general information about the machine in the directory /tmp/ibmsupt.

  2. Copy the checkstop files in /usr/lib/ras to the directory /tmp/ibmsupt/testcase. Execute:
       cp /usr/lib/ras/checkstop* /tmp/ibmsupt/testcase 
    
  3. Make a file called customer_info in the directory /tmp/ibmsupt/other. In this file, include the following information:

    main contact
    telephone number of main contact
    machine type having the problem (examples: 7011, 7012, 7015)
    serial number of machine
    location of machine (physical location of machine including address)
    description of activity of machine prior to event

  4. Put the testcase on diskette:
       tar -cvf /dev/fd0 /tmp/ibmsupt 
    
    fd0 is the floppy device.

  5. Label the tape or diskette with the following information:

    customer name
    customer number
    incident#
    the command used to copy the information to tape

    Very Important: If the person sending in this testcase is not the person who reported the problem, be sure to include the name of the person who reported it. If the proper information is not on the package, then it takes valuable time to process and delays solving your problem. The incident# will be the reference number that your hardware service organization assigns to this problem.


Possible Software Resolutions for Checkstop Conditions

AIX Version 3.2

APAR Description Hardware
IX26815 REQUIRED FOR MODEL 7015-970 7015-970,7011-220
IX39776 LED 185 OCCURS WHEN INSTALLING OR RUNNING PCSIM
IX51519 OPENING MULTIPLE WKS ON FUTURE GRAPH. ADAPT. CAUSES CHKSTOP
IX51912 SYSTEM RELIABILITY TEST CAUSES CHECKSTOP ON FGA
IX54923 SYSTEM WITH INTEGRATED ETHERNET CHECKSTOP 7011-250
IX59366 GXT500 CHECKSTOPS DOING SOLID MODEL ROTATION IN CATIA GXT500
IX49100 UPDATE TO MATCH HARDWARE WORKAROUND FOR INTERRUPT PROBLEM
IX44964 SYSTEM CRASHED AFTER AUTORESTART ON 250,260 7011-250,7011-260
IX49220 MPV: SYSTEM HANGS WITH 185 IN THREE DIGIT DISPLAY
IX51482 CATIA (R) RUNNING 2 SESSIONS HANGS X
IX51525 RUNNING PHIGS APPS CAN 185 THE MACHINE
IX51597 CATIA (R) LOCAL TRANSFORMATIONS JERKY (HW CAUSES 185)
IX52258 GXT500: 185 MACHINE CHECK GXT500
IX54698 SYSTEM CRASHES WITH GXT500 - LED 185
IX53565 GXT500: 186 WHEN RUNNING APP OVERNIGHT GXT500
IX59617 41T/GXT500 W/L2 DOES 186 IN PDGS GXT500

AIX Version 4.1

APAR Description Hardware
IX68896 CHECKSTOP 185/186 ON GXT500D/GXT500 WITH X -BS OPT GXT500
IX70222 NEED SW WORKAROUND FOR PEGASUS 6XX BUS LIVELOCK 7012-G30, 7012-G40, 7012-G50, 7013-J30, 7013-J40, 7013-J50, 7015-R30, 7013-R40, 7013-R50
IX53518 GXT500D: CHECKSTOP 185 RUNNING X/FLIP,GL/JELLO GXT500D
IX55419 604 SMP CCA2 WA FOR CCA2 VER<5.0 7012-G30, 7012-G40, 7012-G50, 7013-J30, 7013-J40, 7013-J50 7015-R30, 7013-R40, 7013-R50
IX57557 CHECKSTOP ON 604 SYSTEMS RUNNING LLDB
IX59985 GXT500 CHECKSTOPS DOING SOLID MODEL ROTATION IN CATIA GXT500
IX62529 UNALIGNED TRANSFERS ON 825A CAN CAUSE MACHINE CHECK PCI F/W SCSI Adap.
IX66482 PCI SCSI ADAPTER CAUSES MACHINE CHECK PCI F/W SCSI Adap.
IX52455 GXT500: 185 MACHINE CHECK GXT500
IX53935 SYSTEM CRASHES WITH GXT500 - LED 185 GXT500
IX54685 GXT500/CATIA LED 185 GXT500
IX56379 TLB INVALIDATE CAUSES EXCESS DATA TO BE READ IN A XFER DATA RD
IX56380 GLREADPIXELS CALL ON GXT500 CAUSES HANG IN AIX420
IX53545 GXT500: 186 WHEN RUNNING APP OVERNIGHT GXT500
IX55386 42T/GXT500 LED 186 GXT500

AIX 4.2

APAR Description Hardware
IX69143 CHECKSTOP 185/186 ON GXT500D/GXT500 WITH X -BS OPTION GXT500
IX70175 NEED SW WORKAROUND FOR PEGASUS 6XX BUS LIVELOCK 7012-G30, 7012-G40, 7012-G50, 7013-J30 7013-J40, 7013-J50, 7015-R30, 7013-R40, 7013-R50
IX62156 UNALIGNED TRANSFERS ON 825A CAN CAUSE MACHINE CHECK PCI F/W SCSI Adap.
IX66931 PCI SCSI ADAPTER CAUSES MACHINE CHECK IN WILDCAT 7025-F50
IX61252 GXT500 CHECKSTOPS DOING SOLID MODEL ROTATION IN CATIA GXT500
IX83745 CHECKSTOP ON SPHINX 43P-260

AIX 4.3

APAR Description Hardware
IX72262 APACHE DEADLOCK AVOIDANCE WORKAROUNDS 7017-S70
IX83586 CHECKSTOP ON SPHINX 43P-260

Diagnosing Checkstops: checkstops.all.krn ITEM: FAX
Dated: 98/11/05~00:00 Category: krn
This HTML file was generated 99/06/24~12:42:06
Comments or suggestions?
Contact us