3Ware Controller Problem Determination Procedures (레이드 카드 오류 설명)
조회 수 33463 추천 수 0 2012.01.16 16:01:50/var/log/messages
These messages can appear in the system syslog file. They are documented here to assist in filtering out what are real and what are false errors. Only the messages explicitly labeled as generating lemon events (e.g. RAID_TW_CTLR or RAID_TW_DISK ) will be reported to the operator. The RAID_TW lemon events which are defined here are obtained from running query commands rather than looking at log history.
Message | Action |
---|---|
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=1 | A degraded disk has been found as part of the RAID array. Follow DiskWinTwMirrorRecover |
kernel: 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. | Message can
be ignored. It indicates that the tw_cli start verify has been run.
|
kernel: 3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0. | Message can
be ignored. It indicates that the tw_cli start verify has been run
and completed |
kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x010D): Invalid field in CDB:. | According to the article on the 3ware web site, this indicates a request for a status page which does not exist. This is not an error with the adapter or disk and the message can be ignored. |
kernel: 3w-xxxx: scsi3: Command failed: status = 0xc4, flags = 0x43, unit #8. | A smartctl
error listing command such as -l selftest has been issued against a
disk which does not exist. If there should be no disk present in this port, this
error can be ignored. Otherwise, follow the 3ware problem determination
procedures following running lemon-host-check . |
9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=5 | A bad block
has been found and re-located. This is not a serious problem and will occur from
time to time. If there are a large number of these errors, the disk can be
checked using tw_cli start verify and then replaced if more errors
occur.Some corruption cases with fsprobe have been identified where these messages are also present. This is not conclusive at the moment (03/11/07). Symptoms where this message has been related to a data corruption have been further observed with the tt_07_1 models (25/01/08). Recommended action is to replace the drive at the port listed in the message (port 5 in this case) |
kernel: 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1. | A bad block
has been found and re-located. This is not a serious problem and will occur from
time to time. If there are a large number of these errors, the disk can be
checked using tw_cli start verify and then replaced if more errors
occur |
9xxx: scsi1: AEN: WARNING (0x04:0x004B): Battery temperature is high | The card has detected battery temperature problems. Follow the procedure in DiskPrbTwBbuFault |
3w-xxxx: scsi2: AEN: WARNING: Unclean shutdown detected: Unit #6. | This indicates the machine was powered-down without doing a clean shutdown. While this is not an error in itself and can be ignored, it may explain other errors such as a corrupted file system where the cache was not saved to disk |
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): <NULL>:. | This message is not understood but seems to be caused by a memory space problem on the machine. It does not indicate a problem with the controller and can be ignored. |
Spare capacity too small for some units: spare unit=3, RAID unit=2 | The spare disk is too small to be incorporated into the RAID. See DiskWinTwUnitDiskSizeFix |
3w-9xxx: scsi0: AEN: ERROR (0x04:0x0057): Battery charging fault:. | A battery is failing. Run the DiskWinTwBbuTest procedure to check the batter and raise a vendor call if the messages re-occur |
kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0x3449700. | This is a serious error which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced. |
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0024): Buffer integrity test failed:error=0x3013. | This is suspected as being a serious error message which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced. |
kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0. | This error will generate a RAID_TW_DISK error. The disk at the specified port has failed, usually shown up by a scheduled verification or media scan. The disk should be replaced. |
kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x005F): Cache synchronization failed; some data lost:unit=1. | This is suspected as being a serious error message which merits a vendor call. Some file system corruption may also occur. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced. |
kernel: 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x4D. | This message occurs when SMART data is requested from a 'logical' disk rather than a physical one (such as /dev/sdb on a 3ware controller). Check the /etc/smartd.conf file and the CDB configuration. |
kernel: 3w-xxxx: scsi1: Unit #0: Command (c4d44e00) timed out, resetting card. | A vendor call should be raised. This problem has occurred around the same time as data corruption problems and seems to be related to cabling or enclosure problems. |
kernel:
3w-9xxx: scsi1: AEN: ERROR (0x04:0x0047): Battery voltage is too
low kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0045): Battery voltage is low kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0044): Battery voltage is normal kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0056): Battery charging completed |
The battery on the cards runs down and is re-charged automatically. On its own, these messages are normal. Under some circumstances, the card will perform this in loops with recharging every few minutes (rather than once a week or so as usual). If recharging occurs very often, a vendor call should be raised to replace the battery |
kernel:
3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity testis
overdue kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0051): Battery health check started kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0052): Battery health check completed kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started |
As above |
kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0036): Verify fixed data/parity mismatch:unit=0. | A start
verify operation has caused the unit to be checked and it has found a
problem. This has been corrected automatically and no further action is required
unless the problem occurs repeatedly. If the problem occurs more than 5 times in an hour, a RAID_TW_DISK alarm is raised and a vendor call should be created. See here for 3ware explanation |
kernel: 3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #7. | This message has been seen when a disk has completely failed. Normally other monitoring such as the RAID_TW or SMART_SELFTEST alarm will detect the problem as well. The error indicates a SMART test is failing since the disk does not respond to SMART requests. The failing disk within the unit should be replaced |
kernel: 3w-xxxx: scsi0: AEN drain failed, retrying. | The exact
cause of this message is not known. The message has been seen when a disk is
determined to have failed and is rebuilding. Thus, this message does not
guarantee a problem, a full verify or mediascan is
recommended to detect the exposure of the problem. |
kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=7, unit=1. | A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure. |
kernel: 3w-9xxx: AEN: WARNING (0x04:0x0042): Primary DCB read error occurred:port=0, error=0x208. | The unit has completely failed. Data loss is likely. Vendor call required |
kernel: 3w-9xxx: scsi0: AEN: WARNING(0x04:0x0043): Backup DCB read error detected:port=9, error=0x1019. | Exact cause not known but current recommendation is a vendor call |
kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x002F): Verify not started; unit never initialized:RAID1 subunit=0. | This message occurs when an array is verified for the first time. It can be ignored. |
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x000C): Initialize started:unit=0. | This message occurs when an array is verified for the first time. It can be ignored. |
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0007): Initialize completed:unit=0. | This message occurs when an array is verified for the first time. It can be ignored. |
kernel: 3w-9xxx: scsi1: WARNING: (0x06:0x000C): Character ioctl (0x108) timed out, resetting card. | This message has been seen when there was a bus problem on the machine. Raise a vendor call |
kernel: Call
Trace: kernel: kernel: kernel: kernel: |
There is a problem with the amount of DMA memory available. This was seen on 3ware 95XX cards with an old version of the firmware (3.04). Try a newer version of the firmware to see if this resolves the problem. |
Flash file system repaired:. | This message has been seen on a few machines usually followed by a controller reset. The root cause is not known but the controller reset justified a vendor call. Follow the procedure for the controller reset |
kernel: 3w-9xxx: scsi1: ERROR: (0x06:0x000C): PCI Parity Error: clearing. | Suspect a problem with the controller card. This has been seen on s0 series of machines. Raise a vendor call for a check of motherboard and potential controller replacement. |
kernel: 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing. | Suspect a problem with the controller card. This has been seen on e5 series machines along with an fsprobe corruption. Raise a vendor call for a check of motherboard and potential controller replacement. |
scsi0: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=21. | A disk has shown higher than expected levels of SMART errors. It should be replaced. Raise a vendor call for a check of motherboard and potential controller replacement. |
tw_cli Errors
The following errors can occur from the command
Message | Action |
---|---|
Controller
firmware too old for application NOT COMPATIBLE. Please upgrade firmware & driver. |
The 3ware
command line utilities need a newer version of the firmware. Download a later
version of the firmware for the card. The card version can be obtained using
grep -i 3ware /proc/pci |
Unable to allocate memory | The
tw_cli command produces the message "Unable to allocate memory"
when it is run under high load conditions. The root cause is not understood and
no references are available in the 3ware FAQ. However, the
message is transient so the lemon sensor ignores the problem and does not raise
an operator alarm |
In the tw_cli /cX show
, the device status can be reported as
follows
Type | Status | Description |
---|---|---|
Unit | INOPERABLE | The RAID array has lost too many disks and now unreadable. The disks which are not present in the unit should be replaced as part of a vendor call and the RAID rebuilt |
Unit | DEGRADED | The RAID array has lost one of the disks but is still working. A mirror broken alarm should have been generated. In the event of a further disks failing, the RAID will go INOPERABLE. Follow the DiskWinTwMirrorRecover to get back to OK |
Unit | INIT_PAUSED INITIALIZING |
A new array
is being set up. This state is temporary and will be changed to OK
when the array is completed rebuilding |
Unit | NOT SUPPORTED |
Status NOT
SUPPORTED has been seen but is not understood. Raise a vendor call to get the
disk replaced. |
Port | DEVICE-ERROR | A disk has shown a large number of errors but bad block relocation has been performed so that it is still functioning but at a reduced capacity. This problem has also been seen when a disk has been restarted due to a connection problem with the controller (such as not responding to a command). A vendor call should be placed to replace the disk before it causes the unit to DEGRADE. The problem can be worked around by exporting the disk, rescanning the controller and then defining the disk as spare. However, it is likely to re-occur again. |
Port | SMART-ERROR | Cause currently unknown but it is suspected that this is reported if the smartd self test has failed on the drive. Open a vendor call. |
Port | ECC-ERROR | Sysadmins should open a vendor call. If the vendor is unable to repair the problem, TSI can investigate further. This may indicate a parity disk error (ECC Problem from RAID5). This Raid is corrupted (Integrity data damaged. Backup Data from Controller2 and Delete RAID5 unit and recreate this unit after removing the disk. |
tw_cli show diag Errors
The tw_cli show diag
tool is primarily intended for use by the
3ware technical support team. However, it does allow access to some lower level
log information compared to that provided by tw_cli show alarms
or
from /var/log/messages.
The following messages have been seen in the diag logs. Note, it is a binary log and some contains many non-readable characters. The time stamps are also not clear.
Message | Description | Action |
---|---|---|
E=0200 I=0092EDE4 T=02:47:45 : Cable CRC error | Raise a vendor call to request investigation | |
E=0208 I=00926044 T=05:50:29 P=2 : Sata bridge reset | A drive has timed out and so the controller resets the drive. This has been known to occur when fsprobe reports corruptions. | Raise a vendor call to request investigation. The port of the disk is at the P= value. |
E=0207 I=009286DC T=11:40:53 P=0 : Reset failed | ||
E=0208 I=009286DC T=11:40:48 P=0 : Soft reset drive | ||
E=0231 I=0092331C T=12:45:17 : Next image buffer expected | Problem not understood. Seen on tt_06_8 machines when there was an fsprobe corruption. | |
E=0212 I=00925A5C T=18:32:01 : PCI parity error | Probable controller failure. | Raise a vendor call. The controller is shown on the /c_N_ value in /var/log/messages |
Lemon Errors
RAID_TW
The RAID_TW alarms are the 3ware RAID array errors logged according to the conventions in DiskRefRaidLemonSensor.These indicate problems with the 3ware disks and controllers. Problems should be handled by the system administration team so service managers with class E support can forward the cases to the system administration team for handling.
The detailed status of the machine can be obtained using
# lemon-cli -m ChkRaidTw
Other 3ware related alarms are also generated such as RAID_TW_CTLR.
Using the error number, determine the action to be performed from the table below.
RAID_TW_CTLR
The RAID_TW_CTLR lemon alarm is generated when there are serious controller
or disk problems identified in the /var/log/messages
logs. These messages can be found by running grep 3w-
/var/log/messages
.
For example,
# grep 3w- /var/log/messages | grep scsi Jan 9 12:49:40 lxfs6004 kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0.
The messages should be checked against those listed in the /var/log/messages section and the actions followed.
On completion of the vendor intervention, the /var/log/messages file should
be rotated (logrotate --force /etc/logrotate.d/syslog
) or the
machine rebooted in order to clear the error message.
RAID_TW_DISK
The RAID_TW_DISK lemon alarm is generated when there are serious disk
problems identified in the /var/log/messages
logs. These messages can be found by running grep 3w-
/var/log/messages
.
For example,
# grep 3w- /var/log/messages | grep ERROR Jan 9 12:49:40 lxfs6004 kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0.
The messages should be checked against those listed in the /var/log/messages section and the actions followed.
On completion of the vendor intervention, the fmonagent should be restarted
using ncm_wrapper.sh --co fmonagent
to clear the error.
RPM Installation
Message | Action |
---|---|
[ERROR] No working tw_cli found |
This occurs when the RPM for the 3ware command line tools cannot find a working CLI from all of the versions it tries. The solution is to upgrade the firmware on the cards since the version is too old to support the level required. |
Further Assistance
In the event of the problem not being listed above, please contact the disk service manager listed at TsiSection#Services.
Related Documents
Link | Description |
---|---|
3Ware Downloads | Download site for new drives and release notes |
DiskRefTwLogMetric | How to define the log metrics for 3ware |