어드민플레이 - Linux ETC Q&A - 3Ware Controller Problem Determination Procedures (레이드 카드 오류 설명)

글 수 367

3Ware Controller Problem Determination Procedures (레이드 카드 오류 설명)

조회 수 33463 추천 수 0 2012.01.16 16:01:50

ADMINPLAY *.90.215.4 http://adminplay.com/LETC/74321

/var/log/messages

These messages can appear in the system syslog file. They are documented here to assist in filtering out what are real and what are false errors. Only the messages explicitly labeled as generating lemon events (e.g. RAID_TW_CTLR or RAID_TW_DISK ) will be reported to the operator. The RAID_TW lemon events which are defined here are obtained from running query commands rather than looking at log history.

Message	Action
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=1	A degraded disk has been found as part of the RAID array. Follow DiskWinTwMirrorRecover
kernel: 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0.	Message can be ignored. It indicates that the `tw_cli start verify` has been run.
kernel: 3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0.	Message can be ignored. It indicates that the `tw_cli start verify` has been run and completed
kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x010D): Invalid field in CDB:.	According to the article on the 3ware web site, this indicates a request for a status page which does not exist. This is not an error with the adapter or disk and the message can be ignored.
kernel: 3w-xxxx: scsi3: Command failed: status = 0xc4, flags = 0x43, unit #8.	A smartctl error listing command such as `-l selftest` has been issued against a disk which does not exist. If there should be no disk present in this port, this error can be ignored. Otherwise, follow the 3ware problem determination procedures following running `lemon-host-check`.
9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=5	A bad block has been found and re-located. This is not a serious problem and will occur from time to time. If there are a large number of these errors, the disk can be checked using `tw_cli start verify` and then replaced if more errors occur. Some corruption cases with fsprobe have been identified where these messages are also present. This is not conclusive at the moment (03/11/07). Symptoms where this message has been related to a data corruption have been further observed with the tt_07_1 models (25/01/08). Recommended action is to replace the drive at the port listed in the message (port 5 in this case)
kernel: 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1.	A bad block has been found and re-located. This is not a serious problem and will occur from time to time. If there are a large number of these errors, the disk can be checked using `tw_cli start verify` and then replaced if more errors occur
9xxx: scsi1: AEN: WARNING (0x04:0x004B): Battery temperature is high	The card has detected battery temperature problems. Follow the procedure in DiskPrbTwBbuFault
3w-xxxx: scsi2: AEN: WARNING: Unclean shutdown detected: Unit #6.	This indicates the machine was powered-down without doing a clean shutdown. While this is not an error in itself and can be ignored, it may explain other errors such as a corrupted file system where the cache was not saved to disk
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): <NULL>:.	This message is not understood but seems to be caused by a memory space problem on the machine. It does not indicate a problem with the controller and can be ignored.
Spare capacity too small for some units: spare unit=3, RAID unit=2	The spare disk is too small to be incorporated into the RAID. See DiskWinTwUnitDiskSizeFix
3w-9xxx: scsi0: AEN: ERROR (0x04:0x0057): Battery charging fault:.	A battery is failing. Run the DiskWinTwBbuTest procedure to check the batter and raise a vendor call if the messages re-occur
kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0x3449700.	This is a serious error which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0024): Buffer integrity test failed:error=0x3013.	This is suspected as being a serious error message which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.
kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0.	This error will generate a RAID_TW_DISK error. The disk at the specified port has failed, usually shown up by a scheduled verification or media scan. The disk should be replaced.
kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x005F): Cache synchronization failed; some data lost:unit=1.	This is suspected as being a serious error message which merits a vendor call. Some file system corruption may also occur. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.
kernel: 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x4D.	This message occurs when SMART data is requested from a 'logical' disk rather than a physical one (such as /dev/sdb on a 3ware controller). Check the /etc/smartd.conf file and the CDB configuration.
kernel: 3w-xxxx: scsi1: Unit #0: Command (c4d44e00) timed out, resetting card.	A vendor call should be raised. This problem has occurred around the same time as data corruption problems and seems to be related to cabling or enclosure problems.
kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0047): Battery voltage is too low kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0045): Battery voltage is low kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0044): Battery voltage is normal kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0056): Battery charging completed	The battery on the cards runs down and is re-charged automatically. On its own, these messages are normal. Under some circumstances, the card will perform this in loops with recharging every few minutes (rather than once a week or so as usual). If recharging occurs very often, a vendor call should be raised to replace the battery
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity testis overdue kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0051): Battery health check started kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0052): Battery health check completed kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started	As above
kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0036): Verify fixed data/parity mismatch:unit=0.	A `start verify` operation has caused the unit to be checked and it has found a problem. This has been corrected automatically and no further action is required unless the problem occurs repeatedly. If the problem occurs more than 5 times in an hour, a RAID_TW_DISK alarm is raised and a vendor call should be created. See here for 3ware explanation
kernel: 3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #7.	This message has been seen when a disk has completely failed. Normally other monitoring such as the RAID_TW or SMART_SELFTEST alarm will detect the problem as well. The error indicates a SMART test is failing since the disk does not respond to SMART requests. The failing disk within the unit should be replaced
kernel: 3w-xxxx: scsi0: AEN drain failed, retrying.	The exact cause of this message is not known. The message has been seen when a disk is determined to have failed and is rebuilding. Thus, this message does not guarantee a problem, a full `verify` or `mediascan` is recommended to detect the exposure of the problem.
kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=7, unit=1.	A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.
kernel: 3w-9xxx: AEN: WARNING (0x04:0x0042): Primary DCB read error occurred:port=0, error=0x208.	The unit has completely failed. Data loss is likely. Vendor call required
kernel: 3w-9xxx: scsi0: AEN: WARNING(0x04:0x0043): Backup DCB read error detected:port=9, error=0x1019.	Exact cause not known but current recommendation is a vendor call
kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x002F): Verify not started; unit never initialized:RAID1 subunit=0.	This message occurs when an array is verified for the first time. It can be ignored.
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x000C): Initialize started:unit=0.	This message occurs when an array is verified for the first time. It can be ignored.
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0007): Initialize completed:unit=0.	This message occurs when an array is verified for the first time. It can be ignored.
kernel: 3w-9xxx: scsi1: WARNING: (0x06:0x000C): Character ioctl (0x108) timed out, resetting card.	This message has been seen when there was a bus problem on the machine. Raise a vendor call
kernel: Call Trace:{__alloc_pages+768} {dma_alloc_pages+125} kernel: {dma_alloc_coherent+97} {:3w_9xxx:twa_chrdev_ioctl+227} kernel: {do_page_fault+575} {autoremove_wake_function+0} kernel: {dput+56} {strncpy_from_user+74} kernel: {sys_ioctl+853} {system_call+126}	There is a problem with the amount of DMA memory available. This was seen on 3ware 95XX cards with an old version of the firmware (3.04). Try a newer version of the firmware to see if this resolves the problem.
Flash file system repaired:.	This message has been seen on a few machines usually followed by a controller reset. The root cause is not known but the controller reset justified a vendor call. Follow the procedure for the controller reset
kernel: 3w-9xxx: scsi1: ERROR: (0x06:0x000C): PCI Parity Error: clearing.	Suspect a problem with the controller card. This has been seen on s0 series of machines. Raise a vendor call for a check of motherboard and potential controller replacement.
kernel: 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing.	Suspect a problem with the controller card. This has been seen on e5 series machines along with an fsprobe corruption. Raise a vendor call for a check of motherboard and potential controller replacement.
scsi0: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=21.	A disk has shown higher than expected levels of SMART errors. It should be replaced. Raise a vendor call for a check of motherboard and potential controller replacement.

tw_cli Errors

The following errors can occur from the command

Message	Action
Controller firmware too old for application NOT COMPATIBLE. Please upgrade firmware & driver.	The 3ware command line utilities need a newer version of the firmware. Download a later version of the firmware for the card. The card version can be obtained using `grep -i 3ware /proc/pci`
Unable to allocate memory	The `tw_cli` command produces the message "Unable to allocate memory" when it is run under high load conditions. The root cause is not understood and no references are available in the 3ware FAQ. However, the message is transient so the lemon sensor ignores the problem and does not raise an operator alarm

In the tw_cli /cX show, the device status can be reported as follows

Type	Status	Description
Unit	INOPERABLE	The RAID array has lost too many disks and now unreadable. The disks which are not present in the unit should be replaced as part of a vendor call and the RAID rebuilt
Unit	DEGRADED	The RAID array has lost one of the disks but is still working. A mirror broken alarm should have been generated. In the event of a further disks failing, the RAID will go INOPERABLE. Follow the DiskWinTwMirrorRecover to get back to OK
Unit	INIT_PAUSED INITIALIZING	A new array is being set up. This state is temporary and will be changed to `OK` when the array is completed rebuilding
Unit	NOT SUPPORTED	Status NOT SUPPORTED has been seen but is not understood. Raise a vendor call to get the disk replaced.
Port	DEVICE-ERROR	A disk has shown a large number of errors but bad block relocation has been performed so that it is still functioning but at a reduced capacity. This problem has also been seen when a disk has been restarted due to a connection problem with the controller (such as not responding to a command). A vendor call should be placed to replace the disk before it causes the unit to DEGRADE. The problem can be worked around by exporting the disk, rescanning the controller and then defining the disk as spare. However, it is likely to re-occur again.
Port	SMART-ERROR	Cause currently unknown but it is suspected that this is reported if the smartd self test has failed on the drive. Open a vendor call.
Port	ECC-ERROR	Sysadmins should open a vendor call. If the vendor is unable to repair the problem, TSI can investigate further. This may indicate a parity disk error (ECC Problem from RAID5). This Raid is corrupted (Integrity data damaged. Backup Data from Controller2 and Delete RAID5 unit and recreate this unit after removing the disk.

tw_cli show diag Errors

The tw_cli show diag tool is primarily intended for use by the 3ware technical support team. However, it does allow access to some lower level log information compared to that provided by tw_cli show alarms or from /var/log/messages.

The following messages have been seen in the diag logs. Note, it is a binary log and some contains many non-readable characters. The time stamps are also not clear.

Message	Description	Action
E=0200 I=0092EDE4 T=02:47:45 : Cable CRC error		Raise a vendor call to request investigation
E=0208 I=00926044 T=05:50:29 P=2 : Sata bridge reset	A drive has timed out and so the controller resets the drive. This has been known to occur when fsprobe reports corruptions.	Raise a vendor call to request investigation. The port of the disk is at the P= value.
E=0207 I=009286DC T=11:40:53 P=0 : Reset failed
E=0208 I=009286DC T=11:40:48 P=0 : Soft reset drive
E=0231 I=0092331C T=12:45:17 : Next image buffer expected	Problem not understood. Seen on tt_06_8 machines when there was an fsprobe corruption.
E=0212 I=00925A5C T=18:32:01 : PCI parity error	Probable controller failure.	Raise a vendor call. The controller is shown on the /c_N_ value in /var/log/messages

Lemon Errors

RAID_TW

The RAID_TW alarms are the 3ware RAID array errors logged according to the conventions in DiskRefRaidLemonSensor.

These indicate problems with the 3ware disks and controllers. Problems should be handled by the system administration team so service managers with class E support can forward the cases to the system administration team for handling.

The detailed status of the machine can be obtained using

# lemon-cli -m ChkRaidTw

Other 3ware related alarms are also generated such as RAID_TW_CTLR.

Using the error number, determine the action to be performed from the table below.

Id	Description	Action
TWA011I	The command line tool has not been able to allocate memory. The root cause for this is not understood but is probably a bug with the driver. It does not indicate a problem with the sensor	No operator alarm is raised and no action required. This message is logged for information only.
TWA061W	The battery has detected high temperatures. The controller is still working but this may indicate a hardware problem such as fan or computer centre temperature. No alarm will be raised in this case for the operators	Check with the data centre staff on the temperature around the machine
TWA062W	The battery is being tested or is charging. The write cache is therefore disabled. No operator alarm will be raised.	No action required and no operator alarm should be raised
TWA103I	A RAID array is rebuilding. This does not automatically generate an error but may need to tracked if this occurs often. No operator alarm raised.	None required.
TWA111I	Disk is being rebuilt. This occurs if a RAID is automatically being rebuilt and so the disk status is degraded while the rebuild is in progress.	No action required and no operator alarm generated.
TWA201E	The size of the spare disk is smaller than disks in a RAID-5 array on the controller. This often occurs in the 1st controller where a SPARE has replaced a member of the system disk RAID-1 array. The SPARE is generally the same size as the RAID-5 disk members and so will not be the same as the system disks (which are smaller).	DiskWinTwUnitDiskSizeFix
TWA210E	The `tw_cli` command is not installed	Contact the service manager for the machine and TSI to arrange additions to CDB hardware monitoring for the machine.
TWA211E	Error retrieving list of disks using `tw_cli`	Contact TSI for assistance. The log of the command is in `/var/log/messages`.
TWA221E	The write cache on the controller has failed. This may be due to a battery or high temperature problems.	Follow the DiskWinTwWriteCache procedure
TWA231E	The high temperature has caused the battery to be disabled.	Ask CC.support to check the temperature around the machine. If there is no temperature problem, raise a vendor call since the battery is mal-functioning.
TWA232E	The battery temperature status is unknown	Contact TSI for assistance
TWA233E	The battery has failed	Currently, it is recommended to run DiskPrbTwBbuFault to really check that the battery is dead. Once this test reports that it is not possible to do the battery test due to `NoBattery`, raise a vendor call
TWA234E	The battery status is unknown	Contact TSI for assistance
TWA235E	The battery is reporting FAULT as it status	This indicates a battery problem. Run the batter test procedure at DiskWinTwBbuTest and see if the failing to charge messages appear in the /var/log/messages after the recharge. If so, raise a vendor call. If there are any repeated alerts after the recharge, raise a vendor call.
TWA243E	A unit has the NCQ qpolicy set to on	Some controller disk combinations produce disk drop outs when NCQ is set on. Since there is only very small performance gain with NCQ on, all units should have it set off. To turn it off, use `tw_cli` as follows, `tw_cli /c0/u2 set qpolicy=off` (where /c0/u2 is unit in the error report. The status can be checked with `tw_cli /c0/u2 show qpolicy`. The alarm will clear itself after 10 minutes when the sensor runs again..
TWA291E	No spare found for controller with RAID-5 array. This is generally because a spare disk was present and another disk failed. The spare was then automatically configured into the RAID.	Run DiskWinTwSpareRecover
TWA292E	A disk is not associated with a unit. This may be due to a new disk being added to the array or a spare having dropped out of the configuration.	Assign the disk as a spare using `tw_cli add type=spare`. If this operation fails (such as `Drive not ready`), raise a vendor call to replace the disk or check the controller card.
TWA300E	A disk has failed within the RAID	DiskWinTwMirrorRecover
TWA301E	Disk has status `DEVICE-ERROR`. It may still be working but it is about to fail.	Vendor call for the disk
TWA302E	Disk has status `ECC-ERROR`. It may still be working but it is about to fail.	DiskWinTwMirrorRecover
TWA303E	Disk has status `SMART-ERROR`. This shows the disk is reporting SMART problems and is about to fail.	Raise a vendor call
TWA304E	A disk is failing. It is in state `DEGRADED` so it will no longer be used for new data.	Run DiskWinTwMirrorRecover
TWA305E	A disk has timed out and reset the SATA bridge in the controller. This has caused data corruptions in the past.	Raise a vendor call to replace the disk identified in the P= line in the syslog for `lemon-tw`
TWA307E	A disk has performed a soft reset. This has caused data corruptions in the past. In some cases, this was identified as a firmware problem with the disk.	Raise a vendor call to replace the disk identified in the P= line in the syslog for `lemon-tw`
TWA308V	A data ECC error was reported to the host. This occurs when a sector repair has occurred for the disk. check the /var/log/messages for the sector repair line during the previous hour. This will indicate the port of the disk which has failed. Check the smart logs for that disk	vendor call the disk for an exchange.
TWA314W	The firmware level on the disk is a known bad firmware level. A campaign to replace the disks has been performed in the past but sometimes the disks are replaced by one from the vendor's stock and these disks have not been upgraded. See DiskPrbFirmware for more information on the firmware levels.	Raise a vendor call and ask for either a firmware upgrade of the disk or a disk with a later firmware level.
TWA315W
TWA400E	The disk array is degraded.	Raise a vendor call DiskWinVendorCall
TWA401E	RAID unit status is unknown	Contact TSI for assistance
TWA600E	A disk has failed	Raise a vendor call (DiskWinVendorCall)
TWA621E	A disk status is unknown	Contact TSI for assistance
TWA622E	A disk is reporting zero size. This indicates that the disk is no longer responding	Follow DiskWinTwSpareRecover and raise a vendor call (DiskWinVendorCall) if this does not succeed.
TWA623E	A disk is reporting status NOT-SUPPORTED. The cause for this is not known but indicates a failing disk.	DiskWinVendorCall
TWA701E	A CRC problem has been reported by `tw_cli show diag`	Raise a vendor call to check the cabling and controller card
TWA801E	A RAID unit has a fatal status. It is likely that the unit is completely failed and that data will be lost due to multiple failures.	Raise a vendor call to replace the disks and contact service manager to inform them of the probable data loss.
TWA900E	Controller is not OK	Check if there is a `HIGH_LOAD` alarm, if so this may indicate a false alarm since the `tw_cli` command sometimes fails to find the controller under load. Otherwise, raise a vendor call (DiskWinVendorCall)
TWA901E	A controller parity error has been detected. This problem is frequently seen on the s0 set of disk servers and is believed to be a systematic problem with the motherboard so the alarm will not be raised for this hardware model	Raise a vendor call to review the motherboard and controller card
TWA911E	No controllers found	Contact the service manager for the machine to check if there should be 3ware controllers installed. If not, ask for the CDB hardware profile to be fixed. If so, raise a vendor call. The log for the execution of the `tw_cli info` command is in `/var/log/messages`.

RAID_TW_CTLR

The RAID_TW_CTLR lemon alarm is generated when there are serious controller or disk problems identified in the /var/log/messages logs. These messages can be found by running grep 3w- /var/log/messages.

For example,

# grep 3w- /var/log/messages | grep scsi
Jan  9 12:49:40 lxfs6004 kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0.

The messages should be checked against those listed in the /var/log/messages section and the actions followed.

On completion of the vendor intervention, the /var/log/messages file should be rotated (logrotate --force /etc/logrotate.d/syslog) or the machine rebooted in order to clear the error message.

RAID_TW_DISK

The RAID_TW_DISK lemon alarm is generated when there are serious disk problems identified in the /var/log/messages logs. These messages can be found by running grep 3w- /var/log/messages.