Skip to content

/var/log/messages

These messages can appear in the system syslog file. They are documented here to assist in filtering out what are real and what are false errors. Only the messages explicitly labeled as generating lemon events (e.g. RAID_TW_CTLR or RAID_TW_DISK ) will be reported to the operator. The RAID_TW lemon events which are defined here are obtained from running query commands rather than looking at log history.

Message Action
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=1 A degraded disk has been found as part of the RAID array. Follow DiskWinTwMirrorRecover
kernel: 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. Message can be ignored. It indicates that the tw_cli start verify has been run.
kernel: 3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0. Message can be ignored. It indicates that the tw_cli start verify has been run and completed
kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x010D): Invalid field in CDB:. According to the article on the 3ware web site, this indicates a request for a status page which does not exist. This is not an error with the adapter or disk and the message can be ignored.
kernel: 3w-xxxx: scsi3: Command failed: status = 0xc4, flags = 0x43, unit #8. A smartctl error listing command such as -l selftest has been issued against a disk which does not exist. If there should be no disk present in this port, this error can be ignored. Otherwise, follow the 3ware problem determination procedures following running lemon-host-check.
9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=5 A bad block has been found and re-located. This is not a serious problem and will occur from time to time. If there are a large number of these errors, the disk can be checked using tw_cli start verify and then replaced if more errors occur.
Some corruption cases with fsprobe have been identified where these messages are also present. This is not conclusive at the moment (03/11/07). Symptoms where this message has been related to a data corruption have been further observed with the tt_07_1 models (25/01/08).
Recommended action is to replace the drive at the port listed in the message (port 5 in this case)
kernel: 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1. A bad block has been found and re-located. This is not a serious problem and will occur from time to time. If there are a large number of these errors, the disk can be checked using tw_cli start verify and then replaced if more errors occur
9xxx: scsi1: AEN: WARNING (0x04:0x004B): Battery temperature is high The card has detected battery temperature problems. Follow the procedure in DiskPrbTwBbuFault
3w-xxxx: scsi2: AEN: WARNING: Unclean shutdown detected: Unit #6. This indicates the machine was powered-down without doing a clean shutdown. While this is not an error in itself and can be ignored, it may explain other errors such as a corrupted file system where the cache was not saved to disk
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): <NULL>:. This message is not understood but seems to be caused by a memory space problem on the machine. It does not indicate a problem with the controller and can be ignored.
Spare capacity too small for some units: spare unit=3, RAID unit=2 The spare disk is too small to be incorporated into the RAID. See DiskWinTwUnitDiskSizeFix
3w-9xxx: scsi0: AEN: ERROR (0x04:0x0057): Battery charging fault:. A battery is failing. Run the DiskWinTwBbuTest procedure to check the batter and raise a vendor call if the messages re-occur
kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0x3449700. This is a serious error which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.
kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0024): Buffer integrity test failed:error=0x3013. This is suspected as being a serious error message which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.
kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0. This error will generate a RAID_TW_DISK error. The disk at the specified port has failed, usually shown up by a scheduled verification or media scan. The disk should be replaced.
kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x005F): Cache synchronization failed; some data lost:unit=1. This is suspected as being a serious error message which merits a vendor call. Some file system corruption may also occur. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.
kernel: 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x4D. This message occurs when SMART data is requested from a 'logical' disk rather than a physical one (such as /dev/sdb on a 3ware controller). Check the /etc/smartd.conf file and the CDB configuration.
kernel: 3w-xxxx: scsi1: Unit #0: Command (c4d44e00) timed out, resetting card. A vendor call should be raised. This problem has occurred around the same time as data corruption problems and seems to be related to cabling or enclosure problems.
kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0047): Battery voltage is too low

kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0045): Battery voltage is low

kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0044): Battery voltage is normal

kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0056): Battery charging completed
The battery on the cards runs down and is re-charged automatically. On its own, these messages are normal. Under some circumstances, the card will perform this in loops with recharging every few minutes (rather than once a week or so as usual). If recharging occurs very often, a vendor call should be raised to replace the battery
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity testis overdue

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0051): Battery health check started

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0052): Battery health check completed

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started
As above
kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0036): Verify fixed data/parity mismatch:unit=0. A start verify operation has caused the unit to be checked and it has found a problem. This has been corrected automatically and no further action is required unless the problem occurs repeatedly.
If the problem occurs more than 5 times in an hour, a RAID_TW_DISK alarm is raised and a vendor call should be created.
See here for 3ware explanation
kernel: 3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #7. This message has been seen when a disk has completely failed. Normally other monitoring such as the RAID_TW or SMART_SELFTEST alarm will detect the problem as well. The error indicates a SMART test is failing since the disk does not respond to SMART requests. The failing disk within the unit should be replaced
kernel: 3w-xxxx: scsi0: AEN drain failed, retrying. The exact cause of this message is not known. The message has been seen when a disk is determined to have failed and is rebuilding. Thus, this message does not guarantee a problem, a full verify or mediascan is recommended to detect the exposure of the problem.
kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=7, unit=1. A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.
kernel: 3w-9xxx: AEN: WARNING (0x04:0x0042): Primary DCB read error occurred:port=0, error=0x208. The unit has completely failed. Data loss is likely. Vendor call required
kernel: 3w-9xxx: scsi0: AEN: WARNING(0x04:0x0043): Backup DCB read error detected:port=9, error=0x1019. Exact cause not known but current recommendation is a vendor call
kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x002F): Verify not started; unit never initialized:RAID1 subunit=0. This message occurs when an array is verified for the first time. It can be ignored.
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x000C): Initialize started:unit=0. This message occurs when an array is verified for the first time. It can be ignored.
kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0007): Initialize completed:unit=0. This message occurs when an array is verified for the first time. It can be ignored.
kernel: 3w-9xxx: scsi1: WARNING: (0x06:0x000C): Character ioctl (0x108) timed out, resetting card. This message has been seen when there was a bus problem on the machine. Raise a vendor call
kernel: Call Trace:{__alloc_pages+768}
{dma_alloc_pages+125}
kernel: {dma_alloc_coherent+97}
{:3w_9xxx:twa_chrdev_ioctl+227}
kernel: {do_page_fault+575}
{autoremove_wake_function+0}
kernel: {dput+56} {strncpy_from_user+74}
kernel: {sys_ioctl+853} {system_call+126}
There is a problem with the amount of DMA memory available. This was seen on 3ware 95XX cards with an old version of the firmware (3.04). Try a newer version of the firmware to see if this resolves the problem.
Flash file system repaired:. This message has been seen on a few machines usually followed by a controller reset. The root cause is not known but the controller reset justified a vendor call. Follow the procedure for the controller reset
kernel: 3w-9xxx: scsi1: ERROR: (0x06:0x000C): PCI Parity Error: clearing. Suspect a problem with the controller card. This has been seen on s0 series of machines. Raise a vendor call for a check of motherboard and potential controller replacement.
kernel: 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Suspect a problem with the controller card. This has been seen on e5 series machines along with an fsprobe corruption. Raise a vendor call for a check of motherboard and potential controller replacement.
scsi0: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=21. A disk has shown higher than expected levels of SMART errors. It should be replaced. Raise a vendor call for a check of motherboard and potential controller replacement.

tw_cli Errors

The following errors can occur from the command

Message Action
Controller firmware too old for application
NOT COMPATIBLE. Please upgrade firmware & driver.
The 3ware command line utilities need a newer version of the firmware. Download a later version of the firmware for the card. The card version can be obtained using grep -i 3ware /proc/pci
Unable to allocate memory The tw_cli command produces the message "Unable to allocate memory" when it is run under high load conditions. The root cause is not understood and no references are available in the 3ware FAQ. However, the message is transient so the lemon sensor ignores the problem and does not raise an operator alarm

In the tw_cli /cX show, the device status can be reported as follows

Type Status Description
Unit INOPERABLE The RAID array has lost too many disks and now unreadable. The disks which are not present in the unit should be replaced as part of a vendor call and the RAID rebuilt
Unit DEGRADED The RAID array has lost one of the disks but is still working. A mirror broken alarm should have been generated. In the event of a further disks failing, the RAID will go INOPERABLE. Follow the DiskWinTwMirrorRecover to get back to OK
Unit INIT_PAUSED
INITIALIZING
A new array is being set up. This state is temporary and will be changed to OK when the array is completed rebuilding
Unit NOT SUPPORTED
Status NOT SUPPORTED has been seen but is not understood. Raise a vendor call to get the disk replaced.
Port DEVICE-ERROR A disk has shown a large number of errors but bad block relocation has been performed so that it is still functioning but at a reduced capacity. This problem has also been seen when a disk has been restarted due to a connection problem with the controller (such as not responding to a command). A vendor call should be placed to replace the disk before it causes the unit to DEGRADE. The problem can be worked around by exporting the disk, rescanning the controller and then defining the disk as spare. However, it is likely to re-occur again.
Port SMART-ERROR Cause currently unknown but it is suspected that this is reported if the smartd self test has failed on the drive. Open a vendor call.
Port ECC-ERROR Sysadmins should open a vendor call. If the vendor is unable to repair the problem, TSI can investigate further. This may indicate a parity disk error (ECC Problem from RAID5). This Raid is corrupted (Integrity data damaged. Backup Data from Controller2 and Delete RAID5 unit and recreate this unit after removing the disk.

tw_cli show diag Errors

The tw_cli show diag tool is primarily intended for use by the 3ware technical support team. However, it does allow access to some lower level log information compared to that provided by tw_cli show alarms or from /var/log/messages.

The following messages have been seen in the diag logs. Note, it is a binary log and some contains many non-readable characters. The time stamps are also not clear.

Message Description Action
E=0200 I=0092EDE4 T=02:47:45 : Cable CRC error   Raise a vendor call to request investigation
E=0208 I=00926044 T=05:50:29 P=2 : Sata bridge reset A drive has timed out and so the controller resets the drive. This has been known to occur when fsprobe reports corruptions. Raise a vendor call to request investigation. The port of the disk is at the P= value.
E=0207 I=009286DC T=11:40:53 P=0 : Reset failed
E=0208 I=009286DC T=11:40:48 P=0 : Soft reset drive
E=0231 I=0092331C T=12:45:17 : Next image buffer expected Problem not understood. Seen on tt_06_8 machines when there was an fsprobe corruption.
E=0212 I=00925A5C T=18:32:01 : PCI parity error Probable controller failure. Raise a vendor call. The controller is shown on the /c_N_ value in /var/log/messages

Lemon Errors

RAID_TW

The RAID_TW alarms are the 3ware RAID array errors logged according to the conventions in DiskRefRaidLemonSensor.

These indicate problems with the 3ware disks and controllers. Problems should be handled by the system administration team so service managers with class E support can forward the cases to the system administration team for handling.

The detailed status of the machine can be obtained using

# lemon-cli -m ChkRaidTw

Other 3ware related alarms are also generated such as RAID_TW_CTLR.

Using the error number, determine the action to be performed from the table below.

Id Description Action
TWA011I The command line tool has not been able to allocate memory. The root cause for this is not understood but is probably a bug with the driver. It does not indicate a problem with the sensor No operator alarm is raised and no action required. This message is logged for information only.
TWA061W The battery has detected high temperatures. The controller is still working but this may indicate a hardware problem such as fan or computer centre temperature. No alarm will be raised in this case for the operators Check with the data centre staff on the temperature around the machine
TWA062W The battery is being tested or is charging. The write cache is therefore disabled. No operator alarm will be raised. No action required and no operator alarm should be raised
TWA103I A RAID array is rebuilding. This does not automatically generate an error but may need to tracked if this occurs often. No operator alarm raised. None required.
TWA111I Disk is being rebuilt. This occurs if a RAID is automatically being rebuilt and so the disk status is degraded while the rebuild is in progress. No action required and no operator alarm generated.
TWA201E The size of the spare disk is smaller than disks in a RAID-5 array on the controller. This often occurs in the 1st controller where a SPARE has replaced a member of the system disk RAID-1 array. The SPARE is generally the same size as the RAID-5 disk members and so will not be the same as the system disks (which are smaller). DiskWinTwUnitDiskSizeFix
TWA210E The tw_cli command is not installed Contact the service manager for the machine and TSI to arrange additions to CDB hardware monitoring for the machine.
TWA211E Error retrieving list of disks using tw_cli Contact TSI for assistance. The log of the command is in /var/log/messages.
TWA221E The write cache on the controller has failed. This may be due to a battery or high temperature problems. Follow the DiskWinTwWriteCache procedure
TWA231E The high temperature has caused the battery to be disabled. Ask CC.support to check the temperature around the machine. If there is no temperature problem, raise a vendor call since the battery is mal-functioning.
TWA232E The battery temperature status is unknown Contact TSI for assistance
TWA233E The battery has failed Currently, it is recommended to run DiskPrbTwBbuFault to really check that the battery is dead. Once this test reports that it is not possible to do the battery test due to NoBattery, raise a vendor call
TWA234E The battery status is unknown Contact TSI for assistance
TWA235E The battery is reporting FAULT as it status This indicates a battery problem. Run the batter test procedure at DiskWinTwBbuTest and see if the failing to charge messages appear in the /var/log/messages after the recharge. If so, raise a vendor call. If there are any repeated alerts after the recharge, raise a vendor call.
TWA243E A unit has the NCQ qpolicy set to on Some controller disk combinations produce disk drop outs when NCQ is set on. Since there is only very small performance gain with NCQ on, all units should have it set off. To turn it off, use tw_cli as follows, tw_cli /c0/u2 set qpolicy=off (where /c0/u2 is unit in the error report. The status can be checked with tw_cli /c0/u2 show qpolicy. The alarm will clear itself after 10 minutes when the sensor runs again..
TWA291E No spare found for controller with RAID-5 array. This is generally because a spare disk was present and another disk failed. The spare was then automatically configured into the RAID. Run DiskWinTwSpareRecover
TWA292E A disk is not associated with a unit. This may be due to a new disk being added to the array or a spare having dropped out of the configuration. Assign the disk as a spare using tw_cli add type=spare. If this operation fails (such as Drive not ready), raise a vendor call to replace the disk or check the controller card.
TWA300E A disk has failed within the RAID DiskWinTwMirrorRecover
TWA301E Disk has status DEVICE-ERROR. It may still be working but it is about to fail. Vendor call for the disk
TWA302E Disk has status ECC-ERROR. It may still be working but it is about to fail. DiskWinTwMirrorRecover
TWA303E Disk has status SMART-ERROR. This shows the disk is reporting SMART problems and is about to fail. Raise a vendor call
TWA304E A disk is failing. It is in state DEGRADED so it will no longer be used for new data. Run DiskWinTwMirrorRecover
TWA305E A disk has timed out and reset the SATA bridge in the controller. This has caused data corruptions in the past. Raise a vendor call to replace the disk identified in the P= line in the syslog for lemon-tw
TWA307E A disk has performed a soft reset. This has caused data corruptions in the past. In some cases, this was identified as a firmware problem with the disk. Raise a vendor call to replace the disk identified in the P= line in the syslog for lemon-tw
TWA308V A data ECC error was reported to the host. This occurs when a sector repair has occurred for the disk. check the /var/log/messages for the sector repair line during the previous hour. This will indicate the port of the disk which has failed. Check the smart logs for that disk vendor call the disk for an exchange.
TWA314W The firmware level on the disk is a known bad firmware level. A campaign to replace the disks has been performed in the past but sometimes the disks are replaced by one from the vendor's stock and these disks have not been upgraded. See DiskPrbFirmware for more information on the firmware levels. Raise a vendor call and ask for either a firmware upgrade of the disk or a disk with a later firmware level.
TWA315W
TWA400E The disk array is degraded. Raise a vendor call DiskWinVendorCall
TWA401E RAID unit status is unknown Contact TSI for assistance
TWA600E A disk has failed Raise a vendor call (DiskWinVendorCall)
TWA621E A disk status is unknown Contact TSI for assistance
TWA622E A disk is reporting zero size. This indicates that the disk is no longer responding Follow DiskWinTwSpareRecover and raise a vendor call (DiskWinVendorCall) if this does not succeed.
TWA623E A disk is reporting status NOT-SUPPORTED. The cause for this is not known but indicates a failing disk.
DiskWinVendorCall
TWA701E A CRC problem has been reported by tw_cli show diag Raise a vendor call to check the cabling and controller card
TWA801E A RAID unit has a fatal status. It is likely that the unit is completely failed and that data will be lost due to multiple failures. Raise a vendor call to replace the disks and contact service manager to inform them of the probable data loss.
TWA900E Controller is not OK Check if there is a HIGH_LOAD alarm, if so this may indicate a false alarm since the tw_cli command sometimes fails to find the controller under load. Otherwise, raise a vendor call (DiskWinVendorCall)
TWA901E A controller parity error has been detected.
This problem is frequently seen on the s0 set of disk servers and is believed to be a systematic problem with the motherboard so the alarm will not be raised for this hardware model
Raise a vendor call to review the motherboard and controller card
TWA911E No controllers found Contact the service manager for the machine to check if there should be 3ware controllers installed. If not, ask for the CDB hardware profile to be fixed. If so, raise a vendor call. The log for the execution of the tw_cli info command is in /var/log/messages.

RAID_TW_CTLR

The RAID_TW_CTLR lemon alarm is generated when there are serious controller or disk problems identified in the /var/log/messages logs. These messages can be found by running grep 3w- /var/log/messages.

For example,

# grep 3w- /var/log/messages | grep scsi
Jan  9 12:49:40 lxfs6004 kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0.

The messages should be checked against those listed in the /var/log/messages section and the actions followed.

On completion of the vendor intervention, the /var/log/messages file should be rotated (logrotate --force /etc/logrotate.d/syslog) or the machine rebooted in order to clear the error message.

RAID_TW_DISK

The RAID_TW_DISK lemon alarm is generated when there are serious disk problems identified in the /var/log/messages logs. These messages can be found by running grep 3w- /var/log/messages.

For example,

# grep 3w- /var/log/messages | grep ERROR
Jan  9 12:49:40 lxfs6004 kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0.

The messages should be checked against those listed in the /var/log/messages section and the actions followed.

On completion of the vendor intervention, the fmonagent should be restarted using ncm_wrapper.sh --co fmonagent to clear the error.

RPM Installation

Message Action
[ERROR] No working tw_cli found This occurs when the RPM for the 3ware command line tools cannot find a working CLI from all of the versions it tries. The solution is to upgrade the firmware on the cards since the version is too old to support the level required.

Further Assistance

In the event of the problem not being listed above, please contact the disk service manager listed at TsiSection#Services.

Related Documents

Link Description
3Ware Downloads Download site for new drives and release notes
DiskRefTwLogMetric How to define the log metrics for 3ware

profile

일요일은 짜빠게뤼~ 먹는날~^^

엮인글 :
http://adminplay.com/74321/3c8/trackback
List of Articles
번호 제목 글쓴이 날짜 조회 수
267 [CentOS] ffmpeg 설치 ADMINPLAY 2012-02-07 24529
266 CentOS ffmpeg yum install ADMINPLAY 2012-02-07 25634
265 iPhone에서 streaming video 서비스 ADMINPLAY 2012-02-07 20960
264 Http Live Streaming 으로 아이폰 동영상 서비스 ADMINPLAY 2012-02-07 29144
263 HTTP Live Streaming 구축 ADMINPLAY 2012-02-07 33200
262 FFMpeg + rtspdump + segmenter 를 이용한 iPhone 스트리... ADMINPLAY 2012-02-07 31415
261 장비를 복제해서 옮겼을 경우 네트워크가 eth0_rename 되... ADMINPLAY 2012-01-31 25851
260 [FreeBSD]CPU, 메모리, HDD확인 및 네트워크 설정 ADMINPLAY 2012-01-17 25140
259 FreeBSD CPU 개수 확인 등.(mptable, sysctl) ADMINPLAY 2012-01-16 22327
258 CentOS 4, CentOS 5, CentOS 6 에 NTFS 파일시스템 마운트... ADMINPLAY 2012-01-16 27092
257 hdparm 세부옵션 ADMINPLAY 2012-01-16 58306
256 mrtg를 이용한 시스템자원 모니터링 ADMINPLAY 2012-01-16 39129
255 linux adduser 시 copydir(): preserving permissions 오... ADMINPLAY 2012-01-16 23276
» 3Ware Controller Problem Determination Procedures (레... ADMINPLAY 2012-01-16 33462
253 DRBD Network Mirroring ADMINPLAY 2012-01-16 20001
252 DRBD(Distributed Replicated Block Device) 에 대해서 ADMINPLAY 2012-01-16 23796
251 그누보드4와 MySQL5를 연동시 초기 관리자 계정이 생성되... ADMINPLAY 2012-01-16 18674
250 MS-DOS 배치파일 문법 ADMINPLAY 2012-01-16 25448
249 dd 명령어 ADMINPLAY 2012-01-16 17636
248 insmod 와 modprobe 의 차이점 ADMINPLAY 2012-01-16 20860

Copyright ADMINPLAY corp. All rights reserved.

abcXYZ, 세종대왕,1234

abcXYZ, 세종대왕,1234