Hide Forgot
Description of problem: While running kernel testing an AMD host was selected. This system gets into a bad state while running several of the kernel tier tests. Version-Release number of selected component (if applicable): 2.6.18-273.el5 x86_64 How reproducible: Often with x86_64 Actual results: ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:55:31:3b/00:00:16:00:00/40 tag 0 ncq 4096 in res 41/40:08:56:31:3b/61:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:55:31:3b/00:00:16:00:00/40 tag 0 ncq 4096 in res 41/40:08:56:31:3b/61:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x18 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:18:55:31:3b/00:00:16:00:00/40 tag 3 ncq 4096 in res 41/40:08:56:31:3b/61:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x6 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:08:55:31:3b/00:00:16:00:00/40 tag 1 ncq 4096 in res 41/40:08:56:31:3b/61:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0xe SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:08:55:31:3b/00:00:16:00:00/40 tag 1 ncq 4096 in res 41/40:08:56:31:3b/61:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:55:31:cb/00:00:15:00:00/40 tag 0 ncq 4096 in res 41/40:08:57:31:cb/61:00:15:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:55:31:cb/00:00:15:00:00/40 tag 0 ncq 4096 in res 41/40:08:57:31:cb/61:00:15:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:10:55:31:cb/00:00:15:00:00/40 tag 2 ncq 4096 in res 41/40:08:57:31:cb/61:00:15:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:e0:55:31:cb/00:00:15:00:00/40 tag 28 ncq 4096 in res 41/40:08:57:31:cb/00:00:15:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } sda: Write Protect is off SCSI device sda: drive cache: write back SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:10:55:31:cb/00:00:15:00:00/40 tag 2 ncq 4096 in res 41/40:08:57:31:cb/00:00:15:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:55:31:cb/00:00:15:00:00/40 tag 0 ncq 4096 in res 41/40:08:57:31:cb/61:00:15:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } end_request: I/O error, dev sda, sector 365637975 [-- MARK -- Tue Jul 5 13:00:00 2011] EXT3-fs error (device dm-0): read_inode_bitmap: Cannot read inode bitmap - block_group = 1394, inode_bitmap = 45678593<5>SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back Aborting journal on device dm-0. EXT3-fs error (device dm-0) in ext3_new_inode: IO failure EXT3-fs error (device dm-0) in ext3_create: IO failure ext3_abort called. EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data
nb: This is a storage error, not a filesystem error. The storage blew up: > ata1.00: cmd 60/08:00:55:31:cb/00:00:15:00:00/40 tag 0 ncq 4096 in res 41/40:08:57:31:cb/61:00:15:00:00/40 Emask 0x409 (media error) <F> > ata1.00: status: { DRDY ERR } > ata1.00: error: { UNC } IO failed: > end_request: I/O error, dev sda, sector 365637975 ext3 noticed: > EXT3-fs error (device dm-0): read_inode_bitmap: Cannot read inode bitmap - and shut down as it should have: > Aborting journal on device dm-0. -Eric
May just be a hardware problem with that host.
Maybe re-run the i386 one and see if blows up in the same place this time? The x86_64 runs seem to have been later..
I'll log in and look at SMART data as a start, ok?
(In reply to comment #4) > Maybe re-run the i386 one and see if blows up in the same place this time? The > x86_64 runs seem to have been later.. A second i386 Beaker job was successful: https://beaker.engineering.redhat.com/jobs/107543 Thanks, Jeff
If I can get a few minutes on the machine, I'd run smartctl to query the drive and see if it is reporting media failures...
(In reply to comment #7) > If I can get a few minutes on the machine, I'd run smartctl to query the drive > and see if it is reporting media failures... Eric, I just reinstalled the system with RHEL5.7-Server-20110630.0 x86_64. I was setting it up for you to grab the data you need. But once the installation finished. On first boot it put the system back into "read only" mode This system just finished running a complete kerneltier1 test suite with i386. Once I reinstalled it to be x86_64 I hit the issue again. ----<snip>---- ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:18:5d:31:07/00:00:16:00:00/40 tag 3 ncq 4096 in res 41/40:08:61:31:07/c3:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:d8:5d:31:07/00:00:16:00:00/40 tag 27 ncq 4096 in res 41/40:08:61:31:07/00:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } sda: Write Protect is off SCSI device sda: drive cache: write back SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0xb SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:18:5d:31:07/00:00:16:00:00/40 tag 3 ncq 4096 in res 41/40:08:61:31:07/00:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x3ffff SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:5d:31:07/00:00:16:00:00/40 tag 0 ncq 4096 in res 41/40:08:61:31:07/00:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x2c001 SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:88:5d:31:07/00:00:16:00:00/40 tag 17 ncq 4096 in res 41/40:08:61:31:07/00:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0 ata1.00: irq_stat 0x40000008 ata1.00: cmd 60/08:00:5d:31:07/00:00:16:00:00/40 tag 0 ncq 4096 in res 41/40:08:61:31:07/00:00:16:00:00/40 Emask 0x409 (media error) <F> ata1.00: status: { DRDY ERR } ata1.00: error: { UNC } end_request: I/O error, dev sda, sector 369570145 EXT3-fs error (device dm-0): ext3_get_inode_loc: <6>ata1: EH complete SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back unable to read inode block - inode=46170113, block=46170114 Aborting journal on device dm-0. journal commit I/O error ext3_abort called. EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system Traceback (most recent call last): File "/usr/lib64/python2.4/logging/__init__.py", line 740, in emit self.flush() File "/usr/lib64/python2.4/logging/__init__.py", line 718, in flush self.stream.flush() IOError: [Errno 30] Read-only file system [ OK ] touch: cannot touch `/var/lock/subsys/haldaemon': Read-only file system touch: cannot touch `/var/lock/subsys/hidd': Read-only file system touch: cannot touch `/var/lock/subsys/lvm2-monitor': Read-only file system Starting autofs: Loading autofs4: [ OK ] Starting automount: automount: test mount forbidden or incorrect kernel protocol version, kernel protocol version 5.00 or above required. [FAILED] [FAILED] Starting hpiod: can't open or create /var/run/hpiod.pid: Read-only file system io/hpiod/hpiod.cpp 195 [FAILED] Starting hpssd: can't open pidfile /var/run/hpssd.pid: (30) Read-only file system [FAILED] Generating SSH1 RSA host key: [FAILED] Starting cups: [ OK ] touch: cannot touch `/var/lock/subsys/cups': Read-only file system Starting xinetd: [ OK ] touch: cannot touch `/var/lock/subsys/xinetd': Read-only file system ntpd: Synchronizing with time server: [ OK ] Starting ntpd: [ OK ] touch: cannot touch `/var/lock/subsys/ntpd': Read-only file system Starting sendmail: [ OK ] touch: cannot touch `/var/lock/subsys/sendmail': Read-only file system Starting sm-client: touch: cannot touch `/var/run/sm-client.pid': Read-only file system chown: cannot access `/var/run/sm-client.pid': No such file or directory lstat(/var/run/sm-client.pid) failed: No such file or directory [ OK ] touch: cannot touch `/var/lock/subsys/sm-client': Read-only file system Starting console mouse services: O0o.oops(): [startup.c(126)]: audit(1309978830.275:7): audit_pid=0 old=3012 by auid=4294967295 subj=system_u:system_r:auditd_t:s0 [ OK ] touch: cannot touch `/var/lock/subsys/gpm': Read-only file system Starting crond: crond: can't open or create /var/run/crond.pid: Read-only file system [FAILED] Starting xfs: mkdir: cannot create directory `/tmp/.font-unix': Read-only file system lstat(/tmp/.font-unix) failed: No such file or directory touch: cannot touch `/var/lock/subsys/xfs': Read-only file system Starting anacron: [ OK ] touch: cannot touch `/var/lock/subsys/atd': Read-only file system Starting background readahead: [ OK ] Starting rhsmcertd 240[FAILED] Starting yum-updatesd: [ OK ] touch: cannot touch `/var/lock/subsys/yum-updatesd': Read-only file system Starting Avahi daemon... [FAILED] touch: cannot touch `/var/lock/subsys/anamon': Read-only file system failed to make symlink /etc/rc2.d/K95anamon: File exists failed to make symlink /etc/rc3.d/K95anamon: Read-only file system failed to make symlink /etc/rc4.d/K95anamon: File exists failed to make symlink /etc/rc5.d/K95anamon: Read-only file system touch: cannot touch `/var/lock/subsys/local': Read-only file system Starting smartd: type=1400 audit(1309978833.585:8): avc: denied { read } for pid=3672 comm="sh" name="tmp" dev=dm-0 ino=47809723 scontext=system_u:system_r:fsdaemon_t:s0 tcontext=syste type=1300 audit(1309978833.585:8): arch=c000003e syscall=4 success=no exit=-13 a0=49d0ff a1=7fff8e202a70 a2=7fff8e202a70 a3=0 items=0 ppid=3671 pid=3672 auid=4294967295 uid=0 gid=0 euid=) type=1400 audit(1309978833.641:9): avc: denied { read } for pid=3674 comm="sh" name="tmp" dev=dm-0 ino=47809723 scontext=system_u:system_r:fsdaemon_t:s0 tcontext=system_u:object_r:usre type=1300 audit(1309978833.641:9): arch=c000003e syscall=4 success=no exit=-13 a0=49d0ff a1=7fff62735840 a2=7fff62735840 a3=0 items=0 ppid=3673 pid=3674 auid=4294967295 uid=0 gid=0 euid=) [ OK ] touch: cannot touch `/var/lock/subsys/smartd': Read-only file system Red Hat Enterprise Linux Server release 5.7 Beta (Tikanga) Kernel 2.6.18-269.el5 on an x86_64 amd-snook-01.lab.bos.redhat.com login: ----<\snip>----
Disk hardware errors: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 20% 9646 348344579
Hi, We are seeing exact same issue. =================================== EXT3-fs error (device sdd): htree_dirblock_to_tree: bad entry in directory #5833454: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Aborting journal on device sdd. __journal_remove_journal_head: freeing b_committed_data ext3_abort called. EXT3-fs error (device sdd): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only EXT3-fs error (device sdd): htree_dirblock_to_tree: bad entry in directory #5833458: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device sdd): htree_dirblock_to_tree: bad entry in directory #5833461: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 ======================================== Is there any work around other than upgrading the kernel/ files system to ext4 ? barrier=1 option in fstab seems to help as per our test. IS there any other recommendations form Redhat? Regards Majoosh
I believe your problem is di(In reply to Majoosh from comment #10) > Hi, > > We are seeing exact same issue. Perhaps, but by a different vector; powering off a virtual host, and guests not using barriers, if I recall correctly. Please engage your support contacts to help you troubleshoot that issue.
Jeff, I think this one should be closed, it sure look like a drive error, no? > ata1.00: cmd 60/08:18:5d:31:07/00:00:16:00:00/40 tag 3 ncq 4096 in > res 41/40:08:61:31:07/00:00:16:00:00/40 Emask 0x409 (media error) <F> > ata1.00: status: { DRDY ERR } > ata1.00: error: { UNC }
Closing NOTABUG, this looks like a hardware error causing the shutdown.