Bug 318661 - F7 crashes, kernel 2.6.22.9-91.fc7: Journal commit I/O error
F7 crashes, kernel 2.6.22.9-91.fc7: Journal commit I/O error
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
7
i686 Linux
low Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-04 12:11 EDT by Gilbert Sebenste
Modified: 2007-11-30 17:12 EST (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-10-22 12:10:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
This is my /var/spool/messages file, showing the relevant bootup messages. No errors occurred before this point; what you see is the first message. following reboot and beyond. (51.02 KB, text/plain)
2007-10-04 12:11 EDT, Gilbert Sebenste
no flags Details
Output from the hard drive test. (3.03 KB, text/plain)
2007-10-04 23:30 EDT, Gilbert Sebenste
no flags Details

  None (edit)
Description Gilbert Sebenste 2007-10-04 12:11:45 EDT
Description of problem:


Version-Release number of selected component (if applicable): 2.6.22.9.91-fc7


How reproducible: Occasionally


Steps to Reproduce:
1. Machine is under load; load average above .5 but lower than 2; mostly from
hard drive (machine is ingesting a lot of real-time weather data)
2. Without warning, error message appears
3. Machine hangs. can log in but typing in simple commands results in I/O
error. A reboot clears things up. There are no error messages in any log file
prior to this. This machine is a Pentium quad core 2.8 GHZ ASUS motherboard
machine, with 4 GB RAM. Duly noted is that it only shows 3.6 officially
when doing a "top". But, the kernel catches it. This is happening once every
week or two, randomly.
 
Actual results: See attachment.

Expected results: No machine crash!


Additional info:
Comment 1 Gilbert Sebenste 2007-10-04 12:11:45 EDT
Created attachment 216061 [details]
This is my /var/spool/messages file, showing the relevant bootup messages. No errors occurred before this point; what you see is the first message. following reboot and beyond.
Comment 2 Chuck Ebbert 2007-10-04 15:19:30 EDT
This is usually caused by failing disk drives. What does smartctl say about the
drive's health?

  # smartctl -t short <device>
  [wait for test to finish]
  # smartctl -a <device>
Comment 3 Gilbert Sebenste 2007-10-04 15:31:04 EDT
Hi Chuck,

Yes, I agree, but this is new, and is happening on two machines.
I've never used smartctl before. df -k yields:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                     707523888  35975188 635028764   6% /
/dev/sda1               101086     18723     77144  20% /boot
tmpfs                  1815340         0   1815340   0% /dev/shm

So when I type:

smartctl -t short /dev/mapper/VolGroup00-LogVol00
or smartctl -t short /
I get:

Smartctl: please specify device type with the -d option.

It's a SATA drive. What would be the correct command?
Comment 4 Gilbert Sebenste 2007-10-04 15:34:55 EDT
OK, I tried this on the kernel part of it. It took 5 seconds to run,
or is it still running? It says it will take 79 minutes, but...

smartctl -t short /dev/sda1
smartctl version 5.37 [i386-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line 
mode".
Drive command "Execute SMART Short self-test routine immediately in off-line 
mode" successful.
Testing has begun.
Please wait 79 minutes for test to complete.
Test will complete after Thu Oct  4 15:52:06 2007

Use smartctl -X to abort test.
[root@machine username]#
Comment 5 Chuck Ebbert 2007-10-04 19:54:23 EDT
Wait out the 79 minutes, then run smartctl -a
Comment 6 Gilbert Sebenste 2007-10-04 23:30:21 EDT
Created attachment 217031 [details]
Output from the hard drive test.

This one is the output form the hard drive check.
Comment 7 Gilbert Sebenste 2007-10-05 10:37:44 EDT
I saw this in the logwatch email I get daily:

WARNING:  Kernel Errors Present
             res 51/04:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  6 Time(s)
             res 51/04:00:01:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  2 Time(s)
             res 51/04:01:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  3 Time(s)
             res 51/04:01:01:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  4 Time(s)
             res 51/04:01:06:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  1 Time(s)

Let me guess...failing hard drive?
Comment 8 Gilbert Sebenste 2007-10-05 11:00:27 EDT
OK, here's an expanded version of the above, at 4 different times last evening. 
None so far today.

Oct  4 14:33:05 weather kernel: ata3.00: cmd 
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
Oct  4 14:33:06 weather kernel:          res 
51/04:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Oct  4 14:33:06 weather kernel: ata3.00: configured for UDMA/133
Oct  4 14:33:06 weather kernel: ata3: EH complete
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] 1465149168 512-byte hardware 
sectors (750156 MB)
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] Write Protect is off
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] Write cache: enabled, read 
cache: enabled, doesn't support $
Oct  4 14:33:06 weather kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x0
Oct  4 14:33:06 weather kernel: ata3.00: cmd 
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
Oct  4 14:33:06 weather kernel:          res 
51/04:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Oct  4 14:33:06 weather kernel: ata3.00: configured for UDMA/133
Oct  4 14:33:06 weather kernel: ata3: EH complete
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] 1465149168 512-byte hardware 
sectors (750156 MB)
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] Write Protect is off

I am getting ready to call this a failing hard drive, and call it a night...
Comment 9 Gilbert Sebenste 2007-10-05 14:06:21 EDT
Gentlemen,

I am sorry to have wasted your time. It is now becoming apparent that I got a 
bad batch of hard drives from Seagate. I couldn't believe they're all bad, but 
after further analysis this afternoon, I can't escape the dreadful conclusion. 
I am terribly sorry to have wasted time here with you. My apologies, and have a 
great weekend.

Mark this as NOTABUG and close, please. Thanks.

Gilbert
Comment 10 Gilbert Sebenste 2007-10-07 23:28:29 EDT
New hard drives do not stop this from happening. occurs randomly, but computer
will not stay up more than 18 hours.

Using a Seagate 750 GB SATA 3Gb/s drive, 7200 RPM, 16 MB.

Since this machine is new out-of-the-box, I can tell you that older machines I
have do NOT have this problem. And they have a higher load and use the drive
more heavily than this one.
Comment 11 Gilbert Sebenste 2007-10-07 23:31:32 EDT
I have also switched back to an older kernel, the original for F7:
2.6.21-1.3194.fc7. Will notify if this holds or not.
Comment 12 Chuck Ebbert 2007-10-08 17:36:12 EDT
(In reply to comment #10)
> New hard drives do not stop this from happening. occurs randomly, but computer
> will not stay up more than 18 hours.
> 
> Using a Seagate 750 GB SATA 3Gb/s drive, 7200 RPM, 16 MB.
> 

Set for 3Gb/s or jumpered down to 1.5? The faster setting can cause problems...

Comment 13 Gilbert Sebenste 2007-10-08 17:47:52 EDT
Hi Chuck,

Really? How does one jumper it down to 1.5? I see 4 pins on the back, but 
nothing to jumper it with in the box. 
Comment 14 Chuck Ebbert 2007-10-08 17:54:28 EDT
(In reply to comment #13)
> How does one jumper it down to 1.5? I see 4 pins on the back, but 
> nothing to jumper it with in the box. 

They're just standard jumpers, they should be available ata a parts store or
salvageable from broken equipment. But you do need the manual for the drive to
tell you which pins to jumper.
Comment 15 Gilbert Sebenste 2007-10-08 18:42:45 EDT
Wait. If my HD is 3 GB/sec, and my motherboard (ASUS) supports SATA-2, why is 
there a problem? Is it BIOS, or OS related?
Comment 16 Chuck Ebbert 2007-10-08 18:58:47 EDT
(In reply to comment #15)
> Wait. If my HD is 3 GB/sec, and my motherboard (ASUS) supports SATA-2, why is 
> there a problem? Is it BIOS, or OS related?

Could be anything: cables, drive firmware, controller or OS.
Comment 17 Gilbert Sebenste 2007-10-09 17:36:32 EDT
Jumper is on as of 19:15 CT on 10/9/07. Will keep you posted.
Comment 18 Gilbert Sebenste 2007-10-22 12:07:40 EDT
I am not sure what to do now. I am now on kernel 2.6.23.1. I am also using a
jumper to keep my SATA drives using the SATA1 protocol instead of SATA2. No
crashes (except for an apparent unrelated bug, for which I just filed).
Should I close this?

Note You need to log in before you can comment on or make changes to this bug.