Bug 318661

Summary: F7 crashes, kernel 2.6.22.9-91.fc7: Journal commit I/O error
Product: [Fedora] Fedora Reporter: Gilbert Sebenste <sebenste>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NOTABUG QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 7   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-10-22 16:10:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
This is my /var/spool/messages file, showing the relevant bootup messages. No errors occurred before this point; what you see is the first message. following reboot and beyond.
none
Output from the hard drive test. none

Description Gilbert Sebenste 2007-10-04 16:11:45 UTC
Description of problem:


Version-Release number of selected component (if applicable): 2.6.22.9.91-fc7


How reproducible: Occasionally


Steps to Reproduce:
1. Machine is under load; load average above .5 but lower than 2; mostly from
hard drive (machine is ingesting a lot of real-time weather data)
2. Without warning, error message appears
3. Machine hangs. can log in but typing in simple commands results in I/O
error. A reboot clears things up. There are no error messages in any log file
prior to this. This machine is a Pentium quad core 2.8 GHZ ASUS motherboard
machine, with 4 GB RAM. Duly noted is that it only shows 3.6 officially
when doing a "top". But, the kernel catches it. This is happening once every
week or two, randomly.
 
Actual results: See attachment.

Expected results: No machine crash!


Additional info:

Comment 1 Gilbert Sebenste 2007-10-04 16:11:45 UTC
Created attachment 216061 [details]
This is my /var/spool/messages file, showing the relevant bootup messages. No errors occurred before this point; what you see is the first message. following reboot and beyond.

Comment 2 Chuck Ebbert 2007-10-04 19:19:30 UTC
This is usually caused by failing disk drives. What does smartctl say about the
drive's health?

  # smartctl -t short <device>
  [wait for test to finish]
  # smartctl -a <device>


Comment 3 Gilbert Sebenste 2007-10-04 19:31:04 UTC
Hi Chuck,

Yes, I agree, but this is new, and is happening on two machines.
I've never used smartctl before. df -k yields:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                     707523888  35975188 635028764   6% /
/dev/sda1               101086     18723     77144  20% /boot
tmpfs                  1815340         0   1815340   0% /dev/shm

So when I type:

smartctl -t short /dev/mapper/VolGroup00-LogVol00
or smartctl -t short /
I get:

Smartctl: please specify device type with the -d option.

It's a SATA drive. What would be the correct command?

Comment 4 Gilbert Sebenste 2007-10-04 19:34:55 UTC
OK, I tried this on the kernel part of it. It took 5 seconds to run,
or is it still running? It says it will take 79 minutes, but...

smartctl -t short /dev/sda1
smartctl version 5.37 [i386-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line 
mode".
Drive command "Execute SMART Short self-test routine immediately in off-line 
mode" successful.
Testing has begun.
Please wait 79 minutes for test to complete.
Test will complete after Thu Oct  4 15:52:06 2007

Use smartctl -X to abort test.
[root@machine username]#

Comment 5 Chuck Ebbert 2007-10-04 23:54:23 UTC
Wait out the 79 minutes, then run smartctl -a

Comment 6 Gilbert Sebenste 2007-10-05 03:30:21 UTC
Created attachment 217031 [details]
Output from the hard drive test.

This one is the output form the hard drive check.

Comment 7 Gilbert Sebenste 2007-10-05 14:37:44 UTC
I saw this in the logwatch email I get daily:

WARNING:  Kernel Errors Present
             res 51/04:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  6 Time(s)
             res 51/04:00:01:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  2 Time(s)
             res 51/04:01:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  3 Time(s)
             res 51/04:01:01:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  4 Time(s)
             res 51/04:01:06:4f:c2/00:00:00:00:00/00 Emask 0x1 (device
error) ...:  1 Time(s)

Let me guess...failing hard drive?

Comment 8 Gilbert Sebenste 2007-10-05 15:00:27 UTC
OK, here's an expanded version of the above, at 4 different times last evening. 
None so far today.

Oct  4 14:33:05 weather kernel: ata3.00: cmd 
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
Oct  4 14:33:06 weather kernel:          res 
51/04:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Oct  4 14:33:06 weather kernel: ata3.00: configured for UDMA/133
Oct  4 14:33:06 weather kernel: ata3: EH complete
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] 1465149168 512-byte hardware 
sectors (750156 MB)
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] Write Protect is off
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] Write cache: enabled, read 
cache: enabled, doesn't support $
Oct  4 14:33:06 weather kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x0
Oct  4 14:33:06 weather kernel: ata3.00: cmd 
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0
Oct  4 14:33:06 weather kernel:          res 
51/04:00:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error)
Oct  4 14:33:06 weather kernel: ata3.00: configured for UDMA/133
Oct  4 14:33:06 weather kernel: ata3: EH complete
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] 1465149168 512-byte hardware 
sectors (750156 MB)
Oct  4 14:33:06 weather kernel: sd 2:0:0:0: [sda] Write Protect is off

I am getting ready to call this a failing hard drive, and call it a night...

Comment 9 Gilbert Sebenste 2007-10-05 18:06:21 UTC
Gentlemen,

I am sorry to have wasted your time. It is now becoming apparent that I got a 
bad batch of hard drives from Seagate. I couldn't believe they're all bad, but 
after further analysis this afternoon, I can't escape the dreadful conclusion. 
I am terribly sorry to have wasted time here with you. My apologies, and have a 
great weekend.

Mark this as NOTABUG and close, please. Thanks.

Gilbert

Comment 10 Gilbert Sebenste 2007-10-08 03:28:29 UTC
New hard drives do not stop this from happening. occurs randomly, but computer
will not stay up more than 18 hours.

Using a Seagate 750 GB SATA 3Gb/s drive, 7200 RPM, 16 MB.

Since this machine is new out-of-the-box, I can tell you that older machines I
have do NOT have this problem. And they have a higher load and use the drive
more heavily than this one.

Comment 11 Gilbert Sebenste 2007-10-08 03:31:32 UTC
I have also switched back to an older kernel, the original for F7:
2.6.21-1.3194.fc7. Will notify if this holds or not.


Comment 12 Chuck Ebbert 2007-10-08 21:36:12 UTC
(In reply to comment #10)
> New hard drives do not stop this from happening. occurs randomly, but computer
> will not stay up more than 18 hours.
> 
> Using a Seagate 750 GB SATA 3Gb/s drive, 7200 RPM, 16 MB.
> 

Set for 3Gb/s or jumpered down to 1.5? The faster setting can cause problems...



Comment 13 Gilbert Sebenste 2007-10-08 21:47:52 UTC
Hi Chuck,

Really? How does one jumper it down to 1.5? I see 4 pins on the back, but 
nothing to jumper it with in the box. 

Comment 14 Chuck Ebbert 2007-10-08 21:54:28 UTC
(In reply to comment #13)
> How does one jumper it down to 1.5? I see 4 pins on the back, but 
> nothing to jumper it with in the box. 

They're just standard jumpers, they should be available ata a parts store or
salvageable from broken equipment. But you do need the manual for the drive to
tell you which pins to jumper.


Comment 15 Gilbert Sebenste 2007-10-08 22:42:45 UTC
Wait. If my HD is 3 GB/sec, and my motherboard (ASUS) supports SATA-2, why is 
there a problem? Is it BIOS, or OS related?

Comment 16 Chuck Ebbert 2007-10-08 22:58:47 UTC
(In reply to comment #15)
> Wait. If my HD is 3 GB/sec, and my motherboard (ASUS) supports SATA-2, why is 
> there a problem? Is it BIOS, or OS related?

Could be anything: cables, drive firmware, controller or OS.


Comment 17 Gilbert Sebenste 2007-10-09 21:36:32 UTC
Jumper is on as of 19:15 CT on 10/9/07. Will keep you posted.

Comment 18 Gilbert Sebenste 2007-10-22 16:07:40 UTC
I am not sure what to do now. I am now on kernel 2.6.23.1. I am also using a
jumper to keep my SATA drives using the SATA1 protocol instead of SATA2. No
crashes (except for an apparent unrelated bug, for which I just filed).
Should I close this?