Bug 468391

Summary: EXT3-fs error (device dm-0) in start_transaction: Journal has aborted
Product: Red Hat Enterprise Linux 4 Reporter: Vivek Goyal <vgoyal>
Component: kernelAssignee: Tomas Henzl <thenzl>
Status: CLOSED WONTFIX QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: bhoefer, czfrux, duck, esandeen, herrold, jbacik, jburke, jcavallaro, jtluka, kashyap.desai, mmahut, sathya.prakash, sforsber, vgoyal
Target Milestone: rcKeywords: OtherQA
Target Release: 4.9   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-26 13:41:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vivek Goyal 2008-10-24 14:34:58 UTC
Description of problem:

During the rhts run of 78.16.EL kernel I noticed ext3 related error message during "scrashme" test.

I suspect that this might be underlying device driver problem (mptfusion), but I am not sure. Somebody needs to investigate.

scrashme was running. There were some audit related messages and then there seemed
to be disk/driver failure.

audit(:0): major=252 name_count=0: freeing multiple contexts (1)

audit(:0): major=113 name_count=0: freeing multiple contexts (2)

audit(:0): major=252 name_count=0: freeing multiple contexts (1)

audit(:0): major=166 name_count=0: freeing multiple contexts (2)

[..]

SCSI error : <0 0 1 0> return code = 0x10000

end_request: I/O error, dev sda, sector 225829

SCSI error : <0 0 1 0> return code = 0x10000

end_request: I/O error, dev sda, sector 225837

[..]

EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

Version-Release number of selected component (if applicable):
78.16.EL

How reproducible:

Noticed once.

Steps to Reproduce:
1. rhts scrashme test
2.
3.
  
Actual results:


Expected results:

No error messages.

Additional info:

Error logs are here.
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4804306

I suspect that I noticed similar issue on a different machine also. But we had somehow lost
console hence rhts could not capture the full logs. But symptoms were same. i386 machine and failure during scrashme test.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4551390

Comment 1 Vivek Goyal 2008-10-27 19:31:35 UTC
Noticed it one more time.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4828688

Comment 2 Eric Sandeen 2008-11-01 18:48:52 UTC
what does scrashme do?

It looks to me from at first glance like ext3 is simply responding to scsi errors here:

SCSI error : <0 0 1 0> return code = 0x10000

end_request: I/O error, dev sda, sector 225837

and then:

EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

IOW, ext3 looks like the victim to me.

Comment 3 Jeff Burke 2008-11-02 00:43:24 UTC
Eric,
   scrashme is a a system call fuzzer test written by David Jones.
http://www.codemonkey.org.uk/projects/scrashme/

Comment 6 Tomas Henzl 2009-01-29 11:51:55 UTC
Sathya,
I've placed some of the logs in http://people.redhat.com/thenzl/bz468391/

Comment 7 Tomas Henzl 2009-01-29 12:01:37 UTC
I've retested this on hp-ml310g5-01 and have seen some errors after several hours of running the test. ibm-firefly is running the tests for a day without an error. I'll try the same with 2.6.9.80.

Comment 8 Tomas Henzl 2009-01-29 16:26:49 UTC
Until now, the tests with 2.6.9.80 are running without issues on both boxes.

Sathya, I have not seen in previous logs a panic, this is probably another problem, so please open a new bz for your issues.

Comment 9 Tomas Henzl 2009-01-30 16:55:33 UTC
Sathya,
I removed the patch version 3.12.29.00rh, now it looks a little bit better,
but I'm still getting these messages :

mptbase: ioc0: IOCStatus=8000 LogInfo=31120403 Originator={PL}, Code={Abort}, SubCode(0x0403)
mptbase: ioc0: IOCStatus=804b LogInfo=31120403 Originator={PL}, Code={Abort}, SubCode(0x0403)
mptbase: ioc0: IOCStatus=804b LogInfo=31120403 Originator={PL}, Code={Abort}, SubCode(0x0403)

Does someone know if it is that a sign of an error or is it harmless ?

Comment 10 Tomas Henzl 2009-02-11 11:27:05 UTC
I saw the same issues with kernel 2.6.9-55.ELsmp on hp-ml310g5-01, on ibm-firefly I've never seen a problem.
Maybe is this a hw issue ? We should now look at previous results.

Comment 12 Jeff Burke 2009-02-11 19:27:14 UTC
Tomas,
 You can use the "Executed Tests" report in RHTS to get this type of information:

Here is a quick link:
Pass and Failures
http://tinyurl.com/scrashme

Just failures
http://tinyurl.com/scrashme-fail

Comment 19 sapalinux 2009-10-14 12:16:28 UTC
I have the same problem.
When is it possible fix it?
Do you have any workaround?

Comment 20 Tomas Henzl 2009-10-14 12:47:17 UTC
(In reply to comment #19)
> I have the same problem.
> When is it possible fix it?
> Do you have any workaround?  
We don't know at the moment what is causing this issue.

Comment 21 sapalinux 2009-10-14 21:00:06 UTC
Thank you. We're seeing this in our production environment. Red Hat Technical support told us to bring it to this bug. Please let us know if you need more details, we are seeing this issue on weekly basis.

Comment 22 Tomas Henzl 2009-10-26 13:19:21 UTC
Kashyap,
have you tried to reproduce this ? 
The other issues we had with our testing machine (there were some problems with remote console) seems to be gone, so we could continue again on this.

Comment 23 kashyap 2009-10-27 08:43:28 UTC
Tomas,
What are the exact steps to reproduce this ? If someone can provide steps, I can try reproducing here.

Comment 24 Tomas Henzl 2009-10-27 13:51:46 UTC
(In reply to comment #23)
> What are the exact steps to reproduce this ? If someone can provide steps, I
> can try reproducing here.  

This issue can be seen after running some hours the scrashme test - see Comment #3. We use here a modified version adapted to our test system, I  can send you this version too, but it could be not easy to use alone.

Comment 25 Tomas Henzl 2009-10-27 14:55:31 UTC
The system which had this issue previously (I just started a new test) has this controller (from dmesg):
mptbase: Initiating ioc0 bringup
ioc0: SAS1064E: Capabilities={Initiator}
scsi0 : ioc0: LSISAS1064E, FwRev=01160000h, Ports=1, MaxQ=268, IRQ=169
Using cfq io scheduler

Comment 26 Tomas Henzl 2009-11-04 15:00:42 UTC
I'm using now another test box - on the previous one I was not able to reproduce this.
The last message from the controller from mpt_reply(MPT_ADAPTER *ioc, u32 pa)
if (ioc_stat & MPI_IOCSTATUS_FLAG_LOG_INFO_AVAILABLE) {
...
     printk(MYIOC_s_INFO_FMT "IOCStatus_1=%04x LogInfo=%08x\n", ioc->name, ioc_stat, log_info);
is this :
kernel: mptbase: ioc0: IOCStatus_1=8000 LogInfo=31170000

The test system uses SAS1064E
ioc0: SAS1064E: Capabilities={Initiator}
scsi0 : ioc0: LSISAS1064E, FwRev=01172a00h, Ports=1, MaxQ=163, IRQ=169
Maybe our controller does not use the latest firmware ?
Were you able to reproduce this also ?

Comment 27 Tomas Henzl 2009-11-06 15:35:44 UTC
So it look like my test box is having some other issues, so I've to wait until it's resolved.

Comment 34 Tomas Henzl 2009-11-16 13:08:08 UTC
(In reply to comment #21)
> Thank you. We're seeing this in our production environment. Red Hat Technical
> support told us to bring it to this bug. Please let us know if you need more
> details, we are seeing this issue on weekly basis.  

Yes please send as as many details as you can. Our testbox seems to have other issues so it is unusable atm.

Do you see the same symptoms, (error messages) or is your case different ? Which RHEL version , arch etc.
Do you have an easy way how to reproduce this ?

Comment 36 Eric Sandeen 2009-11-16 15:47:43 UTC
For anyone reporting this bug, please be sure to look earlier in the logs for storage errors; including the entire dmesg in the report would be most helpful.

The error message in the summary will always come -after- other errors - most likely IO or corruption errors - and is not really a bug in and of itself.

Comment 39 Juan J. Cavallaro 2009-11-26 21:26:20 UTC
I understand, do you have any suggestion of what could we do to reproduce this, or get any useful information out of this customer's box? I can provide a sosreport if it helps.

Thanks,

Comment 40 Eric Sandeen 2009-11-27 01:18:15 UTC
Juan, 

> qla2400 0000:10:00.1: scsi(1:0:0:1): Mid-layer underflow detected (68000 of 68000 bytes)...returning error status.
> SCSI error : <1 0 0 1> return code = 0x70000

I would suggest filing another bug against the scsi subsystem or scsi driver.  If the error indicates a hardware problem rather than a software bug, they can let you know.

But ext3 is just reacting in this case.

-Eric

Comment 41 Juan J. Cavallaro 2009-11-27 19:29:19 UTC
Thank you Eric! I will follow your advice.