Bug 468391
Summary: | EXT3-fs error (device dm-0) in start_transaction: Journal has aborted | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Vivek Goyal <vgoyal> |
Component: | kernel | Assignee: | Tomas Henzl <thenzl> |
Status: | CLOSED WONTFIX | QA Contact: | Martin Jenner <mjenner> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.8 | CC: | bhoefer, czfrux, duck, esandeen, herrold, jbacik, jburke, jcavallaro, jtluka, kashyap.desai, mmahut, sathya.prakash, sforsber, vgoyal |
Target Milestone: | rc | Keywords: | OtherQA |
Target Release: | 4.9 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-04-26 13:41:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vivek Goyal
2008-10-24 14:34:58 UTC
Noticed it one more time. http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4828688 what does scrashme do? It looks to me from at first glance like ext3 is simply responding to scsi errors here: SCSI error : <0 0 1 0> return code = 0x10000 end_request: I/O error, dev sda, sector 225837 and then: EXT3-fs error (device dm-0) in start_transaction: Journal has aborted IOW, ext3 looks like the victim to me. Eric, scrashme is a a system call fuzzer test written by David Jones. http://www.codemonkey.org.uk/projects/scrashme/ One more instance of the problem. http://rhts.redhat.com/cgi-bin/rhts/test_list.cgi?test_filter=/kernel/syscalls/scrashme/multiple&result=Fail&rwhiteboard=kernel%202.6.9-78.30.EL.vgoyal.test4&arch=i386&jobids=41921 Sathya, I've placed some of the logs in http://people.redhat.com/thenzl/bz468391/ I've retested this on hp-ml310g5-01 and have seen some errors after several hours of running the test. ibm-firefly is running the tests for a day without an error. I'll try the same with 2.6.9.80. Until now, the tests with 2.6.9.80 are running without issues on both boxes. Sathya, I have not seen in previous logs a panic, this is probably another problem, so please open a new bz for your issues. Sathya, I removed the patch version 3.12.29.00rh, now it looks a little bit better, but I'm still getting these messages : mptbase: ioc0: IOCStatus=8000 LogInfo=31120403 Originator={PL}, Code={Abort}, SubCode(0x0403) mptbase: ioc0: IOCStatus=804b LogInfo=31120403 Originator={PL}, Code={Abort}, SubCode(0x0403) mptbase: ioc0: IOCStatus=804b LogInfo=31120403 Originator={PL}, Code={Abort}, SubCode(0x0403) Does someone know if it is that a sign of an error or is it harmless ? I saw the same issues with kernel 2.6.9-55.ELsmp on hp-ml310g5-01, on ibm-firefly I've never seen a problem. Maybe is this a hw issue ? We should now look at previous results. Tomas, You can use the "Executed Tests" report in RHTS to get this type of information: Here is a quick link: Pass and Failures http://tinyurl.com/scrashme Just failures http://tinyurl.com/scrashme-fail I have the same problem. When is it possible fix it? Do you have any workaround? (In reply to comment #19) > I have the same problem. > When is it possible fix it? > Do you have any workaround? We don't know at the moment what is causing this issue. Thank you. We're seeing this in our production environment. Red Hat Technical support told us to bring it to this bug. Please let us know if you need more details, we are seeing this issue on weekly basis. Kashyap, have you tried to reproduce this ? The other issues we had with our testing machine (there were some problems with remote console) seems to be gone, so we could continue again on this. Tomas, What are the exact steps to reproduce this ? If someone can provide steps, I can try reproducing here. (In reply to comment #23) > What are the exact steps to reproduce this ? If someone can provide steps, I > can try reproducing here. This issue can be seen after running some hours the scrashme test - see Comment #3. We use here a modified version adapted to our test system, I can send you this version too, but it could be not easy to use alone. The system which had this issue previously (I just started a new test) has this controller (from dmesg): mptbase: Initiating ioc0 bringup ioc0: SAS1064E: Capabilities={Initiator} scsi0 : ioc0: LSISAS1064E, FwRev=01160000h, Ports=1, MaxQ=268, IRQ=169 Using cfq io scheduler I'm using now another test box - on the previous one I was not able to reproduce this. The last message from the controller from mpt_reply(MPT_ADAPTER *ioc, u32 pa) if (ioc_stat & MPI_IOCSTATUS_FLAG_LOG_INFO_AVAILABLE) { ... printk(MYIOC_s_INFO_FMT "IOCStatus_1=%04x LogInfo=%08x\n", ioc->name, ioc_stat, log_info); is this : kernel: mptbase: ioc0: IOCStatus_1=8000 LogInfo=31170000 The test system uses SAS1064E ioc0: SAS1064E: Capabilities={Initiator} scsi0 : ioc0: LSISAS1064E, FwRev=01172a00h, Ports=1, MaxQ=163, IRQ=169 Maybe our controller does not use the latest firmware ? Were you able to reproduce this also ? So it look like my test box is having some other issues, so I've to wait until it's resolved. (In reply to comment #21) > Thank you. We're seeing this in our production environment. Red Hat Technical > support told us to bring it to this bug. Please let us know if you need more > details, we are seeing this issue on weekly basis. Yes please send as as many details as you can. Our testbox seems to have other issues so it is unusable atm. Do you see the same symptoms, (error messages) or is your case different ? Which RHEL version , arch etc. Do you have an easy way how to reproduce this ? For anyone reporting this bug, please be sure to look earlier in the logs for storage errors; including the entire dmesg in the report would be most helpful. The error message in the summary will always come -after- other errors - most likely IO or corruption errors - and is not really a bug in and of itself. I understand, do you have any suggestion of what could we do to reproduce this, or get any useful information out of this customer's box? I can provide a sosreport if it helps. Thanks, Juan,
> qla2400 0000:10:00.1: scsi(1:0:0:1): Mid-layer underflow detected (68000 of 68000 bytes)...returning error status.
> SCSI error : <1 0 0 1> return code = 0x70000
I would suggest filing another bug against the scsi subsystem or scsi driver. If the error indicates a hardware problem rather than a software bug, they can let you know.
But ext3 is just reacting in this case.
-Eric
Thank you Eric! I will follow your advice. |