Bug 213921
Description
Issue Tracker
2006-11-03 19:31:10 UTC
We've received several reports of ext3 filesystems on SAN storage repeatedly going read-only, often different filesystems, or on different arrays, or on different hosts attached to the same arrays. This happens with a single path, with dm-multipath, or with EMC PowerPath. It seems to only happen on active/active arrays, such as EMC Symmetrix, and not happen on active/passive arrays, such as EMC Clariion in the same environment, but it is unclear whether this is due to the multipathing or simply the I/O load. Setting ql2xprocessrscn=1 (qla2xxx) or lpfc_use_adisc=1 (lpfc) helps in some of these cases. As of RHEL 4 U4, the problem is not known to still occur with lpfc when lpfc_use_adisc=1 is set. Both the RHEL 4 U4 version of the qla2xxx driver, and version 8.01.06 from qlogic.com seem to be affected, though we also have one report of 8.01.06 fixing the problem, so it may be an improvement, but not a complete fix. Currently, signs point to an inefficiency in handling RSCNs in the qla2xxx driver, which causes SCSI timeouts, which eventually propagate up the stack to cause an ext3 error. Created attachment 140585 [details]
metadata test script
Created attachment 140586 [details]
Script to handle running of multiple instances of IOZone
Created attachment 140598 [details]
version w/o random delays in loop.
I don't as of yet have my hands on the customer's switch logs. Attached is a
copy of the ioloadgen script version they're actually using (w/o the random
delays as is in the later version). I sent a note to the customer requesting
the switch logs if they still have them. Also, just to remind everyone, now
that this is a public bug, please be sensitive to sharing customer data. For
sensitive items, we can e-mail them, or I can look into making this bug private
again for only the people on the cc list. So, if your not on the cc list yet,
please add your self. Thanks.
Just to be clear on the 'logs' request, QLogic would like to see qla2xxx driver logs (with extended_error_logging enabled), not switch logs, so that we can better understand the failure as well as any relevant contributing precursors e-mailed emc grab from 106 to EMC and QLogic folks. Customer working on providing another one from 007 as well. Testing on 007 didn't happen last night due to scheduling issues. Toure is working with the customer to get the 007 testing kicked off right now. Also, just to be clear, the emc grab I e-mailed out was taken post-failure on 106 while testing under the following conditions: New QLogic driver. Oracle running normally. New metadata script. Same ioloadgen script as before. PP in RE mode. Just sent out grab from 007, test over the weekend was run with w/o any multipath software loaded/running. Plan: If current testing of 007 (dm-multipath in multibus mode) goes read-only, we need to validate the secondary path. To find which port is the primary one, block it with dm-multipath running. Then use "multipath -ll" output to determine if that was the primary or secondary. Make sure the primary is blocked for the test. Disable all multipath software on 007, and test down the secondary path only. Pass/fail of that test will determine if we need to look at dm-multipath or PowerPath further. Our thinking: They may be intercepting and blocking us from seeing low-level I/O failures. If possible, then we will start instrumenting dm-multipath code. Revised plan: Test 007 as above, but also kick-off a similar test on 106 (with the new QLogic driver). Later on, if 106 fails, then we'll know that 007 might have also failed if given that same amount of time. This could save us a headache down the road while providing flexability of terminitating a potentially non-failing test on 007 tomorrow morning. Given a failure on 007 under dm-multipath tomorrow morning, the next step is as above- validate no-multipath and I/O down the secondary path. Leave 106 running if it doesn't fail. QLogic/EMC: When you have any status on your testing, if you could post it along with your respective setups, that would be helpful for tracking purposes. From Andrew: """Could you or the customer perhaps shed some light on when the first failure occurred (date/time) along with a description of which events precipitated the 'failure' -- I'm just trying to get a baseline here, as from what I can see, the last driver load before the weekend run occurred on November 3rd: ... Nov 3 10:37:30 chln106 kernel: QLogic Fibre Channel HBA Driver ... with the additional noise of the LUNZ luns (sda, sdb, sdk, sdt) being inaccessible: ... Nov 3 10:37:36 chln106 kernel: sdt: asking for cache data failed Nov 3 10:37:36 chln106 kernel: sdt: assuming drive cache: write through Nov 3 10:37:36 chln106 kernel: sdt : READ CAPACITY failed. Nov 3 10:37:36 chln106 kernel: sdt : status=1, message=00, host=0, driver=08 ... Nov 4 10:40:04 chln106 kernel: end_request: I/O error, dev sdb, sector 0 Nov 4 10:40:04 chln106 kernel: Buffer I/O error on device sdb, logical block 0 Nov 4 10:40:04 chln106 kernel: end_request: I/O error, dev sdb, sector 0 Nov 4 10:40:04 chln106 kernel: Buffer I/O error on device sdb, logical block 0 Nov 4 10:40:04 chln106 kernel: end_request: I/O error, dev sdb, sector 0 Nov 4 10:40:04 chln106 kernel: Buffer I/O error on device sdb, logical block 0 and finally a sudden ext3-error 2.5 days later: ... Nov 6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_readdir: directory #2342945 contains a hole at offset 163840 Nov 6 01:29:02 chln106 kernel: Aborting journal on device dm-18. Nov 6 01:29:02 chln106 kernel: ext3_abort called. Nov 6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_journal_start_sb: Detected aborted journal Nov 6 01:29:02 chln106 kernel: Remounting filesystem read-only Nov 6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18) in start_transaction: Journal has aborted Nov 6 01:29:33 chln106 last message repeated 575 times ... I'm also trying to drill down to the exact scsi device used with the dm-18 device, am I correct in interpreting the logs in that dm-18 is actually part of /dev/emcpowerb -- which raw scsi-devices are part of that physical volume? In any case, there do not appear to be *any* fabric related entries logged by the device driver (or for that matter, anything) immediately before or in the near vicinity of the ext3 FS going read only. The last driver entries were those logged during init-time (three days earlier). If in fact there were any RSCNs from the switch the driver would have logged something like the following to the messages file: ... scsi(0): Asynchronous RSCR UPDATE. scsi(0): RSCN database changed -- 0070 0000 ... """ From Jerry: """emcpowerb is: Pseudo name=emcpowerb Symmetrix ID=000190101044 Logical device ID=00E3 state=alive; policy=SymmOpt; priority=0; queued-IOs=1 ======================================================================== ====== ---------------- Host --------------- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ======================================================================== ====== 0 qla2xxx sdi FA 7aB active alive 0 0 1 qla2xxx sdr FA 10aB active alive 1 0 """ Andrew, to your question about context: That was the failure which occured on 106 in the middle of the night, with the 8.01.06 QLogic driver, and while _not_ deliberatly testing/stressing anything. No testing was done on 106 between the Friday reboot and the read-only at 1:29 Monday Morning. (Weather or not the system was stressed because of some automated Oracle/System activity is another question.) Unfortunately, the driver was loaded without the extended_error_logging parameter enabled during driver load: Nov 3 10:10:10 chln007 kernel: qla2xxx: Unknown parameter `q12xfailover' Nov 3 10:10:10 chln007 kernel: QLogic Fibre Channel HBA Driver Nov 3 10:10:10 chln007 kernel: ACPI: PCI interrupt 0000:05:08.0[A] -> GSI 32 (level, low) -> IRQ 233 Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Found an ISP2312, irq 233, iobase 0xf8822000 Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Configuring PCI space... Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Configure NVRAM parameters... Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Verifying loaded RISC code... Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Waiting for LIP to complete... Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: LOOP UP detected (2 Gbps). Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Topology - (F_Port), Host Loop address 0xffff Nov 3 10:10:11 chln007 kernel: scsi0 : qla2xxx Nov 3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Nov 3 10:10:11 chln007 kernel: QLogic Fibre Channel HBA Driver: 8.01.04-d7 Nov 3 10:10:11 chln007 kernel: QLogic QLA2342 - Nov 3 10:10:11 chln007 kernel: ISP2312: PCI-X (100 MHz) @ 0000:05:08.0 hdma+, host#=0, fw=3.03.20 IPX if it had been, there would have been an '-debug' appended to the driver version. Some time later, the midlayer in fact reports some failed SCSI I/O with a return status of DID_BUS_BUSY (0x20000), where the first began at 4:06pm: Nov 5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 Nov 5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520408 Nov 5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520416 Nov 5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 ... Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 88879880 Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 88879888 Nov 5 16:43:54 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdh, sector 72414784 Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdh, sector 72414792 Nov 5 16:43:54 chln007 kernel: SCSI error : <0 0 2 14> return code = 0x20000 Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 145749192 Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 145749200 Nov 5 16:43:54 chln007 kernel: SCSI error : <0 0 2 14> return code = 0x20000 Nov 5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 88879920 against luns sdh, sdg, sdd. This continues until the first relevant 'EXT3 goes read-only failure' occurs at 6:43pm: Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: directory #1020212 contains a hole at offset 45056 Nov 5 18:43:12 chln007 kernel: Aborting journal on device dm-7. Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7) in start_transaction: Readonly filesystem Nov 5 18:43:12 chln007 kernel: Aborting journal on device dm-7. Unfortunately, since extended_error_logging wasn't enabled, I can't with certainty why the commands are being returned to the midlayer with a DID_BUS_BUSY status. Some possibilities include: - SAN activity, causing the driver to want to recycle commands until SNS scans complete. - Commands timeout, storage not responding - Storage logged-out of the initiator (LOGO) - Dropped frames, FC frame failures. In any case, for each of the commands, if the fast-fail bit is set (typically is when DM/MD is involved), then the midlayer will *not* retry the command, but instead return the failed status to the upper layers, eventually propagating to the block-layer caller. Now if the fast-fail bit is cleared (unset), then retries are perhaps being exhausted, as typically the midlayer should retry the commands in hopes the SAN is quiesced. If I could get a good failure with 8.01.04-d7 and extended_error_logging enabled, we'd have much more confidence is suggesting next steps. At this point, all that's known is that for some reason the driver is returning back commands with DID_BUS_BUSY statuses. Note again, this is a different signature than that of the 106 run with 8.01.06 and extended_error_logging enabled. One comment on #18 above: That system was being exercised at that time with IOZone, the metadata script, and having someone blocking/unblocking a port within the same zone on the SAN (to trigger RSCNs). It's unfortunate extended error logging was not working, they have confirmed with me the option is there. It's possible that the initial ram disk wasn't rebuilt (as required) and so the option never made it into the module as it was loaded. I'm following up with the customer on this last point. All, Just for the record, the plan for 007: Run with _no_ multipath with the primary path disabled (seem my note below) for 106: Run with dm-multipath, the new QLogic driver, IOZone, metadata-test, and normal Oracle load for as long as possible (preferrably through the weekend if possible, understanding this is a production box). This event sent from IssueTracker by cevich issue 97316 Current status: No failure occurred under dm-multipath on 007 after 16+ hours of testing. We've verified the extended error logging is turned on now and functional on 007. Configuration issues prevented testing on 106, though they are being fixed now. We hope to kick off a long-running test on 106 with normal Oracle Load + new QLogic driver + metadata script + ioloadgen script Plan: 007 switched back to PowerPath in RE mode and back to the original version of the metadata script. We're doing this to both validate possibly reproducing a failure faster with the old script (as was done 4 times before); and, to confirm that we can reproduce on 007 in RE mode with the old QLogic driver. If the old script proves to reproduce faster, I will be posting that information in addition to a copy of the old script to this BZ. Good news from EMC: They were successful in reproducing a read-only condition after 14 hours of testing using the (attached) metadata script in addition to IOZOne load generation but without Oracle. Do we have any status or updates on reproduction efforts from the QLogic side? Customer recreated the failure on 007 using PowerPath in RE mode, with the latest metadata test script, and an older version of the load generator script. This was a misscommunication, but is informative none-the-less. It validates that we can recreate the issue with PP in RE mode on 007. It validates that we can recreate with the new metadata script. It also further confirms that it can be recreated without Oracle. I will be e-mailing out the EMC grabs and other data shortly. Will try the metadata script attached to see if problem can be reproduced here. So far we haven't been able to reproduce it. Re: #24, you'll also want to run the IOZone based load test. So far, we've only seen this problem occur under extremely high load conditions (both system and SAN). Re: #20, If there's a script switching ports on and off in the fabric (to generate RSCNs), then I have a feeling the I/Os are being returned with BUS_BUSY due to: - SAN activity, causing the driver to want to recycle commands until SNS scans complete. As I mentioned before: In any case, for each of the commands, if the fast-fail bit is set (typically is when DM/MD is involved), then the midlayer will *not* retry the command, but instead return the failed status to the upper layers, eventually propagating to the block-layer caller. Are the commands being issued with the fast-fail bit set? If so, then that would certainly account for the journaling errors as the data is never being written/read, as the DID_BUS_BUSY statuses are being directly funneled to the upper-layers without the request retry being done... 8.01.06 has some changes in the semantics of command handling during SAN scanning (as was mentioned before), commands will be returned with DID_IMM_RETRY, instead of DID_BUS_BUSY, as to avoid issues surrounding DM/MD's use of fast-fail and its effects of premature markings of 'failure' to a given I/O. Update: Overnight testing with dm-multipath on 106 and 007 did not reproduce any failures. Testing on 106 will be stopped EOB today. Testing on 007 will continue through the weekend. We identified the problem that was preventing the qla module from turning on extended error logging. The options line contained "q12xfailover=0" instead of "ql2xfailover=0" (one vs. an "l" (ell)). This has been fixed, ramdisk rebuilt, and test restarted. Re: #26, Thanks for the details. Is it your recommendation that the "fast fail" bit not be set under these circumstances? Is anyone aware of possible consequences when not setting it? Is there a requirement for this bit to be set when multipath is in use vs. normal use, or against local SCSI disks? Created attachment 140912 [details]
One of the switch logs from testing done yesterday.
Created attachment 140913 [details]
Other switch's error log from testing yesterday.
If they are using dm_emc then the REQ_FAILFAST bit is used, else it's not. But my question is if this was a problem with dm multipathing failing too quickly, then wouldn't this problem occur on dm multipathing? From what I've seen it's not reproduced with DM, and if it does it takes much longer than PP. I would agree that REQ_FAILFAST would be contributing to the problem, but only if we were seeing lots of failures with DM. Regarding the latest 007 logs (from yesterday): These 007 logs from yesterday contain a run with the driver still not having extended error logging enabled: Nov 9 09:58:39 chln007 kernel: qla2300 0000:05:08.1: Nov 9 09:58:39 chln007 kernel: QLogic Fibre Channel HBA Driver: 8.01.04-d7 Nov 9 09:58:39 chln007 kernel: QLogic QLA2342 - Nov 9 09:58:39 chln007 kernel: ISP2312: PCI-X (100 MHz) @ 0000:05:08.1 hdma+, host#=1, fw=3.03.20 IPX Note, no '-debug' suffix. This is irrelevent though as the block layer *should* have logged some sort of 'failure' message (if the failure began from a block-layer device) before the sudden 'goes read-only': Nov 9 14:19:48 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: directory #1216813 contains a hole at offset 159744 Nov 9 14:19:48 chln007 kernel: Aborting journal on device dm-7. Nov 9 14:19:48 chln007 kernel: ext3_abort called. Nov 9 14:19:48 chln007 kernel: EXT3-fs error (device dm-7): ext3_journal_start_sb: Detected aborted journal Instead, there nothing of the sort. Quite unlike the previous logs where ports on the SAN were being disabled and enabled (causing RSCNs to be pushed to the HBA (initiator). Re: #31, From a block-driver perspective, I can't say which type of I/Os are failing from the November 5th run, either data or meta-data (journaling), as the midlayer is simply reporting some failure, probably due to fast-fail and the LLDD (qla2xxx) returning back DID_BUS_BUSY during SAN disruptions (again this is best-guess given descriptions of what was occuring from emails [no extended_error_logging]) to recycle the command. There appear to be *many* failures before ext3 goes readonly: Nov 5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 Nov 5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520408 Nov 5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520416 Nov 5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 Nov 5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520456 Nov 5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520464 ... Nov 5 16:06:02 chln007 kernel: end_request: I/O error, dev sdh, sector 128520720 Nov 5 16:06:02 chln007 kernel: end_request: I/O error, dev sdh, sector 128520728 Nov 5 16:06:02 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 ... Nov 5 16:32:33 chln007 kernel: end_request: I/O error, dev sdg, sector 13741104 Nov 5 16:32:33 chln007 kernel: end_request: I/O error, dev sdg, sector 13741112 Nov 5 16:32:33 chln007 kernel: SCSI error : <0 0 2 27> return code = 0x20000 Nov 5 16:32:33 chln007 kernel: end_request: I/O error, dev sdg, sector 13741136 Nov 5 16:32:33 chln007 kernel: SCSI error : <0 0 2 27> return code = 0x20000 ... Nov 5 18:08:47 chln007 kernel: SCSI error : <0 0 2 27> return code = 0x20000 Nov 5 18:08:47 chln007 kernel: end_request: I/O error, dev sdg, sector 17729784 Nov 5 18:08:47 chln007 kernel: end_request: I/O error, dev sdg, sector 17729792 Nov 5 18:16:55 chln007 su(pam_unix)[23001]: session closed for user root Nov 5 18:16:57 chln007 sshd(pam_unix)[19572]: session closed for user lidn4nj Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: directory #1020212 contains a hole at offset 45056 Nov 5 18:43:12 chln007 kernel: Aborting journal on device dm-7. Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7) in start_transaction: Readonly filesystem Nov 5 18:43:12 chln007 kernel: Aborting journal on device dm-7. actually 2hrs. Re: #32, please see update in #27, problem caused by a typo on options line (using a 1 (one) instead of an l (ell). On-site tech verified dmesg output on 007 after fixing this and confirmed extended error loging is now enabled. In response to comments 26, 27, and 31 form PP engineering: It does or at least it does as recently as PowerPath 4.5.1. But so do all instances of dm-multipath that work on all storage systems. Possibly the "queue_if_no_path" dm-multipath feature is enabled which is preventing the error from surfacing to ext3 in the dm-multipath case. From reading the Red Hat bugzilla entry, this seems like a problem in the handling of the DID_BUS_BUSY host portion of a SCSI command error since a DID_BUS_BUSY used to be "infinitely retryable" and now receives no retries due to the use of REQ_FAILFAST. It certainly could happen that each path used by PowerPath is failing simply due to the LLD returning DID_BUS_BUSY. (In reply to comment #18) > This continues until the first > relevant 'EXT3 goes read-only failure' occurs at 6:43pm: > > Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: > directory #1020212 contains a hole at offset 45056 That error message is giving the inode number and offset for a hole in a directory inode -- that's what's causing the filesystem to go readonly. It would be interesting to run debugfs on the filesystem *before* fscking it to verify that there is a hole there -- that would tell us if the problem is on magnetic media or if it is in the transport of data from the media to the VFS layer. Chip (In reply to comment #36) > (In reply to comment #18) > > This continues until the first > > relevant 'EXT3 goes read-only failure' occurs at 6:43pm: > > > > Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: > > directory #1020212 contains a hole at offset 45056 > > That error message is giving the inode number and offset for a hole in a > directory inode -- that's what's causing the filesystem to go readonly. It > would be interesting to run debugfs on the filesystem *before* fscking it to > verify that there is a hole there -- that would tell us if the problem is on > magnetic media or if it is in the transport of data from the media to the VFS layer. BTW, the same type of failure happened on both hosts (007 and 106). I repost the error message from 106 below Nov 6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_readdir: directory #2342945 contains a hole at offset 163840 and from 007 Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: directory #1020212 contains a hole at offset 45056 Is this "directory inode contains a hole" failure common to all of the crashes? Chip (In reply to comment #31) > If they are using dm_emc then the REQ_FAILFAST bit is used, else it's not. Are you sure? I found this in drivers/block/ll_rw_blk.c:__make_request /* * inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST) */ if (bio_rw_ahead(bio) || bio_failfast(bio)) req->flags |= REQ_FAILFAST; from which it looks like all read-ahead bios are marked to fail fast. Chip (In reply to comment #37) > > BTW, the same type of failure happened on both hosts (007 and 106). I repost > the error message from 106 below > > Nov 6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_readdir: > directory #2342945 contains a hole at offset 163840 > > and from 007 > > Nov 5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: > directory #1020212 contains a hole at offset 45056 > > Is this "directory inode contains a hole" failure common to all of the crashes? Looks like it is; this is from comment 32 above Nov 9 14:19:48 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir: directory #1216813 contains a hole at offset 159744 Chip I wanted to report that we have reproduced this 3 times now. The latest was this morning: Nov 15 07:09:16 L8 kernel: EXT3-fs error (device emcpowera1): ext3_readdir: directory #4505601 contains a hole at offset 12288 Nov 15 07:09:16 L8 kernel: Aborting journal on device emcpowera1. Nov 15 07:09:16 L8 kernel: ext3_abort called. Nov 15 07:09:16 L8 kernel: EXT3-fs error (device emcpowera1): ext3_journal_start_sb: Detected aborted journal Nov 15 07:09:16 L8 kernel: Remounting filesystem read-only Nov 15 07:09:16 L8 kernel: EXT3-fs error (device emcpowera1) in start_transaction: Journal has aborted The first test took 17 hours, the second 5 and the third 60 hours. All of the errors occurred on the file system with the "tpsreport.sh" script. There are no errors in the messages file previous to the file system going read-only and no fabric activity. Our environment: HP DL385 (AMD) DMX2000, McData ED64M 2x QLogic PCI to Fibre Channel Host Adapter for QLA2342 Driver version 8.01.02-d4 RHEL4u3, 2.6.9-34.ELsmp PowerPath 4.5.1 The current test is running with PowerPath debugging turned up. OK the tpsreport.sh generated another ext3 read-only error: ----------- Wed Nov 15 12:24:12 EST 2006 looping ----------- Wed Nov 15 12:24:12 EST 2006 creating files ::::::::>Error - 1 Create failure on /test2/metadata/0/9/metadata_14055_5626.tmp, Wed Nov 15 12:41:49 EST 2006 End Time: Wed Nov 15 12:41:49 EST 2006 We see from the script that this is generated from an attempted write into this file: <snip> while [ ${NUM} -le ${NUM_FILES} ]; do FILE=${FILE_PREV}_${RAN}_${NUM}.tmp echo 'hello world' > ${FILE} checkRC $? "Create failure on ${FILE}" <snip> This file is actually present (but empty) in the file system. [root@L8 9]# pwd /test2/metadata/0/9 [root@L8 9]# ls -ial metadata_14055_5626.tmp 6334315 -rw-r--r-- 1 root root 0 Nov 15 12:41 metadata_14055_5626.tmp [root@L8 9]# cat metadata_14055_5626.tmp The previous file is fine: [root@L8 9]# ls -ial metadata_14055_5625.tmp 6334314 -rw-r--r-- 1 root root 12 Nov 15 12:41 metadata_14055_5625.tmp [root@L8 9]# cat metadata_14055_5625.tmp hello world The error is curious though: Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1): ext3_readdir: directory #5750785 contains a hole at offset 4096 Nov 15 12:41:49 L8 kernel: Aborting journal on device emcpowerb1. Nov 15 12:41:49 L8 kernel: ext3_abort called. Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1): ext3_journal_start_sb: Detected aborted journal Nov 15 12:41:49 L8 kernel: Remounting filesystem read-only Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1) in start_transaction: Journal has aborted This is not the directory for our "problem" file. [root@L8 ~]# debugfs /dev/emcpowerb1 debugfs 1.35 (28-Feb-2004) debugfs: cd <5750785> debugfs: pwd [pwd] INODE: 5750785 PATH: /metadata/3/6 [root] INODE: 2 PATH: / debugfs: stat <5750785> Inode: 5750785 Type: directory Mode: 0755 Flags: 0x0 Generation: 1681403964 User: 0 Group: 0 Size: 1916928 File ACL: 11469322 Directory ACL: 0 Links: 2 Blockcount: 3760 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x455b486c -- Wed Nov 15 12:03:40 2006 atime: 0x455b515d -- Wed Nov 15 12:41:49 2006 mtime: 0x455b486c -- Wed Nov 15 12:03:40 2006 BLOCKS: (0-11):11528192-11528203, (IND):11528204, (12-79):11528205-11528272, (80):11528285, (81):11528414, (82):11528543, (83):1152867 2, (84):11528801, (85):11528930, (86):11529059, (87):11529188, (88):11529317, (89):11529446, (90):11529575, (91):11529704, (92 ):11529833, (93):11529962, (94):11530091, (95):11530220, (96):11530349, (97):11530478, (98):11530607, (99):11530736, (100):115 30865, (101):11530994, (102):11531123, (103):11531252, (104):11531381, (105):11531510, (106):11531639, (107):11531768, (108):1 1531897, (109):11532026, (110):11532155, (111):11532284, (112):11532413, (113):11532542, (114):11532671, (115):11532800, (116) :11532929, (117):11533058, (118):11533187, (119):11533316, (120):11533445, (121):11533574, (122):11533703, (123):11533832, (12 4):11533961, (125):11534090, (126):11534219, (127):11534872, (128-156):11534988- 11535016, (157-467):11535024-11535334 TOTAL: 469 Chip...Is there any specific debugfs (logdump?) command I can run to provide additional "useful" information? Keith, I have set up a system in our lab to try to reproduce the problem. I don't have exactly the same HW (no Symmetrix), but I do have a multipath active/active array. I would like to know exactly how you are running the tests to reproduce the problem. I have: ioloadgen-1.3.tgz (runs iozone-3-271) tpsreport.sh (metadata exerciser) Based on what I read in the BZ, I am running these both on two LUNs, each with a separate ext3 fs. Is that all you are doing? Any more than two LUNs needed? My test ran until the disk was full. It filled up because tpsreport.sh has quotes around the path name: rm -rf "${BASE}/*" so files never get deleted between loops. Did you change this test script? If you have any more details or suggestions on the best way to reproduce the problem, please let us know. Tom Hi Tom, I did not change the script and I noticed that the file system does fill up after 48 hours or so in my environment, I simply restart it. I have two LUNs also (73GB). Initially I executed tpsreport.sh on one and ioloadgen on the other. After 14 hours (not 17 my mistake) the one running tpsreport.sh went read only. The other three times it was also the tpsreport.sh script that went read-only. Yesterday I combined running both scripts against the same LUN and it went read-only in about 2 hours. I left this LUN in a read-only state for investigation. I have run both scripts on the other LUN and it is still running (19 hours and counting). -Keith I've looked at e2images from a couple other customers experiencing the same problem, and they have *not* had null pages on disk, with the exception of one case with a very badly mangled powerpath installation. In general, it seems we get the "hole" message because ext3 sees an empty buffer that was never filled due to an I/O error. Can we test this with slightly modified ext3 code? Unfortunately the "hole" message could be more helpful, if nothing else it could print the block number & device it encountered the problem on... also it would be good to know if ext3_get_blocks_handle or ext3_bread is what failed just ahead of the message. -Eric I am willing to test with modified ext3 code. However, the error does indicate the device, directory and offset already. The block number would be nice. Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1): ext3_readdir: directory #5750785 contains a hole at offset 4096 In this test emcpowerb1 a meta volume on a DMX that is seen down two paths. Pseudo name=emcpowerb Symmetrix ID=000187490063 Logical device ID=01B9 state=alive; policy=SymmOpt; priority=0; queued-IOs=0 ============================================================================== ---------------- Host --------------- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ============================================================================== 0 qla2xxx sdb FA 3aA active alive 0 0 1 qla2xxx sdd FA 4aA active alive 0 0 At this time we are focusing on generating a vmcore. I have not seen the problem in the lab since Nov 15 12:41:49. We experienced another read-only error today at approximately 15:33. EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a hole at offset 4096 I have placed the log, vmcore and System map files here: ftp://ftp.emc.com/outgoing/213921/ (In reply to comment #47) > We experienced another read-only error today at approximately 15:33. > > EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a > hole at offset 4096 > > I have placed the log, vmcore and System map files here: > ftp://ftp.emc.com/outgoing/213921/ Excellent. I'm downloading this now, and will post an analysis as soon as I have one. Chip I have tried to reporduce the problem using dm-multipath and a Dot Hill RAID box. This is not as performant as the Symmetrix, but is its Active/Active. I have run four over-night runs with a number of configurations, but have not reproduced the failure. I plan to add more adapters and continue testing. It would help me if anyone who has seen this problem on dm-multipath send me the result of: multipath -ll ls -l /dev/mpath/ I will use that to set up my configuration to match as closely as possible. (In reply to comment #48) > (In reply to comment #47) > > We experienced another read-only error today at approximately 15:33. > > > > EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a > > hole at offset 4096 > > > > I have placed the log, vmcore and System map files here: > > ftp://ftp.emc.com/outgoing/213921/ > > Excellent. I'm downloading this now, and will post an analysis as soon as I > have one. I've done some preliminary work on this core. The panic happened in process context; an "ls" command forked from the script "tpsreport.sh". The error in the kernel log EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a hole at offset 4096 gives the inode number of a directory inode, and the file position within that directory inode that could not be written. The error message comes from fs/ext3/dir.c:134 in the function ext3_readdir. Essentially, what this function was doing when the error occurred is iterating down the inode one buffer (block) at a time and doing an ext3_bread on each buffer. When ext3_bread returns NULL, it indicates that it was not able to read the buffer, so ext3_readdir generates the message and panics. As can be seen from the source, there are two ways the ext3_bread function can return NULL: struct buffer_head *ext3_bread(handle_t *handle, struct inode * inode, int block, int create, int *err) { struct buffer_head * bh; int prev_blocks; prev_blocks = inode->i_blocks; bh = ext3_getblk (handle, inode, block, create, err); if (!bh) return bh; if (buffer_uptodate(bh)) return bh; ll_rw_block (READ, 1, &bh); wait_on_buffer (bh); if (buffer_uptodate(bh)) return bh; brelse (bh); *err = -EIO; return NULL; } The first is if ext3_getblk itself returns NULL, the second is if after starting the I/O with ll_rw_block(READ, 1, &bh) and waiting for the I/O to complete, the buffer is still not up to date. In the second case, the value pointed to by the err argument is set to -EIO. -EIO is -5, or 0xfffffffb in 32-bit hexadecimal. Since err in ext3_readdir is an automatic variable, it is allocated on the stack, and this stack address is passed in to ext3_bread. Therefore, if this value was set, it should appear on the stack frame for ext3_readdir, and indeed it does: #16 [1002c461e00] ext3_readdir at ffffffffa00b4be5 1002c461e08: 0000000000000001 0000000000000000 1002c461e18: 000001007fd3b800 0000000000000000 1002c461e28: 0000000000000000 ffffffff80189618 1002c461e38: 000001002c461f38 fffffffb00000002 1002c461e48: 000001003f082d40 000001007f850a00 1002c461e58: 0000010072f2acc0 ffffffff801d1f8b crash> rd -32 1002c461e44 1002c461e44: fffffffb .... Furthermore disassembly of ext3_readdir shows the position of the err variable on the stack to be at offset 0x3c (N.B. x86-64 calling convention puts the first six arguments in %rdi, %rsi, %rdx, %rcx, %r8, %r9, respectively): 0xffffffffa00b4ba0 <ext3_readdir+0x245>: lea 0x3c(%rsp),%r8 and, indeed, subtracting 0x3c from the address 0x1002c461e44 gives the base of the stack for ext3_readdir crash> eval 1002c461e44 - 3c hexadecimal: 1002c461e08 (return address is at stack base - 0x8). This increases my confidence that the code path through ext3_break went via ll_rw_block and wait_on_buffer; additional evidence for this is the presence of io_schedule in the stack trace -- the function wait_on_buffer is an inline around io_schedule. Conclusion so far: ext3_bread failed because the block could not be read. At this point, we can find the address of the buffer_head on the stack for ext3_bread; it is at 0x10061c68ea0: crash> buffer_head 0x10061c68ea0 struct buffer_head { b_state = 0x18, b_this_page = 0x10061c68ea0, <== points back to the same buffer_head b_page = 0x100409932b8, b_count = { counter = 0x1 }, b_size = 0x1000, <== (0x1000 == 4096); entire page b_blocknr = 0xbe7801, b_data = 0x1006b931000 "", b_bdev = 0x1003fde4640, b_end_io = 0xffffffff80178cb2 <end_buffer_read_sync>, b_private = 0x0, b_assoc_buffers = { next = 0x10061c68ee8, prev = 0x10061c68ee8 } } From the b_size and b_this_page fields we can see that this block is 4096 bytes, so we are dealing with a blocks that are equal in size to pages (this can vary depending on the block device); blocks exist in 1-1 correspondence with pages. Furthermore, from the offset in the error message, we can see that this is the second block in the directory inode. Also note that ll_rw_block sets b_end_io to <end_buffer_read_sync>, further evidence that the code path went via ll_rw_block. ll_rw_block would have submitted the read request to the block subsystem by calling the submit_bh function, which allocates a bio and sets bio->bi_private to the address of the buffer_head. So if the bio has not already been recycled, we should be able to locate it by searching kernel memory for occurances of the buffer_head address: crash> search -k 0000010061c68ea0 1002c461d68: 10061c68ea0 1002c461d98: 10061c68ea0 1002c461dc8: 10061c68ea0 1002c461de8: 10061c68ea0 100409932c8: 10061c68ea0 10040e21580: 10061c68ea0 10061c68ea8: 10061c68ea0 <=== buffer_head itself 1007ffcd260: 10061c68ea0 <=== in the 1-k slab cache I haven't been able to back-cast any of these addresses into a struct bio (with offset 0x58). So the bio might already have been recycled when the dump started; I'm still poking around. Chip The files from chln106 show mpath25 -> ../dm-7 and mpath25 uses hwhandler="1 emc" and is composed of sda sdb sdk sdt An earlier sysreport from this system shows these devices are DGC Model: LUNZ Those are Clariion pseudo devices, not legitimate storage devices. Earlier failure reports in the IT show errors on this device: Jul 31 12:59:23 chln106 kernel: EXT3-fs error (device dm-7): ext3_readdir: directory #869102 contains a hole at offset 262144 (More recent failures on chln007 also show dm-7. Are these systems identical?) If this is the current configuration, they should not be doing I/O to dm-7. On our phone call we heard about two recent failures with dm-multipath on chln106, 11/13 and 11/20. Please post a current sysreport and the logs showing both of these errors. (In reply to comment #51) > we should be able to locate it by searching kernel memory for occurances of the > buffer_head address: > > crash> search -k 0000010061c68ea0 > 1002c461d68: 10061c68ea0 > 1002c461d98: 10061c68ea0 > 1002c461dc8: 10061c68ea0 > 1002c461de8: 10061c68ea0 > 100409932c8: 10061c68ea0 > 10040e21580: 10061c68ea0 > 10061c68ea8: 10061c68ea0 <=== buffer_head itself > 1007ffcd260: 10061c68ea0 <=== in the 1-k slab cache I've made some more progress in deciphering these: 1002c461d68: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule 1002c461d98: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule 1002c461dc8: 10061c68ea0 <=== close to ffffffffa00b8810 (t) ext3_bread+0x72 1002c461de8: 10061c68ea0 <=== close to ffffffffa00b4be5 (t) ext3_readdir+0x28a 100409932c8: 10061c68ea0 10040e21580: 10061c68ea0 10061c68ea8: 10061c68ea0 <=== buffer_head->b_this_page of the buffer_head itself 1007ffcd260: 10061c68ea0 <=== per-cpu cache in buffer_head slab cache (not interesting) One thing that's puzzling is that there are *two* bh_wait_queues on the kernel stack in io_schedule. Chip (In reply to comment #56) > 1002c461dc8: 10061c68ea0 <=== close to ffffffffa00b8810 (t) ext3_bread+0x72 > 1002c461de8: 10061c68ea0 <=== close to ffffffffa00b4be5 (t) ext3_readdir+0x28a These two are just places where the buffer_head was pushed on the kernel stack. Chip (In reply to comment #56) > > I've made some more progress in deciphering these: > > 1002c461d68: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule > 1002c461d98: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule [...] > One thing that's puzzling is that there are *two* bh_wait_queues on the kernel > stack in io_schedule. Closer examination of these bh_wait_queues makes them look even more mysterious. Here's the first one: crash> bh_wait_queue 1002c461d68 struct bh_wait_queue { bh = 0x10061c68ea0, wait = { flags = 0x0, task = 0x10007b347f0, func = 0xffffffff80178ad2 <bh_wake_function>, task_list = { next = 0x1002c461db8, prev = 0x1002c461db8 } } } and the second one crash> bh_wait_queue 1002c461d98 struct bh_wait_queue { bh = 0x10061c68ea0, wait = { flags = 0x0, task = 0x10007b347f0, func = 0xffffffff80178ad2 <bh_wake_function>, task_list = { next = 0x1002c461db8, prev = 0x1002c461db8 } } } These two are identical copies. In the case of the second one, the task_list indicates that it is the only element of the list: next == prev == 1002c461d98 + 0x20, where 1002c461d98 is the address of the bh_wait_queue and 0x20 is the offset of the task_list in the bh_wait_queue. This is what you would expect after a LIST_HEAD_INIT macro. However, the first bh_wait_queue is identical, but at a different address on the stack. That seems to indicate a list with two entries on it, namely the first and second bh_wait_queues. So the two list entries are inconsistent, and I can't figure out how there got to be two of them on the stack yet, anyway. Chip Re: "I can't figure out how there got to be two of them on the stack yet, anyway." Could it have anything to do with two background "ls -R" processes running simultaneously? What if they were running on separate CPUs and just happened to be looking at the same block at the same time? (please forgive my ignorance if this is completely impossible given the kernel code which I don't fully understand). Thinking about it some more, I see that my question is impossible as two processes wouldn't share a stack. Would instrumenting ext3_bread() be of any help in tracking down the source of the duplicate and/or failure mode? Could the two seemingly identical bh_wait_queues be the result of improper clean up before re-issuing a block operation that returned a recoverable error? Something like a block read times out, ext3 retries it but instead of reusing the bh_wait_queue it allocates a new one and does not initialize it properly. Perhaps some part of the error recovery path was done assuming bh reuse and another part assuming a fresh bh, and the kernel ends up with a new but only partly initialized bh. Or maybe I should just keep quiet. :) I was thinking along the same lines as well, but in this core (triggered upon a failure) the bh data is identical. If it where as you suggest, I would expect the data to be different. However, that's not to say- What if different parts of the kernel are looking at the two different bh structs. At some point, one is considered temporary and freed (leaving the data in memory but not valid). Subsequently, the in memory data changes (due to another allocation) and freaks out the kernel. However, also in this case, I would expect to see the two in memory structs data be different, not identical. Either way, it seems like instrumenting ext3_bread() may give us helpful information, especially since we're fairly sure at what point it's failing. (In reply to comment #61) > Could the two seemingly identical bh_wait_queues be the result of improper clean > up before re-issuing a block operation that returned a recoverable error? No. First, it's important to understand that the bh_wait_queues are allocated on the stack. So as soon as we return from the function __wait_on_buffer they should be automatically deallocated by the stack adjustment. These are local automatic variables on the stack, not kmalloc-ed data on the heap. So all I can think of is that the compiler did the defer-stack-adjustment optimization. I'm going to have a look at the assembly so see if that theory makes any sense. Chip Ahh, okay, so it could be that ext3_bread() was called twice on the same block, the stack moved and the previous data while not in use, still existed in memory. Either way, it's starting to sound as if having these two coppies may be a red herring, no? All, Here's a summary of the vendor call: +Setting /proc/sys/vm/vfs_cache_pressure to 500 will help force more frequent filesystem metadata flushes. +3 Likely failure scenarios: 1) An Actual underlying I/O error that's not being reported for whatever reason. 2) An error between the driver and filesystem layers. 3) An error related to metadata cache and/or cache read-ahead (FAST FAIL bit set). +Problem has been reproduced at EMC w/ PowerPath and _only_ metadata test script (tpsreport.sh). +Problem seems more easily reproducible when multiple tests are run in parallel on multiple SAN filesystems: +Customer setup- SAN LUN's -> PVs -> 1 VG -> 16 LVs -> ext3 -> two ioloadgen and one tpsreport instance(s) on each. +At this point, we need to generate/analyze as many vmcores as possible (produced with ext3 error=panic). +Suggested customer testing: disable dm_multipath queue_if_no_path, everything else the same. +Suggested vendor testing: dm_multipath, multiple filesystems, multiple tpsreport instances on each. (Please remind me if I've left anything out) multipath.conf file for testing against Symetrix should look like this: devnode_blacklist { devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z][0-9]*" devnode "^cciss!c[0-9]d[0-9]*" } defaults { user_friendly_names yes } devices { device { vendor "EMC" product "SYMMETRIX" path_grouping_policy multibus path_checker tur getuid_callout "/sbin/scsi_id -g -u -p0x80 -s /block/%n" #feature "1 queue_if_no_path" no_path_retry fail failback immediate } } This event sent from IssueTracker by cevich issue 97316 (In reply to comment #63) > (In reply to comment #61) > > Could the two seemingly identical bh_wait_queues be the result of improper clean > > up before re-issuing a block operation that returned a recoverable error? > > No. First, it's important to understand that the bh_wait_queues are allocated > on the stack. So as soon as we return from the function __wait_on_buffer they > should be automatically deallocated by the stack adjustment. These are local > automatic variables on the stack, not kmalloc-ed data on the heap. > > So all I can think of is that the compiler did the defer-stack-adjustment > optimization. I'm going to have a look at the assembly so see if that theory > makes any sense. Curiouser and curiouser. The automatic variables in __wait_on_buffer are wait_queue_head_t *wqh = bh_waitq_head(bh); DEFINE_BH_WAIT(wait, bh); (the second line is a macro that expands into an automatic variable declaration). The variable 'wqh' is allocated a register (%r12), and so occupies no space on the stack. The DEFINE_BH_WAIT macro expands into struct bh_wait_queue name = { \ .bh = b, \ .wait = { \ .task = current, \ .flags = f, \ .func = bh_wake_function, \ .task_list = \ LIST_HEAD_INIT(name.wait.task_list),\ }, \ } The struct has size 0x30 (48 bytes) crash> bh_wait_queue -o struct bh_wait_queue { [0x0] struct buffer_head *bh; [0x8] wait_queue_t wait; } SIZE: 0x30 but the function allocates 0x60 (96 bytes) for the wait queue head: crash> dis __wait_on_buffer 0xffffffff80178bcf <__wait_on_buffer>: push %r12 0xffffffff80178bd1 <__wait_on_buffer+2>: push %rbp 0xffffffff80178bd2 <__wait_on_buffer+3>: push %rbx 0xffffffff80178bd3 <__wait_on_buffer+4>: mov %rdi,%rbx 0xffffffff80178bd6 <__wait_on_buffer+7>: sub $0x60,%rsp [ ... snip ... ] 0xffffffff80178c5e <__wait_on_buffer+143>: add $0x60,%rsp 0xffffffff80178c62 <__wait_on_buffer+147>: pop %rbx 0xffffffff80178c63 <__wait_on_buffer+148>: pop %rbp 0xffffffff80178c64 <__wait_on_buffer+149>: pop %r12 0xffffffff80178c66 <__wait_on_buffer+151>: retq /me scratches head. Chip > > Chip > Never underestimate the compiler's capacity to come up with weird sh*t. I've been studying the disassembled code for __wait_on_buffer and yes, indeed it does allocate *two* bh_wait_queue structs on the stack, then fills in the first one with this: 0xffffffff80178be2 <__wait_on_buffer+19>: mov %rbx,(%rsp) 0xffffffff80178be6 <__wait_on_buffer+23>: movl $0x0,0x8(%rsp) 0xffffffff80178bee <__wait_on_buffer+31>: mov %gs:0x0,%rax 0xffffffff80178bf7 <__wait_on_buffer+40>: lea 0x30(%rsp),%rdi 0xffffffff80178bfc <__wait_on_buffer+45>: mov %rax,0x10(%rsp) 0xffffffff80178c01 <__wait_on_buffer+50>: movq $0xffffffff80178ad2,0x18(%rsp) 0xffffffff80178c0a <__wait_on_buffer+59>: mov %rsp,%rsi 0xffffffff80178c0d <__wait_on_buffer+62>: mov $0xc,%ecx <== set up repz 0xffffffff80178c12 <__wait_on_buffer+67>: lea 0x20(%rdi),%rax 0xffffffff80178c16 <__wait_on_buffer+71>: cld 0xffffffff80178c17 <__wait_on_buffer+72>: mov %rax,0x20(%rsp) 0xffffffff80178c1c <__wait_on_buffer+77>: mov %rax,0x28(%rsp) and then copies it all into the second one with this: 0xffffffff80178c21 <__wait_on_buffer+82>: repz movsl %ds:(%esi),%es:(%edi) ?? (In reply to comment #64) > Ahh, okay, so it could be that ext3_bread() was called twice on the same block, > the stack moved and the previous data while not in use, still existed in memory. > Either way, it's starting to sound as if having these two coppies may be a red > herring, no? You're right, my conclusion is that this is a red herring. The two copies of the bh_wait_queue on the __wait_on_buffer stack are just an artifact of some weird stuff that the compiler did. The second of the two copies at 0x30(%rsp) is the real one, and that is the one that is passed in to prepare_to_wait and finish_wait. Chip Though it is curious, is that struct passed-by-value anywhere? That could also explain two coppies of it, the original one and the one local to a function somewhere. Either way, could we maybe add some instrumentation just before "brelse (bh);" that would help validate the contents of bh? Also, I find it curious that we've not been able to repro. this anywhere against a local disk, only SAN devices, and maybe Symetrix SANs exclusivly. In my mind, that suggests the issue logically _must_ have something to do with driver or hardware interaction somewhere along the way. Perhaps we should re-re-validate this cannot be reproduced on a local disk (using a beefy system). On an unrelated note: Wouldn't eliminating the coppying of bh provide us with a performance boost in high metadata load situations? Just to summarize: The focus now is on the following lines of code within ext3_bread(): ll_rw_block (READ, 1, &bh); wait_on_buffer (bh); if (buffer_uptodate(bh)) return bh; brelse (bh); *err = -EIO; return NULL; We know err is being set to -EIO, therefor buffer_uptodate(bh) == FALSE. To me, this reads like a failure with: 1) ll_rw_block() not doing it's job for some reason -and/or- 2) wait_on_buffer() not waiting long enough, or otherwise doing something that results in a difference between the contents of bh and on-disk data -and/or- 3) buffer_uptodate() returning a false-positive when in fact everything is in sync. Am I missing anything here? Assuming this issue is definately not reproducable on local disks, what about the behavior of a SAN and/or HBA drivers could cause these functions/macros to not behave expectidly? Could we instrument the kernel to give us more information on any of these (or other lower-level) conditions? > Assuming this issue is definately not reproducable on local disks, what about
> the behavior of a SAN and/or HBA drivers could cause these functions/macros to
> not behave expectidly?
I have been going on the assumption that multipath is integral to this problem.
Multipath is exceedingly rare on local, non FC, storage.
Multipath code that is common to PowerPath and dm-multipath is the top candidate.
(In reply to comment #72) > Just to summarize: The focus now is on the following lines of code within > ext3_bread(): > > ll_rw_block (READ, 1, &bh); > wait_on_buffer (bh); > if (buffer_uptodate(bh)) > return bh; > brelse (bh); > *err = -EIO; > return NULL; > > We know err is being set to -EIO, therefor buffer_uptodate(bh) == FALSE. Yes, exactly. > To me, this reads like a failure with: > > 1) ll_rw_block() not doing it's job for some reason My top suspicion > -and/or- > 2) wait_on_buffer() not waiting long enough, or otherwise doing something that > results in a difference between the contents of bh and on-disk data Unlikely, but not impossible. > -and/or- > 3) buffer_uptodate() returning a false-positive when in fact everything is in sync. It's just a macro testing a flag, so it's unlikely. > Could we instrument the kernel to give us more information on any of these (or > other lower-level) conditions? Yes, and perhaps I'm being a little too careful about this. There are still some things about the stack trace that I don't completely understand. Take a look at this: #13 [1002c461d10] ext3_error at ffffffffa00be647 #14 [1002c461d40] io_schedule at ffffffff803053ef #15 [1002c461de0] ext3_bread at ffffffffa00b8810 #16 [1002c461e00] ext3_readdir at ffffffffa00b4be5 From the code, I would have expected io_schedule __wait_on_buffer ext3_bread ext3_readdir or ext3_error ext3_readdir so there must be some deferred stack adjustments that are muddying the waters. I know from the disassembled code that __wait_on_buffer cleans up its stack; that may be why it doesn't appear in the backtrace. If we want to tackle this bug from another direction, Tom is right; we should look at the multipath code and try to figure out a way for ext3_bread to fail. It's strange that it only fails for directories, though. Chip It's probably failing for both directories and normal files, but read() will return EIO to userspace, which will then (hopefully) retry the operation. The problem is that readdir() and getdents() do not have the option of returning EIO to userspace, unless we want to "extend" POSIX in a non-portable fashion. Thus, the error has to be handled by the kernel (somehow) instead of passing it through to userspace. We might be able to make this go away by changing how ext3 responds to EIO, but if there's an underlying bug that's causing that, we're just papering over it. (In reply to comment #75) > It's probably failing for both directories and normal files, but read() will > return EIO to userspace, which will then (hopefully) retry the operation. Good point. Chip OK, here's what I propose. Even though Chris Snook's argument in comment #75 is indisputable, I think we should still restrict our instrumentation to catch the failure to read a directory inode since otherwise we risk being overwhelmed with false positives. So I will instrument the function end_buffer_read_sync to check if the buffer_head is associated with a directory inode and panic if the read operation failed. That will at least inch us closer to the problem. Chip Created attachment 142617 [details]
instrument end_buffer_read_sync
Created attachment 142618 [details]
instrument end_buffer_read_sync; fix missing semicolon
I took a look at the core from the instrumented kernel. It's definitely the same bug -- ext3_readdir finds a hole in a directory. The surprising thing is that the panic came from the ext3_error function ("errors=panic" mount option), not the bugcheck that I added to end_buffer_read_sync. I verified that the bugcheck was there by disassembling the code for that function in the core. There are at least two possible explanations for this: 1. There's something wrong with the instrumentation/bugcheck that I added. 2. end_buffer_read_sync did not run before the process that was woken up. If the latter, then this would explain why the process wakes up and the buffer still isn't marked up-to-date -- one of the things end_buffer_read_sync does is to set that flag. I'm going to double check the logic in my bugcheck. Chip Just a thought here... if we are wondering if end_buffer_read_sync() has run, is there a way, perhaps using a flag variable or array of some sort, to check if it has run recently? Would logging the buffer head (bh) and the current jiffies be meaningful? It might, as we get a panic, give us some detail of the recent operations. Just a thought at the end of the day. Quite possibly a silly one... ;-) Created attachment 143176 [details]
memory hog
This is the source code for the memory hog program. Usage is simple:
# ./memhog 95
(must be run as root!) will consume 95% of available physical memory. Note
that it runs as a daemon, so your command prompt will return immediately, but
"top" followed by "M" should show the program memhog with an RSS consuming 95%
of available memory.
Some ideas about reproducing this bug: In every core so far, the problem has been that a low-level block I/O submitted by ext3_readdir to read a directory inode from disk has failed. So I propose that we modify our strategy to exercise this code path as much as possible. What we want are directories containing *lots* of files (a wide, not necessarily deep directory hierarchy), and enough memory pressure to prevent all of the directory inodes from sitting in the page cache. That is the purpose of the memhog program above; you can shrink the page cache by running another program that consumes a large percentage of physical memory. I also suggest that every "ls -R" should be followed by a "sync" to empty the cache. Since we could very well be dealing with a synchronization problem, it would be a good idea to have processes creating directory entries at the same time that other processes are listing the directory contents. Chip (In reply to comment #84) > Created an attachment (id=143176) [edit] > memory hog > > This is the source code for the memory hog program. Usage is simple: > > # ./memhog 95 This program should come with a warning label. It's not hard at all to bring your system to its knees if you specify a percentage that's too high. Chip FWIW sync won't empty the cache, it just flushes modified pages to disk, leaving them cached. But this might help, and force a re-read: drop_caches ----------- Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free. To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. >drop_caches
IIRC this was not in RHEL4, needs RHEL5 or later
Created attachment 143187 [details]
metadata test script take 2
An almost complete re-write of the script. I think you'll find that it spends
a great deal more time doing directoy listings. However, there's only so much
testing I can do on my little 1 CPU laptop. So, please double-check that it's
doing what we want it to and feel free to fix it as needed.
(In reply to comment #87) > FWIW sync won't empty the cache, it just flushes modified pages to disk, leaving > them cached. Too bad. But the memhog will empty the cache. Chip Created attachment 143188 [details]
Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync
Run this shell script to build and load a jprobe module that will instrument
the two functions end_bio_bh_io_sync and end_buffer_read_sync. We'll get
printk output any time the I/O completes unsuccessfully. My recommendation is
to turn off all the other device driver and SCSI midlayer debugging messages
for the time being so that we can quickly find the output from the
instrumentation in the system logs.
If the script succeeds, the last message it emits is "module loaded".
Otherwise, something went wrong so let me know.
Chip
(In reply to comment #91) > Created an attachment (id=143188) [edit] > Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync Another quick note -- this is to be used instead of the debugging kernel posted earlier. That debugging kernel failed to correctly identify a directory inode; i.e. my instrumentation contained a bug. Chip I had a look at the directory inode from the Nov 17 crash (Eric Sandeen is also taking a look) and it certainly does not seem to have a hole at offset 4096 (== 0x1000). It might also be worth pointing out that this directory was quite large, the inode is 2.5 MiB. However the failure happened when ext3_readdir was reading the second block. Chip All, Hot off the press- EMC reproduced the problem a half-hour before our call. There were no other reports of reproducing the issue from testing over the weekend. Keith had some interesting observations of what happened this morning: 0) EMC PowerPath + QLogic test system running w/o jprobes. 1) Running 4 instances of the "old" tpsreport.sh script for the last 3 weeks, no memhog. 2) One instance of tpsreport.sh script was replaced with the "new" version (in bugzilla) at 8:30am EST today. 3) Some time later, Keith logged in to the X console of the box. 4) The system appeared to "Hang" partway through the GNOME splash screen (where it starts up the panel, applets, etc.) 5) System experienced read-only problem shortly thereafter and dumped core at aprox. 11am EST. We theorized that this repro. might have been "helped" by the extra memory pressure placed on the system by logging into a GNOME session. Keith will be posting information on obtaining the core (and other details) directly to the Bugzilla. Chip has been working on tweaking the jprobes for this issue as there appear to be some unintended problems. He is asking that any tests being restarted with the new "recipe" (below) wait for the new jprobes he is going to post into the bugzilla. The old jprobes appear to be generating some "false-positive" data. There was a question regarding using these jprobes in customer production environments and the possible impact. While not specifically designed to cause crashes, they should be reasonably safe, though may introduce a very slight performance degradation. However, on production systems _not_ also running with SAN filesystems set to error=panic, having the jprobes (kernel module) loaded will be less useful. Chip said that nevertheless, it will still provide slightly helpful information if customers understand the above. Testing update: EMC: Gathering data, will restart testing with new recipe (below) QLogic: testing with QLogic + dm-multipath, no failures, will restart with new recipe. Red Hat: Testing on Clariion + PowerPath, no failures. Testing with dm-multipath, no failures. Customer Labs: Various test cases, no failures. Latest test recipe: +Wait for chip to post new jprobes to the Bugzilla +Install new jprobes on test system(s) +Completely disable swap on test system(s) +For test systems with > 4 CPUs and > 4 tpsreport.sh test instances: -Lower "LISTING_RATIO" to 1, or 2 at the most. -Concern is that large number of "ls -laR" process will overload the system. +Wait for tpsreport.sh test(s) and system load to "settle down" +Start "inching" up memhog _slowly_: -Consume "most" of the remaining system memory -Leave some breathing room -It's __very__ easy to start triggering the OOM killer. +Go get some coffee (or tea). We experienced another crash at approximately 10:55AM EST today. The vmcore and associated files are located here: ftp://ftp.emc.com/outgoing/213921-2/ This appears almost identical to the Nov 17 crash. The one exception being the following in the log immediately before the panic: warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip __do_softirq+0x4d/0xd0 I assume we can attribute this to the running of the new metadata script from comment #89 which was running on another file system. Created attachment 143334 [details]
Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync
This is the updated version of the jprobe module. It is recommended that this
module should only be used with PowerPath; in our experiments dm-multipath will
generate tons of spurious output with this module loaded.
Chip
We experienced another crash at approximately 19:05PM EST last night. The new jprobe from comment #97 was installed. The vmcore, log, messages file and dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-3/ end_bio_bh_io_sync: bio 000001002cb00b80 bh 000001005080fdf0 bytes_done 4096err -11 sector 98697767end_buffer_read_sync: page 0000010040682fe0 mapping 000001003f5f5260 inode 000001003f5f5140 uptodate 0 EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6160385 contains a hole at offset 241664 Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after error My account reported the following error on a QLogic+PP box they were setting up. Does this error come from the same code-path that we're investigating in this bug? Dec 7 19:50:38 xxxx299 kernel: EXT3-fs error (device dm-7): ext3_find_entry: reading directory #2444902 offset 1 This appears to be a different area of code (namei.c) that is doing the same thing (!buffer_uptodate(bh)): 872 if (!buffer_uptodate(bh)) { 873 /* read error, skip block & hope for the best */ 874 ext3_error(sb, __FUNCTION__, "reading directory #%lu " 875 "offset %lu\n", dir->i_ino, block); 876 brelse(bh); 877 goto next; 878 } Thanks Keith. The SAN filesystem in question is being used to house revision control data. I'd immaging that means it's got lots of directories and lots of files in them. It's good to know there are other ways of hitting the problem via different code-paths as well. Hopefully analysis of your latest reproductions vmcore/jprobes data will also be fruitful. (In reply to comment #99) > My account reported the following error on a QLogic+PP box they were setting up. > Does this error come from the same code-path that we're investigating in this bug? > > > Dec 7 19:50:38 xxxx299 kernel: EXT3-fs error (device dm-7): ext3_find_entry: > reading directory #2444902 offset 1 "device dm-7" makes me think this is the device mapper, not PP. Chip Hi Chip, To me this just means they have PowerPath devices configured under LVM. I have seen this before as well. A quick "vgdisplay -v" and a review of the /dev tree will tell the story. -Keith (In reply to comment #103) > Hi Chip, > To me this just means they have PowerPath devices configured under LVM. Oh. Of course. Nevermind. Chip An examination of the latest core from EMC (which had the jprobe module loaded) indicates that I have been barking up the wrong tree. Here's what I can say for certain: 1. The "int err" automatic variable on the stack of the ext3_readdir function contains the value -EIO (0xfffffffb). 2. The I/O error comes from ext3_bread. 3. ext3_bread does not follow the code path through ll_rw_block and wait_on_buffer. The conclusion is that the call to ext3_getblk in ext3_bread is returning NULL, and setting err = -EIO. I believe the only way that can happen is if ext3_get_block_handle fails, so I would like to instrument this function next. Chip Chip, assuming your writing a new jprobe for ext3_get_block_handle? Created attachment 143456 [details]
probe ext3_get_block_handle in 2.6.9-34.EL x86_64
This jprobe/kprobe combination instruments the function ext3_get_block_handle.
It will cause a kernel panic in that function if the kernel is failing in
ext3_get_block_handle. This version is only for kernel 2.6.9-34.EL on x86_64
(the version EMC has been reproducing on). I can provide modules for other
kernels/archs if there is interest.
Chip
Thanks Chip! The new jprobe is in and running. Dec 12 17:57:13 L8 kernel: plant jprobe at ffffffffa00b7c3e crash> dis ffffffffa00b7c3e 0xffffffffa00b7c3e <ext3_get_block_handle>: int3 We experienced another crash at approximately 03:45PM EST this morning. The new jprobe from comment #107 was installed. The vmcore, log, messages file and dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-4/ Please note that I had turned up PowerPath debug for most of yesterday so there are some messages in the log. I turned off PowerPath debug soon after installing the new jprobe (18:55PM EST). While some of the PowerPath debug messages may seem ominous ("MpIodone error" & "Deadlock potential!") they can be considered normal with debugging turned on. As you can see from the large messages file I've included on the ftp site they log continuously. <2>EXT3-fs error (device emcpowera1): ext3_readdir: directory #2064385 contains a hole at offset 4096 Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after error This file system was running the "new" tps script. -Keith We experienced another crash at approximately 09:24PM EST this morning. The new jprobe from comment #107 was installed. The vmcore, log, messages file and dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-5/ EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6127617 contains a hole at offset 8192 Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after error -Keith (In reply to comment #112) > We experienced another crash at approximately 09:24PM EST this morning. The > new jprobe from comment #107 was installed. The vmcore, log, messages file and > dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-5/ > > EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6127617 contains a > hole at offset 8192 > Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after > error From this panic and the previous one, I think I'm still barking up the wrong tree. It should have panic'd before this, in the jprobe module. I will turn another rev on that and get it out ASAP. Also, I wrote a quick and dirty program for reading the directory blocks that you're dumping with debugfs; I'll attach the source code. Chip Created attachment 143555 [details]
program to read directory inode dump (gcc -o readdir readdir.c)
Created attachment 143556 [details]
program to read directory inode dump (gcc -o readdir readdir.c)
Created attachment 143567 [details]
probe ext3_bread (2.6.9-34.ELsmp x86_64)
This version probes farther upstream, in ext3_bread. I'm quite certain that
ext3_bread is returning NULL to ext3_readdir, but there are two ways it could
do so. This module sets a probe on each of those two ways, and will bugcheck
if either of them happens. The bugcheck will indicate which one it was.
Chip
Thanks Chip! The new jprobes/kprobes are in and running. Dec 13 19:56:01 L8 kernel: plant jprobe at ffffffffa00b879e Dec 13 19:56:01 L8 kernel: plant kprobe at ffffffffa00b87a7 Dec 13 19:56:01 L8 kernel: plant kprobe at ffffffffa00b880b -Keith We experienced another crash at approximately 00:47AM EST this morning. The new jprobe/kprobe from comment #116 was installed. The vmcore and log, can be located here: ftp://ftp.emc.com/outgoing/213921-6/ Kernel BUG at jprobe:22 RIP: 0010:[<ffffffffa03ee045>] <ffffffffa03ee045> {:jprobe:ext3_bread_before_brelse+9} Created attachment 143633 [details]
use submit_bh instead of ll_rw_block
This is untested and has not been compiled. One often (by me anyway) forgotten
ll_rw_block feature is that it skips locked buffers, assuming they are already
in flight under IO.
It seems possible some part of the kernel is breaking this rule, and
ext3_getblk seems like as good a suspect as any. So, if someone is locking a
buffer without starting io on it, ext3_bread may return before any IO has
actually been done. I've no proof this is happening, we haven't been able to
reproduce it yet at Oracle.
(In reply to comment #119) > Created an attachment (id=143633) [edit] > use submit_bh instead of ll_rw_block > > This is untested and has not been compiled. One often (by me anyway) forgotten > ll_rw_block feature is that it skips locked buffers, assuming they are already > in flight under IO. > > It seems possible some part of the kernel is breaking this rule, and > ext3_getblk seems like as good a suspect as any. So, if someone is locking a > buffer without starting io on it, ext3_bread may return before any IO has > actually been done. I've no proof this is happening, we haven't been able to > reproduce it yet at Oracle. It's an interesting idea. The upstream kernel has introduced a "READ_META" (and additional flag) to indicate that the ll_rw_block is a metadata read. I haven't found how this is handled differently, yet. Chip Hi Chris, Before testing any fixes I'd like to test with additional jprobes/kprobes to nail down the root cause. The main reason is that in some cases it takes days for a repro to occur. How long to we run with a fix before declaring it valid? Also, based on the latest vmcore we are getting past a "wait_on_buffer (bh)" inside of ext3_bread. Shouldn't that wait for the locked buffer to complete its I/O? Kearnan_Keith wrote: > Hi folks, > > With the latest vmore we panic in the jprobe. > > Kernel BUG at jprobe:22 > invalid operand: 0000 [1] SMP > CPU 3 > Modules linked in: jprobe(U) parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_c > ore emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U) button battery ac md5 ipv6 ohci_hcd hw_random tg > 3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod > Pid: 3991, comm: ls Tainted: P 2.6.9-34.ELsmp > RIP: 0010:[<ffffffffa03ee045>] <ffffffffa03ee045>{:jprobe:ext3_bread_before_brelse+9} > > I know the remaining possibilities seem to be: > > 2) wait_on_buffer() not waiting long enough, or otherwise doing something that > results in a difference between the contents of bh and on-disk data > -and/or- > 3) buffer_uptodate() returning a false-positive when in fact everything is in sync. > > Could there be a 4th possibility where in between the following: > > 926 wait_on_buffer (bh); > 927 if (buffer_uptodate(bh)) > > the bh is getting altered? Should a lock_buffer(bh) be inserted here? > > The __wait_on_buffer comment seems to indicate this is a possibility. > > 149 /* > 150 * Block until a buffer comes unlocked. This doesn't stop it > 151 * from becoming locked again - you have to lock it yourself > 152 * if you want to preserve its state. > 153 */ > 154 void __wait_on_buffer(struct buffer_head * bh) > 155 { > 156 wait_queue_head_t *wqh = bh_waitq_head(bh); > 157 DEFINE_BH_WAIT(wait, bh); > 158 > 159 do { > 160 prepare_to_wait(wqh, &wait.wait, TASK_UNINTERRUPTIBLE); > 161 if (buffer_locked(bh)) { > 162 sync_buffer(bh); > 163 io_schedule(); > 164 } > 165 } while (buffer_locked(bh)); > 166 finish_wait(wqh, &wait.wait); > 167 } > > static inline int buffer_uptodate(struct buffer_head * bh) { return test_bit(BH_Uptodate, &bh->b_state); } > > ... no locking? it's atomic, but no MP protection. I seem to recall Chip saying that bh is supposed to be locked before the call to ext3_bread, but I could be mistaken. Either way, I have a good feeling about this line of thinking. Could this be one of those nassssssty tricksie little Hobbitsesss: wherein locking higher up the call stack is irrelivent because we need to lock in ext3_bread (it may be running concurrently somehow) because the higher up call has already unlocked. ??? Another thing, is there any way the buffer could become unlocked inbetween line 161 and 163 above? (I'm assuming this lock on bh is some type of mutex and not a "> 1" semephore right?) (In reply to comment #121) > Hi Chris, > Before testing any fixes I'd like to test with additional jprobes/kprobes to > nail down the root cause. The patch Chris submitted could be considered as a diagnostic. Basically, it implements the same logic as ll_rw_block, with one important difference. ll_rw_block skips the buffer_head if it is locked, Chris' version skips the buffer_head if it is up-to-date. If the bug reproduces with Chris' patch installed, then we know that the problem really is low-level I/O and not buffer locking/synchronization. Chip It seemed strange that we suddenly stopped getting errors from the device layer. I just double checked and the end_request: I/O error message only comes out at the default log level. Has the console log level been increased to the max on boxes where this has been reproduced recently? Otherwise we may be missing messages that point away from the filesystem. (In reply to comment #123) > The patch Chris submitted could be considered as a diagnostic. Basically, it > implements the same logic as ll_rw_block, with one important difference. > ll_rw_block skips the buffer_head if it is locked, Chris' version skips the > buffer_head if it is up-to-date. This thought deserves some fleshing out. If a buffer is locked, that means an I/O is pending. When the I/O completes, the buffer will be up-to-date unless the I/O failed. That, at least, is how it is supposed to work (ref the function end_buffer_read_sync near line 198 of fs/buffer.c). The ll_rw_block call in ext3_bread will not submit an I/O if the buffer is already locked. The logic there is correct; if there is already an I/O pending on the buffer, there is no point in submitting another one. Whether the pending I/O is a read or a write, when it completes the buffer should be up-to-date, unless the I/O fails for some reason. So we have two possibilities here: 1. The buffer locking and synchronization is somehow broken. 2. The I/Os are failing for some reason. The probability of the second possibility is much higher, since the buffer locking/synchronization is used by all I/O subsystems, not just SANs. We are only seeing this bug on SANs. Chip Created attachment 143648 [details]
more verbose check
This patch is similar, (and also untested/compiled), but if it finds a locked
buffer that is not up to date, and the buffer is still not up to date after
just waiting on it, a log message will be printed:
ext3 wait after failed lock, not uptodate
It then locks the buffer and tries the read. If the read produces an up to
date buffer, another message is printed. This way we'll be able to tell if
we're hitting this potential bug.
Please review this carefully before using it.
(In reply to comment #123) > (In reply to comment #121) > > Hi Chris, > > Before testing any fixes I'd like to test with additional jprobes/kprobes to > > nail down the root cause. > The patch Chris submitted could be considered as a diagnostic. Basically, it > implements the same logic as ll_rw_block, with one important difference. > ll_rw_block skips the buffer_head if it is locked, Chris' version skips the > buffer_head if it is up-to-date. > If the bug reproduces with Chris' patch installed, then we know that the problem > really is low-level I/O and not buffer locking/synchronization. > Chip The only reason I can't totally agree with your conclusion here is because the patch also adds buffer locking. So if it fixes it we cannot jump to the conclusion that the root cause is low-level I/O. (In reply to comment #127) > > The only reason I can't totally agree with your conclusion here is because the > patch also adds buffer locking. So if it fixes it we cannot jump to the > conclusion that the root cause is low-level I/O. It's the same locking that's in ll_rw_block, ported up to ext3_bread. In fact, you could think of the patch as being a re-implementation of ll_rw_block in ext3_bread for the case of interest (nr == 1, rw == READ), with the one difference I described in comment #123. However, like I said in comment #125, I think it is more likely that the problem is caused by failing I/O than a buffer locking/synchronization problem, if only because every I/O device in the kernel uses the same buffer synchronization methods. We could implement the patch from comment #126 in the next kprobe/jprobe version, which might be interesting. Chip Chip RE: comment #125 Yes, if a buffer is locked it is supposed to mean an IO is pending. But, you only need one caller that thinks locking a buffer means excluding other users to trigger the problem. ext3_getblk looks very suspect to me. It probably is not possible to trigger that particular bug due to inode semaphores, but still. With all of that said, problems at the lower layers are much more likely. Earlier in the bug it looked like you had jprobes in the end_io routines that were being called with up_to_date = 0, and then later on we were only able to see errors higher up the stack in ext3. Is this true? (In reply to comment #129) > RE: comment #125 > > Yes, if a buffer is locked it is supposed to mean an IO is pending. But, you > only need one caller that thinks locking a buffer means excluding other users > to trigger the problem. ext3_getblk looks very suspect to me. Comment above journal_get_create_access in fs/jbd/transaction.c: /* * When the user wants to journal a newly created buffer_head * (ie. getblk() returned a new buffer and we are going to populate it * manually rather than reading off disk), then we need to keep the * buffer_head locked until it has been completely filled with new * data. The branch in ext3_getblk that executes lock_buffer(bh) is protected by if (buffer_new(&dummy)). The logic is contorted, but I think it's correct. > With all of that said, problems at the lower layers are much more likely. > Earlier in the bug it looked like you had jprobes in the end_io routines that > were being called with up_to_date = 0, and then later on we were only able to > see errors higher up the stack in ext3. Is this true? Yes, it is. But one should never rule out the possibility of bugs in my jprobes .... Chip (In reply to comment #128) > (In reply to comment #127) > > > > The only reason I can't totally agree with your conclusion here is because the > > patch also adds buffer locking. So if it fixes it we cannot jump to the > > conclusion that the root cause is low-level I/O. > It's the same locking that's in ll_rw_block, ported up to ext3_bread. In fact, > you could think of the patch as being a re-implementation of ll_rw_block in > ext3_bread for the case of interest (nr == 1, rw == READ), with the one > difference I described in comment #123. > However, like I said in comment #125, I think it is more likely that the problem > is caused by failing I/O than a buffer locking/synchronization problem, if only > because every I/O device in the kernel uses the same buffer synchronization methods. > We could implement the patch from comment #126 in the next kprobe/jprobe > version, which might be interesting. > Chip > Chip OK, I see it. I look forward to the next jprobe. Keith, I was wondering if I could get a little more information on your configuration and the load you're running. How much memeory does the system have? Are you running only the exact script that you put in an earlier attachment? Did you add any options to the filesystem other than the default .e.g. the number of inodes? I would just like to align what I'm running with what has successfully reproduced this in the past. thanks, (In reply to comment #132) > Keith, I was wondering if I could get a little more information on your > configuration and the load you're running. > How much memeory does the system have? 2GB > Are you running only the exact script that you put in an earlier attachment? I have run a variety of different jobs and combinations but the most common "set" is below: I have two LUNs that are configured down two paths each. One one LUN I run the script from comment #89 in addition to a single ioloadgen job. On the second LUN I run the original metadata test script "tpsreport.sh" with the modification from comment #42. I also run a single ioloadgen job. I have tried other thing in order to induce a failure quicker but nothing seems to really make it happen quicker. >Did you add any options to the filesystem other than the default .e.g. the number of inodes? No, the command I used to create the filesystem: mkfs.ext3 -j /dev/emcpowera1 The command used to mount: mount -o errors=panic /dev/emcpowera1 /test1 > I would just like to align what I'm running with what has successfully > reproduced this in the past. > thanks, (In reply to comment #126) > Created an attachment (id=143648) [edit] > more verbose check > > This patch is similar, (and also untested/compiled), but if it finds a locked > buffer that is not up to date, and the buffer is still not up to date after > just waiting on it, a log message will be printed: > > ext3 wait after failed lock, not uptodate > > It then locks the buffer and tries the read. If the read produces an up to > date buffer, another message is printed. This way we'll be able to tell if > we're hitting this potential bug. > > Please review this carefully before using it. + if (buffer_uptodate(bh)) { + if (long_wait) + printk(KERN_CRIT "ext3 after failed lock, now good\n"); + unlock_buffer(bh); + return bh; + } Unlocking the buffer here instead of in the I/O completion routine seems wrong to me. On deeper reflection, I'm not sure what we're trying to catch here. Referring to the original version of ext3_bread, if the buffer we get from ext3_getblk is locked and not uptodate, then ll_rw_block is a no-op, but wait_on_buffer will sleep until the other guy's pending I/O completes. That's how it should work, right? We can detect if the buffer is locked before the call to ll_rw_block, but even if it is, it doesn't necessarily indicate a problem. It just means somebody else started an I/O to the same block first. Chip Created attachment 143707 [details]
probe ext3_bread every which way (2.6.9-34.ELsmp.x86_64)
This module drops some additional kprobes in the ext3_bread function to
determine if it calls ll_rw_block with a locked buffer head, and to determine
if the failure of this function is due to an underlying I/O failure or a buffer
synchronization problem.
Chip
Thanks Chip! The new jprobe/kprobe (from comment #135) is installed and running. Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87b8 Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87cd Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b8802 -Keith RE: comment 134. The buffer is always unlocked in the end_io call. I'm unlocking it inside the if (buffer_uptodate(bh)) check because no new io is started, and because the buffer was locked above (either by test_set_buffer_locked or lock_buffer). You're right, there's no significance to a buffer being locked before ll_rw_block. It only matters if it was locked, we wait on it, and it is not up to date after the wait returns. That will trigger the first printk. printk(KERN_CRIT "ext3 wait after failed lock, not uptodate\n"); But, this doesn't indicate an FS error either. The buffer may not be up to date because of IO error (block layer bug) or because of someone breaking the buffer locking rules (probably FS level bug). In the patch I go ahead and submit the IO again, and do another print if that produces an up to date buffer. But, even this doesn't prove an FS bug because the storage array may have just come back online, or the lower paths may have worked out whatever problems they had. The only way to be 100% sure of the IO error is to instrument the end_io call(s), or lower down the stack. And please make sure the kernel log level is set high enough to catch all the messages. So far... Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87b8 Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87cd Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b8802 Dec 14 18:32:17 L8 kernel: buffer_head 000001005b0a6920 locked before ll_rw_block <snip> Dec 14 20:51:51 L8 kernel: buffer_head 0000010035936348 locked before ll_rw_block Dec 14 20:51:51 L8 kernel: buffer_head 00000100359362f0 locked before ll_rw_block [root@L8 jprobe-2.6.9-34.EL.x86_64]# cat /var/log/messages | grep buffer_head | wc -l 321 Not as many as I would have expected. We experienced a crash at approximately 21:408PM EST. The new jprobe/kprobe from comment #135 was installed. The vmcore and log can be located here: ftp://ftp.emc.com/outgoing/213921-7/ buffer_head 0000010028986fa8 locked before ll_rw_block buffer_head 0000010028986fa8 not uptodate after wait_on_buffer inode 000001001a06fb18 EXT3-fs error (device emcpowera1): ext3_readdir: directory #7045121 contains a hole at offset 4096 Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after error -Keith (In reply to comment #139) > We experienced a crash at approximately 21:408PM EST. The new jprobe/kprobe > from comment #135 was installed. The vmcore and log can be located here: > ftp://ftp.emc.com/outgoing/213921-7/ > > buffer_head 0000010028986fa8 locked before ll_rw_block > buffer_head 0000010028986fa8 not uptodate after wait_on_buffer inode > 000001001a06fb18 > EXT3-fs error (device emcpowera1): ext3_readdir: directory #7045121 contains a > hole at offset 4096 > Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after > error Interesting. In light of comment #138, we see that having a locked buffer in ext3_bread before issuing ll_rw_block happened harmlessly some hundreds of times. The at the end it was a precursor to reproducing our bug. We still don't know if it always happens like this, but it did this time. Chip It might possibly be interesting to add the buffer head tracing patch, then if we get this locked buffer and a subsequent problem, we can ASSERT and get a history of where that buffer has been - see if it truly was locked for IO or for some other reason. I don't believe the printks from the latest hang rule out io errors from the device. I can almost but not quite explain what we're seeing here. ext3_readdir will do ll_rw_block(READA) for readahead on directory blocks. READA requests may fail in various ways. With normal disk setups, most of the failures are synchronous, the buffer would be locked and unlocked with the directory semaphore/mutex held, and it would be very difficult to race in and wait on the buffer during the READA request. But, the dm/multipath code are much less careful (but still safe) with bio_rw_ahead() requests, and these also have the fast fail bit set (as noted earlier here). The page cache does readahead for files, but it is generally much more careful about retrying when there are failures. The ext3/ll_rw_block method is prone to transient errors like these. The fix would be my original patch, but I would suggest instrumentation to make sure these are the failures we're seeing. The best place to probe would be the bio_end_io handler used by submit_bh. One problem with my theory is that it doesn't explain why the probe into the end_io funcs didn't see the error or why we aren't seeing the end_request I/O error printk. Is the EMC driver setting REQ_QUIET on failfast or readahead bios that fail? That would at least explain the printk. Fascinating. So does that mean whatever locked the buffer either did so inappropriately, and/or failed to actually sync the disk/buffer, and/or failed to set the flag indicating the buffer is up to date? If I'm reading this right, doesn't it points to a synchronization problem higher up since we're not seeing the printk's lower down? The short answer is...I don't know. It all makes really good sense except the part where we don't get printks from ll_rw_block.c. If we can't explain that, then the best course is to put unlock tracing into the buffer heads, recording the eip of the last 5 or 6 people to call unlock_buffer. I looked this morning for people improperly unlocking buffers without marking them up to date, but the directory semaphore makes those kinds of races very very hard to hit. The fact that we consistently see this on directories makes readahead sound like a pretty good theory to me. (In reply to comment #143) > I don't believe the printks from the latest hang rule out io errors from the > device. Certainly not. > ext3_readdir will do ll_rw_block(READA) for readahead on directory blocks. > READA requests may fail in various ways. With normal disk setups, most of the > failures are synchronous, the buffer would be locked and unlocked with the > directory semaphore/mutex held, and it would be very difficult to race in and > wait on the buffer during the READA request. > > But, the dm/multipath code are much less careful (but still safe) with > bio_rw_ahead() requests, and these also have the fast fail bit set (as noted > earlier here). Well, ext3_bread is definitely waiting for somebody else's I/O to complete, and when it finishes, the I/O has failed. I think there's a good chance that it's a read-ahead with the fast-fail bit set (the read-ahead issued by ext3_readdir itself seems likely to me, too). > The fix would be my original patch, I provisionally disagree. The upstream kernel has a READ_META flag set in ext3_bread's ll_rw_block. I'd prefer to emulate the upstream version of the function if the problem doesn't exist there. > One problem with my theory is that it doesn't explain why the probe into the > end_io funcs didn't see the error or why we aren't seeing the end_request I/O > error printk. Is the EMC driver setting REQ_QUIET on failfast or readahead > bios that fail? That would at least explain the printk. We should clear that flag. It might also be interesting to plant a probe that printks the addresses of buffer_heads as we submit requests for read-ahead in ext3_readdir. However, that is likely to generate a lot of output. Chip (In reply to comment #145) > > The fact that we consistently see this on directories makes > readahead sound like a pretty good theory to me. We could take out the read-ahead. Yes, that will change timings and possibly mask the bug, but I still think it would be interesting. Chip As a next course of action I would suggest: * getting confirmation from EMC about the REQ_QUIET flag * Checking the dmesg buffer in the crash dump for the printks * Instrumenting fs/buffer.c:end_bio_bh_io_sync(). There you can look for bio_rw_ahead requests that are failing and print the buffer head address. I would leave the READA in so we can confirm things. (In reply to comment #149) > As a next course of action I would suggest: > > * Instrumenting fs/buffer.c:end_bio_bh_io_sync(). There you can look for > bio_rw_ahead requests that are failing and print the buffer head address. Good idea. Chip (In reply to comment #149) > As a next course of action I would suggest: > * getting confirmation from EMC about the REQ_QUIET flag PowerPath 4.5.x does not set the REQ_QUIET flag. From comment #98: end_bio_bh_io_sync: bio 000001002cb00b80 bh 000001005080fdf0 bytes_done 4096 err -11 sector 98697767 end_buffer_read_sync: page 0000010040682fe0 mapping 000001003f5f5260 EIO is -5, -11 is EAGAIN. Not sure if that is coming from powerpath, qlogic or the scsi layer. ftp://ftp.emc.com/outgoing/213921-3/ still has the data in it. The EAGAINs may be coming from __make_request (it uses EWOULDBLOCK which is the same as EAGAIN). There are a bunch of these end_bio_bh_io_sync errors, it may be worthwhile to look in the dump and see if any of them correspond to the hole bh. Created attachment 143795 [details]
probe ext3_bread and end_bio_bh_io_sync (2.6.9-34.ELsmp.x86_64)
This version keeps the kprobes in ext3_bread from the previous version, and
adds a jprobe in end_bio_bh_io_sync the drops a printk if a read-ahead
completes with -EIO.
Thanks Chip! The jprobe/kprobe from comment#154 is now running. Dec 15 13:32:25 L8 kernel: plant kprobe at ffffffffa00b87b8 Dec 15 13:32:25 L8 kernel: plant kprobe at ffffffffa00b87cd Dec 15 13:32:25 L8 kernel: plant kprobe at ffffffffa00b8802 Dec 15 13:32:25 L8 kernel: plant jprobe at ffffffff80179fbc -Keith Created attachment 143801 [details]
touchup the logic in the end_bio_bh_io_sync jprobe
Fixup of 143795
There's only one way I can see to explain this as a READA related bug (without EIO in end_bio_bh_io_sync) Somewhere in the path from ll_rw_block(READA) to end_bio_bh_io_sync getting error -11, the lower layers is putting the IO onto a deferred queue. One place this happens is while device mapper is taking a snapshot. If you snapshot /dev/mapper/foo, all new io to /dev/mapper/foo goes onto a queue while the the snapshot is being created. Then the queued io is run from a different process context than the original submission. Ugh. I missed the e-mail regarding the new jprobe from comment #156. The server crashed at approximately 11:04AM EST this morning with the jprobe/kprobe from comment #154. The vmcore and log is here: ftp://ftp.emc.com/outgoing/213921-8/ buffer_head 00000100297add40 locked before ll_rw_block buffer_head 00000100297add40 not uptodate after wait_on_buffer inode 0000010039b96548 EXT3-fs error (device emcpowera1): ext3_readdir: directory #2588673 contains a hole at offset 4096 Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after error I'll update the bugzilla when I start the test with the touched up jprobe. The jprobe/kprobe from comment #156 is now running. Dec 16 14:42:59 L8 kernel: plant kprobe at ffffffffa00b87b8 Dec 16 14:42:59 L8 kernel: plant kprobe at ffffffffa00b87cd Dec 16 14:42:59 L8 kernel: plant kprobe at ffffffffa00b8802 Dec 16 14:42:59 L8 kernel: plant jprobe at ffffffff80179fbc -Keith From the logs: dm_snapshot and dm_mirror are both linked in. Is device mapper (or lvm) being used on the filesystems being exercised? Are there snapshots happening on the box during the run (on any ext3 fs)? (In reply to comment #160) > From the logs: dm_snapshot and dm_mirror are both linked in. It is the standard RHEL4u3 install with no modifications to dm. > Is device mapper (or lvm) being used on the filesystems being exercised? No. >Are there snapshots happening on the box during the run (on any ext3 fs)? No. [root@L8 ~]# lsmod | grep dm dm_snapshot 18689 0 dm_zero 3649 0 dm_mirror 29017 0 dm_mod 67457 6 dm_snapshot,dm_zero,dm_mirror Just two PowerPath devices: [root@L8 ~]# powermt display dev=all Pseudo name=emcpowerb Symmetrix ID=000187490063 Logical device ID=01B1 state=alive; policy=SymmOpt; priority=0; queued-IOs=0 ============================================================================== ---------------- Host --------------- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ============================================================================== 0 qla2xxx sda FA 3aA active alive 0 0 1 qla2xxx sdc FA 4aA active alive 0 0 Pseudo name=emcpowera Symmetrix ID=000187490063 Logical device ID=01B9 state=alive; policy=SymmOpt; priority=0; queued-IOs=3 ============================================================================== ---------------- Host --------------- - Stor - -- I/O Path - -- Stats --- ### HW Path I/O Paths Interf. Mode State Q-IOs Errors ============================================================================== 0 qla2xxx sdb FA 3aA active alive 0 0 1 qla2xxx sdd FA 4aA active alive 3 0 [root@L8 ~]# df -k | grep emc /dev/emcpowera1 69591480 5635004 60421376 9% /test1 /dev/emcpowerb1 69591480 300692 65755688 1% /test2 [root@L8 ~]# mount | grep emc /dev/emcpowera1 on /test1 type ext3 (rw,errors=panic) /dev/emcpowerb1 on /test2 type ext3 (rw,errors=panic) -Keith (In reply to comment #160) > From the logs: dm_snapshot and dm_mirror are both linked in. Is device mapper > (or lvm) being used on the filesystems being exercised? Are there snapshots > happening on the box during the run (on any ext3 fs)? You're referring to your comment #157 here to try to explain why the end_bio_bh_io_sync jprobe didn't fire? I should point out that the unretouched probe module would only have caught an -EIO failure in end_bio_bh_io_sync. The touched up version will catch anything but an -EAGAIN (-EWOULDBLOCK) failure, since there are so many of those and we believe they are harmless. I think we should flog this theory (READA failing fast and causing ext3_bread to fail) a bit more before we give up on it. However it does seem to be a repeat of our expierence with the jprobe from comment #91. Chip Chip...we suffered a mid-air collision a minute ago :) We experienced a crash this morning at approximately 01:35AM EST. The vmcore and log are here: ftp://ftp.emc.com/outgoing/213921-9/ the jprobe/kprobe from comment #156 was installed. buffer_head 0000010014483ce8 locked before ll_rw_block buffer_head 0000010014483ce8 not uptodate after wait_on_buffer inode 000001002f5b8c58 EXT3-fs error (device emcpowera1): ext3_readdir: directory #2080769 contains a hole at offset 4096 Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after error -Keith RE: comment #162, I agree the probes right now seem to indicate the only failures we're getting are -EAGAINs. I think we need to instrument the -EAGAINS to see if they are causing the hole panics. If they are, the only explanation I can see is some kind of deferred processing from powerpath. Created attachment 143921 [details]
verbose logging in end_bio_bh_io_sync
This does more verbose logging of failures in end_bio_bh_io_sync. With a
healthy system under I/O load, I get a ton of these:
end_bio_bh_io_sync err = -11 bi_rw = 2 bio 000001001cad5c00 buffer_head
000001001919ff50 page 00000100015be0a8 mapping 000001003750fba0 inode
000001003750fa80 isdir 0
What would be interesting is if one of these came up with "isdir 1" and the
inode address is the same as the one reported by the kprobe in ext3_bread.
Note that there might be some distance in the log between the two messages.
Chip
(In reply to comment #166) > > end_bio_bh_io_sync err = -11 bi_rw = 2 bio 000001001cad5c00 buffer_head > 000001001919ff50 page 00000100015be0a8 mapping 000001003750fba0 inode > 000001003750fa80 isdir 0 BTW, this inode is in the kernel-internal bdev filesystem. I could check for inode->i_ino == 0 to suppress these messages, but maybe it's better to just let it go. Chip Created attachment 143938 [details]
turn off read-ahead in ext3_readdir
I built a kernel with the patch in comment #168 for x86_64; it's up at http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/ I'll cook up another one for i686hugemem. Chip Now running with the jprobe/kprobe from comment #166. Dec 18 15:43:01 L8 kernel: plant kprobe at ffffffffa00b87b8 Dec 18 15:43:01 L8 kernel: plant kprobe at ffffffffa00b87cd Dec 18 15:43:01 L8 kernel: plant kprobe at ffffffffa00b8802 Dec 18 15:43:01 L8 kernel: plant jprobe at ffffffff80179fbc -Keith (In reply to comment #169) > I built a kernel with the patch in comment #168 for x86_64; it's up at > > http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/ > > I'll cook up another one for i686hugemem. http://people.redhat.com/coldwell/bugs/kernel/213921/i686-hugemem/ We experienced a crash last night at approximately 20:49 last night. The vmcore and log are located here: ftp://ftp.emc.com/outgoing/213921-10/ The jprobe/kprobe from comment #166 was installed. buffer_head 0000010067d62190 locked before ll_rw_block buffer_head 000001001beea088 locked before ll_rw_block end_bio_bh_io_sync err = -11 bi_rw = 2 bio 000001005759a900 buffer_head 00000100245a7ad8 page 0000010040775378 mapping 00000100332c0820 inode 00000100332c0700 isdir 0 buffer_head 00000100171191e8 locked before ll_rw_block buffer_head 0000010013ecde48 locked before ll_rw_block buffer_head 000001004dfa7ce8 locked before ll_rw_block end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010003cd9780 buffer_head 00000100530b11e8 page 0000010040cd4d38 mapping 00000100332c0820 inode 00000100332c0700 isdir 0 buffer_head 0000010006763088 locked before ll_rw_block end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010071eb4200 buffer_head 0000010006763088 page 0000010040683b78 mapping 00000100332c0820 inode 00000100332c0700 isdir 0 buffer_head 0000010006763088 not uptodate after wait_on_buffer inode 00000100732f5300 EXT3-fs error (device emcpowera1): ext3_readdir: directory #294914 contains a hole at offset 4096 Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after error -Keith Fairly good news I think, this is as close as we can get to showing the failed buffer. came from a reada. (In reply to comment #172) > > buffer_head 0000010006763088 locked before ll_rw_block > end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010071eb4200 buffer_head > 0000010006763088 page 0000010040683b78 mapping 00000100332c0820 inode > 00000100332c0700 isdir 0 > buffer_head 0000010006763088 not uptodate after wait_on_buffer inode > 00000100732f5300 > EXT3-fs error (device emcpowera1): ext3_readdir: directory #294914 contains a > hole at offset 4096 > Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after > error This is the important part here -- the buffer head address matches in all three probes (in ext3_bread before ll_rw_block, in end_bio_bh_io_sync, and back in ext3_bread after wait_on_buffer. And we know it was a read-ahead from the value in bi_rw. So this is the smoking gun. Chip (In reply to comment #172) > buffer_head 0000010006763088 locked before ll_rw_block > end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010071eb4200 buffer_head > 0000010006763088 page 0000010040683b78 mapping 00000100332c0820 inode > 00000100332c0700 isdir 0 In fact, this also explains why the previous jprobe didn't catch the error in end_bio_bh_io_sync. I was only dropping a printk when the inode chased down via struct buffer_head *bh = bio->bi_private; struct page *page = bh ? bh->b_page : NULL; struct address_space *mapping = page ? page->mapping : NULL; struct inode *inode = mapping ? mapping->host : NULL; satisfied S_ISDIR(inode->i_mode). Of course, it never does, because it is an inode in the kernel-interal bdev filesystem. So it was a bug in my jprobe. Chip Created attachment 144006 [details]
Fix ext3_bread and ext3_find_entry
Looking through the other ll_rw_block callers and wait_on_buffer users in ext3,
I think these two are the ones that need fixing to deal with failed READAs.
We'll put this patch through paces here, but I wanted to post for review
because the error cases will rarely be run.
The smaller alternative fix is to use READ instead of READA in ext3/dir.c.
This will have some performance impact though.
From http://kerneltrap.org/node/6110 Ext3 and Read Ahead Posted by saisuman on Friday, January 27, 2006 - 06:54 Was writing a block driver, on which we intended some filesystem to run. So while testing out with ext3, found an interesting problem. While using the filesystem, it threw errors such as: Directory #11 has a hole at offset 8192. Remounting as read-only This problem is specific to the way read-ahead is handled by the block layer, and the way the ext3 filesystem uses it. Read-ahead is an optimisation intended by the filesystem, and a block driver comes to know of it by checking the bio->bi_rw flag. It should be "READA". Now, in 2.6, in case the system is under heavy load, a READA bio can return "EAGAIN" (through the bio_end_io callback) so that the request may be retried. So sometimes, we do run into this situation. But the problem is that ext3 never expects to handle this problem, because it uses the old ll_rw_block routine, and never checks for EAGAIN. Ideally, it should be patched to use submit_bio, and check for an EAGAIN return value in the callback. The moral of the story is, if you're writing a filesystem, expect to handle EAGAIN as an error condition if you're using READA for submitting block I/O requests. Let's not forget there are likely similar code paths within other filesystem code (i.e. ext2). Though Chris's proposed patch, if not the methodology at least, seems it would address them AFAIK. Since we know what to look for in the code, let's be careful there aren't any potential corner cases. Seems to me the best course of action is to: A) Wrap our brains around the upstream code. B) Weigh weather or not a fix should diverge/converge/complement upstream C) In parallel to A and B, test the hell out of any and all potential fixes. On Linus' git tree: commit d8733c2956968a01394a4d2a9e97a8b431a78776 Author: Andrew Morton <akpm> Date: Thu Mar 23 03:00:11 2006 -0800 [PATCH] ext3_readdir: use generic readahead Linus points out that ext3_readdir's readahead only cuts in when ext3_readdir() is operating at the very start of the directory. So for large directories we end up performing no readahead at all and we suck. So take it all out and use the core VM's page_cache_readahead(). This means that ext3 directory reads will use all of readahead's dynamic sizing goop. Note that we're using the directory's filp->f_ra to hold the readahead state, but readahead is actually being performed against the underlying blockdev's address_space. Fortunately the readahead code is all set up to handle this. Tested with printk. It works. I was struggling to find a real workload which actually cared. (The patch also exports page_cache_readahead() to GPL modules) Cc: "Stephen C. Tweedie" <sct> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> Created attachment 144007 [details]
commit d8733c2956968a01394a4d2a9e97a8b431a78776 from Linus' git tree
This is the upstream patch from Linus' git tree.
Chip
$ grep READA fs/*/*.c fs/ext3/dir.c: ll_rw_block (READA, num, bha); fs/reiserfs/stree.c: ll_rw_block(READA, 1, bh + j); fs/udf/dir.c: ll_rw_block(READA, num, bha); fs/udf/directory.c: ll_rw_block(READA, num, bha); We don't support reiserfs. I think we do support UDF, though. This looks familiar: struct buffer_head * udf_bread(struct inode * inode, int block, int create, int * err) { struct buffer_head * bh = NULL; bh = udf_getblk(inode, block, create, err); if (!bh) return NULL; if (buffer_uptodate(bh)) return bh; ll_rw_block(READ, 1, &bh); wait_on_buffer(bh); if (buffer_uptodate(bh)) return bh; brelse(bh); *err = -EIO; return NULL; } So maybe we should try to fix this, too. Beautiful! I'm assuming that using the more generic page_cache_readahead() would also solve the EIO/EAGAIN issue? (In reply to comment #180) > Created an attachment (id=144007) [edit] > commit d8733c2956968a01394a4d2a9e97a8b431a78776 from Linus' git tree > > This is the upstream patch from Linus' git tree. This applies to the RHEL-4 head with some fuzz. I'm building kernels for x86_64 and i686-hugemem. Would appreciate some test coverage. Chip reiserfs is also affected, I sent this bug number along to the novell guys so they can fix it up. For udf, I would just s/READA/READ. (In reply to comment #184) > reiserfs is also affected, I sent this bug number along to the novell guys so > they can fix it up. For udf, I would just s/READA/READ. Yes, I'd agree with that solution. RE: Using Linus' patch. I'm not against it, but it is a bigger change than it looks on first glance. The existing code does logical readahead of the directory. It maps in the first set of blocks in the directory and reads those blocks, even if they are not sequential. Linus' patch does readahead on the block device. So, it starts reading at a specific block on the disk and it reads the disk sequentially. The blocks it reads may or may not correspond to the directory. In some workloads, his patch will perform better, and in others it will be worse. (In reply to comment #186) > RE: Using Linus' patch. > > In some workloads, his patch will perform better, and in others it will be > worse. Andrew Morton's comment, "I was struggling to find a real workload which actually cared" seems to indicate that the performance improvement due to any read-ahead on directories is small. At any rate, the culture around here is to emulate the mainline kernel as far as possible, so in order to get your patch accepted by our kernel maintainers, I would first have to get is accepted by Anrew Morton and Linus Torvalds. It's easier to just take the patch they've already included. Chip Yes, Linus' motivation was probably git performance, which isn't typical. The existing code really just tries to read small directories all at once, and the patch is something different. I'm fine with either patch, just wanted to make sure people ack'd the whole change. Sorry this is taking so long. The first test kernel based off of 2.6.9-34.EL.x86_64 is available at http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/2.6.9-34.EL/kernel-smp-2.6.9-34.213921.x86_64.rpm This is the one EMC was most interested in. I'll post the rest as the builds complete. Chip Thanks Chip! I have installed the kernel from comment #189: [root@L8 ~]# uname -a Linux L8.lss.emc.com 2.6.9-34.213921smp #1 SMP Tue Dec 19 15:10:56 EST 2006 x86_64 x86_64 x86_64 GNU/Linux I will run with this for a while, I still mounted the file systems with "-o errors=panic". Created attachment 144044 [details]
Fail all READA I/Os
This patch may induce the failure more quickly than otherwise possible. I was
certain that I'd be able to reproduce the problem on a local test system using
it, but have not been successful thus far. Could someone at EMC give this
patch a try on a known bad kernel to see if the error is triggered more
quickly? If that is the case, then I would recommend adding this patch to the
latest kernel that Chip provided in order to more quickly verify that the bug
is fixed.
Created attachment 144045 [details]
Fail all READA I/Os
This patch may induce the failure more quickly than otherwise possible. I was
certain that I'd be able to reproduce the problem on a local test system using
it, but have not been successful thus far. Could someone at EMC give this
patch a try on a known bad kernel to see if the error is triggered more
quickly? If that is the case, then I would recommend adding this patch to the
latest kernel that Chip provided in order to more quickly verify that the bug
is fixed.
(In reply to comment #189) > I'll post the rest as the builds complete. The rest of the kernels are up under http://people.redhat.com/coldwell/bugs/kernel/213921/ Chip One remaining question is where the -EAGAIN failure is coming from. Keith, please ask the PowerPath people to determine if there are any situations where they might generate this. There were a few reports of the ext3 read-only bug that involved dm-multipath and not PowerPath. We looked at the dm code and did not see the READA/EAGAIN returns, other than the snapshot case that Chris Mason already mentioned. It would be helpful for anyone who is familliar with the dm-multipath cases to find out whether they also had dm snapshot active. Jeff's patch in comment #192 should make it fail faster. We're only doing RA on the first 16 or so blocks, and we start by doing a bread on the first block. On a mostly empty filesystem the directory blocks will probably be sequential, and so all the reada requests will get merged back into the original bio, and the reada bit will get dropped. So, if you want to trigger it w/Jeff's patch, completely fill a small FS with 4k files, delete 16 of them, create empty files in a new directory until the disk is full, unmount, mount, ls -laR I think I see one other case where dm-multipath will trigger. It checks for bio_rw_ahead and will not retry a failed readahead IO on a different path. This is much less likely to happen, but if ql2xprocessrscn=1 isn't on it might be easier to hit... My customer needs a 32bit i686 2.6.9-42.0.3.ELsmp kernel build with the "upstream patch" applied for testing. The patched kernel from comment #189 has now been running for 24 hours without a problem. I have a system running RHEL4u4 2.6.9-42.0.3.ELsmp x86_64/AMD that should be available for testing in a day or so. I'd like a kernel build for that please. Thanks! (In reply to comment #196) > My customer needs a 32bit i686 2.6.9-42.0.3.ELsmp kernel build with the > "upstream patch" applied for testing. http://people.redhat.com/coldwell/bugs/kernel/213921/i686/2.6.9-42.0.3.EL/ Chip (In reply to comment #198) > The patched kernel from comment #189 has now been running for 24 hours without > a problem. > > I have a system running RHEL4u4 2.6.9-42.0.3.ELsmp x86_64/AMD that should be > available for testing in a day or so. I'd like a kernel build for that please. > Thanks! http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/2.6.9-42.0.3.EL/ Chip (In reply to comment #195) > Jeff's patch in comment #192 should make it fail faster. > > We're only doing RA on the first 16 or so blocks, and we start by doing a > bread on the first block. On a mostly empty filesystem the directory blocks > will probably be sequential, and so all the reada requests will get merged > back into the original bio, and the reada bit will get dropped. Good point. Another reason why this bug was hard to reproduce. Chip Chip mentioned that the "real world" performance impact of not doing any READA's should be fairly minimal. Behind fixing this bug, customer's data integrity should be our next highest concern. We need to maintain an extremely low chance of any data-corruptors creeping in. Customers are also asking for a fix *fast* given the business impact to-date. Therefor, I'd like to toss this out as an idea: What if we attempt to measure the performance impact of stripping out READAs? If it's is extremely low (<1%); then, it's a no-brainer: Get them out of there! Alternatively, we could add a kernel tunable to switch on/off READAs as needed. It seems to me, this might be a faster/safer route to solving this bug; and at the same time, avoid introduction of other bugs. Even if there's a 5% performance trade off, I think many customers would opt against READAs since that will avoid this bug with 100% certainty. Thoughts? With the mainline patch, I believe the risk of corruption is small. Turning off the READAs is always an option. The best case for the reada code would be doing a single 64k bio for the whole directory instead of 16 4k bios with reada off. On my sata drive here, reading sequentially in 64k chunks goes at 51MB/s. Reading in 4k chunks runs at 15MB/s. The actual impact will vary greatly with disk layout and workload, but the most common place people will see it is while running backups. If you want to make a hotfix tomorrow that you are sure is 100% safe, turning off directory readahead isn't a bad option. For something longer term I think the mainline patch is a better choice. Chris, I don't mean to sound like I'm attacking your opinion, rather I'm just throwing out potential alternatives. From the looks of it, stripping out or adding a reada tunable should be fast and easy. It also wouldn't preclude parallel work on the other option(s). The crucial bit it would buy us is: time. From our calls, lots of it will be needed to properly analyze and test with the somewhat "unknown" behavior of the upstream code. My customers would prefer a fix tomorrow that comes with 100% certainty; as opposed to, a fix a month from now with unknowns. Also, the behavior of removing reada's will be entirely different on SAN storage. SAN data is heavily cached on the storage side so the latency should be much lower. Especially compared to a SATA disk with maybe 16meg of cache. I think we all have the same goal of making the customers happy fast, so I'm only trying to provide some data to help you decide (sorry if it came across otherwise) Turning off directory readahead in a hotfix kernel would be acceptable I think. It may actually hurt more on the SAN machines than the sata drive because sending bios down the sata pipe is relatively lightweight. But it depends on how quickly the san boxes start their own readahead and other load for those spindles. One problem with the tunable is that it doesn't really reduce the testing load. It still needs to be verified with readahead on before it can be shipped. I would do either the reada off or the mainline patch, but I'll support whatever decision people here feel is best. The patched kernel from comment #189 has now been running for 48 hours without a problem. ---------------------- Root cause has been identified after several rounds of focused instrumentation. An upstream patch has been applied to RHEL4 U4 and is currently under heavy testing at EMC and Red Hat. The problem has not been seen in test in over 48 hours, which is promising. (note: EMC was previously able to reproduce the issue within 4-6hours). However, a longer test period and full regression testing is needed before we can declare victory and approve fixes for production use. In the first week of January, EMC and Red Hat will assess if and in which form the suggested patch will be distributed. ---------------------- Thanks to everyone who worked so hard to identify and address this issue. Has it been decided to "emulate the mainline kernel as far as possible, so in order to get your patch accepted by our kernel maintainers?" Where is RedHat at on releasing a patch (or new kernel RPM's) for this issue? Efforts over the holidays yielded 9 hours of failure-free testing with the upstream patch. However, given the impact of this issue, we'd really like to give the fix an extra special pounding. The next step is to implement another testing-only kernel patch designed to aggravate the original problem. Once we confirm that the problem is reproducible with it (should only take several seconds), we will apply the upstream fix. The goal here is to really make sure that we can't hit the same problem via a different route and while under worse-case conditions. (In reply to comment #209) > Efforts over the holidays yielded 9 hours of failure-free testing with the > upstream patch. You mean 9 days, not 9 hours. Chip (In reply to comment #208) > Thanks to everyone who worked so hard to identify and address this issue. Has it > been decided to "emulate the mainline kernel as far as possible, so in order to > get your patch accepted by our kernel maintainers?" I believe that is the consensus. Also, the mainline kernel patch presumably will also improve performance. Chip (In reply to comment #210) > (In reply to comment #209) > > Efforts over the holidays yielded 9 hours of failure-free testing with the > > upstream patch. > > You mean 9 days, not 9 hours. > > Chip > Yes, sorry about that, my brain is still on a holiday. 9 days of testing is correct. Our internal tests ran for 11 days with a similar setup, though against a Clariion. Created attachment 144802 [details] This is the upstream patch from attachment 144007 [details] adjusted to apply to RHEL4 kernels QE ack for RHEL4.5. I Built i686 kernels (building x86_64 next) with the "Fail all read-aheads" patch from comment #192. These kernels should should be able to reproduce on local as well as SAN filesystems. ALL ext3 read-aheads will fail with this kernel! Make sure all your non-test filesystems are mounted as ext2 or you probably won't be able to boot. Please use this kernel in combination with the filesystem setup instructions in comment #195 to verify the problem can be reproduced. Once that's done, I'll build another round of kernels with both the comment #192 patch and the attachment #144007 [details] patch. Then we'll see for once and for all if it truly fixes the problem for certain. Kernels are at: http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192.EL.i686.rpm http://people.redhat.com/cevich/stuff/kernel-hugemem-2.6.9-42.0.3_bz213921c192.EL.i686.rpm Kernel source package at: http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c192.EL.src.rpm And here are the 64bit packages: http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c192.EL.x86_64.rpm http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192.EL.x86_64.rpm I installed the kernel from comment #223 and I experienced a crash almost immediately. EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6520835 contains a hole at offset 4096 Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after error ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at panic:75 invalid operand: 0000 [1] SMP CPU 2 Modules linked in: emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp (U) emcplib(U) parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac md5 ipv6 ohci_hcd hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 8512, comm: ls Tainted: P 2.6.9-42.0.3_bz213921c192.ELsmp RIP: 0010:[<ffffffff8013794e>] <ffffffff8013794e>{panic+211} RSP: 0000:0000010041785c18 EFLAGS: 00010286 RAX: 0000000000000056 RBX: ffffffffa00c9e89 RCX: 0000000000000246 RDX: 0000000000009cd3 RSI: 0000000000000246 RDI: ffffffff803e2080 RBP: 0000010025e7f400 R08: 0000000000000004 R09: ffffffffa00c9e89 R10: 0000000000000000 R11: 0000000000000001 R12: 000001007e4bd400 R13: 0000000000000000 R14: 0000010041785f38 R15: 0000010057997488 FS: 0000002a9557e580(0000) GS:ffffffff804e5300(0000) knlGS:00000000f7fd76c0 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000527070 CR3: 000000003f1b2000 CR4: 00000000000006e0 Process ls (pid: 8512, threadinfo 0000010041784000, task 00000100285917f0) Stack: 0000003000000010 0000010041785cf8 0000010041785c38 ffffffff80138552 0000003000000008 000001007e4bd640 000001003bac0000 0000000000000246 0000000000000004 ffffffffa00c8c95 Call Trace:<ffffffff80138552>{printk+141} <ffffffff8013827a> {release_console_sem+369} <ffffffff801384a8>{vprintk+498} <ffffffffa00c158b> {:ext3:ext3_handle_error+69} <ffffffffa00c1718>{:ext3:ext3_error+193} <ffffffff8030ab57> {io_schedule+38} <ffffffff8017a562>{bh_wake_function+0} <ffffffff8017a562> {bh_wake_function+0} <ffffffffa00bb810>{:ext3:ext3_bread+114} <ffffffffa00b7be5> {:ext3:ext3_readdir+650} <ffffffff8018b1c4>{filldir64+0} <ffffffff801d4055> {selinux_file_permission+298} <ffffffff801859a8>{path_release+12} <ffffffff8018b1c4>{filldir64+0} <ffffffff8018af03>{vfs_readdir+155} <ffffffff8018b2f4> {sys_getdents64+118} <ffffffff8011026a>{system_call+126} Code: 0f 0b 0f e6 31 80 ff ff ff ff 4b 00 31 ff e8 17 c0 fe ff e8 RIP <ffffffff8013794e>{panic+211} RSP <0000010041785c18> Great! Does "almost immediately" mean you went through the filesystem setup and it crashed on an "ls" or that the system crashed on boot? (In reply to comment #225) > Great! Does "almost immediately" mean you went through the filesystem setup and > it crashed on an "ls" or that the system crashed on boot? I actually kicked of all the jobs from my usual reproduction steps (comment #133). This is the first time I have seen it crash so quickly so I would say the "aggravation change" works well. Thanks! I'm building kernels with both patches applied now. Will post them when done. (In reply to comment #195) > Jeff's patch in comment #192 should make it fail faster. > > We're only doing RA on the first 16 or so blocks, and we start by doing a > bread on the first block. On a mostly empty filesystem the directory blocks > will probably be sequential, and so all the reada requests will get merged > back into the original bio, and the reada bit will get dropped. > > So, if you want to trigger it w/Jeff's patch, completely fill a small FS with > 4k files, delete 16 of them, create empty files in a new directory until the > disk is full, unmount, mount, ls -laR I tried it, and much to my surprise I could not reproduce the bug. I also tried some variations on the formula: remove the first 16 4k files, remove the last 16 4k files, remove 16 random 4k files, remove two groups of 8 4k files, etc. and still no joy (there's an unmount, mke2fs -j, mount, etc in between each of these attempts). Am I missing something? Chip > > I think I see one other case where dm-multipath will trigger. It checks for > bio_rw_ahead and will not retry a failed readahead IO on a different path. > This is much less likely to happen, but if ql2xprocessrscn=1 isn't on it might > be easier to hit... > (Probably should have picked a faster box to compile these on) Here are kernels build with both patches applied: http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192c216.EL.i686.rpm http://people.redhat.com/cevich/stuff/kernel-hugemem-2.6.9-42.0.3_bz213921c192c216.EL.i686.rpm http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c192c216.EL.x86_64.rpm http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192c216.EL.x86_64.rpm (In reply to comment #229) > (Probably should have picked a faster box to compile these on) Here are kernels... Thanks Chris! I kicked off a test with the kernel from comment #229. [root@L8 ~]# uname -a Linux L8.lss.emc.com 2.6.9-42.0.3_bz213921c192c216.ELsmp #1 SMP Fri Jan 5 15:47:27 EST 2007 x86_64 x86_64 x86_64 GNU/Linux (In reply to comment #230) > I kicked off a test with the kernel from comment #229. Is it safe to assume this test ran failure-free for the weekend? Anyone else have results of testing with this worst-case, testing-only kernel ( Comment #229 )? (In reply to comment #231) > (In reply to comment #230) > Is it safe to assume this test ran failure-free for the weekend? Yes. It is still running. -Keith *** Bug 208879 has been marked as a duplicate of this bug. *** Created attachment 145256 [details] Patch to replace all READAs with normal READs For the record, we've decided not to go with the comment #216 patch as some further discussion illustrated it as moderately risky. In weighing the critically of this issue, our desire is to fix the problem while introducing the smallest possible risk. The attached patch accomplishes the high-level intent of the original patch, though in a much safer way. To summarize, all it does is replace read-ahead requests with normal read requests. Thus some optimization is preserved while still allowing the requests to fail gracefully thus preventing the "directory hole read-only" condition. Testing with this patch is currently underway and we believe the outcome will be successful. Any and all who can test this patch are encouraged to do so. I will post a link to the kernel packages as well as start building other architecture kernels as time permits. Kernel source package including patch from comment #235 http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c235.EL.src.rpm Kernel source package including patches from comment #235 and comment #216 THIS KERNEL IS FOR TESTING ONLY - ALL ext3 READA REQUESTS WILL FAIL (HOWEVER, FILESYSTEMS SHOULD NOT GO READ-ONLY) http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c235c216.EL.src.rpm Could someone please summarize the objections to the page cache readahead patch? I do agree that s/READA/READ is safe, but I'm curious to hear the objections. committed in stream U5 build 42.40. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ (In reply to comment #238) > Could someone please summarize the objections to the page cache readahead > patch? I do agree that s/READA/READ is safe, but I'm curious to hear the > objections. Stephen Tweedie is quoted below. On Fri, 2007-01-05 at 14:05 -0500, Jason Baron wrote: > can you be more specific about what is making you nervous...performance > issues from too much readahead? I'm trying to better assess the risk of > this patch for possible inclusion in an asynchronous securrity erratum. It totally changes the way we're accessing cache for directory buffers. Possible interactions include interference with journal code, or with user code accessing the cache at the same time (eg. dump(8).) I can't see any reason why it *should* have any problems, but it's precisely the unforeseen problems which get you. :-) It's really such a big change in the internal way we do directory reads that I can't see it qualifying as a low-risk change. --Stephen A few compiled kernel packages as per my above comments: http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.i686.rpm http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.x86_64.rpm and http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235c216.EL.x86_64.rpm (Both testing and fix patches applied) All, I have been hearing about some confusion regarding the sudden and apparent "last-minute" changes in our approach here. Please all me to clarify the situation bit further. It is very important to us and everyone affected by this bug, that we get the fix "right". After testing the "upstream patch" for this problem, it was determined that it opened up some substantial areas for concern. More specifically, these are not concerns which translate to the upstream code because the context is different (e.g. there are many other changes as well). Additionally, there was found to be evidence the "upstream patch" actually causes a rare, but much more severe problem. Therefor, we felt it best to delay release of the fix such that we can perform additional testing on a _much_ safer fix to this problem. We are aiming for a much shorter test cycle with this fix because of it's perceived safety. On that note, testing is underway with the comment #241 kernels, and so far looks promising. Any and all additional help in testing these kernels is much appreciated. (In reply to comment #241) > A few compiled kernel packages as per my above comments: > > http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.i686.rpm > http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.x86_64.rpm > > and > > http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235c216.EL.x86_64.rpm > (Both testing and fix patches applied) Just to clarify, ignore the first two. They're the same as what's available from comment #239 and I've removed the files from my people page. The last one "bz213921c235c216", is still there. The key difference is it has both the fix and the testing-only "fail all READAs" patch applied. Presumably, the failure should not be reproducable with the former; more importantly, nothing else should break with the later. *** Bug 207109 has been marked as a duplicate of this bug. *** I have not seen a file system go read-only with any of the kernels that I have been asked to test: 2.6.9-42.0.3_bz213921c192c216.ELsmp (64 hours) 2.6.9-42.39.ELsct (3 days) 2.6.9-42.0.6 (4 days) Today I am going to start testing with 2.6.9-42.0.7. -Keith FYI, the following public KBASE article has been released as well: http://kbase.redhat.com/faq/FAQ_85_9610 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2007-0014.html the aforementioned lpfc driver parameter variables did not produce statistically significant results and are considered unrelated to this issue. Any driver configuration variable changes should be cleared through the storage and/or adapter providers. Internal Status set to 'Resolved' Status set to: Closed by Tech Resolution set to: 'Security Errata' This event sent from IssueTracker by sputhenp issue 111261 We are experiencing a similar issue. File systems from both clarion and symmetrix SANs go readonly from time to time. The syslogs show the following errors: Aug 21 00:02:28 lndb004 EXT3-fs error (device dm-9) in start_transaction: Journal has aborted. We are on version 2.6.9-42.0.3.ELsmp. How do we confirm we are experiencing this bug? Is there a test case we can use to attempt to reproduce so that we know the fix works? Craig, A filesystem can go readonly for many reasons. This bug documents discovery and resolution of one particular problem within the kernel's ext3 code. I recommend updating to a later kernel, if the problem doesn't go away then please contact support. They will be able to assist in determining weather or not this bug is your underlying problem or not. Craig, If you're not seeing any errors near that message, you're probably hitting something else. Temporally speaking, there should be other errors before it if you're hitting this bug, but since the message you've quoted would probably come from a different thread from the one actually hitting the bug directly, the error messages could end up out of order in the log. This bug is fixed in 2.6.9-42.0.8.EL, so if you're still having the problem on that kernel or later, it's definitely something else. If the problem is occurring inside VMware guests, it may be due to a different bug which is fixed in the U5 kernel, version 2.6.9-55.EL. I forgot to mention, if you want to try reproducing this bug, try running the multi-instance iozone script and the metadata test script attached to this bug. Basically, if you create an intense I/O load, and then do a whole lot of metadata operations, and keep doing this for a very long time, you'll eventually hit this bug, if your systems are susceptible to it. In lab testing it took anywhere from an hour to a day to trigger. |