Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 213921

Summary:

SAN file systems becoming read-only

Product:

Red Hat Enterprise Linux 4

Reporter:

Issue Tracker <tao>

Component:

kernel

Assignee:

Tom Coughlan <coughlan>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

4.4

CC:

agk, andrew.vasquez, andriusb, apuch, bdonahue, berthiaume_wayne, bob.smith, cevich, chris.mason, clalance, clevio.watanuki, coldwell, craig_bogovich, dhand, djuran, emcnabb, esandeen, fernando.canales, gary.r.hicks, hgarcia, jbacik, jdeverea, jmoyer, josef.carlin, jplans, kearnan_keith, laurie.barry, levy_jerome, mdmann, nayfield, qlogic-redhat-ext, rick.beldin, rkenna, sghosh, tao, tdunnon

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2007-0014

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-01-30 14:37:00 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

195232, 216986, 221291, 221293

Attachments:

Description	Flags
metadata test script	none
Script to handle running of multiple instances of IOZone	none
Previous version that doesn't include random delays	none
version w/o random delays in loop.	none
One of the switch logs from testing done yesterday.	none
Other switch's error log from testing yesterday.	none
instrument end_buffer_read_sync	none
instrument end_buffer_read_sync; fix missing semicolon	none
memory hog	none
metadata test script take 2	none
Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync	none
Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync	none
probe ext3_get_block_handle in 2.6.9-34.EL x86_64	none
program to read directory inode dump (gcc -o readdir readdir.c)	none
program to read directory inode dump (gcc -o readdir readdir.c)	none
probe ext3_bread (2.6.9-34.ELsmp x86_64)	none
use submit_bh instead of ll_rw_block	none
more verbose check	none
probe ext3_bread every which way (2.6.9-34.ELsmp.x86_64)	none
probe ext3_bread and end_bio_bh_io_sync (2.6.9-34.ELsmp.x86_64)	none
touchup the logic in the end_bio_bh_io_sync jprobe	none
verbose logging in end_bio_bh_io_sync	none
turn off read-ahead in ext3_readdir	none
Fix ext3_bread and ext3_find_entry	none
commit d8733c2956968a01394a4d2a9e97a8b431a78776 from Linus' git tree	none
Fail all READA I/Os	none
Fail all READA I/Os	none
This is the upstream patch from attachment 144007 adjusted to apply to RHEL4 kernels	none
Patch to replace all READAs with normal READs	none

Description Issue Tracker 2006-11-03 19:31:10 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Chris Snook 2006-11-03 21:04:05 UTC

We've received several reports of ext3 filesystems on SAN storage repeatedly
going read-only, often different filesystems, or on different arrays, or on
different hosts attached to the same arrays.  This happens with a single path,
with dm-multipath, or with EMC PowerPath.  It seems to only happen on
active/active arrays, such as EMC Symmetrix, and not happen on active/passive
arrays, such as EMC Clariion in the same environment, but it is unclear whether
this is due to the multipathing or simply the I/O load.

Setting ql2xprocessrscn=1 (qla2xxx) or lpfc_use_adisc=1 (lpfc) helps in some of
these cases.  As of RHEL 4 U4, the problem is not known to still occur with lpfc
when lpfc_use_adisc=1 is set.  Both the RHEL 4 U4 version of the qla2xxx driver,
and version 8.01.06 from qlogic.com seem to be affected, though we also have one
report of 8.01.06 fixing the problem, so it may be an improvement, but not a
complete fix.

Currently, signs point to an inefficiency in handling RSCNs in the qla2xxx
driver, which causes SCSI timeouts, which eventually propagate up the stack to
cause an ext3 error.

Comment 2 Chris Evich 2006-11-07 19:45:55 UTC

Created attachment 140585 [details]
metadata test script

Comment 3 Chris Evich 2006-11-07 19:47:28 UTC

Created attachment 140586 [details]
Script to handle running of multiple instances of IOZone

Comment 5 Chris Evich 2006-11-07 20:19:47 UTC

Created attachment 140598 [details]
version w/o random delays in loop.

I don't as of yet have my hands on the customer's switch logs.	Attached is a
copy of the ioloadgen script version they're actually using (w/o the random
delays as is in the later version).  I sent a note to the customer requesting
the switch logs if they still have them.  Also, just to remind everyone, now
that this is a public bug, please be sensitive to sharing customer data.  For
sensitive items, we can e-mail them, or I can look into making this bug private
again for only the people on the cc list.  So, if your not on the cc list yet,
please add your self.  Thanks.

Comment 7 Andrew Vasquez 2006-11-07 21:51:42 UTC

Just to be clear on the 'logs' request, QLogic would like to
see qla2xxx driver logs (with extended_error_logging enabled),
not switch logs, so that we can better understand the failure
as well as any relevant contributing precursors

Comment 8 Chris Evich 2006-11-08 16:40:19 UTC

e-mailed emc grab from 106 to EMC and QLogic folks.  Customer working on
providing another one from 007 as well.

Comment 9 Chris Evich 2006-11-08 16:46:19 UTC

Testing on 007 didn't happen last night due to scheduling issues.  Toure is
working with the customer to get the 007 testing kicked off right now.

Comment 10 Chris Evich 2006-11-08 17:30:07 UTC

Also, just to be clear, the emc grab I e-mailed out was taken post-failure on
106 while testing under the following conditions:

New QLogic driver.
Oracle running normally.
New metadata script.
Same ioloadgen script as before.
PP in RE mode.

Comment 11 Chris Evich 2006-11-08 17:59:54 UTC

Just sent out grab from 007, test over the weekend was run with w/o any
multipath software loaded/running.

Comment 12 Chris Evich 2006-11-08 18:07:35 UTC

Plan:  If current testing of 007 (dm-multipath in multibus mode) goes read-only,
we need to validate the secondary path.  To find which port is the primary one,
block it with dm-multipath running.  Then use "multipath -ll" output to
determine if that was the primary or secondary.  

Make sure the primary is blocked for the test.  Disable all multipath software
on 007, and test down the secondary path only.  Pass/fail of that test will
determine if we need to look at dm-multipath or PowerPath further.  Our
thinking: They may be intercepting and blocking us from seeing low-level I/O
failures.  If possible, then we will start instrumenting dm-multipath code.

Comment 13 Chris Evich 2006-11-08 18:51:59 UTC

Revised plan:

Test 007 as above, but also kick-off a similar test on 106 (with the new QLogic
driver).  Later on, if 106 fails, then we'll know that 007 might have also
failed if given that same amount of time.  This could save us a headache down
the road while providing flexability of terminitating a potentially non-failing
test on 007 tomorrow morning.

Given a failure on 007 under dm-multipath tomorrow morning, the next step is as
above- validate no-multipath and I/O down the secondary path.  Leave 106 running
if it doesn't fail.

Comment 14 Chris Evich 2006-11-08 18:53:54 UTC

QLogic/EMC:  When you have any status on your testing, if you could post it
along with your respective setups, that would be helpful for tracking purposes.

Comment 15 Chris Evich 2006-11-08 20:35:26 UTC

From Andrew:

"""Could you or the customer perhaps shed some light on when the first
failure occurred (date/time) along with a description of which events
precipitated the 'failure' --  I'm just trying to get a baseline here,
as from what I can see, the last driver load before the weekend run
occurred on November 3rd:

	...
	Nov  3 10:37:30 chln106 kernel: QLogic Fibre Channel HBA Driver
	...

with the additional noise of the LUNZ luns (sda, sdb, sdk, sdt) being
inaccessible:

	...
	Nov  3 10:37:36 chln106 kernel: sdt: asking for cache data failed
	Nov  3 10:37:36 chln106 kernel: sdt: assuming drive cache: write through
	Nov  3 10:37:36 chln106 kernel: sdt : READ CAPACITY failed.
	Nov  3 10:37:36 chln106 kernel: sdt : status=1, message=00, host=0, driver=08 
	...
	Nov  4 10:40:04 chln106 kernel: end_request: I/O error, dev sdb, sector 0
	Nov  4 10:40:04 chln106 kernel: Buffer I/O error on device sdb, logical block 0
	Nov  4 10:40:04 chln106 kernel: end_request: I/O error, dev sdb, sector 0
	Nov  4 10:40:04 chln106 kernel: Buffer I/O error on device sdb, logical block 0
	Nov  4 10:40:04 chln106 kernel: end_request: I/O error, dev sdb, sector 0
	Nov  4 10:40:04 chln106 kernel: Buffer I/O error on device sdb, logical block 0

and finally a sudden ext3-error 2.5 days later:

	...
	Nov  6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_readdir:
directory #2342945 contains a hole at offset 163840
	Nov  6 01:29:02 chln106 kernel: Aborting journal on device dm-18.
	Nov  6 01:29:02 chln106 kernel: ext3_abort called.
	Nov  6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18):
ext3_journal_start_sb: Detected aborted journal
	Nov  6 01:29:02 chln106 kernel: Remounting filesystem read-only
	Nov  6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18) in
start_transaction: Journal has aborted
	Nov  6 01:29:33 chln106 last message repeated 575 times
	...

I'm also trying to drill down to the exact scsi device used with the
dm-18 device, am I correct in interpreting the logs in that dm-18 is
actually part of /dev/emcpowerb -- which raw scsi-devices are part of
that physical volume?

In any case, there do not appear to be *any* fabric related entries
logged by the device driver (or for that matter, anything) immediately
before or in the near vicinity of the ext3 FS going read only.  The
last driver entries were those logged during init-time (three days
earlier).  If in fact there were any RSCNs from the switch the driver
would have logged something like the following to the messages file:

	...
	scsi(0): Asynchronous RSCR UPDATE.
	scsi(0): RSCN database changed -- 0070 0000
	...
"""

Comment 16 Chris Evich 2006-11-08 20:36:48 UTC

From Jerry:

"""emcpowerb is:

Pseudo name=emcpowerb
Symmetrix ID=000190101044
Logical device ID=00E3
state=alive; policy=SymmOpt; priority=0; queued-IOs=1
========================================================================
======
---------------- Host ---------------   - Stor -   -- I/O Path -  --
Stats ---
### HW Path                 I/O Paths    Interf.   Mode    State  Q-IOs
Errors
========================================================================
======
   0 qla2xxx                   sdi       FA  7aB   active  alive      0
0
   1 qla2xxx                   sdr       FA 10aB   active  alive      1
0 

"""

Comment 17 Chris Evich 2006-11-08 20:56:54 UTC

Andrew, to your question about context:

That was the failure which occured on 106 in the middle of the night, with the
8.01.06 QLogic driver, and while _not_ deliberatly testing/stressing anything.  

No testing was done on 106 between the Friday reboot and the read-only at 1:29
Monday Morning.

(Weather or not the system was stressed because of some automated Oracle/System
activity is another question.)

Comment 18 Andrew Vasquez 2006-11-08 23:21:33 UTC

Unfortunately, the driver was loaded without the
extended_error_logging parameter enabled during driver load:

	Nov  3 10:10:10 chln007 kernel: qla2xxx: Unknown parameter `q12xfailover'
	Nov  3 10:10:10 chln007 kernel: QLogic Fibre Channel HBA Driver
	Nov  3 10:10:10 chln007 kernel: ACPI: PCI interrupt 0000:05:08.0[A] -> GSI 32
(level, low) -> IRQ 233
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Found an ISP2312, irq
233, iobase 0xf8822000
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Configuring PCI space...
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Configure NVRAM parameters...
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Verifying loaded RISC code...
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Waiting for LIP to
complete...
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: LOOP UP detected (2 Gbps).
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: Topology - (F_Port), Host
Loop address 0xffff
	Nov  3 10:10:11 chln007 kernel: scsi0 : qla2xxx
	Nov  3 10:10:11 chln007 kernel: qla2300 0000:05:08.0: 
	Nov  3 10:10:11 chln007 kernel:  QLogic Fibre Channel HBA Driver: 8.01.04-d7
	Nov  3 10:10:11 chln007 kernel:   QLogic QLA2342 - 
	Nov  3 10:10:11 chln007 kernel:   ISP2312: PCI-X (100 MHz) @ 0000:05:08.0
hdma+, host#=0, fw=3.03.20 IPX

if it had been, there would have been an '-debug' appended
to the driver version.

Some time later, the midlayer in fact reports some failed
SCSI I/O with a return status of DID_BUS_BUSY (0x20000),
where the first began at 4:06pm:

	Nov  5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000
	Nov  5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520408
	Nov  5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520416
	Nov  5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000
	...
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 88879880
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 88879888
	Nov  5 16:43:54 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdh, sector 72414784
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdh, sector 72414792
	Nov  5 16:43:54 chln007 kernel: SCSI error : <0 0 2 14> return code = 0x20000
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 145749192
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 145749200
	Nov  5 16:43:54 chln007 kernel: SCSI error : <0 0 2 14> return code = 0x20000
	Nov  5 16:43:54 chln007 kernel: end_request: I/O error, dev sdd, sector 88879920

against luns sdh, sdg, sdd.  This continues until the first
relevant 'EXT3 goes read-only failure' occurs at 6:43pm:

	Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir:
directory #1020212 contains a hole at offset 45056
	Nov  5 18:43:12 chln007 kernel: Aborting journal on device dm-7. 
	Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7) in
start_transaction: Readonly filesystem
	Nov  5 18:43:12 chln007 kernel: Aborting journal on device dm-7. 

Unfortunately, since extended_error_logging wasn't enabled,
I can't with certainty why the commands are being returned
to the midlayer with a DID_BUS_BUSY status.  Some
possibilities include:

- SAN activity, causing the driver to want to recycle
  commands until SNS scans complete.
- Commands timeout, storage not responding
- Storage logged-out of the initiator (LOGO)
- Dropped frames, FC frame failures.

In any case, for each of the commands, if the fast-fail bit
is set (typically is when DM/MD is involved), then the
midlayer will *not* retry the command, but instead return
the failed status to the upper layers, eventually
propagating to the block-layer caller.

Now if the fast-fail bit is cleared (unset), then retries
are perhaps being exhausted, as typically the midlayer
should retry the commands in hopes the SAN is quiesced.

If I could get a good failure with 8.01.04-d7 and
extended_error_logging enabled, we'd have much more
confidence is suggesting next steps.  At this point, all
that's known is that for some reason the driver is returning
back commands with DID_BUS_BUSY statuses.  Note again, this
is a different signature than that of the 106 run with
8.01.06 and extended_error_logging enabled.

Comment 20 Chris Evich 2006-11-09 16:21:55 UTC

One comment on #18 above:  That system was being exercised at that time with
IOZone, the metadata script, and having someone blocking/unblocking a port
within the same zone on the SAN (to trigger RSCNs).  It's unfortunate extended
error logging was not working, they have confirmed with me the option is there.
 It's possible that the initial ram disk wasn't rebuilt (as required) and so the
option never made it into the module as it was loaded.  I'm following up with
the customer on this last point.

Comment 21 Issue Tracker 2006-11-09 16:35:17 UTC

All,
     Just for the record, the plan for 007:

Run with _no_ multipath with the primary path disabled (seem my note
below)

for 106:

Run with dm-multipath, the new QLogic driver, IOZone, metadata-test, and
normal Oracle load for as long as possible (preferrably through the weekend
if possible, understanding this is a production box).  


This event sent from IssueTracker by cevich 
 issue 97316

Comment 22 Chris Evich 2006-11-09 18:48:45 UTC

Current status:  No failure occurred under dm-multipath on 007 after 16+ hours
of testing.  We've verified the extended error logging is turned on now and
functional on 007.  Configuration issues prevented testing on 106, though they
are being fixed now.  We hope to kick off a long-running test on 106 with normal
Oracle Load + new QLogic driver + metadata script + ioloadgen script

Plan: 007 switched back to PowerPath in RE mode and back to the original version
of the metadata script.  We're doing this to both validate possibly reproducing
a failure faster with the old script (as was done 4 times before); and, to
confirm that we can reproduce on 007 in RE mode with the old QLogic driver.  If
the old script proves to reproduce faster, I will be posting that information in
addition to a copy of the old script to this BZ.

Good news from EMC: They were successful in reproducing a read-only condition
after 14 hours of testing using the (attached) metadata script in addition to
IOZOne load generation but without Oracle. 

Do we have any status or updates on reproduction efforts from the QLogic side?

Comment 23 Chris Evich 2006-11-09 20:48:44 UTC

Customer recreated the failure on 007 using PowerPath in RE mode, with the
latest metadata test script, and an older version of the load generator script.
 This was a misscommunication, but is informative none-the-less.

It validates that we can recreate the issue with PP in RE mode on 007.  It
validates that we can recreate with the new metadata script.  It also further
confirms that it can be recreated without Oracle.  I will be e-mailing out the
EMC grabs and other data shortly.

Comment 24 Dave Wagner 2006-11-09 23:05:52 UTC

Will try the metadata script attached to see if problem can be reproduced here.
So far we haven't been able to reproduce it.

Comment 25 Chris Evich 2006-11-10 14:54:48 UTC

Re: #24, you'll also want to run the IOZone based load test.  So far, we've only
seen this problem occur under extremely high load conditions (both system and SAN).

Comment 26 Andrew Vasquez 2006-11-10 17:32:16 UTC

Re: #20, If there's a script switching ports on and off in the fabric
(to generate RSCNs), then I have a feeling the I/Os are
being returned with BUS_BUSY due to:

	- SAN activity, causing the driver to want to recycle
	  commands until SNS scans complete.

As I mentioned before:

	In any case, for each of the commands, if the fast-fail bit
	is set (typically is when DM/MD is involved), then the
	midlayer will *not* retry the command, but instead return
	the failed status to the upper layers, eventually
	propagating to the block-layer caller.

Are the commands being issued with the fast-fail bit set?
If so, then that would certainly account for the journaling
errors as the data is never being written/read, as the
DID_BUS_BUSY statuses are being directly funneled to the
upper-layers without the request retry being done...

8.01.06 has some changes in the semantics of command
handling during SAN scanning (as was mentioned before),
commands will be returned with DID_IMM_RETRY, instead of
DID_BUS_BUSY, as to avoid issues surrounding DM/MD's use of
fast-fail and its effects of premature markings of 'failure'
to a given I/O.

Comment 27 Chris Evich 2006-11-10 17:37:39 UTC

Update: Overnight testing with dm-multipath on 106 and 007 did not reproduce any
failures.  Testing on 106 will be stopped EOB today.   Testing on 007 will
continue through the weekend.  We identified the problem that was preventing the
qla module from turning on extended error logging.  The options line contained
"q12xfailover=0" instead of "ql2xfailover=0" (one vs. an "l" (ell)).  This has
been fixed, ramdisk rebuilt, and test restarted.

Comment 28 Chris Evich 2006-11-10 17:40:56 UTC

Re: #26, Thanks for the details.  Is it your recommendation that the "fast fail"
bit not be set under these circumstances?  Is anyone aware of possible
consequences when not setting it?  Is there a requirement for this bit to be set
when multipath is in use vs. normal use, or against local SCSI disks?

Comment 29 Chris Evich 2006-11-10 17:50:13 UTC

Created attachment 140912 [details]
One of the switch logs from testing done yesterday.

Comment 30 Chris Evich 2006-11-10 17:51:36 UTC

Created attachment 140913 [details]
Other switch's error log from testing yesterday.

Comment 31 Josef Bacik 2006-11-10 18:36:09 UTC

If they are using dm_emc then the REQ_FAILFAST bit is used, else it's not.  
But my question is if this was a problem with dm multipathing failing too 
quickly, then wouldn't this problem occur on dm multipathing?  From what I've 
seen it's not reproduced with DM, and if it does it takes much longer than PP.  
I would agree that REQ_FAILFAST would be contributing to the problem, but only 
if we were seeing lots of failures with DM.

Comment 32 Andrew Vasquez 2006-11-10 18:45:16 UTC

Regarding the latest 007 logs (from yesterday):

These 007 logs from yesterday contain a run with the driver still not
having extended error logging enabled:

        Nov  9 09:58:39 chln007 kernel: qla2300 0000:05:08.1: 
        Nov  9 09:58:39 chln007 kernel:  QLogic Fibre Channel HBA Driver: 8.01.04-d7
        Nov  9 09:58:39 chln007 kernel:   QLogic QLA2342 - 
        Nov  9 09:58:39 chln007 kernel:   ISP2312: PCI-X (100 MHz) @
0000:05:08.1 hdma+, host#=1, fw=3.03.20 IPX

Note, no '-debug' suffix.  This is irrelevent though as the block 
layer *should* have logged some sort of 'failure' message (if the 
failure began from a block-layer device) before the sudden 'goes 
read-only':

        Nov  9 14:19:48 chln007 kernel: EXT3-fs error (device dm-7):
ext3_readdir: directory #1216813 contains a hole at offset 159744 
        Nov  9 14:19:48 chln007 kernel: Aborting journal on device dm-7.
        Nov  9 14:19:48 chln007 kernel: ext3_abort called.
        Nov  9 14:19:48 chln007 kernel: EXT3-fs error (device dm-7):
ext3_journal_start_sb: Detected aborted journal

Instead, there nothing of the sort.  Quite unlike the previous logs 
where ports on the SAN were being disabled and enabled (causing RSCNs
to be pushed to the HBA (initiator).

Comment 33 Andrew Vasquez 2006-11-10 18:53:22 UTC

Re: #31,

From a block-driver perspective, I can't say which type of
I/Os are failing from the November 5th run, either data or
meta-data (journaling), as the midlayer is simply reporting
some failure, probably due to fast-fail and the LLDD
(qla2xxx) returning back DID_BUS_BUSY during SAN disruptions
(again this is best-guess given descriptions of what was
occuring from emails [no extended_error_logging]) to recycle
the command.

There appear to be *many* failures before ext3 goes
readonly:

	Nov  5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000
	Nov  5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520408
	Nov  5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520416
	Nov  5 16:06:01 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000 
	Nov  5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520456
	Nov  5 16:06:01 chln007 kernel: end_request: I/O error, dev sdh, sector 128520464
	...
	Nov  5 16:06:02 chln007 kernel: end_request: I/O error, dev sdh, sector 128520720
	Nov  5 16:06:02 chln007 kernel: end_request: I/O error, dev sdh, sector 128520728
	Nov  5 16:06:02 chln007 kernel: SCSI error : <0 0 2 28> return code = 0x20000
	...
	Nov  5 16:32:33 chln007 kernel: end_request: I/O error, dev sdg, sector 13741104
	Nov  5 16:32:33 chln007 kernel: end_request: I/O error, dev sdg, sector 13741112
	Nov  5 16:32:33 chln007 kernel: SCSI error : <0 0 2 27> return code = 0x20000
	Nov  5 16:32:33 chln007 kernel: end_request: I/O error, dev sdg, sector 13741136
	Nov  5 16:32:33 chln007 kernel: SCSI error : <0 0 2 27> return code = 0x20000
	...
	Nov  5 18:08:47 chln007 kernel: SCSI error : <0 0 2 27> return code = 0x20000
	Nov  5 18:08:47 chln007 kernel: end_request: I/O error, dev sdg, sector 17729784
	Nov  5 18:08:47 chln007 kernel: end_request: I/O error, dev sdg, sector 17729792
	Nov  5 18:16:55 chln007 su(pam_unix)[23001]: session closed for user root
	Nov  5 18:16:57 chln007 sshd(pam_unix)[19572]: session closed for user lidn4nj
	Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir:
directory #1020212 contains a hole at offset 45056
	Nov  5 18:43:12 chln007 kernel: Aborting journal on device dm-7.
	Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7) in
start_transaction: Readonly filesystem
	Nov  5 18:43:12 chln007 kernel: Aborting journal on device dm-7.

actually 2hrs.

Comment 34 Chris Evich 2006-11-10 19:41:15 UTC

Re: #32, please see update in #27, problem caused by a typo on options line
(using a 1 (one) instead of an l (ell).  On-site tech verified dmesg output on
007 after fixing this and confirmed extended error loging is now enabled.

Comment 35 Wayne Berthiaume 2006-11-13 13:11:10 UTC

In response to comments 26, 27, and 31 form PP engineering:

It does or at least it does as recently as
PowerPath 4.5.1.

But so do all instances of dm-multipath that
work on all storage systems.  Possibly the
"queue_if_no_path" dm-multipath feature is
enabled which is preventing the error from
surfacing to ext3 in the dm-multipath case.

From reading the Red Hat bugzilla entry, this seems
like a problem in the handling of the DID_BUS_BUSY
host portion of a SCSI command error since a
DID_BUS_BUSY used to be "infinitely retryable"
and now receives no retries due to the use of
REQ_FAILFAST.

It certainly could happen that each path used
by PowerPath is failing simply due to the LLD
returning DID_BUS_BUSY.

Comment 36 Chip Coldwell 2006-11-13 16:45:29 UTC

(In reply to comment #18)
> This continues until the first
> relevant 'EXT3 goes read-only failure' occurs at 6:43pm:
> 
> 	Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir:
> directory #1020212 contains a hole at offset 45056

That error message is giving the inode number and offset for a hole in a
directory inode -- that's what's causing the filesystem to go readonly.  It
would be interesting to run debugfs on the filesystem *before* fscking it to
verify that there is a hole there -- that would tell us if the problem is on
magnetic media or if it is in the transport of data from the media to the VFS layer.

Chip

Comment 37 Chip Coldwell 2006-11-13 18:42:03 UTC

(In reply to comment #36)
> (In reply to comment #18)
> > This continues until the first
> > relevant 'EXT3 goes read-only failure' occurs at 6:43pm:
> > 
> > 	Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir:
> > directory #1020212 contains a hole at offset 45056
> 
> That error message is giving the inode number and offset for a hole in a
> directory inode -- that's what's causing the filesystem to go readonly.  It
> would be interesting to run debugfs on the filesystem *before* fscking it to
> verify that there is a hole there -- that would tell us if the problem is on
> magnetic media or if it is in the transport of data from the media to the VFS
layer.

BTW, the same type of failure happened on both hosts (007 and 106).  I repost
the error message from 106 below

Nov  6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_readdir:
directory #2342945 contains a hole at offset 163840

and from 007

Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir:
directory #1020212 contains a hole at offset 45056

Is this "directory inode contains a hole" failure common to all of the crashes?

Chip

Comment 38 Chip Coldwell 2006-11-13 19:14:01 UTC

(In reply to comment #31)
> If they are using dm_emc then the REQ_FAILFAST bit is used, else it's not.  

Are you sure?  I found this in drivers/block/ll_rw_blk.c:__make_request

        /*
         * inherit FAILFAST from bio (for read-ahead, and explicit FAILFAST)
         */
        if (bio_rw_ahead(bio) || bio_failfast(bio))
                req->flags |= REQ_FAILFAST;

from which it looks like all read-ahead bios are marked to fail fast.

Chip

Comment 39 Chip Coldwell 2006-11-13 19:17:17 UTC

(In reply to comment #37)
> 
> BTW, the same type of failure happened on both hosts (007 and 106).  I repost
> the error message from 106 below
> 
> Nov  6 01:29:02 chln106 kernel: EXT3-fs error (device dm-18): ext3_readdir:
> directory #2342945 contains a hole at offset 163840
> 
> and from 007
> 
> Nov  5 18:43:12 chln007 kernel: EXT3-fs error (device dm-7): ext3_readdir:
> directory #1020212 contains a hole at offset 45056
> 
> Is this "directory inode contains a hole" failure common to all of the crashes?

Looks like it is; this is from comment 32 above

Nov  9 14:19:48 chln007 kernel: EXT3-fs error (device dm-7):
ext3_readdir: directory #1216813 contains a hole at offset 159744 

Chip

Comment 40 Keith Kearnan 2006-11-15 16:40:13 UTC

I wanted to report that we have reproduced this 3 times now.  The latest was 
this morning:

Nov 15 07:09:16 L8 kernel: EXT3-fs error (device emcpowera1): ext3_readdir: 
directory #4505601 contains a hole at offset 12288
Nov 15 07:09:16 L8 kernel: Aborting journal on device emcpowera1.
Nov 15 07:09:16 L8 kernel: ext3_abort called.
Nov 15 07:09:16 L8 kernel: EXT3-fs error (device emcpowera1): 
ext3_journal_start_sb: Detected aborted journal
Nov 15 07:09:16 L8 kernel: Remounting filesystem read-only
Nov 15 07:09:16 L8 kernel: EXT3-fs error (device emcpowera1) in 
start_transaction: Journal has aborted  

The first test took 17 hours, the second 5 and the third 60 hours.  All of the 
errors occurred on the file system with the "tpsreport.sh" script.  There are 
no errors in the messages file previous to the file system going read-only and 
no fabric activity.

Our environment:
HP DL385 (AMD)
DMX2000, McData ED64M
2x QLogic PCI to Fibre Channel Host Adapter for QLA2342
Driver version 8.01.02-d4
RHEL4u3, 2.6.9-34.ELsmp
PowerPath 4.5.1

The current test is running with PowerPath debugging turned up.

Comment 41 Keith Kearnan 2006-11-15 19:35:55 UTC

OK the tpsreport.sh generated another ext3 read-only error:

----------- Wed Nov 15 12:24:12 EST 2006 looping
----------- Wed Nov 15 12:24:12 EST 2006 creating files
::::::::>Error - 1 Create failure 
on /test2/metadata/0/9/metadata_14055_5626.tmp, Wed Nov 15 12:41:49 EST 2006
End Time:        Wed Nov 15 12:41:49 EST 2006

We see from the script that this is generated from an attempted write into this 
file:

<snip>
      while [ ${NUM} -le ${NUM_FILES} ]; do
        FILE=${FILE_PREV}_${RAN}_${NUM}.tmp
        echo 'hello world' >  ${FILE}
        checkRC $? "Create failure on ${FILE}"
<snip>

This file is actually present (but empty) in the file system.

[root@L8 9]# pwd
/test2/metadata/0/9
[root@L8 9]# ls -ial metadata_14055_5626.tmp
6334315 -rw-r--r--  1 root root 0 Nov 15 12:41 metadata_14055_5626.tmp
[root@L8 9]# cat metadata_14055_5626.tmp

The previous file is fine:

[root@L8 9]# ls -ial metadata_14055_5625.tmp
6334314 -rw-r--r--  1 root root 12 Nov 15 12:41 metadata_14055_5625.tmp
[root@L8 9]# cat metadata_14055_5625.tmp
hello world

The error is curious though:

Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1): ext3_readdir: 
directory #5750785 contains a hole at offset 4096
Nov 15 12:41:49 L8 kernel: Aborting journal on device emcpowerb1.
Nov 15 12:41:49 L8 kernel: ext3_abort called.
Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1): 
ext3_journal_start_sb: Detected aborted journal
Nov 15 12:41:49 L8 kernel: Remounting filesystem read-only
Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1) in 
start_transaction: Journal has aborted

This is not the directory for our "problem" file.

[root@L8 ~]# debugfs /dev/emcpowerb1
debugfs 1.35 (28-Feb-2004)
debugfs:  cd <5750785>
debugfs:  pwd
[pwd]   INODE: 5750785  PATH: /metadata/3/6
[root]  INODE:      2  PATH: /
debugfs:  stat <5750785>
Inode: 5750785   Type: directory    Mode:  0755   Flags: 0x0   Generation: 
1681403964
User:     0   Group:     0   Size: 1916928
File ACL: 11469322    Directory ACL: 0
Links: 2   Blockcount: 3760
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x455b486c -- Wed Nov 15 12:03:40 2006
atime: 0x455b515d -- Wed Nov 15 12:41:49 2006
mtime: 0x455b486c -- Wed Nov 15 12:03:40 2006
BLOCKS:
(0-11):11528192-11528203, (IND):11528204, (12-79):11528205-11528272, 
(80):11528285, (81):11528414, (82):11528543, (83):1152867
2, (84):11528801, (85):11528930, (86):11529059, (87):11529188, (88):11529317, 
(89):11529446, (90):11529575, (91):11529704, (92
):11529833, (93):11529962, (94):11530091, (95):11530220, (96):11530349, 
(97):11530478, (98):11530607, (99):11530736, (100):115
30865, (101):11530994, (102):11531123, (103):11531252, (104):11531381, 
(105):11531510, (106):11531639, (107):11531768, (108):1
1531897, (109):11532026, (110):11532155, (111):11532284, (112):11532413, 
(113):11532542, (114):11532671, (115):11532800, (116)
:11532929, (117):11533058, (118):11533187, (119):11533316, (120):11533445, 
(121):11533574, (122):11533703, (123):11533832, (12
4):11533961, (125):11534090, (126):11534219, (127):11534872, (128-156):11534988-
11535016, (157-467):11535024-11535334
TOTAL: 469

Chip...Is there any specific debugfs (logdump?) command I can run to provide 
additional "useful" information?

Comment 42 Tom Coughlan 2006-11-15 22:13:59 UTC

Keith,

I have set up a system in our lab to try to reproduce the problem. I don't have
exactly the same HW (no Symmetrix), but I do have a multipath active/active
array. I would like to know exactly how you are running the tests to reproduce
the problem.  

I have:

ioloadgen-1.3.tgz (runs iozone-3-271)
tpsreport.sh (metadata exerciser)

Based on what I read in the BZ, I am running these both on two LUNs, each with a
separate ext3 fs. Is that all you are doing? Any more than two LUNs needed?

My test ran until the disk was full. It filled up because 
tpsreport.sh has quotes around the path name: 

  rm -rf "${BASE}/*"

so files never get deleted between loops. Did you change this test script?

If you have any more details or suggestions on the best way to reproduce
the problem, please let us know.

Tom

Comment 43 Keith Kearnan 2006-11-16 13:42:12 UTC

Hi Tom,

I did not change the script and I noticed that the file system does fill up 
after 48 hours or so in my environment, I simply restart it.  

I have two LUNs also (73GB).  Initially I executed tpsreport.sh on one and 
ioloadgen on the other.  After 14 hours (not 17 my mistake) the one running 
tpsreport.sh went read only.  The other three times it was also the 
tpsreport.sh script that went read-only.  Yesterday I combined running both 
scripts against the same LUN and it went read-only in about 2 hours.  I left 
this LUN in a read-only state for investigation.  I have run both scripts on 
the other LUN and it is still running (19 hours and counting).

-Keith

Comment 44 Chris Snook 2006-11-16 17:41:46 UTC

I've looked at e2images from a couple other customers experiencing the same
problem, and they have *not* had null pages on disk, with the exception of one
case with a very badly mangled powerpath installation.  In general, it seems we
get the "hole" message because ext3 sees an empty buffer that was never filled
due to an I/O error.

Comment 45 Eric Sandeen 2006-11-16 21:23:09 UTC

Can we test this with slightly modified ext3 code?

Unfortunately the "hole" message could be more helpful, if nothing else it could
print the block number & device it encountered the problem on... also it would
be good to know if ext3_get_blocks_handle or ext3_bread is what failed just
ahead of the message.

-Eric

Comment 46 Keith Kearnan 2006-11-17 19:37:31 UTC

I am willing to test with modified ext3 code.  However, the error does indicate 
the device, directory and offset already.  The block number would be nice.

Nov 15 12:41:49 L8 kernel: EXT3-fs error (device emcpowerb1): ext3_readdir: 
directory #5750785 contains a hole at offset 4096

In this test emcpowerb1 a meta volume on a DMX that is seen down two paths.

Pseudo name=emcpowerb
Symmetrix ID=000187490063
Logical device ID=01B9
state=alive; policy=SymmOpt; priority=0; queued-IOs=0
==============================================================================
---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---
### HW Path                 I/O Paths    Interf.   Mode    State  Q-IOs Errors
==============================================================================
   0 qla2xxx                   sdb       FA  3aA   active  alive      0      0
   1 qla2xxx                   sdd       FA  4aA   active  alive      0      0

At this time we are focusing on generating a vmcore.  I have not seen the 
problem in the lab since Nov 15 12:41:49.

Comment 47 Keith Kearnan 2006-11-17 21:35:49 UTC

We experienced another read-only error today at approximately 15:33.

EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a 
hole at offset 4096

I have placed the log, vmcore and System map files here:
ftp://ftp.emc.com/outgoing/213921/

Comment 48 Chip Coldwell 2006-11-17 21:40:18 UTC

(In reply to comment #47)
> We experienced another read-only error today at approximately 15:33.
> 
> EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a 
> hole at offset 4096
> 
> I have placed the log, vmcore and System map files here:
> ftp://ftp.emc.com/outgoing/213921/

Excellent.  I'm downloading this now, and will post an analysis as soon as I
have one.

Chip

Comment 49 Tom Coughlan 2006-11-20 19:43:04 UTC

I have tried to reporduce the problem using dm-multipath and a Dot Hill RAID
box. This is not as performant as the Symmetrix, but is its Active/Active. I
have run four over-night runs with a number of configurations, but have not
reproduced the failure. I plan to add more adapters and continue testing. It
would help me if anyone who has seen this problem on dm-multipath send me the
result of:

multipath -ll
ls -l /dev/mpath/

I will use that to set up my configuration to match as closely as possible.

Comment 51 Chip Coldwell 2006-11-20 22:14:44 UTC

(In reply to comment #48)
> (In reply to comment #47)
> > We experienced another read-only error today at approximately 15:33.
> > 
> > EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a 
> > hole at offset 4096
> > 
> > I have placed the log, vmcore and System map files here:
> > ftp://ftp.emc.com/outgoing/213921/
> 
> Excellent.  I'm downloading this now, and will post an analysis as soon as I
> have one.

I've done some preliminary work on this core.  The panic happened in process
context; an "ls" command forked from the script "tpsreport.sh".  The error in
the kernel log 

EXT3-fs error (device emcpowera1): ext3_readdir: directory #6225921 contains a
hole at offset 4096

gives the inode number of a directory inode, and the file position within that
directory inode that could not be written.  The error message comes from
fs/ext3/dir.c:134 in the function ext3_readdir.  Essentially, what this function
was doing when the error occurred is iterating down the inode one buffer (block)
at a time and doing an ext3_bread on each buffer.  When ext3_bread returns NULL,
it indicates that it was not able to read the buffer, so ext3_readdir generates
the message and panics.

As can be seen from the source, there are two ways the ext3_bread function can
return NULL:

struct buffer_head *ext3_bread(handle_t *handle, struct inode * inode,
			       int block, int create, int *err)
{
	struct buffer_head * bh;
	int prev_blocks;

	prev_blocks = inode->i_blocks;

	bh = ext3_getblk (handle, inode, block, create, err);
	if (!bh)
		return bh;
	if (buffer_uptodate(bh))
		return bh;
	ll_rw_block (READ, 1, &bh);
	wait_on_buffer (bh);
	if (buffer_uptodate(bh))
		return bh;
	brelse (bh);
	*err = -EIO;
	return NULL;
}

The first is if ext3_getblk itself returns NULL, the second is if after starting
the I/O with ll_rw_block(READ, 1, &bh) and waiting for the I/O to complete, the
buffer is still not up to date.  In the second case, the value pointed to by the
err argument is set to -EIO.

-EIO is -5, or 0xfffffffb in 32-bit hexadecimal.  Since err in ext3_readdir is
an automatic variable, it is allocated on the stack, and this stack address is
passed in to ext3_bread.  Therefore, if this value was set, it should appear on
the stack frame for ext3_readdir, and indeed it does:

#16 [1002c461e00] ext3_readdir at ffffffffa00b4be5
    1002c461e08: 0000000000000001 0000000000000000 
    1002c461e18: 000001007fd3b800 0000000000000000 
    1002c461e28: 0000000000000000 ffffffff80189618 
    1002c461e38: 000001002c461f38 fffffffb00000002 
    1002c461e48: 000001003f082d40 000001007f850a00 
    1002c461e58: 0000010072f2acc0 ffffffff801d1f8b 

crash> rd -32 1002c461e44
     1002c461e44:  fffffffb                              ....

Furthermore disassembly of ext3_readdir shows the position of the err variable
on the stack to be at offset 0x3c (N.B. x86-64 calling convention puts the first
six arguments in %rdi, %rsi, %rdx, %rcx, %r8, %r9, respectively):

0xffffffffa00b4ba0 <ext3_readdir+0x245>:        lea    0x3c(%rsp),%r8

and, indeed, subtracting 0x3c from the address 0x1002c461e44 gives the base of
the stack for ext3_readdir

crash> eval 1002c461e44 - 3c
hexadecimal: 1002c461e08  

(return address is at stack base - 0x8).

This increases my confidence that the code path through ext3_break went via
ll_rw_block and wait_on_buffer; additional evidence for this is the presence of
io_schedule in the stack trace -- the function wait_on_buffer is an inline
around io_schedule.

Conclusion so far: ext3_bread failed because the block could not be read.

At this point, we can find the address of the buffer_head on the stack for
ext3_bread; it is at 0x10061c68ea0:

crash> buffer_head 0x10061c68ea0
struct buffer_head {
  b_state = 0x18, 
  b_this_page = 0x10061c68ea0, <== points back to the same buffer_head
  b_page = 0x100409932b8, 
  b_count = {
    counter = 0x1
  }, 
  b_size = 0x1000, <== (0x1000 == 4096); entire page
  b_blocknr = 0xbe7801, 
  b_data = 0x1006b931000 "", 
  b_bdev = 0x1003fde4640, 
  b_end_io = 0xffffffff80178cb2 <end_buffer_read_sync>, 
  b_private = 0x0, 
  b_assoc_buffers = {
    next = 0x10061c68ee8, 
    prev = 0x10061c68ee8
  }
}

From the b_size and b_this_page fields we can see that this block is 4096 bytes,
so we are dealing with a blocks that are equal in size to pages (this can vary
depending on the block device); blocks exist in 1-1 correspondence with pages. 
Furthermore, from the offset in the error message, we can see that this is the
second block in the directory inode.

Also note that ll_rw_block sets b_end_io to <end_buffer_read_sync>, further
evidence that the code path went via ll_rw_block.

ll_rw_block would have submitted the read request to the block subsystem by
calling the submit_bh function, which allocates a bio and sets bio->bi_private
to the address of the buffer_head.  So if the bio has not already been recycled,
we should be able to locate it by searching kernel memory for occurances of the
buffer_head address:

crash> search -k 0000010061c68ea0
1002c461d68: 10061c68ea0
1002c461d98: 10061c68ea0
1002c461dc8: 10061c68ea0
1002c461de8: 10061c68ea0
100409932c8: 10061c68ea0
10040e21580: 10061c68ea0 
10061c68ea8: 10061c68ea0 <=== buffer_head itself
1007ffcd260: 10061c68ea0 <=== in the 1-k slab cache

I haven't been able to back-cast any of these addresses into a struct bio (with
offset 0x58).  So the bio might already have been recycled when the dump
started; I'm still poking around.

Chip

Comment 52 Tom Coughlan 2006-11-21 14:33:08 UTC

The files from chln106 show

mpath25 -> ../dm-7

and 

mpath25 uses hwhandler="1 emc" and is composed of 
sda sdb sdk sdt 

An earlier sysreport from this system shows these devices are 

DGC      Model: LUNZ

Those are Clariion pseudo devices, not legitimate storage devices. 

Earlier failure reports in the IT show errors on this device:

Jul 31 12:59:23 chln106 kernel: EXT3-fs error (device dm-7): ext3_readdir:
directory #869102 contains a hole at offset 262144

(More recent failures on chln007 also show dm-7. Are these systems identical?)

If this is the current configuration, they should not be doing I/O to dm-7. 

On our phone call we heard about two recent failures with dm-multipath on
chln106, 11/13 and 11/20. Please post a current sysreport and the logs showing
both of these errors.

Comment 56 Chip Coldwell 2006-11-21 21:39:08 UTC

(In reply to comment #51)

> we should be able to locate it by searching kernel memory for occurances of the
> buffer_head address:
> 
> crash> search -k 0000010061c68ea0
> 1002c461d68: 10061c68ea0
> 1002c461d98: 10061c68ea0
> 1002c461dc8: 10061c68ea0
> 1002c461de8: 10061c68ea0
> 100409932c8: 10061c68ea0
> 10040e21580: 10061c68ea0 
> 10061c68ea8: 10061c68ea0 <=== buffer_head itself
> 1007ffcd260: 10061c68ea0 <=== in the 1-k slab cache

I've made some more progress in deciphering these:

1002c461d68: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule
1002c461d98: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule
1002c461dc8: 10061c68ea0 <=== close to ffffffffa00b8810 (t) ext3_bread+0x72
1002c461de8: 10061c68ea0 <=== close to ffffffffa00b4be5 (t) ext3_readdir+0x28a
100409932c8: 10061c68ea0
10040e21580: 10061c68ea0 
10061c68ea8: 10061c68ea0 <=== buffer_head->b_this_page of the buffer_head itself
1007ffcd260: 10061c68ea0 <=== per-cpu cache in buffer_head slab cache (not
interesting)

One thing that's puzzling is that there are *two* bh_wait_queues on the kernel
stack in io_schedule.

Chip

Comment 57 Chip Coldwell 2006-11-21 21:43:47 UTC

(In reply to comment #56)
> 1002c461dc8: 10061c68ea0 <=== close to ffffffffa00b8810 (t) ext3_bread+0x72
> 1002c461de8: 10061c68ea0 <=== close to ffffffffa00b4be5 (t) ext3_readdir+0x28a

These two are just places where the buffer_head was pushed on the kernel stack.

Chip

Comment 58 Chip Coldwell 2006-11-22 19:16:43 UTC

(In reply to comment #56)
> 
> I've made some more progress in deciphering these:
> 
> 1002c461d68: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule
> 1002c461d98: 10061c68ea0 <=== bh_wait_queue on kernel stack in io_schedule

[...]

> One thing that's puzzling is that there are *two* bh_wait_queues on the kernel
> stack in io_schedule.

Closer examination of these bh_wait_queues makes them look even more mysterious.
 Here's the first one:

crash> bh_wait_queue 1002c461d68
struct bh_wait_queue {
  bh = 0x10061c68ea0, 
  wait = {
    flags = 0x0, 
    task = 0x10007b347f0, 
    func = 0xffffffff80178ad2 <bh_wake_function>, 
    task_list = {
      next = 0x1002c461db8, 
      prev = 0x1002c461db8
    }
  }
}

and the second one

crash> bh_wait_queue 1002c461d98
struct bh_wait_queue {
  bh = 0x10061c68ea0, 
  wait = {
    flags = 0x0, 
    task = 0x10007b347f0, 
    func = 0xffffffff80178ad2 <bh_wake_function>, 
    task_list = {
      next = 0x1002c461db8, 
      prev = 0x1002c461db8
    }
  }
}

These two are identical copies.  In the case of the second one, the task_list
indicates that it is the only element of the list: next == prev == 1002c461d98 +
0x20, where 1002c461d98 is the address of the bh_wait_queue and 0x20 is the
offset of the task_list in the bh_wait_queue.  This is what you would expect
after a LIST_HEAD_INIT macro.  However, the first bh_wait_queue is identical,
but at a different address on the stack.  That seems to indicate a list with two
entries on it, namely the first and second bh_wait_queues.  So the two list
entries are inconsistent, and I can't figure out how there got to be two of them
on the stack yet, anyway.

Chip

Comment 59 Chris Evich 2006-11-27 19:43:25 UTC

Re: "I can't figure out how there got to be two of them
on the stack yet, anyway."

Could it have anything to do with two background "ls -R" processes running
simultaneously?  What if they were running on separate CPUs and just happened to
be looking at the same block at the same time?  

(please forgive my ignorance if this is completely impossible given the kernel
code which I don't fully understand).

Comment 60 Chris Evich 2006-11-28 20:34:47 UTC

Thinking about it some more, I see that my question is impossible as two
processes wouldn't share a stack.  Would instrumenting ext3_bread() be of any
help in tracking down the source of the duplicate and/or failure mode?

Comment 61 Fabio Olive Leite 2006-11-28 22:29:50 UTC

Could the two seemingly identical bh_wait_queues be the result of improper clean
up before re-issuing a block operation that returned a recoverable error?

Something like a block read times out, ext3 retries it but instead of reusing
the bh_wait_queue it allocates a new one and does not initialize it properly.
Perhaps some part of the error recovery path was done assuming bh reuse and
another part assuming a fresh bh, and the kernel ends up with a new but only
partly initialized bh.

Or maybe I should just keep quiet. :)

Comment 62 Chris Evich 2006-11-29 14:23:38 UTC

I was thinking along the same lines as well, but in this core (triggered upon a
failure) the bh data is identical.  If it where as you suggest, I would expect
the data to be different.  

However, that's not to say- What if different parts of the kernel are looking at
the two different bh structs.  At some point, one is considered temporary and
freed (leaving the data in memory but not valid).  Subsequently, the in memory
data changes (due to another allocation) and freaks out the kernel.  However,
also in this case, I would expect to see the two in memory structs data be
different, not identical.

Either way, it seems like instrumenting ext3_bread() may give us helpful
information, especially since we're fairly sure at what point it's failing.

Comment 63 Chip Coldwell 2006-11-29 15:35:27 UTC

(In reply to comment #61)
> Could the two seemingly identical bh_wait_queues be the result of improper clean
> up before re-issuing a block operation that returned a recoverable error?

No.  First, it's important to understand that the bh_wait_queues are allocated
on the stack.  So as soon as we return from the function __wait_on_buffer they
should be automatically deallocated by the stack adjustment.  These are local
automatic variables on the stack, not kmalloc-ed data on the heap.

So all I can think of is that the compiler did the defer-stack-adjustment
optimization.  I'm going to have a look at the assembly so see if that theory
makes any sense.

Chip

Comment 64 Chris Evich 2006-11-29 16:13:45 UTC

Ahh, okay, so it could be that ext3_bread() was called twice on the same block,
the stack moved and the previous data while not in use, still existed in memory.
 Either way, it's starting to sound as if having these two coppies may be a red
herring, no?

Comment 65 Chris Evich 2006-11-29 17:16:49 UTC

All,
      Here's a summary of the vendor call:

+Setting /proc/sys/vm/vfs_cache_pressure to 500 will help force more frequent
filesystem metadata flushes.
+3 Likely failure scenarios:
     1) An Actual underlying I/O error that's not being reported for whatever
reason.
     2) An error between the driver and filesystem layers.
     3) An error related to metadata cache and/or cache read-ahead (FAST FAIL
bit set).
+Problem has been reproduced at EMC w/ PowerPath and _only_ metadata test script
(tpsreport.sh).
+Problem seems more easily reproducible when multiple tests are run in parallel
on multiple SAN filesystems:
+Customer setup-  SAN LUN's -> PVs -> 1 VG -> 16 LVs -> ext3 -> two ioloadgen
and one tpsreport instance(s) on each.
+At this point, we need to generate/analyze as many vmcores as possible
(produced with ext3 error=panic).
+Suggested customer testing: disable dm_multipath queue_if_no_path, everything
else the same.
+Suggested vendor testing: dm_multipath, multiple filesystems, multiple
tpsreport instances on each.

(Please remind me if I've left anything out)

Comment 66 Issue Tracker 2006-11-29 20:24:27 UTC

multipath.conf file for testing against Symetrix should look like this:

devnode_blacklist {
	devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
	devnode "^hd[a-z][0-9]*"
	devnode "^cciss!c[0-9]d[0-9]*"
}

defaults {
	user_friendly_names yes
}

devices {
       device {
		vendor                  "EMC"
		product                 "SYMMETRIX"
		path_grouping_policy    multibus
		path_checker		    tur
		getuid_callout          "/sbin/scsi_id -g -u -p0x80 -s /block/%n"
		#feature			    "1 queue_if_no_path"
		no_path_retry		    fail
		failback		        immediate
       }
}


This event sent from IssueTracker by cevich 
 issue 97316

Comment 67 Chip Coldwell 2006-11-29 20:42:23 UTC

(In reply to comment #63)
> (In reply to comment #61)
> > Could the two seemingly identical bh_wait_queues be the result of improper clean
> > up before re-issuing a block operation that returned a recoverable error?
> 
> No.  First, it's important to understand that the bh_wait_queues are allocated
> on the stack.  So as soon as we return from the function __wait_on_buffer they
> should be automatically deallocated by the stack adjustment.  These are local
> automatic variables on the stack, not kmalloc-ed data on the heap.
> 
> So all I can think of is that the compiler did the defer-stack-adjustment
> optimization.  I'm going to have a look at the assembly so see if that theory
> makes any sense.

Curiouser and curiouser.  The automatic variables in __wait_on_buffer are

	wait_queue_head_t *wqh = bh_waitq_head(bh);
	DEFINE_BH_WAIT(wait, bh);

(the second line is a macro that expands into an automatic variable
declaration).  The variable 'wqh' is allocated a register (%r12), and so
occupies no space on the stack.  The DEFINE_BH_WAIT macro expands into

	struct bh_wait_queue name = {					\
		.bh	= b,						\
		.wait	= {						\
				.task	= current,			\
				.flags	= f,				\
				.func	= bh_wake_function,		\
				.task_list =				\
					LIST_HEAD_INIT(name.wait.task_list),\
			},						\
	}

The struct has size 0x30 (48 bytes)

crash> bh_wait_queue -o
struct bh_wait_queue {
   [0x0] struct buffer_head *bh;
   [0x8] wait_queue_t wait;
}
SIZE: 0x30

but the function allocates 0x60 (96 bytes) for the wait queue head:

crash> dis __wait_on_buffer
0xffffffff80178bcf <__wait_on_buffer>:          push   %r12
0xffffffff80178bd1 <__wait_on_buffer+2>:        push   %rbp
0xffffffff80178bd2 <__wait_on_buffer+3>:        push   %rbx
0xffffffff80178bd3 <__wait_on_buffer+4>:        mov    %rdi,%rbx
0xffffffff80178bd6 <__wait_on_buffer+7>:        sub    $0x60,%rsp

[ ... snip ... ]

0xffffffff80178c5e <__wait_on_buffer+143>:      add    $0x60,%rsp
0xffffffff80178c62 <__wait_on_buffer+147>:      pop    %rbx
0xffffffff80178c63 <__wait_on_buffer+148>:      pop    %rbp
0xffffffff80178c64 <__wait_on_buffer+149>:      pop    %r12
0xffffffff80178c66 <__wait_on_buffer+151>:      retq   

/me scratches head.

Chip

> 
> Chip
>

Comment 68 Chip Coldwell 2006-11-29 21:43:09 UTC

Never underestimate the compiler's capacity to come up with weird sh*t.  I've
been studying the disassembled code for __wait_on_buffer and yes, indeed it does
allocate *two* bh_wait_queue structs on the stack, then fills in the first one
with this:

0xffffffff80178be2 <__wait_on_buffer+19>:       mov    %rbx,(%rsp)
0xffffffff80178be6 <__wait_on_buffer+23>:       movl   $0x0,0x8(%rsp)
0xffffffff80178bee <__wait_on_buffer+31>:       mov    %gs:0x0,%rax
0xffffffff80178bf7 <__wait_on_buffer+40>:       lea    0x30(%rsp),%rdi
0xffffffff80178bfc <__wait_on_buffer+45>:       mov    %rax,0x10(%rsp)
0xffffffff80178c01 <__wait_on_buffer+50>:       movq  
$0xffffffff80178ad2,0x18(%rsp)
0xffffffff80178c0a <__wait_on_buffer+59>:       mov    %rsp,%rsi
0xffffffff80178c0d <__wait_on_buffer+62>:       mov    $0xc,%ecx <== set up repz
0xffffffff80178c12 <__wait_on_buffer+67>:       lea    0x20(%rdi),%rax
0xffffffff80178c16 <__wait_on_buffer+71>:       cld    
0xffffffff80178c17 <__wait_on_buffer+72>:       mov    %rax,0x20(%rsp)
0xffffffff80178c1c <__wait_on_buffer+77>:       mov    %rax,0x28(%rsp)

and then copies it all into the second one with this:

0xffffffff80178c21 <__wait_on_buffer+82>:       repz movsl %ds:(%esi),%es:(%edi)

??

Comment 69 Chip Coldwell 2006-11-29 22:04:43 UTC

(In reply to comment #64)
> Ahh, okay, so it could be that ext3_bread() was called twice on the same block,
> the stack moved and the previous data while not in use, still existed in memory.
>  Either way, it's starting to sound as if having these two coppies may be a red
> herring, no?

You're right, my conclusion is that this is a red herring.  The two copies of
the bh_wait_queue on the __wait_on_buffer stack are just an artifact of some
weird stuff that the compiler did.  The second of the two copies at 0x30(%rsp)
is the real one, and that is the one that is passed in to prepare_to_wait and
finish_wait.

Chip

Comment 70 Chris Evich 2006-11-30 15:23:06 UTC

Though it is curious, is that struct passed-by-value anywhere?  That could also
explain two coppies of it, the original one and the one local to a function
somewhere.  Either way, could we maybe add some instrumentation just before
"brelse (bh);" that would help validate the contents of bh?  

Also, I find it curious that we've not been able to repro. this anywhere against
a local disk, only SAN devices, and maybe Symetrix SANs exclusivly.  In my mind,
that suggests the issue logically _must_ have something to do with driver or
hardware interaction somewhere along the way.  Perhaps we should re-re-validate
this cannot be reproduced on a local disk (using a beefy system).

On an unrelated note: Wouldn't eliminating the coppying of bh provide us with a
performance boost in high metadata load situations?

Comment 72 Chris Evich 2006-12-01 14:33:29 UTC

Just to summarize:  The focus now is on the following lines of code within
ext3_bread():

	ll_rw_block (READ, 1, &bh);
	wait_on_buffer (bh);
	if (buffer_uptodate(bh))
		return bh;
	brelse (bh);
	*err = -EIO;
	return NULL;

We know err is being set to -EIO, therefor buffer_uptodate(bh) == FALSE.

To me, this reads like a failure with:

1) ll_rw_block() not doing it's job for some reason
-and/or-
2) wait_on_buffer() not waiting long enough, or otherwise doing something that
results in a difference between the contents of bh and on-disk data
-and/or-
3) buffer_uptodate() returning a false-positive when in fact everything is in sync.

Am I missing anything here?  

Assuming this issue is definately not reproducable on local disks, what about
the behavior of a SAN and/or HBA drivers could cause these functions/macros to
not behave expectidly?

Could we instrument the kernel to give us more information on any of these (or
other lower-level) conditions?

Comment 73 Tom Coughlan 2006-12-01 16:08:50 UTC

> Assuming this issue is definately not reproducable on local disks, what about
> the behavior of a SAN and/or HBA drivers could cause these functions/macros to
> not behave expectidly?

I have been going on the assumption that multipath is integral to this problem.
Multipath is exceedingly rare on local, non FC, storage.

Multipath code that is common to PowerPath and dm-multipath is the top candidate.

Comment 74 Chip Coldwell 2006-12-01 17:15:55 UTC

(In reply to comment #72)
> Just to summarize:  The focus now is on the following lines of code within
> ext3_bread():
> 
> 	ll_rw_block (READ, 1, &bh);
> 	wait_on_buffer (bh);
> 	if (buffer_uptodate(bh))
> 		return bh;
> 	brelse (bh);
> 	*err = -EIO;
> 	return NULL;
> 
> We know err is being set to -EIO, therefor buffer_uptodate(bh) == FALSE.

Yes, exactly.

> To me, this reads like a failure with:
> 
> 1) ll_rw_block() not doing it's job for some reason

My top suspicion

> -and/or-
> 2) wait_on_buffer() not waiting long enough, or otherwise doing something that
> results in a difference between the contents of bh and on-disk data

Unlikely, but not impossible.

> -and/or-
> 3) buffer_uptodate() returning a false-positive when in fact everything is in
sync.

It's just a macro testing a flag, so it's unlikely.

> Could we instrument the kernel to give us more information on any of these (or
> other lower-level) conditions?

Yes, and perhaps I'm being a little too careful about this.  There are still
some things about the stack trace that I don't completely understand.  Take a
look at this:

#13 [1002c461d10] ext3_error at ffffffffa00be647
#14 [1002c461d40] io_schedule at ffffffff803053ef
#15 [1002c461de0] ext3_bread at ffffffffa00b8810
#16 [1002c461e00] ext3_readdir at ffffffffa00b4be5

From the code, I would have expected

io_schedule
__wait_on_buffer
ext3_bread
ext3_readdir

or

ext3_error
ext3_readdir

so there must be some deferred stack adjustments that are muddying the waters. 
I know from the disassembled code that __wait_on_buffer cleans up its stack;
that may be why it doesn't appear in the backtrace.

If we want to tackle this bug from another direction, Tom is right; we should
look at the multipath code and try to figure out a way for ext3_bread to fail. 
It's strange that it only fails for directories, though.

Chip

Comment 75 Chris Snook 2006-12-01 17:35:37 UTC

It's probably failing for both directories and normal files, but read() will
return EIO to userspace, which will then (hopefully) retry the operation.  The
problem is that readdir() and getdents() do not have the option of returning EIO
to userspace, unless we want to "extend" POSIX in a non-portable fashion.  Thus,
the error has to be handled by the kernel (somehow) instead of passing it
through to userspace.  We might be able to make this go away by changing how
ext3 responds to EIO, but if there's an underlying bug that's causing that,
we're just papering over it.

Comment 76 Chip Coldwell 2006-12-01 17:39:26 UTC

(In reply to comment #75)
> It's probably failing for both directories and normal files, but read() will
> return EIO to userspace, which will then (hopefully) retry the operation.

Good point.

Chip

Comment 77 Chip Coldwell 2006-12-01 18:32:46 UTC

OK, here's what I propose.  Even though Chris Snook's argument in comment #75 is
indisputable, I think we should still restrict our instrumentation to catch the
failure to read a directory inode since otherwise we risk being overwhelmed with
false positives.  So I will instrument the function end_buffer_read_sync to
check if the buffer_head is associated with a directory inode and panic if the
read operation failed.  That will at least inch us closer to the problem.

Chip

Comment 78 Chip Coldwell 2006-12-01 20:31:23 UTC

Created attachment 142617 [details]
instrument end_buffer_read_sync

Comment 79 Chip Coldwell 2006-12-01 20:36:12 UTC

Created attachment 142618 [details]
instrument end_buffer_read_sync; fix missing semicolon

Comment 82 Chip Coldwell 2006-12-06 21:23:57 UTC

I took a look at the core from the instrumented kernel.  It's definitely the
same bug -- ext3_readdir finds a hole in a directory.  The surprising thing is
that the  panic came from the ext3_error function ("errors=panic" mount option),
not the bugcheck that I added to end_buffer_read_sync.  I verified that the
bugcheck was there by disassembling the code for that function in the core.

There are at least two possible explanations for this:

1.  There's something wrong with the instrumentation/bugcheck that I added.
2.  end_buffer_read_sync did not run before the process that was woken up.

If the latter, then this would explain why the process wakes up and the buffer
still isn't marked up-to-date -- one of the things end_buffer_read_sync does is
to set that flag.

I'm going to double check the logic in my bugcheck.

Chip

Comment 83 Rick Beldin 2006-12-07 22:37:08 UTC

Just a thought here... if we are wondering if end_buffer_read_sync() has run, is
there a way, perhaps using a flag variable or array of some sort, to check if it
has run recently?   Would logging the buffer head (bh) and the current jiffies
be meaningful?   It might, as we get a panic, give us some detail of the recent
operations.   

Just a thought at the end of the day.   Quite possibly a silly one... ;-)

Comment 84 Chip Coldwell 2006-12-08 17:43:41 UTC

Created attachment 143176 [details]
memory hog

This is the source code for the memory hog program.  Usage is simple:

# ./memhog 95

(must be run as root!) will consume 95% of available physical memory.  Note
that it runs as a daemon, so your command prompt will return immediately, but
"top" followed by "M" should show the program memhog with an RSS consuming 95%
of available memory.

Comment 85 Chip Coldwell 2006-12-08 17:51:31 UTC

Some ideas about reproducing this bug:

In every core so far, the problem has been that a low-level block I/O submitted
by ext3_readdir to read a directory inode from disk has failed.  So I propose
that we modify our strategy to exercise this code path as much as possible. 
What we want are directories containing *lots* of files (a wide, not necessarily
deep directory hierarchy), and enough memory pressure to prevent all of the
directory inodes from sitting in the page cache.  That is the purpose of the
memhog program above; you can shrink the page cache by running another program
that consumes a large percentage of physical memory.

I also suggest that every "ls -R" should be followed by a "sync" to empty the cache.

Since we could very well be dealing with a synchronization problem, it would be
a good idea to have processes creating directory entries at the same time that
other processes are listing the directory contents.

Chip

Comment 86 Chip Coldwell 2006-12-08 18:02:46 UTC

(In reply to comment #84)
> Created an attachment (id=143176) [edit]
> memory hog
> 
> This is the source code for the memory hog program.  Usage is simple:
> 
> # ./memhog 95

This program should come with a warning label.  It's not hard at all to bring
your system to its knees if you specify a percentage that's too high.

Chip

Comment 87 Eric Sandeen 2006-12-08 18:09:47 UTC

FWIW sync won't empty the cache, it just flushes modified pages to disk, leaving
them cached.

But this might help, and force a re-read:

drop_caches
-----------

Writing to this will cause the kernel to drop clean caches, dentries and
inodes from memory, causing that memory to become free.

To free pagecache:
        echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
        echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
        echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the
user should run `sync' first.

Comment 88 Rod Nayfield 2006-12-08 19:20:13 UTC

>drop_caches

IIRC this was not in RHEL4, needs RHEL5 or later

Comment 89 Chris Evich 2006-12-08 19:32:32 UTC

Created attachment 143187 [details]
metadata test script take 2

An almost complete re-write of the script.  I think you'll find that it spends
a great deal more time doing directoy listings.   However, there's only so much
testing I can do on my little 1 CPU laptop.  So, please double-check that it's
doing what we want it to and feel free to fix it as needed.

Comment 90 Chip Coldwell 2006-12-08 19:37:22 UTC

(In reply to comment #87)
> FWIW sync won't empty the cache, it just flushes modified pages to disk, leaving
> them cached.

Too bad.  But the memhog will empty the cache.

Chip

Comment 91 Chip Coldwell 2006-12-08 19:46:34 UTC

Created attachment 143188 [details]
Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync

Run this shell script to build and load a jprobe module that will instrument
the two functions end_bio_bh_io_sync and end_buffer_read_sync.	We'll get
printk output any time the I/O completes unsuccessfully.  My recommendation is
to turn off all the other device driver and SCSI midlayer debugging messages
for the time being so that we can quickly find the output from the
instrumentation in the system logs.

If the script succeeds, the last message it emits is "module loaded". 
Otherwise, something went wrong so let me know.

Chip

Comment 92 Chip Coldwell 2006-12-08 19:50:37 UTC

(In reply to comment #91)
> Created an attachment (id=143188) [edit]
> Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync

Another quick note -- this is to be used instead of the debugging kernel posted
earlier.  That debugging kernel failed to correctly identify a directory inode;
i.e. my instrumentation contained a bug.

Chip

Comment 93 Chip Coldwell 2006-12-08 20:50:34 UTC

I had a look at the directory inode from the Nov 17 crash (Eric Sandeen is also
taking a look) and it certainly does not seem to have a hole at offset 4096 (==
0x1000).  It might also be worth pointing out that this directory was quite
large, the inode is 2.5 MiB.  However the failure happened when ext3_readdir was
reading the second block.

Chip

Comment 95 Chris Evich 2006-12-11 18:03:19 UTC

All,

     Hot off the press-  EMC reproduced the problem a half-hour before our call.
 There were no other reports of reproducing the issue from testing over the
weekend.  Keith had some interesting observations of what happened this morning:

0) EMC PowerPath + QLogic test system running w/o jprobes.
1) Running 4 instances of the "old" tpsreport.sh script for the last 3 weeks, no
memhog.
2) One instance of tpsreport.sh script was replaced with the "new" version (in
bugzilla) at 8:30am EST today.
3) Some time later, Keith logged in to the X console of the box.
4) The system appeared to "Hang" partway through the GNOME splash screen (where
it starts up the panel, applets, etc.)
5) System experienced read-only problem shortly thereafter and dumped core at
aprox. 11am EST.

We theorized that this repro. might have been "helped" by the extra memory
pressure placed on the system by logging into a GNOME session.  Keith will be
posting information on obtaining the core (and other details) directly to the
Bugzilla.

Chip has been working on tweaking the jprobes for this issue as there appear to
be some unintended problems.  He is asking that any tests being restarted with
the new "recipe" (below) wait for the new jprobes he is going to post into the
bugzilla.  The old jprobes appear to be generating some "false-positive" data.

There was a question regarding using these jprobes in customer production
environments and the possible impact.  While not specifically designed to cause
crashes, they should be reasonably safe, though may introduce a very slight
performance degradation.  However, on production systems _not_ also running with
SAN filesystems set to error=panic, having the jprobes (kernel module) loaded
will be less useful.  Chip said that nevertheless, it will still provide
slightly helpful information if customers understand the above.

Testing update:
EMC: Gathering data, will restart testing with new recipe (below)
QLogic: testing with QLogic + dm-multipath, no failures, will restart with new
recipe.
Red Hat: Testing on Clariion + PowerPath, no failures.  Testing with
dm-multipath, no failures.
Customer Labs: Various test cases, no failures.

Latest test recipe:
+Wait for chip to post new jprobes to the Bugzilla
+Install new jprobes on test system(s)
+Completely disable swap on test system(s)
+For test systems with > 4 CPUs and > 4 tpsreport.sh test instances:
  -Lower "LISTING_RATIO" to 1, or 2 at the most.
  -Concern is that large number of "ls -laR" process will overload the system.
+Wait for tpsreport.sh test(s) and system load to "settle down"
+Start "inching" up memhog _slowly_:
  -Consume "most" of the remaining system memory
  -Leave some breathing room
  -It's __very__ easy to start triggering the OOM killer.
+Go get some coffee (or tea).

Comment 96 Keith Kearnan 2006-12-11 18:10:08 UTC

We experienced another crash at approximately 10:55AM EST today.  The vmcore 
and associated files are located here: ftp://ftp.emc.com/outgoing/213921-2/

This appears almost identical to the Nov 17 crash.  The one exception being the 
following in the log immediately before the panic:

warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x4d/0xd0

I assume we can attribute this to the running of the new metadata script from 
comment #89 which was running on another file system.

Comment 97 Chip Coldwell 2006-12-11 21:28:28 UTC

Created attachment 143334 [details]
Install jprobes in end_bio_bh_io_sync and end_buffer_read_sync

This is the updated version of the jprobe module.  It is recommended that this
module should only be used with PowerPath; in our experiments dm-multipath will
generate tons of spurious output with this module loaded.

Chip

Comment 98 Keith Kearnan 2006-12-12 13:50:40 UTC

We experienced another crash at approximately 19:05PM EST last night.  The new 
jprobe from comment #97 was installed.  The vmcore, log, messages file and dump 
of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-3/

end_bio_bh_io_sync: bio 000001002cb00b80 bh 000001005080fdf0 bytes_done 
4096err -11 sector 98697767end_buffer_read_sync: page 0000010040682fe0 mapping 
000001003f5f5260 inode 000001003f5f5140 uptodate 0
EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6160385 contains a 
hole at offset 241664
Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after 
error

Comment 99 Chris Evich 2006-12-12 15:06:45 UTC

My account reported the following error on a QLogic+PP box they were setting up.
Does this error come from the same code-path that we're investigating in this bug?


Dec  7 19:50:38 xxxx299 kernel: EXT3-fs error (device dm-7): ext3_find_entry:
reading directory #2444902 offset 1

Comment 100 Keith Kearnan 2006-12-12 15:52:46 UTC

This appears to be a different area of code (namei.c) that is doing the same 
thing (!buffer_uptodate(bh)):

    872                 if (!buffer_uptodate(bh)) {
    873                         /* read error, skip block & hope for the best */
    874                         ext3_error(sb, __FUNCTION__, "reading directory 
#%lu "
    875                                    "offset %lu\n", dir->i_ino, block);
    876                         brelse(bh);
    877                         goto next;
    878                 }

Comment 101 Chris Evich 2006-12-12 16:27:24 UTC

Thanks Keith.  The SAN filesystem in question is being used to house revision
control data.  I'd immaging that means it's got lots of directories and lots of
files in them.  It's good to know there are other ways of hitting the problem
via different code-paths as well.  Hopefully analysis of your latest
reproductions vmcore/jprobes data will also be fruitful.

Comment 102 Chip Coldwell 2006-12-12 18:58:59 UTC

(In reply to comment #99)
> My account reported the following error on a QLogic+PP box they were setting up.
> Does this error come from the same code-path that we're investigating in this bug?
> 
> 
> Dec  7 19:50:38 xxxx299 kernel: EXT3-fs error (device dm-7): ext3_find_entry:
> reading directory #2444902 offset 1

"device dm-7" makes me think this is the device mapper, not PP.

Chip

Comment 103 Keith Kearnan 2006-12-12 19:02:21 UTC

Hi Chip,
To me this just means they have PowerPath devices configured under LVM.  I have 
seen this before as well.  A quick "vgdisplay -v" and a review of the /dev tree 
will tell the story.
-Keith

Comment 104 Chip Coldwell 2006-12-12 19:33:48 UTC

(In reply to comment #103)
> Hi Chip,
> To me this just means they have PowerPath devices configured under LVM. 

Oh.  Of course.  Nevermind.

Chip

Comment 105 Chip Coldwell 2006-12-12 20:30:42 UTC

An examination of the latest core from EMC (which had the jprobe module loaded)
indicates that I have been barking up the wrong tree.  Here's what I can say for
certain:

1.  The "int err" automatic variable on the stack of the ext3_readdir function
contains the value -EIO (0xfffffffb).
2.  The I/O error comes from ext3_bread.
3.  ext3_bread does not follow the code path through ll_rw_block and wait_on_buffer.

The conclusion is that the call to ext3_getblk in ext3_bread is returning NULL,
and setting err = -EIO.  I believe the only way that can happen is if
ext3_get_block_handle fails, so I would like to instrument this function next.

Chip

Comment 106 Chris Evich 2006-12-12 21:15:51 UTC

Chip, assuming your writing a new jprobe for ext3_get_block_handle?

Comment 107 Chip Coldwell 2006-12-12 22:09:59 UTC

Created attachment 143456 [details]
probe ext3_get_block_handle in 2.6.9-34.EL x86_64

This jprobe/kprobe combination instruments the function ext3_get_block_handle. 
It will cause a kernel panic in that function if the kernel is failing in
ext3_get_block_handle.	This version is only for kernel 2.6.9-34.EL on x86_64
(the version EMC has been reproducing on).  I can provide modules for other
kernels/archs if there is interest.

Chip

Comment 108 Keith Kearnan 2006-12-12 23:47:39 UTC

Thanks Chip!

The new jprobe is in and running.

Dec 12 17:57:13 L8 kernel: plant jprobe at ffffffffa00b7c3e

crash> dis ffffffffa00b7c3e
0xffffffffa00b7c3e <ext3_get_block_handle>:     int3

Comment 111 Keith Kearnan 2006-12-13 12:34:27 UTC

We experienced another crash at approximately 03:45PM EST this morning.  The 
new jprobe from comment #107 was installed.  The vmcore, log, messages file and 
dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-4/

Please note that I had turned up PowerPath debug for most of yesterday so there 
are some messages in the log.  I turned off PowerPath debug soon after 
installing the new jprobe (18:55PM EST).  While some of the PowerPath debug 
messages may seem ominous ("MpIodone error" & "Deadlock potential!") they can 
be considered normal with debugging turned on.  As you can see from the large 
messages file I've included on the ftp site they log continuously.  

<2>EXT3-fs error (device emcpowera1): ext3_readdir: directory #2064385 contains 
a hole at offset 4096
Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
error

This file system was running the "new" tps script.

-Keith

Comment 112 Keith Kearnan 2006-12-13 16:00:33 UTC

We experienced another crash at approximately 09:24PM EST this morning.  The 
new jprobe from comment #107 was installed.  The vmcore, log, messages file and 
dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-5/

EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6127617 contains a 
hole at offset 8192
Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after 
error

-Keith

Comment 113 Chip Coldwell 2006-12-13 20:50:49 UTC

(In reply to comment #112)
> We experienced another crash at approximately 09:24PM EST this morning.  The 
> new jprobe from comment #107 was installed.  The vmcore, log, messages file and 
> dump of the directory can be located here: ftp://ftp.emc.com/outgoing/213921-5/
> 
> EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6127617 contains a 
> hole at offset 8192
> Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after 
> error

From this panic and the previous one, I think I'm still barking up the wrong
tree.  It should have panic'd before this, in the jprobe module.  I will turn
another rev on that and get it out ASAP.

Also, I wrote a quick and dirty program for reading the directory blocks that
you're dumping with debugfs; I'll attach the source code.

Chip

Comment 114 Chip Coldwell 2006-12-13 20:56:45 UTC

Created attachment 143555 [details]
program to read directory inode dump (gcc -o readdir readdir.c)

Comment 115 Chip Coldwell 2006-12-13 21:01:25 UTC

Created attachment 143556 [details]
program to read directory inode dump (gcc -o readdir readdir.c)

Comment 116 Chip Coldwell 2006-12-13 22:18:47 UTC

Created attachment 143567 [details]
probe ext3_bread (2.6.9-34.ELsmp x86_64)

This version probes farther upstream, in ext3_bread.  I'm quite certain that
ext3_bread is returning NULL to ext3_readdir, but there are two ways it could
do so.	This module sets a probe on each of those two ways, and will bugcheck
if either of them happens.  The bugcheck will indicate which one it was.

Chip

Comment 117 Keith Kearnan 2006-12-14 00:59:22 UTC

Thanks Chip!

The new jprobes/kprobes are in and running.

Dec 13 19:56:01 L8 kernel: plant jprobe at ffffffffa00b879e
Dec 13 19:56:01 L8 kernel: plant kprobe at ffffffffa00b87a7
Dec 13 19:56:01 L8 kernel: plant kprobe at ffffffffa00b880b

-Keith

Comment 118 Keith Kearnan 2006-12-14 10:50:57 UTC

We experienced another crash at approximately 00:47AM EST this morning.  The 
new jprobe/kprobe from comment #116 was installed.  The vmcore and log, can be 
located here: ftp://ftp.emc.com/outgoing/213921-6/

Kernel BUG at jprobe:22

RIP: 0010:[<ffffffffa03ee045>] <ffffffffa03ee045>
{:jprobe:ext3_bread_before_brelse+9}

Comment 119 Chris Mason 2006-12-14 14:41:16 UTC

Created attachment 143633 [details]
use submit_bh instead of ll_rw_block

This is untested and has not been compiled.  One often (by me anyway) forgotten
ll_rw_block feature is that it skips locked buffers, assuming they are already
in flight under IO.

It seems possible some part of the kernel is breaking this rule, and
ext3_getblk seems like as good a suspect as any.  So, if someone is locking a
buffer without starting io on it, ext3_bread may return before any IO has
actually been done.  I've no proof this is happening, we haven't been able to
reproduce it yet at Oracle.

Comment 120 Chip Coldwell 2006-12-14 15:20:33 UTC

(In reply to comment #119)
> Created an attachment (id=143633) [edit]
> use submit_bh instead of ll_rw_block
> 
> This is untested and has not been compiled.  One often (by me anyway) forgotten
> ll_rw_block feature is that it skips locked buffers, assuming they are already
> in flight under IO.
> 
> It seems possible some part of the kernel is breaking this rule, and
> ext3_getblk seems like as good a suspect as any.  So, if someone is locking a
> buffer without starting io on it, ext3_bread may return before any IO has
> actually been done.  I've no proof this is happening, we haven't been able to
> reproduce it yet at Oracle.

It's an interesting idea.  The upstream kernel has introduced a "READ_META" (and
additional flag) to indicate that the ll_rw_block is a metadata read.  I haven't
found how this is handled differently, yet.

Chip

Comment 121 Keith Kearnan 2006-12-14 15:26:41 UTC

Hi Chris,
Before testing any fixes I'd like to test with additional jprobes/kprobes to 
nail down the root cause.  The main reason is that in some cases it takes days 
for a repro to occur.  How long to we run with a fix before declaring it 
valid?  

Also, based on the latest vmcore we are getting past a "wait_on_buffer (bh)" 
inside of ext3_bread.  Shouldn't that wait for the locked buffer to complete 
its I/O?

Comment 122 Chris Evich 2006-12-14 15:46:34 UTC

Kearnan_Keith wrote:
> Hi folks,
> 
> With the latest vmore we panic in the jprobe.
> 
> Kernel BUG at jprobe:22
> invalid operand: 0000 [1] SMP
> CPU 3
> Modules linked in: jprobe(U) parport_pc lp parport netconsole netdump autofs4
i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_c
> ore emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp(U) emcplib(U)
button battery ac md5 ipv6 ohci_hcd hw_random tg
> 3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc cciss sd_mod scsi_mod
> Pid: 3991, comm: ls Tainted: P      2.6.9-34.ELsmp
> RIP: 0010:[<ffffffffa03ee045>]
<ffffffffa03ee045>{:jprobe:ext3_bread_before_brelse+9}
> 
> I know the remaining possibilities seem to be:
> 
> 2) wait_on_buffer() not waiting long enough, or otherwise doing something that
> results in a difference between the contents of bh and on-disk data 
> -and/or-
> 3) buffer_uptodate() returning a false-positive when in fact everything is in
sync.
> 
> Could there be a 4th possibility where in between the following:
> 
>     926         wait_on_buffer (bh);
>     927         if (buffer_uptodate(bh))
> 
> the bh is getting altered?  Should a lock_buffer(bh) be inserted here?
> 
> The __wait_on_buffer comment seems to indicate this is a possibility.
> 
>     149 /*
>     150  * Block until a buffer comes unlocked.  This doesn't stop it
>     151  * from becoming locked again - you have to lock it yourself
>     152  * if you want to preserve its state.
>     153  */
>     154 void __wait_on_buffer(struct buffer_head * bh)
>     155 {
>     156         wait_queue_head_t *wqh = bh_waitq_head(bh);
>     157         DEFINE_BH_WAIT(wait, bh);
>     158
>     159         do {
>     160                 prepare_to_wait(wqh, &wait.wait, TASK_UNINTERRUPTIBLE);
>     161                 if (buffer_locked(bh)) {
>     162                         sync_buffer(bh);
>     163                         io_schedule();
>     164                 }
>     165         } while (buffer_locked(bh));
>     166         finish_wait(wqh, &wait.wait);
>     167 }
> 
> static inline int buffer_uptodate(struct buffer_head * bh) { return
test_bit(BH_Uptodate, &bh-&gtb_state); }
> 
> ... no locking? it's atomic, but no MP protection. 

I seem to recall Chip saying that bh is supposed to be locked before the call to
ext3_bread, but I could be mistaken.  Either way, I have a good feeling about
this line of thinking.  Could this be one of those nassssssty tricksie little
Hobbitsesss: wherein locking higher up the call stack is irrelivent because we
need to lock in ext3_bread (it may be running concurrently somehow) because the
higher up call has already unlocked. ???  

Another thing, is there any way the buffer could become unlocked inbetween line
161 and 163 above?  (I'm assuming this lock on bh is some type of mutex and not
a "> 1" semephore right?)

Comment 123 Chip Coldwell 2006-12-14 15:51:17 UTC

(In reply to comment #121)
> Hi Chris,
> Before testing any fixes I'd like to test with additional jprobes/kprobes to 
> nail down the root cause.

The patch Chris submitted could be considered as a diagnostic.  Basically, it
implements the same logic as ll_rw_block, with one important difference. 
ll_rw_block skips the buffer_head if it is locked, Chris' version skips the
buffer_head if it is up-to-date.

If the bug reproduces with Chris' patch installed, then we know that the problem
really is low-level I/O and not buffer locking/synchronization.

Chip

Comment 124 Chris Mason 2006-12-14 15:54:23 UTC

It seemed strange that we suddenly stopped getting errors from the device 
layer.  I just double checked and the end_request: I/O error message only 
comes out at the default log level.

Has the console log level been increased to the max on boxes where this has 
been reproduced recently?  Otherwise we may be missing messages that point 
away from the filesystem.

Comment 125 Chip Coldwell 2006-12-14 16:04:20 UTC

(In reply to comment #123)

> The patch Chris submitted could be considered as a diagnostic.  Basically, it
> implements the same logic as ll_rw_block, with one important difference. 
> ll_rw_block skips the buffer_head if it is locked, Chris' version skips the
> buffer_head if it is up-to-date.

This thought deserves some fleshing out.  If a buffer is locked, that means an
I/O is pending.  When the I/O completes, the buffer will be up-to-date unless
the I/O failed.

That, at least, is how it is supposed to work (ref the function
end_buffer_read_sync near line 198 of fs/buffer.c).

The ll_rw_block call in ext3_bread will not submit an I/O if the buffer is
already locked.  The logic there is correct; if there is already an I/O pending
on the buffer, there is no point in submitting another one.  Whether the pending
I/O is a read or a write, when it completes the buffer should be up-to-date,
unless the I/O fails for some reason.

So we have two possibilities here:

1.  The buffer locking and synchronization is somehow broken.
2.  The I/Os are failing for some reason.

The probability of the second possibility is much higher, since the buffer
locking/synchronization is used by all I/O subsystems, not just SANs.  We are
only seeing this bug on SANs.

Chip

Comment 126 Chris Mason 2006-12-14 16:05:01 UTC

Created attachment 143648 [details]
more verbose check

This patch is similar, (and also untested/compiled), but if it finds a locked
buffer that is not up to date, and the buffer is still not up to date after
just waiting on it, a log message will be printed:

ext3 wait after failed lock, not uptodate

It then locks the buffer and tries the read.  If the read produces an up to
date buffer, another message is printed.  This way we'll be able to tell if
we're hitting this potential bug.

Please review this carefully before using it.

Comment 127 Keith Kearnan 2006-12-14 16:09:04 UTC

(In reply to comment #123)
> (In reply to comment #121)
> > Hi Chris,
> > Before testing any fixes I'd like to test with additional jprobes/kprobes 
to 
> > nail down the root cause.
> The patch Chris submitted could be considered as a diagnostic.  Basically, it
> implements the same logic as ll_rw_block, with one important difference. 
> ll_rw_block skips the buffer_head if it is locked, Chris' version skips the
> buffer_head if it is up-to-date.
> If the bug reproduces with Chris' patch installed, then we know that the 
problem
> really is low-level I/O and not buffer locking/synchronization.
> Chip

The only reason I can't totally agree with your conclusion here is because the 
patch also adds buffer locking.  So if it fixes it we cannot jump to the 
conclusion that the root cause is low-level I/O.

Comment 128 Chip Coldwell 2006-12-14 16:14:15 UTC

(In reply to comment #127)
>
> The only reason I can't totally agree with your conclusion here is because the 
> patch also adds buffer locking.  So if it fixes it we cannot jump to the 
> conclusion that the root cause is low-level I/O.

It's the same locking that's in ll_rw_block, ported up to ext3_bread.  In fact,
you could think of the patch as being a re-implementation of ll_rw_block in
ext3_bread for the case of interest (nr == 1, rw == READ), with the one
difference I described in comment #123.

However, like I said in comment #125, I think it is more likely that the problem
is caused by failing I/O than a buffer locking/synchronization problem, if only
because every I/O device in the kernel uses the same buffer synchronization methods.

We could implement the patch from comment #126 in the next kprobe/jprobe
version, which might be interesting.

Chip


Chip

Comment 129 Chris Mason 2006-12-14 16:17:21 UTC

RE: comment #125

Yes, if a buffer is locked it is supposed to mean an IO is pending.  But, you 
only need one caller that thinks locking a buffer means excluding other users 
to trigger the problem.  ext3_getblk looks very suspect to me.  It probably is 
not possible to trigger that particular bug due to inode semaphores, but 
still.

With all of that said, problems at the lower layers are much more likely.  
Earlier in the bug it looked like you had jprobes in the end_io routines that 
were being called with up_to_date = 0, and then later on we were only able to 
see errors higher up the stack in ext3.  Is this true?

Comment 130 Chip Coldwell 2006-12-14 16:28:55 UTC

(In reply to comment #129)
> RE: comment #125
> 
> Yes, if a buffer is locked it is supposed to mean an IO is pending.  But, you 
> only need one caller that thinks locking a buffer means excluding other users 
> to trigger the problem.  ext3_getblk looks very suspect to me.

Comment above journal_get_create_access in fs/jbd/transaction.c:

/*
 * When the user wants to journal a newly created buffer_head
 * (ie. getblk() returned a new buffer and we are going to populate it
 * manually rather than reading off disk), then we need to keep the
 * buffer_head locked until it has been completely filled with new
 * data. 

The branch in ext3_getblk that executes lock_buffer(bh) is protected by 
if (buffer_new(&dummy)).  The logic is contorted, but I think it's correct.

> With all of that said, problems at the lower layers are much more likely.  
> Earlier in the bug it looked like you had jprobes in the end_io routines that 
> were being called with up_to_date = 0, and then later on we were only able to 
> see errors higher up the stack in ext3.  Is this true?

Yes, it is.  But one should never rule out the possibility of bugs in my jprobes
....

Chip

Comment 131 Keith Kearnan 2006-12-14 16:35:27 UTC

(In reply to comment #128)
> (In reply to comment #127)
> >
> > The only reason I can't totally agree with your conclusion here is because 
the 
> > patch also adds buffer locking.  So if it fixes it we cannot jump to the 
> > conclusion that the root cause is low-level I/O.
> It's the same locking that's in ll_rw_block, ported up to ext3_bread.  In 
fact,
> you could think of the patch as being a re-implementation of ll_rw_block in
> ext3_bread for the case of interest (nr == 1, rw == READ), with the one
> difference I described in comment #123.
> However, like I said in comment #125, I think it is more likely that the 
problem
> is caused by failing I/O than a buffer locking/synchronization problem, if 
only
> because every I/O device in the kernel uses the same buffer synchronization 
methods.
> We could implement the patch from comment #126 in the next kprobe/jprobe
> version, which might be interesting.
> Chip
> Chip

OK, I see it.  I look forward to the next jprobe.

Comment 132 Barry Donahue 2006-12-14 20:56:51 UTC

Keith, I was wondering if I could get a little more information on your
configuration and the load you're running. 

   How much memeory does the system have? Are you running only the exact script
that you put in an earlier attachment? Did you add any options to the filesystem
other than the default .e.g. the number of inodes?

   I would just like to align what I'm running with what has successfully
reproduced this in the past.

   thanks,

Comment 133 Keith Kearnan 2006-12-14 21:09:53 UTC

(In reply to comment #132)
> Keith, I was wondering if I could get a little more information on your
> configuration and the load you're running. 
>    How much memeory does the system have? 

2GB

> Are you running only the exact script that you put in an earlier attachment? 

I have run a variety of different jobs and combinations but the most 
common "set" is below:

I have two LUNs that are configured down two paths each.  One one LUN I run the 
script from comment #89 in addition to a single ioloadgen job.

On the second LUN I run the original metadata test script "tpsreport.sh" with 
the modification from comment #42.  I also run a single ioloadgen job.

I have tried other thing in order to induce a failure quicker but nothing seems 
to really make it happen quicker.

>Did you add any options to the filesystem other than the default .e.g. the 
number of inodes?

No, the command I used to create the filesystem:
mkfs.ext3 -j /dev/emcpowera1
The command used to mount:
mount -o errors=panic /dev/emcpowera1 /test1

>    I would just like to align what I'm running with what has successfully
> reproduced this in the past.
>    thanks,

Comment 134 Chip Coldwell 2006-12-14 21:59:47 UTC

(In reply to comment #126)
> Created an attachment (id=143648) [edit]
> more verbose check
> 
> This patch is similar, (and also untested/compiled), but if it finds a locked
> buffer that is not up to date, and the buffer is still not up to date after
> just waiting on it, a log message will be printed:
> 
> ext3 wait after failed lock, not uptodate
> 
> It then locks the buffer and tries the read.  If the read produces an up to
> date buffer, another message is printed.  This way we'll be able to tell if
> we're hitting this potential bug.
> 
> Please review this carefully before using it.

+	if (buffer_uptodate(bh)) {
+		if (long_wait)
+			printk(KERN_CRIT "ext3 after failed lock, now good\n");
+		unlock_buffer(bh);
+		return bh;
+	}

Unlocking the buffer here instead of in the I/O completion routine seems wrong
to me.

On deeper reflection, I'm not sure what we're trying to catch here.  Referring
to the original version of ext3_bread, if the buffer we get from ext3_getblk is
locked and not uptodate, then ll_rw_block is a no-op, but wait_on_buffer will
sleep until the other guy's pending I/O completes.  That's how it should work,
right?

We can detect if the buffer is locked before the call to ll_rw_block, but even
if it is, it doesn't necessarily indicate a problem.  It just means somebody
else started an I/O to the same block first.

Chip

Comment 135 Chip Coldwell 2006-12-14 22:18:50 UTC

Created attachment 143707 [details]
probe ext3_bread every which way (2.6.9-34.ELsmp.x86_64)

This module drops some additional kprobes in the ext3_bread function to
determine if it calls ll_rw_block with a locked buffer head, and to determine
if the failure of this function is due to an underlying I/O failure or a buffer
synchronization problem.

Chip

Comment 136 Keith Kearnan 2006-12-14 23:37:08 UTC

Thanks Chip!
The new jprobe/kprobe (from comment #135) is installed and running.

Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87b8
Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87cd
Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b8802
-Keith

Comment 137 Chris Mason 2006-12-15 01:27:40 UTC

RE: comment 134.

The buffer is always unlocked in the end_io call.  I'm unlocking it inside the 
if (buffer_uptodate(bh)) check because no new io is started, and because the 
buffer was locked above (either by test_set_buffer_locked or lock_buffer).

You're right, there's no significance to a buffer being locked before 
ll_rw_block.  It only matters if it was locked, we wait on it, and it is not 
up to date after the wait returns.  That will trigger the first printk.

printk(KERN_CRIT "ext3 wait after failed lock, not uptodate\n");

But, this doesn't indicate an FS error either.  The buffer may not be up to 
date because of IO error (block layer bug) or because of someone breaking the 
buffer locking rules (probably FS level bug).

In the patch I go ahead and submit the IO again, and do another print if that 
produces an up to date buffer.  But, even this doesn't prove an FS bug because 
the storage array may have just come back online, or the lower paths may have 
worked out whatever problems they had.

The only way to be 100% sure of the IO error is to instrument the end_io 
call(s), or lower down the stack.  And please make sure the kernel log level 
is set high enough to catch all the messages.

Comment 138 Keith Kearnan 2006-12-15 01:54:37 UTC

So far...

Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87b8
Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b87cd
Dec 14 18:31:41 L8 kernel: plant kprobe at ffffffffa00b8802
Dec 14 18:32:17 L8 kernel: buffer_head 000001005b0a6920 locked before 
ll_rw_block
<snip>
Dec 14 20:51:51 L8 kernel: buffer_head 0000010035936348 locked before 
ll_rw_block
Dec 14 20:51:51 L8 kernel: buffer_head 00000100359362f0 locked before 
ll_rw_block

[root@L8 jprobe-2.6.9-34.EL.x86_64]# cat /var/log/messages | grep buffer_head | 
wc -l
321

Not as many as I would have expected.

Comment 139 Keith Kearnan 2006-12-15 03:08:11 UTC

We experienced a crash at approximately 21:408PM EST.  The new jprobe/kprobe 
from comment #135 was installed.  The vmcore and log can be located here: 
ftp://ftp.emc.com/outgoing/213921-7/

buffer_head 0000010028986fa8 locked before ll_rw_block
buffer_head 0000010028986fa8 not uptodate after wait_on_buffer inode 
000001001a06fb18
EXT3-fs error (device emcpowera1): ext3_readdir: directory #7045121 contains a 
hole at offset 4096
Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
error

-Keith

Comment 141 Chip Coldwell 2006-12-15 14:38:29 UTC

(In reply to comment #139)
> We experienced a crash at approximately 21:408PM EST.  The new jprobe/kprobe 
> from comment #135 was installed.  The vmcore and log can be located here: 
> ftp://ftp.emc.com/outgoing/213921-7/
> 
> buffer_head 0000010028986fa8 locked before ll_rw_block
> buffer_head 0000010028986fa8 not uptodate after wait_on_buffer inode 
> 000001001a06fb18
> EXT3-fs error (device emcpowera1): ext3_readdir: directory #7045121 contains a 
> hole at offset 4096
> Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
> error

Interesting.  In light of comment #138, we see that having a locked buffer in
ext3_bread before issuing ll_rw_block happened harmlessly some hundreds of
times.  The at the end it was a precursor to reproducing our bug.  We still
don't know if it always happens like this, but it did this time.

Chip

Comment 142 Eric Sandeen 2006-12-15 14:48:49 UTC

It might possibly be interesting to add the buffer head tracing patch, then if
we get this locked buffer and a subsequent problem, we can ASSERT and get a
history of where that buffer has been - see if it truly was locked for IO or for
some other reason.

Comment 143 Chris Mason 2006-12-15 14:49:26 UTC

I don't believe the printks from the latest hang rule out io errors from the 
device.

I can almost but not quite explain what we're seeing here.

ext3_readdir will do ll_rw_block(READA) for readahead on directory blocks.  
READA requests may fail in various ways.  With normal disk setups, most of the 
failures are synchronous, the buffer would be locked and unlocked with the 
directory semaphore/mutex held, and it would be very difficult to race in and 
wait on the buffer during the READA request.

But, the dm/multipath code are much less careful (but still safe) with 
bio_rw_ahead() requests, and these also have the fast fail bit set (as noted 
earlier here).

The page cache does readahead for files, but it is generally much more careful 
about retrying when there are failures.  The ext3/ll_rw_block method is prone 
to transient errors like these.

The fix would be my original patch, but I would suggest instrumentation to 
make sure these are the failures we're seeing.  The best place to probe would 
be the bio_end_io handler used by submit_bh.

One problem with my theory is that it doesn't explain why the probe into the 
end_io funcs didn't see the error or why we aren't seeing the end_request I/O 
error printk.  Is the EMC driver setting REQ_QUIET on failfast or readahead 
bios that fail?  That would at least explain the printk.

Comment 144 Chris Evich 2006-12-15 14:55:17 UTC

Fascinating.  So does that mean whatever locked the buffer either did so
inappropriately, and/or failed to actually sync the disk/buffer, and/or failed
to set the flag indicating the buffer is up to date?  If I'm reading this right,
doesn't it points to a synchronization problem higher up since we're not seeing
the printk's lower down?

Comment 145 Chris Mason 2006-12-15 15:02:00 UTC

The short answer is...I don't know.  It all makes really good sense except the 
part where we don't get printks from ll_rw_block.c.  If we can't explain that, 
then the best course is to put unlock tracing into the buffer heads, recording 
the eip of the last 5 or 6 people to call unlock_buffer.

I looked this morning for people improperly unlocking buffers without marking 
them up to date, but the directory semaphore makes those kinds of races very 
very hard to hit.  The fact that we consistently see this on directories makes 
readahead sound like a pretty good theory to me.

Comment 146 Chip Coldwell 2006-12-15 15:08:46 UTC

(In reply to comment #143)
> I don't believe the printks from the latest hang rule out io errors from the 
> device.

Certainly not.

> ext3_readdir will do ll_rw_block(READA) for readahead on directory blocks.  
> READA requests may fail in various ways.  With normal disk setups, most of the 
> failures are synchronous, the buffer would be locked and unlocked with the 
> directory semaphore/mutex held, and it would be very difficult to race in and 
> wait on the buffer during the READA request.
> 
> But, the dm/multipath code are much less careful (but still safe) with 
> bio_rw_ahead() requests, and these also have the fast fail bit set (as noted 
> earlier here).

Well, ext3_bread is definitely waiting for somebody else's I/O to complete, and
when it finishes, the I/O has failed.  I think there's a good chance that it's a
read-ahead with the fast-fail bit set (the read-ahead issued by ext3_readdir
itself seems likely to me, too).

> The fix would be my original patch,

I provisionally disagree.  The upstream kernel has a READ_META flag set in
ext3_bread's ll_rw_block.  I'd prefer to emulate the upstream version of the
function if the problem doesn't exist there.

> One problem with my theory is that it doesn't explain why the probe into the 
> end_io funcs didn't see the error or why we aren't seeing the end_request I/O 
> error printk.  Is the EMC driver setting REQ_QUIET on failfast or readahead 
> bios that fail?  That would at least explain the printk.

We should clear that flag.  It might also be interesting to plant a probe that
printks the addresses of buffer_heads as we submit requests for read-ahead in
ext3_readdir.  However, that is likely to generate a lot of output.

Chip

Comment 147 Chip Coldwell 2006-12-15 15:15:03 UTC

(In reply to comment #145)
>
> The fact that we consistently see this on directories makes 
> readahead sound like a pretty good theory to me.

We could take out the read-ahead.  Yes, that will change timings and possibly
mask the bug, but I still think it would be interesting.

Chip

Comment 149 Chris Mason 2006-12-15 15:23:53 UTC

As a next course of action I would suggest:

* getting confirmation from EMC about the REQ_QUIET flag
* Checking the dmesg buffer in the crash dump for the printks
* Instrumenting fs/buffer.c:end_bio_bh_io_sync().  There you can look for 
bio_rw_ahead requests that are failing and print the buffer head address.

I would leave the READA in so we can confirm things.

Comment 150 Chip Coldwell 2006-12-15 15:33:26 UTC

(In reply to comment #149)
> As a next course of action I would suggest:
>
> * Instrumenting fs/buffer.c:end_bio_bh_io_sync().  There you can look for 
> bio_rw_ahead requests that are failing and print the buffer head address.

Good idea.

Chip

Comment 151 Keith Kearnan 2006-12-15 15:54:48 UTC

(In reply to comment #149)
> As a next course of action I would suggest:
> * getting confirmation from EMC about the REQ_QUIET flag

PowerPath 4.5.x does not set the REQ_QUIET flag.

Comment 152 Chris Mason 2006-12-15 16:53:28 UTC

From comment #98:

end_bio_bh_io_sync: bio 000001002cb00b80 bh 000001005080fdf0 bytes_done 
4096 err -11 sector 98697767 end_buffer_read_sync: page 0000010040682fe0 
mapping 000001003f5f5260

EIO is -5, -11 is EAGAIN.  Not sure if that is coming from powerpath, qlogic 
or the scsi layer.

Comment 153 Chris Mason 2006-12-15 17:04:48 UTC

ftp://ftp.emc.com/outgoing/213921-3/ still has the data in it.  The EAGAINs 
may be coming from __make_request (it uses EWOULDBLOCK which is the same as 
EAGAIN).  There are a bunch of these end_bio_bh_io_sync errors, it may be 
worthwhile to look in the dump and see if any of them correspond to the hole 
bh.

Comment 154 Chip Coldwell 2006-12-15 18:28:37 UTC

Created attachment 143795 [details]
probe ext3_bread and end_bio_bh_io_sync (2.6.9-34.ELsmp.x86_64)

This version keeps the kprobes in ext3_bread from the previous version, and
adds a jprobe in end_bio_bh_io_sync the drops a printk if a read-ahead
completes with -EIO.

Comment 155 Keith Kearnan 2006-12-15 18:34:17 UTC

Thanks Chip!
The jprobe/kprobe from comment#154 is now running.
Dec 15 13:32:25 L8 kernel: plant kprobe at ffffffffa00b87b8
Dec 15 13:32:25 L8 kernel: plant kprobe at ffffffffa00b87cd
Dec 15 13:32:25 L8 kernel: plant kprobe at ffffffffa00b8802
Dec 15 13:32:25 L8 kernel: plant jprobe at ffffffff80179fbc
-Keith

Comment 156 Chip Coldwell 2006-12-15 18:41:50 UTC

Created attachment 143801 [details]
touchup the logic in the end_bio_bh_io_sync jprobe

Fixup of 143795

Comment 157 Chris Mason 2006-12-15 19:30:26 UTC

There's only one way I can see to explain this as a READA related bug (without 
EIO in end_bio_bh_io_sync)

Somewhere in the path from ll_rw_block(READA) to end_bio_bh_io_sync getting 
error -11, the lower layers is putting the IO onto a deferred queue.

One place this happens is while device mapper is taking a snapshot.  If you 
snapshot /dev/mapper/foo, all new io to /dev/mapper/foo goes onto a queue 
while the the snapshot is being created.  Then the queued io is run from a 
different process context than the original submission.

Comment 158 Keith Kearnan 2006-12-16 16:52:04 UTC

Ugh.  I missed the e-mail regarding the new jprobe from comment #156.

The server crashed at approximately 11:04AM EST this morning with the 
jprobe/kprobe from comment #154.  The vmcore and log is here: 
ftp://ftp.emc.com/outgoing/213921-8/

buffer_head 00000100297add40 locked before ll_rw_block
buffer_head 00000100297add40 not uptodate after wait_on_buffer inode 
0000010039b96548
EXT3-fs error (device emcpowera1): ext3_readdir: directory #2588673 contains a 
hole at offset 4096
Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
error

I'll update the bugzilla when I start the test with the touched up jprobe.

Comment 159 Keith Kearnan 2006-12-16 19:46:31 UTC

The jprobe/kprobe from comment #156 is now running.
Dec 16 14:42:59 L8 kernel: plant kprobe at ffffffffa00b87b8
Dec 16 14:42:59 L8 kernel: plant kprobe at ffffffffa00b87cd
Dec 16 14:42:59 L8 kernel: plant kprobe at ffffffffa00b8802
Dec 16 14:42:59 L8 kernel: plant jprobe at ffffffff80179fbc
-Keith

Comment 160 Chris Mason 2006-12-16 20:10:45 UTC

From the logs: dm_snapshot and dm_mirror are both linked in.  Is device mapper 
(or lvm) being used on the filesystems being exercised?  Are there snapshots 
happening on the box during the run (on any ext3 fs)?

Comment 161 Keith Kearnan 2006-12-17 01:37:43 UTC

(In reply to comment #160)
> From the logs: dm_snapshot and dm_mirror are both linked in.  

It is the standard RHEL4u3 install with no modifications to dm.

> Is device mapper (or lvm) being used on the filesystems being exercised?  

No.

>Are there snapshots happening on the box during the run (on any ext3 fs)?

No.

[root@L8 ~]# lsmod | grep dm
dm_snapshot            18689  0
dm_zero                 3649  0
dm_mirror              29017  0
dm_mod                 67457  6 dm_snapshot,dm_zero,dm_mirror

Just two PowerPath devices:

[root@L8 ~]# powermt display dev=all
Pseudo name=emcpowerb
Symmetrix ID=000187490063
Logical device ID=01B1
state=alive; policy=SymmOpt; priority=0; queued-IOs=0
==============================================================================
---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---
### HW Path                 I/O Paths    Interf.   Mode    State  Q-IOs Errors
==============================================================================
   0 qla2xxx                   sda       FA  3aA   active  alive      0      0
   1 qla2xxx                   sdc       FA  4aA   active  alive      0      0

Pseudo name=emcpowera
Symmetrix ID=000187490063
Logical device ID=01B9
state=alive; policy=SymmOpt; priority=0; queued-IOs=3
==============================================================================
---------------- Host ---------------   - Stor -   -- I/O Path -  -- Stats ---
### HW Path                 I/O Paths    Interf.   Mode    State  Q-IOs Errors
==============================================================================
   0 qla2xxx                   sdb       FA  3aA   active  alive      0      0
   1 qla2xxx                   sdd       FA  4aA   active  alive      3      0


[root@L8 ~]# df -k | grep emc
/dev/emcpowera1       69591480   5635004  60421376   9% /test1
/dev/emcpowerb1       69591480    300692  65755688   1% /test2

[root@L8 ~]# mount | grep emc
/dev/emcpowera1 on /test1 type ext3 (rw,errors=panic)
/dev/emcpowerb1 on /test2 type ext3 (rw,errors=panic)

-Keith

Comment 162 Chip Coldwell 2006-12-17 12:47:42 UTC

(In reply to comment #160)
> From the logs: dm_snapshot and dm_mirror are both linked in.  Is device mapper 
> (or lvm) being used on the filesystems being exercised?  Are there snapshots 
> happening on the box during the run (on any ext3 fs)?

You're referring to your comment #157 here to try to explain why the
end_bio_bh_io_sync jprobe didn't fire?  I should point out that the unretouched
probe module would only have caught an -EIO failure in end_bio_bh_io_sync.  The
touched up version will catch anything but an -EAGAIN (-EWOULDBLOCK) failure,
since there are so many of those and we believe they are harmless.

I think we should flog this theory (READA failing fast and causing ext3_bread to
fail) a bit more before we give up on it.  However it does seem to be a repeat
of our expierence with the jprobe from comment #91.

Chip

Comment 163 Keith Kearnan 2006-12-17 12:49:57 UTC

Chip...we suffered a mid-air collision a minute ago  :)

We experienced a crash this morning at approximately 01:35AM EST.  The vmcore 
and log are here: ftp://ftp.emc.com/outgoing/213921-9/ the jprobe/kprobe from 
comment #156 was installed.

buffer_head 0000010014483ce8 locked before ll_rw_block
buffer_head 0000010014483ce8 not uptodate after wait_on_buffer inode 
000001002f5b8c58
EXT3-fs error (device emcpowera1): ext3_readdir: directory #2080769 contains a 
hole at offset 4096
Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
error

-Keith

Comment 164 Chris Mason 2006-12-17 14:32:22 UTC

RE: comment #162, I agree the probes right now seem to indicate the only 
failures we're getting are -EAGAINs.  I think we need to instrument 
the -EAGAINS to see if they are causing the hole panics.

If they are, the only explanation I can see is some kind of deferred 
processing from powerpath.

Comment 166 Chip Coldwell 2006-12-18 18:40:45 UTC

Created attachment 143921 [details]
verbose logging in end_bio_bh_io_sync

This does more verbose logging of failures in end_bio_bh_io_sync.  With a
healthy system under I/O load, I get a ton of these:

end_bio_bh_io_sync err = -11 bi_rw = 2 bio 000001001cad5c00 buffer_head
000001001919ff50 page 00000100015be0a8 mapping 000001003750fba0 inode
000001003750fa80 isdir 0

What would be interesting is if one of these came up with "isdir 1" and the
inode address is the same as the one reported by the kprobe in ext3_bread.
Note that there might be some distance in the log between the two messages.

Chip

Comment 167 Chip Coldwell 2006-12-18 18:44:00 UTC

(In reply to comment #166)
>
> end_bio_bh_io_sync err = -11 bi_rw = 2 bio 000001001cad5c00 buffer_head
> 000001001919ff50 page 00000100015be0a8 mapping 000001003750fba0 inode
> 000001003750fa80 isdir 0

BTW, this inode is in the kernel-internal bdev filesystem.  I could check for
inode->i_ino == 0 to suppress these messages, but maybe it's better to just let
it go.

Chip

Comment 168 Chip Coldwell 2006-12-18 20:00:18 UTC

Created attachment 143938 [details]
turn off read-ahead in ext3_readdir

Comment 169 Chip Coldwell 2006-12-18 20:02:59 UTC

I built a kernel with the patch in comment #168 for x86_64; it's up at

http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/

I'll cook up another one for i686hugemem.

Chip

Comment 170 Keith Kearnan 2006-12-18 20:51:14 UTC

Now running with the jprobe/kprobe from comment #166.
Dec 18 15:43:01 L8 kernel: plant kprobe at ffffffffa00b87b8
Dec 18 15:43:01 L8 kernel: plant kprobe at ffffffffa00b87cd
Dec 18 15:43:01 L8 kernel: plant kprobe at ffffffffa00b8802
Dec 18 15:43:01 L8 kernel: plant jprobe at ffffffff80179fbc
-Keith

Comment 171 Chip Coldwell 2006-12-18 21:30:34 UTC

(In reply to comment #169)
> I built a kernel with the patch in comment #168 for x86_64; it's up at
> 
> http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/
> 
> I'll cook up another one for i686hugemem.

http://people.redhat.com/coldwell/bugs/kernel/213921/i686-hugemem/

Comment 172 Keith Kearnan 2006-12-19 13:23:34 UTC

We experienced a crash last night at approximately 20:49 last night.  The 
vmcore and log are located here: ftp://ftp.emc.com/outgoing/213921-10/
The jprobe/kprobe from comment #166 was installed.

buffer_head 0000010067d62190 locked before ll_rw_block
buffer_head 000001001beea088 locked before ll_rw_block
end_bio_bh_io_sync err = -11 bi_rw = 2 bio 000001005759a900 buffer_head 
00000100245a7ad8 page 0000010040775378 mapping 00000100332c0820 inode 
00000100332c0700 isdir 0
buffer_head 00000100171191e8 locked before ll_rw_block
buffer_head 0000010013ecde48 locked before ll_rw_block
buffer_head 000001004dfa7ce8 locked before ll_rw_block
end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010003cd9780 buffer_head 
00000100530b11e8 page 0000010040cd4d38 mapping 00000100332c0820 inode 
00000100332c0700 isdir 0
buffer_head 0000010006763088 locked before ll_rw_block
end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010071eb4200 buffer_head 
0000010006763088 page 0000010040683b78 mapping 00000100332c0820 inode 
00000100332c0700 isdir 0
buffer_head 0000010006763088 not uptodate after wait_on_buffer inode 
00000100732f5300
EXT3-fs error (device emcpowera1): ext3_readdir: directory #294914 contains a 
hole at offset 4096
Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
error

-Keith

Comment 173 Chris Mason 2006-12-19 14:06:58 UTC

Fairly good news I think, this is as close as we can get to showing the failed 
buffer. came from a reada.

Comment 174 Chip Coldwell 2006-12-19 14:36:39 UTC

(In reply to comment #172)
>
> buffer_head 0000010006763088 locked before ll_rw_block
> end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010071eb4200 buffer_head 
> 0000010006763088 page 0000010040683b78 mapping 00000100332c0820 inode 
> 00000100332c0700 isdir 0
> buffer_head 0000010006763088 not uptodate after wait_on_buffer inode 
> 00000100732f5300
> EXT3-fs error (device emcpowera1): ext3_readdir: directory #294914 contains a 
> hole at offset 4096
> Kernel panic - not syncing: EXT3-fs (device emcpowera1): panic forced after 
> error

This is the important part here -- the buffer head address matches in all three
probes (in ext3_bread before ll_rw_block, in end_bio_bh_io_sync, and back in
ext3_bread after wait_on_buffer.  And we know it was a read-ahead from the value
in bi_rw.  So this is the smoking gun.

Chip

Comment 175 Chip Coldwell 2006-12-19 14:44:38 UTC

(In reply to comment #172)

> buffer_head 0000010006763088 locked before ll_rw_block
> end_bio_bh_io_sync err = -11 bi_rw = 2 bio 0000010071eb4200 buffer_head 
> 0000010006763088 page 0000010040683b78 mapping 00000100332c0820 inode 
> 00000100332c0700 isdir 0

In fact, this also explains why the previous jprobe didn't catch the error in
end_bio_bh_io_sync.  I was only dropping a printk when the inode chased down via 

	struct buffer_head *bh = bio->bi_private;
	struct page *page = bh ? bh->b_page : NULL;
	struct address_space *mapping = page ? page->mapping : NULL;
	struct inode *inode = mapping ? mapping->host : NULL;

satisfied S_ISDIR(inode->i_mode).  Of course, it never does, because it is an
inode in the kernel-interal bdev filesystem.  So it was a bug in my jprobe.

Chip

Comment 176 Chris Mason 2006-12-19 15:05:39 UTC

Created attachment 144006 [details]
Fix ext3_bread and ext3_find_entry

Looking through the other ll_rw_block callers and wait_on_buffer users in ext3,
I think these two are the ones that need fixing to deal with failed READAs.

We'll put this patch through paces here, but I wanted to post for review
because the error cases will rarely be run.

The smaller alternative fix is to use READ instead of READA in ext3/dir.c. 
This will have some performance impact though.

Comment 177 Chip Coldwell 2006-12-19 15:11:42 UTC

From http://kerneltrap.org/node/6110

Ext3 and Read Ahead
Posted by saisuman on Friday, January 27, 2006 - 06:54

Was writing a block driver, on which we intended some filesystem to run. So
while testing out with ext3, found an interesting problem. While using the
filesystem, it threw errors such as:

Directory #11 has a hole at offset 8192.
Remounting as read-only

This problem is specific to the way read-ahead is handled by the block layer,
and the way the ext3 filesystem uses it.

Read-ahead is an optimisation intended by the filesystem, and a block driver
comes to know of it by checking the bio->bi_rw flag. It should be "READA". Now,
in 2.6, in case the system is under heavy load, a READA bio can return "EAGAIN"
(through the bio_end_io callback) so that the request may be retried.

So sometimes, we do run into this situation. But the problem is that ext3 never
expects to handle this problem, because it uses the old ll_rw_block routine, and
never checks for EAGAIN. Ideally, it should be patched to use submit_bio, and
check for an EAGAIN return value in the callback.

The moral of the story is, if you're writing a filesystem, expect to handle
EAGAIN as an error condition if you're using READA for submitting block I/O
requests.

Comment 178 Chris Evich 2006-12-19 15:18:10 UTC

Let's not forget there are likely similar code paths within other filesystem
code (i.e. ext2).  Though Chris's proposed patch, if not the methodology at
least, seems it would address them AFAIK.  Since we know what to look for in the
code, let's be careful there aren't any potential corner cases.  Seems to me the
best course of action is to:

A) Wrap our brains around the upstream code.
B) Weigh weather or not a fix should diverge/converge/complement upstream
C) In parallel to A and B, test the hell out of any and all potential fixes.

Comment 179 Chip Coldwell 2006-12-19 15:22:02 UTC

On Linus' git tree:

commit d8733c2956968a01394a4d2a9e97a8b431a78776
Author: Andrew Morton <akpm>
Date:   Thu Mar 23 03:00:11 2006 -0800

    [PATCH] ext3_readdir: use generic readahead
    
    Linus points out that ext3_readdir's readahead only cuts in when
    ext3_readdir() is operating at the very start of the directory.  So for large
    directories we end up performing no readahead at all and we suck.
    
    So take it all out and use the core VM's page_cache_readahead().  This means
    that ext3 directory reads will use all of readahead's dynamic sizing goop.
    
    Note that we're using the directory's filp->f_ra to hold the readahead state,
    but readahead is actually being performed against the underlying blockdev's
    address_space.  Fortunately the readahead code is all set up to handle this.
    
    Tested with printk.  It works.  I was struggling to find a real workload which
    actually cared.
    
    (The patch also exports page_cache_readahead() to GPL modules)
    
    Cc: "Stephen C. Tweedie" <sct>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

Comment 180 Chip Coldwell 2006-12-19 15:29:40 UTC

Created attachment 144007 [details]
commit d8733c2956968a01394a4d2a9e97a8b431a78776 from Linus' git tree

This is the upstream patch from Linus' git tree.

Chip

Comment 181 Jeff Moyer 2006-12-19 15:33:16 UTC

$ grep READA fs/*/*.c
fs/ext3/dir.c:                          ll_rw_block (READA, num, bha);
fs/reiserfs/stree.c:        ll_rw_block(READA, 1, bh + j);
fs/udf/dir.c:                           ll_rw_block(READA, num, bha);
fs/udf/directory.c:                             ll_rw_block(READA, num, bha);

We don't support reiserfs.  I think we do support UDF, though.
This looks familiar:

struct buffer_head * udf_bread(struct inode * inode, int block,
	int create, int * err)
{
	struct buffer_head * bh = NULL;

	bh = udf_getblk(inode, block, create, err);
	if (!bh)
		return NULL;

	if (buffer_uptodate(bh))
		return bh;
	ll_rw_block(READ, 1, &bh);
	wait_on_buffer(bh);
	if (buffer_uptodate(bh))
		return bh;
	brelse(bh);
	*err = -EIO;
	return NULL;
}

So maybe we should try to fix this, too.

Comment 182 Chris Evich 2006-12-19 15:39:08 UTC

Beautiful! I'm assuming that using the more generic page_cache_readahead() would
also solve the EIO/EAGAIN issue?

Comment 183 Chip Coldwell 2006-12-19 15:41:57 UTC

(In reply to comment #180)
> Created an attachment (id=144007) [edit]
> commit d8733c2956968a01394a4d2a9e97a8b431a78776 from Linus' git tree
> 
> This is the upstream patch from Linus' git tree.

This applies to the RHEL-4 head with some fuzz.  I'm building kernels for x86_64
and i686-hugemem.  Would appreciate some test coverage.

Chip

Comment 184 Chris Mason 2006-12-19 15:51:20 UTC

reiserfs is also affected, I sent this bug number along to the novell guys so 
they can fix it up.  For udf, I would just s/READA/READ.

Comment 185 Jeff Moyer 2006-12-19 15:54:18 UTC

(In reply to comment #184)
> reiserfs is also affected, I sent this bug number along to the novell guys so 
> they can fix it up.  For udf, I would just s/READA/READ.

Yes, I'd agree with that solution.

Comment 186 Chris Mason 2006-12-19 16:05:22 UTC

RE: Using Linus' patch.

I'm not against it, but it is a bigger change than it looks on first glance.  
The existing code does logical readahead of the directory.  It maps in the 
first set of blocks in the directory and reads those blocks, even if they are 
not sequential.

Linus' patch does readahead on the block device.  So, it starts reading at a 
specific block on the disk and it reads the disk sequentially.  The blocks it 
reads may or may not correspond to the directory.

In some workloads, his patch will perform better, and in others it will be 
worse.

Comment 187 Chip Coldwell 2006-12-19 16:15:00 UTC

(In reply to comment #186)
> RE: Using Linus' patch.
> 
> In some workloads, his patch will perform better, and in others it will be 
> worse.

Andrew Morton's comment, "I was struggling to find a real workload which
actually cared" seems to indicate that the performance improvement due to any
read-ahead on directories is small.

At any rate, the culture around here is to emulate the mainline kernel as far as
possible, so in order to get your patch accepted by our kernel maintainers, I
would first have to get is accepted by Anrew Morton and Linus Torvalds.  It's
easier to just take the patch they've already included.

Chip

Comment 188 Chris Mason 2006-12-19 16:32:27 UTC

Yes, Linus' motivation was probably git performance, which isn't typical.  The 
existing code really just tries to read small directories all at once, and the 
patch is something different.

I'm fine with either patch, just wanted to make sure people ack'd the whole 
change.

Comment 189 Chip Coldwell 2006-12-19 20:56:40 UTC

Sorry this is taking so long.  The first test kernel based off of
2.6.9-34.EL.x86_64 is available at

http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/2.6.9-34.EL/kernel-smp-2.6.9-34.213921.x86_64.rpm

This is the one EMC was most interested in.

I'll post the rest as the builds complete.

Chip

Comment 190 Keith Kearnan 2006-12-19 21:43:34 UTC

Thanks Chip!
I have installed the kernel from comment #189:

[root@L8 ~]# uname -a
Linux L8.lss.emc.com 2.6.9-34.213921smp #1 SMP Tue Dec 19 15:10:56 EST 2006 
x86_64 x86_64 x86_64 GNU/Linux

I will run with this for a while, I still mounted the file systems with "-o 
errors=panic".

Comment 191 Jeff Moyer 2006-12-19 21:47:42 UTC

Created attachment 144044 [details]
Fail all READA I/Os

This patch may induce the failure more quickly than otherwise possible.  I was
certain that I'd be able to reproduce the problem on a local test system using
it, but have not been successful thus far.  Could someone at EMC give this
patch a try on a known bad kernel to see if the error is triggered more
quickly?  If that is the case, then I would recommend adding this patch to the
latest kernel that Chip provided in order to more quickly verify that the bug
is fixed.

Comment 192 Jeff Moyer 2006-12-19 21:48:13 UTC

Created attachment 144045 [details]
Fail all READA I/Os

This patch may induce the failure more quickly than otherwise possible.  I was
certain that I'd be able to reproduce the problem on a local test system using
it, but have not been successful thus far.  Could someone at EMC give this
patch a try on a known bad kernel to see if the error is triggered more
quickly?  If that is the case, then I would recommend adding this patch to the
latest kernel that Chip provided in order to more quickly verify that the bug
is fixed.

Comment 193 Chip Coldwell 2006-12-19 22:02:32 UTC

(In reply to comment #189)

> I'll post the rest as the builds complete.

The rest of the kernels are up under

http://people.redhat.com/coldwell/bugs/kernel/213921/

Chip

Comment 194 Tom Coughlan 2006-12-20 16:14:59 UTC

One remaining question is where the -EAGAIN failure is coming from. 

Keith, please ask the PowerPath people to determine if there are any situations
where they might generate this.

There were a few reports of the ext3 read-only bug that involved dm-multipath
and not PowerPath. We looked at the dm code and did not see the READA/EAGAIN
returns, other than the snapshot case that Chris Mason already mentioned. It
would be helpful for anyone who is familliar with the dm-multipath cases to find
out whether they also had dm snapshot active.

Comment 195 Chris Mason 2006-12-20 18:49:10 UTC

Jeff's patch in comment #192 should make it fail faster.

We're only doing RA on the first 16 or so blocks, and we start by doing a 
bread on the first block.  On a mostly empty filesystem the directory blocks 
will probably be sequential, and so all the reada requests will get merged 
back into the original bio, and the reada bit will get dropped.

So, if you want to trigger it w/Jeff's patch, completely fill a small FS with 
4k files, delete 16 of them, create empty files in a new directory until the 
disk is full, unmount, mount, ls -laR

I think I see one other case where dm-multipath will trigger.  It checks for 
bio_rw_ahead and will not retry a failed readahead IO on a different path.  
This is much less likely to happen, but if ql2xprocessrscn=1 isn't on it might 
be easier to hit...

Comment 196 Chris Evich 2006-12-20 20:50:40 UTC

My customer needs a 32bit i686 2.6.9-42.0.3.ELsmp kernel build with the
"upstream patch" applied for testing.

Comment 198 Keith Kearnan 2006-12-20 21:34:02 UTC

The patched kernel from comment #189 has now been running for 24 hours without 
a problem.  

I have a system running RHEL4u4 2.6.9-42.0.3.ELsmp x86_64/AMD that should be 
available for testing in a day or so.  I'd like a kernel build for that please.
Thanks!

Comment 199 Chip Coldwell 2006-12-20 23:51:07 UTC

(In reply to comment #196)
> My customer needs a 32bit i686 2.6.9-42.0.3.ELsmp kernel build with the
> "upstream patch" applied for testing.

http://people.redhat.com/coldwell/bugs/kernel/213921/i686/2.6.9-42.0.3.EL/

Chip

Comment 200 Chip Coldwell 2006-12-20 23:53:48 UTC

(In reply to comment #198)
> The patched kernel from comment #189 has now been running for 24 hours without 
> a problem.  
> 
> I have a system running RHEL4u4 2.6.9-42.0.3.ELsmp x86_64/AMD that should be 
> available for testing in a day or so.  I'd like a kernel build for that please.
> Thanks!

http://people.redhat.com/coldwell/bugs/kernel/213921/x86_64/2.6.9-42.0.3.EL/

Chip

Comment 201 Chip Coldwell 2006-12-21 15:18:34 UTC

(In reply to comment #195)
> Jeff's patch in comment #192 should make it fail faster.
> 
> We're only doing RA on the first 16 or so blocks, and we start by doing a 
> bread on the first block.  On a mostly empty filesystem the directory blocks 
> will probably be sequential, and so all the reada requests will get merged 
> back into the original bio, and the reada bit will get dropped.

Good point.  Another reason why this bug was hard to reproduce.

Chip

Comment 202 Chris Evich 2006-12-21 15:45:23 UTC

Chip mentioned that the "real world" performance impact of not doing any READA's
should be fairly minimal.  Behind fixing this bug, customer's data integrity
should be our next highest concern.  We need to maintain an extremely low chance
of any data-corruptors creeping in.  Customers are also asking for a fix *fast*
given the business impact to-date.  Therefor, I'd like to toss this out as an
idea:  

What if we attempt to measure the performance impact of stripping out READAs? 
If it's is extremely low (<1%); then, it's a no-brainer: Get them out of there!
 Alternatively, we could add a kernel tunable to switch on/off READAs as needed.
 It seems to me, this might be a faster/safer route to solving this bug; and at
the same time, avoid introduction of other bugs.  Even if there's a 5%
performance trade off, I think many customers would opt against READAs since
that will avoid this bug with 100% certainty.

Thoughts?

Comment 203 Chris Mason 2006-12-21 17:01:28 UTC

With the mainline patch, I believe the risk of corruption is small.  Turning 
off the READAs is always an option.  The best case for the reada code would be 
doing a single 64k bio for the whole directory instead of 16 4k bios with 
reada off.

On my sata drive here, reading sequentially in 64k chunks goes at 51MB/s.  
Reading in 4k chunks runs at 15MB/s.  The actual impact will vary greatly with 
disk layout and workload, but the most common place people will see it is 
while running backups.

If you want to make a hotfix tomorrow that you are sure is 100% safe, turning 
off directory readahead isn't a bad option.  For something longer term I think 
the mainline patch is a better choice.

Comment 204 Chris Evich 2006-12-21 18:23:47 UTC

Chris,
     I don't mean to sound like I'm attacking your opinion, rather I'm just
throwing out potential alternatives.

From the looks of it, stripping out or adding a reada tunable should be fast and
easy.  It also wouldn't preclude parallel work on the other option(s).  The
crucial bit it would buy us is: time.  From our calls, lots of it will be needed
to properly analyze and test with the somewhat "unknown" behavior of the
upstream code.

My customers would prefer a fix tomorrow that comes with 100% certainty; as
opposed to, a fix a month from now with unknowns.  Also, the behavior of
removing reada's will be entirely different on SAN storage.  SAN data is heavily
cached on the storage side so the latency should be much lower.  Especially
compared to a SATA disk with maybe 16meg of cache.

Comment 205 Chris Mason 2006-12-21 18:57:39 UTC

I think we all have the same goal of making the customers happy fast, so I'm 
only trying to provide some data to help you decide (sorry if it came across 
otherwise)

Turning off directory readahead in a hotfix kernel would be acceptable I 
think.  It may actually hurt more on the SAN machines than the sata drive 
because sending bios down the sata pipe is relatively lightweight.  But it 
depends on how quickly the san boxes start their own readahead and other load 
for those spindles.

One problem with the tunable is that it doesn't really reduce the testing 
load.  It still needs to be verified with readahead on before it can be 
shipped.  I would do either the reada off or the mainline patch, but I'll 
support whatever decision people here feel is best.

Comment 206 Keith Kearnan 2006-12-21 21:36:23 UTC

The patched kernel from comment #189 has now been running for 48 hours without 
a problem.

Comment 207 Chris Evich 2006-12-22 19:18:34 UTC

----------------------
Root cause has been identified after several rounds of focused instrumentation.
An upstream patch has been applied to RHEL4 U4 and is currently under heavy
testing at EMC and Red Hat. The problem has not been seen in test in over 48
hours, which is promising. (note: EMC was previously able to reproduce the issue
within 4-6hours).   However, a longer test period and full regression testing is
needed before we can declare victory and approve fixes for production use. In
the first week of January, EMC and Red Hat will assess if and in which form the
suggested patch will be distributed.
----------------------

Comment 208 Josef Carlin 2007-01-02 15:47:35 UTC

Thanks to everyone who worked so hard to identify and address this issue. Has it
been decided to "emulate the mainline kernel as far as possible, so in order to
get your patch accepted by our kernel maintainers?" Where is RedHat at on
releasing a patch (or new kernel RPM's) for this issue?

Comment 209 Chris Evich 2007-01-02 16:02:59 UTC

Efforts over the holidays yielded 9 hours of failure-free testing with the
upstream patch.  However, given the impact of this issue, we'd really like to
give the fix an extra special pounding.  The next step is to implement another
testing-only kernel patch designed to aggravate the original problem.  Once we
confirm that the problem is reproducible with it (should only take several
seconds), we will apply the upstream fix.  The goal here is to really make sure
that we can't hit the same problem via a different route and while under
worse-case conditions.

Comment 210 Chip Coldwell 2007-01-02 16:54:04 UTC

(In reply to comment #209)
> Efforts over the holidays yielded 9 hours of failure-free testing with the
> upstream patch.

You mean 9 days, not 9 hours.

Chip

Comment 211 Chip Coldwell 2007-01-02 16:57:28 UTC

(In reply to comment #208)
> Thanks to everyone who worked so hard to identify and address this issue. Has it
> been decided to "emulate the mainline kernel as far as possible, so in order to
> get your patch accepted by our kernel maintainers?"

I believe that is the consensus.  Also, the mainline kernel patch presumably
will also improve performance.

Chip

Comment 214 Chris Evich 2007-01-02 18:38:56 UTC

(In reply to comment #210)
> (In reply to comment #209)
> > Efforts over the holidays yielded 9 hours of failure-free testing with the
> > upstream patch.
> 
> You mean 9 days, not 9 hours.
> 
> Chip
> 

Yes, sorry about that, my brain is still on a holiday.  9 days of testing is
correct.  Our internal tests ran for 11 days with a similar setup, though
against a Clariion.

Comment 216 Chip Coldwell 2007-01-04 15:26:47 UTC

Created attachment 144802 [details]
This is the upstream patch from attachment 144007 [details] adjusted to apply to RHEL4 kernels

Comment 219 Jay Turner 2007-01-04 16:46:39 UTC

QE ack for RHEL4.5.

Comment 221 Chris Evich 2007-01-05 17:29:54 UTC

I Built i686 kernels (building x86_64 next) with the "Fail all read-aheads"
patch from comment #192.  These kernels should should be able to reproduce on
local as well as SAN filesystems.  ALL ext3 read-aheads will fail with this
kernel!  Make sure all your non-test filesystems are mounted as ext2 or you
probably won't be able to boot.

Please use this kernel in combination with the filesystem setup instructions in
comment #195 to verify the problem can be reproduced.  

Once that's done, I'll build another round of kernels with both the comment #192
patch and the attachment #144007 [details] patch.  Then we'll see for once and for all if
it truly fixes the problem for certain.

Kernels are at:
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192.EL.i686.rpm
http://people.redhat.com/cevich/stuff/kernel-hugemem-2.6.9-42.0.3_bz213921c192.EL.i686.rpm

Comment 222 Chris Evich 2007-01-05 17:32:22 UTC

Kernel source package at:
http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c192.EL.src.rpm

Comment 223 Chris Evich 2007-01-05 18:47:35 UTC

And here are the 64bit packages:

http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c192.EL.x86_64.rpm
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192.EL.x86_64.rpm

Comment 224 Keith Kearnan 2007-01-05 20:04:48 UTC

I installed the kernel from comment #223 and I experienced a crash almost 
immediately.

EXT3-fs error (device emcpowerb1): ext3_readdir: directory #6520835 contains a 
hole at offset 4096
Kernel panic - not syncing: EXT3-fs (device emcpowerb1): panic forced after 
error

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at panic:75
invalid operand: 0000 [1] SMP
CPU 2
Modules linked in: emcphr(U) emcpmpap(U) emcpmpaa(U) emcpmpc(U) emcpmp(U) emcp
(U) emcplib(U) parport_pc lp parport netconsole netdump autofs4 i2c_dev 
i2c_core sunrpc ds yenta_socket pcmcia_core button battery ac md5 ipv6 ohci_hcd 
hw_random tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 
qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
Pid: 8512, comm: ls Tainted: P      2.6.9-42.0.3_bz213921c192.ELsmp
RIP: 0010:[<ffffffff8013794e>] <ffffffff8013794e>{panic+211}
RSP: 0000:0000010041785c18  EFLAGS: 00010286
RAX: 0000000000000056 RBX: ffffffffa00c9e89 RCX: 0000000000000246
RDX: 0000000000009cd3 RSI: 0000000000000246 RDI: ffffffff803e2080
RBP: 0000010025e7f400 R08: 0000000000000004 R09: ffffffffa00c9e89
R10: 0000000000000000 R11: 0000000000000001 R12: 000001007e4bd400
R13: 0000000000000000 R14: 0000010041785f38 R15: 0000010057997488
FS:  0000002a9557e580(0000) GS:ffffffff804e5300(0000) knlGS:00000000f7fd76c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000527070 CR3: 000000003f1b2000 CR4: 00000000000006e0
Process ls (pid: 8512, threadinfo 0000010041784000, task 00000100285917f0)
Stack: 0000003000000010 0000010041785cf8 0000010041785c38 ffffffff80138552
       0000003000000008 000001007e4bd640 000001003bac0000 0000000000000246
       0000000000000004 ffffffffa00c8c95
Call Trace:<ffffffff80138552>{printk+141} <ffffffff8013827a>
{release_console_sem+369}
       <ffffffff801384a8>{vprintk+498} <ffffffffa00c158b>
{:ext3:ext3_handle_error+69}
       <ffffffffa00c1718>{:ext3:ext3_error+193} <ffffffff8030ab57>
{io_schedule+38}
       <ffffffff8017a562>{bh_wake_function+0} <ffffffff8017a562>
{bh_wake_function+0}
       <ffffffffa00bb810>{:ext3:ext3_bread+114} <ffffffffa00b7be5>
{:ext3:ext3_readdir+650}
       <ffffffff8018b1c4>{filldir64+0} <ffffffff801d4055>
{selinux_file_permission+298}
       <ffffffff801859a8>{path_release+12} <ffffffff8018b1c4>{filldir64+0}
       <ffffffff8018af03>{vfs_readdir+155} <ffffffff8018b2f4>
{sys_getdents64+118}
       <ffffffff8011026a>{system_call+126}

Code: 0f 0b 0f e6 31 80 ff ff ff ff 4b 00 31 ff e8 17 c0 fe ff e8
RIP <ffffffff8013794e>{panic+211} RSP <0000010041785c18>

Comment 225 Chris Evich 2007-01-05 20:08:45 UTC

Great!  Does "almost immediately" mean you went through the filesystem setup and
it crashed on an "ls" or that the system crashed on boot?

Comment 226 Keith Kearnan 2007-01-05 20:15:10 UTC

(In reply to comment #225)
> Great!  Does "almost immediately" mean you went through the filesystem setup 
and
> it crashed on an "ls" or that the system crashed on boot?

I actually kicked of all the jobs from my usual reproduction steps (comment 
#133).  This is the first time I have seen it crash so quickly so I would say 
the "aggravation change" works well.

Comment 227 Chris Evich 2007-01-05 20:31:29 UTC

Thanks!   I'm building kernels with both patches applied now.  Will post them
when done.

Comment 228 Chip Coldwell 2007-01-05 21:09:13 UTC

(In reply to comment #195)
> Jeff's patch in comment #192 should make it fail faster.
> 
> We're only doing RA on the first 16 or so blocks, and we start by doing a 
> bread on the first block.  On a mostly empty filesystem the directory blocks 
> will probably be sequential, and so all the reada requests will get merged 
> back into the original bio, and the reada bit will get dropped.
> 
> So, if you want to trigger it w/Jeff's patch, completely fill a small FS with 
> 4k files, delete 16 of them, create empty files in a new directory until the 
> disk is full, unmount, mount, ls -laR

I tried it, and much to my surprise I could not reproduce the bug.  I also tried
some variations on the formula: remove the first 16 4k files, remove the last 16
4k files, remove 16 random 4k files, remove two groups of 8 4k files, etc. and
still no joy (there's an unmount, mke2fs -j, mount, etc in between each of these
attempts).

Am I missing something?

Chip


> 
> I think I see one other case where dm-multipath will trigger.  It checks for 
> bio_rw_ahead and will not retry a failed readahead IO on a different path.  
> This is much less likely to happen, but if ql2xprocessrscn=1 isn't on it might 
> be easier to hit...
>

Comment 229 Chris Evich 2007-01-05 23:25:46 UTC

(Probably should have picked a faster box to compile these on)  Here are kernels
build with both patches applied:

http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192c216.EL.i686.rpm
http://people.redhat.com/cevich/stuff/kernel-hugemem-2.6.9-42.0.3_bz213921c192c216.EL.i686.rpm
http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c192c216.EL.x86_64.rpm
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c192c216.EL.x86_64.rpm

Comment 230 Keith Kearnan 2007-01-06 21:56:52 UTC

(In reply to comment #229)
> (Probably should have picked a faster box to compile these on)  Here are 
kernels...

Thanks Chris!

I kicked off a test with the kernel from comment #229.

[root@L8 ~]# uname -a
Linux L8.lss.emc.com 2.6.9-42.0.3_bz213921c192c216.ELsmp #1 SMP Fri Jan 5 
15:47:27 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

Comment 231 Chris Evich 2007-01-08 19:30:30 UTC

(In reply to comment #230)
> I kicked off a test with the kernel from comment #229.

Is it safe to assume this test ran failure-free for the weekend?

Anyone else have results of testing with this worst-case, testing-only kernel 
( Comment #229 )?

Comment 232 Keith Kearnan 2007-01-08 19:35:43 UTC

(In reply to comment #231)
> (In reply to comment #230)
> Is it safe to assume this test ran failure-free for the weekend?

Yes.  It is still running.
-Keith

Comment 234 Jason Baron 2007-01-09 21:15:10 UTC

*** Bug 208879 has been marked as a duplicate of this bug. ***

Comment 235 Chris Evich 2007-01-10 14:47:34 UTC

Created attachment 145256 [details]
Patch to replace all READAs with normal READs

For the record, we've decided not to go with the comment #216 patch as some
further discussion illustrated it as moderately risky.	In weighing the
critically of this issue, our desire is to fix the problem while introducing
the smallest possible risk.  The attached patch accomplishes the high-level
intent of the original patch, though in a much safer way.  To summarize, all it
does is replace read-ahead requests with normal read requests.	Thus some
optimization is preserved while still allowing the requests to fail gracefully
thus preventing the "directory hole read-only" condition.  Testing with this
patch is currently underway and we believe the outcome will be successful.  Any
and all who can test this patch are encouraged to do so.  I will post a link to
the kernel packages as well as start building other architecture kernels as
time permits.

Comment 236 Chris Evich 2007-01-10 16:15:17 UTC

Kernel source package including patch from comment #235

http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c235.EL.src.rpm

Comment 237 Chris Evich 2007-01-10 16:28:10 UTC

Kernel source package including patches from comment #235 and comment #216

THIS KERNEL IS FOR TESTING ONLY - ALL ext3 READA REQUESTS WILL FAIL
(HOWEVER, FILESYSTEMS SHOULD NOT GO READ-ONLY)

http://people.redhat.com/cevich/stuff/kernel-2.6.9-42.0.3_bz213921c235c216.EL.src.rpm

Comment 238 Chris Mason 2007-01-10 19:08:10 UTC

Could someone please summarize the objections to the page cache readahead 
patch?  I do agree that s/READA/READ is safe, but I'm curious to hear the 
objections.

Comment 239 Jason Baron 2007-01-10 19:11:41 UTC

committed in stream U5 build 42.40. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 240 Chip Coldwell 2007-01-10 19:23:25 UTC

(In reply to comment #238)
> Could someone please summarize the objections to the page cache readahead 
> patch?  I do agree that s/READA/READ is safe, but I'm curious to hear the 
> objections.

Stephen Tweedie is quoted below.

On Fri, 2007-01-05 at 14:05 -0500, Jason Baron wrote:

> can you be more specific about what is making you nervous...performance 
> issues from too much readahead? I'm trying to better assess the risk of 
> this patch for possible inclusion in an asynchronous securrity erratum.

It totally changes the way we're accessing cache for directory buffers.
Possible interactions include interference with journal code, or with
user code accessing the cache at the same time (eg. dump(8).)

I can't see any reason why it *should* have any problems, but it's
precisely the unforeseen problems which get you.  :-)  It's really such
a big change in the internal way we do directory reads that I can't see
it qualifying as a low-risk change.

--Stephen

Comment 241 Chris Evich 2007-01-10 19:34:13 UTC

A few compiled kernel packages as per my above comments:

http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.i686.rpm
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.x86_64.rpm

and

http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235c216.EL.x86_64.rpm
(Both testing and fix patches applied)

Comment 242 Chris Evich 2007-01-11 14:23:40 UTC

All,

     I have been hearing about some confusion regarding the sudden and apparent
"last-minute" changes in our approach here.  Please all me to clarify the
situation bit further.  It is very important to us and everyone affected by this
bug, that we get the fix "right".  After testing the "upstream patch" for this
problem, it was determined that it opened up some substantial areas for concern.
 More specifically, these are not concerns which translate to the upstream code
because the context is different (e.g. there are many other changes as well). 
Additionally, there was found to be evidence the "upstream patch" actually
causes a rare, but much more severe problem.  Therefor, we felt it best to delay
release of the fix such that we can perform additional testing on a _much_ safer
fix to this problem.  We are aiming for a much shorter test cycle with this fix
because of it's perceived safety.  On that note, testing is underway with the
comment #241 kernels, and so far looks promising.  Any and all additional help
in testing these kernels is much appreciated.

Comment 243 Chris Evich 2007-01-11 17:11:06 UTC

(In reply to comment #241)
> A few compiled kernel packages as per my above comments:
> 
>
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.i686.rpm
>
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235.EL.x86_64.rpm
> 
> and
> 
>
http://people.redhat.com/cevich/stuff/kernel-smp-2.6.9-42.0.3_bz213921c235c216.EL.x86_64.rpm
> (Both testing and fix patches applied)

Just to clarify, ignore the first two.  They're the same as what's available
from comment #239 and I've removed the files from my people page.  The last one
"bz213921c235c216", is still there.  The key difference is it has both the fix
and the testing-only "fail all READAs" patch applied.  Presumably, the failure
should not be reproducable with the former; more importantly, nothing else
should break with the later.

Comment 244 Andrius Benokraitis 2007-01-12 15:32:43 UTC

*** Bug 207109 has been marked as a duplicate of this bug. ***

Comment 252 Keith Kearnan 2007-01-19 16:19:14 UTC

I have not seen a file system go read-only with any of the kernels that I have 
been asked to test:
2.6.9-42.0.3_bz213921c192c216.ELsmp (64 hours)
2.6.9-42.39.ELsct (3 days)
2.6.9-42.0.6 (4 days)
Today I am going to start testing with 2.6.9-42.0.7.
-Keith

Comment 255 Andrius Benokraitis 2007-01-24 15:01:18 UTC

FYI, the following public KBASE article has been released as well:

http://kbase.redhat.com/faq/FAQ_85_9610

Comment 261 Red Hat Bugzilla 2007-01-30 14:37:00 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2007-0014.html

Comment 270 laurie barry 2007-02-27 19:11:02 UTC

the aforementioned lpfc driver parameter variables did not produce 
statistically significant results and are considered unrelated to this issue.  
Any driver configuration variable changes should be cleared through the 
storage and/or adapter providers.

Comment 284 Issue Tracker 2007-07-04 10:04:46 UTC

Internal Status set to 'Resolved'
Status set to: Closed by Tech
Resolution set to: 'Security Errata'

This event sent from IssueTracker by sputhenp 
 issue 111261

Comment 288 Craig Bogovich 2007-11-09 20:53:08 UTC

We are experiencing a similar issue.  File systems from both clarion and 
symmetrix SANs go readonly from time to time.  The syslogs show the following 
errors:    Aug 21 00:02:28 lndb004 EXT3-fs error (device dm-9) in 
start_transaction: Journal has aborted.

We are on version 2.6.9-42.0.3.ELsmp.

How do we confirm we are experiencing this bug?  Is there a test case we can 
use to attempt to reproduce so that we know the fix works?

Comment 289 Chris Evich 2007-11-09 21:44:58 UTC

Craig,

     A filesystem can go readonly for many reasons.  This bug documents
discovery and resolution of one particular problem within the kernel's ext3
code.  I recommend updating to a later kernel, if the problem doesn't go away
then please contact support.  They will be able to assist in determining weather
or not this bug is your underlying problem or not.

Comment 290 Chris Snook 2007-11-10 00:09:37 UTC

Craig,

If you're not seeing any errors near that message, you're probably hitting
something else.  Temporally speaking, there should be other errors before it if
you're hitting this bug, but since the message you've quoted would probably come
from a different thread from the one actually hitting the bug directly, the
error messages could end up out of order in the log.

This bug is fixed in 2.6.9-42.0.8.EL, so if you're still having the problem on
that kernel or later, it's definitely something else.

If the problem is occurring inside VMware guests, it may be due to a different
bug which is fixed in the U5 kernel, version 2.6.9-55.EL.

Comment 291 Chris Snook 2007-11-10 00:17:13 UTC

I forgot to mention, if you want to try reproducing this bug, try running the
multi-instance iozone script and the metadata test script attached to this bug.
 Basically, if you create an intense I/O load, and then do a whole lot of
metadata operations, and keep doing this for a very long time, you'll eventually
hit this bug, if your systems are susceptible to it.  In lab testing it took
anywhere from an hour to a day to trigger.