Bug 119457

Summary:

panics in generic_aio_complete_rw and unmap_kvec after __iodesc_free calls generic_aio_complete_read()

Product:

Red Hat Enterprise Linux 3

Reporter:

Chuck Berg <cberg>

Component:

kernel

Assignee:

Jeff Moyer <jmoyer>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

djuran, ggrignou, petrides, riel, tao, tcallawa, vanhoof

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2006-0437

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-07-20 13:14:10 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

181405, 186960

Attachments:

Description	Flags
netdump log of kernel panic	none
Add a missing return statement; without this, an io can be completed twice.	none

Description Chuck Berg 2004-03-30 17:57:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.6)
Gecko/20040116

Description of problem:
This system is an 8-CPU (plus hyperthreading) HP DL760 with 20GB RAM
running Sybase 12.5. The kernel is 2.4.21-9.0.1.ELhugemem.

Storage is a dual HP cciss card, with many 1 and 2GB LVM volumes
(striped between the RAID volume on each controller). There is also a
ql2300 card connected to an 8GB solid state disk. (sliced up into 1GB
LVM volumes). The qla2300 driver is from Qlogic (v6.06.10) as the one
Redhat ships is hopelessly broken in this configuration.

There were two panics under heavy Sybase load (a couple thousand
inserts and updates per second).

The first panic was mangled by a broken serial console. Here's the
surviving portion:
Stack: 29b44600 b3bc2400 fffffffb 00000000 02189671 9e92fe80 fffffffb
7eef7580
       00000400 00000000 02189727 00000000 b3bc2400 29b44600 fffffffb
02149ba7
       b3bc2400 29b44600 fffffffb 29b44600 02189700 b3bc2400 f1fe9f68
39f43f7c
Call Trace:   [<02189671>] generic_aio_complete_rw [kernel] 0x31
(0x39f43ee0)
[<02189727>] generic_aio_complete_read [kernel] 0x27 (0x39f43ef8)
[<02149ba7>] __iodesc_free [kernel] 0x97 (0x39f43f0c)
[<02189700>] generic_aio_complete_read [kernel] 0x0 (0x39f43f20)
[<0214a48b>] __iodesc_read_finish [kernel] 0x22b (0x39f43f38)
[<0214a678>] __iodesc_make_uptodate [kernel] 0xc8 (0x39f43f5c)
[<0212f35a>] __run_task_queue [kernel] 0x6a (0x39f43f74)
[<0213af7d>] context_thread [kernel] 0x13d (0x39f43f8c)
[<021237a0>] schedule [kernel] 0x0 (0x39f43fbc)
[<0213ae40>] context_thread [kernel] 0x0 (0x39f43fe0)
[<0210958d>] kernel_thread_helper [kernel] 0x5 (0x39f43ff0)

Code: 8b 41 18 a9 00 80 00 00 75 13 85 ed 74 06 f0 0f ba 69 18 04

Kernel panic: Fatal exception

The code matches unmap_kvec+42, which is:
0x02141ac6 <unmap_kvec+38>:     test   %ecx,%ecx
0x02141ac8 <unmap_kvec+40>:     je     0x2141ae7 <unmap_kvec+71>
0x02141aca <unmap_kvec+42>:     mov    0x18(%ecx),%eax
0x02141acd <unmap_kvec+45>:     test   $0x8000,%eax

Source says:
                if (likely(map != NULL) && !PageReserved(map)) {

The second time, I had to copy it off the screen. Here's what I got:
EIP is at generic_aio_complete_rw [kernel] 0x42
(2.4.21-9.0.1.ELhugemem/i686)
eax: 00000000    ebx: 5f491600    ecx: 1c30e080    edx: 000000f5
esi: 6e0c9380    edi: 00000800    ebp: 00000000    esp: 39f43ee4
ds: 0068   es:0068   ss:0068
Process keventd (pid: 18, stackpage=39f43000)
Stack: 00000001 00000800 ccf9e380 00000000 00000000 02189727 00000000
6e0c9380
       5f491600 00000800 02149ba7 6e0c9380 5f491600 00000800 5f491600
02189700
       6e0c9380 0215defc 065347ac e5021000 00000000 0214a48b ccf9e380
00000000
Call Trace: [<02189727>] generic_aio_complete_read [kernel] 0x27
(0x39f43ef8)
[<02149ba7>] __iodesc_free [kernel] 0x97 (0x39f43f0c)
[<02189700<] generic_aio_complete_read [kernel] 0x0 (0x39f43f20)
[<0215defc>] kmap_high [kernel] 0x5c (0x39f43f28)
[<0214a48b>] __iodesc_read_finish [kernel] 0x22b (0x39f43f38)
[<0212f35a>] __run_task_queue [kernel] 0x6a (0x39f43f74)
[<0213af7d>] context_thread [kernel] 0x13d (0x39f43f8c)
[<021237a0>] schedule [kernel] 0x0 (0x39f43fbc)
[<0213ae40>] context_kthread [kernel] 0x0 (0x39f43fe0)
[<0210958d>] kernel_thread_helper [kernel] 0x5 (0x39f43ff0)

Code: 8b 48 04 85 c9 75 05 3b 7e 30 74 24 8b 56 28 85 d2 89 f0 8b

Kernel panic: Fatal exception

In this case, generic_aio_complete_rw() tried to dereference a null
iocb->ctx.

I'm now trying to reproduce the problem with netdump enabled and
hopefully the serial console fixed.

Version-Release number of selected component (if applicable):
2.4.21-9.0.1.ELhugemem

How reproducible:
Sometimes

Steps to Reproduce:
happened twice, can't reproduce at-will yet.    

Additional info:

Comment 1 Chuck Berg 2004-03-31 03:01:50 UTC

Created attachment 98988 [details]
netdump log of kernel panic

A complete panic message this time.

Comment 2 Jeff Moyer 2004-03-31 03:12:10 UTC

Great.  Do you have the netdump image, as well?  Could you compress it
and attach it if you do?

Comment 3 Chuck Berg 2004-03-31 21:51:54 UTC

Unfortunately, I didn't get the netdump image - the netdump server
didn't have enough disk space. I assumed
http://www.redhat.com/support/wpapers/redhat/netdump/setup.html was
current documentation and only a 4GB dump would be saved. I'm banging
on the machine some more, I'll hopefully get the netdump image the
next time it crashes.

Comment 4 Jeff Moyer 2004-03-31 22:01:27 UTC

Ok, we need to update the documentation.  That bug was fixed some time
ago.  Of course, I'm not sure how well the dump will compress, so it
may be a moot point.  ;-)

On a side note, could you open up bugs on the shortcomings of the
qla2300 driver that ships with RHEL 3?  We are updating the driver,
and we would like to ensure that we get the fixes that you need.

Thanks!

Comment 5 Chuck Berg 2004-04-01 17:09:16 UTC

I got a crash dump. It compresses to over 2GB, I probably shouldn't
create an attachment that large (anyway Mozilla won't upload a file
larger than 2GB). You can get it from:

ftp://rhcrash:rhcrash@nitftp.knight-sec.com/goldinst-vmcore-20040401.gz

and the panic message is:

ftp://rhcrash:rhcrash@nitftp.knight-sec.com/goldinst-panic-20040401.txt

I only have this machine until Monday, it's an eval unit from HP.
(just so you know in case you need me to test a patch or collect more
information).

I opened a case on that ql2300 driver.

Comment 6 Jeff Moyer 2004-04-05 19:04:24 UTC

> Storage is a dual HP cciss card, with many 1 and 2GB LVM volumes
> (striped between the RAID volume on each controller). 

I read this as you had configured each cciss separately, then used the
Linux LVM to stripe your volumes across the two cciss devices.  Is
that correct?

From the dump, it looks like the errant process is accessing
/dev/hpdisk/s1g09 (i.e. opened a block device, not an lvm partition).
 Does this make sense given your current configuration?

Has the box gone back to HP yet?  Any chance you can keep it longer?

Thanks!

Jeff

Comment 7 Chuck Berg 2004-04-05 20:35:15 UTC

That's correct about the volume layout.

hpdisk is an LVM volume group. It contains physical volumes
/dev/cciss/c1d0 and /dev/cciss/c2d0. /dev/hpdisk/s1g09 is a 1GB LVM
volume. (striped between those two physical volumes)

Sybase would have opened this particular volume through /dev/raw/raw9.

I have the machine until tomorrow (Tuesday) morning now. I doubt I can
keep it longer but I'll see.

Comment 8 Jeff Moyer 2004-04-06 16:54:59 UTC

If you still have the machine, let me know, and we'll turn on the slab
debugging.  If not, I'll continue to try to reproduce this in-house.

Thanks.

Comment 9 Chuck Berg 2004-04-28 22:39:02 UTC

I no longer have this machine, nor access to any similarly large x86
machine.

Thanks for your help.

Comment 10 Gwendal Grignou 2004-08-27 02:37:22 UTC

My name is Gwendal Grignou, I am working for Silverback Systems where
I wrote driver for our new iSCSI chip, iSNAP 2110.
Our test is 2 RHEL machines, one running as SCSI initiator, the other
one as a target. I open 20 iSCSI sessions (which map to 20 TCP
Connections and 20 SCSI host driver interfaces). 
We use iometer to pmup traffic, throuch a patched version of dynamo,
iometer back end thread that generate traffic.

After few seconds, the system crashes: with kgdb, it crashes in:
 generic_aio_complete_rw at aio.c:1135

The problem is iocb->ctx is NULL while the code try to access
iocb->ctx->dead

(gdb) p _iocb
$3 = (void *) 0xd12f0700
(gdb) p *(struct kiocb*)$3
$4 = {list = {next = 0xd12f3480, prev = 0xd12f2c00}, filp = 0x0, ctx =
0x0, user_obj = 0x0,
  user_data = 3076314848, pos = 387072, buf = 0, nr_transferred =
2048, size = 2048, this_size = 2048,
  key = 1002, cancel = 0, data = 0x0, users = 0, u = {tq = {list =
{next = 0x0, prev = 0x0}, sync = 0,
      routine = 0, data = 0x0}, list = {next = 0x0, prev = 0x0}},
rlim_fsize = 4294967295}

It is easy to reproduce in my configuration. 
Do you have any idea where I should look to help you debug?

Thanks,

Gwendal.

Comment 11 Jeff Moyer 2004-10-19 15:10:43 UTC

Hi, Gwendal,

As a start, you could enable slab debugging.  Let me know what (if
anything) that uncovers.

-Jeff

Comment 12 Peter Martuccelli 2005-01-26 14:54:08 UTC

Reposting prior update for inclusion in IT entry, 55870:

<start>

Hi, Gwendal,

As a start, you could enable slab debugging.  Let me know what (if
anything) that uncovers.

-Jeff

<End>

Comment 13 Jeff Moyer 2005-03-23 21:14:20 UTC

For anyone still running into this problem, please try the debug kernel located
here:

  http://people.redhat.com/jmoyer/.bz119457/

Be sure to capture the kernel logs via netlog/netdump or a serial console.  When
next the machine panics, I'm interested in seeing these logs.

Comment 14 Jeff Moyer 2005-10-21 15:34:49 UTC

Is there anyone who is still seeing this problem?

Comment 28 Jeff Moyer 2005-12-15 21:06:04 UTC

Created attachment 122301 [details]
Add a missing return statement;  without this, an io can be completed twice.

I'm currently testing this patch, and it's holding up quite well.  I am going
to coordinate some more stress testing internally, but I'd like to see if the
customer(s) could give this a try.

Thanks.

Comment 36 Jeff Moyer 2006-01-23 14:13:51 UTC

Here is a description of the problem that the posted patch fixes:

If a program issues a read for a page which is undergoing write I/O, you can
trigger this problem.  For example, a program which iissues a number of writes
to a file, followed very closely by a number of reads to the same portions of
the same file will trigger this bug.

What ends up happening is that the AIO reads will catch up with the writes,
since everything is being cached.  When the read path reaches a page which is
still locked by the writer, it follows a code path that is seldom run.  The code
path tests to see if the page is locked for I/O.  If it is locked for I/O, then
the reader puts itself on the page wait queue (entries on the page wait queue
get woken up when the page is unlocked).  The errant code path then completes
the I/O, even though it hasn't completed.  Because of this, when the page is
unlocked, we get called again with real data in the page, and the same I/O is
completed a second time.  This second I/O completion then causes the
dereferencing of NULL pointer in the io context structure.

Because of the racy nature of this problem, it's easier to reproduce over a
networked file system than it is on a local file system.  That isn't to say that
one cannot trigger the bug on a local file system;  it's simply more difficult
to hit the race in the local case.

Comment 39 Ernie Petrides 2006-02-18 00:23:39 UTC

A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.2.EL).

Comment 40 Bob Johnson 2006-04-11 16:18:20 UTC

This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 3.8 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 3.8 release.

Comment 46 Red Hat Bugzilla 2006-07-20 13:14:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html