577968 – kernel 2.6.34-0.16.rc2.git0.fc14.x86_64 gets GPF at cfq_free_io_context+0x18/0x34

Bug 577968 - kernel 2.6.34-0.16.rc2.git0.fc14.x86_64 gets GPF at cfq_free_io_context+0x18/0x34

Summary: kernel 2.6.34-0.16.rc2.git0.fc14.x86_64 gets GPF at cfq_free_io_context+0x18/...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Jeff Moyer
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (7):	520620 585659 624979 637068 638677 726679 755219 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-03-29 21:06 UTC by Michal Jaegermann
Modified:	2011-11-21 21:38 UTC (History)
CC List:	23 users (show)
Fixed In Version:	kernel-2.6.35.14-96.fc14
Clone Of:
Clones:	716520 (view as bug list)
Environment:
Last Closed:	2011-09-12 11:27:31 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg output for 2.6.34-0.16.rc2.git0.fc14.x86_64 with a fault (38.01 KB, text/plain) 2010-03-29 21:06 UTC, Michal Jaegermann	no flags	Details
kernel config for x86_64 (102.60 KB, text/plain) 2010-03-31 20:17 UTC, Jeff Moyer	no flags	Details
backtrace for 2.6.36-0.16.rc3.git0.fc15.x86_64 kernel (2.39 KB, text/plain) 2010-09-07 19:02 UTC, Michal Jaegermann	no flags	Details
crash picture (902.96 KB, image/jpeg) 2010-09-16 12:41 UTC, Yanko Kaneti	no flags	Details
dmesg from _2.6.36-0.16.rc3.git0.fc15.x86_64 with crash at _call_for_each_cic+0x21/0x3f (40.71 KB, application/octet-stream) 2010-09-25 01:34 UTC, Michal Jaegermann	no flags	Details
my v2.6.39-rc6 vanilla config (119.58 KB, text/plain) 2011-05-10 19:18 UTC, Paul Bolle	no flags	Details
CFQ cic debug patch (1.34 KB, patch) 2011-05-17 18:58 UTC, Vivek Goyal	no flags	Details \| Diff
CFQ cic debug patch V2 (2.68 KB, patch) 2011-05-18 23:03 UTC, Vivek Goyal	no flags	Details \| Diff
last 100 lines (before unsuccesfull reboot in capture kernel) (6.33 KB, text/plain) 2011-05-19 21:10 UTC, Paul Bolle	no flags	Details
16+ seconds of blktrace output (530.23 KB, text/plain) 2011-05-20 07:38 UTC, Paul Bolle	no flags	Details
Fix frequent cgroup changed messages visible in debug logs (915 bytes, patch) 2011-05-23 14:28 UTC, Vivek Goyal	no flags	Details \| Diff
RCU collectdebugfs.sh script (1.63 KB, application/x-sh) 2011-08-08 21:46 UTC, Vivek Goyal	no flags	Details
output of collectdebugfs.sh (52.78 KB, text/plain) 2011-08-09 19:47 UTC, Paul Bolle	no flags	Details
Patch for v3.0..rcu/next as of August 9, 2011 (135.34 KB, text/plain) 2011-08-11 02:38 UTC, Paul E. McKenney	no flags	Details
Diagnostic patch for RCU_FAST_NO_HZ assumptions (1.74 KB, patch) 2011-08-19 22:27 UTC, Paul E. McKenney	no flags	Details \| Diff
Patch to remove rcu_needs_cpu_flush() to avoid false quiescent states. (2.78 KB, patch) 2011-08-21 01:42 UTC, Paul E. McKenney	no flags	Details \| Diff
View All

Description Michal Jaegermann 2010-03-29 21:06:48 UTC

Created attachment 403363 [details]
dmesg output for 2.6.34-0.16.rc2.git0.fc14.x86_64 with a fault

Actual results:


Expected results:
dmesg output for 2.6.34-0.16.rc2.git0.fc14.x86_64 with a fault

Description of problem:

What looks like a side effect of running /etc/cron.daily a general protection fault: 0000 [#1] SMP was registered triggered by "Pid: 5613, comm: ld-linux-x86-64" and with
RIP: 0010:[<ffffffff81226d77>]  [<ffffffff81226d77>] cfq_free_io_context+0x18/0x34
All that ended with "Fixing recursive fault but reboot is needed!".
Full dmesg output attached

Version-Release number of selected component (if applicable):
kernel-2.6.34-0.16.rc2.git0.fc14.x86_64

How reproducible:
So far this was quite a surprise

Comment 1 Jeff Moyer 2010-03-30 20:10:42 UTC

/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2410
ffffffff81226d6b:       48 8b 5f 78             mov    0x78(%rdi),%rbx  <=======
cfq_free_io_context():
/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2475
ffffffff81226d6f:       49 89 fc                mov    %rdi,%r12
__call_for_each_cic():
/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2410
ffffffff81226d72:       48 85 db                test   %rbx,%rbx
ffffffff81226d75:       74 17                   je     ffffffff81226d8e <cfq_free_io_context+0x2f>
ffffffff81226d77:       48 8b 03                mov    (%rbx),%rax  <===========
/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2411
ffffffff81226d7a:       48 8d 73 a8             lea    -0x58(%rbx),%rsi
ffffffff81226d7e:       4c 89 e7                mov    %r12,%rdi
/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2410
ffffffff81226d81:       0f 18 08                prefetcht0 (%rax)
/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2411
ffffffff81226d84:       e8 53 ff ff ff          callq  ffffffff81226cdc <cic_free_func>

rbx is ioc->cic_list (as cic_list is at offset 0x78 from the start of struct io_context), which the register dump shows is 0x6b6b6b6b6b6b6b6b.  So, the io context has already been freed.

Kerneloops.org shows that this bug has manifested as early as 2.6.30.8, so I don't think this is a new problem.  Though 2.6.34 seems to be triggerring it more easily.

Comment 2 Jens Axboe 2010-03-31 08:20:09 UTC

OK, that's not good. Probably some tricky RCU sneakiness, I'll see if I can reproduce something like that here.

Comment 3 Jens Axboe 2010-03-31 18:35:23 UTC

Any chance someone has a full trace of this?

Comment 4 Jeff Moyer 2010-03-31 18:41:58 UTC

The full trace was in the attachment:

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-9/uevent
CPU 0 
Modules linked in: autofs4 w83627hf hwmon_vid sunrpc ipv6 ipt_LOG iptable_nat nf_nat ext2 uinput e100 ppdev edac_core edac_mce_amd microcode snd_mpu401 ns558 i2c_viapro parport_pc parport snd_via82xx gameport snd_ac97_codec ac97_bus snd_seq snd_pcm floppy snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd k8temp soundcore hwmon skge mii shpchp ata_generic pata_acpi firewire_ohci pata_via firewire_core crc_itu_t sata_via sata_promise radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: freq_table]

Pid: 5613, comm: ld-linux-x86-64 Not tainted 2.6.34-0.16.rc2.git0.fc14.x86_64 #1 SK8V/To Be Filled By O.E.M.
RIP: 0010:[<ffffffff81226d77>]  [<ffffffff81226d77>] cfq_free_io_context+0x18/0x34
RSP: 0018:ffff880016713e28  EFLAGS: 00010202
RAX: 00000001002c0f33 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000
RDX: 00000001002c0f36 RSI: ffff880002dd0bc0 RDI: 0000000000000296
RBP: ffff880016713e38 R08: ffff880015374680 R09: ffff880016bac4e0
R10: 0000000000000046 R11: ffff880016713d48 R12: ffff880015374680
R13: ffff88001665cfc8 R14: 0000000000000000 R15: ffff88001f5a8000
FS:  00007f2122d0e7a0(0000) GS:ffff880002c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007feac7253000 CR3: 000000001b77c000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ld-linux-x86-64 (pid: 5613, threadinfo ffff880016712000, task ffff88001665c900)
Stack:
 ffff880015374680 ffff88001665c900 ffff880016713e68 ffffffff8121f358
<0> ffffffff8121f323 ffff88001665c900 ffff88001665cfc8 ffff880015374680
<0> ffff880016713ea8 ffffffff8121f464 ffffffff8121f40f ffffffff8107dee4
Call Trace:
 [<ffffffff8121f358>] put_io_context+0x69/0x9c
 [<ffffffff8121f323>] ? put_io_context+0x34/0x9c
 [<ffffffff8121f464>] exit_io_context+0xa0/0xac
 [<ffffffff8121f40f>] ? exit_io_context+0x4b/0xac
 [<ffffffff8107dee4>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81054775>] do_exit+0x79d/0x7d1
 [<ffffffff81129fb7>] ? path_put+0x22/0x27
 [<ffffffff8105482d>] do_group_exit+0x84/0xb0
 [<ffffffff81054870>] sys_exit_group+0x17/0x1b
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: c6 93 6d 22 81 e8 3d a3 e8 ff 5b 41 5c 41 5d 41 5e c9 c3 55 48 89 e5 41 54 53 0f 1f 44 00 00 48 8b 5f 78 49 89 fc 48 85 db 74 17 <48> 8b 03 48 8d 73 a8 4c 89 e7 0f 18 08 e8 53 ff ff ff 48 8b 1b 
RIP  [<ffffffff81226d77>] cfq_free_io_context+0x18/0x34
 RSP <ffff880016713e28>
---[ end trace 1a4cf5320fdfc46c ]---
Fixing recursive fault but reboot is needed!

Is that what you're looking for?

Comment 5 Jens Axboe 2010-03-31 18:56:24 UTC

Thanks Jeff, dunno how I missed that there was an attachment.

So next question, seems this was triggered by 'readahead'. That's the RH readahead thing, correct? Does it use io priorities? Or CLONE_IO?

Comment 6 Jeff Moyer 2010-03-31 19:18:19 UTC

I'm not convinced readahead is the problem.  I think the reporter just said that as that was the last thing in the log before the GPF.  The task associated with the failure is ld-linux-x86-64, which may be related to the prelink cron job.  prelink cron job does this:

renice +19 -p $$ >/dev/null 2>&1

but I don't see any changing of io priorities.  The prelink binary doesn't seem to use CLONE_IO.  Not sure I want to dig around in ld-linux-x86_64.  ;-)

Comment 7 Michal Jaegermann 2010-03-31 19:24:59 UTC

(In reply to comment #5)

> So next question, seems this was triggered by 'readahead'.

By timiing this happened while running 'cron.daily' and apparently in a
scheduler and dmesg output suggests that 'readahead' may be involved so the
above looked like a reasonable "working hypothesis" but I do not know if this is
really proven.

> That's the RH readahead thing, correct?

This is from a "rawhide" installation for Fedora.

Comment 8 Jens Axboe 2010-03-31 19:39:26 UTC

Jeff, I would not want you to dive into ld-linux-x86_64 either :-).

So what else is being run by cron.daily on that system?

Oh, and a kernel config from this kernel would help, just in case it depends on some special RCU setting (or something like that).

I'll see if I can reproduce locally, from a look over the ioc exit/free path I didn't spot anything. I'll let something run over the next few days once I have an idea what that something should be. thu/fri/mon are holidays here, so don't expect anything before next week.

Comment 9 Jeff Moyer 2010-03-31 20:17:09 UTC

Created attachment 403822 [details]
kernel config for x86_64

My system (F12) shows the following in cron.daily:
  cups logrotate mlocate.cron readahead.cron makewhatis.cron prelink tmpwatch

Comment 10 Jeff Moyer 2010-03-31 20:18:16 UTC

Thanks for looking into this, Jens!  I'll try to spend some time on it as well when I next come up for air.

Comment 11 Jeff Moyer 2010-03-31 21:15:48 UTC

FYI, you can also look here for backtrace information:

http://kerneloops.org/guilty.php?guilty=cfq_free_io_context&version=2.6.34-rc&start=2228224&end=2260991&class=oops

Comment 12 Michal Jaegermann 2010-04-11 20:29:37 UTC

(In reply to comment #9)

> My system (F12) shows the following in cron.daily:
>   cups logrotate mlocate.cron readahead.cron makewhatis.cron prelink tmpwatch    

Just got that again with 2.6.34-0.28.rc3.git3.fc14.x86_64 kernel. "Call Trace" and "Code" do not really differ from those in attachment (id=403363) but this time one can find "Process gzip (pid: 5083, threadinfo ffff880013416000, task ffff880007f08000)" instead of ld-linux-x86-64 from the previous instance.

Timestamps in logs locate that oops squarely between "starting makewhatis.cron"
and "finished makewhatis.cron" with ample margins on both sides.

Comment 13 Michal Jaegermann 2010-04-19 15:40:25 UTC

I have the next one with 2.6.34-0.28.rc3.git3.fc14.x86_64.  This time makewhatis.cron is gone but timing implicates man-db.cron and a triggering process is gzip again.

Comment 14 Michal Jaegermann 2010-04-28 23:43:32 UTC

man-db.cron seems to be rather good in triggering this bug.  Recently I see these recursive faults quite regularly and always when man-db.cron script is triggered by cron.  The current ones with kernel-2.6.34-0.38.rc5.git0.fc14.x86_64.

Comment 15 Michal Jaegermann 2010-07-14 16:52:27 UTC

I was away for quite a while and not running rawhide.  Upon return I got pretty quickly something familiar but this time from 2.6.35-... kernel.  Here we go:

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:0f.0/host0/target0:0:0/0:0:0:0/block/sda/sda11/uevent
CPU 0 
Modules linked in: autofs4 w83627hf hwmon_vid sunrpc ipv6 ipt_LOG iptable_nat nf_nat ext2 uinput snd_via82xx microcode snd_ac97_codec ac97_bus snd_seq e100 snd_pcm snd_timer snd_page_alloc ppdev edac_core edac_mce_amd snd_mpu401 k8temp snd_mpu401_uart snd_rawmidi parport_pc snd_seq_device parport i2c_viapro snd soundcore ns558 gameport skge mii shpchp ata_generic pata_acpi firewire_ohci firewire_core sata_promise pata_via sata_via crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: mperf]

Pid: 28772, comm: gzip Not tainted 2.6.35-0.31.rc4.git4.fc14.x86_64 #1 SK8V/To Be Filled By O.E.M.
RIP: 0010:[<ffffffff81228a7c>]  [<ffffffff81228a7c>] __call_for_each_cic+0x21/0x3f
RSP: 0018:ffff880011537e28  EFLAGS: 00010202
RAX: 00000001001fd43f RBX: 6b6b6b6b6b6b6b6b RCX: ffff88001bbdef48
RDX: 00000001001fd442 RSI: ffff880002dd0d70 RDI: 0000000000000286
RBP: ffff880011537e48 R08: 0000000000000286 R09: ffff880011537d98
R10: ffff88001b0d1028 R11: ffff880011537df8 R12: ffff88001b0d1000
R13: ffffffff81228ac8 R14: 0000000000000001 R15: ffff88001f5c0000
FS:  00007f5d74174720(0000) GS:ffff880002c00000(0000) knlGS:00000000f7782720
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f5d73a88470 CR3: 0000000001a43000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process gzip (pid: 28772, threadinfo ffff880011536000, task ffff88001b6f0000)
Stack:
 0000000000000001 ffff88001b0d1000 ffff88001b6f0000 ffff88001b6f0690
<0> ffff880011537e58 ffffffff81228aaf ffff880011537e78 ffffffff8121fc25
<0> ffff880011537e78 ffff88001b0d1000 ffff880011537ea8 ffffffff8121fce5
Call Trace:
 [<ffffffff81228aaf>] cfq_free_io_context+0x15/0x17
 [<ffffffff8121fc25>] put_io_context+0x41/0x5e
 [<ffffffff8121fce5>] exit_io_context+0x6c/0x74
 [<ffffffff8105416a>] do_exit+0x7a9/0x7d0
 [<ffffffff81496aae>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff81054427>] do_group_exit+0x88/0xb6
 [<ffffffff8105446c>] sys_exit_group+0x17/0x1b
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ed 00 00 41 59 48 98 5b c9 c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 48 8b 5f 78 49 89 fc 49 89 f5 48 85 db 74 15 <48> 8b 03 48 8d 73 b0 4c 89 e7 0f 18 08 41 ff d5 48 8b 1b eb e6 
RIP  [<ffffffff81228a7c>] __call_for_each_cic+0x21/0x3f
 RSP <ffff880011537e28>
---[ end trace 8e807d826e855c8f ]---
Fixing recursive fault but reboot is needed!

If this is really a new bug and/or full dmesg output is desired then please let me know.  It seems to be triggered by the same man-db.cron though.

Comment 16 Bug Zapper 2010-07-30 11:13:05 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle.
Changing version to '14'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 17 Michal Jaegermann 2010-08-05 20:05:18 UTC

(In reply to comment #16)

> Changing version to '14'.

Should this be left as '14'?  I was just hit but one of those on
a rawhide kernel 2.6.35-0.57.rc6.git5.fc15.x86_64.  That trace is basically the same as the one from comment 15 for 2.6.35-0.31.rc4.git4.fc14.x86_64.

Comment 18 Chuck Ebbert 2010-08-06 20:15:35 UTC

*** Bug 585659 has been marked as a duplicate of this bug. ***

Comment 19 Chuck Ebbert 2010-08-11 12:09:36 UTC

block/cfq-iosched.c:2492:

static void
__call_for_each_cic(struct io_context *ioc,
                    void (*func)(struct io_context *, struct cfq_io_context *))
{
        struct cfq_io_context *cic;
        struct hlist_node *n;

        hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list)
                func(ioc, cic);
}

#define hlist_for_each_entry_rcu(tpos, pos, head, member)                \
        for (pos = rcu_dereference_raw((head)->first);                   \
                pos && ({ prefetch(pos->next); 1; }) &&                  \
                ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \
                pos = rcu_dereference_raw(pos->next))

We've made at least one trip through the loop (i.e. head->first was valid) and now we've hit a case where ->next points to freed memory.

Comment 20 Chuck Ebbert 2010-08-18 12:22:30 UTC

*** Bug 624979 has been marked as a duplicate of this bug. ***

Comment 21 Chuck Ebbert 2010-08-18 12:25:01 UTC

Attachment 439347 [details] has a trace from kernel 2.6.35.2-9

Comment 22 Michal Jaegermann 2010-09-07 19:02:57 UTC

Created attachment 445756 [details]
backtrace for 2.6.36-0.16.rc3.git0.fc15.x86_64 kernel

Just to record that this general protection fault shows up in 2.6.36... kernels as well.

Comment 23 Michal Jaegermann 2010-09-15 20:33:47 UTC

So far 2.6.36-0.16.rc3.git0.fc15.x86_64 was the last one with which I have seen it.  I cannot vouch that this oops will not return but this was a pretty regular event up to that time.  Maybe it is really gone?  At least on my box.

Comment 24 Jeff Moyer 2010-09-15 20:49:41 UTC

I haven't seen anything go by that should fix this, so I'd be surprised.  Thanks for the update, though!

Comment 25 Yanko Kaneti 2010-09-16 12:41:55 UTC

Created attachment 447730 [details]
crash picture

I've just had a very similar crash with kernel-2.6.36-0.21.rc4.git1.fc15.x86_64
Only managed to capture the attached picture. 
This is the first time ever that I've seen this.
Maybe there was some memory pressure at that point. Evolution grows a lot.

Please ignore the taint, its something in networking that has been going on and of for a number of kernel releases without any effect on the system as far as I can tell.

Comment 26 Jeff Moyer 2010-09-17 20:34:58 UTC

Staring at the code isn't bringing me any close to enlightenment.  Is there any way someone who can reproduce this could catch a crashdump for me?

Comment 27 Michal Jaegermann 2010-09-17 22:07:23 UTC

(In reply to comment #26)
> Is there any
> way someone who can reproduce this could catch a crashdump for me?

Anywhere some detailed description explaining how to get what you would like to see?

I have also this problem that, as noted in comment 23, I stopped seeing that while earlier running makewhatis.cron was a really good way to trigger such crash.  OTOH what Chuck Ebbert wrote in comment 19 looked really convincing. 
Only a good question how to test that pos->next points to something already freed.

Comment 28 Jeff Moyer 2010-09-20 11:51:07 UTC

You can install the 'kexec-tools' package.  Modify /etc/grub.conf, appending a "crashkernel=256M" to the kernel command line.  Then, chkconfig kdump on and reboot.  When your system comes back up, it should build an initrd for kdump.  You can check to see if kdump loaded properly with 'service kdump status'.  That should report that kdump is operational.

When the kernel next panics, your system will boot into the kdump kernel and store the contents of memory into a directory under /var/crash/.  I'd need the vmcore file that gets saved, and the exact kernel version you are running to then debug the problem further.

Chuck's comment means that there were at least two cic's active, which means that you had at least 2 processes doing I/O.  Also, that list poison means exactly that there is a race between two list removal code paths.  I've been trying to reproduce this by running the cron.daily jobs every minute, and then every hour (two separate tests).  I also tried to keep the system busy with I/O from multiple processes over the weekend and was not able to reproduce the crash.

Comment 29 Michal Jaegermann 2010-09-20 22:18:30 UTC

(In reply to comment #28)
> When your system comes back up, it should build an initrd for kdump.

I wish for that to be this simple.  Unfortunately mkdumprd is utterly broken
and it gets itself into such infinite loop:
....
+ cd ..
+ '[' '!' -f modalias ']'
+ '[' / = /sys/devices ']'
+ cd ..
+ '[' '!' -f modalias ']'
+ '[' / = /sys/devices ']'
+ cd ..
....
It gets also stuck without -d option.  Sigh!

> When the kernel next panics,

That does not appear to be as easy as in the past where that was a quite regular event.

> I've been
> trying to reproduce this by running the cron.daily jobs every minute, and then
> every hour (two separate tests). 

There is no makewhatis.cron anymore which seemed particularly good in triggering that crash; mandb does not look like a suitable substitute.

Anyway, I reinstalled and rebooted into kernel-2.6.36-0.16.rc3.git0.fc15.x86_64, as it was crashing like that, but so far no luck.

I will try to see if I will get that crash again but, like I said, the last time it happened on September 7th on 11:12 and 11:14.  Both times "comm" was dbus-uuidgen.  I have the following in logs:

Sep  7 11:11:54 dyna0 kernel: RPC: Registered udp transport module.
Sep  7 11:11:54 dyna0 kernel: RPC: Registered tcp transport module.
Sep  7 11:11:54 dyna0 kernel: RPC: Registered tcp NFSv4.1 backchannel transport 
module.
Sep  7 11:11:56 dyna0 kernel: general protection fault: 0000 [#1] SMP 
Sep  7 11:11:56 dyna0 kernel: last sysfs file: /sys/module/mperf/refcnt

and later

Sep  7 11:12:32 dyna0 kernel: readahead-collector: sorting
Sep  7 11:12:33 dyna0 kernel: readahead-collector: finished
Sep  7 11:14:13 dyna0 abrt: Kerneloops: Reported 1 kernel oopses to Abrt

and that created another directory in /var/spool/abrt/ but no other details in logs.

It appears that user-space changes will make tracing that much harder (or maybe even hide it completely).

Comment 30 Michal Jaegermann 2010-09-20 23:48:20 UTC

(In reply to comment #29)
> (In reply to comment #28)
> Unfortunately mkdumprd is utterly broken

See bug 635893.

Comment 31 Michal Jaegermann 2010-09-21 20:56:54 UTC

(In reply to comment #28)

> You can check to see if kdump loaded properly with 'service kdump status'. 
> That should report that kdump is operational.

There is a problem. 'service kdump status' reports "Kdump is operational" and I have reproduced this crash using 2.6.36-0.16.rc3.git0.fc15.x86_64 kernel.  Moreover in cmdline I have 'crashkernel=256M' and also /boot/initrd-2.6.36-0.16.rc3.git0.fc15.x86_64kdump.img as created yesterday after fixing mkdumprd to actually run.

> When the kernel next panics, your system will boot into the kdump kernel and
> store the contents of memory into a directory under /var/crash/.

dmesg ends up with:

RIP  [<ffffffff81232c36>] __call_for_each_cic+0x21/0x3f
 RSP <ffff880017899e28>
---[ end trace 438ad5fa6ce9222e ]---
Fixing recursive fault but reboot is needed!

but /var/crash/ is empty and no vmcore anywhere in sight.  Now what?
Should initrd-2.6.36-0.16.rc3.git0.fc15.x86_64kdump.img be used instead of a corresponding initramfs while booting?

Comment 32 Jeff Moyer 2010-09-21 21:12:57 UTC

Sorry, I forgot that fedora doesn't set panic_on_oops by default.  Add the following line to /etc/sysctl.conf:

kernel.panic_on_oops = 1

Then, next time you boot, that will be set.  You can set it at runtime via:

sysctl -w kernel.panic_on_oops=1

I'm pretty sure that will work for a GPF.  Again, sorry about this.

Comment 33 Michal Jaegermann 2010-09-21 22:11:26 UTC

(In reply to comment #32)
> Add the following line to /etc/sysctl.conf:
> 
> kernel.panic_on_oops = 1

OK.  Will do.  Now I will have to wait for the next crash.  Unfortunately
it does not happen on demand,

I am using 2.6.36-0.16.rc3.git0.fc15.x86_64 because I know that at least with this one this was happening.

With the last crash a trigger program was reported as "Pid: 20242, comm: gzip"
and by timestamps this happened by the end of man-db.cron run performed by cron.daily.  Running that "by hand" so far does not seem to have the same effects.

Comment 34 Michal Jaegermann 2010-09-25 01:34:33 UTC

Created attachment 449551 [details]
dmesg from _2.6.36-0.16.rc3.git0.fc15.x86_64 with crash at _call_for_each_cic+0x21/0x3f

It looks like I got it this time.  By timing it appears that this happened in the middle of a series of updates as perfromed by yum and messing up /var/lib/rpm/ as a side effect.  I am not exactly sure what happened as the whole machine just died cutting off a local console and a network as well and also killing X server for a good measure. abrt generated backtrace claims:

last sysfs file: /sys/devices/virtual/block/dm-9/uevent
Pid: 7124, comm: hardlink

if that can be relied upon under circumstances.

A produced vmcore is 282M.  Looks like something not exactly of a "bugzilla size".

Comment 35 Michal Jaegermann 2010-09-25 21:24:58 UTC

Scratch comment 23. I got hit by this bug while running 2.6.36-0.26.rc5.git4.fc15.x86_64.  I do not have anything in /var/crash from that as in the meantime 'kexec-tools' update broke /sbin/mkdumprd again making it impossible to boot with 'kdump' service active.  GPF happened when I was in the process of tracing and fixing mkdumprd.

Hopefuly data from 2.6.36-0.16.rc3.git0.fc15.x86_64 will be good enough.  I have no reliable method of reproduction.

Comment 36 Michal Jaegermann 2010-09-25 21:35:09 UTC

Added to comment 35.

Timestamps indicate that crash happened again while running man-db.cron from cron.daily and in 'backtrace' I have 'Pid: 10062, comm: gzip'. OTOH in /var/log/messages I can find this:

abrtd: New crash /var/spool/abrt/kerneloops-1285446849-2141-1, processing
abrtd: Registered Action plugin 'RunApp'
abrtd: RunApp('/var/spool/abrt/kerneloops-1285446849-2141-1','test x"`cat component`" = x"xorg-x11-server-Xorg" && cp /var/log/Xorg.0.log

and X server is indeed gone.  Not sure how to interpret that.

Comment 37 Panu Matilainen 2010-09-30 06:51:18 UTC

I just got this too on kernel 2.6.35.4-28.fc14.x86_64, and based on log timestamps it occurred while running man-db.cron here too.

general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-1/dm/name
CPU 0
Modules linked in: fuse rfcomm sco bnep l2cap nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc cpufreq_ondemand powernow_k8 freq_table mperf nf_conntrack_netbios_ns ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 uinput snd_via82xx gameport snd_via82xx_modem snd_ac97_codec ac97_bus snd_seq snd_pcm ppdev parport_pc snd_timer shpchp snd_mpu401_uart snd_rawmidi snd_seq_device btusb bluetooth amd64_edac_mod parport rfkill snd microcode soundcore skge edac_core i2c_viapro snd_page_alloc k8temp edac_mce_amd ata_generic pata_acpi firewire_ohci pata_via firewire_core sata_via sata_promise crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 29167, comm: gzip Not tainted 2.6.35.4-28.fc14.x86_64 #1 A8V Deluxe/To Be Filled By O.E.M.
RIP: 0010:[<ffffffff8122d6b8>]  [<ffffffff8122d6b8>] __call_for_each_cic+0x21/0x3f
RSP: 0018:ffff88000c017e28  EFLAGS: 00010202
RAX: 00000001014dbb1d RBX: 6b6b6b6b6b6b6b6b RCX: ffff880035acea98
RDX: 00000001014dbb20 RSI: ffff8800047d0da0 RDI: 0000000000000286
RBP: ffff88000c017e48 R08: 0000000000000286 R09: ffff88000c017d98
R10: ffff88003a563df8 R11: ffff88000c017df8 R12: ffff88003a563dd0
R13: ffffffff8122d704 R14: 0000000000000001 R15: ffff88003e680000
FS:  00007f0c4abb5720(0000) GS:ffff880004600000(0000) knlGS:00000000f77cc850
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f0c4a4be770 CR3: 0000000001a43000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process gzip (pid: 29167, threadinfo ffff88000c016000, task ffff88001d3c8000)
 0000000000000001 ffff88003a563dd0 ffff88001d3c8000 ffff88001d3c8688
<0> ffff88000c017e58 ffffffff8122d6eb ffff88000c017e78 ffffffff81224861
<0> ffff88000c017e78 ffff88003a563dd0 ffff88000c017ea8 ffffffff81224921
Call Trace:
 [<ffffffff8122d6eb>] cfq_free_io_context+0x15/0x17
 [<ffffffff81224861>] put_io_context+0x41/0x5e
 [<ffffffff81224921>] exit_io_context+0x6c/0x74
 [<ffffffff810551b7>] do_exit+0x7a9/0x7d0
 [<ffffffff8149d437>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff81055474>] do_group_exit+0x88/0xb6
 [<ffffffff810554b9>] sys_exit_group+0x17/0x1b
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code: ed 00 00 41 59 48 98 5b c9 c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 48 8b 5f 78 49 89 fc 49 89 f5 48 85 db 74 15 <48> 8b 03 48 8d 73 b0 4c 89 e7 0f 18 08 41 ff d5 48 8b 1b eb e6
RIP  [<ffffffff8122d6b8>] __call_for_each_cic+0x21/0x3f
 RSP <ffff88000c017e28>
---[ end trace 48092b8495a64693 ]---
Fixing recursive fault but reboot is needed!

Comment 38 Chuck Ebbert 2010-09-30 21:35:35 UTC

*** Bug 637068 has been marked as a duplicate of this bug. ***

Comment 39 Chuck Ebbert 2010-09-30 21:36:25 UTC

*** Bug 638677 has been marked as a duplicate of this bug. ***

Comment 40 Michal Jaegermann 2010-10-01 01:51:47 UTC

Anybody interested in vmcore produced (see comment 34 and also comment 28 for a procedure description)?  Please contact me directly if yes.  It took a considerable effort to produce that one.  Due to a size of this file it is for me not a good candidate for a bugzilla attachment.  A corresponding dmesg output is in attachment 449551 [details].

Comment 41 Jeff Moyer 2010-10-04 13:26:25 UTC

I sent email privately;  I am definitely interested in looking at that crash.

Comment 42 John Reiser 2010-11-05 00:26:21 UTC

*** Bug 520620 has been marked as a duplicate of this bug. ***

Comment 43 Joachim Frieben 2011-02-12 15:30:22 UTC

I do not see this issue for kernel-2.6.35.11-83.fc14.x86_64 anymore. This bug should be safe to close now.

Comment 44 Jeff Moyer 2011-02-14 14:34:28 UTC

(In reply to comment #43)
> I do not see this issue for kernel-2.6.35.11-83.fc14.x86_64 anymore. This bug
> should be safe to close now.

I'm glad it's not crashing for you anymore.  I got a similar report from Michal.  I would like to keep this bug open, though, as I believe there is still a race in the code, it's just not tickled for you anymore.

Comment 45 Paul Bolle 2011-04-08 19:39:17 UTC

0) Still occurs on vanilla v2.6.38.2 (otherwise standard Rawhide). Not a Fedora blessed kernel, but should be enough to convince people this bug can still be triggered.

1) From /var/log/messages:

Apr  6 11:46:40 t41 kernel: [  573.045271] BUG: unable to handle kernel paging request at 6b6b6b6b
Apr  6 11:46:40 t41 kernel: [  573.045411] IP: [<c05e10ae>] __call_for_each_cic+0x24/0x3a
Apr  6 11:46:40 t41 kernel: [  573.045522] *pde = 00000000 
Apr  6 11:46:40 t41 kernel: [  573.045581] Oops: 0000 [#1] SMP 
Apr  6 11:46:40 t41 kernel: [  573.045650] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.1/host1/target1:0:0/1:0:0:0/block/sr0/uevent
Apr  6 11:46:40 t41 kernel: [  573.045805] Modules linked in: sco bnep l2cap bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_netbios_ns snd_intel8x0m arc4 snd_intel8x0 ath5k snd_ac97_codec ath ac97_bus mac80211 snd_seq snd_seq_device snd_pcm ppdev cfg80211 thinkpad_acpi microcode snd_timer parport_pc rfkill e1000 iTCO_wdt snd_page_alloc joydev pcspkr i2c_i801 iTCO_vendor_support snd soundcore parport uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Apr  6 11:46:40 t41 kernel: [  573.046032] 
Apr  6 11:46:40 t41 kernel: [  573.046032] Pid: 2111, comm: cdrom_id Not tainted 2.6.38.2-0.local0.fc16.i686 #1 IBM        /       
Apr  6 11:46:40 t41 kernel: [  573.046032] EIP: 0060:[<c05e10ae>] EFLAGS: 00010202 CPU: 0
Apr  6 11:46:40 t41 kernel: [  573.046032] EIP is at __call_for_each_cic+0x24/0x3a
Apr  6 11:46:40 t41 kernel: [  573.046032] EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0a9a6b0
Apr  6 11:46:40 t41 kernel: [  573.046032] ESI: edd3a180 EDI: c05e13c1 EBP: ed793f18 ESP: ed793f0c
Apr  6 11:46:40 t41 kernel: [  573.046032]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Apr  6 11:46:40 t41 kernel: [  573.046032] Process cdrom_id (pid: 2111, ti=ed792000 task=edf0bf90 task.ti=ed792000)
Apr  6 11:46:40 t41 kernel: [  573.046032] Stack:
Apr  6 11:46:40 t41 kernel: [  573.046032]  edd3a180 edf0bf90 edf0c368 ed793f20 c05e10d6 ed793f2c c05d691a edd3a180
Apr  6 11:46:40 t41 kernel: [  573.046032]  ed793f40 c05d69aa edf0bf90 00000012 ed793f74 ed793f8c c04405d6 00000085
Apr  6 11:46:40 t41 kernel: [  573.046032]  00000000 00000000 edf0bf90 ed799e28 edf0c19c edf0bf88 edf0c19c 00000001
Apr  6 11:46:40 t41 kernel: [  573.046032] Call Trace:
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c05e10d6>] cfq_free_io_context+0x12/0x14
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c05d691a>] put_io_context+0x34/0x5c
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c05d69aa>] exit_io_context+0x68/0x6d
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c04405d6>] do_exit+0x609/0x62c
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c0440810>] do_group_exit+0x63/0x86
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c044084b>] __wake_up_parent+0x0/0x2b
Apr  6 11:46:40 t41 kernel: [  573.046032]  [<c04030df>] sysenter_do_call+0x12/0x38
Apr  6 11:46:40 t41 kernel: [  573.046032] Code: e8 d8 41 e8 ff 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 8b 58 4c 89 c6 89 d7 e8 bf 22 e7 ff 85 c0 74 05 e8 bc ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd 5b 5e 5f 5d 
Apr  6 11:46:40 t41 kernel: [  573.046032] EIP: [<c05e10ae>] __call_for_each_cic+0x24/0x3a SS:ESP 0068:ed793f0c
Apr  6 11:46:40 t41 kernel: [  573.046032] CR2: 000000006b6b6b6b
Apr  6 11:46:40 t41 kernel: [  573.074456] ---[ end trace 7ef6dd0aa5934c5e ]---
Apr  6 11:46:40 t41 kernel: [  573.074468] Fixing recursive fault but reboot is needed!

Comment 46 Paul Bolle 2011-04-08 20:00:22 UTC

0) And this one is from vanilla v2.6.39-rc2. Again, not Fedora blessed, but it might still be interesting:

Apr  8 09:25:34 t41 kernel: [ 1917.490095] BUG: unable to handle kernel paging request at 6b6b6b6b
Apr  8 09:25:34 t41 kernel: [ 1917.490781] IP: [<c05e84c6>] __call_for_each_cic+0x24/0x3a
Apr  8 09:25:34 t41 kernel: [ 1917.491020] *pde = 00000000 
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Oops: 0000 [#1] SMP 
Apr  8 09:25:34 t41 kernel: [ 1917.491020] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda/uevent
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6_tables snd_intel8x0 snd_intel8x0m arc4 snd_ac97_codec ac97_bus ath5k snd_seq ath mac80211 snd_seq_device ppdev snd_pcm microcode thinkpad_acpi parport_pc cfg80211 parport e1000 snd_timer rfkill snd iTCO_wdt joydev iTCO_vendor_support i2c_i801 soundcore pcspkr snd_page_alloc uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Apr  8 09:25:34 t41 kernel: [ 1917.491020] 
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Pid: 15233, comm: mandb Not tainted 2.6.39-0.rc2.git0.0.local0.fc16.i686 #1 IBM        /       
Apr  8 09:25:34 t41 kernel: [ 1917.491020] EIP: 0060:[<c05e84c6>] EFLAGS: 00010202 CPU: 0
Apr  8 09:25:34 t41 kernel: [ 1917.491020] EIP is at __call_for_each_cic+0x24/0x3a
Apr  8 09:25:34 t41 kernel: [ 1917.491020] EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0acaa98
Apr  8 09:25:34 t41 kernel: [ 1917.491020] ESI: eae6fc80 EDI: c05e87d9 EBP: e546bf18 ESP: e546bf0c
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Process mandb (pid: 15233, ti=e546a000 task=e5ffe9f0 task.ti=e546a000)
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Stack:
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  eae6fc80 e5ffe9f0 e5ffedc0 e546bf20 c05e84ee e546bf2c c05ddbae eae6fc80
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  e546bf40 c05ddc3e e5ffe9f0 00000012 e546bf74 e546bf8c c0440821 eaa97a28
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  e5fb8e08 e546bf70 c04f49ea 00000001 e5ffebf4 e5ffe9e8 e5ffebf4 00000001
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Call Trace:
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c05e84ee>] cfq_free_io_context+0x12/0x14
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c05ddbae>] put_io_context+0x34/0x5c
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c05ddc3e>] exit_io_context+0x68/0x6d
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c0440821>] do_exit+0x634/0x657
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c04f49ea>] ? fsnotify_modify+0x64/0x6c
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c0440a5b>] do_group_exit+0x63/0x86
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c0440a96>] sys_exit_group+0x18/0x18
Apr  8 09:25:34 t41 kernel: [ 1917.491020]  [<c082045f>] sysenter_do_call+0x12/0x38
Apr  8 09:25:34 t41 kernel: [ 1917.491020] Code: e8 5c db e7 ff 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 8b 58 4c 89 c6 89 d7 e8 df b1 e6 ff 85 c0 74 05 e8 bc ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd 5b 5e 5f 5d 
Apr  8 09:25:34 t41 kernel: [ 1917.491020] EIP: [<c05e84c6>] __call_for_each_cic+0x24/0x3a SS:ESP 0068:e546bf0c
Apr  8 09:25:34 t41 kernel: [ 1917.491020] CR2: 000000006b6b6b6b
Apr  8 09:25:34 t41 kernel: [ 1917.575977] ---[ end trace 6895ae03d9c0f56d ]---
Apr  8 09:25:34 t41 kernel: [ 1917.575982] Fixing recursive fault but reboot is needed!

1) A quick look at the output of "objdump -l -d block/cfq-iosched.o" suggests this fault is triggered by "prefetch(pos->next)" in the hlist_for_each_entry_rcu macro in include/linux/rculist.h

[...]block/cfq-iosched.c:2594
    1840:       e8 fc ff ff ff          call   1841 <__call_for_each_cic+0x13>
    1845:       85 c0                   test   %eax,%eax
    1847:       74 05                   je     184e <__call_for_each_cic+0x20>
    1849:       e8 bc ff ff ff          call   180a <rcu_read_lock_held>
    184e:       85 db                   test   %ebx,%ebx
    1850:       74 11                   je     1863 <__call_for_each_cic+0x35>
prefetch():
[...]arch/x86/include/asm/processor.h:833
    1852:       8b 03                   mov    (%ebx),%eax
    1854:       8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi

(This was already hinted at in an earlier comment.)

Comment 47 Chuck Ebbert 2011-04-14 01:44:59 UTC

We could "fix" this by making deadline the default IO scheduler :/  I switched to that on all my systems long ago...

Comment 48 Jeff Moyer 2011-04-14 13:22:34 UTC

(In reply to comment #47)
> We could "fix" this by making deadline the default IO scheduler :/  I switched
> to that on all my systems long ago...

I believe that's an entirely different discussion, Chuck.  Deadline has its own set of issues when run on a S-ATA rotational disk (which, I might add, are very common).  I understand the frustration surrounding this issue, and apologize that it hasn't been addressed in a timely fashion.

Comment 49 Chuck Ebbert 2011-04-18 03:02:07 UTC

(In reply to comment #48)
> (In reply to comment #47)
> > We could "fix" this by making deadline the default IO scheduler :/  I switched
> > to that on all my systems long ago...
> 
> I believe that's an entirely different discussion, Chuck.  Deadline has its own
> set of issues when run on a S-ATA rotational disk (which, I might add, are very
> common). 

What issues? I'm certainly not getting any oopses by running with deadline.

Comment 50 Jeff Moyer 2011-04-18 13:12:54 UTC

Background writeout can completely starve readers with deadline on SATA disks.  For example, if you start a dd process writing to a file, then startup firefox or some other application that is not in cache, firefox may not start until the dd has completed.

For those people experiencing this oops, switching to deadline is a good workaround until it is fixed.  I don't think that many people are hitting this, though.

Comment 51 Paul Bolle 2011-04-18 14:40:45 UTC

(In reply to comment #50)
> I don't think that many people are hitting this,
> though.

And, as far as I know, no one has been able to reliably reproduce this. (I tried a few things, but without success. Ie, without even once seeing this oops when I actually wanted it to occur.)

Comment 52 Michal Jaegermann 2011-04-18 15:00:34 UTC

(In reply to comment #51)
> And, as far as I know, no one has been able to reliably reproduce this.

There were a long series of kernels last year where I was hitting that quite regularly. Jeffrey Moyer got from me a vmcore file produced in one of those incidents. Still it some moment it stopped and I did not see a single instance ever since. No idea why.

Comment 53 Vivek Goyal 2011-04-18 15:04:33 UTC

(In reply to comment #46)
> 
> 1) A quick look at the output of "objdump -l -d block/cfq-iosched.o" suggests
> this fault is triggered by "prefetch(pos->next)" in the
> hlist_for_each_entry_rcu macro in include/linux/rculist.h
> 
> [...]block/cfq-iosched.c:2594
>     1840:       e8 fc ff ff ff          call   1841 <__call_for_each_cic+0x13>
>     1845:       85 c0                   test   %eax,%eax
>     1847:       74 05                   je     184e <__call_for_each_cic+0x20>
>     1849:       e8 bc ff ff ff          call   180a <rcu_read_lock_held>
>     184e:       85 db                   test   %ebx,%ebx
>     1850:       74 11                   je     1863 <__call_for_each_cic+0x35>
> prefetch():
> [...]arch/x86/include/asm/processor.h:833
>     1852:       8b 03                   mov    (%ebx),%eax
>     1854:       8d 74 26 00             lea    0x0(%esi,%eiz,1),%esi
> 
> (This was already hinted at in an earlier comment.)
I am little confused between two. IIUC, previous comment had hinted pos->next
is a valid pointer but the object pos->next is pointing to has been freed. And
this comment seems to be suggesting that pos->next itself is 6b6b6b.. and we
are freed object already.

Anyway, so the bottom line seems to be that ioc->cic_list has a cic which 
has already been freed.

I have looked at the code closely and can't find anything wrong yet. There seem
to be two paths/places where cic's are removed from ioc->cic_list.

static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
{
        unsigned long flags;
        unsigned long dead_key = (unsigned long) cic->key;

        BUG_ON(!(dead_key & CIC_DEAD_KEY));

        spin_lock_irqsave(&ioc->lock, flags);
        radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT);
        hlist_del_rcu(&cic->cic_list);
        spin_unlock_irqrestore(&ioc->lock, flags);

        cfq_cic_free(cic);
}

static void
cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc,
                  struct cfq_io_context *cic)
{
        unsigned long flags;

        WARN_ON(!list_empty(&cic->queue_list));
        BUG_ON(cic->key != cfqd_dead_key(cfqd));

        spin_lock_irqsave(&ioc->lock, flags);

        BUG_ON(ioc->ioc_data == cic);

        radix_tree_delete(&ioc->radix_root, cfqd->cic_index);
        hlist_del_rcu(&cic->cic_list);
        spin_unlock_irqrestore(&ioc->lock, flags);

        cfq_cic_free(cic);
}

In both the places we take ioc->lock to remove the cic from list and then
call cfq_cic_free() which frees up a cic only after rcu time has elapsed.

static void cfq_cic_free(struct cfq_io_context *cic)
{
        call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
}


Given above, I am still wondering how can we reach a situation where we get to
a cic which has been freed but still on the ioc->cic_list.

Also a question. Does anybody know how this poisoning scheme works. I mean
these poison bytes of 6b6b6b.. are written on whole slab object whenever a
object is freed?

Comment 54 Jeff Moyer 2011-04-18 15:15:30 UTC

When I was working on it, my guess was that we were getting into the drop_dead_cic case, and that this was racy (no clue why we keep the cic on the rbtree at all... I think it would be safe to remove it from there).  I couldn't prove it, though.

Comment 55 Paul Bolle 2011-04-18 15:36:30 UTC

(In reply to comment #53)
0) One annoyance with the code in question is that there are three members called cic_list which are all (slightly) different. I found that rather confusing ...
 
> Given above, I am still wondering how can we reach a situation where we get to
> a cic which has been freed but still on the ioc->cic_list.

1) This is all RCU stuff, which I'm rather clueless about.

Anyhow, my hunch is that the code here runs with readers (code just walking the hlist here) and writers (code manipulating that same list) working with different copies of that list at the same time (which is what RCU seems to be all about). Normally it shouldn't matter if a reader has an outdated view of an hlist. But here it apparently does matter to some readers (and so maybe an hlist doesn't really fit the job here). But my hunch didn't help me while looking at the code until, well, I just gave up ...

> Also a question. Does anybody know how this poisoning scheme works. I mean
> these poison bytes of 6b6b6b.. are written on whole slab object whenever a
> object is freed?

2) I'm afraid I don't.

Comment 56 Jeff Layton 2011-04-18 15:47:43 UTC

(In reply to comment #53)
> 
> Also a question. Does anybody know how this poisoning scheme works. I mean
> these poison bytes of 6b6b6b.. are written on whole slab object whenever a
> object is freed?
>

Correct:

#define POISON_FREE     0x6b    /* for use-after-free poisoning */

...mainly this happens in cache_free_debugcheck for slab allocator. I'm a little less familiar with slub though, but I'm pretty sure it uses the same preprocessor constant.

Comment 57 Vivek Goyal 2011-04-18 18:10:29 UTC

(In reply to comment #55)
> (In reply to comment #53)
> 0) One annoyance with the code in question is that there are three members
> called cic_list which are all (slightly) different. I found that rather
> confusing ...
> 
> > Given above, I am still wondering how can we reach a situation where we get to
> > a cic which has been freed but still on the ioc->cic_list.
> 
> 1) This is all RCU stuff, which I'm rather clueless about.
> 
> Anyhow, my hunch is that the code here runs with readers (code just walking the
> hlist here) and writers (code manipulating that same list) working with
> different copies of that list at the same time (which is what RCU seems to be
> all about). Normally it shouldn't matter if a reader has an outdated view of an
> hlist. But here it apparently does matter to some readers (and so maybe an
> hlist doesn't really fit the job here). But my hunch didn't help me while
> looking at the code until, well, I just gave up ...

I have checked that readers are always running with rcu read lock held. That will make sure that even if a writer has removed a cic from ioc->cic_list
it will not free the object till we drop the rcu read lock (until and unless rcu has stopped working :-))

- Because we are doing call_rcu() before freeing the cic, we should never run into a situation where reader gets the reference to the freed object as long as we have taken the read lock. (That's how I understand it).

Now tricky bit is why we are running into the freed object despite the fact that we are holding rcu read lock in cfq_free_io_context() path.

Comment 58 Vivek Goyal 2011-04-18 18:17:21 UTC

(In reply to comment #54)
> When I was working on it, my guess was that we were getting into the
> drop_dead_cic case, and that this was racy (no clue why we keep the cic on the
> rbtree at all... I think it would be safe to remove it from there).  I couldn't
> prove it, though.

Jeff,

I guess there are two reasons.

- ioc->cic_list and radix tree is protected by ioc->ioc_lock. May be we are trying to avoid taking that lock when process is exiting and just mark cic dead.

- If we drop it from rb-tree then we will lose track of old/dead cic which needs to be dropped. By keeping it in tree, it pops up in our search and we can drop it hence no need to look into separate list.

Do you recall details that why do you think drop_dead_cic is racy. I do see that we are taking ioc->lock before we modify list and rb-tree and freeing of object is done under call_rcu(). So that sounds good atleast on the surface.

Comment 59 Vivek Goyal 2011-04-18 18:19:20 UTC

Paul, is this problem reproducible consistently for you. One of the big stumbling blocks here is that we can't reproduce it consistently to dive deeper.

Is there any specific config option enabled like all the preemption stuff in your config.

Comment 60 Jeff Moyer 2011-04-18 18:40:43 UTC

(In reply to comment #58)
> (In reply to comment #54)
> > When I was working on it, my guess was that we were getting into the
> > drop_dead_cic case, and that this was racy (no clue why we keep the cic on the
> > rbtree at all... I think it would be safe to remove it from there).  I couldn't
> > prove it, though.
> 
> Jeff,
> 
> I guess there are two reasons.
> 
> - ioc->cic_list and radix tree is protected by ioc->ioc_lock. May be we are
> trying to avoid taking that lock when process is exiting and just mark cic
> dead.

Could be.  I'd have to check to see if taking the lock could cause problems.

> - If we drop it from rb-tree then we will lose track of old/dead cic which
> needs to be dropped. By keeping it in tree, it pops up in our search and we can
> drop it hence no need to look into separate list.

It should be removed anyway when the grace period expires, no?  My memory is pretty fuzzy on this, and I don't have time to dig back in right now, sorry.

> Do you recall details that why do you think drop_dead_cic is racy. I do see
> that we are taking ioc->lock before we modify list and rb-tree and freeing of
> object is done under call_rcu(). So that sounds good atleast on the surface.

Fuzzy memory.  As you know, I was also trying to perform io_context tracking throughout the I/O stack, so some of my memories may be from kernels with my patches applied.

Comment 61 Paul Bolle 2011-04-18 18:43:49 UTC

(In reply to comment #59)
> Paul, is this problem reproducible consistently for you. One of the big
> stumbling blocks here is that we can't reproduce it consistently to dive
> deeper.

No, see comment #51.

(The context of the oopses in comment #45 and comment #46 - which isn't clear from the comments themselves - could be said to be something like "a period with a lot of (relatively small?) IO". I seem to remember that comment #46 was triggered by installing a new kernel rpm. But, again, I haven't been able to reproduce this oops.)

Comment 62 Paul Bolle 2011-04-22 20:28:29 UTC

For what it's worth, shutting down (not Fedora distributed) vanilla v2.6.39-rc3 just gave me this:

Apr 22 21:57:21 t41 kernel: [253346.725182] BUG: unable to handle kernel paging request at 6b6b6b6b
Apr 22 21:57:21 t41 kernel: [253346.726034] IP: [<c05e850a>] __call_for_each_cic+0x24/0x3a
Apr 22 21:57:21 t41 kernel: [253346.726034] *pde = 00000000 
Apr 22 21:57:21 t41 kernel: [253346.726034] Oops: 0000 [#1] SMP 
Apr 22 21:57:21 t41 kernel: [253346.726034] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/
BAT0/energy_full
Apr 22 21:57:21 t41 kernel: [253346.726034] Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_
defrag_ipv6 ip6table_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6_tables snd_intel8x0 snd_intel8x0m snd_ac97_codec ac97_bus snd_seq arc4 sn
d_seq_device ath5k ath mac80211 ppdev snd_pcm microcode thinkpad_acpi cfg80211 parport_pc snd_timer snd e1000 parport rfkill iTCO_wdt soundcore iTCO_ven
dor_support snd_page_alloc pcspkr joydev i2c_i801 uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scs
i_wait_scan]
Apr 22 21:57:21 t41 kernel: [253346.726034] 
Apr 22 21:57:21 t41 kernel: [253346.726034] Pid: 1405, comm: krb5-auth-dialo Tainted: G        W   2.6.39-0.rc3.git0.0.local0.fc16.i686 #1 IBM        / 
      
Apr 22 21:57:21 t41 kernel: [253346.726034] EIP: 0060:[<c05e850a>] EFLAGS: 00010202 CPU: 0
Apr 22 21:57:21 t41 kernel: [253346.726034] EIP is at __call_for_each_cic+0x24/0x3a
Apr 22 21:57:21 t41 kernel: [253346.726034] EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0acaa98
Apr 22 21:57:21 t41 kernel: [253346.726034] ESI: c802f200 EDI: c05e881d EBP: eef91dec ESP: eef91de0
Apr 22 21:57:21 t41 kernel: [253346.726034]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Apr 22 21:57:21 t41 kernel: [253346.726034] Process krb5-auth-dialo (pid: 1405, ti=eef90000 task=ee043f90 task.ti=eef90000)
Apr 22 21:57:21 t41 kernel: [253346.726034] Stack:
Apr 22 21:57:21 t41 kernel: [253346.726034]  c802f200 ee043f90 ee044360 eef91df4 c05e8532 eef91e00 c05ddbf2 c802f200
Apr 22 21:57:21 t41 kernel: [253346.726034]  eef91e14 c05ddc82 ee043f90 00000000 eef91e48 eef91e60 c04407dd eef91e24
Apr 22 21:57:21 t41 kernel: [253346.726034]  c0465bac eef91e38 c045af59 c0bb46b8 ee044194 ee043f88 ee044194 00000001
Apr 22 21:57:21 t41 kernel: [253346.726034] Call Trace:
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c05e8532>] cfq_free_io_context+0x12/0x14
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c05ddbf2>] put_io_context+0x34/0x5c
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c05ddc82>] exit_io_context+0x68/0x6d
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c04407dd>] do_exit+0x634/0x657
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c0465bac>] ? trace_hardirqs_off+0xb/0xd
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c045af59>] ? local_clock+0x32/0x49
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c0440a17>] do_group_exit+0x63/0x86
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c044c600>] get_signal_to_deliver+0x3a9/0x3c0
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c0402615>] do_signal+0x5b/0x4f0
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c0466087>] ? lock_release_holdtime.part.8+0x5d/0x63
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c046889e>] ? lock_release+0x166/0x16d
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c081a503>] ? _raw_spin_unlock+0x22/0x25
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c05008ac>] ? set_close_on_exec+0xa3/0xaa
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c04fce1f>] ? path_put+0x1a/0x1d
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c0402d3f>] do_notify_resume+0x26/0x6f
Apr 22 21:57:21 t41 kernel: [253346.726034]  [<c081a868>] work_notifysig+0x13/0x1b
Apr 22 21:57:21 t41 kernel: [253346.726034] Code: e8 d0 da e7 ff 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 8b 58 4c 89 c6 89 d7 e8 53 b1 e6 ff 85 c0 74 05 e8 bc ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd 5b 5e 5f 5d 
Apr 22 21:57:21 t41 kernel: [253346.726034] EIP: [<c05e850a>] __call_for_each_cic+0x24/0x3a SS:ESP 0068:eef91de0
Apr 22 21:57:21 t41 kernel: [253346.726034] CR2: 000000006b6b6b6b

Just slightly different from the Oopses I previously submitted.

Comment 63 Chuck Ebbert 2011-04-23 06:18:48 UTC

__call_for_each_cic() has been removed by this commit in 2.6.39:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0

Comment 64 Paul Bolle 2011-04-29 19:51:01 UTC

(In reply to comment #63)
> __call_for_each_cic() has been removed by this commit in 2.6.39:
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0

0) This Oops (generated with non-Fedroa blessed vanilla 2.6.39-rc5) suggests that commit (which is included in 2.6.39-rc5, if my memory is correct) was not enough to beat mandb into submission:

BUG: unable to handle kernel paging request at 6b6b6b6b
IP: [<c05e859d>] call_for_each_cic+0x29/0x44
*pde = 00000000 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full
Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_netbios_ns ip6table_filter nf_conntrack_broadcast ip6_tables snd_intel8x0m snd_intel8x0 arc4 snd_ac97_codec ath5k ac97_bus snd_seq ath snd_seq_device mac80211 snd_pcm ppdev thinkpad_acpi cfg80211 microcode snd_timer parport_pc e1000 snd rfkill parport iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support joydev i2c_i801 pcspkr uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 11397, comm: mandb Not tainted 2.6.39-0.rc5.git0.0.local.fc16.i686 #1 IBM        /       
EIP: 0060:[<c05e859d>] EFLAGS: 00010202 CPU: 0
EIP is at call_for_each_cic+0x29/0x44
EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0ad4a98
ESI: ecfd0980 EDI: c05e88a4 EBP: c670bf18 ESP: c670bf0c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mandb (pid: 11397, ti=c670a000 task=d40769f0 task.ti=c670a000)
Stack:
 ecfd0980 d40769f0 d4076dc0 c670bf20 c05e85ca c670bf2c c05ddc86 ecfd0980
 c670bf40 c05ddd16 d40769f0 00000012 c670bf74 c670bf8c c0440813 f2a182e8
 d401ac08 c670bf70 c04f48da 00000001 d4076bf4 d40769e8 d4076bf4 00000001
Call Trace:
 [<c05e85ca>] cfq_free_io_context+0x12/0x14
 [<c05ddc86>] put_io_context+0x34/0x5c
 [<c05ddd16>] exit_io_context+0x68/0x6d
 [<c0440813>] do_exit+0x63e/0x661
 [<c04f48da>] ? fsnotify_modify+0x64/0x6c
 [<c0440a4d>] do_group_exit+0x63/0x86
 [<c0440a88>] sys_exit_group+0x18/0x18
 [<c082551f>] sysenter_do_call+0x12/0x38
Code: 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 89 c6 89 d7 e8 3a ec ff ff 8b 5e 4c e8 00 b1 e6 ff 85 c0 74 05 e8 b7 ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd e8 49 ec ff 
EIP: [<c05e859d>] call_for_each_cic+0x29/0x44 SS:ESP 0068:c670bf0c
CR2: 000000006b6b6b6b
---[ end trace 70dea08ef41fda48 ]---
Fixing recursive fault but reboot is needed!
BUG: unable to handle kernel paging request at 6b6b6b6b
IP: [<c05e859d>] call_for_each_cic+0x29/0x44
*pde = 00000000 
Oops: 0000 [#2] SMP 
last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full
Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_netbios_ns ip6table_filter nf_conntrack_broadcast ip6_tables snd_intel8x0m snd_intel8x0 arc4 snd_ac97_codec ath5k ac97_bus snd_seq ath snd_seq_device mac80211 snd_pcm ppdev thinkpad_acpi cfg80211 microcode snd_timer parport_pc e1000 snd rfkill parport iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support joydev i2c_i801 pcspkr uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 8732, comm: mandb Tainted: G      D     2.6.39-0.rc5.git0.0.local.fc16.i686 #1 IBM        /       
EIP: 0060:[<c05e859d>] EFLAGS: 00010202 CPU: 0
EIP is at call_for_each_cic+0x29/0x44
EAX: 00000000 EBX: 6b6b6b6b ECX: 00000286 EDX: 025d2b76
ESI: ecfd0700 EDI: c05e88a4 EBP: d88a1f18 ESP: d88a1f0c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mandb (pid: 8732, ti=d88a0000 task=f2df69f0 task.ti=d88a0000)
Stack:
 ecfd0700 f2df69f0 f2df6dc0 d88a1f20 c05e85ca d88a1f2c c05ddc86 ecfd0700
 d88a1f40 c05ddd16 f2df69f0 00000012 d88a1f74 d88a1f8c c0440813 ef3c6000
 cb778508 d88a1f70 c04f48da 00000001 f2df6bf4 f2df69e8 f2df6bf4 00000001
Call Trace:
 [<c05e85ca>] cfq_free_io_context+0x12/0x14
 [<c05ddc86>] put_io_context+0x34/0x5c
 [<c05ddd16>] exit_io_context+0x68/0x6d
 [<c0440813>] do_exit+0x63e/0x661
 [<c04f48da>] ? fsnotify_modify+0x64/0x6c
 [<c04fcd68>] ? path_put+0x1a/0x1d
 [<c0440a4d>] do_group_exit+0x63/0x86
 [<c0440a88>] sys_exit_group+0x18/0x18
 [<c082551f>] sysenter_do_call+0x12/0x38
Code: 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 89 c6 89 d7 e8 3a ec ff ff 8b 5e 4c e8 00 b1 e6 ff 85 c0 74 05 e8 b7 ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd e8 49 ec ff 
EIP: [<c05e859d>] call_for_each_cic+0x29/0x44 SS:ESP 0068:d88a1f0c
CR2: 000000006b6b6b6b
---[ end trace 70dea08ef41fda49 ]---
Fixing recursive fault but reboot is needed!
BUG: unable to handle kernel paging request at 6b6b6b6b
IP: [<c05e859d>] call_for_each_cic+0x29/0x44
*pde = 00000000 
Oops: 0000 [#3] SMP 
last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full
Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_netbios_ns ip6table_filter nf_conntrack_broadcast ip6_tables snd_intel8x0m snd_intel8x0 arc4 snd_ac97_codec ath5k ac97_bus snd_seq ath snd_seq_device mac80211 snd_pcm ppdev thinkpad_acpi cfg80211 microcode snd_timer parport_pc e1000 snd rfkill parport iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support joydev i2c_i801 pcspkr uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 14230, comm: mandb Tainted: G      D     2.6.39-0.rc5.git0.0.local.fc16.i686 #1 IBM        /       
EIP: 0060:[<c05e859d>] EFLAGS: 00010202 CPU: 0
EIP is at call_for_each_cic+0x29/0x44
EAX: 00000000 EBX: 6b6b6b6b ECX: 00000286 EDX: 025dbc75
ESI: ecfd0b80 EDI: c05e88a4 EBP: da0bbf18 ESP: da0bbf0c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mandb (pid: 14230, ti=da0ba000 task=ebb30000 task.ti=da0ba000)
Stack:
 ecfd0b80 ebb30000 ebb303d0 da0bbf20 c05e85ca da0bbf2c c05ddc86 ecfd0b80
 da0bbf40 c05ddd16 ebb30000 00000012 da0bbf74 da0bbf8c c0440813 ef3c68b8
 ca4eef08 da0bbf70 c04f48da 00000001 ebb30204 ebb2fff8 ebb30204 00000001
Call Trace:
 [<c05e85ca>] cfq_free_io_context+0x12/0x14
 [<c05ddc86>] put_io_context+0x34/0x5c
 [<c05ddd16>] exit_io_context+0x68/0x6d
 [<c0440813>] do_exit+0x63e/0x661
 [<c04f48da>] ? fsnotify_modify+0x64/0x6c
 [<c04fcd68>] ? path_put+0x1a/0x1d
 [<c0440a4d>] do_group_exit+0x63/0x86
 [<c0440a88>] sys_exit_group+0x18/0x18
 [<c082551f>] sysenter_do_call+0x12/0x38
Code: 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 89 c6 89 d7 e8 3a ec ff ff 8b 5e 4c e8 00 b1 e6 ff 85 c0 74 05 e8 b7 ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd e8 49 ec ff 
EIP: [<c05e859d>] call_for_each_cic+0x29/0x44 SS:ESP 0068:da0bbf0c
CR2: 000000006b6b6b6b
---[ end trace 70dea08ef41fda4a ]---
Fixing recursive fault but reboot is needed!

Comment 65 Paul Bolle 2011-05-01 10:42:00 UTC

0) I just hit (basically) the Oops of comment #64 again, once again triggered by mandb.

1) I won't paste the backtrace, that should be noise by now. But we might be able to make this reproducible, as the events seem to be:
- install a few dozen packages that have been updated for Rawhide;
- reboot, because something low level has been updated so one might as wel just reboot;
- wait until (I guess) cron fires of mandb (after, maybe, 15 minutes).

2) Would there be a way to force mandb to do what it does - including its side effects, ie generate troublesome disk activity somehow - in order to trigger this manually?

Comment 66 John Reiser 2011-05-01 15:04:32 UTC

(In reply to comment #65)
> ... the events seem to be:
> - install a few dozen packages that have been updated for Rawhide;
> - reboot, because something low level has been updated so one might as wel just
> reboot;
> - wait until (I guess) cron fires of mandb (after, maybe, 15 minutes).

Please post the list of packages that were updated.  On my system the list appears in /var/log/messages:
-----
Apr 30 11:12:23 f14-64 yum[3593]: Updated: firefox-3.6.17-1.fc14.x86_64
-----

If it is cron (or anacron ?) that runs mandb, then the mechanism is by invoking a shell script.  So find that shell script (look soon after reboot), and post the script.

Comment 67 Vivek Goyal 2011-05-02 14:15:04 UTC

(In reply to comment #65)
> 0) I just hit (basically) the Oops of comment #64 again, once again triggered
> by mandb.
> 
> 1) I won't paste the backtrace, that should be noise by now. But we might be
> able to make this reproducible, as the events seem to be:
> - install a few dozen packages that have been updated for Rawhide;
> - reboot, because something low level has been updated so one might as wel just
> reboot;
> - wait until (I guess) cron fires of mandb (after, maybe, 15 minutes).
> 
> 2) Would there be a way to force mandb to do what it does - including its side
> effects, ie generate troublesome disk activity somehow - in order to trigger
> this manually?

Bolle, is it possible to enable kdump on this box and capture the vmcore whenever this happens again.

Also can you please keep the kernel debug info compiled (CONFIG_DEBUG_INFO=y) so that I can open the vmcore using "crash" and try to get some more information out.

Comment 68 Chuck Ebbert 2011-05-03 01:13:29 UTC

(In reply to comment #64)
> (In reply to comment #63)
> > __call_for_each_cic() has been removed by this commit in 2.6.39:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0
> 
> 0) This Oops (generated with non-Fedroa blessed vanilla 2.6.39-rc5) suggests
> that commit (which is included in 2.6.39-rc5, if my memory is correct) was not
> enough to beat mandb into submission:
> 

Confirmed, that commit is in -rc5

Comment 69 Vivek Goyal 2011-05-03 13:01:19 UTC

(In reply to comment #68)
> (In reply to comment #64)
> > (In reply to comment #63)
> > > __call_for_each_cic() has been removed by this commit in 2.6.39:
> > > 
> > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0
> > 
> > 0) This Oops (generated with non-Fedroa blessed vanilla 2.6.39-rc5) suggests
> > that commit (which is included in 2.6.39-rc5, if my memory is correct) was not
> > enough to beat mandb into submission:
> > 
> 
> Confirmed, that commit is in -rc5

That commit was put to fix a different race and is not going to help with this problem.

Comment 70 Paul Bolle 2011-05-07 18:38:20 UTC

(In reply to comment #66)

The trick turned out to be to (basically) run
    /etc/cron.daily/man-db.cron

with
    OPTS="-c"

(See also /etc/sysconfig/man-db.)

Running that command triggered this Oops (within a few minutes). So, finally, it looks like I can trigger this manually. Progress, of sorts ...

Comment 71 Paul Bolle 2011-05-07 18:45:14 UTC

(In reply to comment #67) 
> Bolle,

0) Please feel free to use my first name (Paul). But in bugzilla, one isn't even impolite when not using the other commenter's name at all.

> is it possible to enable kdump on this box and capture the vmcore
> whenever this happens again.
> 
> Also can you please keep the kernel debug info compiled (CONFIG_DEBUG_INFO=y)
> so that I can open the vmcore using "crash" and try to get some more
> information out.

1) I'm about to recompile (vanilla) v2.6.39-rc6, using a (not too old) Rawhide i686 .config. Does it have all the needed CONFIG_*s?

2) Is Documentation/kdump/kdump.txt the best introduction to this (rather intimidating) stuff?

Comment 72 Paul Bolle 2011-05-08 19:40:20 UTC

(In reply to comment #71)
> 2) Is Documentation/kdump/kdump.txt the best introduction to this (rather
> intimidating) stuff?

OK, I'm in danger of ending up shaving the yak: kexec-tools is installed, kdump is apparently happy when loading a "capture kernel", the "system kernel" is set to panic on oops, but somehow refuses to boot into the "capture kernel" and simply (seems to) hang(s)... What should I try next to solve that issue?

Comment 73 Paul Bolle 2011-05-09 12:49:51 UTC

(In reply to comment #72)
> [...]
> simply (seems to) hang(s)... What should I try next to solve that issue?

Attaching a serial console learned me that the the "capture kernel" actually is being booted and also learned me that last thing this "capture kernel" prints is
> Trying to unpack rootfs image as initramfs...

Any suggestions what I might be doing wrong?

Comment 74 Vivek Goyal 2011-05-09 13:20:36 UTC

(In reply to comment #73)
> (In reply to comment #72)
> > [...]
> > simply (seems to) hang(s)... What should I try next to solve that issue?
> 
> Attaching a serial console learned me that the the "capture kernel" actually is
> being booted and also learned me that last thing this "capture kernel" prints
> is
> > Trying to unpack rootfs image as initramfs...
> 
> Any suggestions what I might be doing wrong?

How much memory are you reserving? Try increasing the amount of memory reserved. Also you need not to wait for it to crash. First you can manually crash the system using "echo c > /proc/sysrq-trigger".

Comment 75 Michal Jaegermann 2011-05-09 13:39:48 UTC

(In reply to comment #72)

> ... but somehow refuses to boot into the "capture kernel" and
> simply (seems to) hang(s)... What should I try next to solve that issue?

It is not clear from your description but you are possibly running into
bug 635893 and/or bug 654431.  A workaround was to boot with a regular kernel,
by hook and by crook do a job of /sbin/mkdumprd for a kdump one and reboot
into a desired kernel.  In an absence of a needed initrd /sbin/mkdumprd runs
in a boot sequence and this gets stuck.

Comment 76 Paul Bolle 2011-05-10 08:55:55 UTC

(In reply to comment #74)
> How much memory are you reserving? Try increasing the amount of memory
> reserved. Also you need not to wait for it to crash.

0) I was reserving (the suggested amount of) 64M.

> First you can manually
> crash the system using "echo c > /proc/sysrq-trigger".

1) You're not just interested in the vmcore at the moment of the Oops (upgraded to a panic, so kexec kicks in)?

(In reply to comment #75)
> In an absence of a needed initrd /sbin/mkdumprd runs
> in a boot sequence and this gets stuck.

2) This doesn't seem to be the issue. At the boot of the "system kernel" /etc/rc.d/init.d/kdump did generate a initrd for for the "capture kernel".

3) My trouble apparently was that I was using the initramfs for the "system kernel" while I should have been using the initramfs (called initrd) generated by kdump for the "capture kernel". However using that initramfs doesn't help much since kdump doesn't generate an initramfs compatible with my setup (ie, root on ext3 on LVM2). So the kexec reboot now stopped at the Busybox prompt of a session without rootfs and without the means, as far as I could discover, to mount a rootfs from a LVM2 logical volume.

4) I filed bug #703369 for that last issue, because it seems if I had a initramfs.conf kdump would invoke dracut, and dracut should be able to generate an initramfs compatible with my setup (which essentially is a Fedora default setup, isn't it?).

Comment 77 Vivek Goyal 2011-05-10 12:16:35 UTC

(In reply to comment #76)
> (In reply to comment #74)
> > How much memory are you reserving? Try increasing the amount of memory
> > reserved. Also you need not to wait for it to crash.
> 
> 0) I was reserving (the suggested amount of) 64M.

Depending on kernel and initramfs size it can take much more than that. I would suggest reserving 256M to be safe.

> 
> > First you can manually
> > crash the system using "echo c > /proc/sysrq-trigger".
> 
> 1) You're not just interested in the vmcore at the moment of the Oops (upgraded
> to a panic, so kexec kicks in)?

I suggested above just to make sure that you are able to capture dump when you crash the system manually. If that works, then we are reasonably sure that it will work when CFQ crashes. So yes, I am interested in the vmcore when CFQ actually crashes.

> 
> (In reply to comment #75)
> > In an absence of a needed initrd /sbin/mkdumprd runs
> > in a boot sequence and this gets stuck.
> 
> 2) This doesn't seem to be the issue. At the boot of the "system kernel"
> /etc/rc.d/init.d/kdump did generate a initrd for for the "capture kernel".
> 
> 3) My trouble apparently was that I was using the initramfs for the "system
> kernel" while I should have been using the initramfs (called initrd) generated
> by kdump for the "capture kernel".

It is not mandatory. One can use dracut generated initramfs also. Just that it is much bigger than kdump generated initrd and does not have some automatic kdump known options for saving dump.

> However using that initramfs doesn't help
> much since kdump doesn't generate an initramfs compatible with my setup (ie,
> root on ext3 on LVM2). So the kexec reboot now stopped at the Busybox prompt of
> a session without rootfs and without the means, as far as I could discover, to
> mount a rootfs from a LVM2 logical volume.

Is your network up. Can you scp the vmcore over network to a different machine from busybox prompt.

So root is not mounted because some driver or dm modules are not packed in initrd? You can compile these in your kernel and hence you don't have to depend on kdump initrd packing those in.

> 
> 4) I filed bug #703369 for that last issue, because it seems if I had a
> initramfs.conf kdump would invoke dracut, and dracut should be able to generate
> an initramfs compatible with my setup (which essentially is a Fedora default
> setup, isn't it?).

If you want, you can disable kdump service entirely and rely on loading kdump kernel manually. Reserve 256M of memory and you can use dracut generated 
initramfs for command line below.

kdump -p /boot/vmlinuz  --initrd=/boot/initrd --append="`cat /proc/cmdline` maxcpus=1, irqpoll, reset_devices"

Comment 78 Vivek Goyal 2011-05-10 12:24:04 UTC

Paul,

Can you please upload your config file for the kernel you are able to recreate the crash. I will use the same config on upstream kernel and boot on fedora and try to reproduce the issue.

Comment 79 Paul Bolle 2011-05-10 19:18:12 UTC

Created attachment 498131 [details]
my v2.6.39-rc6 vanilla config

(In reply to comment #78)
> Can you please upload your config file for the kernel you are able to recreate
> the crash. I will use the same config on upstream kernel and boot on fedora and
> try to reproduce the issue.

This is based on the first appearance of v2.6.39-rc6 in Fedora's kernel.git, but only "rpmbuilt" as a vanilla kernel, so it should be close to what Fedora might be shipping by now.

Comment 80 Paul Bolle 2011-05-11 10:25:15 UTC

0) Finally managed to create a vmcore dump. crashkernel=256M helped. It also feels like using a serial console is the only (reliable) way to use kexec (see https://lkml.org/lkml/2011/4/12/210 for a similar report). Perhaps throttling the output to 115200 baud stops something, somewhere from being racey. But, this is just a feeling.

1) Now I have a (2048 - 256 =) 1792 MB core file. Not practical to upload.

2) First impressions with gdb:

Reading symbols from /home/[...]/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/vmlinux...done.
[New LWP 14366]
#0  prefetch (x=Cannot access memory at address 0x6b6b6b6b
)
    at /home/[...]/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/arch/x86/include/asm/processor.h:833
833             alternative_input(BASE_PREFETCH,
(gdb) bt 9
#0  prefetch (x=Cannot access memory at address 0x6b6b6b6b
)
    at /home/peb/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/arch/x86/include/asm/processor.h:833
#1  call_for_each_cic (ioc=0xeeb60500, func=0xc05e88f8 <cic_free_func>)
    at block/cfq-iosched.c:2596
#2  0xc05e861e in cfq_free_io_context (ioc=<optimized out>)
    at block/cfq-iosched.c:2659
#3  0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26
#4  put_io_context (ioc=0xeeb60500) at block/blk-ioc.c:43
#5  0xc05ddd6a in exit_io_context (task=0xed1469f0) at block/blk-ioc.c:80
#6  0xc044083f in do_exit (code=<optimized out>) at kernel/exit.c:1044
#7  0xc0440a79 in do_group_exit (exit_code=0) at kernel/exit.c:1106
#8  0xc0440ab4 in sys_exit_group (error_code=0) at kernel/exit.c:1117
(More stack frames follow...)
(gdb) f 3
#3  0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26
26                      cic->dtor(ioc);
(gdb) l
21              if (!hlist_empty(&ioc->cic_list)) {
22                      struct cfq_io_context *cic;
23
24                      cic = list_entry(ioc->cic_list.first, struct cfq_io_context,
25                                                                      cic_list);
26                      cic->dtor(ioc);
27              }
28      }
29
30      /*
(gdb) print ioc->cic_list
$1 = {first = 0x0}
(gdb) 

3) So apparently something set ioc->cic_list.first to NULL after the hlist_empty() test was run but before cic->dtor() / cfq_free_io_context() was called. Is that possible?

4) Nit pick: list_entry() should actually be hlist_entry() here.

Comment 81 Vivek Goyal 2011-05-11 13:52:18 UTC

(In reply to comment #80)
> 0) Finally managed to create a vmcore dump. crashkernel=256M helped. It also
> feels like using a serial console is the only (reliable) way to use kexec (see
> https://lkml.org/lkml/2011/4/12/210 for a similar report). Perhaps throttling
> the output to 115200 baud stops something, somewhere from being racey. But,
> this is just a feeling.
> 
> 1) Now I have a (2048 - 256 =) 1792 MB core file. Not practical to upload.

You can filter it out using mkdumpfile utitliy and just extract kernel pages and discard user space pages and i think cache pages too. That might bring it to reasonable size to upload.

Comment 82 Vivek Goyal 2011-05-11 14:01:18 UTC

(In reply to comment #80)

> (gdb) bt 9
> #0  prefetch (x=Cannot access memory at address 0x6b6b6b6b
> )
>     at
> /home/peb/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/arch/x86/include/asm/processor.h:833
> #1  call_for_each_cic (ioc=0xeeb60500, func=0xc05e88f8 <cic_free_func>)
>     at block/cfq-iosched.c:2596
> #2  0xc05e861e in cfq_free_io_context (ioc=<optimized out>)
>     at block/cfq-iosched.c:2659
> #3  0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26
> #4  put_io_context (ioc=0xeeb60500) at block/blk-ioc.c:43
> #5  0xc05ddd6a in exit_io_context (task=0xed1469f0) at block/blk-ioc.c:80
> #6  0xc044083f in do_exit (code=<optimized out>) at kernel/exit.c:1044
> #7  0xc0440a79 in do_group_exit (exit_code=0) at kernel/exit.c:1106
> #8  0xc0440ab4 in sys_exit_group (error_code=0) at kernel/exit.c:1117
> (More stack frames follow...)
> (gdb) f 3
> #3  0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26
> 26                      cic->dtor(ioc);
> (gdb) l
> 21              if (!hlist_empty(&ioc->cic_list)) {
> 22                      struct cfq_io_context *cic;
> 23
> 24                      cic = list_entry(ioc->cic_list.first, struct
> cfq_io_context,
> 25                                                                     
> cic_list);
> 26                      cic->dtor(ioc);
> 27              }
> 28      }
> 29
> 30      /*
> (gdb) print ioc->cic_list
> $1 = {first = 0x0}
> (gdb) 
> 
> 3) So apparently something set ioc->cic_list.first to NULL after the
> hlist_empty() test was run but before cic->dtor() / cfq_free_io_context() was
> called. Is that possible?

I don't think that happened. After extracting cic pointer you successfully
called into cfq_free_io_context(). I think what happened that there was
atleast 1 element on the cic_list and then cfq_free_io_context() released
that element hence you see ioc->cic_list = null now.

Comment 83 Vivek Goyal 2011-05-13 12:42:39 UTC

Paul,

Planning to update this vmcore (after filtering) and vmlinux? I am curious to have a look and see if I can spot something.

Comment 84 Paul Bolle 2011-05-15 18:24:04 UTC

(In reply to comment #83)
> Planning to update this vmcore (after filtering) and vmlinux? I am curious to
> have a look and see if I can spot something.

0) Well, after getting sidetracked, I managed to reproduce this under v2.6.39-rc7. The vmcore file is still (almost) 200M after butchering it makedumpfile, which is still rather big to upload. 

1) Perhaps you can contact me directly (ie, with bypassing bugzilla.redhat.com) to discuss possibilities here.

Comment 85 Vivek Goyal 2011-05-17 17:31:41 UTC

Thanks for the vmcore paul. Some observations after looking at it.

- Crash happened when we tried to do pos->next where pos=0x6b6b6b6b6b.

- ioc->cic_list is empty at the time of crash. So it looks like when we
entered the loop, atleast one element was on the list and later we
fetched next element which turned out to be freed object.

pos = rcu_dereference_raw(hlist_next_rcu(pos)

- One conspiracy theory is that somehow the cic we removed from the ioc->cic_list, got freed before rcu read lock was released. That way
cic object got poisoned and we fetched a poisoned pointer. Now I don't
know if that's possible or not.

- I tried hard looking for pointer to recently freed object but looks like
we have lost that.

- Dave Anderson helped looking at the crash trying to find anything else. We
did "bt -F" and we see that leftover on the stack there seems to be pointer
to cic->rcu_head field. We are assuming that that's the pointer to the
recently freed cic.

- 0xf6ab7460 seems to be the last cic freed (guess from the stack lefovers).

- "kmem -S cfq_io_context" verifies that this is a free object in the cache.

- struct cfq_io_context 0xf6ab7460 shows that object is poisoned.

So at this point of time I am not sure if we are looking at the right object
and whether it is the most recent free object. If yes, then question will
come did we free it up while we were in rcu_read_lock().

Paul, as you can reproduce the problem, I guess I will write a small patch
to print recently freed objects and see what we are freeing when.

Comment 86 Vivek Goyal 2011-05-17 18:58:02 UTC

Created attachment 499426 [details]
CFQ cic debug patch

a debug patch to get some more info out of system during crash.

Comment 87 Vivek Goyal 2011-05-17 18:58:48 UTC

Paul, can you please apply above patch and try to reproduce the issue again. Do attach the serial console and capture the console output.

Comment 88 Vivek Goyal 2011-05-17 19:11:12 UTC

Also, keep kdump still configured. A vmcore gives whole lot of information about various kmem caches and object states.

Comment 89 Paul Bolle 2011-05-18 07:34:41 UTC

(In reply to comment #85)
> Some observations after looking at it.

Two minor (lazy) questions, if that's allowed.

> - [..] We
>   did "bt -F" and we see that leftover on the stack there seems to be pointer
>   to cic->rcu_head field.

What's "bt -F"? 

> - "kmem -S cfq_io_context" verifies that this is a free object in the cache.

What's kmem?

Comment 90 Vivek Goyal 2011-05-18 09:47:27 UTC

These are "crash" commands. So I used "crash vmlinux dumpfile" to open the dump and not gdb.

You can use "help bt" to get more details, but in summary, crash travels through
stack contents and if there are any address found from any of slab caches, it prints the name of the slab symbolically.

In this case, we saw following.

 #3 [ec05be50] do_page_fault at c08228ba
    [RA: c08205d4  SP: ec05be54  FP: ec05becc  SIZE: 124]
    ec05be54: 00000007  f699bf90  restore_all_notrace  [cfq_io_context]  
    ec05be64: 0000005c  00000000  00000008  f699bf90  
    ec05be74: 00000000  trace_hardirqs_on_thunk+12  runqueues  init_task  
    ec05be84: 00000000  restore_all_notrace  f4d43764  00000080  
    ec05be94: 00000080  rcu_sched_state  [cfq_io_context]  ec05bec8  
    ec05bea4: 00000046  ec05beb8  trace_hardirqs_off_caller+46  6b6b6b6b  
    ec05beb4: [blkdev_ioc]  ec05bf18  6b6b6b6b  [blkdev_ioc]  
    ec05bec4: do_page_fault  ec05bf18  error_code+108  

There seem to be two addresses in one of the stack frames which seem to
be coming from cfq_io_context slab.

Then we did "bt -f" to get raw addresses.

 #3 [ec05be50] do_page_fault at c08228ba
    [RA: c08205d4  SP: ec05be54  FP: ec05becc  SIZE: 124]
    ec05be54: 00000007  f699bf90  c081f8ed  f6ab7498  
    ec05be64: 0000005c  00000000  00000008  f699bf90  
    ec05be74: 00000000  c05f3ef4  c0d017c0  c0ac5fe0  
    ec05be84: 00000000  c081f8ed  f4d43764  00000080  
    ec05be94: 00000080  c0ad9e80  f6ab7498  ec05bec8  
    ec05bea4: 00000046  ec05beb8  c0465c5d  6b6b6b6b  
    ec05beb4: ee9deb00  ec05bf18  6b6b6b6b  ee9deb00  
    ec05bec4: c082273f  ec05bf18  c08205d4  

And the address seems to be 0xf6ab7498 at both the instances. At this point of time it only tells that this address points to with-in some object in cfq_io_context slab.

kmem command shows kernel memory state. -s and -S show slab state. Then we did following.

crash> kmem -s 0xf6ab7498
CACHE    NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
f696b580 cfq_io_context            64         26        36      1     4k
  SLAB      MEMORY    NODE  TOTAL  ALLOCATED  FREE
  f58ac6e0  f6ab7000     0     36         26    10
  FREE / [ALLOCATED]
   f6ab7460  (cpu 0 cache)

This tells that 0xf6ab7498 was pointing inside one of the cfq_io_context object and beginning address of the object is 0xf6ab7460, which has been freed. it can be further verified using kmem -S.

crash> kmem -S cfq_io_context
CACHE    NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
f696b580 cfq_io_context            64         26        36      1     4k
CPU 0 SLAB:
  SLAB      MEMORY    NODE  TOTAL  ALLOCATED  FREE
  f58ac6e0  f6ab7000     0     36         26    10
  FREE / [ALLOCATED]
  [f6ab7000]
  [f6ab7070]
   f6ab70e0  (cpu 0 cache)
  [f6ab7150]
  [f6ab71c0]
  [f6ab7230]
  [f6ab72a0]
  [f6ab7310]
  [f6ab7380]
  [f6ab73f0]
   f6ab7460  (cpu 0 cache)
   f6ab74d0  (cpu 0 cache)
....
....
....

Notice above address does not have [] around it. That means this object has been freed and returned to the slub.

So at this point we only know that we found a cic pointer on stack which has been freed. We don't know if that's the latest cic we freed under RCU. Hence
the patch with debug message might verify/deny that.

Comment 91 Paul Bolle 2011-05-18 20:06:18 UTC

0) Interim report (since I, well, just forgot to enable panic on oops).

1) I captured 6M+ via minicom over 115200 baud. mandb therefore took about 20 min! This part shows that we're on the right track, doesn't it? (This will probably end up a bit mangled.)

[...]
debug: Will decouple cic=0xef3f1380, Next pointer is   (null)
debug: cic=0xef3f1380 decoupled. Next pointer is   (null)
debug: Will decouple cic=0xef3f1380, Next pointer is   (null)
debug: cic=0xef3f1380 decoupled. Next pointer is   (null)
debug: freeing up cic=ef3f1380
cic=ef3f1380 freed under while in RCU
------------[ cut here ]------------
kernel BUG at block/cfq-iosched.c:2606!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/firmware/memmap/8/type
Modules linked in: cpufreq_ondemand acpi_cpufreq mperf bnep bluetooth ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast snd_intel8x0m arc4 snd_intel8x0 ath5k snd_ac97_codec ath ac97_bus snd_seq mac80211 snd_seq_device cfg80211 e1000 thinkpad_acpi snd_pcm ppdev rfkill iTCO_wdt snd_timer parport_pc iTCO_vendor_support snd i2c_i801 snd_page_alloc parport soundcore pcspkr joydev microcode uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 12275, comm: mandb Not tainted 2.6.39-0.rc7.git0.0.local1.fc16.i686 #1 IBM        /       
EIP: 0060:[<c05e86c6>] EFLAGS: 00010296 CPU: 0
EIP is at call_for_each_cic+0x3b/0xa8
EAX: 00000038 EBX: ef3f13a8 ECX: 00000086 EDX: 00000000
ESI: ef3f1380 EDI: ef7baf00 EBP: eca75f18 ESP: eca75ef8
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mandb (pid: 12275, ti=eca74000 task=ef693f90 task.ti=eca74000)
Stack:
 c09a8b73 ef3f1380 00000000 c05e8a1f 6b6b6b6b ef7baf00 ef693f90 ef694360
 eca75f20 c05e8745 eca75f2c c05ddd86 ef7baf00 eca75f40 c05dde16 ef693f90
 00000012 eca75f74 eca75f8c c04408ab efbe65dc efbe6580 00000009 ef693f90
Call Trace:
 [<c05e8a1f>] ? cfqq_process_refs+0x18/0x18
 [<c05e8745>] cfq_free_io_context+0x12/0x14
 [<c05ddd86>] put_io_context+0x34/0x5c
 [<c05dde16>] exit_io_context+0x68/0x6d
 [<c04408ab>] do_exit+0x63e/0x661
 [<c0440ae5>] do_group_exit+0x63/0x86
 [<c0440b20>] sys_exit_group+0x18/0x18
 [<c082571f>] sysenter_do_call+0x12/0x38
Code: 89 55 ec e8 1f ec ff ff 8b 5f 4c eb 22 e8 ba ff ff ff 81 7d f0 6b 6b 6b 6b 75 12 89 74 24 04 c7 04 24 73 8b 9a c0 e8 51 e1 22 00 <0f> 0b 8b 1b e8 c1 b0 e6 ff 85 c0 74 05 e8 8f ff ff ff 85 db 74 
EIP: [<c05e86c6>] call_for_each_cic+0x3b/0xa8 SS:ESP 0068:eca75ef8
---[ end trace be23eecaac4d4ddf ]---
Fixing recursive fault but reboot is needed!
[...]

Comment 92 Vivek Goyal 2011-05-18 20:47:13 UTC

This one looks interesting. The only question I have is that why are we decoupling same cic multiple times. Is it that cic was decoupled, allocated again and then put again on the list or something like that.

Can you provide either the whole of the console log or atleast last 100-200 lines or so.

Comment 93 Paul Bolle 2011-05-18 21:06:38 UTC

(In reply to comment #92)
> This one looks interesting. The only question I have is that why are we
> decoupling same cic multiple times. Is it that cic was decoupled, allocated
> again and then put again on the list or something like that.
> 
> Can you provide either the whole of the console log or atleast last 100-200
> lines or so.

0) Well, that cic was used a lot:
$ grep "freeing up cic=ef3f1380" call_for_each_cic.out | wc -l
7367

1) Perhaps we should try another way of printk debugging this.

[2) Vivek, I just uploaded the (xz compressed) vmlinux and (ditto) dumpfile of a similar crash, now with panic_on_oops enabled. That one should have its log embedded in the dumpfile.]

Comment 94 Vivek Goyal 2011-05-18 23:01:09 UTC

Thanks Paul. Ok, call_for_each_cic() is called by 4 functions.

- exit single io context
- decouple and free cic
- change ioprio
- change group

On my system I see call to exit_io_context first and then call to freeing up the cic. So I see cic decouple message twice and then cic is freed. 

In your logs i see that call_for_each_cic() has been called 3 times on each cic. I am wondering what is that 3rd call. Is it change of ioprio or change of group etc.

Can you please try another debug patch. In this patch I am also identifying the caller of the function.

Comment 95 Vivek Goyal 2011-05-18 23:03:07 UTC

Created attachment 499716 [details]
CFQ cic debug patch V2

Another debug patch for CFQ. It also identifies the caller of the call_for_each_cic function. It also puts a printk for drop_dead_cic() which is only other path where cic is freed.

Comment 96 Paul Bolle 2011-05-19 20:34:07 UTC

0) Vivek, I ran into trouble trying your new debug patch since, for some reason, my "capture kernel" now loops while printing things like this:

irq 11: nobody cared (try booting with the "irqpoll" option)
Pid: 205, comm: modprobe Not tainted 2.6.39-0.rc7.git0.0.local.fc16.i686 #1
Call Trace:
 [<d08167d4>] ? printk+0x2d/0x2f
 [<d0818025>] __report_bad_irq+0x40/0x9f
 [<d04930b2>] note_interrupt+0x122/0x188
 [<d0491b1f>] handle_irq_event_percpu+0x171/0x18c
 [<d04689b2>] ? lock_release+0x166/0x16d
 [<d0491b6b>] handle_irq_event+0x31/0x4a
 [<d0493553>] ? unmask_irq+0x23/0x23
 [<d04935a8>] handle_level_irq+0x55/0x73
 <IRQ>  [<d0404040>] ? do_IRQ+0x3c/0x92
 [<d0465c5d>] ? trace_hardirqs_off_caller+0x2e/0x86
 [<d0825c75>] ? common_interrupt+0x35/0x3c
 [<d044007b>] ? mm_update_next_owner+0xad/0x18a
 [<d044007b>] ? mm_update_next_owner+0xad/0x18a
 [<d04600e0>] ? clocksource_watchdog_work+0x16/0x3d
 [<d0441a6e>] ? arch_local_irq_enable+0x5/0xb
 [<d044272e>] ? __do_softirq+0x68/0x185
 [<d04426c6>] ? __tasklet_hi_schedule_first+0x30/0x30
 <IRQ>  [<d0442a03>] ? irq_exit+0x4f/0x8b
 [<d0404082>] ? do_IRQ+0x7e/0x92
 [<d0465c5d>] ? trace_hardirqs_off_caller+0x2e/0x86
 [<d0825c75>] ? common_interrupt+0x35/0x3c
 [<d081007b>] ? powernow_cpu_init+0x51/0x5d0
 [<d04600e0>] ? clocksource_watchdog_work+0x16/0x3d
 [<d046dfd1>] ? arch_local_irq_restore+0x5/0xb
 [<d081f625>] ? _raw_spin_unlock_irqrestore+0x44/0x48
 [<d0492b00>] ? __setup_irq+0x27c/0x324
 [<d0492c1b>] ? request_threaded_irq+0x73/0xd9
 [<e07c048f>] ? cb_writel+0x59/0x59 [yenta_socket]
 [<d0492c5a>] ? request_threaded_irq+0xb2/0xd9
 [<e07c16e3>] ? yenta_probe_cb_irq+0x46/0x121 [yenta_socket]
 [<e07c1d54>] ? ti12xx_override+0x371/0x4b4 [yenta_socket]
 [<e07c28c4>] ? yenta_probe+0x2c9/0x568 [yenta_socket]
 [<d06a524c>] ? pm_runtime_enable+0x60/0x68
 [<d0609080>] ? pci_device_probe+0x62/0xab
 [<d069ef4a>] ? driver_probe_device+0x129/0x208
 [<d081e20d>] ? mutex_lock_nested+0x39/0x3e
 [<d069f078>] ? __driver_attach+0x4f/0x6b
 [<d069e139>] ? bus_for_each_dev+0x42/0x6b
 [<d069eb8e>] ? driver_attach+0x1f/0x23
 [<d069f029>] ? driver_probe_device+0x208/0x208
 [<d069e801>] ? bus_add_driver+0xcd/0x214
 [<d069f4b6>] ? driver_register+0x84/0xe3
 [<d05f7b6b>] ? __raw_spin_lock_init+0x2d/0x4e
 [<d0609829>] ? __pci_register_driver+0x4f/0xaa
 [<e07a6017>] ? yenta_socket_init+0x17/0x1000 [yenta_socket]
 [<d040118a>] ? do_one_initcall+0x8c/0x146
 [<e07a6000>] ? 0xe07a5fff
 [<e07a6000>] ? 0xe07a5fff
 [<d0472108>] ? sys_init_module+0x14fc/0x1704
 [<d082569f>] ? sysenter_do_call+0x12/0x38
handlers:
[<d06f0464>] (usb_hcd_irq+0x0/0x95)
[<d06f0464>] (usb_hcd_irq+0x0/0x95)
[<d06f0464>] (usb_hcd_irq+0x0/0x95)
[<d06f0464>] (usb_hcd_irq+0x0/0x95)
[<e0904a79>] (radeon_driver_irq_handler_kms+0x0/0x19 [radeon])
[<e07c17be>] (yenta_interrupt+0x0/0xb3 [yenta_socket])
[<e07c048f>] (yenta_probe_handler+0x0/0x4c [yenta_socket])
Disabling IRQ #11

1) According to the previous printed messages "irqpoll" already is set.

2) Any suggestions what could be going wrong?

Comment 97 Vivek Goyal 2011-05-19 20:55:24 UTC

Paul, does not look like that it is related to my debug patch. I am surprised that it is looping for long time. Because generally if lots of interrupts come when nobody is handling the interrupts, kernel disables the interrupt line and moves on. So kernel should have done that.

May be just try the test again.

In the mean time, I can look at the console logs while you retrying capturing
the crash dump.

Comment 98 Paul Bolle 2011-05-19 21:10:06 UTC

Created attachment 499949 [details]
last 100 lines (before unsuccesfull reboot in capture kernel)

(In reply to comment #97)
> Paul, does not look like that it is related to my debug patch.

Indeed. Not sure what I did to mess things up this way.

> In the mean time, I can look at the console logs while you retrying capturing
> the crash dump.

I've attached the last 100 lines printed by the (patched) "system kernel", before it rebooted into the "capture kernel".

Comment 99 Vivek Goyal 2011-05-19 21:27:20 UTC

Hmm..., I see lots of following message which i don't see on my system.

debug: Will change cgroup cic=0xee226380, Next pointer is   (null)
debug: cic=0xee226380 change cgroup done. Next pointer is   (null)

Have you got any type of automatic rules setup for cgroups. So that if a thread is launched it is moved in some cgroup or so (/etc/cgrules.conf?).

Can you disable this cgroup related special behavior and see if problem is still reproducible?

Comment 100 Vivek Goyal 2011-05-19 21:29:32 UTC

you can disable cgconfig and cgred service on the system (if these are running).

Comment 101 Paul Bolle 2011-05-19 21:50:23 UTC

(In reply to comment #99)
> Have you got any type of automatic rules setup for cgroups. So that if a thread
> is launched it is moved in some cgroup or so (/etc/cgrules.conf?).
> 
> Can you disable this cgroup related special behavior and see if problem is
> still reproducible?

I know nothing of cgroup related stuff. I guess it should be configured as current Rawhide has pushed it onto people (like me) tracking Rawhide. But at this moment I can't check as the kernel is (one again) happily flooding all consoles with the "cic" debug messages I patched in.

(In reply to comment #100)
> you can disable cgconfig and cgred service on the system (if these are
> running).

Ditto.

Comment 102 Vivek Goyal 2011-05-19 21:54:51 UTC

you can just kill the mandb process and then kernel should stop filling the console?

I enabled cgred and put a rule in cgrules.conf that put all root processes in /cgroup/blk/test1. Then I see changed cgroup message only few times as process
migrates to new group. But I don't see this constant string of messages of change cgroup as you are seeing. 

Now my "mandb -c" is running /cgroup/blk/test1/ group but no constant stream of cgroup changed messages.

Can you try capturing say 10 seconds of blktrace on your root disk (sda I guess?)

- start mandb
- blktrace -d /dev/sda
- after 10 sec press Ctrl-C
- blkparse -i sda > blkparse-output

And upload this blkparse-output file which contains the trace message. You might have to install the blktrace package though.

Comment 103 Vivek Goyal 2011-05-19 22:05:21 UTC

Though we are changing cgroups frequently in your setup looking at the cgroup change code, it looks harmless in this context. We just release the cfqq associated with the cic and don't try to free cic as such.

Still I am curious what's happening in your setup which causes all these messages and may be that could explain the issue.

Comment 104 Paul Bolle 2011-05-19 22:06:57 UTC

0) Things settled down. For once mandb did not manage to trigger this issue. The joy of race conditions?

1) /etc/cgrules.conf is entirely commented out (its default state, I suppose).

2) If chkconfig is to be believed neither cgconfig nor cgred should run (but maybe systemd is interfering with this, dunno...).

Comment 105 Paul Bolle 2011-05-19 22:13:52 UTC

(In reply to comment #102)
> you can just kill the mandb process and then kernel should stop filling the
> console?

My comment #104 collided with your comment #102 and comment #103. Anyway, I hope to give this (ie, your blktrace stuff) another try tomorrow. I'm a bit fed up now with seeing this laptop booting, crashing, rebooting and crashing again without any apparent progress ....

Comment 106 Paul Bolle 2011-05-20 07:38:48 UTC

Created attachment 500006 [details]
16+ seconds of blktrace output

(In reply to comment #102)
> Can you try capturing say 10 seconds of blktrace on your root disk (sda I
> guess?)

16+ seconds of blktrace output (while running mandb - which eventually oopsed the box, a single hard disk, single CPU system).

Comment 107 Vivek Goyal 2011-05-20 18:27:44 UTC

Ok, nothing suspicious in blktrace. There are no "changed cgroup" messages. So i think these are newly allocated cic's where ioc->cgroup_changed = 1 but there is no associated cfqq hence blktrace does not carry "chaged_cgroup" message.

Why are we changing cgroup of mandb threads so frequently, still beats me. I guess that's a different problem we can analyze later.

Back to the core issue of freeing up object while in RCU. I am really running out of ideas that how to debug this. I am hoping that Paul McKenney gives us some tips how to prove/disprove that it is an RCU issue or not.

BTW, out of curiosity, is this issue also reproducible on stock fedora kernel for you? Is it reproducible if you just trim down your config to bare minimum. Just trying to figure out if it is some CONFIG option dependent or not.

Comment 108 Paul Bolle 2011-05-20 19:04:01 UTC

(In reply to comment #95)
> CFQ cic debug patch V2
> 
> Another debug patch for CFQ. It also identifies the caller of the
> call_for_each_cic function. It also puts a printk for drop_dead_cic() which is
> only other path where cic is freed.

0) I've finally uploaded a (xz compressed) vmlinux and (ditto) dumpfile of
this crash (both using the V2 debug patch).

(1) I've been running into trouble with kexec again and again for the last 48 hours. To aid my memory, it seems I need to add these options to the "system kernel" to get things to work:

crashkernel=256M console=ttyS0,115200n8 rd_NO_PLYMOUTH nousb nomodeset yenta_socket.foo=1 ath5k.foo=1 3

- "nomodeset" (and no console=tty0) for radeon related bugs (of the "capture kernel");
- yenta_socket.foo=1 and ath5k.foo=1 are tricks to make sure those modules won't get loaded (for IRQ=11 related bugs of the "capture kernel";
- nousb for ehci relted bugs (of the "capture kernel");
- rd_NO_PLYMOUTH to be able to log into the "system kernel" over serial line.

This is just what I ended up with after beating kexec into (apparent) submission. Not sure what's really needed and what's just cargo cult.)

Comment 109 Paul Bolle 2011-05-20 19:06:38 UTC

(In reply to comment #107)
> BTW, out of curiosity, is this issue also reproducible on stock fedora kernel
> for you? Is it reproducible if you just trim down your config to bare minimum.
> Just trying to figure out if it is some CONFIG option dependent or not.

Could also be a gcc related issue, I guess.

What is the latest i686 kernel now actually shipped on the Rawhide mirrors?

Comment 110 Paul Bolle 2011-05-21 21:57:57 UTC

(In reply to comment #109)
> Could also be a gcc related issue, I guess.

I upgraded gcc and friends (from 4.6.0-6) to 4.6.0-7, recompiled v2.6.39-rc7 and my favourite testcase (mandb -c) Oops'ed twice. My next comment should prove it's not gcc related.

Comment 111 Paul Bolle 2011-05-21 22:10:27 UTC

(In reply to comment #107)
> BTW, out of curiosity, is this issue also reproducible on stock fedora kernel
> for you? Is it reproducible if you just trim down your config to bare minimum.
> Just trying to figure out if it is some CONFIG option dependent or not.

Ha!

I installed kernel-2.6.39-0 (as pre-compiled for i686 Rawhide). I also rebuilt that rpm locally (using gcc-4.6.0-7) and rebuilt that rpm in its vanilla variant (using rpmbuild's "--with vanilla" option). I'm unable to generate this Oops under v2.6.39 (in any of the three variants I used)!

This suggests things got fixed between v.2.6.39-rc7 and v2.6.39.

I hope to answer Paul McKenney's message to LKML (https://lkml.org/lkml/2011/5/21/151) shortly to find out whether things really got fixed in v2.6.39 or whether the window of opportunity for this Oops just got too small for my attempts at reproducing it.

Comment 112 Paul Bolle 2011-05-22 12:15:43 UTC

(In reply to comment #111)
> I installed kernel-2.6.39-0 (as pre-compiled for i686 Rawhide). I also rebuilt
> that rpm locally (using gcc-4.6.0-7) and rebuilt that rpm in its vanilla
> variant (using rpmbuild's "--with vanilla" option). I'm unable to generate this
> Oops under v2.6.39 (in any of the three variants I used)!
> 
> This suggests things got fixed between v.2.6.39-rc7 and v2.6.39.

0) It turns out the Fedora kernel specfile uses rather different config files for release candidates and final releases: a lot of debug options will be unset for final releases. In this case I guess I was fooled by this:

diff -U 0 /boot/config-2.6.39-0.rc7.git0.0.local.fc16.i686 /boot/config-2.6.39-0.local.fc16.i686 | grep -e RCU
-CONFIG_PROVE_RCU=y
-# CONFIG_PROVE_RCU_REPEATEDLY is not set

1) I battled a bit with the specfile to get it to not delete config.debug, but I have to admit defeat. I'll have to figure out another way to use the debug config options.

(2) For those people maintaining kernel.spec:

Are these lines correct:
    %if !%{debugbuildsenabled}
    rm -f kernel-%{version}-*debug.config
    %endif

?

It seems their effect is that both the information shipped in config-debug and config-nodebug is dropped. But this stuff is all rather confusing ....)

Comment 113 Paul Bolle 2011-05-22 19:20:47 UTC

(In reply to comment #112)
> 0) It turns out the Fedora kernel specfile uses rather different config files
> for release candidates and final releases: a lot of debug options will be unset
> for final releases. In this case I guess I was fooled by this:
> 
> diff -U 0 /boot/config-2.6.39-0.rc7.git0.0.local.fc16.i686
> /boot/config-2.6.39-0.local.fc16.i686 | grep -e RCU
> -CONFIG_PROVE_RCU=y
> -# CONFIG_PROVE_RCU_REPEATEDLY is not set

And, indeed, when I rebuilt and booted v2.6.39 with a .config file more in line with what it used for the release candidates (eg, with CONFIG_PROVE_RCU=y) I ran into this issue the first time I ran my precious little testing tool (mandb -c).

Comment 114 Vivek Goyal 2011-05-23 14:11:17 UTC

(In reply to comment #113)
> (In reply to comment #112)
> > 0) It turns out the Fedora kernel specfile uses rather different config files
> > for release candidates and final releases: a lot of debug options will be unset
> > for final releases. In this case I guess I was fooled by this:
> > 
> > diff -U 0 /boot/config-2.6.39-0.rc7.git0.0.local.fc16.i686
> > /boot/config-2.6.39-0.local.fc16.i686 | grep -e RCU
> > -CONFIG_PROVE_RCU=y
> > -# CONFIG_PROVE_RCU_REPEATEDLY is not set
> 
> And, indeed, when I rebuilt and booted v2.6.39 with a .config file more in line
> with what it used for the release candidates (eg, with CONFIG_PROVE_RCU=y) I
> ran into this issue the first time I ran my precious little testing tool (mandb
> -c).

PaulB,

So if you take fedora final release file and just enable CONFIG_PROVE_RCU=y, then you can reproduce the issue? Or more config options are involved too.

Comment 115 Vivek Goyal 2011-05-23 14:25:43 UTC

(In reply to comment #103)
> Though we are changing cgroups frequently in your setup looking at the cgroup
> change code, it looks harmless in this context. We just release the cfqq
> associated with the cic and don't try to free cic as such.
> 
> Still I am curious what's happening in your setup which causes all these
> messages and may be that could explain the issue.

Ok, I found out why you are seeing so many cgroup changed messages. We are not
initializing ioc->cgroup_changed field during ioc creation and CFQ thinks that
cgroup has changed.

I am going to post a patch and that should remove cgroup changed messages from
the logs. Though I think that this is completely orthogonal and has nothing to
do with original problem.

Comment 116 Vivek Goyal 2011-05-23 14:28:36 UTC

Created attachment 500449 [details]
Fix frequent cgroup changed messages visible in debug logs

PaulB was seeing lots of cgroup changed messages when we enabled debug. This patch removes those unnecessary extra calls to call_for_each_cic(changed_cgroup).

Comment 117 Vivek Goyal 2011-05-23 15:58:48 UTC

PaulB, would you like to re-run those tests with some config options enabled as suggested by PaulM. Also it would be good to run the script to capture the
rcu traces.

CONFIG_PROVE_RCU = y
CONFIG_PREEMPT = y
CONFIG_DEBUG_OBJECTS_RCU_HEAD = y
CONFIG_RCU_TRACE = y

Comment 118 Paul Bolle 2011-05-23 20:53:05 UTC

(In reply to comment #114)
> So if you take fedora final release file and just enable CONFIG_PROVE_RCU=y,
> then you can reproduce the issue? Or more config options are involved too.

0) The Fedora final release I ran was kernel-2.6.39-0.fc16.i686.rpm. The first version of that release that I managed to built with a config resembling release candidates (but as a vanilla kernel) was kernel-vanilla-2.6.39-0.local2.fc16.i686.

1) This horror shows the differences in their respective configs:

diff -U 0 <(rpm2cpio Download/kernel-2.6.39-0.fc16.i686.rpm | cpio --quiet -i --to-stdout ./boot/config-2.6.39-0.fc16.i686) /boot/config-2.6.39-0.local2.fc16.i686

2) That generates this output (of which "+CONFIG_PROVE_RCU=y" seems to be the most important change, but most of it is entirely new for me anyway):

--- /dev/fd/63	2011-05-23 22:48:15.071686130 +0200
+++ /boot/config-2.6.39-0.local2.fc16.i686	2011-05-22 19:42:32.000000000 +0200
@@ -3,2 +3,2 @@
-# Linux/i386 2.6.39-0.fc16.i686 Kernel Configuration
-# Thu May 19 20:07:58 2011
+# Linux/i386 2.6.39-0.local2.fc16.i686 Kernel Configuration
+# Sun May 22 17:51:58 2011
@@ -178,0 +179 @@
+CONFIG_PERF_USE_VMALLOC=y
@@ -185 +186 @@
-# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
+CONFIG_DEBUG_PERF_USE_VMALLOC=y
@@ -229 +230 @@
-# CONFIG_MODULE_FORCE_UNLOAD is not set
+CONFIG_MODULE_FORCE_UNLOAD=y
@@ -258 +259 @@
-CONFIG_INLINE_SPIN_UNLOCK=y
+# CONFIG_INLINE_SPIN_UNLOCK is not set
@@ -260 +261 @@
-CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
+# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set
@@ -267 +268 @@
-CONFIG_INLINE_READ_UNLOCK=y
+# CONFIG_INLINE_READ_UNLOCK is not set
@@ -269 +270 @@
-CONFIG_INLINE_READ_UNLOCK_IRQ=y
+# CONFIG_INLINE_READ_UNLOCK_IRQ is not set
@@ -276 +277 @@
-CONFIG_INLINE_WRITE_UNLOCK=y
+# CONFIG_INLINE_WRITE_UNLOCK is not set
@@ -278 +279 @@
-CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
+# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set
@@ -280 +281 @@
-CONFIG_MUTEX_SPIN_ON_OWNER=y
+# CONFIG_MUTEX_SPIN_ON_OWNER is not set
@@ -415 +416 @@
-CONFIG_SPLIT_PTLOCK_CPUS=4
+CONFIG_SPLIT_PTLOCK_CPUS=999999
@@ -501 +502,2 @@
-# CONFIG_ACPI_DEBUG is not set
+CONFIG_ACPI_DEBUG=y
+# CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
@@ -596 +597,0 @@
-CONFIG_PCI_MSI_DEFAULT_ON=y
@@ -1189 +1190 @@
-# CONFIG_CAN_DEBUG_DEVICES is not set
+CONFIG_CAN_DEBUG_DEVICES=y
@@ -1323 +1324 @@
-# CONFIG_CEPH_LIB_PRETTYDEBUG is not set
+CONFIG_CEPH_LIB_PRETTYDEBUG=y
@@ -1520 +1521 @@
-# CONFIG_DRBD_FAULT_INJECTION is not set
+CONFIG_DRBD_FAULT_INJECTION=y
@@ -2115 +2116 @@
-# CONFIG_ATH_DEBUG is not set
+CONFIG_ATH_DEBUG=y
@@ -2131 +2132 @@
-# CONFIG_CARL9170_DEBUGFS is not set
+CONFIG_CARL9170_DEBUGFS=y
@@ -2180 +2181 @@
-# CONFIG_IWLWIFI_DEVICE_TRACING is not set
+CONFIG_IWLWIFI_DEVICE_TRACING=y
@@ -2662 +2662,0 @@
-CONFIG_INPUT_APPLEIR=m
@@ -2695 +2694,0 @@
-CONFIG_CRASH=m
@@ -4522 +4521,2 @@
-# CONFIG_DMADEVICES_DEBUG is not set
+CONFIG_DMADEVICES_DEBUG=y
+CONFIG_DMADEVICES_VDEBUG=y
@@ -4721 +4721 @@
-# CONFIG_EXT4_DEBUG is not set
+CONFIG_EXT4_DEBUG=y
@@ -4726 +4726 @@
-# CONFIG_JBD2_DEBUG is not set
+CONFIG_JBD2_DEBUG=y
@@ -4766 +4766 @@
-# CONFIG_QUOTA_DEBUG is not set
+CONFIG_QUOTA_DEBUG=y
@@ -4817 +4816,0 @@
-CONFIG_TMPFS_XATTR=y
@@ -5020,2 +5019,8 @@
-# CONFIG_DEBUG_OBJECTS is not set
-# CONFIG_SLUB_DEBUG_ON is not set
+CONFIG_DEBUG_OBJECTS=y
+# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
+CONFIG_DEBUG_OBJECTS_FREE=y
+CONFIG_DEBUG_OBJECTS_TIMERS=y
+CONFIG_DEBUG_OBJECTS_WORK=y
+CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y
+CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
+CONFIG_SLUB_DEBUG_ON=y
@@ -5024 +5029,2 @@
-# CONFIG_DEBUG_RT_MUTEXES is not set
+CONFIG_DEBUG_RT_MUTEXES=y
+CONFIG_DEBUG_PI_LIST=y
@@ -5026,4 +5032,6 @@
-# CONFIG_DEBUG_SPINLOCK is not set
-# CONFIG_DEBUG_MUTEXES is not set
-# CONFIG_DEBUG_LOCK_ALLOC is not set
-# CONFIG_PROVE_LOCKING is not set
+CONFIG_DEBUG_SPINLOCK=y
+CONFIG_DEBUG_MUTEXES=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+CONFIG_PROVE_RCU=y
+# CONFIG_PROVE_RCU_REPEATEDLY is not set
@@ -5031 +5039,4 @@
-# CONFIG_LOCK_STAT is not set
+CONFIG_LOCKDEP=y
+CONFIG_LOCK_STAT=y
+# CONFIG_DEBUG_LOCKDEP is not set
+CONFIG_TRACE_IRQFLAGS=y
@@ -5040 +5051 @@
-# CONFIG_DEBUG_VM is not set
+CONFIG_DEBUG_VM=y
@@ -5042 +5053 @@
-# CONFIG_DEBUG_WRITECOUNT is not set
+CONFIG_DEBUG_WRITECOUNT=y
@@ -5045,4 +5056,4 @@
-# CONFIG_TEST_LIST_SORT is not set
-# CONFIG_DEBUG_SG is not set
-# CONFIG_DEBUG_NOTIFIERS is not set
-# CONFIG_DEBUG_CREDENTIALS is not set
+CONFIG_TEST_LIST_SORT=y
+CONFIG_DEBUG_SG=y
+CONFIG_DEBUG_NOTIFIERS=y
+CONFIG_DEBUG_CREDENTIALS=y
@@ -5059,2 +5070,8 @@
-# CONFIG_CPU_NOTIFIER_ERROR_INJECT is not set
-# CONFIG_FAULT_INJECTION is not set
+CONFIG_CPU_NOTIFIER_ERROR_INJECT=m
+CONFIG_FAULT_INJECTION=y
+CONFIG_FAILSLAB=y
+CONFIG_FAIL_PAGE_ALLOC=y
+CONFIG_FAIL_MAKE_REQUEST=y
+CONFIG_FAIL_IO_TIMEOUT=y
+CONFIG_FAULT_INJECTION_DEBUG_FS=y
+CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y
@@ -5062 +5079 @@
-# CONFIG_SYSCTL_SYSCALL_CHECK is not set
+CONFIG_SYSCTL_SYSCALL_CHECK=y
@@ -5100 +5117,2 @@
-# CONFIG_MMIOTRACE is not set
+CONFIG_MMIOTRACE=y
+# CONFIG_MMIOTRACE_TEST is not set
@@ -5106 +5124 @@
-# CONFIG_DMA_API_DEBUG is not set
+CONFIG_DMA_API_DEBUG=y
@@ -5125,3 +5143,3 @@
-# CONFIG_DEBUG_STACK_USAGE is not set
-# CONFIG_DEBUG_PER_CPU_MAPS is not set
-# CONFIG_X86_PTDUMP is not set
+CONFIG_DEBUG_STACK_USAGE=y
+CONFIG_DEBUG_PER_CPU_MAPS=y
+CONFIG_X86_PTDUMP=y
@@ -5130 +5148 @@
-# CONFIG_DEBUG_SET_MODULE_RONX is not set
+CONFIG_DEBUG_SET_MODULE_RONX=y
@@ -5195,0 +5214 @@
+CONFIG_CRYPTO_FIPS=y
@@ -5210 +5229 @@
-CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
+# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
@@ -5373,0 +5393 @@
+CONFIG_CPUMASK_OFFSTACK=y
@@ -5378 +5397,0 @@
-CONFIG_PCI_DEFAULT_USE_CRS=y

Comment 119 Paul Bolle 2011-05-23 21:18:32 UTC

(In reply to comment #117)
> PaulB, would you like to re-run those tests with some config options enabled as
> suggested by PaulM. Also it would be good to run the script to capture the
> rcu traces.
> 
> CONFIG_PROVE_RCU = y
> CONFIG_PREEMPT = y
> CONFIG_DEBUG_OBJECTS_RCU_HEAD = y
> CONFIG_RCU_TRACE = y

0) The first of Paul E. McKenney's suggestions I tried was CONFIG_DEBUG_OBJECTS_RCU_HEAD=y (since Fedora relase candidates already ship with CONFIG_PROVE_RCU=y). CONFIG_DEBUG_OBJECTS_RCU_HEAD has some dependencies, so I basically ended up with: 

    $ grep -e PREEMPT -e RCU /boot/config-2.6.39-0.local3.fc16.i686 | grep -v "^#"
    CONFIG_TREE_PREEMPT_RCU=y
    CONFIG_PREEMPT_RCU=y
    CONFIG_RCU_FANOUT=32
    CONFIG_PREEMPT_NOTIFIERS=y
    CONFIG_PREEMPT=y
    CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
    CONFIG_DEBUG_PREEMPT=y
    CONFIG_PROVE_RCU=y
    CONFIG_SPARSE_RCU_POINTER=y

(This doesn't yet use CONFIG_RCU_TRACE. Remember that one of the Fedora defaults seems to be CONFIG_PREEMPT_VOLUNTARY=y.)

1) Using that config I seem to be unable to trigger this Oops. Does that make sense?

Comment 120 Vivek Goyal 2011-05-24 13:14:43 UTC

(In reply to comment #118)
> (In reply to comment #114)
> > So if you take fedora final release file and just enable CONFIG_PROVE_RCU=y,
> > then you can reproduce the issue? Or more config options are involved too.
> 
> 0) The Fedora final release I ran was kernel-2.6.39-0.fc16.i686.rpm. The first
> version of that release that I managed to built with a config resembling
> release candidates (but as a vanilla kernel) was
> kernel-vanilla-2.6.39-0.local2.fc16.i686.
> 
> 1) This horror shows the differences in their respective configs:
> 
> diff -U 0 <(rpm2cpio Download/kernel-2.6.39-0.fc16.i686.rpm | cpio --quiet -i
> --to-stdout ./boot/config-2.6.39-0.fc16.i686)
> /boot/config-2.6.39-0.local2.fc16.i686
> 
> 2) That generates this output (of which "+CONFIG_PROVE_RCU=y" seems to be the
> most important change, but most of it is entirely new for me anyway):
> 

Ok, lots of debug options seem to have been turned off. May be we can take fedora final release config and turn on CONFIG_PROVE_RCU=y and see if you can reproduce the issue or not.

Comment 121 Paul Bolle 2011-05-24 22:12:39 UTC

(In reply to comment #120)
> Ok, lots of debug options seem to have been turned off. May be we can take
> fedora final release config and turn on CONFIG_PROVE_RCU=y and see if you can
> reproduce the issue or not.

0) If I take a Fedora finale release (kernel-2.6.39-0.fc16.i686.rpm) and use CONFIG_PROVE_RCU=y, I end up with these differences (of which a few are due to the fact I rpmbuilt a vanilla kernel package):

--- /dev/fd/63	2011-05-25 00:05:59.211662288 +0200
+++ /boot/config-2.6.39-0.local4.fc16.i686	2011-05-24 22:13:54.000000000 +0200
@@ -3,2 +3,2 @@
-# Linux/i386 2.6.39-0.fc16.i686 Kernel Configuration
-# Thu May 19 20:07:58 2011
+# Linux/i386 2.6.39-0.local4.fc16.i686 Kernel Configuration
+# Tue May 24 20:15:33 2011
@@ -258 +258 @@
-CONFIG_INLINE_SPIN_UNLOCK=y
+# CONFIG_INLINE_SPIN_UNLOCK is not set
@@ -260 +260 @@
-CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
+# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set
@@ -267 +267 @@
-CONFIG_INLINE_READ_UNLOCK=y
+# CONFIG_INLINE_READ_UNLOCK is not set
@@ -269 +269 @@
-CONFIG_INLINE_READ_UNLOCK_IRQ=y
+# CONFIG_INLINE_READ_UNLOCK_IRQ is not set
@@ -276 +276 @@
-CONFIG_INLINE_WRITE_UNLOCK=y
+# CONFIG_INLINE_WRITE_UNLOCK is not set
@@ -278 +278 @@
-CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
+# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set
@@ -280 +280 @@
-CONFIG_MUTEX_SPIN_ON_OWNER=y
+# CONFIG_MUTEX_SPIN_ON_OWNER is not set
@@ -415 +415 @@
-CONFIG_SPLIT_PTLOCK_CPUS=4
+CONFIG_SPLIT_PTLOCK_CPUS=999999
@@ -596 +595,0 @@
-CONFIG_PCI_MSI_DEFAULT_ON=y
@@ -2662 +2660,0 @@
-CONFIG_INPUT_APPLEIR=m
@@ -2695 +2692,0 @@
-CONFIG_CRASH=m
@@ -4817 +4813,0 @@
-CONFIG_TMPFS_XATTR=y
@@ -5026,4 +5022,6 @@
-# CONFIG_DEBUG_SPINLOCK is not set
-# CONFIG_DEBUG_MUTEXES is not set
-# CONFIG_DEBUG_LOCK_ALLOC is not set
-# CONFIG_PROVE_LOCKING is not set
+CONFIG_DEBUG_SPINLOCK=y
+CONFIG_DEBUG_MUTEXES=y
+CONFIG_DEBUG_LOCK_ALLOC=y
+CONFIG_PROVE_LOCKING=y
+CONFIG_PROVE_RCU=y
+# CONFIG_PROVE_RCU_REPEATEDLY is not set
@@ -5030,0 +5029 @@
+CONFIG_LOCKDEP=y
@@ -5031,0 +5031,2 @@
+# CONFIG_DEBUG_LOCKDEP is not set
+CONFIG_TRACE_IRQFLAGS=y
@@ -5378 +5378,0 @@
-CONFIG_PCI_DEFAULT_USE_CRS=y

1) Using that Fedora final release with CONFIG_PROVE_RCU=y (and its dependencies) this issue does not seem to show up (ie, I'm unable to trigger this BUG).

Comment 122 Vivek Goyal 2011-05-25 20:13:04 UTC

Ok, so enabling CONFIG_PROVE_RCU and associated dependencies did not reproduce the isue. I guess time to try paul's debug patch for detecting blocking in rcu critical section along with rcu traces.

If time permits, it might be a good idea to try to play with rest of the config options and try to narrow down enabling which config option really causes this behavior.

Comment 123 Paul E. McKenney 2011-05-25 20:32:34 UTC

My patch requires CONFIG_PROVE_RCU...  So maybe best to follow up on Jens's suggestion.  Alternatively, capture the CPU and context-switch number at the beginning of the RCU read-side critical section, then verify that there is no change in either quantity at the end of the RCU read-side critical section.

Comment 124 Vivek Goyal 2011-05-25 21:00:26 UTC

(In reply to comment #123)
> My patch requires CONFIG_PROVE_RCU...  So maybe best to follow up on Jens's
> suggestion.  

PaulB did try Jens patch of disabling cooperating queue logic. But that did not help.

> Alternatively, capture the CPU and context-switch number at the
> beginning of the RCU read-side critical section, then verify that there is no
> change in either quantity at the end of the RCU read-side critical section.

He has got a single cpu machine. So cpu number is not going to change. 

context-switch number is interesting data. Where it is stored. I might just write a quick patch to print that info ask PaulB to try that.

Comment 125 Paul Bolle 2011-05-26 09:17:09 UTC

(In reply to comment #123)
> My patch requires CONFIG_PROVE_RCU...

This doesn't ring a bell. What patch are you referring to here?

Comment 126 Vivek Goyal 2011-05-26 13:32:08 UTC

(In reply to comment #125)
> (In reply to comment #123)
> > My patch requires CONFIG_PROVE_RCU...
> 
> This doesn't ring a bell. What patch are you referring to here?

I guess following.

https://lkml.org/lkml/2011/5/24/362

Comment 127 Paul E. McKenney 2011-05-26 13:58:44 UTC

Further testing got me this patch replacing https://lkml.org/lkml/2011/5/24/362:

rcu: Restore checks for blocking in RCU read-side critical sections

Long ago, using TREE_RCU with PREEMPT would result in "scheduling
while atomic" diagnostics if you blocked in an RCU read-side critical
section.  However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats
this diagnostic.  This commit therefore adds a replacement diagnostic
based on PROVE_RCU.

Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being
used for things that have nothing to do with rcu_dereference(), rename
lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third
argument that is a string indicating what is suspicious.  This third
argument is passed in from a new third argument to rcu_lockdep_assert().
Update all calls to rcu_lockdep_assert() to add an informative third
argument.

Finally, add a pair of rcu_lockdep_assert() calls from within
rcu_note_context_switch(), one complaining if a context switch occurs
in an RCU-bh read-side critical section and another complaining if a
context switch occurs in an RCU-sched read-side critical section.
These are present only if the PROVE_RCU kernel parameter is enabled.

Again, you must enable PROVE_RCU to see these new diagnostics.  But you
are enabling PROVE_RCU to check out new RCU uses in any case, aren't you?

Signed-off-by: Paul E. McKenney <paulmck.ibm.com>

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 4aef1dd..bfa66e7 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -545,7 +545,7 @@ do {									\
 #endif
 
 #ifdef CONFIG_PROVE_RCU
-extern void lockdep_rcu_dereference(const char *file, const int line);
+void lockdep_rcu_suspicious(const char *file, const int line, const char *s);
 #endif
 
 #endif /* __LINUX_LOCKDEP_H */
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 58b13f1..fb2933d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -297,19 +297,20 @@ extern int rcu_my_thread_group_empty(void);
 /**
  * rcu_lockdep_assert - emit lockdep splat if specified condition not met
  * @c: condition to check
+ * @s: informative message
  */
-#define rcu_lockdep_assert(c)						\
+#define rcu_lockdep_assert(c, s)					\
 	do {								\
 		static bool __warned;					\
 		if (debug_lockdep_rcu_enabled() && !__warned && !(c)) {	\
 			__warned = true;				\
-			lockdep_rcu_dereference(__FILE__, __LINE__);	\
+			lockdep_rcu_suspicious(__FILE__, __LINE__, s);	\
 		}							\
 	} while (0)
 
 #else /* #ifdef CONFIG_PROVE_RCU */
 
-#define rcu_lockdep_assert(c) do { } while (0)
+#define rcu_lockdep_assert(c, s) do { } while (0)
 
 #endif /* #else #ifdef CONFIG_PROVE_RCU */
 
@@ -338,14 +339,16 @@ extern int rcu_my_thread_group_empty(void);
 #define __rcu_dereference_check(p, c, space) \
 	({ \
 		typeof(*p) *_________p1 = (typeof(*p)*__force )ACCESS_ONCE(p); \
-		rcu_lockdep_assert(c); \
+		rcu_lockdep_assert(c, "suspicious rcu_dereference_check()" \
+				      " usage"); \
 		rcu_dereference_sparse(p, space); \
 		smp_read_barrier_depends(); \
 		((typeof(*p) __force __kernel *)(_________p1)); \
 	})
 #define __rcu_dereference_protected(p, c, space) \
 	({ \
-		rcu_lockdep_assert(c); \
+		rcu_lockdep_assert(c, "suspicious rcu_dereference_protected()" \
+				      " usage"); \
 		rcu_dereference_sparse(p, space); \
 		((typeof(*p) __force __kernel *)(p)); \
 	})
@@ -359,7 +362,9 @@ extern int rcu_my_thread_group_empty(void);
 #define __rcu_dereference_index_check(p, c) \
 	({ \
 		typeof(p) _________p1 = ACCESS_ONCE(p); \
-		rcu_lockdep_assert(c); \
+		rcu_lockdep_assert(c, \
+				   "suspicious rcu_dereference_index_check()" \
+				   " usage"); \
 		smp_read_barrier_depends(); \
 		(_________p1); \
 	})
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 53a6895..9789028 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3798,7 +3798,7 @@ void lockdep_sys_exit(void)
 	}
 }
 
-void lockdep_rcu_dereference(const char *file, const int line)
+void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 {
 	struct task_struct *curr = current;
 
@@ -3807,9 +3807,10 @@ void lockdep_rcu_dereference(const char *file, const int line)
 		return;
 #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
 	/* Note: the following can be executed concurrently, so be careful. */
-	printk("\n===================================================\n");
-	printk(  "[ INFO: suspicious rcu_dereference_check() usage. ]\n");
-	printk(  "---------------------------------------------------\n");
+	printk("\n===============================\n");
+	printk(  "[ INFO: suspicious RCU usage. ]\n");
+	printk(  "-------------------------------\n");
+	printk(  "%s\n", s);
 	printk("%s:%d invoked rcu_dereference_check() without protection!\n",
 			file, line);
 	printk("\nother info that might help us debug this:\n\n");
@@ -3818,4 +3819,4 @@ void lockdep_rcu_dereference(const char *file, const int line)
 	printk("\nstack backtrace:\n");
 	dump_stack();
 }
-EXPORT_SYMBOL_GPL(lockdep_rcu_dereference);
+EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
diff --git a/kernel/pid.c b/kernel/pid.c
index 57a8346..a7577b3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -419,7 +419,9 @@ EXPORT_SYMBOL(pid_task);
  */
 struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns)
 {
-	rcu_lockdep_assert(rcu_read_lock_held());
+	rcu_lockdep_assert(rcu_read_lock_held(),
+			   "find_task_by_pid_ns() needs rcu_read_lock()"
+			   " protection");
 	return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID);
 }
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 88547c8..e043b4d 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu)
  */
 void rcu_note_context_switch(int cpu)
 {
+	rcu_lockdep_assert(!rcu_read_lock_bh_held(),
+			   "Illegal context switch in RCU-bh"
+			   " read-side critical section");
+	rcu_lockdep_assert(!rcu_read_lock_sched_held(),
+			   "Illegal context switch in RCU-sched"
+			   " read-side critical section");
 	rcu_sched_qs(cpu);
 	rcu_preempt_note_context_switch(cpu);
 }

Comment 128 Paul Bolle 2011-05-27 07:31:01 UTC

(In reply to comment #127)
> Further testing got me this patch [...]:
> 
> rcu: Restore checks for blocking in RCU read-side critical sections
> 
> [...]

0) I've applied that patch to (mostly vanilla) v2.6.39. Specifically, I applied commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9, from one of the branches of (one of?) your git repositor(y|ies) on git.kernel.org. Is that the correct commit?

1) None of the messages from that patch show up in my console or in my logs, even though I did manage (on the first try) to hit this Oops. Should I perhaps turn a knob here or there to get things to work? Or are we apparently looking in the wrong direction?

Comment 129 Chuck Ebbert 2011-06-24 18:17:41 UTC

Upstream commit:

http://git.kernel.org/linus/ab4bd22d3cce6977dc039664cc2d052e3147d662

Comment 130 Michal Jaegermann 2011-07-30 22:10:56 UTC

(In reply to comment #129)
> Upstream commit:
> 
> http://git.kernel.org/linus/ab4bd22d3cce6977dc039664cc2d052e3147d662

I got hit with a series of recursive faults, which look remarkably similar to the original issue, i.e.:

[ 3923.519247] Pid: 22341, comm: mandb Not tainted 3.1.0-0.rc0.git12.1.fc17.x86_
64 #1 To Be Filled By O.E.M. To Be Filled By O.E.M./SK8V
[ 3923.520333] RIP: 0010:[<ffffffff81240bef>]  [<ffffffff81240bef>] cic_free_func+0x14/0x8c
......

while trying 3.1.0-0.rc0.git12.1.fc17.x86_64 kernel.  When this happened for the first time my machine entirely locked up first scrolling off a screen any information what possibly started that.  In the second incident a machine stayed alive and a complete dmesg from that time is in attachment 515990 [details] to bug 726877.
Before that happened other locking problems were reported.

Comment 131 Paul Bolle 2011-08-01 09:26:29 UTC

(In reply to comment #130)
> (In reply to comment #129)
> > Upstream commit:
> > 
> > http://git.kernel.org/linus/ab4bd22d3cce6977dc039664cc2d052e3147d662

0) And, for some reason, 9b50902db5eb8a220160fb89e95aa11967998d12 too ! I have no clue what happened there, but I guess git is smart enough to handle two identical commits.
 
> I got hit with a series of recursive faults, which look remarkably similar to
> the original issue, i.e.:
> 
> [ 3923.519247] Pid: 22341, comm: mandb Not tainted
> 3.1.0-0.rc0.git12.1.fc17.x86_
> 64 #1 To Be Filled By O.E.M. To Be Filled By O.E.M./SK8V
> [ 3923.520333] RIP: 0010:[<ffffffff81240bef>]  [<ffffffff81240bef>]
> cic_free_func+0x14/0x8c
> ......
> 
> while trying 3.1.0-0.rc0.git12.1.fc17.x86_64 kernel.

1) Ditto here, but on a i686: 

BUG: unable to handle kernel paging request at 6b6b6b43
IP: [<c05d7d80>] cic_free_func+0x10/0x6c
*pde = 00000000 
Oops: 0000 [#1] SMP 
Modules linked in: arc4 ppdev microcode joydev pcspkr snd_intel8x0m snd_intel8x0 ath5k snd_ac97_codec ac97_bus ath snd_seq snd_seq_device thinkpad_acpi mac80211 snd_pcm parport_pc parport cfg80211 e1000 rfkill snd_timer iTCO_wdt snd soundcore snd_page_alloc iTCO_vendor_support i2c_i801 binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]

Pid: 25711, comm: mandb Not tainted 3.0.0-1.local1.fc16.i686 #1 IBM        /       
EIP: 0060:[<c05d7d80>] EFLAGS: 00010286 CPU: 0
EIP is at cic_free_func+0x10/0x6c
EAX: f67bf900 EBX: 6b6b6b43 ECX: 00000246 EDX: 6b6b6b43
ESI: f67bf900 EDI: c05d7d70 EBP: c75e9f04 ESP: c75e9ef0
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process mandb (pid: 25711, ti=c75e8000 task=e9518000 task.ti=c75e8000)
Stack:
 6b6b6b6b f67bf900 6b6b6b6b f67bf900 c05d7d70 c75e9f18 c05d7610 f67bf900
 e9518000 e95183d8 c75e9f20 c05d7630 c75e9f2c c05cc8ba f67bf900 c75e9f40
 c05cc94a e9518000 00000012 c75e9f74 c75e9f8c c04423e2 eeb13080 cb4edb08
Call Trace:
 [<c05d7d70>] ? cfqq_process_refs+0x18/0x18
 [<c05d7610>] call_for_each_cic+0x30/0x3e
 [<c05d7630>] cfq_free_io_context+0x12/0x14
 [<c05cc8ba>] put_io_context+0x34/0x5c
 [<c05cc94a>] exit_io_context+0x68/0x6d
 [<c04423e2>] do_exit+0x62f/0x66a
 [<c04f93b2>] ? fsnotify_modify+0x64/0x6c
 [<c0442634>] do_group_exit+0x63/0x86
 [<c044266f>] sys_exit_group+0x18/0x18
 [<c083f69f>] sysenter_do_call+0x12/0x38
Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b 00 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 e0 
EIP: [<c05d7d80>] cic_free_func+0x10/0x6c SS:ESP 0068:c75e9ef0
CR2: 000000006b6b6b43
---[ end trace 4694910e102a3c2a ]---

> Before that happened other locking problems were reported.

2) Apparently not relevant, as I just triggered it too - on a self compiled vanilla 3.0 running (sort of stock) Rawhide.

3) I spent countless hours trying to squash this bug in the last three weeks of May 2011 and the first week of June 2011. It was one of the longest debugging sessions I have ever participated in. I'm not sure I have the desire too once again dive into this matter....

4) Michal, can you trigger this at will (by running mandb by hand, while forcing it to redo its work from scratch, by using a command line switch which I already forgot)? If so, we might be able to speed up debugging (as it is not just my machine any more, assuming I can also trigger this at will).

5) If I recall correctly I drafted a patch for this bug (with Paul McKenney's assistance) that dived a bit deeper into this than the fix that Jens Axboe actually merged (and which really has been merged in v3.0, hasn't it?). If Michal and/or I are once again able to trigger this by running mandb manually the first thing we should try is to make the RCU locking around ioc->ioc_data not just good enough but, well, perfect.

Comment 132 Michal Jaegermann 2011-08-01 16:19:28 UTC

(In reply to comment #131)
> 
> 4) Michal, can you trigger this at will (by running mandb by hand, while
> forcing it to redo its work from scratch, by using a command line switch which
> I already forgot)?

I think that you mean 'mandb -c' as this forces redoing all mandb work from scratch. I am afraid that not.  I tried that few times without causing any ill effects (not counting assorted complaints from mandb when not runnig '--quiet').

Quite possible that mandb run can do that if overlapped at the same time with another substantial I/O, like running 'yum update' with a bunch of packages, but my attempts to recreate something of that sort also did not yet trigger any kernel faults.

> If so, we might be able to speed up debugging (as it is not
> just my machine any more, assuming I can also trigger this at will).

I really have no idea why at some time interval I was seeing this problem quite regularly, after that there was a long hiatus when I could not record it at all while others were hit by it, and now it apparently returned but it caused a total lockup on one run, a series of recursive faults but no lockup on the next one, and so far no the next instance.  Apparently I have to look the other way. :-)

Comment 133 Vivek Goyal 2011-08-01 17:08:51 UTC

(In reply to comment #131)

> 
> 5) If I recall correctly I drafted a patch for this bug (with Paul McKenney's
> assistance) that dived a bit deeper into this than the fix that Jens Axboe
> actually merged (and which really has been merged in v3.0, hasn't it?). If
> Michal and/or I are once again able to trigger this by running mandb manually
> the first thing we should try is to make the RCU locking around ioc->ioc_data
> not just good enough but, well, perfect.

Paul,

Did following commits took care of rest of the rcu issues or there was are still more concerns.

commit 726e99ab88db059fe1422e15376ae404f8c66eb4
Author: Shaohua Li <shaohua.li>
Date:   Mon Jun 27 09:03:48 2011 +0200

    cfq-iosched: make code consistent

commit 3181faa85bda3dc3f5e630a1846526c9caaa38e3
Author: Shaohua Li <shaohua.li>
Date:   Mon Jun 27 09:03:47 2011 +0200

    cfq-iosched: fix a rcu warning

Comment 134 Paul Bolle 2011-08-02 10:57:20 UTC

(In reply to comment #132)
> I think that you mean 'mandb -c' as this forces redoing all mandb work from
> scratch.

0) Yes, that's the one.

> I am afraid that not.  I tried that few times without causing any ill
> effects (not counting assorted complaints from mandb when not runnig
> '--quiet').

1) That's a pity.

> Quite possible that mandb run can do that if overlapped at the same time with
> another substantial I/O, like running 'yum update' with a bunch of packages,
> but my attempts to recreate something of that sort also did not yet trigger any
> kernel faults.

2) It's my feeling you can increase the chance of hitting this by having the daily cron run of mandb do more work than necessary. You'll need to edit /etc/sysconfig/man-db and add that -c switch. With a bit of, well, luck you'll hit this Oops every now and then and can keep us informed if that really happens (not every time, I'd say, but only after things like a kernel version bumps).

(Note that I seem to remember that at one point even programs like evolution and git sometimes triggered this bug.)

3) Do the Fedora kernel maintainers agree with this suggestion?

Comment 135 Paul Bolle 2011-08-02 11:21:55 UTC

(In reply to comment #133)
> (In reply to comment #131)
> > 5) If I recall correctly I drafted a patch for this bug (with Paul McKenney's
> > assistance) that dived a bit deeper into this than the fix that Jens Axboe
> > actually merged (and which really has been merged in v3.0, hasn't it?). If
> > Michal and/or I are once again able to trigger this by running mandb manually
> > the first thing we should try is to make the RCU locking around ioc->ioc_data
> > not just good enough but, well, perfect.
> Did following commits took care of rest of the rcu issues or there was are
> still more concerns.
> 
> commit 726e99ab88db059fe1422e15376ae404f8c66eb4
> [...]
>     cfq-iosched: make code consistent
> 
> commit 3181faa85bda3dc3f5e630a1846526c9caaa38e3
> [...]
>     cfq-iosched: fix a rcu warning

0) It turns out I misremembered things. I never bothered to submit an updated patch (after Paul McKenney's review) for several, uninteresting reasons (like Jens Axboe being offline for some weeks after we finally pinpointed the problem in early June 2011, and I actually not really wanting to spend even more time on a bug that I already spent so much time on, etc.).

1) Anyhow, at first glance those two commits do seem to tighten things around ioc->ioc_data.

2) So I guess there are two things I could do right now:
    i) see if I can again triggers this manually (with a decent chance)
    ii) if I can, see what happens if I again rip out the all references to ioc->ioc_data (the main side effect of that should be that it slows CFQ down a bit). 

[ 3) It might be interesting to note that I haven't yet used v3.0 a lot, and stuck to some self patched version of v2.6.39 (because the fix for this bug was only added in -rc?, because of a summer holiday, and because after that I spent quite some time tracking down bug #722472, etc.) So perhaps there's still some undiscovered problem, and the machine that so easily triggers this just needs to be exposed to v3.0 a bit more... ]

Comment 136 Paul Bolle 2011-08-02 11:48:51 UTC

(In reply to comment #135)
> 2) So I guess there are two things I could do right now:
>     i) see if I can again triggers this manually (with a decent chance)

Well I hit that same Oops at the first try of "mandb -c"! So I guess it's time to recompile (but now with CFQ as a module) and see whether I can torture CFQ into confessing its latest secret.

>     ii) [...].

Comment 137 Paul Bolle 2011-08-02 19:25:31 UTC

(In reply to comment #135)
0) Please be aware that advanced agitation might interfere with my actions and cloud my judgement. 

>     ii) if I can, see what happens if I again rip out the all references to
> ioc->ioc_data (the main side effect of that should be that it slows CFQ down a
> bit). 

1) Ripping out ioc_data (the one hit cic cache) has no effect. "mandb -c" reliably triggers an Oops. This suggests we've hit yet another CFQ bug. I have no clue whatsoever that bug might be. Jens? Vivek?

Comment 138 Vivek Goyal 2011-08-02 20:23:47 UTC

Paul,

- So it is same machine where you were hitting bug with mandb -c?
- What kernel version you are using? Make sure you are using latest upstream
  kernel.
- Can you also paste latest backtrace?
- Is it crashing at the same place where we are accessing a freed cic?
- I thought that in the past you confirmed that mandb -c is working fine (after
  you fix). I am wondering what changed since then that same issue has returned.

Comment 139 Paul Bolle 2011-08-03 09:41:09 UTC

(In reply to comment #138)
> - So it is same machine where you were hitting bug with mandb -c?

Yes, the same machine as the machine that so easily triggered this bug two/three months ago.

> - What kernel version you are using? Make sure you are using latest upstream
>   kernel.

v3.0 (I'm not running the daily snapshots, and wait for the release candidates to appear). Note that I've down two things:
- make CFQ modular
- removed all traces of the ioc_data one hit cache (just to be sure this is something different).

> - Can you also paste latest backtrace?

OK, this is from my _current_ session (bugzilla messes up the layout a bit). The crash is still not fatal:

[21265.163144] BUG: unable to handle kernel paging request at 6b6b6b43
[21265.164022] IP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched]
[21265.164022] *pde = 00000000 
[21265.164022] Oops: 0000 [#1] SMP 
[21265.164022] Modules linked in: cfq_iosched tcp_lp fuse arc4 ppdev snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device microcode
 snd_pcm joydev snd_timer ath5k pcspkr snd_page_alloc thinkpad_acpi i2c_i801 ath mac80211 snd soundcore cfg80211 rfkill parport_pc parport e1000 iTCO_wd
t iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
[21265.164022] 
[21265.164022] Pid: 18514, comm: mandb Not tainted 3.0.0-1.local2.fc17.i686 #1 IBM        /       
[21265.164022] EIP: 0060:[<f7d55fe0>] EFLAGS: 00010286 CPU: 0
[21265.164022] EIP is at cic_free_func+0x10/0x6c [cfq_iosched]
[21265.164022] EAX: f242e400 EBX: 6b6b6b43 ECX: 00000246 EDX: 6b6b6b43
[21265.164022] ESI: f242e400 EDI: f7d55fd0 EBP: df92df04 ESP: df92def0
[21265.164022]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[21265.164022] Process mandb (pid: 18514, ti=df92c000 task=ed1554c0 task.ti=df92c000)
[21265.164022] Stack:
[21265.164022]  c0468df7 c0af0cb0 6b6b6b6b f242e400 f7d55fd0 df92df18 f7d54510 f242e400
[21265.164022]  ed1554c0 ed155898 df92df20 f7d54530 df92df2c c05cc8ba f242e400 df92df40
[21265.164022]  c05cc94a ed1554c0 00000012 df92df74 df92df8c c04423e2 eedf2b5c eedf2b00
[21265.164022] Call Trace:
[21265.164022]  [<c0468df7>] ? lock_is_held+0x73/0x7b
[21265.164022]  [<f7d55fd0>] ? cfqq_process_refs+0x18/0x18 [cfq_iosched]
[21265.164022]  [<f7d54510>] call_for_each_cic+0x3e/0x4c [cfq_iosched]
[21265.164022]  [<f7d54530>] cfq_free_io_context+0x12/0x14 [cfq_iosched]
[21265.164022]  [<c05cc8ba>] put_io_context+0x34/0x5c
[21265.164022]  [<c05cc94a>] exit_io_context+0x68/0x6d
[21265.164022]  [<c04423e2>] do_exit+0x62f/0x66a
[21265.164022]  [<c0442634>] do_group_exit+0x63/0x86
[21265.164022]  [<c044266f>] sys_exit_group+0x18/0x18
[21265.164022]  [<c083b0df>] sysenter_do_call+0x12/0x38
[21265.164022] Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b 00 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8
b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 c8 
[21265.164022] EIP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched] SS:ESP 0068:df92def0
[21265.164022] CR2: 000000006b6b6b43
[21265.363749] ---[ end trace a779435da6d0a152 ]---

Shortly afterwards:
[21484.785098] BUG: unable to handle kernel paging request at 6b6b6b43
[21484.786020] IP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched]
[21484.786020] *pde = 00000000 
[21484.786020] Oops: 0000 [#2] SMP 
[21484.786020] Modules linked in: cfq_iosched tcp_lp fuse arc4 ppdev snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device microcode
 snd_pcm joydev snd_timer ath5k pcspkr snd_page_alloc thinkpad_acpi i2c_i801 ath mac80211 snd soundcore cfg80211 rfkill parport_pc parport e1000 iTCO_wd
t iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
[21484.786020] 
[21484.786020] Pid: 8230, comm: mandb Tainted: G      D     3.0.0-1.local2.fc17.i686 #1 IBM        /       
[21484.786020] EIP: 0060:[<f7d55fe0>] EFLAGS: 00010286 CPU: 0
[21484.786020] EIP is at cic_free_func+0x10/0x6c [cfq_iosched]
[21484.786020] EAX: f242e880 EBX: 6b6b6b43 ECX: 00000286 EDX: 6b6b6b43
[21484.786020] ESI: f242e880 EDI: f7d55fd0 EBP: ed0edf04 ESP: ed0edef0
[21484.786020]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[21484.786020] Process mandb (pid: 8230, ti=ed0ec000 task=ed7e54c0 task.ti=ed0ec000)
[21484.786020] Stack:
[21484.786020]  00000001 00000286 6b6b6b6b f242e880 f7d55fd0 ed0edf18 f7d54510 f242e880
[21484.786020]  ed7e54c0 ed7e5898 ed0edf20 f7d54530 ed0edf2c c05cc8ba f242e880 ed0edf40
[21484.786020]  c05cc94a ed7e54c0 00000012 ed0edf74 ed0edf8c c04423e2 eef3a89c eef3a840
[21484.786020] Call Trace:
[21484.786020]  [<f7d55fd0>] ? cfqq_process_refs+0x18/0x18 [cfq_iosched]
[21484.786020]  [<f7d54510>] call_for_each_cic+0x3e/0x4c [cfq_iosched]
[21484.786020]  [<f7d54530>] cfq_free_io_context+0x12/0x14 [cfq_iosched]
[21484.786020]  [<c05cc8ba>] put_io_context+0x34/0x5c
[21484.786020]  [<c05cc94a>] exit_io_context+0x68/0x6d
[21484.786020]  [<c04423e2>] do_exit+0x62f/0x66a
[21484.786020]  [<c04f8352>] ? filp_close+0x5c/0x64
[21484.786020]  [<c0442634>] do_group_exit+0x63/0x86
[21484.786020]  [<c044266f>] sys_exit_group+0x18/0x18
[21484.786020]  [<c083b0df>] sysenter_do_call+0x12/0x38
[21484.786020] Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b 00 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8
b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 c8 
[21484.786020] EIP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched] SS:ESP 0068:ed0edef0
[21484.786020] CR2: 000000006b6b6b43
[21485.009510] ---[ end trace a779435da6d0a153 ]---

This last one is repeated twice, almost identical each time (so it occurred three times in total).

> - Is it crashing at the same place where we are accessing a freed cic?

I don't (yet) know what happens. The only thing I note is that the faulting address is 6b6b6b43. This suggests something is fed 6b6b6b6b somewhere and wants to use the address at an 0x28 offset from that address (via a container_of() like mechanism?).

> - I thought that in the past you confirmed that mandb -c is working fine (after you fix).

Yes.

> I am wondering what changed since then that same issue has
> returned.

No idea. But is looks like this is not the same issue (as I've removed ioc->ioc_data). It does look similar (ie, somehow stuff gets out of sync and we're suddenly using an invalid cic).

Comment 140 Paul Bolle 2011-08-03 10:12:26 UTC

(In reply to comment #139)
> This suggests something is fed 6b6b6b6b somewhere and
> wants to use the address at an 0x28 offset from that address (via a
> container_of() like mechanism?).

0) On 32 bit x86 0x28 is the offset of hlist_node cic_list in struct cfq_io_context (four pointers, four longs, and one struct of two pointers equal (10 * 4) = 0x28 bytes). In Michal's backtrace I see 0x6b6b6b1b mentioned. Michal was running 64 bit x86. On 64 bit x86 six pointers and four longs equal (10 * 8 =) 0x50 bytes. Which matches neatly with the above.

1) This could be a red herring, of course, and wrongly calculated too, but for now we might focus on whether we can find a spot where a struct cfq_io_context might be used while invalid.

Comment 141 Vivek Goyal 2011-08-03 21:37:45 UTC

(In reply to comment #140)
> (In reply to comment #139)
> > This suggests something is fed 6b6b6b6b somewhere and
> > wants to use the address at an 0x28 offset from that address (via a
> > container_of() like mechanism?).
> 
> 0) On 32 bit x86 0x28 is the offset of hlist_node cic_list in struct
> cfq_io_context (four pointers, four longs, and one struct of two pointers equal
> (10 * 4) = 0x28 bytes). In Michal's backtrace I see 0x6b6b6b1b mentioned.
> Michal was running 64 bit x86. On 64 bit x86 six pointers and four longs equal
> (10 * 8 =) 0x50 bytes. Which matches neatly with the above.
> 
> 1) This could be a red herring, of course, and wrongly calculated too, but for
> now we might focus on whether we can find a spot where a struct cfq_io_context
> might be used while invalid.

This explanation makes sense. I am not sure what is offset 0x10 in cic_free_func() but it looks like it could be following.

hlist_del_rcu(&cic->cic_list);

So if we have fetched a bad cic pointer (because previous cic was freed and got
poisoned before rcu period got over), we would oops. Can you disassemble, your vmlinux and confirm which is the faluting instruction. It looks like we are back to original issue of how come previous cic got poisoned while we are in rcu read lock.

Comment 142 Vivek Goyal 2011-08-03 21:48:23 UTC

Paul,

It might make sense to apply my cfq debug patch V2 again and run again to verify it is the same issue where we are freeing up a cic before rcu period is over.

Comment 143 Vivek Goyal 2011-08-03 22:24:16 UTC

(In reply to comment #141)
> 
> This explanation makes sense. I am not sure what is offset 0x10 in
> cic_free_func() but it looks like it could be following.
> 
> hlist_del_rcu(&cic->cic_list);
> 

In fact it should be following access which caused oops.

unsigned long dead_key = (unsigned long) cic->key;

Comment 144 Paul Bolle 2011-08-04 06:27:42 UTC

(In reply to comment #141)
> Can you disassemble, your
> vmlinux and confirm which is the faluting instruction.

(In reply to comment #143)
> In fact it should be following access which caused oops.
> 
> unsigned long dead_key = (unsigned long) cic->key;

That is correct, see this gdb session with cfq-iosched.ko (a small benefit of making cfq modular):

$ gdb block/cfq-iosched.ko
Reading symbols from /home/[...]/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.ko...done.
(gdb) list cic_free_func
2695	{
2696		call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
2697	}
2698	
2699	static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
2700	{
2701		unsigned long flags;
2702		unsigned long dead_key = (unsigned long) cic->key;
2703	
2704		BUG_ON(!(dead_key & CIC_DEAD_KEY));
(gdb) info line 2702
Line 2702 of "/home/peb/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.c" starts at address 0x2004 <cic_free_func+16>
   and ends at 0x2006 <cic_free_func+18>.
(gdb) info line 2704
Line 2704 of "/home/peb/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.c" starts at address 0x2008 <cic_free_func+20>
   and ends at 0x200f <cic_free_func+27>.
(gdb) disassemble /m 0x2004,0x200f
Dump of assembler code from 0x2004 to 0x200f:
2700	{
   0x00002006 <cic_free_func+18>:	mov    esi,eax

2701		unsigned long flags;
2702		unsigned long dead_key = (unsigned long) cic->key;
   0x00002004 <cic_free_func+16>:	mov    edx,DWORD PTR [edx]

2703	
2704		BUG_ON(!(dead_key & CIC_DEAD_KEY));
   0x00002008 <cic_free_func+20>:	test   dl,0x1
   0x0000200b <cic_free_func+23>:	jne    0x200f <cic_free_func+27>
   0x0000200d <cic_free_func+25>:	ud2    

End of assembler dump.
(gdb) disassemble /r 0x2004,0x200f
Dump of assembler code from 0x2004 to 0x200f:
   0x00002004 <cic_free_func+16>:	8b 12	mov    edx,DWORD PTR [edx]
   0x00002006 <cic_free_func+18>:	89 c6	mov    esi,eax
   0x00002008 <cic_free_func+20>:	f6 c2 01	test   dl,0x1
   0x0000200b <cic_free_func+23>:	75 02	jne    0x200f <cic_free_func+27>
   0x0000200d <cic_free_func+25>:	0f 0b	ud2    
End of assembler dump.

(Please note that addresses are in hex but offsets are in decimal.)

The output from "disassemble /r" matches the faulting code (according to the Oops) perfectly.

Comment 145 Paul Bolle 2011-08-04 06:35:54 UTC

(In reply to comment #144)
> The output from "disassemble /r" matches the faulting code (according to the
> Oops) perfectly.

Would it be possible to double check this using the debuginfo rpm for the Fedora compiled that triggered Michal's Oops (see  attachment 515990 [details])?

Comment 146 Paul Bolle 2011-08-04 07:46:53 UTC

(In reply to comment #142)
> It might make sense to apply my cfq debug patch V2 again and run again to
> verify it is the same issue where we are freeing up a cic before rcu period is
> over.

0) I have some doubts.

1) We've determined that a simple access of cic->key triggers a page fault (its address being 0x6b6b643). key is the first member of struct cfq_io_context, so 0x6b6b6b43 must be the address of cic too, mustn't it? That debug patch looks for a cic with address 6b6b6b6b, so we will not find this exact occurrence of cic.

2) But perhaps (it's hard reasoning about this stuff) some code was fed cic->cic_list and did a container_of() like call on it (eg, the rcu code, which has lots of those) which _might_ give us (0x6b6b6b6b - 0x28 =) 0x6b6b6b43. If so, I could run that debug patch V2, if not, it would be better to see what a debug patch could better track.

Comment 147 Paul Bolle 2011-08-04 10:17:26 UTC

(In reply to comment #145)
> Would it be possible to double check this using the debuginfo rpm for the
> Fedora compiled that triggered Michal's Oops (see attachment 515990 [details])?

0) Yes, trivial actually, which is rather impressive. Anyway, see this gdb session with vmlinux (as shipped with kernel-debuginfo-3.1.0-0.rc0.git12.1.fc17.x86_64):

$ gdb /usr/lib/debug/lib/modules/3.1.0-0.rc0.git12.1.fc17.x86_64/vmlinux
Reading symbols from /usr/lib/debug/lib/modules/3.1.0-0.rc0.git12.1.fc17.x86_64/vmlinux...done.
(gdb) list cic_free_func
2699	{
2700		call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
2701	}
2702	
2703	static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic)
2704	{
2705		unsigned long flags;
2706		unsigned long dead_key = (unsigned long) cic->key;
2707	
2708		BUG_ON(!(dead_key & CIC_DEAD_KEY));
(gdb) info line 2706
Line 2706 of "block/cfq-iosched.c"
   starts at address 0xffffffff81240bef <cic_free_func+20>
   and ends at 0xffffffff81240bf2 <cic_free_func+23>.
(gdb) info line 2708
Line 2708 of "block/cfq-iosched.c"
   starts at address 0xffffffff81240bf8 <cic_free_func+29>
   and ends at 0xffffffff81240c00 <cic_free_func+37>.
(gdb) disassemble /m 0xffffffff81240bef,0xffffffff81240c00
Dump of assembler code from 0xffffffff81240bef to 0xffffffff81240c00:
2704	{
   0xffffffff81240bf2 <cic_free_func+23>:	mov    %rdi,%r13
   0xffffffff81240bf5 <cic_free_func+26>:	mov    %rsi,%rbx

2705		unsigned long flags;
2706		unsigned long dead_key = (unsigned long) cic->key;
   0xffffffff81240bef <cic_free_func+20>:	mov    (%rsi),%r15

2707	
2708		BUG_ON(!(dead_key & CIC_DEAD_KEY));
   0xffffffff81240bf8 <cic_free_func+29>:	test   $0x1,%r15b
   0xffffffff81240bfc <cic_free_func+33>:	jne    0xffffffff81240c00 <cic_free_func+37>
   0xffffffff81240bfe <cic_free_func+35>:	ud2a   

End of assembler dump.
(gdb) disassemble /r 0xffffffff81240bef,0xffffffff81240c00
Dump of assembler code from 0xffffffff81240bef to 0xffffffff81240c00:
   0xffffffff81240bef <cic_free_func+20>:	 4c 8b 3e	mov    (%rsi),%r15
   0xffffffff81240bf2 <cic_free_func+23>:	 49 89 fd	mov    %rdi,%r13
   0xffffffff81240bf5 <cic_free_func+26>:	 48 89 f3	mov    %rsi,%rbx
   0xffffffff81240bf8 <cic_free_func+29>:	 41 f6 c7 01	test   $0x1,%r15b
   0xffffffff81240bfc <cic_free_func+33>:	 75 02	jne    0xffffffff81240c00 <cic_free_func+37>
   0xffffffff81240bfe <cic_free_func+35>:	 0f 0b	ud2a   
End of assembler dump.

(Please note that addresses are in hex but offsets are in decimal.)

The output from "disassemble /r" matches the faulting code (according to the
Oops Michal publicized in attachment 515990 [details]) perfectly.

2) So now we know exactly were we trigger the page fault and that this is not a single local oddity. Progress, I'd say!

Comment 148 Paul Bolle 2011-08-04 12:23:49 UTC

(In reply to comment #146)
> 
> 2) But perhaps (it's hard reasoning about this stuff) some code was fed
> cic->cic_list and did a container_of() like call on it (eg, the rcu code, which
> has lots of those) which _might_ give us (0x6b6b6b6b - 0x28 =) 0x6b6b6b43. If
> so, I could run that debug patch V2 [...].

0) So I decided to try debug patch V2 anyway, and hit the BUG() added at the enc of call_for_each_cic() within a minute or three (I switched to cfq at about 400 secs).

1) Oops and a few preceding and following lines (entire output was almost 70k lines, while the added BUG() was hit 5 times in total):

[  562.529534] debug: Will exit io context cic=0xf0469230, Next pointer is   (null)
[  562.540965] debug: cic=0xf0469230 exit io context done. Next pointer is   (null)
[  562.552320] debug: Will decouple-free cic cic=0xf0469230, Next pointer is   (null)
[  562.563883] debug: cic=0xf0469230 decouple-free cic done. Next pointer is   (null)
[  562.578050] debug: freeing up cic=f0469230
[  562.598881] debug: Will exit io context cic=0xf0469230, Next pointer is   (null)
[  562.598881] debug: cic=0xf0469230 exit io context done. Next pointer is   (null)
[  562.598881] debug: Will decouple-free cic cic=0xf0469230, Next pointer is   (null)
[  562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer is   (null)
[  562.644949] debug: freeing up cic=f0469230
[  562.652971] cic=f0469230 freed while in RCU read lock
[  562.661994] ------------[ cut here ]------------
[  562.662933] kernel BUG at /home/[...]/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.c:2699!
[  562.662933] invalid opcode: 0000 [#1] SMP 
[  562.662933] Modules linked in: cfq_iosched arc4 ppdev snd_intel8x0 snd_intel8x0m snd_ac97_codec ac97_bus snd_seq snd_seq_device microcode snd_pcm joydev pcspkr ath5k snd_timer ath snd_page_alloc i2c_i801 thinkpad_acpi mac80211 snd soundcore cfg80211 parport_pc e1000 rfkill parport iTCO_wdt iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
[  562.662933] 
[  562.662933] Pid: 7550, comm: mandb Not tainted 3.0.0-1.local2.fc17.i686 #1 IBM        /       
[  562.662933] EIP: 0060:[<f806301f>] EFLAGS: 00010286 CPU: 0
[  562.662933] EIP is at call_for_each_cic+0x79/0xf2 [cfq_iosched]
[  562.662933] EAX: 0000003c EBX: f0469258 ECX: 00000082 EDX: 00000000
[  562.662933] ESI: f80661bd EDI: f0469230 EBP: f04b3f18 ESP: f04b3ef0
[  562.662933]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[  562.662933] Process mandb (pid: 7550, ti=f04b2000 task=f40bbf90 task.ti=f04b2000)
[  562.662933] Stack:
[  562.662933]  f8066203 f0469230 f80661bd 00000000 6b6b6b6b f2fc8380 f806545b f2fc8380
[  562.662933]  f40bbf90 f40bc368 f04b3f20 f80630aa f04b3f2c c05cc8ba f2fc8380 f04b3f40
[  562.662933]  c05cc94a f40bbf90 00000012 f04b3f74 f04b3f8c c04423e2 f658ee1c f658edc0
[  562.662933] Call Trace:
[  562.662933]  [<f806545b>] ? cfq_put_request+0x5e/0x5e [cfq_iosched]
[  562.662933]  [<f80630aa>] cfq_free_io_context+0x12/0x14 [cfq_iosched]
[  562.662933]  [<c05cc8ba>] put_io_context+0x34/0x5c
[  562.662933]  [<c05cc94a>] exit_io_context+0x68/0x6d
[  562.662933]  [<c04423e2>] do_exit+0x62f/0x66a
[  562.662933]  [<c0442634>] do_group_exit+0x63/0x86
[  562.662933]  [<c044266f>] sys_exit_group+0x18/0x18
[  562.662933]  [<c083b0df>] sysenter_do_call+0x12/0x38
[  562.662933] Code: e8 81 e4 ff ff 8b 45 ec 8b 58 4c eb 22 e8 7c ff ff ff 81 7d e8 6b 6b 6b 6b 75 12 89 7c 24 04 c7 04 24 03 62 06 f8 e8 a5 93 7c c8 <0f> 0b 8b 1b e8 d0 29 3f c8 85 c0 74 05 e8 51 ff ff ff 85 db 74 
[  562.662933] EIP: [<f806301f>] call_for_each_cic+0x79/0xf2 [cfq_iosched] SS:ESP 0068:f04b3ef0
[  562.993878] ---[ end trace b9b7cf2ea11e5aaa ]---
[  563.003045] Fixing recursive fault but reboot is needed!
[  563.045875] debug: Will exit io context cic=0xf0469230, Next pointer is   (null)
[  563.045885] debug: cic=0xf0469230 exit io context done. Next pointer is   (null)
[  563.045888] debug: Will decouple-free cic cic=0xf0469230, Next pointer is   (null)
[  563.045894] debug: cic=0xf0469230 decouple-free cic done. Next pointer is   (null)
[  563.048061] debug: freeing up cic=f0469230

2) What next?

Comment 149 Vivek Goyal 2011-08-04 14:38:13 UTC

(In reply to comment #146)
> (In reply to comment #142)
> > It might make sense to apply my cfq debug patch V2 again and run again to
> > verify it is the same issue where we are freeing up a cic before rcu period is
> > over.
> 
> 0) I have some doubts.
> 
> 1) We've determined that a simple access of cic->key triggers a page fault (its
> address being 0x6b6b643). key is the first member of struct cfq_io_context, so
> 0x6b6b6b43 must be the address of cic too, mustn't it? That debug patch looks
> for a cic with address 6b6b6b6b, so we will not find this exact occurrence of
> cic.

No, that debug patch tracks the pointer to next &cic->cic_list to be 6b6b6b6b. So if current cic is freed before rcu period is over, we will fecth poisoned value and then convert it to (offset_of()) to get the pointer to cic which in this case seems to be 0x6b6b6b43 etc. and then 0x6b6b6b43->key access will fault.

Comment 150 Paul Bolle 2011-08-04 15:37:06 UTC

0) Just a quick note, to make sure this is recorded somewhere. Perhaps this might give us a clue of the problem we're seeing here. Perhaps not.

1) When I do a second test run (after a first run that leads to - multiple - page faults, switching back to deadline, removing cfq_iosched.ko, switching to cfq again, which reloads cfq_iosched.ko) I see almost no debug output. cfq_iosched.ko prints not even 1% of output it prints otherwise.

2) Could that be just because mandb reads all (or most of) its files from some sort of memory cache and not from disk? (This machine has 2 GB of memory, that might be enough to cache all these files.)

Comment 151 Vivek Goyal 2011-08-04 15:56:51 UTC

Not sure what is happening. Just flush the cache to rule out above possibility.

echo 3 > /proc/sys/vm/drop_caches

Comment 152 Michal Jaegermann 2011-08-04 16:22:03 UTC

(In reply to comment #150)
 
> 2) Could that be just because mandb reads all (or most of) its files from some
> sort of memory cache and not from disk? (This machine has 2 GB of memory, that
> might be enough to cache all these files.)

An x86_64 machine where I was encountering those crashes has right now 1 GB of memory and it was running for quite a while with 512 MB.  AFAICT there was no correlation between how much of a memory was present and a frequency with which I was bumping into these.

It is really unclear to me why Paul apparently can hit, luckily enough, this bug quite reliably while for me it remains now quite elusive.  It could be that an amount of memory in the current setup does affect some timings.

Comment 153 Paul Bolle 2011-08-04 17:36:50 UTC

(In reply to comment #151)
> Just flush the cache to rule out above possibility.

Sure, once I drop the cache(s) disk activity increases (dramatically) and I can trigger this again (with tried and trusted "mandb -c").

Comment 154 Vivek Goyal 2011-08-04 18:17:43 UTC

(In reply to comment #148)
> 
> 0) So I decided to try debug patch V2 anyway, and hit the BUG() added at the
> enc of call_for_each_cic() within a minute or three (I switched to cfq at about
> 400 secs).
> 
> 1) Oops and a few preceding and following lines (entire output was almost 70k
> lines, while the added BUG() was hit 5 times in total):
> 
> [  562.529534] debug: Will exit io context cic=0xf0469230, Next pointer is  
> (null)
> [  562.540965] debug: cic=0xf0469230 exit io context done. Next pointer is  
> (null)
> [  562.552320] debug: Will decouple-free cic cic=0xf0469230, Next pointer is  
> (null)
> [  562.563883] debug: cic=0xf0469230 decouple-free cic done. Next pointer is  
> (null)
> [  562.578050] debug: freeing up cic=f0469230
> [  562.598881] debug: Will exit io context cic=0xf0469230, Next pointer is  
> (null)
> [  562.598881] debug: cic=0xf0469230 exit io context done. Next pointer is  
> (null)
> [  562.598881] debug: Will decouple-free cic cic=0xf0469230, Next pointer is  
> (null)
> [  562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer is  
> (null)
> [  562.644949] debug: freeing up cic=f0469230
> [  562.652971] cic=f0469230 freed while in RCU read lock

I am wondering why time difference is so much between following two messages.

[  562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer is (null)

[  562.652971] cic=f0469230 freed while in RCU read lock

roughly 20 ms. Is it because of the printk() handles the messaging?

After calling cic_free_func() we checked the next pointer and it was null. Time was 562.633403. After that there is one rcu_dereference() statement and then the check for next=poisoned-pointer. After first check was done, around 11ms later associated cic was freed and around 8 ms after that we checked for next pointer being posioned.

I am wondering that why are we taking so much of time. Is there some kind of context switch going on or something else? Or it is just a false pointer because one can not rely on printk() time stamps?

Comment 155 Vivek Goyal 2011-08-04 18:37:55 UTC

PaulB,

IIUC, when we call call_for_each_cic() interrupts are still enabled. It might just be a fun experiment to disable interrupts and see if this problem still happens. 

Something like.

cfq_free_io_context() {
    unsigned long flags;

    local_irq_save(flags);
    call_for_each_cic()
    local_irq_restore(flags);
}

And keep the CFQ debug V2 applied and run mandb -c again and see if anything changes.

Comment 156 Paul Bolle 2011-08-04 20:31:31 UTC

(In reply to comment #155)
> IIUC, when we call call_for_each_cic() interrupts are still enabled. It might
> just be a fun experiment to disable interrupts and see if this problem still
> happens. 

0) I'm curious what the idea behind that experiment is. Since I have a poor grasp of all this stuff, explaining it might not be easy. On the other hand, it's my loyal CFQ torturer that's suffering all extra wear and tear involved in trying to trigger this issue, so perhaps my curiosity is, well, appropriate. 
 
> Something like.
> 
> cfq_free_io_context() {
>     unsigned long flags;
> 
>     local_irq_save(flags);
>     call_for_each_cic()
>     local_irq_restore(flags);
> }

1) I've added exactly that.

> And keep the CFQ debug V2 applied and run mandb -c again and see if anything
> changes.

2) Well, in three consecutive runs (making sure caches were dropped before every run) I was unable to trigger this bug! That's quite remarkable. This suggests you at least made the chance of hitting this issue smaller.

3) I don't really want to ruin this little experiment with yet another run. Why ruin a perfect score of 3/3?

Comment 157 Vivek Goyal 2011-08-04 20:48:28 UTC

(In reply to comment #156)
> (In reply to comment #155)
> > IIUC, when we call call_for_each_cic() interrupts are still enabled. It might
> > just be a fun experiment to disable interrupts and see if this problem still
> > happens. 
> 
> 0) I'm curious what the idea behind that experiment is. Since I have a poor
> grasp of all this stuff, explaining it might not be easy. On the other hand,
> it's my loyal CFQ torturer that's suffering all extra wear and tear involved in
> trying to trigger this issue, so perhaps my curiosity is, well, appropriate. 

Last time PaulM had mentioned that in the rcu read section there should be no blocking instruction and no re-scheduling of thread should happen. There was also a talk of keeping track of context switch in task (paul had written a patch and see if any kind of context switch is happening).

I was wondering if interrupts are enabled, then may be all interrupts and softirqs can run and may be they have any kind of side affect.

Hence it is just a wild guess to narrow down the problem a bit.

> 
> 2) Well, in three consecutive runs (making sure caches were dropped before
> every run) I was unable to trigger this bug! That's quite remarkable. This
> suggests you at least made the chance of hitting this issue smaller.
> 
> 3) I don't really want to ruin this little experiment with yet another run. Why
> ruin a perfect score of 3/3?

I would say that let it run in a loop in a script. Once it has run for couple of hours without any crash, then we have a good data point that disabling interrupts helps solve the problem.

This might give some hint regarding where to look and what next.

Comment 158 Paul Bolle 2011-08-04 21:37:06 UTC

(In reply to comment #157)
> I would say that let it run in a loop in a script. Once it has run for couple
> of hours without any crash, then we have a good data point that disabling
> interrupts helps solve the problem.
> 
> This might give some hint regarding where to look and what next.

0) I'll consider that.

1) Mind if I disable the debug verbosity? It would wipe all other information from my logs. (Perhaps set it KERN_DEBUG, so it will only clobber dmesg.)

2) By the way, during the last few runs I had something similar to https://lkml.org/lkml/2011/8/3/310 ("[RFC] CFQ: simplify radix tree lookup in cfq_cic_lookup()") enabled, apparently without any ill effects.

Comment 159 Paul E. McKenney 2011-08-05 07:23:08 UTC

Very interesting, great work narrowing this down!

Are you running CONFIG_NO_HZ?  If so, I would be interested in hearing what happens if you run the test without the interrupt-disabling, but with CONFIG_NO_HZ=n.

Comment 160 Paul Bolle 2011-08-05 09:16:57 UTC

(In reply to comment #159) 
> Are you running CONFIG_NO_HZ?

0) Yes. See:
$ grep -e RCU_FAST_NO_HZ -e CPU_IDLE_GOV_MENU -e NO_HZ /boot/config-3.0.0-1.local2.fc17.i686 
CONFIG_RCU_FAST_NO_HZ=y
CONFIG_NO_HZ=y
CONFIG_CPU_IDLE_GOV_MENU=y

(CONFIG_RCU_FAST_NO_HZ and CONFIG_CPU_IDLE_GOV_MENU were the two only options that depend (directly and positively) on CONFIG_NO_HZ.)

> If so, I would be interested in hearing what
> happens if you run the test without the interrupt-disabling, but with
> CONFIG_NO_HZ=n.

1) I hope to do that later today (it's still morning over here). Please prod me if it takes me too long.

Comment 161 Vivek Goyal 2011-08-05 13:19:02 UTC

(In reply to comment #158)
> (In reply to comment #157)
> > I would say that let it run in a loop in a script. Once it has run for couple
> > of hours without any crash, then we have a good data point that disabling
> > interrupts helps solve the problem.
> > 
> > This might give some hint regarding where to look and what next.
> 
> 0) I'll consider that.
> 
> 1) Mind if I disable the debug verbosity? It would wipe all other information
> from my logs. (Perhaps set it KERN_DEBUG, so it will only clobber dmesg.)

Not a problem. 

> 
> 2) By the way, during the last few runs I had something similar to
> https://lkml.org/lkml/2011/8/3/310 ("[RFC] CFQ: simplify radix tree lookup in
> cfq_cic_lookup()") enabled, apparently without any ill effects.

Ok. That's fine.

Comment 162 Paul Bolle 2011-08-06 11:53:32 UTC

(In reply to comment #159)
> If so, I would be interested in hearing what
> happens if you run the test without the interrupt-disabling, but with
> CONFIG_NO_HZ=n.

0) Recompiled 3.0 with CONFIG_NO_HZ. Ran a script that does 8 runs of (basically) "mandb -c" (preceded by dropping caches each time). This takes almost an hour over here.

1) This triggers no page fault in cic_free_func. So, it seems we're onto something. Does 8 runs sound enough to conclude that (something somehow influenced by) CONFIG_NO_HZ triggers this issue?

2) Should I also try test runs with CONFIG_NO_HZ=y and interrupts disabled (when calling cic_free_func())?

Comment 163 Paul Bolle 2011-08-06 12:08:30 UTC

(In reply to comment #162)
> 0) Recompiled 3.0 with CONFIG_NO_HZ.

Say, couldn't I just have used the "nohz=off" kernel parameter? (Making sure I called cic_free_func() with interrupts enabled.)

Comment 164 Paul Bolle 2011-08-06 21:35:04 UTC

(In reply to comment #163)
> Say, couldn't I just have used the "nohz=off" kernel parameter?

0) It turns out that bugzilla.redhat.com doesn't answer lazyweb questions, so I had to test this myself.

1) Yes, booting with "nohz=off" on CONFIG_NO_HZ=y kernel seems to behave identical to booting a CONFIG_NO_HZ=n kernel. Let's just say I learned something today... Anyhow, one again no page faults in 8 test runs.

2) Perhaps Jens, Vivek and Paul McKenney could discuss further, more specific tests and/or patches before I try anything else (like, running cic_free_func() with interrupts disabled for a few hours, as Vivek suggested). I'll be right here, once you decided on something.

(In reply to comment #152)
> It is really unclear to me why Paul apparently can hit, luckily enough, this
> bug quite reliably while for me it remains now quite elusive.  It could be that
> an amount of memory in the current setup does affect some timings.

Michal, perhaps you'd like to try what I do now: basically, run "mandb -c" in a loop, making sure all caches are dropped before every iteration. I run this silly test in an 8 iterations loop, but any other number of iterations could be just as interesting.

Comment 165 Michal Jaegermann 2011-08-07 16:47:46 UTC

(In reply to comment #164)
> 
> Michal, perhaps you'd like to try what I do now: basically, run "mandb -c" in a
> loop, making sure all caches are dropped before every iteration.

After five times through such loop while running kernel-3.1.0-0.rc0.git21.1.fc17.x86_64 I did hit that bug with
"RIP  [<ffffffff81242f6f>] cic_free_func+0x14/0x8c" and so on.

On a loop six the same fault happened but already I got a message that "Fixing recursive fault but reboot is needed". What kind of a new information that possibly brings?

Comment 166 Paul E. McKenney 2011-08-08 03:56:54 UTC

Paul Bolle: Could you please reproduce the failure (CONFIG_NO_HZ=y, no added irq disabling) and run the script from May 26?  This will require building with CONFIG_RCU_TRACE=y.

Comment 167 Paul Bolle 2011-08-08 09:47:59 UTC

(In reply to comment #165)
> After five times through such loop while running
> kernel-3.1.0-0.rc0.git21.1.fc17.x86_64 I did hit that bug with
> "RIP  [<ffffffff81242f6f>] cic_free_func+0x14/0x8c" and so on.

I'm glad to hear there's another machine on which this can be triggered manually (albeit not as easily).

> On a loop six the same fault happened but already I got a message that "Fixing
> recursive fault but reboot is needed". What kind of a new information that
> possibly brings?

I'm not sure what you mean here. If you mean that further iterations of this (silly) test are unnecessary once a page fault has been triggered then you're right. As far as I know, the page fault always happens in a child process of mandb. In that case mandb should return non-zero, shouldn't it? So you could use that non-zero return to break from the loop and stop the test.

If your point is that the kernel warns that a "reboot is needed" more than once, then that second warning could be considered superfluous. But it's not a big deal and it's perhaps more important to get that message out than to be less chatty.

Comment 168 Vivek Goyal 2011-08-08 21:46:05 UTC

Created attachment 517316 [details]
RCU collectdebugfs.sh script

I think paulM is referring to collectdebugfs.sh script he had posted in previous mail thread on lkml. I have extracted it here from eariler mail thread and attaching with the bug. It becomes easier to download and run.

Comment 169 Paul Bolle 2011-08-08 21:52:22 UTC

(In reply to comment #168)
> [...] and run.

Tomorrow, I suppose. I already recompiled with CONFIG_RCU_TRACE=y, and booted into that kernel. I just need to run the test. Feel free to prod me if I appear inactive for too long.

Comment 170 Paul E. McKenney 2011-08-08 22:46:07 UTC

Vivek: Yep, that is the one!

Comment 171 Michal Jaegermann 2011-08-09 00:51:06 UTC

(In reply to comment #167)
> (In reply to comment #165)
> > What kind of a new information that
> > possibly brings?
> 
> I'm not sure what you mean here.

I meant that we already know that I can trigger that fault too although with current kernels for some reasons not that readily.  After all I filed the original report although quite a while ago (and later that bug was "hiding" from me for a long time).

Apologies that I am not overly proactive in trying to chase that down but now things to do are piling up here.

BTW - my test system is using one of early samples of an x86_64 board and is likely not so frequent specimen of an x86_64 machine with a single core.  I wonder if this may not have something to do with all of this.

Comment 172 Paul Bolle 2011-08-09 19:47:13 UTC

Created attachment 517474 [details]
output of collectdebugfs.sh

0) This is the output of over 10 minutes of running collectdebugfs.sh (I still had a copy somewhere on my hard disk). Only the first 3 or so minutes are probably of interest.

1) Everybody's favourite Oops occurred at 21:17:31 and 21:17:39.

Comment 173 Paul E. McKenney 2011-08-10 02:46:20 UTC

Can't see that I see anything incriminating in the collectdebugfs.sh output.

Besides which, you are running on a single CPU...

Nonetheless, I do have a new -rcu posted, if you are interested in trying it:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git rcu/next

Comment 174 Paul Bolle 2011-08-10 09:19:57 UTC

(In reply to comment #173)
0) Thanks for diving into this again.

> Nonetheless, I do have a new -rcu posted, if you are interested in trying it:
> 
> [...]

So the choice seems to be between trying rcu/next and staring at the CFQ code (for hours on end) to see if something suspicious can be spotted ...

Comment 175 Paul E. McKenney 2011-08-10 14:31:10 UTC

Decisions, decisions.  ;-)

Comment 176 Vivek Goyal 2011-08-10 15:16:17 UTC

PaulM,

I am wondering what does it mean and what should we do next. I mean using cfq debug patch, I can see that pointer i dereferenced inside rcu read section started pointing to a freed object.

Do you think that that's not the right way to decide whether an object got freed while rcu read lock was held?

IOW, I am wondering what to do next. Debug patch seems to think that it is rcu related and rcu debug stuff does not point to anything suspicious.

Comment 177 Paul E. McKenney 2011-08-10 15:49:12 UTC

I suggest trying -rcu.  I can't see how any of the fixes there would help you, but then again, I don't fully understand cfq.

Here are reasons why a pointer dereferenced within an RCU read-side critical section might end up pointing to a freed structure within that same RCU read-side critical section:

1.  The structure is freed directly instead of waiting for an RCU grace period.

2.  The structure is protected with one flavor of RCU and freed after waiting for another flavor of RCU's grace period.

3.  There is a bug in RCU.

So trying out -rcu would be a way to check for some cases of #3 above.  I am doing a line-by-line inspection of RCU to try to find any other such cases.

Comment 178 Vivek Goyal 2011-08-10 16:02:54 UTC

(In reply to comment #177)
> I suggest trying -rcu.  I can't see how any of the fixes there would help you,
> but then again, I don't fully understand cfq.
> 
> Here are reasons why a pointer dereferenced within an RCU read-side critical
> section might end up pointing to a freed structure within that same RCU
> read-side critical section:
> 
> 1.  The structure is freed directly instead of waiting for an RCU grace period.

cic is freed using one function.

static void cfq_cic_free(struct cfq_io_context *cic)
{
        call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
}

I can't spot in the code anywhere else where we free cic through any other path. 

> 
> 2.  The structure is protected with one flavor of RCU and freed after waiting
> for another flavor of RCU's grace period.

In this case we just do call_rcu() for freeing and wait using rcu_read_lock() and rcu_read_unlock(). I am hoping this is right way of doing things.

> 
> 3.  There is a bug in RCU.
> 
> So trying out -rcu would be a way to check for some cases of #3 above.  I am
> doing a line-by-line inspection of RCU to try to find any other such cases.

Ok, PaulB, can you please give -rcu tree a try. We can always stare at CFQ code but testing -rcu sounds like a relatively less time consuming option.

Comment 179 Paul E. McKenney 2011-08-10 16:28:39 UTC

Ah, and the following:

4.  Failing to remove all reader paths to the object before starting to wait
for the grace period.

Comment 180 Vivek Goyal 2011-08-10 17:32:48 UTC

(In reply to comment #179)
> Ah, and the following:
> 
> 4.  Failing to remove all reader paths to the object before starting to wait
> for the grace period.

What does that mean? We are in reader path and reader itself calls call_rcu() and it should make somebody wait for grace period asynchronously. So wait for grace period does start while some readers are still there. So we effectively run following code.

rcu_read_lock();
hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list){
             spin_lock_irqsave(&ioc->lock, flags);
             hlist_del_rcu(&cic->cic_list);
             spin_unlock_irqrestore(&ioc->lock, flags);
   
             call_rcu(&cic->rcu_head, cfq_cic_free_rcu);
}
rcu_read_unlock();

Is above not the right usage?

Comment 181 Paul E. McKenney 2011-08-10 21:46:11 UTC

As long as there is no other path by which readers can reach this structure, the above looks good.  Also, as long as readers don't create paths to this structure.

How did the tests with -rcu go?

Comment 182 Paul Bolle 2011-08-10 22:21:42 UTC

(In reply to comment #181)
> How did the tests with -rcu go?

0) They did not even start.

1) Say, what's a simple way to retrieve from your rcu/next (ie, http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=shortlog;h=refs/heads/rcu/next ) the diff(s) between
    http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=commit;h=02f8c6aee8df3cdc935e9bdd4f2d020306035dbe

(ie, v3.0) and
    http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=commit;h=0697a6c28e520ef4ce3ce9b7532e4374d47be90c

(ie, rcu/next's current tip of the tree) ?

2) Please note that I'm uncomfortable with tracking two upstream repositories. To give an example: I track Linus' tree an the trivial tree each in their own local repository. (And thus I have to jump a few hoops when I want to copy private work from one local repository to the other.) Fear of the unknown, I guess...

Comment 183 Paul E. McKenney 2011-08-11 01:51:49 UTC

One approach would be to clone the -rcu tree locally and do:

git checkout rcu/next
git diff v3.0 > /tmp/wherever

If you wanted to get the -rcu branch pulled into one of your existing trees, one way to do this would be to again clone the -rcu tree and do:

git checkout rcu/next
cd ../linux-2.6 # or wherever you put it
git checkout v3.0
get pull ../linux-2.6-rcu rcu/next # replacing ../linux-2.6-rcu with path to -rcu
git branch rcu/next

Comment 184 Paul E. McKenney 2011-08-11 02:38:05 UTC

Created attachment 517723 [details]
Patch for v3.0..rcu/next as of August 9, 2011

Or you can download this patch.

Comment 185 Paul Bolle 2011-08-11 16:09:55 UTC

(In reply to comment #184)
> Or you can download this patch.

0) It turns out adding another "remote" is rather easy, and having your patch at hand made it easy to see I did it correctly. Thanks.

1) Same Oops while running rcu/next (even hit five times in one mandb run, but that is not really uncommon).

2) Should I perhaps play with some /sys files or tweak CONFIG_* variables to fully enjoy rcu/next? If not, is it now time to start staring at the code?

Comment 186 Paul E. McKenney 2011-08-12 18:31:50 UTC

If you have not already done so, please try either CONFIG_TINY_RCU or CONFIG_TINY_PREEMPT_RCU.  The results would be interesting with either mainline or -rcu.

And I will certainly be continuing to look at code.

Comment 187 Paul Bolle 2011-08-13 09:19:13 UTC

(In reply to comment #186)
> If you have not already done so, please try either CONFIG_TINY_RCU or
> CONFIG_TINY_PREEMPT_RCU.

Both depend on !SMP. TINY_PREEMPT_RCU depends on PREEMPT. Please note that Fedora by default enables SMP and disables PREEMPT. So do the kernels I run currently (which basically are vanilla rebuilds of Rawhide kernels).

I hope to shortly try TINY_RCU, as it should require the least .config tweaking.

Comment 188 Paul Bolle 2011-08-13 20:14:41 UTC

(In reply to comment #187)
> I hope to shortly try TINY_RCU [...].

0) I've done that now. (There's no /proc or /sys file that "proves" which RCU variant one is running, is there? Nor is that printed to the logs, apparently.)

1) Running mandb for 8 iterations, dropping caches beforehand, didn't trigger this bug. Is this small test - which took about an hour - enough to indicate this bug cannot be triggered with (!SMP and) TINY_RCU?

2) What's next?

Comment 189 Paul E. McKenney 2011-08-14 17:05:43 UTC

0) If you don't see "Preemptible hierarchical RCU implementation" in dmesg at early boot, you are running one of the TINY_RCU variants.

1) Good question -- how many iterations does it take for TREE_RCU to fail?  Given that info, the answer is a simple matter of statistics.  You can never be 100% sure, but you can compute a probability.

2) One thing that would help me in my slowly ongoing inspection of RCU is how the failures map out over the possible configurations:

a: TREE_RCU && NO_HZ:
b: TREE_RCU && !NO_HZ:
c: TREE_PREEMPT_RCU && NO_HZ:
d: TREE_PREEMPT_RCU && !NO_HZ:
e: TINY_RCU && NO_HZ:
f: TINY_RCU && !NO_HZ:
g: TINY_PREEMPT_RCU && NO_HZ:
h: TINY_PREEMPT_RCU && !NO_HZ:

If I remember correctly, (a) fails, (b) does not, and one of (e) or (f) does not.  Is that correct?

Comment 190 Paul Bolle 2011-08-15 21:11:33 UTC

(In reply to comment #189)
> 0) If you don't see "Preemptible hierarchical RCU implementation" in dmesg at
> early boot, you are running one of the TINY_RCU variants.

I see lots of lines saying
    Hierarchical RCU implementation.

(note the dot!) in the logs. Anyway, printing a different messages in the opposite case wouldn't hurt, would it?

Comment 191 Paul E. McKenney 2011-08-15 21:41:11 UTC

TREE_RCU says "Hierarchical RCU implementation.", as you note, complete with the period at the end.  TREE_PREEMPT_RCU says "Preemptible hierarchical RCU implementation." -- also with a period.  TINY_RCU and TINY_PREEMPT_RCU say nothing in an attempt to save a few more bytes for deep-embedded uses of RCU.  The TINY variants of RCU will remain tiny.

So we are dealing with TREE_RCU.  I recall (perhaps incorrectly) that NO_HZ failed and !NO_HZ didn't fail.  Is that correct?  If so, was RCU_FAST_NO_HZ also set?

Comment 192 Paul Bolle 2011-08-15 21:47:49 UTC

(In reply to comment #189)
> 2) One thing that would help me in my slowly ongoing inspection of RCU is how
> the failures map out over the possible configurations:
> 
> a: TREE_RCU && NO_HZ:
> b: TREE_RCU && !NO_HZ:
> c: TREE_PREEMPT_RCU && NO_HZ:
> d: TREE_PREEMPT_RCU && !NO_HZ:
> e: TINY_RCU && NO_HZ:
> f: TINY_RCU && !NO_HZ:
> g: TINY_PREEMPT_RCU && NO_HZ:
> h: TINY_PREEMPT_RCU && !NO_HZ:
> 
> If I remember correctly, (a) fails, (b) does not, and one of (e) or (f) does
> not.  Is that correct?

a: TREE_RCU && NO_HZ:            fails
b: TREE_RCU && !NO_HZ:           appears OK
c: TREE_PREEMPT_RCU && NO_HZ:    n/a
d: TREE_PREEMPT_RCU && !NO_HZ:   n/a
e: TINY_RCU && NO_HZ:            appears OK
f: TINY_RCU && !NO_HZ:           n/a
g: TINY_PREEMPT_RCU && NO_HZ:    n/a
h: TINY_PREEMPT_RCU && !NO_HZ:   n/a

So you were correct.

(Filling in this table makes one realize that CONFIG_NO_HZ forces one to think in double negatives, which is always fun, isn't it?)

Comment 193 Paul Bolle 2011-08-15 21:57:05 UTC

(In reply to comment #191)
> So we are dealing with TREE_RCU.  I recall (perhaps incorrectly) that NO_HZ
> failed and !NO_HZ didn't fail.  Is that correct?  If so, was RCU_FAST_NO_HZ
> also set?

a: TREE_RCU && NO_HZ:            fails =>
    CONFIG_TREE_RCU=y
    CONFIG_RCU_FAST_NO_HZ=y
    CONFIG_NO_HZ=y

(These result are brought to you by the department of handwaving-a-lot-to-look-reliable. If more tests are needed to pinpoint this, we should think about logging info we care about when running tests.)

Comment 194 Paul E. McKenney 2011-08-15 22:28:09 UTC

Could you please try CONFIG_RCU_FAST_NO_HZ=n along with CONFIG_TREE_RCU=y and CONFIG_NO_HZ=y?  CONFIG_RCU_FAST_NO_HZ, when set, prints "RCU dyntick-idle grace-period acceleration is enabled" at boot time.

Comment 195 Paul Bolle 2011-08-16 15:14:15 UTC

(In reply to comment #194)
> Could you please try CONFIG_RCU_FAST_NO_HZ=n along with CONFIG_TREE_RCU=y and
> CONFIG_NO_HZ=y?

0) Note to self: tested with 3.0.0-1.local7.fc17.i686. Cleaned up changelog:
    * Tue Aug 16 2011 Paul Bolle <pebolle> 3.0.0-1.local7
    - epoll: Fix spurious lockdep warnings
    - rcu/next: git diff v3.0..0697a6c2 (0697a6c2 "rcu: Prohibit grace periods
      during early boot")
    - CONFIG_IOSCHED_CFQ=m
    - CONFIG_RCU_TRACE=y
    - CONFIG_RCU_FAST_NO_HZ=n

1) Refuses to Oops in 8 mandb test runs (using a script that drops caches before every run).

2) Have we pinpointed this to CONFIG_RCU_FAST_NO_HZ?

Comment 196 Paul Bolle 2011-08-16 15:20:39 UTC

(In reply to comment #195)
> 1) Refuses to Oops in 8 mandb test runs (using a script that drops caches
> before every run).

Strike that (for the time being), as It turns out that checking cfq requires one to, well, have cfq enabled (CONFIG_IOSCHED_CFQ=m boots into deadline). What a suprise! On to an actual test run of cfq ...

Comment 197 Paul Bolle 2011-08-16 17:40:24 UTC

(In reply to comment #196)
> On to an actual test run of cfq ...

0) Do scientist have the concept of a well behaved experiment (ie, an experiment that generates the desired outcome no matter what data one throws at it)?

1) A kernel with (in short) CONFIG_RCU_FAST_NO_HZ=n refuses to Oops in 8 mandb test runs (using a script that drops caches before every run) even if cfq is actually enabled.

2) Have we pinpointed this latest issue to CONFIG_RCU_FAST_NO_HZ?

Comment 198 Paul E. McKenney 2011-08-16 19:27:14 UTC

We certainly have pointed the finger of suspicion at CONFIG_RCU_FAST_NO_HZ.  It is relatively new and thus should be presumed less trustworthy.

But back to your questions about scientists...  Do you have data on how many mandb attempts were required to cause failures in the CONFIG_RCU_FAST_NO_HZ=y case?  Ideally, this data would take the form of the number of times failure occurred on the first mandb invocation, the second, and so on.  Given that data, we can compute a probability of the bug surviving eight attempts.

But just to humor an old man's paranoia, could you please also do a few more mandb attempts with CONFIG_RCU_FAST_NO_HZ=n?

Comment 199 Paul Bolle 2011-08-16 19:57:42 UTC

(In reply to comment #198)
> Do you have data on how many
> mandb attempts were required to cause failures in the CONFIG_RCU_FAST_NO_HZ=y
> case?  Ideally, this data would take the form of the number of times failure
> occurred on the first mandb invocation, the second, and so on.  Given that
> data, we can compute a probability of the bug surviving eight attempts.

anecdote, n. pl. data

My feeling is that, if we hit this issue, we need at most two runs. But nine out of ten times the first run is enough. (Note that run is about as vague as it gets. Were tnalking about a program that spawns X childs, doing Y reads and Z writes. Y probably being a rather large number.)

> But just to humor an old man's paranoia, could you please also do a few more
> mandb attempts with CONFIG_RCU_FAST_NO_HZ=n?

I hope to do that shortly. Results are due (West European) tomorrow.

Comment 200 Michal Jaegermann 2011-08-16 20:29:52 UTC

(In reply to comment #199)

> My feeling is that, if we hit this issue, we need at most two runs. But nine
> out of ten times the first run is enough.

That may well be the case in that particular hardware/software combination.  I am repeating from time to time this experiment of running 'mandb -c' with caches dropped on my machine, with unmodified rawhide kernels, and in what I observe that bug _may_ hit here in two runs but usually quite a bit more is needed.  For some time I did not see this failure on the first run at all.  OTOH with some patience I eventually get it every time I try. Yes, we all know that this is clearly some timings dependent. Just a note.

I was not posting new comments about these results as they really look consistently the same.

Comment 201 Paul E. McKenney 2011-08-16 22:04:25 UTC

OK, suppose that each run of mandb had a 50% chance of hitting the bug.  Then you would have a 50% chance of hitting it on the first run, a 75% chance of hitting it on the second run, an 88% chance of hitting it on the third run, and so on.

The probability might vary from system to system, for example, if you have a 75% chance of hitting the bug on a given run, then there is a 94% chance of hitting it on the first two runs and a 98% chance of hitting it on the first three runs.

This is why I was hoping for a list -- I could then estimate the per-run probability, and then further estimate the number of runs required to give (say) a 99.999% chance that the bug was in fact not happening.

Comment 202 Paul Bolle 2011-08-17 07:32:07 UTC

(In reply to comment #201)
> This is why I was hoping for a list [...].
Well, doing a methodical and thorough job won't do you any good when working for the the department of handwaving-a-lot-to-look-reliable! So no, I didn't log this stuff (one reason is that I could basically always trigger this with my outdated ThinkPad T41 and I didn't particularly care whether that was in the first or second run).

(In reply to comment #198)
> But just to humor an old man's paranoia, could you please also do a few more
> mandb attempts with CONFIG_RCU_FAST_NO_HZ=n?

Fresh boot, 16 runs (using a script that will check if cfq is enabled), which took almost 3 hours, no Oopses.

Comment 203 Paul E. McKenney 2011-08-17 21:33:07 UTC

Thank you for the additional testing!

I am reviewing the CONFIG_RCU_FAST_NO_HZ yet again.  If I still don't find anything, I will see about putting together some sort of diagnostic patch.

Comment 204 Michal Jaegermann 2011-08-18 19:04:57 UTC

Just as a reference point.  I was just hit by this bug on a kernel 3.1.0-0.rc1.git6.1.fc17.x86_64 when cron decided to run mandb.  So this was surely on the first loop. :-)

Comment 205 Paul E. McKenney 2011-08-19 16:13:53 UTC

OK, good, there seems to be a high probability of failure, which should make testing of fixes easier.

Does this bug trigger with CONFIG_RCU_FAST_NO_HZ=y and CONFIG_PROVE_RCU=y?  (Yeah, I know, way too much for me to hope for, but it really would be convenient for diagnostic patches...)

Comment 206 Paul Bolle 2011-08-19 16:38:59 UTC

(In reply to comment #205)

> Does this bug trigger with CONFIG_RCU_FAST_NO_HZ=y and CONFIG_PROVE_RCU=y? 

0) Yes, it does

1) I'm currently running pre-compiled 3.1.0-0.rc2.git0.1.fc17.i686 (the latest kernel pushed in Rawhide). Our tireless tester "mandb" once again triggered this issue (when kicked into action by cron). And
    grep -w -e CONFIG_RCU_FAST_NO_HZ -e CONFIG_PROVE_RCU /boot/config-3.1.0-0.rc2.git0.1.fc17.i686

prints:
   CONFIG_RCU_FAST_NO_HZ=y
   CONFIG_PROVE_RCU=y

Comment 207 Michal Jaegermann 2011-08-19 17:01:06 UTC

(In reply to comment #206)
> (In reply to comment #205)
> 
> > Does this bug trigger with CONFIG_RCU_FAST_NO_HZ=y and CONFIG_PROVE_RCU=y? 
> 
> 0) Yes, it does

Likely stating something "nearly obvious" but just in case.  The same is true for x86_64;  both are "y" in configurations of rawhide kernels.

Comment 208 Paul E. McKenney 2011-08-19 22:27:32 UTC

Created attachment 519110 [details]
Diagnostic patch for RCU_FAST_NO_HZ assumptions

Could you please test the attached patch?  It should give useful information whether or not CONFIG_RCU_FAST_NO_HZ is set.

Comment 209 Paul Bolle 2011-08-20 20:16:23 UTC

(In reply to comment #208)
> Could you please test the attached patch? 

0) Note to self: tested with 3.1.0-rc2.local0.fc17.i686. Cleaned up changelog:
* Sat Aug 20 2011 Paul Bolle <pebolle> 3.1.0-rc2.local
- epoll: Fix spurious lockdep warnings
- rcu: check for entering dyntick-idle mode while in read-side critical
  section
- CONFIG_IOSCHED_CFQ=m
- CONFIG_RCU_TRACE=y

1) This already happened before I even had a chance to poke the system myself (ie, by a fsck launched by dracut):
------------[ cut here ]------------
WARNING: at kernel/rcutree.c:1746 rcu_needs_cpu_quick_check+0x31/0xbd()
Hardware name:        
Modules linked in: yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core
Pid: 340, comm: fsck.ext4 Not tainted 3.1.0-rc2.local0.fc17.i686 #1
Call Trace:
 [<c083e25d>] ? printk+0x2d/0x2f
 [<c0440088>] warn_slowpath_common+0x7c/0x91
 [<c049a5db>] ? rcu_needs_cpu_quick_check+0x31/0xbd
 [<c049a5db>] ? rcu_needs_cpu_quick_check+0x31/0xbd
 [<c04400bf>] warn_slowpath_null+0x22/0x24
 [<c049a5db>] rcu_needs_cpu_quick_check+0x31/0xbd
 [<c049bba9>] rcu_needs_cpu+0xd7/0x16c
 [<c049bca0>] rcu_process_callbacks+0x62/0x86
 [<c0445a9e>] __do_softirq+0xdd/0x203
 [<c04459c1>] ? __tasklet_hi_schedule_first+0x30/0x30
 <IRQ>  [<c0445dc6>] ? irq_exit+0x4f/0xac
 [<c040402a>] ? do_IRQ+0x7e/0x92
 [<c046a94c>] ? trace_hardirqs_off_caller+0x2e/0x86
 [<c084ddb5>] ? common_interrupt+0x35/0x3c
 [<c05d007b>] ? ftrace_raw_event_block_bio+0x64/0xc7
 [<c04600e0>] ? exit_creds+0x40/0xab
 [<c05dc325>] ? arch_local_irq_restore+0x5/0xb
 [<c05dcdcb>] ? blkiocg_update_dispatch_stats+0xbe/0xc6
 [<c05deb37>] ? blk_throtl_bio+0xca/0x37a
 [<c04efbff>] ? set_track+0x4d/0xab
 [<c0840f36>] ? __slab_alloc.constprop.27+0x2ef/0x304
 [<c046de9e>] ? trace_hardirqs_on_caller+0x10e/0x13f
 [<c04c2c5b>] ? mempool_alloc_slab+0x13/0x15
 [<c05d2e09>] ? generic_make_request+0x2e3/0x40f
 [<c05eaae3>] ? random32+0x1e/0x2b
 [<c05f50c8>] ? should_fail+0x79/0xdf
 [<c04f1889>] ? kmem_cache_alloc+0x47/0x11a
 [<c04c2c5b>] ? mempool_alloc_slab+0x13/0x15
 [<c05d3017>] ? submit_bio+0xe2/0xfb
 [<c046a9af>] ? trace_hardirqs_off+0xb/0xd
 [<c046aeba>] ? lock_release_holdtime.part.9+0x4b/0x51
 [<c0524d9e>] ? bio_alloc_bioset+0x3c/0x9c
 [<c046d9c5>] ? lock_release+0x188/0x1ae
 [<c0520cad>] ? submit_bh+0xcd/0xe7
 [<c0523d0c>] ? block_read_full_page+0x206/0x216
 [<c052626e>] ? blkdev_get_blocks+0x100/0x100
 [<c04c0fe1>] ? add_to_page_cache_locked+0xa7/0xe0
 [<c04c9a3f>] ? __lru_cache_add+0x34/0x5b
 [<c0525ee2>] ? blkdev_readpage+0x14/0x16
 [<c04c8c19>] ? __do_page_cache_readahead+0x149/0x17f
 [<c04c8fb4>] ? ondemand_readahead+0x163/0x1c4
 [<c04c90d0>] ? page_cache_sync_readahead+0x47/0x4f
 [<c04c22c8>] ? generic_file_aio_read+0x277/0x57c
 [<c04fed10>] ? do_sync_read+0x96/0xcf
 [<c04ff05f>] ? rw_verify_area+0xd0/0xf3
 [<c046ae1a>] ? lock_is_held+0x73/0x7b
 [<c04ff3cc>] ? vfs_read+0x8d/0xd5
 [<c04fec7a>] ? do_sync_write+0xcf/0xcf
 [<c04ff456>] ? sys_read+0x42/0x63
 [<c084d7df>] ? sysenter_do_call+0x12/0x38
---[ end trace eec8da039e1a69f5 ]---

1) $ sed -n  "1746 p" kernel/rcutree.c
	PROVE_RCU(WARN_ON_ONCE(lock_is_held(&rcu_lock_map)));

2) Is there a part of this call trace you want me to peek at in depth (eg, with gdb)?

Comment 210 Paul E. McKenney 2011-08-20 23:21:57 UTC

Gah.  I guess the moral of this story is to stop using CONFIG_RCU_FAST_NO_HZ until I get this fixed.  I might well need some help from the dyntick-idle guys on this one.

My guess is that the CPU tried to enter dyntick-idle mode, and RCU took it seriously, but the system reversed its decision and RCU was blissfully unaware of this fact, with the result you see.

If you would like to confirm my guess with gdb, that would be way cool.

Comment 211 Paul E. McKenney 2011-08-21 01:42:32 UTC

Created attachment 519158 [details]
Patch to remove rcu_needs_cpu_flush() to avoid false quiescent states.

This prototype patch should fix the CONFIG_RCU_FAST_NO_HZ problem.

Comment 212 Paul Bolle 2011-08-21 14:03:01 UTC

(In reply to comment #211)
> This prototype patch should fix the CONFIG_RCU_FAST_NO_HZ problem.

0) Note to self: tested with 3.1.0-rc2.local1, Cleaned up changelog:
* Sun Aug 21 2011 Paul Bolle <pebolle> 3.1.0-rc2.local1
- rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states
  (manually removed trivial conflict)

1) Patch could be applied on top op v3.1-rc2 with one (trivial) conflict.

2) Fresh boot, 8 runs, which took over 1 hour, no Oopses. Problem seems fixed by that prototype.

(In reply to comment #210)
> If you would like to confirm my guess with gdb, that would be way cool.

3) Do you still care? If so, what part of that impressive Call Trace would you like me to examine specifically?

Comment 213 Paul E. McKenney 2011-08-22 04:38:06 UTC

Just to be sure, this test was with CONFIG_RCU_FAST_NO_HZ=y, right?  (Sorry to be so paranoid, but then again I am still alive...)

The test results speak for themselves, and I need to add the relevant tracing anyway, so no need for gdb.  (I wasn't going to turn down an offer, of course!)

Comment 214 Paul Bolle 2011-08-22 08:06:11 UTC

(In reply to comment #213)
> Just to be sure, this test was with CONFIG_RCU_FAST_NO_HZ=y, right? 

$ grep -w CONFIG_RCU_FAST_NO_HZ /boot/config-3.1.0-rc2.local1.fc17.i686
CONFIG_RCU_FAST_NO_HZ=y

Comment 215 Paul E. McKenney 2011-08-22 12:59:37 UTC

Thank you very much!  I have queued this patch.

Comment 216 Vivek Goyal 2011-08-22 13:09:59 UTC

phew....... Looks like finally this bug has been nailed down. Hopefully my euphoria is not shortlived :-)

Anyway, thanks a ton PaulB and PaulM for figuring this one out.

Comment 217 Dave Jones 2011-08-22 13:58:36 UTC

Indeed, thanks guys, I'm impressed you guys managed to get this bug over 200 comments, and still remain productive.  

I'll get this merged in all our releases for our next updates.

Comment 218 Paul Bolle 2011-08-22 16:05:51 UTC

(In reply to comment #217)
> [...] you guys managed to get this bug over 200 comments [...].

0) Note that it turned out to be two bugs (both reported first by Michal):
- the first 130 comments (and 15 months) concerned a CFQ bug that was thought to be an RCU bug first;
- the next 87 comments (and just one month) concerned an RCU bug that I thought to be a CFQ bug at first.

1) Say, what's Michal's grace period for reporting his third issue in the intersection of CFQ and RCU before this bug is closed?

Comment 219 Josh Boyer 2011-08-23 14:20:59 UTC

*** Bug 726679 has been marked as a duplicate of this bug. ***

Comment 220 Michal Jaegermann 2011-09-01 00:59:02 UTC

(In reply to comment #216)
> phew....... Looks like finally this bug has been nailed down.

So what is a status of that fix?  I do not see anything mentioned in changelogs, even for rawhide kernels, and if something showed up "upstream" I missed it.  In the meantime I was hit by this, and quite "spontaneously" (that means without using any test scripts to provoke it) while running kernel-3.1.0-0.rc4.git0.0.fc17.x86_64.

A familiar "Fixing recursive fault but reboot is needed!" is not kidding.  On various occasions a not rebooted machine locked up entirely some while after that message showed up even if immediately after that hit it appeared to be running "normally".

Comment 221 Fedora Update System 2011-09-01 11:07:07 UTC

kernel-2.6.40.4-5.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.40.4-5.fc15

Comment 222 Josh Boyer 2011-09-01 11:17:36 UTC

(In reply to comment #221)
> kernel-2.6.40.4-5.fc15 has been submitted as an update for Fedora 15.
> https://admin.fedoraproject.org/updates/kernel-2.6.40.4-5.fc15

I'll be building an f14 kernel with the fix and submitting it for updates-testing today.

Comment 223 Fedora Update System 2011-09-01 15:25:16 UTC

kernel-2.6.35.14-96.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14

Comment 224 Fedora Update System 2011-09-02 05:30:07 UTC

Package kernel-2.6.35.14-96.fc14:
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-2.6.35.14-96.fc14'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14
then log in and leave karma (feedback).

Comment 225 Fedora Update System 2011-09-06 23:58:15 UTC

kernel-2.6.35.14-96.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 226 Fedora Update System 2011-09-07 00:01:19 UTC

kernel-2.6.40.4-5.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 227 Michal Jaegermann 2011-09-09 22:26:50 UTC

AFAICS closing that bug as "CLOSED ERRATA" is somewhat premature.  Running in a loop 'mandb -c' on a system with a kernel-3.1.0-0.rc5.git0.0.fc17 with a build date "Wed 07 Sep 2011 02:58:01 PM MDT" simply locked up that machine with no traces in logs.  This happened not very far into the second loop branch after the first one finished without any incidents.

Comment 228 Paul Bolle 2011-09-10 18:02:03 UTC

(In reply to comment #227)
> AFAICS closing that bug as "CLOSED ERRATA" is somewhat premature.  Running in a
> loop 'mandb -c' on a system with a kernel-3.1.0-0.rc5.git0.0.fc17 with a build
> date "Wed 07 Sep 2011 02:58:01 PM MDT" simply locked up that machine with no
> traces in logs.  This happened not very far into the second loop branch after
> the first one finished without any incidents.

0) Are you sure some patch for this issue is actually part of that kernel?

1) kernel-3.1.0-0.rc5.git0.0.fc17 seems to correspond to commit 1bc752781f87f13a8747eb4caee00066499e4aa1 to Fedora's kernel.git. This is what I get when I try to find a related patch in that tree:

[kernel * master]$ git checkout 1bc752781f87f13a8747eb4caee00066499e4aa1
[...]
[kernel * (1bc7527...)]$ grep -i -e rcu -e nohz -e cfq -e 577968 -e dyntick kernel.spec 
Patch12025: rcu-avoid-just-onlined-cpu-resched.patch
ApplyPatch rcu-avoid-just-onlined-cpu-resched.patch

2) It looks like Rawhide's kernel is one of the few Fedora kernels running without a patch for this issues.

Comment 229 Michal Jaegermann 2011-09-10 19:19:40 UTC

(In reply to comment #228)
> 2) It looks like Rawhide's kernel is one of the few Fedora kernels running
> without a patch for this issues.

That why I think that closing this is a bit hasty; especially that one can potentially get a hard lock with it.  OTOH if you are really sure that this will not end up in some release ...

Comment 230 Michal Jaegermann 2011-09-10 20:02:03 UTC

For an added attraction I just found traces from this bug for 3.1.0-0.rc5.git0.0.fc17.x86_64 kernel but on that occasion this is not 'mandb -c' nor any other of cron jobs.  According to log none were running when that hit.
A trigger process is recorded as:
Process ld-linux-x86-64 (pid: 8657, threadinfo ffff88005a91c000, task ffff880076e40000)
so this could be really anything.

Comment 231 Josh Boyer 2011-09-12 11:27:31 UTC

(In reply to comment #229)
> (In reply to comment #228)
> > 2) It looks like Rawhide's kernel is one of the few Fedora kernels running
> > without a patch for this issues.
> 
> That why I think that closing this is a bit hasty; especially that one can
> potentially get a hard lock with it.  OTOH if you are really sure that this
> will not end up in some release ...

The bug is reported against f14.  It's been fixed in f14-f16.  I can throw the patch into rawhide today.  I would have hoped it makes it's way upstream before 3.1 final, but who knows.

Comment 232 Josh Boyer 2011-09-12 11:32:17 UTC

The patch has been added to rawhide.  It will be included in the next build.

Comment 233 Paul Bolle 2011-10-07 17:03:26 UTC

0) I just ran into this issue while running (self compiled and vanilla) v3.0.6 on a x86_64. This seems to be the first time I triggered this on another machine than the i686 that so happily reproduced this error. Anyhow, self compiled and vanilla means that this is not a Fedora issue.

1) But I do wonder what the status is of the patch that Fedora is now carrying. Has there been any effort to get this into mainline? As far as I know there's no fix for this issue included in v3.1-rc9. Or is there? And are there any plans of backporting a fix into the stable tree?

Comment 234 Paul E. McKenney 2011-10-07 18:32:35 UTC

Indeed, I would not expect it to be specific to Fedora.  The commit is 06ae115a1 in -tip, queued for the 3.2 merge window.  I did post it separately to LKML some time back (https://lkml.org/lkml/2011/8/19/355), and got absolutely no response whatsoever.  But the 3.2 merge window should be opening up soon.

It turns out to be against the rules to CC a patch to stable before that patch is in mainline.  But once that happens, it seems to me to be a good candidate for stable.

Comment 235 Paul Bolle 2011-10-07 21:18:57 UTC

(In reply to comment #234)
> The commit is
> 06ae115a1 in -tip, queued for the 3.2 merge window.  I did post it separately
> to LKML some time back (https://lkml.org/lkml/2011/8/19/355), and got
> absolutely no response whatsoever. 

0) I didn't check the commit in whatever tree -tip might be, but the link should be to https://lkml.org/lkml/2011/9/7/71 ("rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states"), which also got no response whatsoever.

1) If you resubmit this for the 3.2 merge window you might add a reference to this bug report to the commit message. Not really important.

Comment 236 Dave Jones 2011-11-21 21:38:56 UTC

*** Bug 755219 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.

anton
axboe
dfordham
dodji
dougsland
eblake
gansalmon
itamar
jens.axboe
jfrieben
jlayton
jmoyer
jonathan
jreiser
kernel-maint
mads
mishu
paulmck
pebolle
selinux
vgoyal
yaneti
zaitcev