Created attachment 403363 [details] dmesg output for 2.6.34-0.16.rc2.git0.fc14.x86_64 with a fault Actual results: Expected results: dmesg output for 2.6.34-0.16.rc2.git0.fc14.x86_64 with a fault Description of problem: What looks like a side effect of running /etc/cron.daily a general protection fault: 0000 [#1] SMP was registered triggered by "Pid: 5613, comm: ld-linux-x86-64" and with RIP: 0010:[<ffffffff81226d77>] [<ffffffff81226d77>] cfq_free_io_context+0x18/0x34 All that ended with "Fixing recursive fault but reboot is needed!". Full dmesg output attached Version-Release number of selected component (if applicable): kernel-2.6.34-0.16.rc2.git0.fc14.x86_64 How reproducible: So far this was quite a surprise
/usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2410 ffffffff81226d6b: 48 8b 5f 78 mov 0x78(%rdi),%rbx <======= cfq_free_io_context(): /usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2475 ffffffff81226d6f: 49 89 fc mov %rdi,%r12 __call_for_each_cic(): /usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2410 ffffffff81226d72: 48 85 db test %rbx,%rbx ffffffff81226d75: 74 17 je ffffffff81226d8e <cfq_free_io_context+0x2f> ffffffff81226d77: 48 8b 03 mov (%rbx),%rax <=========== /usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2411 ffffffff81226d7a: 48 8d 73 a8 lea -0x58(%rbx),%rsi ffffffff81226d7e: 4c 89 e7 mov %r12,%rdi /usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2410 ffffffff81226d81: 0f 18 08 prefetcht0 (%rax) /usr/src/debug/kernel-2.6.33/linux-2.6.33.x86_64/block/cfq-iosched.c:2411 ffffffff81226d84: e8 53 ff ff ff callq ffffffff81226cdc <cic_free_func> rbx is ioc->cic_list (as cic_list is at offset 0x78 from the start of struct io_context), which the register dump shows is 0x6b6b6b6b6b6b6b6b. So, the io context has already been freed. Kerneloops.org shows that this bug has manifested as early as 2.6.30.8, so I don't think this is a new problem. Though 2.6.34 seems to be triggerring it more easily.
OK, that's not good. Probably some tricky RCU sneakiness, I'll see if I can reproduce something like that here.
Any chance someone has a full trace of this?
The full trace was in the attachment: general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-9/uevent CPU 0 Modules linked in: autofs4 w83627hf hwmon_vid sunrpc ipv6 ipt_LOG iptable_nat nf_nat ext2 uinput e100 ppdev edac_core edac_mce_amd microcode snd_mpu401 ns558 i2c_viapro parport_pc parport snd_via82xx gameport snd_ac97_codec ac97_bus snd_seq snd_pcm floppy snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd k8temp soundcore hwmon skge mii shpchp ata_generic pata_acpi firewire_ohci pata_via firewire_core crc_itu_t sata_via sata_promise radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: freq_table] Pid: 5613, comm: ld-linux-x86-64 Not tainted 2.6.34-0.16.rc2.git0.fc14.x86_64 #1 SK8V/To Be Filled By O.E.M. RIP: 0010:[<ffffffff81226d77>] [<ffffffff81226d77>] cfq_free_io_context+0x18/0x34 RSP: 0018:ffff880016713e28 EFLAGS: 00010202 RAX: 00000001002c0f33 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000000 RDX: 00000001002c0f36 RSI: ffff880002dd0bc0 RDI: 0000000000000296 RBP: ffff880016713e38 R08: ffff880015374680 R09: ffff880016bac4e0 R10: 0000000000000046 R11: ffff880016713d48 R12: ffff880015374680 R13: ffff88001665cfc8 R14: 0000000000000000 R15: ffff88001f5a8000 FS: 00007f2122d0e7a0(0000) GS:ffff880002c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007feac7253000 CR3: 000000001b77c000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ld-linux-x86-64 (pid: 5613, threadinfo ffff880016712000, task ffff88001665c900) Stack: ffff880015374680 ffff88001665c900 ffff880016713e68 ffffffff8121f358 <0> ffffffff8121f323 ffff88001665c900 ffff88001665cfc8 ffff880015374680 <0> ffff880016713ea8 ffffffff8121f464 ffffffff8121f40f ffffffff8107dee4 Call Trace: [<ffffffff8121f358>] put_io_context+0x69/0x9c [<ffffffff8121f323>] ? put_io_context+0x34/0x9c [<ffffffff8121f464>] exit_io_context+0xa0/0xac [<ffffffff8121f40f>] ? exit_io_context+0x4b/0xac [<ffffffff8107dee4>] ? trace_hardirqs_on+0xd/0xf [<ffffffff81054775>] do_exit+0x79d/0x7d1 [<ffffffff81129fb7>] ? path_put+0x22/0x27 [<ffffffff8105482d>] do_group_exit+0x84/0xb0 [<ffffffff81054870>] sys_exit_group+0x17/0x1b [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: c6 93 6d 22 81 e8 3d a3 e8 ff 5b 41 5c 41 5d 41 5e c9 c3 55 48 89 e5 41 54 53 0f 1f 44 00 00 48 8b 5f 78 49 89 fc 48 85 db 74 17 <48> 8b 03 48 8d 73 a8 4c 89 e7 0f 18 08 e8 53 ff ff ff 48 8b 1b RIP [<ffffffff81226d77>] cfq_free_io_context+0x18/0x34 RSP <ffff880016713e28> ---[ end trace 1a4cf5320fdfc46c ]--- Fixing recursive fault but reboot is needed! Is that what you're looking for?
Thanks Jeff, dunno how I missed that there was an attachment. So next question, seems this was triggered by 'readahead'. That's the RH readahead thing, correct? Does it use io priorities? Or CLONE_IO?
I'm not convinced readahead is the problem. I think the reporter just said that as that was the last thing in the log before the GPF. The task associated with the failure is ld-linux-x86-64, which may be related to the prelink cron job. prelink cron job does this: renice +19 -p $$ >/dev/null 2>&1 but I don't see any changing of io priorities. The prelink binary doesn't seem to use CLONE_IO. Not sure I want to dig around in ld-linux-x86_64. ;-)
(In reply to comment #5) > So next question, seems this was triggered by 'readahead'. By timiing this happened while running 'cron.daily' and apparently in a scheduler and dmesg output suggests that 'readahead' may be involved so the above looked like a reasonable "working hypothesis" but I do not know if this is really proven. > That's the RH readahead thing, correct? This is from a "rawhide" installation for Fedora.
Jeff, I would not want you to dive into ld-linux-x86_64 either :-). So what else is being run by cron.daily on that system? Oh, and a kernel config from this kernel would help, just in case it depends on some special RCU setting (or something like that). I'll see if I can reproduce locally, from a look over the ioc exit/free path I didn't spot anything. I'll let something run over the next few days once I have an idea what that something should be. thu/fri/mon are holidays here, so don't expect anything before next week.
Created attachment 403822 [details] kernel config for x86_64 My system (F12) shows the following in cron.daily: cups logrotate mlocate.cron readahead.cron makewhatis.cron prelink tmpwatch
Thanks for looking into this, Jens! I'll try to spend some time on it as well when I next come up for air.
FYI, you can also look here for backtrace information: http://kerneloops.org/guilty.php?guilty=cfq_free_io_context&version=2.6.34-rc&start=2228224&end=2260991&class=oops
(In reply to comment #9) > My system (F12) shows the following in cron.daily: > cups logrotate mlocate.cron readahead.cron makewhatis.cron prelink tmpwatch Just got that again with 2.6.34-0.28.rc3.git3.fc14.x86_64 kernel. "Call Trace" and "Code" do not really differ from those in attachment (id=403363) but this time one can find "Process gzip (pid: 5083, threadinfo ffff880013416000, task ffff880007f08000)" instead of ld-linux-x86-64 from the previous instance. Timestamps in logs locate that oops squarely between "starting makewhatis.cron" and "finished makewhatis.cron" with ample margins on both sides.
I have the next one with 2.6.34-0.28.rc3.git3.fc14.x86_64. This time makewhatis.cron is gone but timing implicates man-db.cron and a triggering process is gzip again.
man-db.cron seems to be rather good in triggering this bug. Recently I see these recursive faults quite regularly and always when man-db.cron script is triggered by cron. The current ones with kernel-2.6.34-0.38.rc5.git0.fc14.x86_64.
I was away for quite a while and not running rawhide. Upon return I got pretty quickly something familiar but this time from 2.6.35-... kernel. Here we go: general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:0f.0/host0/target0:0:0/0:0:0:0/block/sda/sda11/uevent CPU 0 Modules linked in: autofs4 w83627hf hwmon_vid sunrpc ipv6 ipt_LOG iptable_nat nf_nat ext2 uinput snd_via82xx microcode snd_ac97_codec ac97_bus snd_seq e100 snd_pcm snd_timer snd_page_alloc ppdev edac_core edac_mce_amd snd_mpu401 k8temp snd_mpu401_uart snd_rawmidi parport_pc snd_seq_device parport i2c_viapro snd soundcore ns558 gameport skge mii shpchp ata_generic pata_acpi firewire_ohci firewire_core sata_promise pata_via sata_via crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: mperf] Pid: 28772, comm: gzip Not tainted 2.6.35-0.31.rc4.git4.fc14.x86_64 #1 SK8V/To Be Filled By O.E.M. RIP: 0010:[<ffffffff81228a7c>] [<ffffffff81228a7c>] __call_for_each_cic+0x21/0x3f RSP: 0018:ffff880011537e28 EFLAGS: 00010202 RAX: 00000001001fd43f RBX: 6b6b6b6b6b6b6b6b RCX: ffff88001bbdef48 RDX: 00000001001fd442 RSI: ffff880002dd0d70 RDI: 0000000000000286 RBP: ffff880011537e48 R08: 0000000000000286 R09: ffff880011537d98 R10: ffff88001b0d1028 R11: ffff880011537df8 R12: ffff88001b0d1000 R13: ffffffff81228ac8 R14: 0000000000000001 R15: ffff88001f5c0000 FS: 00007f5d74174720(0000) GS:ffff880002c00000(0000) knlGS:00000000f7782720 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f5d73a88470 CR3: 0000000001a43000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process gzip (pid: 28772, threadinfo ffff880011536000, task ffff88001b6f0000) Stack: 0000000000000001 ffff88001b0d1000 ffff88001b6f0000 ffff88001b6f0690 <0> ffff880011537e58 ffffffff81228aaf ffff880011537e78 ffffffff8121fc25 <0> ffff880011537e78 ffff88001b0d1000 ffff880011537ea8 ffffffff8121fce5 Call Trace: [<ffffffff81228aaf>] cfq_free_io_context+0x15/0x17 [<ffffffff8121fc25>] put_io_context+0x41/0x5e [<ffffffff8121fce5>] exit_io_context+0x6c/0x74 [<ffffffff8105416a>] do_exit+0x7a9/0x7d0 [<ffffffff81496aae>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff81054427>] do_group_exit+0x88/0xb6 [<ffffffff8105446c>] sys_exit_group+0x17/0x1b [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: ed 00 00 41 59 48 98 5b c9 c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 48 8b 5f 78 49 89 fc 49 89 f5 48 85 db 74 15 <48> 8b 03 48 8d 73 b0 4c 89 e7 0f 18 08 41 ff d5 48 8b 1b eb e6 RIP [<ffffffff81228a7c>] __call_for_each_cic+0x21/0x3f RSP <ffff880011537e28> ---[ end trace 8e807d826e855c8f ]--- Fixing recursive fault but reboot is needed! If this is really a new bug and/or full dmesg output is desired then please let me know. It seems to be triggered by the same man-db.cron though.
This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle. Changing version to '14'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
(In reply to comment #16) > Changing version to '14'. Should this be left as '14'? I was just hit but one of those on a rawhide kernel 2.6.35-0.57.rc6.git5.fc15.x86_64. That trace is basically the same as the one from comment 15 for 2.6.35-0.31.rc4.git4.fc14.x86_64.
*** Bug 585659 has been marked as a duplicate of this bug. ***
block/cfq-iosched.c:2492: static void __call_for_each_cic(struct io_context *ioc, void (*func)(struct io_context *, struct cfq_io_context *)) { struct cfq_io_context *cic; struct hlist_node *n; hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list) func(ioc, cic); } #define hlist_for_each_entry_rcu(tpos, pos, head, member) \ for (pos = rcu_dereference_raw((head)->first); \ pos && ({ prefetch(pos->next); 1; }) && \ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; }); \ pos = rcu_dereference_raw(pos->next)) We've made at least one trip through the loop (i.e. head->first was valid) and now we've hit a case where ->next points to freed memory.
*** Bug 624979 has been marked as a duplicate of this bug. ***
Attachment 439347 [details] has a trace from kernel 2.6.35.2-9
Created attachment 445756 [details] backtrace for 2.6.36-0.16.rc3.git0.fc15.x86_64 kernel Just to record that this general protection fault shows up in 2.6.36... kernels as well.
So far 2.6.36-0.16.rc3.git0.fc15.x86_64 was the last one with which I have seen it. I cannot vouch that this oops will not return but this was a pretty regular event up to that time. Maybe it is really gone? At least on my box.
I haven't seen anything go by that should fix this, so I'd be surprised. Thanks for the update, though!
Created attachment 447730 [details] crash picture I've just had a very similar crash with kernel-2.6.36-0.21.rc4.git1.fc15.x86_64 Only managed to capture the attached picture. This is the first time ever that I've seen this. Maybe there was some memory pressure at that point. Evolution grows a lot. Please ignore the taint, its something in networking that has been going on and of for a number of kernel releases without any effect on the system as far as I can tell.
Staring at the code isn't bringing me any close to enlightenment. Is there any way someone who can reproduce this could catch a crashdump for me?
(In reply to comment #26) > Is there any > way someone who can reproduce this could catch a crashdump for me? Anywhere some detailed description explaining how to get what you would like to see? I have also this problem that, as noted in comment 23, I stopped seeing that while earlier running makewhatis.cron was a really good way to trigger such crash. OTOH what Chuck Ebbert wrote in comment 19 looked really convincing. Only a good question how to test that pos->next points to something already freed.
You can install the 'kexec-tools' package. Modify /etc/grub.conf, appending a "crashkernel=256M" to the kernel command line. Then, chkconfig kdump on and reboot. When your system comes back up, it should build an initrd for kdump. You can check to see if kdump loaded properly with 'service kdump status'. That should report that kdump is operational. When the kernel next panics, your system will boot into the kdump kernel and store the contents of memory into a directory under /var/crash/. I'd need the vmcore file that gets saved, and the exact kernel version you are running to then debug the problem further. Chuck's comment means that there were at least two cic's active, which means that you had at least 2 processes doing I/O. Also, that list poison means exactly that there is a race between two list removal code paths. I've been trying to reproduce this by running the cron.daily jobs every minute, and then every hour (two separate tests). I also tried to keep the system busy with I/O from multiple processes over the weekend and was not able to reproduce the crash.
(In reply to comment #28) > When your system comes back up, it should build an initrd for kdump. I wish for that to be this simple. Unfortunately mkdumprd is utterly broken and it gets itself into such infinite loop: .... + cd .. + '[' '!' -f modalias ']' + '[' / = /sys/devices ']' + cd .. + '[' '!' -f modalias ']' + '[' / = /sys/devices ']' + cd .. .... It gets also stuck without -d option. Sigh! > When the kernel next panics, That does not appear to be as easy as in the past where that was a quite regular event. > I've been > trying to reproduce this by running the cron.daily jobs every minute, and then > every hour (two separate tests). There is no makewhatis.cron anymore which seemed particularly good in triggering that crash; mandb does not look like a suitable substitute. Anyway, I reinstalled and rebooted into kernel-2.6.36-0.16.rc3.git0.fc15.x86_64, as it was crashing like that, but so far no luck. I will try to see if I will get that crash again but, like I said, the last time it happened on September 7th on 11:12 and 11:14. Both times "comm" was dbus-uuidgen. I have the following in logs: Sep 7 11:11:54 dyna0 kernel: RPC: Registered udp transport module. Sep 7 11:11:54 dyna0 kernel: RPC: Registered tcp transport module. Sep 7 11:11:54 dyna0 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Sep 7 11:11:56 dyna0 kernel: general protection fault: 0000 [#1] SMP Sep 7 11:11:56 dyna0 kernel: last sysfs file: /sys/module/mperf/refcnt and later Sep 7 11:12:32 dyna0 kernel: readahead-collector: sorting Sep 7 11:12:33 dyna0 kernel: readahead-collector: finished Sep 7 11:14:13 dyna0 abrt: Kerneloops: Reported 1 kernel oopses to Abrt and that created another directory in /var/spool/abrt/ but no other details in logs. It appears that user-space changes will make tracing that much harder (or maybe even hide it completely).
(In reply to comment #29) > (In reply to comment #28) > Unfortunately mkdumprd is utterly broken See bug 635893.
(In reply to comment #28) > You can check to see if kdump loaded properly with 'service kdump status'. > That should report that kdump is operational. There is a problem. 'service kdump status' reports "Kdump is operational" and I have reproduced this crash using 2.6.36-0.16.rc3.git0.fc15.x86_64 kernel. Moreover in cmdline I have 'crashkernel=256M' and also /boot/initrd-2.6.36-0.16.rc3.git0.fc15.x86_64kdump.img as created yesterday after fixing mkdumprd to actually run. > When the kernel next panics, your system will boot into the kdump kernel and > store the contents of memory into a directory under /var/crash/. dmesg ends up with: RIP [<ffffffff81232c36>] __call_for_each_cic+0x21/0x3f RSP <ffff880017899e28> ---[ end trace 438ad5fa6ce9222e ]--- Fixing recursive fault but reboot is needed! but /var/crash/ is empty and no vmcore anywhere in sight. Now what? Should initrd-2.6.36-0.16.rc3.git0.fc15.x86_64kdump.img be used instead of a corresponding initramfs while booting?
Sorry, I forgot that fedora doesn't set panic_on_oops by default. Add the following line to /etc/sysctl.conf: kernel.panic_on_oops = 1 Then, next time you boot, that will be set. You can set it at runtime via: sysctl -w kernel.panic_on_oops=1 I'm pretty sure that will work for a GPF. Again, sorry about this.
(In reply to comment #32) > Add the following line to /etc/sysctl.conf: > > kernel.panic_on_oops = 1 OK. Will do. Now I will have to wait for the next crash. Unfortunately it does not happen on demand, I am using 2.6.36-0.16.rc3.git0.fc15.x86_64 because I know that at least with this one this was happening. With the last crash a trigger program was reported as "Pid: 20242, comm: gzip" and by timestamps this happened by the end of man-db.cron run performed by cron.daily. Running that "by hand" so far does not seem to have the same effects.
Created attachment 449551 [details] dmesg from _2.6.36-0.16.rc3.git0.fc15.x86_64 with crash at _call_for_each_cic+0x21/0x3f It looks like I got it this time. By timing it appears that this happened in the middle of a series of updates as perfromed by yum and messing up /var/lib/rpm/ as a side effect. I am not exactly sure what happened as the whole machine just died cutting off a local console and a network as well and also killing X server for a good measure. abrt generated backtrace claims: last sysfs file: /sys/devices/virtual/block/dm-9/uevent Pid: 7124, comm: hardlink if that can be relied upon under circumstances. A produced vmcore is 282M. Looks like something not exactly of a "bugzilla size".
Scratch comment 23. I got hit by this bug while running 2.6.36-0.26.rc5.git4.fc15.x86_64. I do not have anything in /var/crash from that as in the meantime 'kexec-tools' update broke /sbin/mkdumprd again making it impossible to boot with 'kdump' service active. GPF happened when I was in the process of tracing and fixing mkdumprd. Hopefuly data from 2.6.36-0.16.rc3.git0.fc15.x86_64 will be good enough. I have no reliable method of reproduction.
Added to comment 35. Timestamps indicate that crash happened again while running man-db.cron from cron.daily and in 'backtrace' I have 'Pid: 10062, comm: gzip'. OTOH in /var/log/messages I can find this: abrtd: New crash /var/spool/abrt/kerneloops-1285446849-2141-1, processing abrtd: Registered Action plugin 'RunApp' abrtd: RunApp('/var/spool/abrt/kerneloops-1285446849-2141-1','test x"`cat component`" = x"xorg-x11-server-Xorg" && cp /var/log/Xorg.0.log and X server is indeed gone. Not sure how to interpret that.
I just got this too on kernel 2.6.35.4-28.fc14.x86_64, and based on log timestamps it occurred while running man-db.cron here too. general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-1/dm/name CPU 0 Modules linked in: fuse rfcomm sco bnep l2cap nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc cpufreq_ondemand powernow_k8 freq_table mperf nf_conntrack_netbios_ns ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 uinput snd_via82xx gameport snd_via82xx_modem snd_ac97_codec ac97_bus snd_seq snd_pcm ppdev parport_pc snd_timer shpchp snd_mpu401_uart snd_rawmidi snd_seq_device btusb bluetooth amd64_edac_mod parport rfkill snd microcode soundcore skge edac_core i2c_viapro snd_page_alloc k8temp edac_mce_amd ata_generic pata_acpi firewire_ohci pata_via firewire_core sata_via sata_promise crc_itu_t radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 29167, comm: gzip Not tainted 2.6.35.4-28.fc14.x86_64 #1 A8V Deluxe/To Be Filled By O.E.M. RIP: 0010:[<ffffffff8122d6b8>] [<ffffffff8122d6b8>] __call_for_each_cic+0x21/0x3f RSP: 0018:ffff88000c017e28 EFLAGS: 00010202 RAX: 00000001014dbb1d RBX: 6b6b6b6b6b6b6b6b RCX: ffff880035acea98 RDX: 00000001014dbb20 RSI: ffff8800047d0da0 RDI: 0000000000000286 RBP: ffff88000c017e48 R08: 0000000000000286 R09: ffff88000c017d98 R10: ffff88003a563df8 R11: ffff88000c017df8 R12: ffff88003a563dd0 R13: ffffffff8122d704 R14: 0000000000000001 R15: ffff88003e680000 FS: 00007f0c4abb5720(0000) GS:ffff880004600000(0000) knlGS:00000000f77cc850 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f0c4a4be770 CR3: 0000000001a43000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process gzip (pid: 29167, threadinfo ffff88000c016000, task ffff88001d3c8000) 0000000000000001 ffff88003a563dd0 ffff88001d3c8000 ffff88001d3c8688 <0> ffff88000c017e58 ffffffff8122d6eb ffff88000c017e78 ffffffff81224861 <0> ffff88000c017e78 ffff88003a563dd0 ffff88000c017ea8 ffffffff81224921 Call Trace: [<ffffffff8122d6eb>] cfq_free_io_context+0x15/0x17 [<ffffffff81224861>] put_io_context+0x41/0x5e [<ffffffff81224921>] exit_io_context+0x6c/0x74 [<ffffffff810551b7>] do_exit+0x7a9/0x7d0 [<ffffffff8149d437>] ? lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff81055474>] do_group_exit+0x88/0xb6 [<ffffffff810554b9>] sys_exit_group+0x17/0x1b [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Code: ed 00 00 41 59 48 98 5b c9 c3 55 48 89 e5 41 55 41 54 53 48 83 ec 08 0f 1f 44 00 00 48 8b 5f 78 49 89 fc 49 89 f5 48 85 db 74 15 <48> 8b 03 48 8d 73 b0 4c 89 e7 0f 18 08 41 ff d5 48 8b 1b eb e6 RIP [<ffffffff8122d6b8>] __call_for_each_cic+0x21/0x3f RSP <ffff88000c017e28> ---[ end trace 48092b8495a64693 ]--- Fixing recursive fault but reboot is needed!
*** Bug 637068 has been marked as a duplicate of this bug. ***
*** Bug 638677 has been marked as a duplicate of this bug. ***
Anybody interested in vmcore produced (see comment 34 and also comment 28 for a procedure description)? Please contact me directly if yes. It took a considerable effort to produce that one. Due to a size of this file it is for me not a good candidate for a bugzilla attachment. A corresponding dmesg output is in attachment 449551 [details].
I sent email privately; I am definitely interested in looking at that crash.
*** Bug 520620 has been marked as a duplicate of this bug. ***
I do not see this issue for kernel-2.6.35.11-83.fc14.x86_64 anymore. This bug should be safe to close now.
(In reply to comment #43) > I do not see this issue for kernel-2.6.35.11-83.fc14.x86_64 anymore. This bug > should be safe to close now. I'm glad it's not crashing for you anymore. I got a similar report from Michal. I would like to keep this bug open, though, as I believe there is still a race in the code, it's just not tickled for you anymore.
0) Still occurs on vanilla v2.6.38.2 (otherwise standard Rawhide). Not a Fedora blessed kernel, but should be enough to convince people this bug can still be triggered. 1) From /var/log/messages: Apr 6 11:46:40 t41 kernel: [ 573.045271] BUG: unable to handle kernel paging request at 6b6b6b6b Apr 6 11:46:40 t41 kernel: [ 573.045411] IP: [<c05e10ae>] __call_for_each_cic+0x24/0x3a Apr 6 11:46:40 t41 kernel: [ 573.045522] *pde = 00000000 Apr 6 11:46:40 t41 kernel: [ 573.045581] Oops: 0000 [#1] SMP Apr 6 11:46:40 t41 kernel: [ 573.045650] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.1/host1/target1:0:0/1:0:0:0/block/sr0/uevent Apr 6 11:46:40 t41 kernel: [ 573.045805] Modules linked in: sco bnep l2cap bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_netbios_ns snd_intel8x0m arc4 snd_intel8x0 ath5k snd_ac97_codec ath ac97_bus mac80211 snd_seq snd_seq_device snd_pcm ppdev cfg80211 thinkpad_acpi microcode snd_timer parport_pc rfkill e1000 iTCO_wdt snd_page_alloc joydev pcspkr i2c_i801 iTCO_vendor_support snd soundcore parport uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Apr 6 11:46:40 t41 kernel: [ 573.046032] Apr 6 11:46:40 t41 kernel: [ 573.046032] Pid: 2111, comm: cdrom_id Not tainted 2.6.38.2-0.local0.fc16.i686 #1 IBM / Apr 6 11:46:40 t41 kernel: [ 573.046032] EIP: 0060:[<c05e10ae>] EFLAGS: 00010202 CPU: 0 Apr 6 11:46:40 t41 kernel: [ 573.046032] EIP is at __call_for_each_cic+0x24/0x3a Apr 6 11:46:40 t41 kernel: [ 573.046032] EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0a9a6b0 Apr 6 11:46:40 t41 kernel: [ 573.046032] ESI: edd3a180 EDI: c05e13c1 EBP: ed793f18 ESP: ed793f0c Apr 6 11:46:40 t41 kernel: [ 573.046032] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Apr 6 11:46:40 t41 kernel: [ 573.046032] Process cdrom_id (pid: 2111, ti=ed792000 task=edf0bf90 task.ti=ed792000) Apr 6 11:46:40 t41 kernel: [ 573.046032] Stack: Apr 6 11:46:40 t41 kernel: [ 573.046032] edd3a180 edf0bf90 edf0c368 ed793f20 c05e10d6 ed793f2c c05d691a edd3a180 Apr 6 11:46:40 t41 kernel: [ 573.046032] ed793f40 c05d69aa edf0bf90 00000012 ed793f74 ed793f8c c04405d6 00000085 Apr 6 11:46:40 t41 kernel: [ 573.046032] 00000000 00000000 edf0bf90 ed799e28 edf0c19c edf0bf88 edf0c19c 00000001 Apr 6 11:46:40 t41 kernel: [ 573.046032] Call Trace: Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c05e10d6>] cfq_free_io_context+0x12/0x14 Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c05d691a>] put_io_context+0x34/0x5c Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c05d69aa>] exit_io_context+0x68/0x6d Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c04405d6>] do_exit+0x609/0x62c Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c0440810>] do_group_exit+0x63/0x86 Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c044084b>] __wake_up_parent+0x0/0x2b Apr 6 11:46:40 t41 kernel: [ 573.046032] [<c04030df>] sysenter_do_call+0x12/0x38 Apr 6 11:46:40 t41 kernel: [ 573.046032] Code: e8 d8 41 e8 ff 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 8b 58 4c 89 c6 89 d7 e8 bf 22 e7 ff 85 c0 74 05 e8 bc ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd 5b 5e 5f 5d Apr 6 11:46:40 t41 kernel: [ 573.046032] EIP: [<c05e10ae>] __call_for_each_cic+0x24/0x3a SS:ESP 0068:ed793f0c Apr 6 11:46:40 t41 kernel: [ 573.046032] CR2: 000000006b6b6b6b Apr 6 11:46:40 t41 kernel: [ 573.074456] ---[ end trace 7ef6dd0aa5934c5e ]--- Apr 6 11:46:40 t41 kernel: [ 573.074468] Fixing recursive fault but reboot is needed!
0) And this one is from vanilla v2.6.39-rc2. Again, not Fedora blessed, but it might still be interesting: Apr 8 09:25:34 t41 kernel: [ 1917.490095] BUG: unable to handle kernel paging request at 6b6b6b6b Apr 8 09:25:34 t41 kernel: [ 1917.490781] IP: [<c05e84c6>] __call_for_each_cic+0x24/0x3a Apr 8 09:25:34 t41 kernel: [ 1917.491020] *pde = 00000000 Apr 8 09:25:34 t41 kernel: [ 1917.491020] Oops: 0000 [#1] SMP Apr 8 09:25:34 t41 kernel: [ 1917.491020] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.1/host0/target0:0:0/0:0:0:0/block/sda/uevent Apr 8 09:25:34 t41 kernel: [ 1917.491020] Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6_tables snd_intel8x0 snd_intel8x0m arc4 snd_ac97_codec ac97_bus ath5k snd_seq ath mac80211 snd_seq_device ppdev snd_pcm microcode thinkpad_acpi parport_pc cfg80211 parport e1000 snd_timer rfkill snd iTCO_wdt joydev iTCO_vendor_support i2c_i801 soundcore pcspkr snd_page_alloc uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Apr 8 09:25:34 t41 kernel: [ 1917.491020] Apr 8 09:25:34 t41 kernel: [ 1917.491020] Pid: 15233, comm: mandb Not tainted 2.6.39-0.rc2.git0.0.local0.fc16.i686 #1 IBM / Apr 8 09:25:34 t41 kernel: [ 1917.491020] EIP: 0060:[<c05e84c6>] EFLAGS: 00010202 CPU: 0 Apr 8 09:25:34 t41 kernel: [ 1917.491020] EIP is at __call_for_each_cic+0x24/0x3a Apr 8 09:25:34 t41 kernel: [ 1917.491020] EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0acaa98 Apr 8 09:25:34 t41 kernel: [ 1917.491020] ESI: eae6fc80 EDI: c05e87d9 EBP: e546bf18 ESP: e546bf0c Apr 8 09:25:34 t41 kernel: [ 1917.491020] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Apr 8 09:25:34 t41 kernel: [ 1917.491020] Process mandb (pid: 15233, ti=e546a000 task=e5ffe9f0 task.ti=e546a000) Apr 8 09:25:34 t41 kernel: [ 1917.491020] Stack: Apr 8 09:25:34 t41 kernel: [ 1917.491020] eae6fc80 e5ffe9f0 e5ffedc0 e546bf20 c05e84ee e546bf2c c05ddbae eae6fc80 Apr 8 09:25:34 t41 kernel: [ 1917.491020] e546bf40 c05ddc3e e5ffe9f0 00000012 e546bf74 e546bf8c c0440821 eaa97a28 Apr 8 09:25:34 t41 kernel: [ 1917.491020] e5fb8e08 e546bf70 c04f49ea 00000001 e5ffebf4 e5ffe9e8 e5ffebf4 00000001 Apr 8 09:25:34 t41 kernel: [ 1917.491020] Call Trace: Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c05e84ee>] cfq_free_io_context+0x12/0x14 Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c05ddbae>] put_io_context+0x34/0x5c Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c05ddc3e>] exit_io_context+0x68/0x6d Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c0440821>] do_exit+0x634/0x657 Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c04f49ea>] ? fsnotify_modify+0x64/0x6c Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c0440a5b>] do_group_exit+0x63/0x86 Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c0440a96>] sys_exit_group+0x18/0x18 Apr 8 09:25:34 t41 kernel: [ 1917.491020] [<c082045f>] sysenter_do_call+0x12/0x38 Apr 8 09:25:34 t41 kernel: [ 1917.491020] Code: e8 5c db e7 ff 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 8b 58 4c 89 c6 89 d7 e8 df b1 e6 ff 85 c0 74 05 e8 bc ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd 5b 5e 5f 5d Apr 8 09:25:34 t41 kernel: [ 1917.491020] EIP: [<c05e84c6>] __call_for_each_cic+0x24/0x3a SS:ESP 0068:e546bf0c Apr 8 09:25:34 t41 kernel: [ 1917.491020] CR2: 000000006b6b6b6b Apr 8 09:25:34 t41 kernel: [ 1917.575977] ---[ end trace 6895ae03d9c0f56d ]--- Apr 8 09:25:34 t41 kernel: [ 1917.575982] Fixing recursive fault but reboot is needed! 1) A quick look at the output of "objdump -l -d block/cfq-iosched.o" suggests this fault is triggered by "prefetch(pos->next)" in the hlist_for_each_entry_rcu macro in include/linux/rculist.h [...]block/cfq-iosched.c:2594 1840: e8 fc ff ff ff call 1841 <__call_for_each_cic+0x13> 1845: 85 c0 test %eax,%eax 1847: 74 05 je 184e <__call_for_each_cic+0x20> 1849: e8 bc ff ff ff call 180a <rcu_read_lock_held> 184e: 85 db test %ebx,%ebx 1850: 74 11 je 1863 <__call_for_each_cic+0x35> prefetch(): [...]arch/x86/include/asm/processor.h:833 1852: 8b 03 mov (%ebx),%eax 1854: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi (This was already hinted at in an earlier comment.)
We could "fix" this by making deadline the default IO scheduler :/ I switched to that on all my systems long ago...
(In reply to comment #47) > We could "fix" this by making deadline the default IO scheduler :/ I switched > to that on all my systems long ago... I believe that's an entirely different discussion, Chuck. Deadline has its own set of issues when run on a S-ATA rotational disk (which, I might add, are very common). I understand the frustration surrounding this issue, and apologize that it hasn't been addressed in a timely fashion.
(In reply to comment #48) > (In reply to comment #47) > > We could "fix" this by making deadline the default IO scheduler :/ I switched > > to that on all my systems long ago... > > I believe that's an entirely different discussion, Chuck. Deadline has its own > set of issues when run on a S-ATA rotational disk (which, I might add, are very > common). What issues? I'm certainly not getting any oopses by running with deadline.
Background writeout can completely starve readers with deadline on SATA disks. For example, if you start a dd process writing to a file, then startup firefox or some other application that is not in cache, firefox may not start until the dd has completed. For those people experiencing this oops, switching to deadline is a good workaround until it is fixed. I don't think that many people are hitting this, though.
(In reply to comment #50) > I don't think that many people are hitting this, > though. And, as far as I know, no one has been able to reliably reproduce this. (I tried a few things, but without success. Ie, without even once seeing this oops when I actually wanted it to occur.)
(In reply to comment #51) > And, as far as I know, no one has been able to reliably reproduce this. There were a long series of kernels last year where I was hitting that quite regularly. Jeffrey Moyer got from me a vmcore file produced in one of those incidents. Still it some moment it stopped and I did not see a single instance ever since. No idea why.
(In reply to comment #46) > > 1) A quick look at the output of "objdump -l -d block/cfq-iosched.o" suggests > this fault is triggered by "prefetch(pos->next)" in the > hlist_for_each_entry_rcu macro in include/linux/rculist.h > > [...]block/cfq-iosched.c:2594 > 1840: e8 fc ff ff ff call 1841 <__call_for_each_cic+0x13> > 1845: 85 c0 test %eax,%eax > 1847: 74 05 je 184e <__call_for_each_cic+0x20> > 1849: e8 bc ff ff ff call 180a <rcu_read_lock_held> > 184e: 85 db test %ebx,%ebx > 1850: 74 11 je 1863 <__call_for_each_cic+0x35> > prefetch(): > [...]arch/x86/include/asm/processor.h:833 > 1852: 8b 03 mov (%ebx),%eax > 1854: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi > > (This was already hinted at in an earlier comment.) I am little confused between two. IIUC, previous comment had hinted pos->next is a valid pointer but the object pos->next is pointing to has been freed. And this comment seems to be suggesting that pos->next itself is 6b6b6b.. and we are freed object already. Anyway, so the bottom line seems to be that ioc->cic_list has a cic which has already been freed. I have looked at the code closely and can't find anything wrong yet. There seem to be two paths/places where cic's are removed from ioc->cic_list. static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic) { unsigned long flags; unsigned long dead_key = (unsigned long) cic->key; BUG_ON(!(dead_key & CIC_DEAD_KEY)); spin_lock_irqsave(&ioc->lock, flags); radix_tree_delete(&ioc->radix_root, dead_key >> CIC_DEAD_INDEX_SHIFT); hlist_del_rcu(&cic->cic_list); spin_unlock_irqrestore(&ioc->lock, flags); cfq_cic_free(cic); } static void cfq_drop_dead_cic(struct cfq_data *cfqd, struct io_context *ioc, struct cfq_io_context *cic) { unsigned long flags; WARN_ON(!list_empty(&cic->queue_list)); BUG_ON(cic->key != cfqd_dead_key(cfqd)); spin_lock_irqsave(&ioc->lock, flags); BUG_ON(ioc->ioc_data == cic); radix_tree_delete(&ioc->radix_root, cfqd->cic_index); hlist_del_rcu(&cic->cic_list); spin_unlock_irqrestore(&ioc->lock, flags); cfq_cic_free(cic); } In both the places we take ioc->lock to remove the cic from list and then call cfq_cic_free() which frees up a cic only after rcu time has elapsed. static void cfq_cic_free(struct cfq_io_context *cic) { call_rcu(&cic->rcu_head, cfq_cic_free_rcu); } Given above, I am still wondering how can we reach a situation where we get to a cic which has been freed but still on the ioc->cic_list. Also a question. Does anybody know how this poisoning scheme works. I mean these poison bytes of 6b6b6b.. are written on whole slab object whenever a object is freed?
When I was working on it, my guess was that we were getting into the drop_dead_cic case, and that this was racy (no clue why we keep the cic on the rbtree at all... I think it would be safe to remove it from there). I couldn't prove it, though.
(In reply to comment #53) 0) One annoyance with the code in question is that there are three members called cic_list which are all (slightly) different. I found that rather confusing ... > Given above, I am still wondering how can we reach a situation where we get to > a cic which has been freed but still on the ioc->cic_list. 1) This is all RCU stuff, which I'm rather clueless about. Anyhow, my hunch is that the code here runs with readers (code just walking the hlist here) and writers (code manipulating that same list) working with different copies of that list at the same time (which is what RCU seems to be all about). Normally it shouldn't matter if a reader has an outdated view of an hlist. But here it apparently does matter to some readers (and so maybe an hlist doesn't really fit the job here). But my hunch didn't help me while looking at the code until, well, I just gave up ... > Also a question. Does anybody know how this poisoning scheme works. I mean > these poison bytes of 6b6b6b.. are written on whole slab object whenever a > object is freed? 2) I'm afraid I don't.
(In reply to comment #53) > > Also a question. Does anybody know how this poisoning scheme works. I mean > these poison bytes of 6b6b6b.. are written on whole slab object whenever a > object is freed? > Correct: #define POISON_FREE 0x6b /* for use-after-free poisoning */ ...mainly this happens in cache_free_debugcheck for slab allocator. I'm a little less familiar with slub though, but I'm pretty sure it uses the same preprocessor constant.
(In reply to comment #55) > (In reply to comment #53) > 0) One annoyance with the code in question is that there are three members > called cic_list which are all (slightly) different. I found that rather > confusing ... > > > Given above, I am still wondering how can we reach a situation where we get to > > a cic which has been freed but still on the ioc->cic_list. > > 1) This is all RCU stuff, which I'm rather clueless about. > > Anyhow, my hunch is that the code here runs with readers (code just walking the > hlist here) and writers (code manipulating that same list) working with > different copies of that list at the same time (which is what RCU seems to be > all about). Normally it shouldn't matter if a reader has an outdated view of an > hlist. But here it apparently does matter to some readers (and so maybe an > hlist doesn't really fit the job here). But my hunch didn't help me while > looking at the code until, well, I just gave up ... I have checked that readers are always running with rcu read lock held. That will make sure that even if a writer has removed a cic from ioc->cic_list it will not free the object till we drop the rcu read lock (until and unless rcu has stopped working :-)) - Because we are doing call_rcu() before freeing the cic, we should never run into a situation where reader gets the reference to the freed object as long as we have taken the read lock. (That's how I understand it). Now tricky bit is why we are running into the freed object despite the fact that we are holding rcu read lock in cfq_free_io_context() path.
(In reply to comment #54) > When I was working on it, my guess was that we were getting into the > drop_dead_cic case, and that this was racy (no clue why we keep the cic on the > rbtree at all... I think it would be safe to remove it from there). I couldn't > prove it, though. Jeff, I guess there are two reasons. - ioc->cic_list and radix tree is protected by ioc->ioc_lock. May be we are trying to avoid taking that lock when process is exiting and just mark cic dead. - If we drop it from rb-tree then we will lose track of old/dead cic which needs to be dropped. By keeping it in tree, it pops up in our search and we can drop it hence no need to look into separate list. Do you recall details that why do you think drop_dead_cic is racy. I do see that we are taking ioc->lock before we modify list and rb-tree and freeing of object is done under call_rcu(). So that sounds good atleast on the surface.
Paul, is this problem reproducible consistently for you. One of the big stumbling blocks here is that we can't reproduce it consistently to dive deeper. Is there any specific config option enabled like all the preemption stuff in your config.
(In reply to comment #58) > (In reply to comment #54) > > When I was working on it, my guess was that we were getting into the > > drop_dead_cic case, and that this was racy (no clue why we keep the cic on the > > rbtree at all... I think it would be safe to remove it from there). I couldn't > > prove it, though. > > Jeff, > > I guess there are two reasons. > > - ioc->cic_list and radix tree is protected by ioc->ioc_lock. May be we are > trying to avoid taking that lock when process is exiting and just mark cic > dead. Could be. I'd have to check to see if taking the lock could cause problems. > - If we drop it from rb-tree then we will lose track of old/dead cic which > needs to be dropped. By keeping it in tree, it pops up in our search and we can > drop it hence no need to look into separate list. It should be removed anyway when the grace period expires, no? My memory is pretty fuzzy on this, and I don't have time to dig back in right now, sorry. > Do you recall details that why do you think drop_dead_cic is racy. I do see > that we are taking ioc->lock before we modify list and rb-tree and freeing of > object is done under call_rcu(). So that sounds good atleast on the surface. Fuzzy memory. As you know, I was also trying to perform io_context tracking throughout the I/O stack, so some of my memories may be from kernels with my patches applied.
(In reply to comment #59) > Paul, is this problem reproducible consistently for you. One of the big > stumbling blocks here is that we can't reproduce it consistently to dive > deeper. No, see comment #51. (The context of the oopses in comment #45 and comment #46 - which isn't clear from the comments themselves - could be said to be something like "a period with a lot of (relatively small?) IO". I seem to remember that comment #46 was triggered by installing a new kernel rpm. But, again, I haven't been able to reproduce this oops.)
For what it's worth, shutting down (not Fedora distributed) vanilla v2.6.39-rc3 just gave me this: Apr 22 21:57:21 t41 kernel: [253346.725182] BUG: unable to handle kernel paging request at 6b6b6b6b Apr 22 21:57:21 t41 kernel: [253346.726034] IP: [<c05e850a>] __call_for_each_cic+0x24/0x3a Apr 22 21:57:21 t41 kernel: [253346.726034] *pde = 00000000 Apr 22 21:57:21 t41 kernel: [253346.726034] Oops: 0000 [#1] SMP Apr 22 21:57:21 t41 kernel: [253346.726034] last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/ BAT0/energy_full Apr 22 21:57:21 t41 kernel: [253346.726034] Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_ defrag_ipv6 ip6table_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6_tables snd_intel8x0 snd_intel8x0m snd_ac97_codec ac97_bus snd_seq arc4 sn d_seq_device ath5k ath mac80211 ppdev snd_pcm microcode thinkpad_acpi cfg80211 parport_pc snd_timer snd e1000 parport rfkill iTCO_wdt soundcore iTCO_ven dor_support snd_page_alloc pcspkr joydev i2c_i801 uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scs i_wait_scan] Apr 22 21:57:21 t41 kernel: [253346.726034] Apr 22 21:57:21 t41 kernel: [253346.726034] Pid: 1405, comm: krb5-auth-dialo Tainted: G W 2.6.39-0.rc3.git0.0.local0.fc16.i686 #1 IBM / Apr 22 21:57:21 t41 kernel: [253346.726034] EIP: 0060:[<c05e850a>] EFLAGS: 00010202 CPU: 0 Apr 22 21:57:21 t41 kernel: [253346.726034] EIP is at __call_for_each_cic+0x24/0x3a Apr 22 21:57:21 t41 kernel: [253346.726034] EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0acaa98 Apr 22 21:57:21 t41 kernel: [253346.726034] ESI: c802f200 EDI: c05e881d EBP: eef91dec ESP: eef91de0 Apr 22 21:57:21 t41 kernel: [253346.726034] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Apr 22 21:57:21 t41 kernel: [253346.726034] Process krb5-auth-dialo (pid: 1405, ti=eef90000 task=ee043f90 task.ti=eef90000) Apr 22 21:57:21 t41 kernel: [253346.726034] Stack: Apr 22 21:57:21 t41 kernel: [253346.726034] c802f200 ee043f90 ee044360 eef91df4 c05e8532 eef91e00 c05ddbf2 c802f200 Apr 22 21:57:21 t41 kernel: [253346.726034] eef91e14 c05ddc82 ee043f90 00000000 eef91e48 eef91e60 c04407dd eef91e24 Apr 22 21:57:21 t41 kernel: [253346.726034] c0465bac eef91e38 c045af59 c0bb46b8 ee044194 ee043f88 ee044194 00000001 Apr 22 21:57:21 t41 kernel: [253346.726034] Call Trace: Apr 22 21:57:21 t41 kernel: [253346.726034] [<c05e8532>] cfq_free_io_context+0x12/0x14 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c05ddbf2>] put_io_context+0x34/0x5c Apr 22 21:57:21 t41 kernel: [253346.726034] [<c05ddc82>] exit_io_context+0x68/0x6d Apr 22 21:57:21 t41 kernel: [253346.726034] [<c04407dd>] do_exit+0x634/0x657 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c0465bac>] ? trace_hardirqs_off+0xb/0xd Apr 22 21:57:21 t41 kernel: [253346.726034] [<c045af59>] ? local_clock+0x32/0x49 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c0440a17>] do_group_exit+0x63/0x86 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c044c600>] get_signal_to_deliver+0x3a9/0x3c0 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c0402615>] do_signal+0x5b/0x4f0 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c0466087>] ? lock_release_holdtime.part.8+0x5d/0x63 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c046889e>] ? lock_release+0x166/0x16d Apr 22 21:57:21 t41 kernel: [253346.726034] [<c081a503>] ? _raw_spin_unlock+0x22/0x25 Apr 22 21:57:21 t41 kernel: [253346.726034] [<c05008ac>] ? set_close_on_exec+0xa3/0xaa Apr 22 21:57:21 t41 kernel: [253346.726034] [<c04fce1f>] ? path_put+0x1a/0x1d Apr 22 21:57:21 t41 kernel: [253346.726034] [<c0402d3f>] do_notify_resume+0x26/0x6f Apr 22 21:57:21 t41 kernel: [253346.726034] [<c081a868>] work_notifysig+0x13/0x1b Apr 22 21:57:21 t41 kernel: [253346.726034] Code: e8 d0 da e7 ff 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 8b 58 4c 89 c6 89 d7 e8 53 b1 e6 ff 85 c0 74 05 e8 bc ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd 5b 5e 5f 5d Apr 22 21:57:21 t41 kernel: [253346.726034] EIP: [<c05e850a>] __call_for_each_cic+0x24/0x3a SS:ESP 0068:eef91de0 Apr 22 21:57:21 t41 kernel: [253346.726034] CR2: 000000006b6b6b6b Just slightly different from the Oopses I previously submitted.
__call_for_each_cic() has been removed by this commit in 2.6.39: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0
(In reply to comment #63) > __call_for_each_cic() has been removed by this commit in 2.6.39: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0 0) This Oops (generated with non-Fedroa blessed vanilla 2.6.39-rc5) suggests that commit (which is included in 2.6.39-rc5, if my memory is correct) was not enough to beat mandb into submission: BUG: unable to handle kernel paging request at 6b6b6b6b IP: [<c05e859d>] call_for_each_cic+0x29/0x44 *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_netbios_ns ip6table_filter nf_conntrack_broadcast ip6_tables snd_intel8x0m snd_intel8x0 arc4 snd_ac97_codec ath5k ac97_bus snd_seq ath snd_seq_device mac80211 snd_pcm ppdev thinkpad_acpi cfg80211 microcode snd_timer parport_pc e1000 snd rfkill parport iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support joydev i2c_i801 pcspkr uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 11397, comm: mandb Not tainted 2.6.39-0.rc5.git0.0.local.fc16.i686 #1 IBM / EIP: 0060:[<c05e859d>] EFLAGS: 00010202 CPU: 0 EIP is at call_for_each_cic+0x29/0x44 EAX: 00000001 EBX: 6b6b6b6b ECX: 00000246 EDX: c0ad4a98 ESI: ecfd0980 EDI: c05e88a4 EBP: c670bf18 ESP: c670bf0c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process mandb (pid: 11397, ti=c670a000 task=d40769f0 task.ti=c670a000) Stack: ecfd0980 d40769f0 d4076dc0 c670bf20 c05e85ca c670bf2c c05ddc86 ecfd0980 c670bf40 c05ddd16 d40769f0 00000012 c670bf74 c670bf8c c0440813 f2a182e8 d401ac08 c670bf70 c04f48da 00000001 d4076bf4 d40769e8 d4076bf4 00000001 Call Trace: [<c05e85ca>] cfq_free_io_context+0x12/0x14 [<c05ddc86>] put_io_context+0x34/0x5c [<c05ddd16>] exit_io_context+0x68/0x6d [<c0440813>] do_exit+0x63e/0x661 [<c04f48da>] ? fsnotify_modify+0x64/0x6c [<c0440a4d>] do_group_exit+0x63/0x86 [<c0440a88>] sys_exit_group+0x18/0x18 [<c082551f>] sysenter_do_call+0x12/0x38 Code: 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 89 c6 89 d7 e8 3a ec ff ff 8b 5e 4c e8 00 b1 e6 ff 85 c0 74 05 e8 b7 ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd e8 49 ec ff EIP: [<c05e859d>] call_for_each_cic+0x29/0x44 SS:ESP 0068:c670bf0c CR2: 000000006b6b6b6b ---[ end trace 70dea08ef41fda48 ]--- Fixing recursive fault but reboot is needed! BUG: unable to handle kernel paging request at 6b6b6b6b IP: [<c05e859d>] call_for_each_cic+0x29/0x44 *pde = 00000000 Oops: 0000 [#2] SMP last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_netbios_ns ip6table_filter nf_conntrack_broadcast ip6_tables snd_intel8x0m snd_intel8x0 arc4 snd_ac97_codec ath5k ac97_bus snd_seq ath snd_seq_device mac80211 snd_pcm ppdev thinkpad_acpi cfg80211 microcode snd_timer parport_pc e1000 snd rfkill parport iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support joydev i2c_i801 pcspkr uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 8732, comm: mandb Tainted: G D 2.6.39-0.rc5.git0.0.local.fc16.i686 #1 IBM / EIP: 0060:[<c05e859d>] EFLAGS: 00010202 CPU: 0 EIP is at call_for_each_cic+0x29/0x44 EAX: 00000000 EBX: 6b6b6b6b ECX: 00000286 EDX: 025d2b76 ESI: ecfd0700 EDI: c05e88a4 EBP: d88a1f18 ESP: d88a1f0c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process mandb (pid: 8732, ti=d88a0000 task=f2df69f0 task.ti=d88a0000) Stack: ecfd0700 f2df69f0 f2df6dc0 d88a1f20 c05e85ca d88a1f2c c05ddc86 ecfd0700 d88a1f40 c05ddd16 f2df69f0 00000012 d88a1f74 d88a1f8c c0440813 ef3c6000 cb778508 d88a1f70 c04f48da 00000001 f2df6bf4 f2df69e8 f2df6bf4 00000001 Call Trace: [<c05e85ca>] cfq_free_io_context+0x12/0x14 [<c05ddc86>] put_io_context+0x34/0x5c [<c05ddd16>] exit_io_context+0x68/0x6d [<c0440813>] do_exit+0x63e/0x661 [<c04f48da>] ? fsnotify_modify+0x64/0x6c [<c04fcd68>] ? path_put+0x1a/0x1d [<c0440a4d>] do_group_exit+0x63/0x86 [<c0440a88>] sys_exit_group+0x18/0x18 [<c082551f>] sysenter_do_call+0x12/0x38 Code: 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 89 c6 89 d7 e8 3a ec ff ff 8b 5e 4c e8 00 b1 e6 ff 85 c0 74 05 e8 b7 ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd e8 49 ec ff EIP: [<c05e859d>] call_for_each_cic+0x29/0x44 SS:ESP 0068:d88a1f0c CR2: 000000006b6b6b6b ---[ end trace 70dea08ef41fda49 ]--- Fixing recursive fault but reboot is needed! BUG: unable to handle kernel paging request at 6b6b6b6b IP: [<c05e859d>] call_for_each_cic+0x29/0x44 *pde = 00000000 Oops: 0000 [#3] SMP last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0A03:00/device:01/PNP0C09:00/PNP0C0A:00/power_supply/BAT0/energy_full Modules linked in: bnep bluetooth fuse cpufreq_ondemand acpi_cpufreq mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_netbios_ns ip6table_filter nf_conntrack_broadcast ip6_tables snd_intel8x0m snd_intel8x0 arc4 snd_ac97_codec ath5k ac97_bus snd_seq ath snd_seq_device mac80211 snd_pcm ppdev thinkpad_acpi cfg80211 microcode snd_timer parport_pc e1000 snd rfkill parport iTCO_wdt soundcore snd_page_alloc iTCO_vendor_support joydev i2c_i801 pcspkr uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 14230, comm: mandb Tainted: G D 2.6.39-0.rc5.git0.0.local.fc16.i686 #1 IBM / EIP: 0060:[<c05e859d>] EFLAGS: 00010202 CPU: 0 EIP is at call_for_each_cic+0x29/0x44 EAX: 00000000 EBX: 6b6b6b6b ECX: 00000286 EDX: 025dbc75 ESI: ecfd0b80 EDI: c05e88a4 EBP: da0bbf18 ESP: da0bbf0c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process mandb (pid: 14230, ti=da0ba000 task=ebb30000 task.ti=da0ba000) Stack: ecfd0b80 ebb30000 ebb303d0 da0bbf20 c05e85ca da0bbf2c c05ddc86 ecfd0b80 da0bbf40 c05ddd16 ebb30000 00000012 da0bbf74 da0bbf8c c0440813 ef3c68b8 ca4eef08 da0bbf70 c04f48da 00000001 ebb30204 ebb2fff8 ebb30204 00000001 Call Trace: [<c05e85ca>] cfq_free_io_context+0x12/0x14 [<c05ddc86>] put_io_context+0x34/0x5c [<c05ddd16>] exit_io_context+0x68/0x6d [<c0440813>] do_exit+0x63e/0x661 [<c04f48da>] ? fsnotify_modify+0x64/0x6c [<c04fcd68>] ? path_put+0x1a/0x1d [<c0440a4d>] do_group_exit+0x63/0x86 [<c0440a88>] sys_exit_group+0x18/0x18 [<c082551f>] sysenter_do_call+0x12/0x38 Code: 5d c3 55 89 e5 57 56 53 3e 8d 74 26 00 89 c6 89 d7 e8 3a ec ff ff 8b 5e 4c e8 00 b1 e6 ff 85 c0 74 05 e8 b7 ff ff ff 85 db 74 11 <8b> 03 0f 18 00 90 8d 53 d8 89 f0 ff d7 8b 1b eb dd e8 49 ec ff EIP: [<c05e859d>] call_for_each_cic+0x29/0x44 SS:ESP 0068:da0bbf0c CR2: 000000006b6b6b6b ---[ end trace 70dea08ef41fda4a ]--- Fixing recursive fault but reboot is needed!
0) I just hit (basically) the Oops of comment #64 again, once again triggered by mandb. 1) I won't paste the backtrace, that should be noise by now. But we might be able to make this reproducible, as the events seem to be: - install a few dozen packages that have been updated for Rawhide; - reboot, because something low level has been updated so one might as wel just reboot; - wait until (I guess) cron fires of mandb (after, maybe, 15 minutes). 2) Would there be a way to force mandb to do what it does - including its side effects, ie generate troublesome disk activity somehow - in order to trigger this manually?
(In reply to comment #65) > ... the events seem to be: > - install a few dozen packages that have been updated for Rawhide; > - reboot, because something low level has been updated so one might as wel just > reboot; > - wait until (I guess) cron fires of mandb (after, maybe, 15 minutes). Please post the list of packages that were updated. On my system the list appears in /var/log/messages: ----- Apr 30 11:12:23 f14-64 yum[3593]: Updated: firefox-3.6.17-1.fc14.x86_64 ----- If it is cron (or anacron ?) that runs mandb, then the mechanism is by invoking a shell script. So find that shell script (look soon after reboot), and post the script.
(In reply to comment #65) > 0) I just hit (basically) the Oops of comment #64 again, once again triggered > by mandb. > > 1) I won't paste the backtrace, that should be noise by now. But we might be > able to make this reproducible, as the events seem to be: > - install a few dozen packages that have been updated for Rawhide; > - reboot, because something low level has been updated so one might as wel just > reboot; > - wait until (I guess) cron fires of mandb (after, maybe, 15 minutes). > > 2) Would there be a way to force mandb to do what it does - including its side > effects, ie generate troublesome disk activity somehow - in order to trigger > this manually? Bolle, is it possible to enable kdump on this box and capture the vmcore whenever this happens again. Also can you please keep the kernel debug info compiled (CONFIG_DEBUG_INFO=y) so that I can open the vmcore using "crash" and try to get some more information out.
(In reply to comment #64) > (In reply to comment #63) > > __call_for_each_cic() has been removed by this commit in 2.6.39: > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0 > > 0) This Oops (generated with non-Fedroa blessed vanilla 2.6.39-rc5) suggests > that commit (which is included in 2.6.39-rc5, if my memory is correct) was not > enough to beat mandb into submission: > Confirmed, that commit is in -rc5
(In reply to comment #68) > (In reply to comment #64) > > (In reply to comment #63) > > > __call_for_each_cic() has been removed by this commit in 2.6.39: > > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5f45c69589b7d2953584e6cd0b31e35dbe960ad0 > > > > 0) This Oops (generated with non-Fedroa blessed vanilla 2.6.39-rc5) suggests > > that commit (which is included in 2.6.39-rc5, if my memory is correct) was not > > enough to beat mandb into submission: > > > > Confirmed, that commit is in -rc5 That commit was put to fix a different race and is not going to help with this problem.
(In reply to comment #66) The trick turned out to be to (basically) run /etc/cron.daily/man-db.cron with OPTS="-c" (See also /etc/sysconfig/man-db.) Running that command triggered this Oops (within a few minutes). So, finally, it looks like I can trigger this manually. Progress, of sorts ...
(In reply to comment #67) > Bolle, 0) Please feel free to use my first name (Paul). But in bugzilla, one isn't even impolite when not using the other commenter's name at all. > is it possible to enable kdump on this box and capture the vmcore > whenever this happens again. > > Also can you please keep the kernel debug info compiled (CONFIG_DEBUG_INFO=y) > so that I can open the vmcore using "crash" and try to get some more > information out. 1) I'm about to recompile (vanilla) v2.6.39-rc6, using a (not too old) Rawhide i686 .config. Does it have all the needed CONFIG_*s? 2) Is Documentation/kdump/kdump.txt the best introduction to this (rather intimidating) stuff?
(In reply to comment #71) > 2) Is Documentation/kdump/kdump.txt the best introduction to this (rather > intimidating) stuff? OK, I'm in danger of ending up shaving the yak: kexec-tools is installed, kdump is apparently happy when loading a "capture kernel", the "system kernel" is set to panic on oops, but somehow refuses to boot into the "capture kernel" and simply (seems to) hang(s)... What should I try next to solve that issue?
(In reply to comment #72) > [...] > simply (seems to) hang(s)... What should I try next to solve that issue? Attaching a serial console learned me that the the "capture kernel" actually is being booted and also learned me that last thing this "capture kernel" prints is > Trying to unpack rootfs image as initramfs... Any suggestions what I might be doing wrong?
(In reply to comment #73) > (In reply to comment #72) > > [...] > > simply (seems to) hang(s)... What should I try next to solve that issue? > > Attaching a serial console learned me that the the "capture kernel" actually is > being booted and also learned me that last thing this "capture kernel" prints > is > > Trying to unpack rootfs image as initramfs... > > Any suggestions what I might be doing wrong? How much memory are you reserving? Try increasing the amount of memory reserved. Also you need not to wait for it to crash. First you can manually crash the system using "echo c > /proc/sysrq-trigger".
(In reply to comment #72) > ... but somehow refuses to boot into the "capture kernel" and > simply (seems to) hang(s)... What should I try next to solve that issue? It is not clear from your description but you are possibly running into bug 635893 and/or bug 654431. A workaround was to boot with a regular kernel, by hook and by crook do a job of /sbin/mkdumprd for a kdump one and reboot into a desired kernel. In an absence of a needed initrd /sbin/mkdumprd runs in a boot sequence and this gets stuck.
(In reply to comment #74) > How much memory are you reserving? Try increasing the amount of memory > reserved. Also you need not to wait for it to crash. 0) I was reserving (the suggested amount of) 64M. > First you can manually > crash the system using "echo c > /proc/sysrq-trigger". 1) You're not just interested in the vmcore at the moment of the Oops (upgraded to a panic, so kexec kicks in)? (In reply to comment #75) > In an absence of a needed initrd /sbin/mkdumprd runs > in a boot sequence and this gets stuck. 2) This doesn't seem to be the issue. At the boot of the "system kernel" /etc/rc.d/init.d/kdump did generate a initrd for for the "capture kernel". 3) My trouble apparently was that I was using the initramfs for the "system kernel" while I should have been using the initramfs (called initrd) generated by kdump for the "capture kernel". However using that initramfs doesn't help much since kdump doesn't generate an initramfs compatible with my setup (ie, root on ext3 on LVM2). So the kexec reboot now stopped at the Busybox prompt of a session without rootfs and without the means, as far as I could discover, to mount a rootfs from a LVM2 logical volume. 4) I filed bug #703369 for that last issue, because it seems if I had a initramfs.conf kdump would invoke dracut, and dracut should be able to generate an initramfs compatible with my setup (which essentially is a Fedora default setup, isn't it?).
(In reply to comment #76) > (In reply to comment #74) > > How much memory are you reserving? Try increasing the amount of memory > > reserved. Also you need not to wait for it to crash. > > 0) I was reserving (the suggested amount of) 64M. Depending on kernel and initramfs size it can take much more than that. I would suggest reserving 256M to be safe. > > > First you can manually > > crash the system using "echo c > /proc/sysrq-trigger". > > 1) You're not just interested in the vmcore at the moment of the Oops (upgraded > to a panic, so kexec kicks in)? I suggested above just to make sure that you are able to capture dump when you crash the system manually. If that works, then we are reasonably sure that it will work when CFQ crashes. So yes, I am interested in the vmcore when CFQ actually crashes. > > (In reply to comment #75) > > In an absence of a needed initrd /sbin/mkdumprd runs > > in a boot sequence and this gets stuck. > > 2) This doesn't seem to be the issue. At the boot of the "system kernel" > /etc/rc.d/init.d/kdump did generate a initrd for for the "capture kernel". > > 3) My trouble apparently was that I was using the initramfs for the "system > kernel" while I should have been using the initramfs (called initrd) generated > by kdump for the "capture kernel". It is not mandatory. One can use dracut generated initramfs also. Just that it is much bigger than kdump generated initrd and does not have some automatic kdump known options for saving dump. > However using that initramfs doesn't help > much since kdump doesn't generate an initramfs compatible with my setup (ie, > root on ext3 on LVM2). So the kexec reboot now stopped at the Busybox prompt of > a session without rootfs and without the means, as far as I could discover, to > mount a rootfs from a LVM2 logical volume. Is your network up. Can you scp the vmcore over network to a different machine from busybox prompt. So root is not mounted because some driver or dm modules are not packed in initrd? You can compile these in your kernel and hence you don't have to depend on kdump initrd packing those in. > > 4) I filed bug #703369 for that last issue, because it seems if I had a > initramfs.conf kdump would invoke dracut, and dracut should be able to generate > an initramfs compatible with my setup (which essentially is a Fedora default > setup, isn't it?). If you want, you can disable kdump service entirely and rely on loading kdump kernel manually. Reserve 256M of memory and you can use dracut generated initramfs for command line below. kdump -p /boot/vmlinuz --initrd=/boot/initrd --append="`cat /proc/cmdline` maxcpus=1, irqpoll, reset_devices"
Paul, Can you please upload your config file for the kernel you are able to recreate the crash. I will use the same config on upstream kernel and boot on fedora and try to reproduce the issue.
Created attachment 498131 [details] my v2.6.39-rc6 vanilla config (In reply to comment #78) > Can you please upload your config file for the kernel you are able to recreate > the crash. I will use the same config on upstream kernel and boot on fedora and > try to reproduce the issue. This is based on the first appearance of v2.6.39-rc6 in Fedora's kernel.git, but only "rpmbuilt" as a vanilla kernel, so it should be close to what Fedora might be shipping by now.
0) Finally managed to create a vmcore dump. crashkernel=256M helped. It also feels like using a serial console is the only (reliable) way to use kexec (see https://lkml.org/lkml/2011/4/12/210 for a similar report). Perhaps throttling the output to 115200 baud stops something, somewhere from being racey. But, this is just a feeling. 1) Now I have a (2048 - 256 =) 1792 MB core file. Not practical to upload. 2) First impressions with gdb: Reading symbols from /home/[...]/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/vmlinux...done. [New LWP 14366] #0 prefetch (x=Cannot access memory at address 0x6b6b6b6b ) at /home/[...]/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/arch/x86/include/asm/processor.h:833 833 alternative_input(BASE_PREFETCH, (gdb) bt 9 #0 prefetch (x=Cannot access memory at address 0x6b6b6b6b ) at /home/peb/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/arch/x86/include/asm/processor.h:833 #1 call_for_each_cic (ioc=0xeeb60500, func=0xc05e88f8 <cic_free_func>) at block/cfq-iosched.c:2596 #2 0xc05e861e in cfq_free_io_context (ioc=<optimized out>) at block/cfq-iosched.c:2659 #3 0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26 #4 put_io_context (ioc=0xeeb60500) at block/blk-ioc.c:43 #5 0xc05ddd6a in exit_io_context (task=0xed1469f0) at block/blk-ioc.c:80 #6 0xc044083f in do_exit (code=<optimized out>) at kernel/exit.c:1044 #7 0xc0440a79 in do_group_exit (exit_code=0) at kernel/exit.c:1106 #8 0xc0440ab4 in sys_exit_group (error_code=0) at kernel/exit.c:1117 (More stack frames follow...) (gdb) f 3 #3 0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26 26 cic->dtor(ioc); (gdb) l 21 if (!hlist_empty(&ioc->cic_list)) { 22 struct cfq_io_context *cic; 23 24 cic = list_entry(ioc->cic_list.first, struct cfq_io_context, 25 cic_list); 26 cic->dtor(ioc); 27 } 28 } 29 30 /* (gdb) print ioc->cic_list $1 = {first = 0x0} (gdb) 3) So apparently something set ioc->cic_list.first to NULL after the hlist_empty() test was run but before cic->dtor() / cfq_free_io_context() was called. Is that possible? 4) Nit pick: list_entry() should actually be hlist_entry() here.
(In reply to comment #80) > 0) Finally managed to create a vmcore dump. crashkernel=256M helped. It also > feels like using a serial console is the only (reliable) way to use kexec (see > https://lkml.org/lkml/2011/4/12/210 for a similar report). Perhaps throttling > the output to 115200 baud stops something, somewhere from being racey. But, > this is just a feeling. > > 1) Now I have a (2048 - 256 =) 1792 MB core file. Not practical to upload. You can filter it out using mkdumpfile utitliy and just extract kernel pages and discard user space pages and i think cache pages too. That might bring it to reasonable size to upload.
(In reply to comment #80) > (gdb) bt 9 > #0 prefetch (x=Cannot access memory at address 0x6b6b6b6b > ) > at > /home/peb/Rpmbuild/BUILD/kernel-2.6.38.fc16/linux-2.6.38.i686/arch/x86/include/asm/processor.h:833 > #1 call_for_each_cic (ioc=0xeeb60500, func=0xc05e88f8 <cic_free_func>) > at block/cfq-iosched.c:2596 > #2 0xc05e861e in cfq_free_io_context (ioc=<optimized out>) > at block/cfq-iosched.c:2659 > #3 0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26 > #4 put_io_context (ioc=0xeeb60500) at block/blk-ioc.c:43 > #5 0xc05ddd6a in exit_io_context (task=0xed1469f0) at block/blk-ioc.c:80 > #6 0xc044083f in do_exit (code=<optimized out>) at kernel/exit.c:1044 > #7 0xc0440a79 in do_group_exit (exit_code=0) at kernel/exit.c:1106 > #8 0xc0440ab4 in sys_exit_group (error_code=0) at kernel/exit.c:1117 > (More stack frames follow...) > (gdb) f 3 > #3 0xc05ddcda in cfq_dtor (ioc=0xeeb60500) at block/blk-ioc.c:26 > 26 cic->dtor(ioc); > (gdb) l > 21 if (!hlist_empty(&ioc->cic_list)) { > 22 struct cfq_io_context *cic; > 23 > 24 cic = list_entry(ioc->cic_list.first, struct > cfq_io_context, > 25 > cic_list); > 26 cic->dtor(ioc); > 27 } > 28 } > 29 > 30 /* > (gdb) print ioc->cic_list > $1 = {first = 0x0} > (gdb) > > 3) So apparently something set ioc->cic_list.first to NULL after the > hlist_empty() test was run but before cic->dtor() / cfq_free_io_context() was > called. Is that possible? I don't think that happened. After extracting cic pointer you successfully called into cfq_free_io_context(). I think what happened that there was atleast 1 element on the cic_list and then cfq_free_io_context() released that element hence you see ioc->cic_list = null now.
Paul, Planning to update this vmcore (after filtering) and vmlinux? I am curious to have a look and see if I can spot something.
(In reply to comment #83) > Planning to update this vmcore (after filtering) and vmlinux? I am curious to > have a look and see if I can spot something. 0) Well, after getting sidetracked, I managed to reproduce this under v2.6.39-rc7. The vmcore file is still (almost) 200M after butchering it makedumpfile, which is still rather big to upload. 1) Perhaps you can contact me directly (ie, with bypassing bugzilla.redhat.com) to discuss possibilities here.
Thanks for the vmcore paul. Some observations after looking at it. - Crash happened when we tried to do pos->next where pos=0x6b6b6b6b6b. - ioc->cic_list is empty at the time of crash. So it looks like when we entered the loop, atleast one element was on the list and later we fetched next element which turned out to be freed object. pos = rcu_dereference_raw(hlist_next_rcu(pos) - One conspiracy theory is that somehow the cic we removed from the ioc->cic_list, got freed before rcu read lock was released. That way cic object got poisoned and we fetched a poisoned pointer. Now I don't know if that's possible or not. - I tried hard looking for pointer to recently freed object but looks like we have lost that. - Dave Anderson helped looking at the crash trying to find anything else. We did "bt -F" and we see that leftover on the stack there seems to be pointer to cic->rcu_head field. We are assuming that that's the pointer to the recently freed cic. - 0xf6ab7460 seems to be the last cic freed (guess from the stack lefovers). - "kmem -S cfq_io_context" verifies that this is a free object in the cache. - struct cfq_io_context 0xf6ab7460 shows that object is poisoned. So at this point of time I am not sure if we are looking at the right object and whether it is the most recent free object. If yes, then question will come did we free it up while we were in rcu_read_lock(). Paul, as you can reproduce the problem, I guess I will write a small patch to print recently freed objects and see what we are freeing when.
Created attachment 499426 [details] CFQ cic debug patch a debug patch to get some more info out of system during crash.
Paul, can you please apply above patch and try to reproduce the issue again. Do attach the serial console and capture the console output.
Also, keep kdump still configured. A vmcore gives whole lot of information about various kmem caches and object states.
(In reply to comment #85) > Some observations after looking at it. Two minor (lazy) questions, if that's allowed. > - [..] We > did "bt -F" and we see that leftover on the stack there seems to be pointer > to cic->rcu_head field. What's "bt -F"? > - "kmem -S cfq_io_context" verifies that this is a free object in the cache. What's kmem?
These are "crash" commands. So I used "crash vmlinux dumpfile" to open the dump and not gdb. You can use "help bt" to get more details, but in summary, crash travels through stack contents and if there are any address found from any of slab caches, it prints the name of the slab symbolically. In this case, we saw following. #3 [ec05be50] do_page_fault at c08228ba [RA: c08205d4 SP: ec05be54 FP: ec05becc SIZE: 124] ec05be54: 00000007 f699bf90 restore_all_notrace [cfq_io_context] ec05be64: 0000005c 00000000 00000008 f699bf90 ec05be74: 00000000 trace_hardirqs_on_thunk+12 runqueues init_task ec05be84: 00000000 restore_all_notrace f4d43764 00000080 ec05be94: 00000080 rcu_sched_state [cfq_io_context] ec05bec8 ec05bea4: 00000046 ec05beb8 trace_hardirqs_off_caller+46 6b6b6b6b ec05beb4: [blkdev_ioc] ec05bf18 6b6b6b6b [blkdev_ioc] ec05bec4: do_page_fault ec05bf18 error_code+108 There seem to be two addresses in one of the stack frames which seem to be coming from cfq_io_context slab. Then we did "bt -f" to get raw addresses. #3 [ec05be50] do_page_fault at c08228ba [RA: c08205d4 SP: ec05be54 FP: ec05becc SIZE: 124] ec05be54: 00000007 f699bf90 c081f8ed f6ab7498 ec05be64: 0000005c 00000000 00000008 f699bf90 ec05be74: 00000000 c05f3ef4 c0d017c0 c0ac5fe0 ec05be84: 00000000 c081f8ed f4d43764 00000080 ec05be94: 00000080 c0ad9e80 f6ab7498 ec05bec8 ec05bea4: 00000046 ec05beb8 c0465c5d 6b6b6b6b ec05beb4: ee9deb00 ec05bf18 6b6b6b6b ee9deb00 ec05bec4: c082273f ec05bf18 c08205d4 And the address seems to be 0xf6ab7498 at both the instances. At this point of time it only tells that this address points to with-in some object in cfq_io_context slab. kmem command shows kernel memory state. -s and -S show slab state. Then we did following. crash> kmem -s 0xf6ab7498 CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE f696b580 cfq_io_context 64 26 36 1 4k SLAB MEMORY NODE TOTAL ALLOCATED FREE f58ac6e0 f6ab7000 0 36 26 10 FREE / [ALLOCATED] f6ab7460 (cpu 0 cache) This tells that 0xf6ab7498 was pointing inside one of the cfq_io_context object and beginning address of the object is 0xf6ab7460, which has been freed. it can be further verified using kmem -S. crash> kmem -S cfq_io_context CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE f696b580 cfq_io_context 64 26 36 1 4k CPU 0 SLAB: SLAB MEMORY NODE TOTAL ALLOCATED FREE f58ac6e0 f6ab7000 0 36 26 10 FREE / [ALLOCATED] [f6ab7000] [f6ab7070] f6ab70e0 (cpu 0 cache) [f6ab7150] [f6ab71c0] [f6ab7230] [f6ab72a0] [f6ab7310] [f6ab7380] [f6ab73f0] f6ab7460 (cpu 0 cache) f6ab74d0 (cpu 0 cache) .... .... .... Notice above address does not have [] around it. That means this object has been freed and returned to the slub. So at this point we only know that we found a cic pointer on stack which has been freed. We don't know if that's the latest cic we freed under RCU. Hence the patch with debug message might verify/deny that.
0) Interim report (since I, well, just forgot to enable panic on oops). 1) I captured 6M+ via minicom over 115200 baud. mandb therefore took about 20 min! This part shows that we're on the right track, doesn't it? (This will probably end up a bit mangled.) [...] debug: Will decouple cic=0xef3f1380, Next pointer is (null) debug: cic=0xef3f1380 decoupled. Next pointer is (null) debug: Will decouple cic=0xef3f1380, Next pointer is (null) debug: cic=0xef3f1380 decoupled. Next pointer is (null) debug: freeing up cic=ef3f1380 cic=ef3f1380 freed under while in RCU ------------[ cut here ]------------ kernel BUG at block/cfq-iosched.c:2606! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/firmware/memmap/8/type Modules linked in: cpufreq_ondemand acpi_cpufreq mperf bnep bluetooth ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast snd_intel8x0m arc4 snd_intel8x0 ath5k snd_ac97_codec ath ac97_bus snd_seq mac80211 snd_seq_device cfg80211 e1000 thinkpad_acpi snd_pcm ppdev rfkill iTCO_wdt snd_timer parport_pc iTCO_vendor_support snd i2c_i801 snd_page_alloc parport soundcore pcspkr joydev microcode uinput ipv6 yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 12275, comm: mandb Not tainted 2.6.39-0.rc7.git0.0.local1.fc16.i686 #1 IBM / EIP: 0060:[<c05e86c6>] EFLAGS: 00010296 CPU: 0 EIP is at call_for_each_cic+0x3b/0xa8 EAX: 00000038 EBX: ef3f13a8 ECX: 00000086 EDX: 00000000 ESI: ef3f1380 EDI: ef7baf00 EBP: eca75f18 ESP: eca75ef8 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process mandb (pid: 12275, ti=eca74000 task=ef693f90 task.ti=eca74000) Stack: c09a8b73 ef3f1380 00000000 c05e8a1f 6b6b6b6b ef7baf00 ef693f90 ef694360 eca75f20 c05e8745 eca75f2c c05ddd86 ef7baf00 eca75f40 c05dde16 ef693f90 00000012 eca75f74 eca75f8c c04408ab efbe65dc efbe6580 00000009 ef693f90 Call Trace: [<c05e8a1f>] ? cfqq_process_refs+0x18/0x18 [<c05e8745>] cfq_free_io_context+0x12/0x14 [<c05ddd86>] put_io_context+0x34/0x5c [<c05dde16>] exit_io_context+0x68/0x6d [<c04408ab>] do_exit+0x63e/0x661 [<c0440ae5>] do_group_exit+0x63/0x86 [<c0440b20>] sys_exit_group+0x18/0x18 [<c082571f>] sysenter_do_call+0x12/0x38 Code: 89 55 ec e8 1f ec ff ff 8b 5f 4c eb 22 e8 ba ff ff ff 81 7d f0 6b 6b 6b 6b 75 12 89 74 24 04 c7 04 24 73 8b 9a c0 e8 51 e1 22 00 <0f> 0b 8b 1b e8 c1 b0 e6 ff 85 c0 74 05 e8 8f ff ff ff 85 db 74 EIP: [<c05e86c6>] call_for_each_cic+0x3b/0xa8 SS:ESP 0068:eca75ef8 ---[ end trace be23eecaac4d4ddf ]--- Fixing recursive fault but reboot is needed! [...]
This one looks interesting. The only question I have is that why are we decoupling same cic multiple times. Is it that cic was decoupled, allocated again and then put again on the list or something like that. Can you provide either the whole of the console log or atleast last 100-200 lines or so.
(In reply to comment #92) > This one looks interesting. The only question I have is that why are we > decoupling same cic multiple times. Is it that cic was decoupled, allocated > again and then put again on the list or something like that. > > Can you provide either the whole of the console log or atleast last 100-200 > lines or so. 0) Well, that cic was used a lot: $ grep "freeing up cic=ef3f1380" call_for_each_cic.out | wc -l 7367 1) Perhaps we should try another way of printk debugging this. [2) Vivek, I just uploaded the (xz compressed) vmlinux and (ditto) dumpfile of a similar crash, now with panic_on_oops enabled. That one should have its log embedded in the dumpfile.]
Thanks Paul. Ok, call_for_each_cic() is called by 4 functions. - exit single io context - decouple and free cic - change ioprio - change group On my system I see call to exit_io_context first and then call to freeing up the cic. So I see cic decouple message twice and then cic is freed. In your logs i see that call_for_each_cic() has been called 3 times on each cic. I am wondering what is that 3rd call. Is it change of ioprio or change of group etc. Can you please try another debug patch. In this patch I am also identifying the caller of the function.
Created attachment 499716 [details] CFQ cic debug patch V2 Another debug patch for CFQ. It also identifies the caller of the call_for_each_cic function. It also puts a printk for drop_dead_cic() which is only other path where cic is freed.
0) Vivek, I ran into trouble trying your new debug patch since, for some reason, my "capture kernel" now loops while printing things like this: irq 11: nobody cared (try booting with the "irqpoll" option) Pid: 205, comm: modprobe Not tainted 2.6.39-0.rc7.git0.0.local.fc16.i686 #1 Call Trace: [<d08167d4>] ? printk+0x2d/0x2f [<d0818025>] __report_bad_irq+0x40/0x9f [<d04930b2>] note_interrupt+0x122/0x188 [<d0491b1f>] handle_irq_event_percpu+0x171/0x18c [<d04689b2>] ? lock_release+0x166/0x16d [<d0491b6b>] handle_irq_event+0x31/0x4a [<d0493553>] ? unmask_irq+0x23/0x23 [<d04935a8>] handle_level_irq+0x55/0x73 <IRQ> [<d0404040>] ? do_IRQ+0x3c/0x92 [<d0465c5d>] ? trace_hardirqs_off_caller+0x2e/0x86 [<d0825c75>] ? common_interrupt+0x35/0x3c [<d044007b>] ? mm_update_next_owner+0xad/0x18a [<d044007b>] ? mm_update_next_owner+0xad/0x18a [<d04600e0>] ? clocksource_watchdog_work+0x16/0x3d [<d0441a6e>] ? arch_local_irq_enable+0x5/0xb [<d044272e>] ? __do_softirq+0x68/0x185 [<d04426c6>] ? __tasklet_hi_schedule_first+0x30/0x30 <IRQ> [<d0442a03>] ? irq_exit+0x4f/0x8b [<d0404082>] ? do_IRQ+0x7e/0x92 [<d0465c5d>] ? trace_hardirqs_off_caller+0x2e/0x86 [<d0825c75>] ? common_interrupt+0x35/0x3c [<d081007b>] ? powernow_cpu_init+0x51/0x5d0 [<d04600e0>] ? clocksource_watchdog_work+0x16/0x3d [<d046dfd1>] ? arch_local_irq_restore+0x5/0xb [<d081f625>] ? _raw_spin_unlock_irqrestore+0x44/0x48 [<d0492b00>] ? __setup_irq+0x27c/0x324 [<d0492c1b>] ? request_threaded_irq+0x73/0xd9 [<e07c048f>] ? cb_writel+0x59/0x59 [yenta_socket] [<d0492c5a>] ? request_threaded_irq+0xb2/0xd9 [<e07c16e3>] ? yenta_probe_cb_irq+0x46/0x121 [yenta_socket] [<e07c1d54>] ? ti12xx_override+0x371/0x4b4 [yenta_socket] [<e07c28c4>] ? yenta_probe+0x2c9/0x568 [yenta_socket] [<d06a524c>] ? pm_runtime_enable+0x60/0x68 [<d0609080>] ? pci_device_probe+0x62/0xab [<d069ef4a>] ? driver_probe_device+0x129/0x208 [<d081e20d>] ? mutex_lock_nested+0x39/0x3e [<d069f078>] ? __driver_attach+0x4f/0x6b [<d069e139>] ? bus_for_each_dev+0x42/0x6b [<d069eb8e>] ? driver_attach+0x1f/0x23 [<d069f029>] ? driver_probe_device+0x208/0x208 [<d069e801>] ? bus_add_driver+0xcd/0x214 [<d069f4b6>] ? driver_register+0x84/0xe3 [<d05f7b6b>] ? __raw_spin_lock_init+0x2d/0x4e [<d0609829>] ? __pci_register_driver+0x4f/0xaa [<e07a6017>] ? yenta_socket_init+0x17/0x1000 [yenta_socket] [<d040118a>] ? do_one_initcall+0x8c/0x146 [<e07a6000>] ? 0xe07a5fff [<e07a6000>] ? 0xe07a5fff [<d0472108>] ? sys_init_module+0x14fc/0x1704 [<d082569f>] ? sysenter_do_call+0x12/0x38 handlers: [<d06f0464>] (usb_hcd_irq+0x0/0x95) [<d06f0464>] (usb_hcd_irq+0x0/0x95) [<d06f0464>] (usb_hcd_irq+0x0/0x95) [<d06f0464>] (usb_hcd_irq+0x0/0x95) [<e0904a79>] (radeon_driver_irq_handler_kms+0x0/0x19 [radeon]) [<e07c17be>] (yenta_interrupt+0x0/0xb3 [yenta_socket]) [<e07c048f>] (yenta_probe_handler+0x0/0x4c [yenta_socket]) Disabling IRQ #11 1) According to the previous printed messages "irqpoll" already is set. 2) Any suggestions what could be going wrong?
Paul, does not look like that it is related to my debug patch. I am surprised that it is looping for long time. Because generally if lots of interrupts come when nobody is handling the interrupts, kernel disables the interrupt line and moves on. So kernel should have done that. May be just try the test again. In the mean time, I can look at the console logs while you retrying capturing the crash dump.
Created attachment 499949 [details] last 100 lines (before unsuccesfull reboot in capture kernel) (In reply to comment #97) > Paul, does not look like that it is related to my debug patch. Indeed. Not sure what I did to mess things up this way. > In the mean time, I can look at the console logs while you retrying capturing > the crash dump. I've attached the last 100 lines printed by the (patched) "system kernel", before it rebooted into the "capture kernel".
Hmm..., I see lots of following message which i don't see on my system. debug: Will change cgroup cic=0xee226380, Next pointer is (null) debug: cic=0xee226380 change cgroup done. Next pointer is (null) Have you got any type of automatic rules setup for cgroups. So that if a thread is launched it is moved in some cgroup or so (/etc/cgrules.conf?). Can you disable this cgroup related special behavior and see if problem is still reproducible?
you can disable cgconfig and cgred service on the system (if these are running).
(In reply to comment #99) > Have you got any type of automatic rules setup for cgroups. So that if a thread > is launched it is moved in some cgroup or so (/etc/cgrules.conf?). > > Can you disable this cgroup related special behavior and see if problem is > still reproducible? I know nothing of cgroup related stuff. I guess it should be configured as current Rawhide has pushed it onto people (like me) tracking Rawhide. But at this moment I can't check as the kernel is (one again) happily flooding all consoles with the "cic" debug messages I patched in. (In reply to comment #100) > you can disable cgconfig and cgred service on the system (if these are > running). Ditto.
you can just kill the mandb process and then kernel should stop filling the console? I enabled cgred and put a rule in cgrules.conf that put all root processes in /cgroup/blk/test1. Then I see changed cgroup message only few times as process migrates to new group. But I don't see this constant string of messages of change cgroup as you are seeing. Now my "mandb -c" is running /cgroup/blk/test1/ group but no constant stream of cgroup changed messages. Can you try capturing say 10 seconds of blktrace on your root disk (sda I guess?) - start mandb - blktrace -d /dev/sda - after 10 sec press Ctrl-C - blkparse -i sda > blkparse-output And upload this blkparse-output file which contains the trace message. You might have to install the blktrace package though.
Though we are changing cgroups frequently in your setup looking at the cgroup change code, it looks harmless in this context. We just release the cfqq associated with the cic and don't try to free cic as such. Still I am curious what's happening in your setup which causes all these messages and may be that could explain the issue.
0) Things settled down. For once mandb did not manage to trigger this issue. The joy of race conditions? 1) /etc/cgrules.conf is entirely commented out (its default state, I suppose). 2) If chkconfig is to be believed neither cgconfig nor cgred should run (but maybe systemd is interfering with this, dunno...).
(In reply to comment #102) > you can just kill the mandb process and then kernel should stop filling the > console? My comment #104 collided with your comment #102 and comment #103. Anyway, I hope to give this (ie, your blktrace stuff) another try tomorrow. I'm a bit fed up now with seeing this laptop booting, crashing, rebooting and crashing again without any apparent progress ....
Created attachment 500006 [details] 16+ seconds of blktrace output (In reply to comment #102) > Can you try capturing say 10 seconds of blktrace on your root disk (sda I > guess?) 16+ seconds of blktrace output (while running mandb - which eventually oopsed the box, a single hard disk, single CPU system).
Ok, nothing suspicious in blktrace. There are no "changed cgroup" messages. So i think these are newly allocated cic's where ioc->cgroup_changed = 1 but there is no associated cfqq hence blktrace does not carry "chaged_cgroup" message. Why are we changing cgroup of mandb threads so frequently, still beats me. I guess that's a different problem we can analyze later. Back to the core issue of freeing up object while in RCU. I am really running out of ideas that how to debug this. I am hoping that Paul McKenney gives us some tips how to prove/disprove that it is an RCU issue or not. BTW, out of curiosity, is this issue also reproducible on stock fedora kernel for you? Is it reproducible if you just trim down your config to bare minimum. Just trying to figure out if it is some CONFIG option dependent or not.
(In reply to comment #95) > CFQ cic debug patch V2 > > Another debug patch for CFQ. It also identifies the caller of the > call_for_each_cic function. It also puts a printk for drop_dead_cic() which is > only other path where cic is freed. 0) I've finally uploaded a (xz compressed) vmlinux and (ditto) dumpfile of this crash (both using the V2 debug patch). (1) I've been running into trouble with kexec again and again for the last 48 hours. To aid my memory, it seems I need to add these options to the "system kernel" to get things to work: crashkernel=256M console=ttyS0,115200n8 rd_NO_PLYMOUTH nousb nomodeset yenta_socket.foo=1 ath5k.foo=1 3 - "nomodeset" (and no console=tty0) for radeon related bugs (of the "capture kernel"); - yenta_socket.foo=1 and ath5k.foo=1 are tricks to make sure those modules won't get loaded (for IRQ=11 related bugs of the "capture kernel"; - nousb for ehci relted bugs (of the "capture kernel"); - rd_NO_PLYMOUTH to be able to log into the "system kernel" over serial line. This is just what I ended up with after beating kexec into (apparent) submission. Not sure what's really needed and what's just cargo cult.)
(In reply to comment #107) > BTW, out of curiosity, is this issue also reproducible on stock fedora kernel > for you? Is it reproducible if you just trim down your config to bare minimum. > Just trying to figure out if it is some CONFIG option dependent or not. Could also be a gcc related issue, I guess. What is the latest i686 kernel now actually shipped on the Rawhide mirrors?
(In reply to comment #109) > Could also be a gcc related issue, I guess. I upgraded gcc and friends (from 4.6.0-6) to 4.6.0-7, recompiled v2.6.39-rc7 and my favourite testcase (mandb -c) Oops'ed twice. My next comment should prove it's not gcc related.
(In reply to comment #107) > BTW, out of curiosity, is this issue also reproducible on stock fedora kernel > for you? Is it reproducible if you just trim down your config to bare minimum. > Just trying to figure out if it is some CONFIG option dependent or not. Ha! I installed kernel-2.6.39-0 (as pre-compiled for i686 Rawhide). I also rebuilt that rpm locally (using gcc-4.6.0-7) and rebuilt that rpm in its vanilla variant (using rpmbuild's "--with vanilla" option). I'm unable to generate this Oops under v2.6.39 (in any of the three variants I used)! This suggests things got fixed between v.2.6.39-rc7 and v2.6.39. I hope to answer Paul McKenney's message to LKML (https://lkml.org/lkml/2011/5/21/151) shortly to find out whether things really got fixed in v2.6.39 or whether the window of opportunity for this Oops just got too small for my attempts at reproducing it.
(In reply to comment #111) > I installed kernel-2.6.39-0 (as pre-compiled for i686 Rawhide). I also rebuilt > that rpm locally (using gcc-4.6.0-7) and rebuilt that rpm in its vanilla > variant (using rpmbuild's "--with vanilla" option). I'm unable to generate this > Oops under v2.6.39 (in any of the three variants I used)! > > This suggests things got fixed between v.2.6.39-rc7 and v2.6.39. 0) It turns out the Fedora kernel specfile uses rather different config files for release candidates and final releases: a lot of debug options will be unset for final releases. In this case I guess I was fooled by this: diff -U 0 /boot/config-2.6.39-0.rc7.git0.0.local.fc16.i686 /boot/config-2.6.39-0.local.fc16.i686 | grep -e RCU -CONFIG_PROVE_RCU=y -# CONFIG_PROVE_RCU_REPEATEDLY is not set 1) I battled a bit with the specfile to get it to not delete config.debug, but I have to admit defeat. I'll have to figure out another way to use the debug config options. (2) For those people maintaining kernel.spec: Are these lines correct: %if !%{debugbuildsenabled} rm -f kernel-%{version}-*debug.config %endif ? It seems their effect is that both the information shipped in config-debug and config-nodebug is dropped. But this stuff is all rather confusing ....)
(In reply to comment #112) > 0) It turns out the Fedora kernel specfile uses rather different config files > for release candidates and final releases: a lot of debug options will be unset > for final releases. In this case I guess I was fooled by this: > > diff -U 0 /boot/config-2.6.39-0.rc7.git0.0.local.fc16.i686 > /boot/config-2.6.39-0.local.fc16.i686 | grep -e RCU > -CONFIG_PROVE_RCU=y > -# CONFIG_PROVE_RCU_REPEATEDLY is not set And, indeed, when I rebuilt and booted v2.6.39 with a .config file more in line with what it used for the release candidates (eg, with CONFIG_PROVE_RCU=y) I ran into this issue the first time I ran my precious little testing tool (mandb -c).
(In reply to comment #113) > (In reply to comment #112) > > 0) It turns out the Fedora kernel specfile uses rather different config files > > for release candidates and final releases: a lot of debug options will be unset > > for final releases. In this case I guess I was fooled by this: > > > > diff -U 0 /boot/config-2.6.39-0.rc7.git0.0.local.fc16.i686 > > /boot/config-2.6.39-0.local.fc16.i686 | grep -e RCU > > -CONFIG_PROVE_RCU=y > > -# CONFIG_PROVE_RCU_REPEATEDLY is not set > > And, indeed, when I rebuilt and booted v2.6.39 with a .config file more in line > with what it used for the release candidates (eg, with CONFIG_PROVE_RCU=y) I > ran into this issue the first time I ran my precious little testing tool (mandb > -c). PaulB, So if you take fedora final release file and just enable CONFIG_PROVE_RCU=y, then you can reproduce the issue? Or more config options are involved too.
(In reply to comment #103) > Though we are changing cgroups frequently in your setup looking at the cgroup > change code, it looks harmless in this context. We just release the cfqq > associated with the cic and don't try to free cic as such. > > Still I am curious what's happening in your setup which causes all these > messages and may be that could explain the issue. Ok, I found out why you are seeing so many cgroup changed messages. We are not initializing ioc->cgroup_changed field during ioc creation and CFQ thinks that cgroup has changed. I am going to post a patch and that should remove cgroup changed messages from the logs. Though I think that this is completely orthogonal and has nothing to do with original problem.
Created attachment 500449 [details] Fix frequent cgroup changed messages visible in debug logs PaulB was seeing lots of cgroup changed messages when we enabled debug. This patch removes those unnecessary extra calls to call_for_each_cic(changed_cgroup).
PaulB, would you like to re-run those tests with some config options enabled as suggested by PaulM. Also it would be good to run the script to capture the rcu traces. CONFIG_PROVE_RCU = y CONFIG_PREEMPT = y CONFIG_DEBUG_OBJECTS_RCU_HEAD = y CONFIG_RCU_TRACE = y
(In reply to comment #114) > So if you take fedora final release file and just enable CONFIG_PROVE_RCU=y, > then you can reproduce the issue? Or more config options are involved too. 0) The Fedora final release I ran was kernel-2.6.39-0.fc16.i686.rpm. The first version of that release that I managed to built with a config resembling release candidates (but as a vanilla kernel) was kernel-vanilla-2.6.39-0.local2.fc16.i686. 1) This horror shows the differences in their respective configs: diff -U 0 <(rpm2cpio Download/kernel-2.6.39-0.fc16.i686.rpm | cpio --quiet -i --to-stdout ./boot/config-2.6.39-0.fc16.i686) /boot/config-2.6.39-0.local2.fc16.i686 2) That generates this output (of which "+CONFIG_PROVE_RCU=y" seems to be the most important change, but most of it is entirely new for me anyway): --- /dev/fd/63 2011-05-23 22:48:15.071686130 +0200 +++ /boot/config-2.6.39-0.local2.fc16.i686 2011-05-22 19:42:32.000000000 +0200 @@ -3,2 +3,2 @@ -# Linux/i386 2.6.39-0.fc16.i686 Kernel Configuration -# Thu May 19 20:07:58 2011 +# Linux/i386 2.6.39-0.local2.fc16.i686 Kernel Configuration +# Sun May 22 17:51:58 2011 @@ -178,0 +179 @@ +CONFIG_PERF_USE_VMALLOC=y @@ -185 +186 @@ -# CONFIG_DEBUG_PERF_USE_VMALLOC is not set +CONFIG_DEBUG_PERF_USE_VMALLOC=y @@ -229 +230 @@ -# CONFIG_MODULE_FORCE_UNLOAD is not set +CONFIG_MODULE_FORCE_UNLOAD=y @@ -258 +259 @@ -CONFIG_INLINE_SPIN_UNLOCK=y +# CONFIG_INLINE_SPIN_UNLOCK is not set @@ -260 +261 @@ -CONFIG_INLINE_SPIN_UNLOCK_IRQ=y +# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set @@ -267 +268 @@ -CONFIG_INLINE_READ_UNLOCK=y +# CONFIG_INLINE_READ_UNLOCK is not set @@ -269 +270 @@ -CONFIG_INLINE_READ_UNLOCK_IRQ=y +# CONFIG_INLINE_READ_UNLOCK_IRQ is not set @@ -276 +277 @@ -CONFIG_INLINE_WRITE_UNLOCK=y +# CONFIG_INLINE_WRITE_UNLOCK is not set @@ -278 +279 @@ -CONFIG_INLINE_WRITE_UNLOCK_IRQ=y +# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set @@ -280 +281 @@ -CONFIG_MUTEX_SPIN_ON_OWNER=y +# CONFIG_MUTEX_SPIN_ON_OWNER is not set @@ -415 +416 @@ -CONFIG_SPLIT_PTLOCK_CPUS=4 +CONFIG_SPLIT_PTLOCK_CPUS=999999 @@ -501 +502,2 @@ -# CONFIG_ACPI_DEBUG is not set +CONFIG_ACPI_DEBUG=y +# CONFIG_ACPI_DEBUG_FUNC_TRACE is not set @@ -596 +597,0 @@ -CONFIG_PCI_MSI_DEFAULT_ON=y @@ -1189 +1190 @@ -# CONFIG_CAN_DEBUG_DEVICES is not set +CONFIG_CAN_DEBUG_DEVICES=y @@ -1323 +1324 @@ -# CONFIG_CEPH_LIB_PRETTYDEBUG is not set +CONFIG_CEPH_LIB_PRETTYDEBUG=y @@ -1520 +1521 @@ -# CONFIG_DRBD_FAULT_INJECTION is not set +CONFIG_DRBD_FAULT_INJECTION=y @@ -2115 +2116 @@ -# CONFIG_ATH_DEBUG is not set +CONFIG_ATH_DEBUG=y @@ -2131 +2132 @@ -# CONFIG_CARL9170_DEBUGFS is not set +CONFIG_CARL9170_DEBUGFS=y @@ -2180 +2181 @@ -# CONFIG_IWLWIFI_DEVICE_TRACING is not set +CONFIG_IWLWIFI_DEVICE_TRACING=y @@ -2662 +2662,0 @@ -CONFIG_INPUT_APPLEIR=m @@ -2695 +2694,0 @@ -CONFIG_CRASH=m @@ -4522 +4521,2 @@ -# CONFIG_DMADEVICES_DEBUG is not set +CONFIG_DMADEVICES_DEBUG=y +CONFIG_DMADEVICES_VDEBUG=y @@ -4721 +4721 @@ -# CONFIG_EXT4_DEBUG is not set +CONFIG_EXT4_DEBUG=y @@ -4726 +4726 @@ -# CONFIG_JBD2_DEBUG is not set +CONFIG_JBD2_DEBUG=y @@ -4766 +4766 @@ -# CONFIG_QUOTA_DEBUG is not set +CONFIG_QUOTA_DEBUG=y @@ -4817 +4816,0 @@ -CONFIG_TMPFS_XATTR=y @@ -5020,2 +5019,8 @@ -# CONFIG_DEBUG_OBJECTS is not set -# CONFIG_SLUB_DEBUG_ON is not set +CONFIG_DEBUG_OBJECTS=y +# CONFIG_DEBUG_OBJECTS_SELFTEST is not set +CONFIG_DEBUG_OBJECTS_FREE=y +CONFIG_DEBUG_OBJECTS_TIMERS=y +CONFIG_DEBUG_OBJECTS_WORK=y +CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER=y +CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1 +CONFIG_SLUB_DEBUG_ON=y @@ -5024 +5029,2 @@ -# CONFIG_DEBUG_RT_MUTEXES is not set +CONFIG_DEBUG_RT_MUTEXES=y +CONFIG_DEBUG_PI_LIST=y @@ -5026,4 +5032,6 @@ -# CONFIG_DEBUG_SPINLOCK is not set -# CONFIG_DEBUG_MUTEXES is not set -# CONFIG_DEBUG_LOCK_ALLOC is not set -# CONFIG_PROVE_LOCKING is not set +CONFIG_DEBUG_SPINLOCK=y +CONFIG_DEBUG_MUTEXES=y +CONFIG_DEBUG_LOCK_ALLOC=y +CONFIG_PROVE_LOCKING=y +CONFIG_PROVE_RCU=y +# CONFIG_PROVE_RCU_REPEATEDLY is not set @@ -5031 +5039,4 @@ -# CONFIG_LOCK_STAT is not set +CONFIG_LOCKDEP=y +CONFIG_LOCK_STAT=y +# CONFIG_DEBUG_LOCKDEP is not set +CONFIG_TRACE_IRQFLAGS=y @@ -5040 +5051 @@ -# CONFIG_DEBUG_VM is not set +CONFIG_DEBUG_VM=y @@ -5042 +5053 @@ -# CONFIG_DEBUG_WRITECOUNT is not set +CONFIG_DEBUG_WRITECOUNT=y @@ -5045,4 +5056,4 @@ -# CONFIG_TEST_LIST_SORT is not set -# CONFIG_DEBUG_SG is not set -# CONFIG_DEBUG_NOTIFIERS is not set -# CONFIG_DEBUG_CREDENTIALS is not set +CONFIG_TEST_LIST_SORT=y +CONFIG_DEBUG_SG=y +CONFIG_DEBUG_NOTIFIERS=y +CONFIG_DEBUG_CREDENTIALS=y @@ -5059,2 +5070,8 @@ -# CONFIG_CPU_NOTIFIER_ERROR_INJECT is not set -# CONFIG_FAULT_INJECTION is not set +CONFIG_CPU_NOTIFIER_ERROR_INJECT=m +CONFIG_FAULT_INJECTION=y +CONFIG_FAILSLAB=y +CONFIG_FAIL_PAGE_ALLOC=y +CONFIG_FAIL_MAKE_REQUEST=y +CONFIG_FAIL_IO_TIMEOUT=y +CONFIG_FAULT_INJECTION_DEBUG_FS=y +CONFIG_FAULT_INJECTION_STACKTRACE_FILTER=y @@ -5062 +5079 @@ -# CONFIG_SYSCTL_SYSCALL_CHECK is not set +CONFIG_SYSCTL_SYSCALL_CHECK=y @@ -5100 +5117,2 @@ -# CONFIG_MMIOTRACE is not set +CONFIG_MMIOTRACE=y +# CONFIG_MMIOTRACE_TEST is not set @@ -5106 +5124 @@ -# CONFIG_DMA_API_DEBUG is not set +CONFIG_DMA_API_DEBUG=y @@ -5125,3 +5143,3 @@ -# CONFIG_DEBUG_STACK_USAGE is not set -# CONFIG_DEBUG_PER_CPU_MAPS is not set -# CONFIG_X86_PTDUMP is not set +CONFIG_DEBUG_STACK_USAGE=y +CONFIG_DEBUG_PER_CPU_MAPS=y +CONFIG_X86_PTDUMP=y @@ -5130 +5148 @@ -# CONFIG_DEBUG_SET_MODULE_RONX is not set +CONFIG_DEBUG_SET_MODULE_RONX=y @@ -5195,0 +5214 @@ +CONFIG_CRYPTO_FIPS=y @@ -5210 +5229 @@ -CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y +# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set @@ -5373,0 +5393 @@ +CONFIG_CPUMASK_OFFSTACK=y @@ -5378 +5397,0 @@ -CONFIG_PCI_DEFAULT_USE_CRS=y
(In reply to comment #117) > PaulB, would you like to re-run those tests with some config options enabled as > suggested by PaulM. Also it would be good to run the script to capture the > rcu traces. > > CONFIG_PROVE_RCU = y > CONFIG_PREEMPT = y > CONFIG_DEBUG_OBJECTS_RCU_HEAD = y > CONFIG_RCU_TRACE = y 0) The first of Paul E. McKenney's suggestions I tried was CONFIG_DEBUG_OBJECTS_RCU_HEAD=y (since Fedora relase candidates already ship with CONFIG_PROVE_RCU=y). CONFIG_DEBUG_OBJECTS_RCU_HEAD has some dependencies, so I basically ended up with: $ grep -e PREEMPT -e RCU /boot/config-2.6.39-0.local3.fc16.i686 | grep -v "^#" CONFIG_TREE_PREEMPT_RCU=y CONFIG_PREEMPT_RCU=y CONFIG_RCU_FANOUT=32 CONFIG_PREEMPT_NOTIFIERS=y CONFIG_PREEMPT=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=y CONFIG_DEBUG_PREEMPT=y CONFIG_PROVE_RCU=y CONFIG_SPARSE_RCU_POINTER=y (This doesn't yet use CONFIG_RCU_TRACE. Remember that one of the Fedora defaults seems to be CONFIG_PREEMPT_VOLUNTARY=y.) 1) Using that config I seem to be unable to trigger this Oops. Does that make sense?
(In reply to comment #118) > (In reply to comment #114) > > So if you take fedora final release file and just enable CONFIG_PROVE_RCU=y, > > then you can reproduce the issue? Or more config options are involved too. > > 0) The Fedora final release I ran was kernel-2.6.39-0.fc16.i686.rpm. The first > version of that release that I managed to built with a config resembling > release candidates (but as a vanilla kernel) was > kernel-vanilla-2.6.39-0.local2.fc16.i686. > > 1) This horror shows the differences in their respective configs: > > diff -U 0 <(rpm2cpio Download/kernel-2.6.39-0.fc16.i686.rpm | cpio --quiet -i > --to-stdout ./boot/config-2.6.39-0.fc16.i686) > /boot/config-2.6.39-0.local2.fc16.i686 > > 2) That generates this output (of which "+CONFIG_PROVE_RCU=y" seems to be the > most important change, but most of it is entirely new for me anyway): > Ok, lots of debug options seem to have been turned off. May be we can take fedora final release config and turn on CONFIG_PROVE_RCU=y and see if you can reproduce the issue or not.
(In reply to comment #120) > Ok, lots of debug options seem to have been turned off. May be we can take > fedora final release config and turn on CONFIG_PROVE_RCU=y and see if you can > reproduce the issue or not. 0) If I take a Fedora finale release (kernel-2.6.39-0.fc16.i686.rpm) and use CONFIG_PROVE_RCU=y, I end up with these differences (of which a few are due to the fact I rpmbuilt a vanilla kernel package): --- /dev/fd/63 2011-05-25 00:05:59.211662288 +0200 +++ /boot/config-2.6.39-0.local4.fc16.i686 2011-05-24 22:13:54.000000000 +0200 @@ -3,2 +3,2 @@ -# Linux/i386 2.6.39-0.fc16.i686 Kernel Configuration -# Thu May 19 20:07:58 2011 +# Linux/i386 2.6.39-0.local4.fc16.i686 Kernel Configuration +# Tue May 24 20:15:33 2011 @@ -258 +258 @@ -CONFIG_INLINE_SPIN_UNLOCK=y +# CONFIG_INLINE_SPIN_UNLOCK is not set @@ -260 +260 @@ -CONFIG_INLINE_SPIN_UNLOCK_IRQ=y +# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set @@ -267 +267 @@ -CONFIG_INLINE_READ_UNLOCK=y +# CONFIG_INLINE_READ_UNLOCK is not set @@ -269 +269 @@ -CONFIG_INLINE_READ_UNLOCK_IRQ=y +# CONFIG_INLINE_READ_UNLOCK_IRQ is not set @@ -276 +276 @@ -CONFIG_INLINE_WRITE_UNLOCK=y +# CONFIG_INLINE_WRITE_UNLOCK is not set @@ -278 +278 @@ -CONFIG_INLINE_WRITE_UNLOCK_IRQ=y +# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set @@ -280 +280 @@ -CONFIG_MUTEX_SPIN_ON_OWNER=y +# CONFIG_MUTEX_SPIN_ON_OWNER is not set @@ -415 +415 @@ -CONFIG_SPLIT_PTLOCK_CPUS=4 +CONFIG_SPLIT_PTLOCK_CPUS=999999 @@ -596 +595,0 @@ -CONFIG_PCI_MSI_DEFAULT_ON=y @@ -2662 +2660,0 @@ -CONFIG_INPUT_APPLEIR=m @@ -2695 +2692,0 @@ -CONFIG_CRASH=m @@ -4817 +4813,0 @@ -CONFIG_TMPFS_XATTR=y @@ -5026,4 +5022,6 @@ -# CONFIG_DEBUG_SPINLOCK is not set -# CONFIG_DEBUG_MUTEXES is not set -# CONFIG_DEBUG_LOCK_ALLOC is not set -# CONFIG_PROVE_LOCKING is not set +CONFIG_DEBUG_SPINLOCK=y +CONFIG_DEBUG_MUTEXES=y +CONFIG_DEBUG_LOCK_ALLOC=y +CONFIG_PROVE_LOCKING=y +CONFIG_PROVE_RCU=y +# CONFIG_PROVE_RCU_REPEATEDLY is not set @@ -5030,0 +5029 @@ +CONFIG_LOCKDEP=y @@ -5031,0 +5031,2 @@ +# CONFIG_DEBUG_LOCKDEP is not set +CONFIG_TRACE_IRQFLAGS=y @@ -5378 +5378,0 @@ -CONFIG_PCI_DEFAULT_USE_CRS=y 1) Using that Fedora final release with CONFIG_PROVE_RCU=y (and its dependencies) this issue does not seem to show up (ie, I'm unable to trigger this BUG).
Ok, so enabling CONFIG_PROVE_RCU and associated dependencies did not reproduce the isue. I guess time to try paul's debug patch for detecting blocking in rcu critical section along with rcu traces. If time permits, it might be a good idea to try to play with rest of the config options and try to narrow down enabling which config option really causes this behavior.
My patch requires CONFIG_PROVE_RCU... So maybe best to follow up on Jens's suggestion. Alternatively, capture the CPU and context-switch number at the beginning of the RCU read-side critical section, then verify that there is no change in either quantity at the end of the RCU read-side critical section.
(In reply to comment #123) > My patch requires CONFIG_PROVE_RCU... So maybe best to follow up on Jens's > suggestion. PaulB did try Jens patch of disabling cooperating queue logic. But that did not help. > Alternatively, capture the CPU and context-switch number at the > beginning of the RCU read-side critical section, then verify that there is no > change in either quantity at the end of the RCU read-side critical section. He has got a single cpu machine. So cpu number is not going to change. context-switch number is interesting data. Where it is stored. I might just write a quick patch to print that info ask PaulB to try that.
(In reply to comment #123) > My patch requires CONFIG_PROVE_RCU... This doesn't ring a bell. What patch are you referring to here?
(In reply to comment #125) > (In reply to comment #123) > > My patch requires CONFIG_PROVE_RCU... > > This doesn't ring a bell. What patch are you referring to here? I guess following. https://lkml.org/lkml/2011/5/24/362
Further testing got me this patch replacing https://lkml.org/lkml/2011/5/24/362: rcu: Restore checks for blocking in RCU read-side critical sections Long ago, using TREE_RCU with PREEMPT would result in "scheduling while atomic" diagnostics if you blocked in an RCU read-side critical section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats this diagnostic. This commit therefore adds a replacement diagnostic based on PROVE_RCU. Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being used for things that have nothing to do with rcu_dereference(), rename lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third argument that is a string indicating what is suspicious. This third argument is passed in from a new third argument to rcu_lockdep_assert(). Update all calls to rcu_lockdep_assert() to add an informative third argument. Finally, add a pair of rcu_lockdep_assert() calls from within rcu_note_context_switch(), one complaining if a context switch occurs in an RCU-bh read-side critical section and another complaining if a context switch occurs in an RCU-sched read-side critical section. These are present only if the PROVE_RCU kernel parameter is enabled. Again, you must enable PROVE_RCU to see these new diagnostics. But you are enabling PROVE_RCU to check out new RCU uses in any case, aren't you? Signed-off-by: Paul E. McKenney <paulmck.ibm.com> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 4aef1dd..bfa66e7 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -545,7 +545,7 @@ do { \ #endif #ifdef CONFIG_PROVE_RCU -extern void lockdep_rcu_dereference(const char *file, const int line); +void lockdep_rcu_suspicious(const char *file, const int line, const char *s); #endif #endif /* __LINUX_LOCKDEP_H */ diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 58b13f1..fb2933d 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -297,19 +297,20 @@ extern int rcu_my_thread_group_empty(void); /** * rcu_lockdep_assert - emit lockdep splat if specified condition not met * @c: condition to check + * @s: informative message */ -#define rcu_lockdep_assert(c) \ +#define rcu_lockdep_assert(c, s) \ do { \ static bool __warned; \ if (debug_lockdep_rcu_enabled() && !__warned && !(c)) { \ __warned = true; \ - lockdep_rcu_dereference(__FILE__, __LINE__); \ + lockdep_rcu_suspicious(__FILE__, __LINE__, s); \ } \ } while (0) #else /* #ifdef CONFIG_PROVE_RCU */ -#define rcu_lockdep_assert(c) do { } while (0) +#define rcu_lockdep_assert(c, s) do { } while (0) #endif /* #else #ifdef CONFIG_PROVE_RCU */ @@ -338,14 +339,16 @@ extern int rcu_my_thread_group_empty(void); #define __rcu_dereference_check(p, c, space) \ ({ \ typeof(*p) *_________p1 = (typeof(*p)*__force )ACCESS_ONCE(p); \ - rcu_lockdep_assert(c); \ + rcu_lockdep_assert(c, "suspicious rcu_dereference_check()" \ + " usage"); \ rcu_dereference_sparse(p, space); \ smp_read_barrier_depends(); \ ((typeof(*p) __force __kernel *)(_________p1)); \ }) #define __rcu_dereference_protected(p, c, space) \ ({ \ - rcu_lockdep_assert(c); \ + rcu_lockdep_assert(c, "suspicious rcu_dereference_protected()" \ + " usage"); \ rcu_dereference_sparse(p, space); \ ((typeof(*p) __force __kernel *)(p)); \ }) @@ -359,7 +362,9 @@ extern int rcu_my_thread_group_empty(void); #define __rcu_dereference_index_check(p, c) \ ({ \ typeof(p) _________p1 = ACCESS_ONCE(p); \ - rcu_lockdep_assert(c); \ + rcu_lockdep_assert(c, \ + "suspicious rcu_dereference_index_check()" \ + " usage"); \ smp_read_barrier_depends(); \ (_________p1); \ }) diff --git a/kernel/lockdep.c b/kernel/lockdep.c index 53a6895..9789028 100644 --- a/kernel/lockdep.c +++ b/kernel/lockdep.c @@ -3798,7 +3798,7 @@ void lockdep_sys_exit(void) } } -void lockdep_rcu_dereference(const char *file, const int line) +void lockdep_rcu_suspicious(const char *file, const int line, const char *s) { struct task_struct *curr = current; @@ -3807,9 +3807,10 @@ void lockdep_rcu_dereference(const char *file, const int line) return; #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */ /* Note: the following can be executed concurrently, so be careful. */ - printk("\n===================================================\n"); - printk( "[ INFO: suspicious rcu_dereference_check() usage. ]\n"); - printk( "---------------------------------------------------\n"); + printk("\n===============================\n"); + printk( "[ INFO: suspicious RCU usage. ]\n"); + printk( "-------------------------------\n"); + printk( "%s\n", s); printk("%s:%d invoked rcu_dereference_check() without protection!\n", file, line); printk("\nother info that might help us debug this:\n\n"); @@ -3818,4 +3819,4 @@ void lockdep_rcu_dereference(const char *file, const int line) printk("\nstack backtrace:\n"); dump_stack(); } -EXPORT_SYMBOL_GPL(lockdep_rcu_dereference); +EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious); diff --git a/kernel/pid.c b/kernel/pid.c index 57a8346..a7577b3 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -419,7 +419,9 @@ EXPORT_SYMBOL(pid_task); */ struct task_struct *find_task_by_pid_ns(pid_t nr, struct pid_namespace *ns) { - rcu_lockdep_assert(rcu_read_lock_held()); + rcu_lockdep_assert(rcu_read_lock_held(), + "find_task_by_pid_ns() needs rcu_read_lock()" + " protection"); return pid_task(find_pid_ns(nr, ns), PIDTYPE_PID); } diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 88547c8..e043b4d 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -153,6 +153,12 @@ void rcu_bh_qs(int cpu) */ void rcu_note_context_switch(int cpu) { + rcu_lockdep_assert(!rcu_read_lock_bh_held(), + "Illegal context switch in RCU-bh" + " read-side critical section"); + rcu_lockdep_assert(!rcu_read_lock_sched_held(), + "Illegal context switch in RCU-sched" + " read-side critical section"); rcu_sched_qs(cpu); rcu_preempt_note_context_switch(cpu); }
(In reply to comment #127) > Further testing got me this patch [...]: > > rcu: Restore checks for blocking in RCU read-side critical sections > > [...] 0) I've applied that patch to (mostly vanilla) v2.6.39. Specifically, I applied commit c15d76f26712bd5228aa0c6af7a7e7c492a812c9, from one of the branches of (one of?) your git repositor(y|ies) on git.kernel.org. Is that the correct commit? 1) None of the messages from that patch show up in my console or in my logs, even though I did manage (on the first try) to hit this Oops. Should I perhaps turn a knob here or there to get things to work? Or are we apparently looking in the wrong direction?
Upstream commit: http://git.kernel.org/linus/ab4bd22d3cce6977dc039664cc2d052e3147d662
(In reply to comment #129) > Upstream commit: > > http://git.kernel.org/linus/ab4bd22d3cce6977dc039664cc2d052e3147d662 I got hit with a series of recursive faults, which look remarkably similar to the original issue, i.e.: [ 3923.519247] Pid: 22341, comm: mandb Not tainted 3.1.0-0.rc0.git12.1.fc17.x86_ 64 #1 To Be Filled By O.E.M. To Be Filled By O.E.M./SK8V [ 3923.520333] RIP: 0010:[<ffffffff81240bef>] [<ffffffff81240bef>] cic_free_func+0x14/0x8c ...... while trying 3.1.0-0.rc0.git12.1.fc17.x86_64 kernel. When this happened for the first time my machine entirely locked up first scrolling off a screen any information what possibly started that. In the second incident a machine stayed alive and a complete dmesg from that time is in attachment 515990 [details] to bug 726877. Before that happened other locking problems were reported.
(In reply to comment #130) > (In reply to comment #129) > > Upstream commit: > > > > http://git.kernel.org/linus/ab4bd22d3cce6977dc039664cc2d052e3147d662 0) And, for some reason, 9b50902db5eb8a220160fb89e95aa11967998d12 too ! I have no clue what happened there, but I guess git is smart enough to handle two identical commits. > I got hit with a series of recursive faults, which look remarkably similar to > the original issue, i.e.: > > [ 3923.519247] Pid: 22341, comm: mandb Not tainted > 3.1.0-0.rc0.git12.1.fc17.x86_ > 64 #1 To Be Filled By O.E.M. To Be Filled By O.E.M./SK8V > [ 3923.520333] RIP: 0010:[<ffffffff81240bef>] [<ffffffff81240bef>] > cic_free_func+0x14/0x8c > ...... > > while trying 3.1.0-0.rc0.git12.1.fc17.x86_64 kernel. 1) Ditto here, but on a i686: BUG: unable to handle kernel paging request at 6b6b6b43 IP: [<c05d7d80>] cic_free_func+0x10/0x6c *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: arc4 ppdev microcode joydev pcspkr snd_intel8x0m snd_intel8x0 ath5k snd_ac97_codec ac97_bus ath snd_seq snd_seq_device thinkpad_acpi mac80211 snd_pcm parport_pc parport cfg80211 e1000 rfkill snd_timer iTCO_wdt snd soundcore snd_page_alloc iTCO_vendor_support i2c_i801 binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] Pid: 25711, comm: mandb Not tainted 3.0.0-1.local1.fc16.i686 #1 IBM / EIP: 0060:[<c05d7d80>] EFLAGS: 00010286 CPU: 0 EIP is at cic_free_func+0x10/0x6c EAX: f67bf900 EBX: 6b6b6b43 ECX: 00000246 EDX: 6b6b6b43 ESI: f67bf900 EDI: c05d7d70 EBP: c75e9f04 ESP: c75e9ef0 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process mandb (pid: 25711, ti=c75e8000 task=e9518000 task.ti=c75e8000) Stack: 6b6b6b6b f67bf900 6b6b6b6b f67bf900 c05d7d70 c75e9f18 c05d7610 f67bf900 e9518000 e95183d8 c75e9f20 c05d7630 c75e9f2c c05cc8ba f67bf900 c75e9f40 c05cc94a e9518000 00000012 c75e9f74 c75e9f8c c04423e2 eeb13080 cb4edb08 Call Trace: [<c05d7d70>] ? cfqq_process_refs+0x18/0x18 [<c05d7610>] call_for_each_cic+0x30/0x3e [<c05d7630>] cfq_free_io_context+0x12/0x14 [<c05cc8ba>] put_io_context+0x34/0x5c [<c05cc94a>] exit_io_context+0x68/0x6d [<c04423e2>] do_exit+0x62f/0x66a [<c04f93b2>] ? fsnotify_modify+0x64/0x6c [<c0442634>] do_group_exit+0x63/0x86 [<c044266f>] sys_exit_group+0x18/0x18 [<c083f69f>] sysenter_do_call+0x12/0x38 Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b 00 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 e0 EIP: [<c05d7d80>] cic_free_func+0x10/0x6c SS:ESP 0068:c75e9ef0 CR2: 000000006b6b6b43 ---[ end trace 4694910e102a3c2a ]--- > Before that happened other locking problems were reported. 2) Apparently not relevant, as I just triggered it too - on a self compiled vanilla 3.0 running (sort of stock) Rawhide. 3) I spent countless hours trying to squash this bug in the last three weeks of May 2011 and the first week of June 2011. It was one of the longest debugging sessions I have ever participated in. I'm not sure I have the desire too once again dive into this matter.... 4) Michal, can you trigger this at will (by running mandb by hand, while forcing it to redo its work from scratch, by using a command line switch which I already forgot)? If so, we might be able to speed up debugging (as it is not just my machine any more, assuming I can also trigger this at will). 5) If I recall correctly I drafted a patch for this bug (with Paul McKenney's assistance) that dived a bit deeper into this than the fix that Jens Axboe actually merged (and which really has been merged in v3.0, hasn't it?). If Michal and/or I are once again able to trigger this by running mandb manually the first thing we should try is to make the RCU locking around ioc->ioc_data not just good enough but, well, perfect.
(In reply to comment #131) > > 4) Michal, can you trigger this at will (by running mandb by hand, while > forcing it to redo its work from scratch, by using a command line switch which > I already forgot)? I think that you mean 'mandb -c' as this forces redoing all mandb work from scratch. I am afraid that not. I tried that few times without causing any ill effects (not counting assorted complaints from mandb when not runnig '--quiet'). Quite possible that mandb run can do that if overlapped at the same time with another substantial I/O, like running 'yum update' with a bunch of packages, but my attempts to recreate something of that sort also did not yet trigger any kernel faults. > If so, we might be able to speed up debugging (as it is not > just my machine any more, assuming I can also trigger this at will). I really have no idea why at some time interval I was seeing this problem quite regularly, after that there was a long hiatus when I could not record it at all while others were hit by it, and now it apparently returned but it caused a total lockup on one run, a series of recursive faults but no lockup on the next one, and so far no the next instance. Apparently I have to look the other way. :-)
(In reply to comment #131) > > 5) If I recall correctly I drafted a patch for this bug (with Paul McKenney's > assistance) that dived a bit deeper into this than the fix that Jens Axboe > actually merged (and which really has been merged in v3.0, hasn't it?). If > Michal and/or I are once again able to trigger this by running mandb manually > the first thing we should try is to make the RCU locking around ioc->ioc_data > not just good enough but, well, perfect. Paul, Did following commits took care of rest of the rcu issues or there was are still more concerns. commit 726e99ab88db059fe1422e15376ae404f8c66eb4 Author: Shaohua Li <shaohua.li> Date: Mon Jun 27 09:03:48 2011 +0200 cfq-iosched: make code consistent commit 3181faa85bda3dc3f5e630a1846526c9caaa38e3 Author: Shaohua Li <shaohua.li> Date: Mon Jun 27 09:03:47 2011 +0200 cfq-iosched: fix a rcu warning
(In reply to comment #132) > I think that you mean 'mandb -c' as this forces redoing all mandb work from > scratch. 0) Yes, that's the one. > I am afraid that not. I tried that few times without causing any ill > effects (not counting assorted complaints from mandb when not runnig > '--quiet'). 1) That's a pity. > Quite possible that mandb run can do that if overlapped at the same time with > another substantial I/O, like running 'yum update' with a bunch of packages, > but my attempts to recreate something of that sort also did not yet trigger any > kernel faults. 2) It's my feeling you can increase the chance of hitting this by having the daily cron run of mandb do more work than necessary. You'll need to edit /etc/sysconfig/man-db and add that -c switch. With a bit of, well, luck you'll hit this Oops every now and then and can keep us informed if that really happens (not every time, I'd say, but only after things like a kernel version bumps). (Note that I seem to remember that at one point even programs like evolution and git sometimes triggered this bug.) 3) Do the Fedora kernel maintainers agree with this suggestion?
(In reply to comment #133) > (In reply to comment #131) > > 5) If I recall correctly I drafted a patch for this bug (with Paul McKenney's > > assistance) that dived a bit deeper into this than the fix that Jens Axboe > > actually merged (and which really has been merged in v3.0, hasn't it?). If > > Michal and/or I are once again able to trigger this by running mandb manually > > the first thing we should try is to make the RCU locking around ioc->ioc_data > > not just good enough but, well, perfect. > Did following commits took care of rest of the rcu issues or there was are > still more concerns. > > commit 726e99ab88db059fe1422e15376ae404f8c66eb4 > [...] > cfq-iosched: make code consistent > > commit 3181faa85bda3dc3f5e630a1846526c9caaa38e3 > [...] > cfq-iosched: fix a rcu warning 0) It turns out I misremembered things. I never bothered to submit an updated patch (after Paul McKenney's review) for several, uninteresting reasons (like Jens Axboe being offline for some weeks after we finally pinpointed the problem in early June 2011, and I actually not really wanting to spend even more time on a bug that I already spent so much time on, etc.). 1) Anyhow, at first glance those two commits do seem to tighten things around ioc->ioc_data. 2) So I guess there are two things I could do right now: i) see if I can again triggers this manually (with a decent chance) ii) if I can, see what happens if I again rip out the all references to ioc->ioc_data (the main side effect of that should be that it slows CFQ down a bit). [ 3) It might be interesting to note that I haven't yet used v3.0 a lot, and stuck to some self patched version of v2.6.39 (because the fix for this bug was only added in -rc?, because of a summer holiday, and because after that I spent quite some time tracking down bug #722472, etc.) So perhaps there's still some undiscovered problem, and the machine that so easily triggers this just needs to be exposed to v3.0 a bit more... ]
(In reply to comment #135) > 2) So I guess there are two things I could do right now: > i) see if I can again triggers this manually (with a decent chance) Well I hit that same Oops at the first try of "mandb -c"! So I guess it's time to recompile (but now with CFQ as a module) and see whether I can torture CFQ into confessing its latest secret. > ii) [...].
(In reply to comment #135) 0) Please be aware that advanced agitation might interfere with my actions and cloud my judgement. > ii) if I can, see what happens if I again rip out the all references to > ioc->ioc_data (the main side effect of that should be that it slows CFQ down a > bit). 1) Ripping out ioc_data (the one hit cic cache) has no effect. "mandb -c" reliably triggers an Oops. This suggests we've hit yet another CFQ bug. I have no clue whatsoever that bug might be. Jens? Vivek?
Paul, - So it is same machine where you were hitting bug with mandb -c? - What kernel version you are using? Make sure you are using latest upstream kernel. - Can you also paste latest backtrace? - Is it crashing at the same place where we are accessing a freed cic? - I thought that in the past you confirmed that mandb -c is working fine (after you fix). I am wondering what changed since then that same issue has returned.
(In reply to comment #138) > - So it is same machine where you were hitting bug with mandb -c? Yes, the same machine as the machine that so easily triggered this bug two/three months ago. > - What kernel version you are using? Make sure you are using latest upstream > kernel. v3.0 (I'm not running the daily snapshots, and wait for the release candidates to appear). Note that I've down two things: - make CFQ modular - removed all traces of the ioc_data one hit cache (just to be sure this is something different). > - Can you also paste latest backtrace? OK, this is from my _current_ session (bugzilla messes up the layout a bit). The crash is still not fatal: [21265.163144] BUG: unable to handle kernel paging request at 6b6b6b43 [21265.164022] IP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched] [21265.164022] *pde = 00000000 [21265.164022] Oops: 0000 [#1] SMP [21265.164022] Modules linked in: cfq_iosched tcp_lp fuse arc4 ppdev snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device microcode snd_pcm joydev snd_timer ath5k pcspkr snd_page_alloc thinkpad_acpi i2c_i801 ath mac80211 snd soundcore cfg80211 rfkill parport_pc parport e1000 iTCO_wd t iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] [21265.164022] [21265.164022] Pid: 18514, comm: mandb Not tainted 3.0.0-1.local2.fc17.i686 #1 IBM / [21265.164022] EIP: 0060:[<f7d55fe0>] EFLAGS: 00010286 CPU: 0 [21265.164022] EIP is at cic_free_func+0x10/0x6c [cfq_iosched] [21265.164022] EAX: f242e400 EBX: 6b6b6b43 ECX: 00000246 EDX: 6b6b6b43 [21265.164022] ESI: f242e400 EDI: f7d55fd0 EBP: df92df04 ESP: df92def0 [21265.164022] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [21265.164022] Process mandb (pid: 18514, ti=df92c000 task=ed1554c0 task.ti=df92c000) [21265.164022] Stack: [21265.164022] c0468df7 c0af0cb0 6b6b6b6b f242e400 f7d55fd0 df92df18 f7d54510 f242e400 [21265.164022] ed1554c0 ed155898 df92df20 f7d54530 df92df2c c05cc8ba f242e400 df92df40 [21265.164022] c05cc94a ed1554c0 00000012 df92df74 df92df8c c04423e2 eedf2b5c eedf2b00 [21265.164022] Call Trace: [21265.164022] [<c0468df7>] ? lock_is_held+0x73/0x7b [21265.164022] [<f7d55fd0>] ? cfqq_process_refs+0x18/0x18 [cfq_iosched] [21265.164022] [<f7d54510>] call_for_each_cic+0x3e/0x4c [cfq_iosched] [21265.164022] [<f7d54530>] cfq_free_io_context+0x12/0x14 [cfq_iosched] [21265.164022] [<c05cc8ba>] put_io_context+0x34/0x5c [21265.164022] [<c05cc94a>] exit_io_context+0x68/0x6d [21265.164022] [<c04423e2>] do_exit+0x62f/0x66a [21265.164022] [<c0442634>] do_group_exit+0x63/0x86 [21265.164022] [<c044266f>] sys_exit_group+0x18/0x18 [21265.164022] [<c083b0df>] sysenter_do_call+0x12/0x38 [21265.164022] Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b 00 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8 b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 c8 [21265.164022] EIP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched] SS:ESP 0068:df92def0 [21265.164022] CR2: 000000006b6b6b43 [21265.363749] ---[ end trace a779435da6d0a152 ]--- Shortly afterwards: [21484.785098] BUG: unable to handle kernel paging request at 6b6b6b43 [21484.786020] IP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched] [21484.786020] *pde = 00000000 [21484.786020] Oops: 0000 [#2] SMP [21484.786020] Modules linked in: cfq_iosched tcp_lp fuse arc4 ppdev snd_intel8x0m snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device microcode snd_pcm joydev snd_timer ath5k pcspkr snd_page_alloc thinkpad_acpi i2c_i801 ath mac80211 snd soundcore cfg80211 rfkill parport_pc parport e1000 iTCO_wd t iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] [21484.786020] [21484.786020] Pid: 8230, comm: mandb Tainted: G D 3.0.0-1.local2.fc17.i686 #1 IBM / [21484.786020] EIP: 0060:[<f7d55fe0>] EFLAGS: 00010286 CPU: 0 [21484.786020] EIP is at cic_free_func+0x10/0x6c [cfq_iosched] [21484.786020] EAX: f242e880 EBX: 6b6b6b43 ECX: 00000286 EDX: 6b6b6b43 [21484.786020] ESI: f242e880 EDI: f7d55fd0 EBP: ed0edf04 ESP: ed0edef0 [21484.786020] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [21484.786020] Process mandb (pid: 8230, ti=ed0ec000 task=ed7e54c0 task.ti=ed0ec000) [21484.786020] Stack: [21484.786020] 00000001 00000286 6b6b6b6b f242e880 f7d55fd0 ed0edf18 f7d54510 f242e880 [21484.786020] ed7e54c0 ed7e5898 ed0edf20 f7d54530 ed0edf2c c05cc8ba f242e880 ed0edf40 [21484.786020] c05cc94a ed7e54c0 00000012 ed0edf74 ed0edf8c c04423e2 eef3a89c eef3a840 [21484.786020] Call Trace: [21484.786020] [<f7d55fd0>] ? cfqq_process_refs+0x18/0x18 [cfq_iosched] [21484.786020] [<f7d54510>] call_for_each_cic+0x3e/0x4c [cfq_iosched] [21484.786020] [<f7d54530>] cfq_free_io_context+0x12/0x14 [cfq_iosched] [21484.786020] [<c05cc8ba>] put_io_context+0x34/0x5c [21484.786020] [<c05cc94a>] exit_io_context+0x68/0x6d [21484.786020] [<c04423e2>] do_exit+0x62f/0x66a [21484.786020] [<c04f8352>] ? filp_close+0x5c/0x64 [21484.786020] [<c0442634>] do_group_exit+0x63/0x86 [21484.786020] [<c044266f>] sys_exit_group+0x18/0x18 [21484.786020] [<c083b0df>] sysenter_do_call+0x12/0x38 [21484.786020] Code: 5f 5d c3 55 89 e5 3e 8d 74 26 00 8b 50 3c 03 50 40 8b 00 29 d0 79 02 0f 0b 5d c3 55 89 e5 57 56 53 83 ec 08 3e 8d 74 26 00 89 d3 <8 b> 12 89 c6 f6 c2 01 75 02 0f 0b 8d 78 08 89 f8 89 55 ec e8 c8 [21484.786020] EIP: [<f7d55fe0>] cic_free_func+0x10/0x6c [cfq_iosched] SS:ESP 0068:ed0edef0 [21484.786020] CR2: 000000006b6b6b43 [21485.009510] ---[ end trace a779435da6d0a153 ]--- This last one is repeated twice, almost identical each time (so it occurred three times in total). > - Is it crashing at the same place where we are accessing a freed cic? I don't (yet) know what happens. The only thing I note is that the faulting address is 6b6b6b43. This suggests something is fed 6b6b6b6b somewhere and wants to use the address at an 0x28 offset from that address (via a container_of() like mechanism?). > - I thought that in the past you confirmed that mandb -c is working fine (after you fix). Yes. > I am wondering what changed since then that same issue has > returned. No idea. But is looks like this is not the same issue (as I've removed ioc->ioc_data). It does look similar (ie, somehow stuff gets out of sync and we're suddenly using an invalid cic).
(In reply to comment #139) > This suggests something is fed 6b6b6b6b somewhere and > wants to use the address at an 0x28 offset from that address (via a > container_of() like mechanism?). 0) On 32 bit x86 0x28 is the offset of hlist_node cic_list in struct cfq_io_context (four pointers, four longs, and one struct of two pointers equal (10 * 4) = 0x28 bytes). In Michal's backtrace I see 0x6b6b6b1b mentioned. Michal was running 64 bit x86. On 64 bit x86 six pointers and four longs equal (10 * 8 =) 0x50 bytes. Which matches neatly with the above. 1) This could be a red herring, of course, and wrongly calculated too, but for now we might focus on whether we can find a spot where a struct cfq_io_context might be used while invalid.
(In reply to comment #140) > (In reply to comment #139) > > This suggests something is fed 6b6b6b6b somewhere and > > wants to use the address at an 0x28 offset from that address (via a > > container_of() like mechanism?). > > 0) On 32 bit x86 0x28 is the offset of hlist_node cic_list in struct > cfq_io_context (four pointers, four longs, and one struct of two pointers equal > (10 * 4) = 0x28 bytes). In Michal's backtrace I see 0x6b6b6b1b mentioned. > Michal was running 64 bit x86. On 64 bit x86 six pointers and four longs equal > (10 * 8 =) 0x50 bytes. Which matches neatly with the above. > > 1) This could be a red herring, of course, and wrongly calculated too, but for > now we might focus on whether we can find a spot where a struct cfq_io_context > might be used while invalid. This explanation makes sense. I am not sure what is offset 0x10 in cic_free_func() but it looks like it could be following. hlist_del_rcu(&cic->cic_list); So if we have fetched a bad cic pointer (because previous cic was freed and got poisoned before rcu period got over), we would oops. Can you disassemble, your vmlinux and confirm which is the faluting instruction. It looks like we are back to original issue of how come previous cic got poisoned while we are in rcu read lock.
Paul, It might make sense to apply my cfq debug patch V2 again and run again to verify it is the same issue where we are freeing up a cic before rcu period is over.
(In reply to comment #141) > > This explanation makes sense. I am not sure what is offset 0x10 in > cic_free_func() but it looks like it could be following. > > hlist_del_rcu(&cic->cic_list); > In fact it should be following access which caused oops. unsigned long dead_key = (unsigned long) cic->key;
(In reply to comment #141) > Can you disassemble, your > vmlinux and confirm which is the faluting instruction. (In reply to comment #143) > In fact it should be following access which caused oops. > > unsigned long dead_key = (unsigned long) cic->key; That is correct, see this gdb session with cfq-iosched.ko (a small benefit of making cfq modular): $ gdb block/cfq-iosched.ko Reading symbols from /home/[...]/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.ko...done. (gdb) list cic_free_func 2695 { 2696 call_rcu(&cic->rcu_head, cfq_cic_free_rcu); 2697 } 2698 2699 static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic) 2700 { 2701 unsigned long flags; 2702 unsigned long dead_key = (unsigned long) cic->key; 2703 2704 BUG_ON(!(dead_key & CIC_DEAD_KEY)); (gdb) info line 2702 Line 2702 of "/home/peb/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.c" starts at address 0x2004 <cic_free_func+16> and ends at 0x2006 <cic_free_func+18>. (gdb) info line 2704 Line 2704 of "/home/peb/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.c" starts at address 0x2008 <cic_free_func+20> and ends at 0x200f <cic_free_func+27>. (gdb) disassemble /m 0x2004,0x200f Dump of assembler code from 0x2004 to 0x200f: 2700 { 0x00002006 <cic_free_func+18>: mov esi,eax 2701 unsigned long flags; 2702 unsigned long dead_key = (unsigned long) cic->key; 0x00002004 <cic_free_func+16>: mov edx,DWORD PTR [edx] 2703 2704 BUG_ON(!(dead_key & CIC_DEAD_KEY)); 0x00002008 <cic_free_func+20>: test dl,0x1 0x0000200b <cic_free_func+23>: jne 0x200f <cic_free_func+27> 0x0000200d <cic_free_func+25>: ud2 End of assembler dump. (gdb) disassemble /r 0x2004,0x200f Dump of assembler code from 0x2004 to 0x200f: 0x00002004 <cic_free_func+16>: 8b 12 mov edx,DWORD PTR [edx] 0x00002006 <cic_free_func+18>: 89 c6 mov esi,eax 0x00002008 <cic_free_func+20>: f6 c2 01 test dl,0x1 0x0000200b <cic_free_func+23>: 75 02 jne 0x200f <cic_free_func+27> 0x0000200d <cic_free_func+25>: 0f 0b ud2 End of assembler dump. (Please note that addresses are in hex but offsets are in decimal.) The output from "disassemble /r" matches the faulting code (according to the Oops) perfectly.
(In reply to comment #144) > The output from "disassemble /r" matches the faulting code (according to the > Oops) perfectly. Would it be possible to double check this using the debuginfo rpm for the Fedora compiled that triggered Michal's Oops (see attachment 515990 [details])?
(In reply to comment #142) > It might make sense to apply my cfq debug patch V2 again and run again to > verify it is the same issue where we are freeing up a cic before rcu period is > over. 0) I have some doubts. 1) We've determined that a simple access of cic->key triggers a page fault (its address being 0x6b6b643). key is the first member of struct cfq_io_context, so 0x6b6b6b43 must be the address of cic too, mustn't it? That debug patch looks for a cic with address 6b6b6b6b, so we will not find this exact occurrence of cic. 2) But perhaps (it's hard reasoning about this stuff) some code was fed cic->cic_list and did a container_of() like call on it (eg, the rcu code, which has lots of those) which _might_ give us (0x6b6b6b6b - 0x28 =) 0x6b6b6b43. If so, I could run that debug patch V2, if not, it would be better to see what a debug patch could better track.
(In reply to comment #145) > Would it be possible to double check this using the debuginfo rpm for the > Fedora compiled that triggered Michal's Oops (see attachment 515990 [details])? 0) Yes, trivial actually, which is rather impressive. Anyway, see this gdb session with vmlinux (as shipped with kernel-debuginfo-3.1.0-0.rc0.git12.1.fc17.x86_64): $ gdb /usr/lib/debug/lib/modules/3.1.0-0.rc0.git12.1.fc17.x86_64/vmlinux Reading symbols from /usr/lib/debug/lib/modules/3.1.0-0.rc0.git12.1.fc17.x86_64/vmlinux...done. (gdb) list cic_free_func 2699 { 2700 call_rcu(&cic->rcu_head, cfq_cic_free_rcu); 2701 } 2702 2703 static void cic_free_func(struct io_context *ioc, struct cfq_io_context *cic) 2704 { 2705 unsigned long flags; 2706 unsigned long dead_key = (unsigned long) cic->key; 2707 2708 BUG_ON(!(dead_key & CIC_DEAD_KEY)); (gdb) info line 2706 Line 2706 of "block/cfq-iosched.c" starts at address 0xffffffff81240bef <cic_free_func+20> and ends at 0xffffffff81240bf2 <cic_free_func+23>. (gdb) info line 2708 Line 2708 of "block/cfq-iosched.c" starts at address 0xffffffff81240bf8 <cic_free_func+29> and ends at 0xffffffff81240c00 <cic_free_func+37>. (gdb) disassemble /m 0xffffffff81240bef,0xffffffff81240c00 Dump of assembler code from 0xffffffff81240bef to 0xffffffff81240c00: 2704 { 0xffffffff81240bf2 <cic_free_func+23>: mov %rdi,%r13 0xffffffff81240bf5 <cic_free_func+26>: mov %rsi,%rbx 2705 unsigned long flags; 2706 unsigned long dead_key = (unsigned long) cic->key; 0xffffffff81240bef <cic_free_func+20>: mov (%rsi),%r15 2707 2708 BUG_ON(!(dead_key & CIC_DEAD_KEY)); 0xffffffff81240bf8 <cic_free_func+29>: test $0x1,%r15b 0xffffffff81240bfc <cic_free_func+33>: jne 0xffffffff81240c00 <cic_free_func+37> 0xffffffff81240bfe <cic_free_func+35>: ud2a End of assembler dump. (gdb) disassemble /r 0xffffffff81240bef,0xffffffff81240c00 Dump of assembler code from 0xffffffff81240bef to 0xffffffff81240c00: 0xffffffff81240bef <cic_free_func+20>: 4c 8b 3e mov (%rsi),%r15 0xffffffff81240bf2 <cic_free_func+23>: 49 89 fd mov %rdi,%r13 0xffffffff81240bf5 <cic_free_func+26>: 48 89 f3 mov %rsi,%rbx 0xffffffff81240bf8 <cic_free_func+29>: 41 f6 c7 01 test $0x1,%r15b 0xffffffff81240bfc <cic_free_func+33>: 75 02 jne 0xffffffff81240c00 <cic_free_func+37> 0xffffffff81240bfe <cic_free_func+35>: 0f 0b ud2a End of assembler dump. (Please note that addresses are in hex but offsets are in decimal.) The output from "disassemble /r" matches the faulting code (according to the Oops Michal publicized in attachment 515990 [details]) perfectly. 2) So now we know exactly were we trigger the page fault and that this is not a single local oddity. Progress, I'd say!
(In reply to comment #146) > > 2) But perhaps (it's hard reasoning about this stuff) some code was fed > cic->cic_list and did a container_of() like call on it (eg, the rcu code, which > has lots of those) which _might_ give us (0x6b6b6b6b - 0x28 =) 0x6b6b6b43. If > so, I could run that debug patch V2 [...]. 0) So I decided to try debug patch V2 anyway, and hit the BUG() added at the enc of call_for_each_cic() within a minute or three (I switched to cfq at about 400 secs). 1) Oops and a few preceding and following lines (entire output was almost 70k lines, while the added BUG() was hit 5 times in total): [ 562.529534] debug: Will exit io context cic=0xf0469230, Next pointer is (null) [ 562.540965] debug: cic=0xf0469230 exit io context done. Next pointer is (null) [ 562.552320] debug: Will decouple-free cic cic=0xf0469230, Next pointer is (null) [ 562.563883] debug: cic=0xf0469230 decouple-free cic done. Next pointer is (null) [ 562.578050] debug: freeing up cic=f0469230 [ 562.598881] debug: Will exit io context cic=0xf0469230, Next pointer is (null) [ 562.598881] debug: cic=0xf0469230 exit io context done. Next pointer is (null) [ 562.598881] debug: Will decouple-free cic cic=0xf0469230, Next pointer is (null) [ 562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer is (null) [ 562.644949] debug: freeing up cic=f0469230 [ 562.652971] cic=f0469230 freed while in RCU read lock [ 562.661994] ------------[ cut here ]------------ [ 562.662933] kernel BUG at /home/[...]/Rpmbuild/BUILD/kernel-3.0.fc17/linux-3.0.i686/block/cfq-iosched.c:2699! [ 562.662933] invalid opcode: 0000 [#1] SMP [ 562.662933] Modules linked in: cfq_iosched arc4 ppdev snd_intel8x0 snd_intel8x0m snd_ac97_codec ac97_bus snd_seq snd_seq_device microcode snd_pcm joydev pcspkr ath5k snd_timer ath snd_page_alloc i2c_i801 thinkpad_acpi mac80211 snd soundcore cfg80211 parport_pc e1000 rfkill parport iTCO_wdt iTCO_vendor_support binfmt_misc uinput yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] [ 562.662933] [ 562.662933] Pid: 7550, comm: mandb Not tainted 3.0.0-1.local2.fc17.i686 #1 IBM / [ 562.662933] EIP: 0060:[<f806301f>] EFLAGS: 00010286 CPU: 0 [ 562.662933] EIP is at call_for_each_cic+0x79/0xf2 [cfq_iosched] [ 562.662933] EAX: 0000003c EBX: f0469258 ECX: 00000082 EDX: 00000000 [ 562.662933] ESI: f80661bd EDI: f0469230 EBP: f04b3f18 ESP: f04b3ef0 [ 562.662933] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 562.662933] Process mandb (pid: 7550, ti=f04b2000 task=f40bbf90 task.ti=f04b2000) [ 562.662933] Stack: [ 562.662933] f8066203 f0469230 f80661bd 00000000 6b6b6b6b f2fc8380 f806545b f2fc8380 [ 562.662933] f40bbf90 f40bc368 f04b3f20 f80630aa f04b3f2c c05cc8ba f2fc8380 f04b3f40 [ 562.662933] c05cc94a f40bbf90 00000012 f04b3f74 f04b3f8c c04423e2 f658ee1c f658edc0 [ 562.662933] Call Trace: [ 562.662933] [<f806545b>] ? cfq_put_request+0x5e/0x5e [cfq_iosched] [ 562.662933] [<f80630aa>] cfq_free_io_context+0x12/0x14 [cfq_iosched] [ 562.662933] [<c05cc8ba>] put_io_context+0x34/0x5c [ 562.662933] [<c05cc94a>] exit_io_context+0x68/0x6d [ 562.662933] [<c04423e2>] do_exit+0x62f/0x66a [ 562.662933] [<c0442634>] do_group_exit+0x63/0x86 [ 562.662933] [<c044266f>] sys_exit_group+0x18/0x18 [ 562.662933] [<c083b0df>] sysenter_do_call+0x12/0x38 [ 562.662933] Code: e8 81 e4 ff ff 8b 45 ec 8b 58 4c eb 22 e8 7c ff ff ff 81 7d e8 6b 6b 6b 6b 75 12 89 7c 24 04 c7 04 24 03 62 06 f8 e8 a5 93 7c c8 <0f> 0b 8b 1b e8 d0 29 3f c8 85 c0 74 05 e8 51 ff ff ff 85 db 74 [ 562.662933] EIP: [<f806301f>] call_for_each_cic+0x79/0xf2 [cfq_iosched] SS:ESP 0068:f04b3ef0 [ 562.993878] ---[ end trace b9b7cf2ea11e5aaa ]--- [ 563.003045] Fixing recursive fault but reboot is needed! [ 563.045875] debug: Will exit io context cic=0xf0469230, Next pointer is (null) [ 563.045885] debug: cic=0xf0469230 exit io context done. Next pointer is (null) [ 563.045888] debug: Will decouple-free cic cic=0xf0469230, Next pointer is (null) [ 563.045894] debug: cic=0xf0469230 decouple-free cic done. Next pointer is (null) [ 563.048061] debug: freeing up cic=f0469230 2) What next?
(In reply to comment #146) > (In reply to comment #142) > > It might make sense to apply my cfq debug patch V2 again and run again to > > verify it is the same issue where we are freeing up a cic before rcu period is > > over. > > 0) I have some doubts. > > 1) We've determined that a simple access of cic->key triggers a page fault (its > address being 0x6b6b643). key is the first member of struct cfq_io_context, so > 0x6b6b6b43 must be the address of cic too, mustn't it? That debug patch looks > for a cic with address 6b6b6b6b, so we will not find this exact occurrence of > cic. No, that debug patch tracks the pointer to next &cic->cic_list to be 6b6b6b6b. So if current cic is freed before rcu period is over, we will fecth poisoned value and then convert it to (offset_of()) to get the pointer to cic which in this case seems to be 0x6b6b6b43 etc. and then 0x6b6b6b43->key access will fault.
0) Just a quick note, to make sure this is recorded somewhere. Perhaps this might give us a clue of the problem we're seeing here. Perhaps not. 1) When I do a second test run (after a first run that leads to - multiple - page faults, switching back to deadline, removing cfq_iosched.ko, switching to cfq again, which reloads cfq_iosched.ko) I see almost no debug output. cfq_iosched.ko prints not even 1% of output it prints otherwise. 2) Could that be just because mandb reads all (or most of) its files from some sort of memory cache and not from disk? (This machine has 2 GB of memory, that might be enough to cache all these files.)
Not sure what is happening. Just flush the cache to rule out above possibility. echo 3 > /proc/sys/vm/drop_caches
(In reply to comment #150) > 2) Could that be just because mandb reads all (or most of) its files from some > sort of memory cache and not from disk? (This machine has 2 GB of memory, that > might be enough to cache all these files.) An x86_64 machine where I was encountering those crashes has right now 1 GB of memory and it was running for quite a while with 512 MB. AFAICT there was no correlation between how much of a memory was present and a frequency with which I was bumping into these. It is really unclear to me why Paul apparently can hit, luckily enough, this bug quite reliably while for me it remains now quite elusive. It could be that an amount of memory in the current setup does affect some timings.
(In reply to comment #151) > Just flush the cache to rule out above possibility. Sure, once I drop the cache(s) disk activity increases (dramatically) and I can trigger this again (with tried and trusted "mandb -c").
(In reply to comment #148) > > 0) So I decided to try debug patch V2 anyway, and hit the BUG() added at the > enc of call_for_each_cic() within a minute or three (I switched to cfq at about > 400 secs). > > 1) Oops and a few preceding and following lines (entire output was almost 70k > lines, while the added BUG() was hit 5 times in total): > > [ 562.529534] debug: Will exit io context cic=0xf0469230, Next pointer is > (null) > [ 562.540965] debug: cic=0xf0469230 exit io context done. Next pointer is > (null) > [ 562.552320] debug: Will decouple-free cic cic=0xf0469230, Next pointer is > (null) > [ 562.563883] debug: cic=0xf0469230 decouple-free cic done. Next pointer is > (null) > [ 562.578050] debug: freeing up cic=f0469230 > [ 562.598881] debug: Will exit io context cic=0xf0469230, Next pointer is > (null) > [ 562.598881] debug: cic=0xf0469230 exit io context done. Next pointer is > (null) > [ 562.598881] debug: Will decouple-free cic cic=0xf0469230, Next pointer is > (null) > [ 562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer is > (null) > [ 562.644949] debug: freeing up cic=f0469230 > [ 562.652971] cic=f0469230 freed while in RCU read lock I am wondering why time difference is so much between following two messages. [ 562.633403] debug: cic=0xf0469230 decouple-free cic done. Next pointer is (null) [ 562.652971] cic=f0469230 freed while in RCU read lock roughly 20 ms. Is it because of the printk() handles the messaging? After calling cic_free_func() we checked the next pointer and it was null. Time was 562.633403. After that there is one rcu_dereference() statement and then the check for next=poisoned-pointer. After first check was done, around 11ms later associated cic was freed and around 8 ms after that we checked for next pointer being posioned. I am wondering that why are we taking so much of time. Is there some kind of context switch going on or something else? Or it is just a false pointer because one can not rely on printk() time stamps?
PaulB, IIUC, when we call call_for_each_cic() interrupts are still enabled. It might just be a fun experiment to disable interrupts and see if this problem still happens. Something like. cfq_free_io_context() { unsigned long flags; local_irq_save(flags); call_for_each_cic() local_irq_restore(flags); } And keep the CFQ debug V2 applied and run mandb -c again and see if anything changes.
(In reply to comment #155) > IIUC, when we call call_for_each_cic() interrupts are still enabled. It might > just be a fun experiment to disable interrupts and see if this problem still > happens. 0) I'm curious what the idea behind that experiment is. Since I have a poor grasp of all this stuff, explaining it might not be easy. On the other hand, it's my loyal CFQ torturer that's suffering all extra wear and tear involved in trying to trigger this issue, so perhaps my curiosity is, well, appropriate. > Something like. > > cfq_free_io_context() { > unsigned long flags; > > local_irq_save(flags); > call_for_each_cic() > local_irq_restore(flags); > } 1) I've added exactly that. > And keep the CFQ debug V2 applied and run mandb -c again and see if anything > changes. 2) Well, in three consecutive runs (making sure caches were dropped before every run) I was unable to trigger this bug! That's quite remarkable. This suggests you at least made the chance of hitting this issue smaller. 3) I don't really want to ruin this little experiment with yet another run. Why ruin a perfect score of 3/3?
(In reply to comment #156) > (In reply to comment #155) > > IIUC, when we call call_for_each_cic() interrupts are still enabled. It might > > just be a fun experiment to disable interrupts and see if this problem still > > happens. > > 0) I'm curious what the idea behind that experiment is. Since I have a poor > grasp of all this stuff, explaining it might not be easy. On the other hand, > it's my loyal CFQ torturer that's suffering all extra wear and tear involved in > trying to trigger this issue, so perhaps my curiosity is, well, appropriate. Last time PaulM had mentioned that in the rcu read section there should be no blocking instruction and no re-scheduling of thread should happen. There was also a talk of keeping track of context switch in task (paul had written a patch and see if any kind of context switch is happening). I was wondering if interrupts are enabled, then may be all interrupts and softirqs can run and may be they have any kind of side affect. Hence it is just a wild guess to narrow down the problem a bit. > > 2) Well, in three consecutive runs (making sure caches were dropped before > every run) I was unable to trigger this bug! That's quite remarkable. This > suggests you at least made the chance of hitting this issue smaller. > > 3) I don't really want to ruin this little experiment with yet another run. Why > ruin a perfect score of 3/3? I would say that let it run in a loop in a script. Once it has run for couple of hours without any crash, then we have a good data point that disabling interrupts helps solve the problem. This might give some hint regarding where to look and what next.
(In reply to comment #157) > I would say that let it run in a loop in a script. Once it has run for couple > of hours without any crash, then we have a good data point that disabling > interrupts helps solve the problem. > > This might give some hint regarding where to look and what next. 0) I'll consider that. 1) Mind if I disable the debug verbosity? It would wipe all other information from my logs. (Perhaps set it KERN_DEBUG, so it will only clobber dmesg.) 2) By the way, during the last few runs I had something similar to https://lkml.org/lkml/2011/8/3/310 ("[RFC] CFQ: simplify radix tree lookup in cfq_cic_lookup()") enabled, apparently without any ill effects.
Very interesting, great work narrowing this down! Are you running CONFIG_NO_HZ? If so, I would be interested in hearing what happens if you run the test without the interrupt-disabling, but with CONFIG_NO_HZ=n.
(In reply to comment #159) > Are you running CONFIG_NO_HZ? 0) Yes. See: $ grep -e RCU_FAST_NO_HZ -e CPU_IDLE_GOV_MENU -e NO_HZ /boot/config-3.0.0-1.local2.fc17.i686 CONFIG_RCU_FAST_NO_HZ=y CONFIG_NO_HZ=y CONFIG_CPU_IDLE_GOV_MENU=y (CONFIG_RCU_FAST_NO_HZ and CONFIG_CPU_IDLE_GOV_MENU were the two only options that depend (directly and positively) on CONFIG_NO_HZ.) > If so, I would be interested in hearing what > happens if you run the test without the interrupt-disabling, but with > CONFIG_NO_HZ=n. 1) I hope to do that later today (it's still morning over here). Please prod me if it takes me too long.
(In reply to comment #158) > (In reply to comment #157) > > I would say that let it run in a loop in a script. Once it has run for couple > > of hours without any crash, then we have a good data point that disabling > > interrupts helps solve the problem. > > > > This might give some hint regarding where to look and what next. > > 0) I'll consider that. > > 1) Mind if I disable the debug verbosity? It would wipe all other information > from my logs. (Perhaps set it KERN_DEBUG, so it will only clobber dmesg.) Not a problem. > > 2) By the way, during the last few runs I had something similar to > https://lkml.org/lkml/2011/8/3/310 ("[RFC] CFQ: simplify radix tree lookup in > cfq_cic_lookup()") enabled, apparently without any ill effects. Ok. That's fine.
(In reply to comment #159) > If so, I would be interested in hearing what > happens if you run the test without the interrupt-disabling, but with > CONFIG_NO_HZ=n. 0) Recompiled 3.0 with CONFIG_NO_HZ. Ran a script that does 8 runs of (basically) "mandb -c" (preceded by dropping caches each time). This takes almost an hour over here. 1) This triggers no page fault in cic_free_func. So, it seems we're onto something. Does 8 runs sound enough to conclude that (something somehow influenced by) CONFIG_NO_HZ triggers this issue? 2) Should I also try test runs with CONFIG_NO_HZ=y and interrupts disabled (when calling cic_free_func())?
(In reply to comment #162) > 0) Recompiled 3.0 with CONFIG_NO_HZ. Say, couldn't I just have used the "nohz=off" kernel parameter? (Making sure I called cic_free_func() with interrupts enabled.)
(In reply to comment #163) > Say, couldn't I just have used the "nohz=off" kernel parameter? 0) It turns out that bugzilla.redhat.com doesn't answer lazyweb questions, so I had to test this myself. 1) Yes, booting with "nohz=off" on CONFIG_NO_HZ=y kernel seems to behave identical to booting a CONFIG_NO_HZ=n kernel. Let's just say I learned something today... Anyhow, one again no page faults in 8 test runs. 2) Perhaps Jens, Vivek and Paul McKenney could discuss further, more specific tests and/or patches before I try anything else (like, running cic_free_func() with interrupts disabled for a few hours, as Vivek suggested). I'll be right here, once you decided on something. (In reply to comment #152) > It is really unclear to me why Paul apparently can hit, luckily enough, this > bug quite reliably while for me it remains now quite elusive. It could be that > an amount of memory in the current setup does affect some timings. Michal, perhaps you'd like to try what I do now: basically, run "mandb -c" in a loop, making sure all caches are dropped before every iteration. I run this silly test in an 8 iterations loop, but any other number of iterations could be just as interesting.
(In reply to comment #164) > > Michal, perhaps you'd like to try what I do now: basically, run "mandb -c" in a > loop, making sure all caches are dropped before every iteration. After five times through such loop while running kernel-3.1.0-0.rc0.git21.1.fc17.x86_64 I did hit that bug with "RIP [<ffffffff81242f6f>] cic_free_func+0x14/0x8c" and so on. On a loop six the same fault happened but already I got a message that "Fixing recursive fault but reboot is needed". What kind of a new information that possibly brings?
Paul Bolle: Could you please reproduce the failure (CONFIG_NO_HZ=y, no added irq disabling) and run the script from May 26? This will require building with CONFIG_RCU_TRACE=y.
(In reply to comment #165) > After five times through such loop while running > kernel-3.1.0-0.rc0.git21.1.fc17.x86_64 I did hit that bug with > "RIP [<ffffffff81242f6f>] cic_free_func+0x14/0x8c" and so on. I'm glad to hear there's another machine on which this can be triggered manually (albeit not as easily). > On a loop six the same fault happened but already I got a message that "Fixing > recursive fault but reboot is needed". What kind of a new information that > possibly brings? I'm not sure what you mean here. If you mean that further iterations of this (silly) test are unnecessary once a page fault has been triggered then you're right. As far as I know, the page fault always happens in a child process of mandb. In that case mandb should return non-zero, shouldn't it? So you could use that non-zero return to break from the loop and stop the test. If your point is that the kernel warns that a "reboot is needed" more than once, then that second warning could be considered superfluous. But it's not a big deal and it's perhaps more important to get that message out than to be less chatty.
Created attachment 517316 [details] RCU collectdebugfs.sh script I think paulM is referring to collectdebugfs.sh script he had posted in previous mail thread on lkml. I have extracted it here from eariler mail thread and attaching with the bug. It becomes easier to download and run.
(In reply to comment #168) > [...] and run. Tomorrow, I suppose. I already recompiled with CONFIG_RCU_TRACE=y, and booted into that kernel. I just need to run the test. Feel free to prod me if I appear inactive for too long.
Vivek: Yep, that is the one!
(In reply to comment #167) > (In reply to comment #165) > > What kind of a new information that > > possibly brings? > > I'm not sure what you mean here. I meant that we already know that I can trigger that fault too although with current kernels for some reasons not that readily. After all I filed the original report although quite a while ago (and later that bug was "hiding" from me for a long time). Apologies that I am not overly proactive in trying to chase that down but now things to do are piling up here. BTW - my test system is using one of early samples of an x86_64 board and is likely not so frequent specimen of an x86_64 machine with a single core. I wonder if this may not have something to do with all of this.
Created attachment 517474 [details] output of collectdebugfs.sh 0) This is the output of over 10 minutes of running collectdebugfs.sh (I still had a copy somewhere on my hard disk). Only the first 3 or so minutes are probably of interest. 1) Everybody's favourite Oops occurred at 21:17:31 and 21:17:39.
Can't see that I see anything incriminating in the collectdebugfs.sh output. Besides which, you are running on a single CPU... Nonetheless, I do have a new -rcu posted, if you are interested in trying it: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu.git rcu/next
(In reply to comment #173) 0) Thanks for diving into this again. > Nonetheless, I do have a new -rcu posted, if you are interested in trying it: > > [...] So the choice seems to be between trying rcu/next and staring at the CFQ code (for hours on end) to see if something suspicious can be spotted ...
Decisions, decisions. ;-)
PaulM, I am wondering what does it mean and what should we do next. I mean using cfq debug patch, I can see that pointer i dereferenced inside rcu read section started pointing to a freed object. Do you think that that's not the right way to decide whether an object got freed while rcu read lock was held? IOW, I am wondering what to do next. Debug patch seems to think that it is rcu related and rcu debug stuff does not point to anything suspicious.
I suggest trying -rcu. I can't see how any of the fixes there would help you, but then again, I don't fully understand cfq. Here are reasons why a pointer dereferenced within an RCU read-side critical section might end up pointing to a freed structure within that same RCU read-side critical section: 1. The structure is freed directly instead of waiting for an RCU grace period. 2. The structure is protected with one flavor of RCU and freed after waiting for another flavor of RCU's grace period. 3. There is a bug in RCU. So trying out -rcu would be a way to check for some cases of #3 above. I am doing a line-by-line inspection of RCU to try to find any other such cases.
(In reply to comment #177) > I suggest trying -rcu. I can't see how any of the fixes there would help you, > but then again, I don't fully understand cfq. > > Here are reasons why a pointer dereferenced within an RCU read-side critical > section might end up pointing to a freed structure within that same RCU > read-side critical section: > > 1. The structure is freed directly instead of waiting for an RCU grace period. cic is freed using one function. static void cfq_cic_free(struct cfq_io_context *cic) { call_rcu(&cic->rcu_head, cfq_cic_free_rcu); } I can't spot in the code anywhere else where we free cic through any other path. > > 2. The structure is protected with one flavor of RCU and freed after waiting > for another flavor of RCU's grace period. In this case we just do call_rcu() for freeing and wait using rcu_read_lock() and rcu_read_unlock(). I am hoping this is right way of doing things. > > 3. There is a bug in RCU. > > So trying out -rcu would be a way to check for some cases of #3 above. I am > doing a line-by-line inspection of RCU to try to find any other such cases. Ok, PaulB, can you please give -rcu tree a try. We can always stare at CFQ code but testing -rcu sounds like a relatively less time consuming option.
Ah, and the following: 4. Failing to remove all reader paths to the object before starting to wait for the grace period.
(In reply to comment #179) > Ah, and the following: > > 4. Failing to remove all reader paths to the object before starting to wait > for the grace period. What does that mean? We are in reader path and reader itself calls call_rcu() and it should make somebody wait for grace period asynchronously. So wait for grace period does start while some readers are still there. So we effectively run following code. rcu_read_lock(); hlist_for_each_entry_rcu(cic, n, &ioc->cic_list, cic_list){ spin_lock_irqsave(&ioc->lock, flags); hlist_del_rcu(&cic->cic_list); spin_unlock_irqrestore(&ioc->lock, flags); call_rcu(&cic->rcu_head, cfq_cic_free_rcu); } rcu_read_unlock(); Is above not the right usage?
As long as there is no other path by which readers can reach this structure, the above looks good. Also, as long as readers don't create paths to this structure. How did the tests with -rcu go?
(In reply to comment #181) > How did the tests with -rcu go? 0) They did not even start. 1) Say, what's a simple way to retrieve from your rcu/next (ie, http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=shortlog;h=refs/heads/rcu/next ) the diff(s) between http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=commit;h=02f8c6aee8df3cdc935e9bdd4f2d020306035dbe (ie, v3.0) and http://git.kernel.org/?p=linux/kernel/git/paulmck/linux-2.6-rcu.git;a=commit;h=0697a6c28e520ef4ce3ce9b7532e4374d47be90c (ie, rcu/next's current tip of the tree) ? 2) Please note that I'm uncomfortable with tracking two upstream repositories. To give an example: I track Linus' tree an the trivial tree each in their own local repository. (And thus I have to jump a few hoops when I want to copy private work from one local repository to the other.) Fear of the unknown, I guess...
One approach would be to clone the -rcu tree locally and do: git checkout rcu/next git diff v3.0 > /tmp/wherever If you wanted to get the -rcu branch pulled into one of your existing trees, one way to do this would be to again clone the -rcu tree and do: git checkout rcu/next cd ../linux-2.6 # or wherever you put it git checkout v3.0 get pull ../linux-2.6-rcu rcu/next # replacing ../linux-2.6-rcu with path to -rcu git branch rcu/next
Created attachment 517723 [details] Patch for v3.0..rcu/next as of August 9, 2011 Or you can download this patch.
(In reply to comment #184) > Or you can download this patch. 0) It turns out adding another "remote" is rather easy, and having your patch at hand made it easy to see I did it correctly. Thanks. 1) Same Oops while running rcu/next (even hit five times in one mandb run, but that is not really uncommon). 2) Should I perhaps play with some /sys files or tweak CONFIG_* variables to fully enjoy rcu/next? If not, is it now time to start staring at the code?
If you have not already done so, please try either CONFIG_TINY_RCU or CONFIG_TINY_PREEMPT_RCU. The results would be interesting with either mainline or -rcu. And I will certainly be continuing to look at code.
(In reply to comment #186) > If you have not already done so, please try either CONFIG_TINY_RCU or > CONFIG_TINY_PREEMPT_RCU. Both depend on !SMP. TINY_PREEMPT_RCU depends on PREEMPT. Please note that Fedora by default enables SMP and disables PREEMPT. So do the kernels I run currently (which basically are vanilla rebuilds of Rawhide kernels). I hope to shortly try TINY_RCU, as it should require the least .config tweaking.
(In reply to comment #187) > I hope to shortly try TINY_RCU [...]. 0) I've done that now. (There's no /proc or /sys file that "proves" which RCU variant one is running, is there? Nor is that printed to the logs, apparently.) 1) Running mandb for 8 iterations, dropping caches beforehand, didn't trigger this bug. Is this small test - which took about an hour - enough to indicate this bug cannot be triggered with (!SMP and) TINY_RCU? 2) What's next?
0) If you don't see "Preemptible hierarchical RCU implementation" in dmesg at early boot, you are running one of the TINY_RCU variants. 1) Good question -- how many iterations does it take for TREE_RCU to fail? Given that info, the answer is a simple matter of statistics. You can never be 100% sure, but you can compute a probability. 2) One thing that would help me in my slowly ongoing inspection of RCU is how the failures map out over the possible configurations: a: TREE_RCU && NO_HZ: b: TREE_RCU && !NO_HZ: c: TREE_PREEMPT_RCU && NO_HZ: d: TREE_PREEMPT_RCU && !NO_HZ: e: TINY_RCU && NO_HZ: f: TINY_RCU && !NO_HZ: g: TINY_PREEMPT_RCU && NO_HZ: h: TINY_PREEMPT_RCU && !NO_HZ: If I remember correctly, (a) fails, (b) does not, and one of (e) or (f) does not. Is that correct?
(In reply to comment #189) > 0) If you don't see "Preemptible hierarchical RCU implementation" in dmesg at > early boot, you are running one of the TINY_RCU variants. I see lots of lines saying Hierarchical RCU implementation. (note the dot!) in the logs. Anyway, printing a different messages in the opposite case wouldn't hurt, would it?
TREE_RCU says "Hierarchical RCU implementation.", as you note, complete with the period at the end. TREE_PREEMPT_RCU says "Preemptible hierarchical RCU implementation." -- also with a period. TINY_RCU and TINY_PREEMPT_RCU say nothing in an attempt to save a few more bytes for deep-embedded uses of RCU. The TINY variants of RCU will remain tiny. So we are dealing with TREE_RCU. I recall (perhaps incorrectly) that NO_HZ failed and !NO_HZ didn't fail. Is that correct? If so, was RCU_FAST_NO_HZ also set?
(In reply to comment #189) > 2) One thing that would help me in my slowly ongoing inspection of RCU is how > the failures map out over the possible configurations: > > a: TREE_RCU && NO_HZ: > b: TREE_RCU && !NO_HZ: > c: TREE_PREEMPT_RCU && NO_HZ: > d: TREE_PREEMPT_RCU && !NO_HZ: > e: TINY_RCU && NO_HZ: > f: TINY_RCU && !NO_HZ: > g: TINY_PREEMPT_RCU && NO_HZ: > h: TINY_PREEMPT_RCU && !NO_HZ: > > If I remember correctly, (a) fails, (b) does not, and one of (e) or (f) does > not. Is that correct? a: TREE_RCU && NO_HZ: fails b: TREE_RCU && !NO_HZ: appears OK c: TREE_PREEMPT_RCU && NO_HZ: n/a d: TREE_PREEMPT_RCU && !NO_HZ: n/a e: TINY_RCU && NO_HZ: appears OK f: TINY_RCU && !NO_HZ: n/a g: TINY_PREEMPT_RCU && NO_HZ: n/a h: TINY_PREEMPT_RCU && !NO_HZ: n/a So you were correct. (Filling in this table makes one realize that CONFIG_NO_HZ forces one to think in double negatives, which is always fun, isn't it?)
(In reply to comment #191) > So we are dealing with TREE_RCU. I recall (perhaps incorrectly) that NO_HZ > failed and !NO_HZ didn't fail. Is that correct? If so, was RCU_FAST_NO_HZ > also set? a: TREE_RCU && NO_HZ: fails => CONFIG_TREE_RCU=y CONFIG_RCU_FAST_NO_HZ=y CONFIG_NO_HZ=y (These result are brought to you by the department of handwaving-a-lot-to-look-reliable. If more tests are needed to pinpoint this, we should think about logging info we care about when running tests.)
Could you please try CONFIG_RCU_FAST_NO_HZ=n along with CONFIG_TREE_RCU=y and CONFIG_NO_HZ=y? CONFIG_RCU_FAST_NO_HZ, when set, prints "RCU dyntick-idle grace-period acceleration is enabled" at boot time.
(In reply to comment #194) > Could you please try CONFIG_RCU_FAST_NO_HZ=n along with CONFIG_TREE_RCU=y and > CONFIG_NO_HZ=y? 0) Note to self: tested with 3.0.0-1.local7.fc17.i686. Cleaned up changelog: * Tue Aug 16 2011 Paul Bolle <pebolle> 3.0.0-1.local7 - epoll: Fix spurious lockdep warnings - rcu/next: git diff v3.0..0697a6c2 (0697a6c2 "rcu: Prohibit grace periods during early boot") - CONFIG_IOSCHED_CFQ=m - CONFIG_RCU_TRACE=y - CONFIG_RCU_FAST_NO_HZ=n 1) Refuses to Oops in 8 mandb test runs (using a script that drops caches before every run). 2) Have we pinpointed this to CONFIG_RCU_FAST_NO_HZ?
(In reply to comment #195) > 1) Refuses to Oops in 8 mandb test runs (using a script that drops caches > before every run). Strike that (for the time being), as It turns out that checking cfq requires one to, well, have cfq enabled (CONFIG_IOSCHED_CFQ=m boots into deadline). What a suprise! On to an actual test run of cfq ...
(In reply to comment #196) > On to an actual test run of cfq ... 0) Do scientist have the concept of a well behaved experiment (ie, an experiment that generates the desired outcome no matter what data one throws at it)? 1) A kernel with (in short) CONFIG_RCU_FAST_NO_HZ=n refuses to Oops in 8 mandb test runs (using a script that drops caches before every run) even if cfq is actually enabled. 2) Have we pinpointed this latest issue to CONFIG_RCU_FAST_NO_HZ?
We certainly have pointed the finger of suspicion at CONFIG_RCU_FAST_NO_HZ. It is relatively new and thus should be presumed less trustworthy. But back to your questions about scientists... Do you have data on how many mandb attempts were required to cause failures in the CONFIG_RCU_FAST_NO_HZ=y case? Ideally, this data would take the form of the number of times failure occurred on the first mandb invocation, the second, and so on. Given that data, we can compute a probability of the bug surviving eight attempts. But just to humor an old man's paranoia, could you please also do a few more mandb attempts with CONFIG_RCU_FAST_NO_HZ=n?
(In reply to comment #198) > Do you have data on how many > mandb attempts were required to cause failures in the CONFIG_RCU_FAST_NO_HZ=y > case? Ideally, this data would take the form of the number of times failure > occurred on the first mandb invocation, the second, and so on. Given that > data, we can compute a probability of the bug surviving eight attempts. anecdote, n. pl. data My feeling is that, if we hit this issue, we need at most two runs. But nine out of ten times the first run is enough. (Note that run is about as vague as it gets. Were tnalking about a program that spawns X childs, doing Y reads and Z writes. Y probably being a rather large number.) > But just to humor an old man's paranoia, could you please also do a few more > mandb attempts with CONFIG_RCU_FAST_NO_HZ=n? I hope to do that shortly. Results are due (West European) tomorrow.
(In reply to comment #199) > My feeling is that, if we hit this issue, we need at most two runs. But nine > out of ten times the first run is enough. That may well be the case in that particular hardware/software combination. I am repeating from time to time this experiment of running 'mandb -c' with caches dropped on my machine, with unmodified rawhide kernels, and in what I observe that bug _may_ hit here in two runs but usually quite a bit more is needed. For some time I did not see this failure on the first run at all. OTOH with some patience I eventually get it every time I try. Yes, we all know that this is clearly some timings dependent. Just a note. I was not posting new comments about these results as they really look consistently the same.
OK, suppose that each run of mandb had a 50% chance of hitting the bug. Then you would have a 50% chance of hitting it on the first run, a 75% chance of hitting it on the second run, an 88% chance of hitting it on the third run, and so on. The probability might vary from system to system, for example, if you have a 75% chance of hitting the bug on a given run, then there is a 94% chance of hitting it on the first two runs and a 98% chance of hitting it on the first three runs. This is why I was hoping for a list -- I could then estimate the per-run probability, and then further estimate the number of runs required to give (say) a 99.999% chance that the bug was in fact not happening.
(In reply to comment #201) > This is why I was hoping for a list [...]. Well, doing a methodical and thorough job won't do you any good when working for the the department of handwaving-a-lot-to-look-reliable! So no, I didn't log this stuff (one reason is that I could basically always trigger this with my outdated ThinkPad T41 and I didn't particularly care whether that was in the first or second run). (In reply to comment #198) > But just to humor an old man's paranoia, could you please also do a few more > mandb attempts with CONFIG_RCU_FAST_NO_HZ=n? Fresh boot, 16 runs (using a script that will check if cfq is enabled), which took almost 3 hours, no Oopses.
Thank you for the additional testing! I am reviewing the CONFIG_RCU_FAST_NO_HZ yet again. If I still don't find anything, I will see about putting together some sort of diagnostic patch.
Just as a reference point. I was just hit by this bug on a kernel 3.1.0-0.rc1.git6.1.fc17.x86_64 when cron decided to run mandb. So this was surely on the first loop. :-)
OK, good, there seems to be a high probability of failure, which should make testing of fixes easier. Does this bug trigger with CONFIG_RCU_FAST_NO_HZ=y and CONFIG_PROVE_RCU=y? (Yeah, I know, way too much for me to hope for, but it really would be convenient for diagnostic patches...)
(In reply to comment #205) > Does this bug trigger with CONFIG_RCU_FAST_NO_HZ=y and CONFIG_PROVE_RCU=y? 0) Yes, it does 1) I'm currently running pre-compiled 3.1.0-0.rc2.git0.1.fc17.i686 (the latest kernel pushed in Rawhide). Our tireless tester "mandb" once again triggered this issue (when kicked into action by cron). And grep -w -e CONFIG_RCU_FAST_NO_HZ -e CONFIG_PROVE_RCU /boot/config-3.1.0-0.rc2.git0.1.fc17.i686 prints: CONFIG_RCU_FAST_NO_HZ=y CONFIG_PROVE_RCU=y
(In reply to comment #206) > (In reply to comment #205) > > > Does this bug trigger with CONFIG_RCU_FAST_NO_HZ=y and CONFIG_PROVE_RCU=y? > > 0) Yes, it does Likely stating something "nearly obvious" but just in case. The same is true for x86_64; both are "y" in configurations of rawhide kernels.
Created attachment 519110 [details] Diagnostic patch for RCU_FAST_NO_HZ assumptions Could you please test the attached patch? It should give useful information whether or not CONFIG_RCU_FAST_NO_HZ is set.
(In reply to comment #208) > Could you please test the attached patch? 0) Note to self: tested with 3.1.0-rc2.local0.fc17.i686. Cleaned up changelog: * Sat Aug 20 2011 Paul Bolle <pebolle> 3.1.0-rc2.local - epoll: Fix spurious lockdep warnings - rcu: check for entering dyntick-idle mode while in read-side critical section - CONFIG_IOSCHED_CFQ=m - CONFIG_RCU_TRACE=y 1) This already happened before I even had a chance to poke the system myself (ie, by a fsck launched by dracut): ------------[ cut here ]------------ WARNING: at kernel/rcutree.c:1746 rcu_needs_cpu_quick_check+0x31/0xbd() Hardware name: Modules linked in: yenta_socket video radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core Pid: 340, comm: fsck.ext4 Not tainted 3.1.0-rc2.local0.fc17.i686 #1 Call Trace: [<c083e25d>] ? printk+0x2d/0x2f [<c0440088>] warn_slowpath_common+0x7c/0x91 [<c049a5db>] ? rcu_needs_cpu_quick_check+0x31/0xbd [<c049a5db>] ? rcu_needs_cpu_quick_check+0x31/0xbd [<c04400bf>] warn_slowpath_null+0x22/0x24 [<c049a5db>] rcu_needs_cpu_quick_check+0x31/0xbd [<c049bba9>] rcu_needs_cpu+0xd7/0x16c [<c049bca0>] rcu_process_callbacks+0x62/0x86 [<c0445a9e>] __do_softirq+0xdd/0x203 [<c04459c1>] ? __tasklet_hi_schedule_first+0x30/0x30 <IRQ> [<c0445dc6>] ? irq_exit+0x4f/0xac [<c040402a>] ? do_IRQ+0x7e/0x92 [<c046a94c>] ? trace_hardirqs_off_caller+0x2e/0x86 [<c084ddb5>] ? common_interrupt+0x35/0x3c [<c05d007b>] ? ftrace_raw_event_block_bio+0x64/0xc7 [<c04600e0>] ? exit_creds+0x40/0xab [<c05dc325>] ? arch_local_irq_restore+0x5/0xb [<c05dcdcb>] ? blkiocg_update_dispatch_stats+0xbe/0xc6 [<c05deb37>] ? blk_throtl_bio+0xca/0x37a [<c04efbff>] ? set_track+0x4d/0xab [<c0840f36>] ? __slab_alloc.constprop.27+0x2ef/0x304 [<c046de9e>] ? trace_hardirqs_on_caller+0x10e/0x13f [<c04c2c5b>] ? mempool_alloc_slab+0x13/0x15 [<c05d2e09>] ? generic_make_request+0x2e3/0x40f [<c05eaae3>] ? random32+0x1e/0x2b [<c05f50c8>] ? should_fail+0x79/0xdf [<c04f1889>] ? kmem_cache_alloc+0x47/0x11a [<c04c2c5b>] ? mempool_alloc_slab+0x13/0x15 [<c05d3017>] ? submit_bio+0xe2/0xfb [<c046a9af>] ? trace_hardirqs_off+0xb/0xd [<c046aeba>] ? lock_release_holdtime.part.9+0x4b/0x51 [<c0524d9e>] ? bio_alloc_bioset+0x3c/0x9c [<c046d9c5>] ? lock_release+0x188/0x1ae [<c0520cad>] ? submit_bh+0xcd/0xe7 [<c0523d0c>] ? block_read_full_page+0x206/0x216 [<c052626e>] ? blkdev_get_blocks+0x100/0x100 [<c04c0fe1>] ? add_to_page_cache_locked+0xa7/0xe0 [<c04c9a3f>] ? __lru_cache_add+0x34/0x5b [<c0525ee2>] ? blkdev_readpage+0x14/0x16 [<c04c8c19>] ? __do_page_cache_readahead+0x149/0x17f [<c04c8fb4>] ? ondemand_readahead+0x163/0x1c4 [<c04c90d0>] ? page_cache_sync_readahead+0x47/0x4f [<c04c22c8>] ? generic_file_aio_read+0x277/0x57c [<c04fed10>] ? do_sync_read+0x96/0xcf [<c04ff05f>] ? rw_verify_area+0xd0/0xf3 [<c046ae1a>] ? lock_is_held+0x73/0x7b [<c04ff3cc>] ? vfs_read+0x8d/0xd5 [<c04fec7a>] ? do_sync_write+0xcf/0xcf [<c04ff456>] ? sys_read+0x42/0x63 [<c084d7df>] ? sysenter_do_call+0x12/0x38 ---[ end trace eec8da039e1a69f5 ]--- 1) $ sed -n "1746 p" kernel/rcutree.c PROVE_RCU(WARN_ON_ONCE(lock_is_held(&rcu_lock_map))); 2) Is there a part of this call trace you want me to peek at in depth (eg, with gdb)?
Gah. I guess the moral of this story is to stop using CONFIG_RCU_FAST_NO_HZ until I get this fixed. I might well need some help from the dyntick-idle guys on this one. My guess is that the CPU tried to enter dyntick-idle mode, and RCU took it seriously, but the system reversed its decision and RCU was blissfully unaware of this fact, with the result you see. If you would like to confirm my guess with gdb, that would be way cool.
Created attachment 519158 [details] Patch to remove rcu_needs_cpu_flush() to avoid false quiescent states. This prototype patch should fix the CONFIG_RCU_FAST_NO_HZ problem.
(In reply to comment #211) > This prototype patch should fix the CONFIG_RCU_FAST_NO_HZ problem. 0) Note to self: tested with 3.1.0-rc2.local1, Cleaned up changelog: * Sun Aug 21 2011 Paul Bolle <pebolle> 3.1.0-rc2.local1 - rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states (manually removed trivial conflict) 1) Patch could be applied on top op v3.1-rc2 with one (trivial) conflict. 2) Fresh boot, 8 runs, which took over 1 hour, no Oopses. Problem seems fixed by that prototype. (In reply to comment #210) > If you would like to confirm my guess with gdb, that would be way cool. 3) Do you still care? If so, what part of that impressive Call Trace would you like me to examine specifically?
Just to be sure, this test was with CONFIG_RCU_FAST_NO_HZ=y, right? (Sorry to be so paranoid, but then again I am still alive...) The test results speak for themselves, and I need to add the relevant tracing anyway, so no need for gdb. (I wasn't going to turn down an offer, of course!)
(In reply to comment #213) > Just to be sure, this test was with CONFIG_RCU_FAST_NO_HZ=y, right? $ grep -w CONFIG_RCU_FAST_NO_HZ /boot/config-3.1.0-rc2.local1.fc17.i686 CONFIG_RCU_FAST_NO_HZ=y
Thank you very much! I have queued this patch.
phew....... Looks like finally this bug has been nailed down. Hopefully my euphoria is not shortlived :-) Anyway, thanks a ton PaulB and PaulM for figuring this one out.
Indeed, thanks guys, I'm impressed you guys managed to get this bug over 200 comments, and still remain productive. I'll get this merged in all our releases for our next updates.
(In reply to comment #217) > [...] you guys managed to get this bug over 200 comments [...]. 0) Note that it turned out to be two bugs (both reported first by Michal): - the first 130 comments (and 15 months) concerned a CFQ bug that was thought to be an RCU bug first; - the next 87 comments (and just one month) concerned an RCU bug that I thought to be a CFQ bug at first. 1) Say, what's Michal's grace period for reporting his third issue in the intersection of CFQ and RCU before this bug is closed?
*** Bug 726679 has been marked as a duplicate of this bug. ***
(In reply to comment #216) > phew....... Looks like finally this bug has been nailed down. So what is a status of that fix? I do not see anything mentioned in changelogs, even for rawhide kernels, and if something showed up "upstream" I missed it. In the meantime I was hit by this, and quite "spontaneously" (that means without using any test scripts to provoke it) while running kernel-3.1.0-0.rc4.git0.0.fc17.x86_64. A familiar "Fixing recursive fault but reboot is needed!" is not kidding. On various occasions a not rebooted machine locked up entirely some while after that message showed up even if immediately after that hit it appeared to be running "normally".
kernel-2.6.40.4-5.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/kernel-2.6.40.4-5.fc15
(In reply to comment #221) > kernel-2.6.40.4-5.fc15 has been submitted as an update for Fedora 15. > https://admin.fedoraproject.org/updates/kernel-2.6.40.4-5.fc15 I'll be building an f14 kernel with the fix and submitting it for updates-testing today.
kernel-2.6.35.14-96.fc14 has been submitted as an update for Fedora 14. https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14
Package kernel-2.6.35.14-96.fc14: * should fix your issue, * was pushed to the Fedora 14 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-2.6.35.14-96.fc14' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14 then log in and leave karma (feedback).
kernel-2.6.35.14-96.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.
kernel-2.6.40.4-5.fc15 has been pushed to the Fedora 15 stable repository. If problems still persist, please make note of it in this bug report.
AFAICS closing that bug as "CLOSED ERRATA" is somewhat premature. Running in a loop 'mandb -c' on a system with a kernel-3.1.0-0.rc5.git0.0.fc17 with a build date "Wed 07 Sep 2011 02:58:01 PM MDT" simply locked up that machine with no traces in logs. This happened not very far into the second loop branch after the first one finished without any incidents.
(In reply to comment #227) > AFAICS closing that bug as "CLOSED ERRATA" is somewhat premature. Running in a > loop 'mandb -c' on a system with a kernel-3.1.0-0.rc5.git0.0.fc17 with a build > date "Wed 07 Sep 2011 02:58:01 PM MDT" simply locked up that machine with no > traces in logs. This happened not very far into the second loop branch after > the first one finished without any incidents. 0) Are you sure some patch for this issue is actually part of that kernel? 1) kernel-3.1.0-0.rc5.git0.0.fc17 seems to correspond to commit 1bc752781f87f13a8747eb4caee00066499e4aa1 to Fedora's kernel.git. This is what I get when I try to find a related patch in that tree: [kernel * master]$ git checkout 1bc752781f87f13a8747eb4caee00066499e4aa1 [...] [kernel * (1bc7527...)]$ grep -i -e rcu -e nohz -e cfq -e 577968 -e dyntick kernel.spec Patch12025: rcu-avoid-just-onlined-cpu-resched.patch ApplyPatch rcu-avoid-just-onlined-cpu-resched.patch 2) It looks like Rawhide's kernel is one of the few Fedora kernels running without a patch for this issues.
(In reply to comment #228) > 2) It looks like Rawhide's kernel is one of the few Fedora kernels running > without a patch for this issues. That why I think that closing this is a bit hasty; especially that one can potentially get a hard lock with it. OTOH if you are really sure that this will not end up in some release ...
For an added attraction I just found traces from this bug for 3.1.0-0.rc5.git0.0.fc17.x86_64 kernel but on that occasion this is not 'mandb -c' nor any other of cron jobs. According to log none were running when that hit. A trigger process is recorded as: Process ld-linux-x86-64 (pid: 8657, threadinfo ffff88005a91c000, task ffff880076e40000) so this could be really anything.
(In reply to comment #229) > (In reply to comment #228) > > 2) It looks like Rawhide's kernel is one of the few Fedora kernels running > > without a patch for this issues. > > That why I think that closing this is a bit hasty; especially that one can > potentially get a hard lock with it. OTOH if you are really sure that this > will not end up in some release ... The bug is reported against f14. It's been fixed in f14-f16. I can throw the patch into rawhide today. I would have hoped it makes it's way upstream before 3.1 final, but who knows.
The patch has been added to rawhide. It will be included in the next build.
0) I just ran into this issue while running (self compiled and vanilla) v3.0.6 on a x86_64. This seems to be the first time I triggered this on another machine than the i686 that so happily reproduced this error. Anyhow, self compiled and vanilla means that this is not a Fedora issue. 1) But I do wonder what the status is of the patch that Fedora is now carrying. Has there been any effort to get this into mainline? As far as I know there's no fix for this issue included in v3.1-rc9. Or is there? And are there any plans of backporting a fix into the stable tree?
Indeed, I would not expect it to be specific to Fedora. The commit is 06ae115a1 in -tip, queued for the 3.2 merge window. I did post it separately to LKML some time back (https://lkml.org/lkml/2011/8/19/355), and got absolutely no response whatsoever. But the 3.2 merge window should be opening up soon. It turns out to be against the rules to CC a patch to stable before that patch is in mainline. But once that happens, it seems to me to be a good candidate for stable.
(In reply to comment #234) > The commit is > 06ae115a1 in -tip, queued for the 3.2 merge window. I did post it separately > to LKML some time back (https://lkml.org/lkml/2011/8/19/355), and got > absolutely no response whatsoever. 0) I didn't check the commit in whatever tree -tip might be, but the link should be to https://lkml.org/lkml/2011/9/7/71 ("rcu: Remove rcu_needs_cpu_flush() to avoid false quiescent states"), which also got no response whatsoever. 1) If you resubmit this for the 3.2 merge window you might add a reference to this bug report to the commit message. Not really important.
*** Bug 755219 has been marked as a duplicate of this bug. ***