Created attachment 397076 [details] /var/log/messages Description of problem: Under heavy load (doing backups) there was a page allocation failure from iwl_rx_allocate(). Once this occurred the network became unuseable and several other problems ensured (nfs timeouts etc...) Version-Release number of selected component (if applicable): 2.6.32.9-64.fc12.x86_64 (taken from koji build) How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: I will attach /var/log/messages Expected results: Additional info:
Looks similar to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=551937
Hi gene c First part is the same as in Bug 551937, but there system after printing message become stable. From your logs I see some processes hangs - not good. The question is: are you able to reproduce that bug ?
The hangs are because the network was not usable - nfs mounts hung etc ... I believe I can reproduce this by putting the iwlagn under heavy load (think needs N not just G to get enough load but I'm not sure) I'd also be happy to try 2.6.33 where I understand some allocation bugs for iwlagn are fixed. I cant find a koji build of 2.6.33 for F12 however .. Im willing to try f13/alpha/beta/rawhide kernel if there are no core dependencies dragged in which may be problematic for stock f12 (such as dracut) - I need this f12 system to keep running. Do you know if there will be f12 builds of 2.6.33 ?
To update to 2.6.33 just do, this will update gubby and should not break usage of older kernel. yum --enablerepo="rawhide" update kernel Patch solving issue in 2.6.33 is problematic, I would like to have better solution for 2.6.32. I will propose it upstream and then create 2.6.32 kernel for test. I'm not sure if my proposition would be accepted however. Please make sure you are able to reproduce, since now nobody really was able (including me, I get this ones, but not any more), so we could not tell if patch fix the problem or not. Thanks.
Ok - I will do my best to confirm it is both reproducible in 2.6.32 and also fixed in 2.6.33 .. - will take some time as this machine is being used for business and it took a long while of sustained beating on the network (via backups) to trigger it previously.
Did as above - and got 2.6.34 which sadly craps out on boot. Mar 3 08:35:13 lap1 kernel: ============================================= Mar 3 08:35:13 lap1 kernel: [ INFO: possible recursive locking detected ] Mar 3 08:35:13 lap1 kernel: 2.6.34-0.4.rc0.git2.fc14.x86_64 #1 Mar 3 08:35:13 lap1 kernel: --------------------------------------------- Mar 3 08:35:13 lap1 kernel: K99cpuspeed/2809 is trying to acquire lock: Mar 3 08:35:13 lap1 kernel: (s_active){++++.+}, at: [<ffffffff81176f02>] sysfs_addrm_finish+0x36/0x55 Mar 3 08:35:13 lap1 kernel: Mar 3 08:35:13 lap1 kernel: but task is already holding lock: Mar 3 08:35:13 lap1 kernel: (s_active){++++.+}, at: [<ffffffff811770bd>] sysfs_get_active_two+0x24/0x48 Mar 3 08:35:13 lap1 kernel: Mar 3 08:35:13 lap1 kernel: other info that might help us debug this: Mar 3 08:35:13 lap1 kernel: 4 locks held by K99cpuspeed/2809: Mar 3 08:35:13 lap1 kernel: #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff81175b9f>] sysfs_write_file+0x3c/0x144 Mar 3 08:35:13 lap1 kernel: #1: (s_active){++++.+}, at: [<ffffffff811770bd>] sysfs_get_active_two+0x24/0x48 Mar 3 08:35:13 lap1 kernel: #2: (s_active){++++.+}, at: [<ffffffff811770ca>] sysfs_get_active_two+0x31/0x48 Mar 3 08:35:13 lap1 kernel: #3: (dbs_mutex){+.+.+.}, at: [<ffffffffa0408da2>] cpufreq_governor_dbs+0x2a0/0x352 [cpufreq_ondemand] Mar 3 08:35:13 lap1 kernel: Mar 3 08:35:13 lap1 kernel: stack backtrace: Mar 3 08:35:13 lap1 kernel: Pid: 2809, comm: K99cpuspeed Not tainted 2.6.34-0.4.rc0.git2.fc14.x86_64 #1 Mar 3 08:35:13 lap1 kernel: Call Trace: Mar 3 08:35:13 lap1 kernel: [<ffffffff8107e94f>] __lock_acquire+0xcb5/0xd2c Mar 3 08:35:13 lap1 kernel: [<ffffffff8107cf48>] ? mark_held_locks+0x52/0x70 Mar 3 08:35:13 lap1 kernel: [<ffffffff8107d329>] ? debug_check_no_locks_freed+0x12e/0x145 Mar 3 08:35:13 lap1 kernel: [<ffffffff8107d1c8>] ? trace_hardirqs_on_caller+0x111/0x135 Mar 3 08:35:13 lap1 kernel: [<ffffffff8107eaa2>] lock_acquire+0xdc/0x102 Mar 3 08:35:13 lap1 kernel: [<ffffffff81176f02>] ? sysfs_addrm_finish+0x36/0x55 Mar 3 08:35:13 lap1 kernel: [<ffffffff8107c300>] ? lockdep_init_map+0x9e/0x113 Mar 3 08:35:13 lap1 kernel: [<ffffffff8117690f>] sysfs_deactivate+0x9a/0x103 Mar 3 08:35:13 lap1 kernel: [<ffffffff81176f02>] ? sysfs_addrm_finish+0x36/0x55 Mar 3 08:35:13 lap1 kernel: [<ffffffff8107120a>] ? sched_clock_cpu+0xc3/0xce Mar 3 08:35:13 lap1 kernel: [<ffffffff8147889c>] ? __mutex_unlock_slowpath+0x120/0x132 Mar 3 08:35:13 lap1 kernel: [<ffffffff81176f02>] sysfs_addrm_finish+0x36/0x55 Mar 3 08:35:13 lap1 kernel: [<ffffffff81175254>] sysfs_hash_and_remove+0x53/0x6a Mar 3 08:35:13 lap1 kernel: [<ffffffff811782f9>] sysfs_remove_group+0x91/0xca Mar 3 08:35:13 lap1 kernel: [<ffffffffa0408db6>] cpufreq_governor_dbs+0x2b4/0x352 [cpufreq_ondemand] Mar 3 08:35:13 lap1 kernel: [<ffffffff8107d1f9>] ? trace_hardirqs_on+0xd/0xf Mar 3 08:35:13 lap1 kernel: [<ffffffff813a2623>] __cpufreq_governor+0x9b/0xde Mar 3 08:35:13 lap1 kernel: [<ffffffff813a346f>] __cpufreq_set_policy+0x1ce/0x275 Mar 3 08:35:13 lap1 kernel: [<ffffffff813a3993>] store_scaling_governor+0x1a7/0x1fb Mar 3 08:35:13 lap1 kernel: [<ffffffff813a3c9c>] ? handle_update+0x0/0x39 Mar 3 08:35:13 lap1 kernel: [<ffffffff814791b7>] ? down_write+0x7a/0x81 Mar 3 08:35:13 lap1 kernel: [<ffffffff813a327c>] store+0x61/0x86 Mar 3 08:35:13 lap1 kernel: [<ffffffff81175c6b>] sysfs_write_file+0x108/0x144 Mar 3 08:35:13 lap1 kernel: [<ffffffff8111ed9d>] vfs_write+0xae/0x10b Mar 3 08:35:13 lap1 kernel: [<ffffffff8107d1c8>] ? trace_hardirqs_on_caller+0x111/0x135 Mar 3 08:35:13 lap1 kernel: [<ffffffff8111eeba>] sys_write+0x4a/0x6e Mar 3 08:35:13 lap1 kernel: [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b Mar 3 08:35:13 lap1 kernel: Kernel logging (proc) stopped.
Ill see if i can boot 2.6.33 from f13 .. or fc13 or whatever the label is .. if I can then Ill at least have 2 kernels to compare. I prefer install via rpm as yum from rawhide did not keep all the kernels I asked it to (via /etc/yum.conf) it stuck to the default of 3 which was kinda annoying. Assuming 2.6.33 runs ok Ill work on doing some tests in the coming days ...
Sorry, yesterday rawhide yum update kernel install 2.6.33 :-/ I'm not recommending using 2.6.34, since it is at very begging of development cycle. I'm working on 2.6.32 memory allocation fix, but now I have some other work at higher priority, so this will take some time.
I tried 2.6.33 from fc13 - sadly it craps the same way - as soon as the modeset occurs for graphics it dies - I looked at updateing nouveau but it needs a new X server as well .. is there a way I can run 2.6.33 on f12 with nouveau driver ? Trace is as above: Mar 4 08:31:54 lap1 kernel: ============================================= Mar 4 08:31:54 lap1 kernel: [ INFO: possible recursive locking detected ] Mar 4 08:31:54 lap1 kernel: 2.6.33-4.fc13.x86_64 #1 Mar 4 08:31:54 lap1 kernel: --------------------------------------------- ... etc. I need help knowing exactly what to install or what kernel args (nomodeset or whatever) to boot 2.6.33 on my hardware ? thanks gene
To be even more clear - a way to run it that allows me to revert back to earlier kernels (and presumably nouveau driver) ... I saw a long thread on LKML on nouveau .. not sure where we stand on this. I believe I can trigger this with my regular backup. Oddly I had plugged in wire to do backup, but iwl still blew up - so I had to rmmod iwlagn to do a wired backup ... which I thought was odd. Please let me know if you know a clean way to boot up a later kernel with nouveau that allows me to go back to 2.6.32 etc ..
(In reply to comment #10) > Please let me know if you know a clean way to boot up a later kernel with > nouveau that allows me to go back to 2.6.32 etc .. Could you elaborate more, I don't understand problem here.
Using nouveau - 2.6.33 kernel wont boot on fedora 12 - in order to do at present, we need to update nouveau driver which in turn requires libdrm as well as new Xorg server. The binary API for nouveau changed in a non-backward compatible way. If I do this - then I can no longer go back and run 2.6.32 any longer. I believe linus had a similar complaint on LKML.
No. You can run vanilla 2.6.33 on F12 with nouveau (I'm doing that :), but you can't run vanilla 2.6.32. Fedora have extra patches for 2.6.32 kernel to run with F12 Xorg, and that was about Linus complains. You have different problem with 2.6.33, it is this "INFO: possible recursive locking detected". In this week, I'll look at this bug more closely and provide you working 2.6.33 or 2.6.32 kernel to test. Stay tuned :)
Ok, here is the kernel: http://koji.fedoraproject.org/koji/taskinfo?taskID=2045788 Would be nice if you can assure that bug is reproducible fedora released kernel and not reproducible on kernel from above link.
I have not had time yet to test this ... I will get to it soon as I can. I have also found an NFS crash (triggered by load over nfs) in same kernel Mar 13 22:13:09 lap1 kernel: [<ffffffff8107bb23>] ? timekeeping_get_ns+0x1b/0x3d Mar 13 22:13:09 lap1 kernel: [<ffffffffa04b2517>] ? nfs_wait_bit_uninterruptible+0x0/0x12 [nfs] I need to file separate bug unless this is known issue By the way now that 2.6.33.1 is out in stable, will this kernel be made avail to f12 do you know ? gene/
(In reply to comment #15) > I have not had time yet to test this ... I will get to it soon as I can. I have > also found an NFS crash (triggered by load over nfs) in same kernel [snip] > I need to file separate bug unless this is known issue Please do, if this is know issue it will be closed by duplicate. > By the way now that 2.6.33.1 is out in stable, will this kernel be made avail > to f12 do you know ? No, as far as I know. And F13 will use 2.6.34.
My system seems to be having the same problem that is discussed in this bug. Below is a representative stack trace: Mar 14 14:54:55 localhost kernel: java: page allocation failure. order:2, mode:0x4020 Mar 14 14:54:57 localhost kernel: Pid: 4143, comm: java Tainted: G W 2.6.32.9-70.fc12.x86_64 #1 Mar 14 14:54:57 localhost kernel: Call Trace: Mar 14 14:54:57 localhost kernel: <IRQ> [<ffffffff810dc7fc>] __alloc_pages_nodemask+0x5ad/0x631 Mar 14 14:54:57 localhost kernel: [<ffffffff8110c288>] alloc_pages_node+0x48/0x4a Mar 14 14:54:57 localhost kernel: [<ffffffff8110c2b4>] kmalloc_large_node+0x2a/0x67 Mar 14 14:54:57 localhost kernel: [<ffffffff8110d5c7>] __kmalloc_node_track_caller+0x31/0x125 Mar 14 14:54:57 localhost kernel: [<ffffffffa01f607c>] ? iwl_rx_allocate+0x90/0x2f4 [iwlcore] Mar 14 14:54:57 localhost kernel: [<ffffffff813abf16>] __alloc_skb+0x80/0x170 Mar 14 14:54:57 localhost kernel: [<ffffffffa01f607c>] iwl_rx_allocate+0x90/0x2f4 [iwlcore] Mar 14 14:54:57 localhost kernel: [<ffffffff81234652>] ? is_swiotlb_buffer+0x2e/0x3b Mar 14 14:54:57 localhost kernel: [<ffffffffa01f76c8>] iwl_rx_replenish_now+0x1b/0x27 [iwlcore] Mar 14 14:54:57 localhost kernel: [<ffffffffa0221565>] iwl_rx_handle+0x3aa/0x3c3 [iwlagn] Mar 14 14:54:57 localhost kernel: [<ffffffff81017d45>] ? sched_clock+0x9/0xd Mar 14 14:54:57 localhost kernel: [<ffffffff81079325>] ? sched_clock_local+0x1c/0x82 Mar 14 14:54:57 localhost kernel: [<ffffffffa0224433>] iwl_irq_tasklet_legacy+0x50c/0x76e [iwlagn] Mar 14 14:54:57 localhost kernel: [<ffffffffa01f0e17>] ? __iwl_read32.clone.2+0xaa/0xb9 [iwlcore] Mar 14 14:54:57 localhost kernel: [<ffffffffa01eeef8>] ? tasklet_schedule+0x1a/0x1c [iwlcore] Mar 14 14:54:57 localhost kernel: [<ffffffff8105bf7b>] tasklet_action+0x85/0xe4 Mar 14 14:54:57 localhost kernel: [<ffffffff8105d964>] __do_softirq+0xe5/0x1a9 Mar 14 14:54:57 localhost kernel: [<ffffffff810acd61>] ? handle_IRQ_event+0x60/0x121 Mar 14 14:54:57 localhost kernel: [<ffffffff81012e6c>] call_softirq+0x1c/0x30 Mar 14 14:54:57 localhost kernel: [<ffffffff810143ea>] do_softirq+0x46/0x86 Mar 14 14:54:57 localhost kernel: [<ffffffff8105d7a2>] irq_exit+0x3b/0x7d Mar 14 14:54:57 localhost kernel: [<ffffffff81459c9d>] do_IRQ+0xa5/0xbc Mar 14 14:54:57 localhost kernel: [<ffffffff81012693>] ret_from_intr+0x0/0x11 Mar 14 14:54:57 localhost kernel: <EOI> [<ffffffff8145442f>] ? thread_return+0x70/0xdb Mar 14 14:54:57 localhost kernel: [<ffffffff8145418b>] ? schedule+0x601/0x835 Mar 14 14:54:57 localhost kernel: [<ffffffff8101268e>] ? common_interrupt+0xe/0x13 Mar 14 14:54:57 localhost kernel: [<ffffffff81017e73>] ? native_sched_clock+0x2d/0x5f Mar 14 14:54:57 localhost kernel: [<ffffffff810a8fe5>] ? audit_syscall_entry+0x11e/0x14a Mar 14 14:54:57 localhost kernel: [<ffffffff81051b71>] ? sys_sched_yield+0x51/0x57 Mar 14 14:54:57 localhost kernel: [<ffffffff81011d32>] ? system_call_fastpath+0x16/0x1b I am able to easily reproduce the hang when the system is under memory pressure and starts to swap to disk. I create this condition by running the command "memhog 4g". Once swapping starts, the system will be left unresponsive within seconds. I have been able to reproduce it on the following kernels: kernel-2.6.32.9-70.fc12.x86_64 kernel-2.6.33-10.fc13.x86_64 If start the system in single user mode and run memhog, the system will not hang. If I "modprobe iwlagn", and then run memhog, the system will not hang. If I then start NetworkManager in single user mode and run memhog, the system hangs. I have installed the suggested kernel-2.6.32.8-48.rc2.sg_slab.fc12.x86_64.rpm and I am no longer able to reproduce the problem. Once the system starts to swap it does become unresponsive as the swapping is occurring, but the swapping does finish and the system becomes responsive again. Can others reproduce the hang with the memhog command?
I think, system hangs are due to iwlwifi driver or firmware, which is not robust enough to survive allocation failures in some cases. Since SLAB make allocation failures gone system does not hung.
Created attachment 402144 [details] Messages with "page allocation failure. order:2" I think that I spoke to soon with regard to my last post. I was able to get page allocation failures with 2.6.32.8-48.rc2.sg_slab.fc12.x86_64 as well. The system will hang, and take about 15 minutes to become responsive again.
FYI, there is a separate bug report for the recursive locking error bz571969
(In reply to comment #18) > If start the system in single user mode and run memhog, the system will not > hang. If I "modprobe iwlagn", and then run memhog, the system will not hang. If > I then start NetworkManager in single user mode and run memhog, the system > hangs. [snip] > Can others reproduce the hang with the memhog command? I'm trying to reproduce. Steps to reproduce are like below? Add S to kernel parameters in grub.conf, after boot do: # memhog 4g # modprobe iwlagn # memhog 4g # /etc/init.d/NetworkManager start For me: this give NetworkManager calltrace. And additional question, what shows your "free -m"?
Hi Manny Could you please answer the questions from Comment 21 ? Thanks.
I apologize for the delay. This has been a busy week... I have tried to reproduce what I did before and now too I get a networkmanager call trace as well. In any case, I can still reproduce the issue by starting a standard graphical session as a regular user. My network will automatically connect to a wireless network. I then run memhog 4g as a normal user. This results in the system getting stuck. It does not matter if I am connect to a wireless N or wireless G network at either 2.5 or 5 ghz. I have found that the system is not permanently frozen, just stuck for an extended period. I believe that the higher the amount of memory pressure (memhog 4g vs memhog 5g), the longer the system remains unusable. Below is my free -m output: total used free shared buffers cached Mem: 3891 3788 102 0 35 1158 -/+ buffers/cache: 2594 1296 Swap: 5887 228 5659 I should be able to answer quicker next time. Sorry for the delay...
(In reply to comment #23) > I have tried to reproduce what I did before and now too I get a networkmanager > call trace as well. Hmm, what changed - NetworkManager was updated? Or maybe you did something different than before, and now just suggested by my steps to reproduce get this NetworkManager calltrace :) ? > It does not matter if I am connect to a wireless N or wireless G network at > either 2.5 or 5 ghz. I have found that the system is not permanently frozen, > just stuck for an extended period. I believe that the higher the amount of > memory pressure (memhog 4g vs memhog 5g), the longer the system remains > unusable. I have this too, but there are no allocations errors. This behaviour (stuck for a while) must be some other bug in virtual memory management :( > I should be able to answer quicker next time. Sorry for the delay... No problem, we (fedora developers) do not respond in timely manner as well.
(In reply to comment #16) > > By the way now that 2.6.33.1 is out in stable, will this kernel be made avail > > to f12 do you know ? > > No, as far as I know. And F13 will use 2.6.34. Just to clarify. F-12 will stay with 2.6.32 and F13 is using 2.6.33 as far (I thought it will be 2.6.34 but seems plans changed).
I took the plunge and wiped my system for F13. Under memory pressure the system remains unresponsive for an extended period of time. I wouldn't say that the system is frozen, because it does come back to life after an extended period of time. This behavior is identical to F12. Once the system recovers, the following appears in the logs: INFO: task kdmflush:460 blocked for more than 120 seconds. INFO: task jbd2/dm-1-8:470 blocked for more than 120 seconds. INFO: task rpcbind:1531 blocked for more than 120 seconds. INFO: task NetworkManager:1556 blocked for more than 120 seconds. INFO: task hald:1674 blocked for more than 120 seconds. INFO: task hald-addon-acpi:1723 blocked for more than 120 seconds. INFO: task ntpd:1800 blocked for more than 120 seconds. INFO: task abrtd:1837 blocked for more than 120 seconds. INFO: task crond:1845 blocked for more than 120 seconds. INFO: task Xorg:1908 blocked for more than 120 seconds. along with associated stack traces. I did not get any iwlagn page allocation failures this time, but it seems to be related somehow. I was using kernel-2.6.33.1-19.fc13.x86_64 for this. The problem seems to be that the virtual memory management is getting bogged down somehow. Could the iwlagn page allocation failures be due to the pages taking excessive time to be allocated? Is there a means in the page request mechanism for iwlagn that results in allocation failures if the pages take to long? Perhaps a side effect of some sort of realtime process scheduling? I am not too familiar with the inner workings of the kernel.
Update - I have this with 2.6.32.12-115 (and earlier of course). rdiff-backup never on /home never fails to induce this: trace attached. (As always once this occurs machine is essentially unusable)
Created attachment 414289 [details] traceback from /var/log/messages
Sorry took me a while - is there a 2.6.34.x I could test somewhere (on f12) ? I will look into installing f13 as a dual boot when I can if there isnt a convenient later kernel to test .. this machine is important so I need to make sure I keep f12 running as well ... even if all backups are done on wire .. bright side is they are a lot faster.
(In reply to comment #29) > Sorry took me a while - is there a 2.6.34.x I could test somewhere (on f12) ? No, F-12 will stay with 2.6.32 . On F-13 we have currently 2.6.33 but switch to 2.6.34 is possible. > I will look into installing f13 as a dual boot when I can if there isnt a > convenient later kernel to test .. this machine is important so I need to make > sure I keep f12 running as well ... even if all backups are done on wire .. > bright side is they are a lot faster. Yeh, if wireless not work and there is possibility to to have wired connection,that's good idea. Does your last allocation failure from comment 28 make something wrong with system or network connection, or driver recover from that situation by itself?
The driver does not recover - a reboot is required and the machine is not in a good state .. fiddling with system makes it eventually hang and a power reset is required ... I will try f13 this weekend if possible.
Ok I have not managed to trigger this on F13 with this kernel: 2.6.33.5-124.fc13.x86_64 So far so good ...
Based on above comment I'm closing bug.