We have found that the 2.6.9-67.ELxen kernel can occasionally enter tickless mode when an RCU is pending. We've mainly noticed it very early on at start of day or late during shutdown when there isn't much other activity going on. When this triggers it is usually in synchronize_kernel() which means the guest essentially hangs until some external event (e.g. a SysRQ) unwedges it. Usually we see it when loading/unloading iptables modules during startup or shutdown. i.e. modprobe D C02AB810 2932 4668 4630 (NOTLB) dc939ed4 00000286 dc90d2c0 c02ab810 dfca2da4 c1410320 00004a9a 6bc82a97 00000423 c16610e0 df76e170 df76e2dc c01c2e62 dc939f38 dc939f38 dc939ec8 c026fdac c027a631 00000000 dc939f38 dc939f3c dc939f38 dc939ef0 dc939f28 Call Trace: [<c01c2e62>] alloc_layer+0x3a/0x40 [<c026fdac>] __cond_resched+0x14/0x3c [<c026f7dd>] wait_for_completion+0x9c/0xd3 [<c01185cb>] default_wake_function+0x0/0x12 [<c01185cb>] default_wake_function+0x0/0x12 [<c01224bd>] unregister_proc_table+0x38/0x69 [<c012d547>] synchronize_kernel+0x41/0x46 [<c012d4fa>] wakeme_after_rcu+0x0/0xc [<e091e77a>] init_or_cleanup+0x18f/0x20b [ip_conntrack] [<e09219c1>] fini+0x7/0x9 [ip_conntrack] [<c01325d0>] sys_delete_module+0x13e/0x187 [<c014f57d>] do_munmap+0x11d/0x129 [<c014f5d1>] sys_munmap+0x48/0x63 [<c010740f>] syscall_call+0x7/0xb The callchain here isn't especially clear, I believe it is something like sys_delete_module -> fini -> init_or_cleanup -> nf_unregister_hook -> synchronize_net -> synchronize_kernel. The last few links are optimized into tailcalls which is why they don't appear in the trace. It is very tricky to reproduce since it reproduces very rarely, we mainly see it during our automated testing. About the only way I've found is a reboot loop and an aweful lot of patience. The fix is xen-unstable.hg 10327:c230dbe793d6 and 10532:4b45f7f62dc7 which in turn require git 677517771b7b6efaf8617e70f655b16f3cafcc9b and 986733e01d258c26107f1da9d8d47c718349ad2f. http://xenbits.xensource.com/xen-unstable.hg?rev/c230dbe793d6 http://xenbits.xensource.com/xen-unstable.hg?rev/4b45f7f62dc7 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=986733e01d258c26107f1da9d8d47c718349ad2f http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=677517771b7b6efaf8617e70f655b16f3cafcc9b
Created attachment 291063 [details] git 677517771b7b6efaf8617e70f655b16f3cafcc9b backported to 2.6.9-67.EL
Created attachment 291064 [details] Git 986733e01d258c26107f1da9d8d47c718349ad2f backported to 2.6.9-67.EL
Created attachment 291065 [details] xen-unstable.hg 10327:c230dbe793d6 backported to 2.6.9-67.EL
Created attachment 291066 [details] xen-unstable.hg 10532:4b45f7f62dc7 backported to 2.6.9-67.EL
Patches apply in the order: git-677517771b7b6efaf8617e70f655b16f3cafcc9b git-986733e01d258c26107f1da9d8d47c718349ad2f xen-unstable-10327-c230dbe793d6 xen-unstable-10532-4b45f7f62dc7 They are against 2.6.9-67.0.1.EL not -67.EL as I said above.
Ian, Why is the 4th patch needed. It states that a problem exists for a dom0 hang, but rhel4-xenU is a domU-only kernel.
Sorry for the delay responding. I could have sworn I replied to this but I must've forgotten to hit Submit or something. The problem was initially noticed in domain 0 in different circumstances to reported here (all I know about it is what is given in Ack's commit message). The problem reported here was subsequently seen in domainU and the fix turned out (coincidentally) to be the same. The issue is that a domain can go tickless either with RCU events or timers pending. In the later case next_timer_interrupt() returns a time in the recent past hence the changes to 10532. The original upstream patch at http://xenbits.xensource.com/xen-unstable.hg?rev/4b45f7f62dc7 has an additional hunk which I dropped because hrtimers aren't relevant to RHEL4 but the comment in that same hunk is probably useful actually: ++ /* ++ * If timers are pending, "expires" will be in the recent past ++ * of "jiffies". If there are no hr_timers registered, "hr_expires" ++ * will be "jiffies + MAX_JIFFY_OFFSET"; this is *just* short of being ++ * considered to be before "jiffies". This makes it very likely that ++ * "hr_expires" *will* be considered to be before "expires". ++ * So we must check when there are pending timers (expires <= jiffies) ++ * to ensure that we don't accidently tell the caller that there is ++ * nothing scheduled until half an epoch (MAX_JIFFY_OFFSET)! ++ */ Now that I look again it's possible that I am mistaken and that without hrtimers the remaining hunk isn't needed either. Our testing has always included all 4 of the patches so I'd be reluctant to say that it definately isn't required.
The fourth patch is wrong, it may use j uninitialized: /* Leave ourselves in tick mode if rcu or softirq or timer pending. */ if (rcu_needs_cpu(cpu) || local_softirq_pending() || (j = next_timer_interrupt(), time_before_eq(j, jiffies))) { cpu_clear(cpu, nohz_cpu_mask); j = jiffies + 1; } if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0) I'll work on a fix to post upstream.
(In reply to comment #11) > The fourth patch is wrong, it may use j uninitialized: > > /* Leave ourselves in tick mode if rcu or softirq or timer pending. */ > if (rcu_needs_cpu(cpu) || local_softirq_pending() || > (j = next_timer_interrupt(), time_before_eq(j, jiffies))) { > cpu_clear(cpu, nohz_cpu_mask); > j = jiffies + 1; > } > > if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0) How so? If rcu_needs_cpu(cpu) or local_softirq_pending() is true, then we enter the if block and set j = jiffies + 1. If both of those are false, then we enter the third || condition and set j = next_timer_interrupt(). So how can j be uninitialized? Chris Lalancette
Oops, tricky... Well, if upstream does it this way I guess we have to do the same.
This is a difficult bug to recreate, but the proposed patches have been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patches to see if it goes away.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
with host: xen-3.0.3-120-x86_64.el5 kernel-xen-2.6.18-238.el5 guest: kernel-2.6.9-94.ELxen, 64bit [1] no call trace after 2000 times reboot with iptables enabled [2] code sanity check is ok, patch is applied successfully so change this to verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html