Bug 427998 - RHEL4: Can enter no tick idle mode with RCU pending leading to hang
Summary: RHEL4: Can enter no tick idle mode with RCU pending leading to hang
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen
Version: 4.6
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Andrew Jones
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 458302
TreeView+ depends on / blocked
 
Reported: 2008-01-08 16:19 UTC by Ian Campbell
Modified: 2011-02-16 16:03 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-16 16:03:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
git 677517771b7b6efaf8617e70f655b16f3cafcc9b backported to 2.6.9-67.EL (2.75 KB, patch)
2008-01-08 16:21 UTC, Ian Campbell
no flags Details | Diff
Git 986733e01d258c26107f1da9d8d47c718349ad2f backported to 2.6.9-67.EL (2.29 KB, patch)
2008-01-08 16:21 UTC, Ian Campbell
no flags Details | Diff
xen-unstable.hg 10327:c230dbe793d6 backported to 2.6.9-67.EL (1.90 KB, patch)
2008-01-08 16:22 UTC, Ian Campbell
no flags Details | Diff
xen-unstable.hg 10532:4b45f7f62dc7 backported to 2.6.9-67.EL (1.36 KB, patch)
2008-01-08 16:23 UTC, Ian Campbell
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description Ian Campbell 2008-01-08 16:19:37 UTC
We have found that the 2.6.9-67.ELxen kernel can occasionally enter tickless
mode when an RCU is pending. We've mainly noticed it very early on at start of
day or late during shutdown when there isn't much other activity going on. 
When this triggers it is usually in synchronize_kernel() which means the guest
essentially hangs until some external event (e.g. a SysRQ) unwedges it. 

Usually we see it when loading/unloading iptables modules during startup or
shutdown. i.e.
modprobe      D C02AB810  2932  4668   4630                     (NOTLB)
dc939ed4 00000286 dc90d2c0 c02ab810 dfca2da4 c1410320 00004a9a 6bc82a97 
       00000423 c16610e0 df76e170 df76e2dc c01c2e62 dc939f38 dc939f38 dc939ec8 
       c026fdac c027a631 00000000 dc939f38 dc939f3c dc939f38 dc939ef0 dc939f28 
Call Trace:
 [<c01c2e62>] alloc_layer+0x3a/0x40
 [<c026fdac>] __cond_resched+0x14/0x3c
 [<c026f7dd>] wait_for_completion+0x9c/0xd3
 [<c01185cb>] default_wake_function+0x0/0x12
 [<c01185cb>] default_wake_function+0x0/0x12
 [<c01224bd>] unregister_proc_table+0x38/0x69
 [<c012d547>] synchronize_kernel+0x41/0x46
 [<c012d4fa>] wakeme_after_rcu+0x0/0xc
 [<e091e77a>] init_or_cleanup+0x18f/0x20b [ip_conntrack]
 [<e09219c1>] fini+0x7/0x9 [ip_conntrack]
 [<c01325d0>] sys_delete_module+0x13e/0x187
 [<c014f57d>] do_munmap+0x11d/0x129
 [<c014f5d1>] sys_munmap+0x48/0x63
 [<c010740f>] syscall_call+0x7/0xb

The callchain here isn't especially clear, I believe it is something like
sys_delete_module -> fini -> init_or_cleanup -> nf_unregister_hook ->
synchronize_net -> synchronize_kernel. The last few links are optimized into
tailcalls which is why they don't appear in the trace.

It is very tricky to reproduce since it reproduces very rarely, we mainly see it
during our automated testing. About the only way I've found is a reboot loop and
an aweful lot of patience.

The fix is xen-unstable.hg 10327:c230dbe793d6 and 10532:4b45f7f62dc7 which in
turn require git 677517771b7b6efaf8617e70f655b16f3cafcc9b and
986733e01d258c26107f1da9d8d47c718349ad2f.

http://xenbits.xensource.com/xen-unstable.hg?rev/c230dbe793d6
http://xenbits.xensource.com/xen-unstable.hg?rev/4b45f7f62dc7
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=986733e01d258c26107f1da9d8d47c718349ad2f
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=677517771b7b6efaf8617e70f655b16f3cafcc9b

Comment 1 Ian Campbell 2008-01-08 16:21:19 UTC
Created attachment 291063 [details]
git 677517771b7b6efaf8617e70f655b16f3cafcc9b backported to 2.6.9-67.EL

Comment 2 Ian Campbell 2008-01-08 16:21:55 UTC
Created attachment 291064 [details]
Git 986733e01d258c26107f1da9d8d47c718349ad2f backported to 2.6.9-67.EL

Comment 3 Ian Campbell 2008-01-08 16:22:40 UTC
Created attachment 291065 [details]
xen-unstable.hg 10327:c230dbe793d6 backported to 2.6.9-67.EL

Comment 4 Ian Campbell 2008-01-08 16:23:15 UTC
Created attachment 291066 [details]
xen-unstable.hg 10532:4b45f7f62dc7 backported to 2.6.9-67.EL

Comment 5 Ian Campbell 2008-01-08 16:24:18 UTC
Patches apply in the order:
git-677517771b7b6efaf8617e70f655b16f3cafcc9b
git-986733e01d258c26107f1da9d8d47c718349ad2f
xen-unstable-10327-c230dbe793d6
xen-unstable-10532-4b45f7f62dc7

They are against 2.6.9-67.0.1.EL not -67.EL as I said above.

Comment 6 Don Dutile (Red Hat) 2008-03-06 22:55:03 UTC
Ian,

Why is the 4th patch needed.
It states that a problem exists for a dom0 hang, but
rhel4-xenU is a domU-only kernel.



Comment 7 Ian Campbell 2008-03-14 14:28:51 UTC
Sorry for the delay responding. I could have sworn I replied to this but I
must've forgotten to hit Submit or something.

The problem was initially noticed in domain 0 in different circumstances to
reported here (all I know about it is what is given in Ack's commit message).

The problem reported here was subsequently seen in domainU and the fix turned
out (coincidentally) to be the same. The issue is that a domain can go tickless
either with RCU events or timers pending. In the later case
next_timer_interrupt() returns a time in the recent past hence the changes to 10532.

The original upstream patch at
http://xenbits.xensource.com/xen-unstable.hg?rev/4b45f7f62dc7 has an additional
hunk which I dropped because hrtimers aren't relevant to RHEL4 but the comment
in that same hunk is probably useful actually:
++	/*
++	 * If timers are pending, "expires" will be in the recent past
++	 * of "jiffies". If there are no hr_timers registered, "hr_expires"
++	 * will be "jiffies + MAX_JIFFY_OFFSET"; this is *just* short of being
++	 * considered to be before "jiffies". This makes it very likely that
++	 * "hr_expires" *will* be considered to be before "expires".
++	 * So we must check when there are pending timers (expires <= jiffies)
++	 * to ensure that we don't accidently tell the caller that there is
++	 * nothing scheduled until half an epoch (MAX_JIFFY_OFFSET)!
++	 */

Now that I look again it's possible that I am mistaken and that without hrtimers
the remaining hunk isn't needed either. Our testing has always included all 4 of
the patches so I'd be reluctant to say that it definately isn't required.

Comment 11 Paolo Bonzini 2009-06-23 07:22:33 UTC
The fourth patch is wrong, it may use j uninitialized:

	/* Leave ourselves in tick mode if rcu or softirq or timer pending. */
	if (rcu_needs_cpu(cpu) || local_softirq_pending() ||
	    (j = next_timer_interrupt(), time_before_eq(j, jiffies))) {
 		cpu_clear(cpu, nohz_cpu_mask);
 		j = jiffies + 1;
 	}
 
 	if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0)

I'll work on a fix to post upstream.

Comment 12 Chris Lalancette 2009-06-23 07:41:20 UTC
(In reply to comment #11)
> The fourth patch is wrong, it may use j uninitialized:
> 
>  /* Leave ourselves in tick mode if rcu or softirq or timer pending. */
>  if (rcu_needs_cpu(cpu) || local_softirq_pending() ||
>      (j = next_timer_interrupt(), time_before_eq(j, jiffies))) {
>    cpu_clear(cpu, nohz_cpu_mask);
>    j = jiffies + 1;
>   }
> 
>   if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0)

How so?  If rcu_needs_cpu(cpu) or local_softirq_pending() is true, then we enter the if block and set j = jiffies + 1.  If both of those are false, then we enter the third || condition and set j = next_timer_interrupt().  So how can j be uninitialized?

Chris Lalancette

Comment 13 Paolo Bonzini 2009-06-23 09:29:19 UTC
Oops, tricky...  Well, if upstream does it this way I guess we have to do the same.

Comment 15 Andrew Jones 2009-07-01 18:18:16 UTC
This is a difficult bug to recreate, but the proposed patches have been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patches to see if it goes away.

Comment 20 RHEL Program Management 2010-10-12 17:51:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 21 Vivek Goyal 2010-10-13 16:11:19 UTC
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 23 Binbin Yu 2011-01-13 05:30:46 UTC
with host: xen-3.0.3-120-x86_64.el5 kernel-xen-2.6.18-238.el5
guest: kernel-2.6.9-94.ELxen, 64bit
[1] no call trace after 2000 times reboot with iptables enabled
[2] code sanity check is ok, patch is applied successfully  

so change this to verified.

Comment 24 errata-xmlrpc 2011-02-16 16:03:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.