Bug 427998
Summary: | RHEL4: Can enter no tick idle mode with RCU pending leading to hang | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Ian Campbell <ijc> |
Component: | kernel-xen | Assignee: | Andrew Jones <drjones> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4.6 | CC: | byu, clalance, drjones, pbonzini, xen-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-02-16 16:03:26 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 458302 | ||
Attachments: |
Description
Ian Campbell
2008-01-08 16:19:37 UTC
Created attachment 291063 [details]
git 677517771b7b6efaf8617e70f655b16f3cafcc9b backported to 2.6.9-67.EL
Created attachment 291064 [details]
Git 986733e01d258c26107f1da9d8d47c718349ad2f backported to 2.6.9-67.EL
Created attachment 291065 [details]
xen-unstable.hg 10327:c230dbe793d6 backported to 2.6.9-67.EL
Created attachment 291066 [details]
xen-unstable.hg 10532:4b45f7f62dc7 backported to 2.6.9-67.EL
Patches apply in the order: git-677517771b7b6efaf8617e70f655b16f3cafcc9b git-986733e01d258c26107f1da9d8d47c718349ad2f xen-unstable-10327-c230dbe793d6 xen-unstable-10532-4b45f7f62dc7 They are against 2.6.9-67.0.1.EL not -67.EL as I said above. Ian, Why is the 4th patch needed. It states that a problem exists for a dom0 hang, but rhel4-xenU is a domU-only kernel. Sorry for the delay responding. I could have sworn I replied to this but I must've forgotten to hit Submit or something. The problem was initially noticed in domain 0 in different circumstances to reported here (all I know about it is what is given in Ack's commit message). The problem reported here was subsequently seen in domainU and the fix turned out (coincidentally) to be the same. The issue is that a domain can go tickless either with RCU events or timers pending. In the later case next_timer_interrupt() returns a time in the recent past hence the changes to 10532. The original upstream patch at http://xenbits.xensource.com/xen-unstable.hg?rev/4b45f7f62dc7 has an additional hunk which I dropped because hrtimers aren't relevant to RHEL4 but the comment in that same hunk is probably useful actually: ++ /* ++ * If timers are pending, "expires" will be in the recent past ++ * of "jiffies". If there are no hr_timers registered, "hr_expires" ++ * will be "jiffies + MAX_JIFFY_OFFSET"; this is *just* short of being ++ * considered to be before "jiffies". This makes it very likely that ++ * "hr_expires" *will* be considered to be before "expires". ++ * So we must check when there are pending timers (expires <= jiffies) ++ * to ensure that we don't accidently tell the caller that there is ++ * nothing scheduled until half an epoch (MAX_JIFFY_OFFSET)! ++ */ Now that I look again it's possible that I am mistaken and that without hrtimers the remaining hunk isn't needed either. Our testing has always included all 4 of the patches so I'd be reluctant to say that it definately isn't required. The fourth patch is wrong, it may use j uninitialized: /* Leave ourselves in tick mode if rcu or softirq or timer pending. */ if (rcu_needs_cpu(cpu) || local_softirq_pending() || (j = next_timer_interrupt(), time_before_eq(j, jiffies))) { cpu_clear(cpu, nohz_cpu_mask); j = jiffies + 1; } if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0) I'll work on a fix to post upstream. (In reply to comment #11) > The fourth patch is wrong, it may use j uninitialized: > > /* Leave ourselves in tick mode if rcu or softirq or timer pending. */ > if (rcu_needs_cpu(cpu) || local_softirq_pending() || > (j = next_timer_interrupt(), time_before_eq(j, jiffies))) { > cpu_clear(cpu, nohz_cpu_mask); > j = jiffies + 1; > } > > if (HYPERVISOR_set_timer_op(jiffies_to_st(j)) != 0) How so? If rcu_needs_cpu(cpu) or local_softirq_pending() is true, then we enter the if block and set j = jiffies + 1. If both of those are false, then we enter the third || condition and set j = next_timer_interrupt(). So how can j be uninitialized? Chris Lalancette Oops, tricky... Well, if upstream does it this way I guess we have to do the same. This is a difficult bug to recreate, but the proposed patches have been integrated into a test build at http://people.redhat.com/drjones/virttest/1-2/. The build is available for anyone who has seen the bug and would like to test the patches to see if it goes away. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ with host: xen-3.0.3-120-x86_64.el5 kernel-xen-2.6.18-238.el5 guest: kernel-2.6.9-94.ELxen, 64bit [1] no call trace after 2000 times reboot with iptables enabled [2] code sanity check is ok, patch is applied successfully so change this to verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html |