Bug 836803
Summary: | RHEL6: Potential fix for leapsecond caused futex related load spikes | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Prarit Bhargava <prarit> | ||||||||||||||||||
Component: | kernel | Assignee: | Prarit Bhargava <prarit> | ||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Dong Zhu <dZhu> | ||||||||||||||||||
Severity: | urgent | Docs Contact: | |||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||
Version: | 6.4 | CC: | bugzilla.redhat.com, chorn, czhang, dhoward, hartsjc, hfuchi, ifloodmu, jeder, jipan, jjasghar, jwest, liko, mmahudha, mmilgram, myamazak, pep, qcai, qguo, rfujita, rrosario, SCHAKRAB, sforsber, smccarty, vgaikwad, yh.choi | ||||||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | kernel-2.6.32-298.el6 | Doc Type: | Bug Fix | ||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||
Last Closed: | 2013-02-21 06:29:54 UTC | Type: | Bug | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Bug Depends On: | |||||||||||||||||||||
Bug Blocks: | 782183, 840683, 847364, 847365, 847366, 1300182 | ||||||||||||||||||||
Attachments: |
|
Description
Prarit Bhargava
2012-07-01 14:43:58 UTC
Working on a backport now. Backport is likely to depend on fix for bug 836748. P. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. The current situation is as follows: A patchset has been posted upstream, http://marc.info/?l=linux-kernel&m=134138316402296&w=2 which has an Acked-by: me. I've tested this patchset on an upstream kernel using the following tests: 1. A leap second test I wrote (but which is VERY similar to) 2. http://marc.info/?l=linux-kernel&m=134116789230177&w=2, and 3. http://marc.info/?l=linux-kernel&m=134116789230177&w=2 with and without the "-s" option. So far all tests have been successful. I am now doing a wider test for RHEL6 with a signficant backport of patches to the kernel/time/timekeeping.c code + the patches from upstream. This testing is ongoing, however, thus far boot testing has not picked up any issues. A smaller patchset has been identified as well that will resolve only the leapsecond issue, but still leaves the clock code susceptible to some other smaller races and issues. I've decided to go with the larger set for the sake of completion -- besides, we're changing core code already and I don't see any reason not to broaden the changes at this point. Watch here for further BZ updates. P. upstream kernel testing .... [root@intel-canoepass-05 tmp]# uname -a Linux intel-canoepass-05.lab.bos.redhat.com 3.4.4-5.fc17.x86_64 #1 SMP Thu Jul 5 20:20:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@intel-canoepass-05 tmp]# ./leap-a-day -s Setting time to Sun Jul 22 20:00:00 2012 Scheduling leap second for Sun Jul 22 20:00:00 2012 Sun Jul 22 19:59:57 2012 + 500283 us TIME_INS Sun Jul 22 19:59:58 2012 + 521 us TIME_INS Sun Jul 22 19:59:58 2012 + 500770 us TIME_INS Sun Jul 22 19:59:59 2012 + 1011 us TIME_INS Sun Jul 22 19:59:59 2012 + 501289 us TIME_INS Sun Jul 22 19:59:59 2012 + 6806 us TIME_OOP Sun Jul 22 19:59:59 2012 + 506946 us TIME_OOP Sun Jul 22 20:00:00 2012 + 7143 us TIME_WAIT Sun Jul 22 20:00:00 2012 + 507274 us TIME_WAIT Sun Jul 22 20:00:01 2012 + 7516 us TIME_WAIT Sun Jul 22 20:00:01 2012 + 507652 us TIME_WAIT Sun Jul 22 20:00:02 2012 + 7898 us TIME_WAIT Note: hrtimer early expiration failure observed. Leap complete ............................................................................. Modified kernel with upstream patches: [root@intel-canoepass-05 tmp]# uname -a Linux intel-canoepass-05.lab.bos.redhat.com 3.5.0-rc6+ #2 SMP Wed Jul 11 14:51:10 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [root@intel-canoepass-05 tmp]# [root@intel-canoepass-05 tmp]# ./leap-a-day -s This runs continuously. Press ctrl-c to stop Setting time to speed up testing Setting time to Wed Jul 11 20:00:00 2012 Scheduling leap second for Wed Jul 11 20:00:00 2012 Something woke us up, returning to sleep Wed Jul 11 19:59:50 2012 + 746240 us TIME_OK Wed Jul 11 19:59:51 2012 + 246487 us TIME_INS Wed Jul 11 19:59:51 2012 + 746702 us TIME_INS Wed Jul 11 19:59:52 2012 + 246965 us TIME_INS Wed Jul 11 19:59:52 2012 + 747181 us TIME_INS Wed Jul 11 19:59:53 2012 + 247454 us TIME_INS Wed Jul 11 19:59:53 2012 + 747677 us TIME_INS Wed Jul 11 19:59:54 2012 + 247885 us TIME_INS Wed Jul 11 19:59:54 2012 + 748096 us TIME_INS Wed Jul 11 19:59:55 2012 + 248371 us TIME_INS Wed Jul 11 19:59:55 2012 + 748623 us TIME_INS Wed Jul 11 19:59:56 2012 + 248886 us TIME_INS Wed Jul 11 19:59:56 2012 + 749087 us TIME_INS Wed Jul 11 19:59:57 2012 + 249357 us TIME_INS Wed Jul 11 19:59:57 2012 + 749597 us TIME_INS Wed Jul 11 19:59:58 2012 + 249793 us TIME_INS Wed Jul 11 19:59:58 2012 + 750003 us TIME_INS Wed Jul 11 19:59:59 2012 + 250274 us TIME_INS Wed Jul 11 19:59:59 2012 + 750494 us TIME_INS Wed Jul 11 19:59:59 2012 + 250728 us TIME_OOP Wed Jul 11 19:59:59 2012 + 750938 us TIME_OOP Wed Jul 11 20:00:00 2012 + 251203 us TIME_WAIT Wed Jul 11 20:00:00 2012 + 751426 us TIME_WAIT Wed Jul 11 20:00:01 2012 + 251631 us TIME_WAIT Wed Jul 11 20:00:01 2012 + 751845 us TIME_WAIT Wed Jul 11 20:00:02 2012 + 252118 us TIME_WAIT Leap complete Created attachment 597663 [details]
Current upstream test for hrtimer expiration
I've run several systems with a modified upstream kernel and the futex patchset and haven't seen any failures in 18+ hours. P. Upstream patches are currently in tip. P. I've put together a set of patches (that depend on BZ 836748) and have started testing across a large set of systems using the test case previously provided in this BZ. I will update the BZ with testing results, and the patches after my initial testing is complete. P. (In reply to comment #2) > This request was evaluated by Red Hat Product Management for > inclusion in a Red Hat Enterprise Linux release. Product > Management has requested further review of this request by > Red Hat Engineering, for potential inclusion in a Red Hat > Enterprise Linux release for currently deployed products. > This request is not yet committed for inclusion in a release. Is there an estimated time for inclusion of this as an official errata? There are several RHEL customers wanting this errata to come out. Scott McCarty Solutions Architect Scott, There should not be any urgency surrounding this errata as no other leap seconds are currently scheduled and they are typically announced _years_ in advance. http://en.wikipedia.org/wiki/Leap_second Please inform your customers that Engineering is working on a stable and well-tested solution, and that a fix will be in RHEL6.4. P. Prarit, I appreciate the response. The leap second insertion is released in Bulletin C on the IERS website, which is typically 5 months before the June or December of a leap second. Our customers all have tickets open which cannot be closed until they have guaranteed that a fix is in place for their systems. These two facts combined create risk for operations folks and their managers are particularly uncomfortable closing the ticket without closing the loop. This is why a definite release for this patch is so critical. I hope that helps clarify. I have been educating our customer's operations teams on the RHEL6.4 release. I have explained to them that it would not be released z-stream because the time system is such a critical piece of the kernel. Best Regards Scott M Created attachment 603549 [details]
RHEL PATCH 1/7
Created attachment 603550 [details]
RHEL PATCH 2/7
Created attachment 603551 [details]
RHEL PATCH 3/7
Created attachment 603552 [details]
RHEL PATCH 4/7
Created attachment 603553 [details]
RHEL PATCH 5/7
Created attachment 603554 [details]
RHEL PATCH 6/7
Created attachment 603555 [details]
RHEL PATCH 7/7
Patch(es) available on kernel-2.6.32-298.el6 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html |