Red Hat Bugzilla – Full Text Bug Listing
|Summary:||RHEL6: Potential fix for leapsecond caused futex related load spikes|
|Product:||Red Hat Enterprise Linux 6||Reporter:||Prarit Bhargava <prarit>|
|Component:||kernel||Assignee:||Prarit Bhargava <prarit>|
|Status:||CLOSED ERRATA||QA Contact:||Dong Zhu <dZhu>|
|Version:||6.4||CC:||bugzilla.redhat.com, caiqian, chorn, czhang, dhoward, hartsjc, hfuchi, ifloodmu, jeder, jipan, jjasghar, jwest, liko, mmahudha, mmilgram, myamazak, pep, qguo, rfujita, rrosario, SCHAKRAB, sforsber, smccarty, vgaikwad, yh.choi|
|Fixed In Version:||kernel-2.6.32-298.el6||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2013-02-21 01:29:54 EST||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
|Bug Blocks:||782183, 840683, 1300182, 847364, 847365, 847366|
Description Prarit Bhargava 2012-07-01 10:43:58 EDT
Description of problem: After the leap second on June 30, 2012, load spikes were noticed in userspace. After some debugging it was noticed that futexes were timing out which was causing CPU loads to increase dramatically. [FWIW: I noticed this myself yesterday night. My firefox suddenly consumed 98.9% of the CPU shortly after the leap second. Resetting the date and restarting firefox resolved the problem. Version-Release number of selected component (if applicable): 2.6.32-279 How reproducible: Unknown at this time. Probably fairly high. Steps to Reproduce: A reproducer is available here, http://marc.info/?l=linux-kernel&m=134113615122011&w=2 Actual results: userspace programs consume ~100% of CPU time because of futex timeouts. Expected results: Futexes should not timeout. Additional info: http://marc.info/?l=linux-kernel&m=134113577921904&w=2 has an RFC patch and reproducer attached.
Comment 1 Prarit Bhargava 2012-07-01 10:44:39 EDT
Working on a backport now. Backport is likely to depend on fix for bug 836748. P.
Comment 2 RHEL Product and Program Management 2012-07-01 10:50:47 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release.
Comment 3 Prarit Bhargava 2012-07-06 11:25:53 EDT
The current situation is as follows: A patchset has been posted upstream, http://marc.info/?l=linux-kernel&m=134138316402296&w=2 which has an Acked-by: me. I've tested this patchset on an upstream kernel using the following tests: 1. A leap second test I wrote (but which is VERY similar to) 2. http://marc.info/?l=linux-kernel&m=134116789230177&w=2, and 3. http://marc.info/?l=linux-kernel&m=134116789230177&w=2 with and without the "-s" option. So far all tests have been successful. I am now doing a wider test for RHEL6 with a signficant backport of patches to the kernel/time/timekeeping.c code + the patches from upstream. This testing is ongoing, however, thus far boot testing has not picked up any issues. A smaller patchset has been identified as well that will resolve only the leapsecond issue, but still leaves the clock code susceptible to some other smaller races and issues. I've decided to go with the larger set for the sake of completion -- besides, we're changing core code already and I don't see any reason not to broaden the changes at this point. Watch here for further BZ updates. P.
Comment 6 Prarit Bhargava 2012-07-11 16:11:49 EDT
upstream kernel testing .... [root@intel-canoepass-05 tmp]# uname -a Linux intel-canoepass-05.lab.bos.redhat.com 3.4.4-5.fc17.x86_64 #1 SMP Thu Jul 5 20:20:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@intel-canoepass-05 tmp]# ./leap-a-day -s Setting time to Sun Jul 22 20:00:00 2012 Scheduling leap second for Sun Jul 22 20:00:00 2012 Sun Jul 22 19:59:57 2012 + 500283 us TIME_INS Sun Jul 22 19:59:58 2012 + 521 us TIME_INS Sun Jul 22 19:59:58 2012 + 500770 us TIME_INS Sun Jul 22 19:59:59 2012 + 1011 us TIME_INS Sun Jul 22 19:59:59 2012 + 501289 us TIME_INS Sun Jul 22 19:59:59 2012 + 6806 us TIME_OOP Sun Jul 22 19:59:59 2012 + 506946 us TIME_OOP Sun Jul 22 20:00:00 2012 + 7143 us TIME_WAIT Sun Jul 22 20:00:00 2012 + 507274 us TIME_WAIT Sun Jul 22 20:00:01 2012 + 7516 us TIME_WAIT Sun Jul 22 20:00:01 2012 + 507652 us TIME_WAIT Sun Jul 22 20:00:02 2012 + 7898 us TIME_WAIT Note: hrtimer early expiration failure observed. Leap complete ............................................................................. Modified kernel with upstream patches: [root@intel-canoepass-05 tmp]# uname -a Linux intel-canoepass-05.lab.bos.redhat.com 3.5.0-rc6+ #2 SMP Wed Jul 11 14:51:10 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux [root@intel-canoepass-05 tmp]# [root@intel-canoepass-05 tmp]# ./leap-a-day -s This runs continuously. Press ctrl-c to stop Setting time to speed up testing Setting time to Wed Jul 11 20:00:00 2012 Scheduling leap second for Wed Jul 11 20:00:00 2012 Something woke us up, returning to sleep Wed Jul 11 19:59:50 2012 + 746240 us TIME_OK Wed Jul 11 19:59:51 2012 + 246487 us TIME_INS Wed Jul 11 19:59:51 2012 + 746702 us TIME_INS Wed Jul 11 19:59:52 2012 + 246965 us TIME_INS Wed Jul 11 19:59:52 2012 + 747181 us TIME_INS Wed Jul 11 19:59:53 2012 + 247454 us TIME_INS Wed Jul 11 19:59:53 2012 + 747677 us TIME_INS Wed Jul 11 19:59:54 2012 + 247885 us TIME_INS Wed Jul 11 19:59:54 2012 + 748096 us TIME_INS Wed Jul 11 19:59:55 2012 + 248371 us TIME_INS Wed Jul 11 19:59:55 2012 + 748623 us TIME_INS Wed Jul 11 19:59:56 2012 + 248886 us TIME_INS Wed Jul 11 19:59:56 2012 + 749087 us TIME_INS Wed Jul 11 19:59:57 2012 + 249357 us TIME_INS Wed Jul 11 19:59:57 2012 + 749597 us TIME_INS Wed Jul 11 19:59:58 2012 + 249793 us TIME_INS Wed Jul 11 19:59:58 2012 + 750003 us TIME_INS Wed Jul 11 19:59:59 2012 + 250274 us TIME_INS Wed Jul 11 19:59:59 2012 + 750494 us TIME_INS Wed Jul 11 19:59:59 2012 + 250728 us TIME_OOP Wed Jul 11 19:59:59 2012 + 750938 us TIME_OOP Wed Jul 11 20:00:00 2012 + 251203 us TIME_WAIT Wed Jul 11 20:00:00 2012 + 751426 us TIME_WAIT Wed Jul 11 20:00:01 2012 + 251631 us TIME_WAIT Wed Jul 11 20:00:01 2012 + 751845 us TIME_WAIT Wed Jul 11 20:00:02 2012 + 252118 us TIME_WAIT Leap complete
Comment 7 Prarit Bhargava 2012-07-11 16:13:12 EDT
Created attachment 597663 [details] Current upstream test for hrtimer expiration
Comment 8 Prarit Bhargava 2012-07-12 09:07:19 EDT
I've run several systems with a modified upstream kernel and the futex patchset and haven't seen any failures in 18+ hours. P.
Comment 9 Prarit Bhargava 2012-07-16 08:54:30 EDT
Upstream patches are currently in tip. P.
Comment 12 Prarit Bhargava 2012-07-28 08:12:36 EDT
I've put together a set of patches (that depend on BZ 836748) and have started testing across a large set of systems using the test case previously provided in this BZ. I will update the BZ with testing results, and the patches after my initial testing is complete. P.
Comment 13 Scott McCarty 2012-07-30 09:21:16 EDT
(In reply to comment #2) > This request was evaluated by Red Hat Product Management for > inclusion in a Red Hat Enterprise Linux release. Product > Management has requested further review of this request by > Red Hat Engineering, for potential inclusion in a Red Hat > Enterprise Linux release for currently deployed products. > This request is not yet committed for inclusion in a release. Is there an estimated time for inclusion of this as an official errata? There are several RHEL customers wanting this errata to come out. Scott McCarty Solutions Architect
Comment 14 Prarit Bhargava 2012-07-30 09:25:24 EDT
Scott, There should not be any urgency surrounding this errata as no other leap seconds are currently scheduled and they are typically announced _years_ in advance. http://en.wikipedia.org/wiki/Leap_second Please inform your customers that Engineering is working on a stable and well-tested solution, and that a fix will be in RHEL6.4. P.
Comment 15 Scott McCarty 2012-07-31 22:53:14 EDT
Prarit, I appreciate the response. The leap second insertion is released in Bulletin C on the IERS website, which is typically 5 months before the June or December of a leap second. Our customers all have tickets open which cannot be closed until they have guaranteed that a fix is in place for their systems. These two facts combined create risk for operations folks and their managers are particularly uncomfortable closing the ticket without closing the loop. This is why a definite release for this patch is so critical. I hope that helps clarify. I have been educating our customer's operations teams on the RHEL6.4 release. I have explained to them that it would not be released z-stream because the time system is such a critical piece of the kernel. Best Regards Scott M
Comment 19 Prarit Bhargava 2012-08-10 09:56:50 EDT
Created attachment 603549 [details] RHEL PATCH 1/7
Comment 20 Prarit Bhargava 2012-08-10 09:56:54 EDT
Created attachment 603550 [details] RHEL PATCH 2/7
Comment 21 Prarit Bhargava 2012-08-10 09:56:57 EDT
Created attachment 603551 [details] RHEL PATCH 3/7
Comment 22 Prarit Bhargava 2012-08-10 09:57:01 EDT
Created attachment 603552 [details] RHEL PATCH 4/7
Comment 23 Prarit Bhargava 2012-08-10 09:57:07 EDT
Created attachment 603553 [details] RHEL PATCH 5/7
Comment 24 Prarit Bhargava 2012-08-10 09:57:10 EDT
Created attachment 603554 [details] RHEL PATCH 6/7
Comment 25 Prarit Bhargava 2012-08-10 09:57:13 EDT
Created attachment 603555 [details] RHEL PATCH 7/7
Comment 32 Jarod Wilson 2012-08-16 11:01:39 EDT
Patch(es) available on kernel-2.6.32-298.el6
Comment 37 errata-xmlrpc 2013-02-21 01:29:54 EST
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0496.html