The kernel attempts to print a message when a leap second is inserted or removed. This can cause kernel versions prior to 2.6.29 to hang, due to a deadlock on xtime_lock. See http://lkml.org/lkml/2009/1/2/373 for a trace and explanation. The simplest solution for the older RHEL kernels is probably just to remove the leap second printks.
>The simplest solution for the older RHEL kernels is probably just to remove the >leap second printks. True, but the leap second printks are useful.
I've spent some time looking at a solution for this. 1. (Obviously) We currently differ greatly from upstream. This causes any solution that keeps the printks to become messy. 2. Any arch using kernel/timer.c will require a messy modification to individual /arch files. 3. We just had a leap second between 2008-2009. No reports of hangs were seen*. With all the installations we have around the world, no one has reported a problem. Based on this evaluation, I'm closing this as WONTFIX. AFAICT, there is very little or no risk of hitting this and the risk far outweighs the benefit. Worse-comes-to-worse, and we do eventually hit this situation we can revisit the issue. * This clearly is a bug, however. It _can_ happen but the window for it to occur seems very very very small. P.
Created attachment 330223 [details] RHEL5 x86/x86_64 fix for this issue
The reason I chased this down is because I had a RHEL 4 server hang on the leap second. The same problem exists between RHEL 4 and RHEL 5 kernels. I opened bugs for both, but the RHEL 5 case is more important, since it is more likely to still be under support when the next leap second comes around. If you don't want to do anything invasive to fix it, at least take out the printks. Yes they are useful, but having servers that run reliably through a leap second is more important.
(In reply to comment #4) > The reason I chased this down is because I had a RHEL 4 server hang on the leap > second. The same problem exists between RHEL 4 and RHEL 5 kernels. I opened > bugs for both, but the RHEL 5 case is more important, since it is more likely > to still be under support when the next leap second comes around. > Ah -- okay. I thought you were reporting a hypothetical situation. > If you don't want to do anything invasive to fix it, at least take out the > printks. Yes they are useful, but having servers that run reliably through a > leap second is more important. No, I think the right thing to do is to keep the printks. They are informative to users that a leap second has occurred. P.
(In reply to comment #5) > Ah -- okay. I thought you were reporting a hypothetical situation. Unfortunately not. :-( Out of a couple of dozen Linux systems, most running RHEL 4 or 5 (with several Fedora and a couple of CentOS), one RHEL 4 server here hung on New Year's Eve (luckily I wasn't on call). I just thought it was odd until I saw reports of Linux hangs on /., the NTP newsgroup, and NANOG; then I set up a test environment and tracked it down to the printks. It only seems to hang when the system is busy; an idle system wouldn't hang, but running "find / -mount -type f | xargs cat > /dev/null" would cause it to hang at the first leap second attempt. > No, I think the right thing to do is to keep the printks. They are informative > to users that a leap second has occurred. Oh I definately like the message (I run NTP with a GPS receiver for stratum 1 accuracy; in other words, I'm a time nut :-) ). I just wasn't sure how invasive the changes would be to keep it working. BTW: if you want to test, I reproduced this with a script that used adjtimex to set the flag to insert a leap second, set the clock to 2008-12-31 23:59:59 UTC, watched the clock for a couple of seconds, and looped. I then started the above find command in another window, and the system crashed at the printk. I think I still have the script on my test system (powered off at home right now) if you want it.
(In reply to comment #6) > (In reply to comment #5) > > Ah -- okay. I thought you were reporting a hypothetical situation. > > Unfortunately not. :-( Out of a couple of dozen Linux systems, most running > RHEL 4 or 5 (with several Fedora and a couple of CentOS), one RHEL 4 server > here hung on New Year's Eve (luckily I wasn't on call). I just thought it was > odd until I saw reports of Linux hangs on /., the NTP newsgroup, and NANOG; > then I set up a test environment and tracked it down to the printks. > :( You should file BZs for F8, F9, and F10. I'm more than willing to help get patches out, so cc me. > It only seems to hang when the system is busy; an idle system wouldn't hang, > but running "find / -mount -type f | xargs cat > /dev/null" would cause it to > hang at the first leap second attempt. > > > No, I think the right thing to do is to keep the printks. They are informative > > to users that a leap second has occurred. > > Oh I definately like the message (I run NTP with a GPS receiver for stratum 1 > accuracy; in other words, I'm a time nut :-) ). I just wasn't sure how > invasive the changes would be to keep it working. > Exactly :) -- I've got a build running across all arches and I'll attach and post the patch for internal review shortly. > BTW: if you want to test, I reproduced this with a script that used adjtimex to > set the flag to insert a leap second, set the clock to 2008-12-31 23:59:59 UTC, > watched the clock for a couple of seconds, and looped. I then started the > above find command in another window, and the system crashed at the printk. I > think I still have the script on my test system (powered off at home right now) > if you want it. :) I actually wrote something similar and was able to reproduce the hang & crash. But thanks -- the offer of the script is appreciated. P.
Created attachment 330280 [details] RHEL5 fix for this issue
Brew built here: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1666146 P.
(In reply to comment #7) > :( You should file BZs for F8, F9, and F10. I'm more than willing to help get > patches out, so cc me. I'm not all that concerned about this being a problem for Fedora. It'll be a while before another leap second (at least 5 months, and probably a couple of years), so any current Fedora is most likely going to be past end of life. Also, this won't be a problem going forward, kernel 2.6.29 fixes the problem of calling printk while holding the xtime lock hanging, so once that version lands in Fedora, there won't be a problem. I wanted to push getting this fixed in RHEL 4 and 5 since I'll probably still have RHEL 5 (and probably a few RHEL 4) servers running the next time there's a leap second, and they'll still be running kernels < 2.6.29. Thanks for working on this.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-131.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Updating PM score.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
*** Bug 800289 has been marked as a duplicate of this bug. ***
(In reply to comment #6) > (In reply to comment #5) > > Ah -- okay. I thought you were reporting a hypothetical situation. > > Unfortunately not. :-( Out of a couple of dozen Linux systems, most running > RHEL 4 or 5 (with several Fedora and a couple of CentOS), one RHEL 4 server > here hung on New Year's Eve (luckily I wasn't on call). I just thought it > was odd until I saw reports of Linux hangs on /., the NTP newsgroup, and > NANOG; then I set up a test environment and tracked it down to the printks. > > It only seems to hang when the system is busy; an idle system wouldn't hang, > but running "find / -mount -type f | xargs cat > /dev/null" would cause it > to hang at the first leap second attempt. > > > No, I think the right thing to do is to keep the printks. They are informative > > to users that a leap second has occurred. > > Oh I definately like the message (I run NTP with a GPS receiver for stratum > 1 accuracy; in other words, I'm a time nut :-) ). I just wasn't sure how > invasive the changes would be to keep it working. > > BTW: if you want to test, I reproduced this with a script that used adjtimex > to set the flag to insert a leap second, set the clock to 2008-12-31 > 23:59:59 UTC, watched the clock for a couple of seconds, and looped. I then > started the above find command in another window, and the system crashed at > the printk. I think I still have the script on my test system (powered off > at home right now) if you want it. I want to test, please send the script to zhang__3125
Could it be that this is still active on RH6.2 and RH6.3 on 2012-07-01 00:00 GMT???
(In reply to comment #22) > Could it be that this is still active on RH6.2 and RH6.3 on 2012-07-01 00:00 > GMT??? David -- no, however, other bugs were identified that impact 6.2 and 6.3. Please contact your support representative for details. P.
Noticed jboss running 375% cpu load (1.6GB res mem) on RHEL6 (RHEV3 manager). Caused load 75+ on Xeon E5506-based machine (20GB RAM). Also RHEL6 based hypervisor loops there. load ~70. (2*X5650,128GB). (machine is quite jammed, but has 55GB free memory). So shutdowned our client's RHEV3 test environment. Not nice. Is to contact support representative going to be now on the standard answer? Why do you continuously close bugs when all the effects are not properly examined? //arl
Hi Ari, It's indeed a new bug - the great leap second weekend disaster of 2012, which is somehow linked to Java https://access.redhat.com/knowledge/articles/15145 - Leap Seconds in Red Hat Enterprise Linux (07/01/12 - 11:03)
http://pedroalves-bi.blogspot.fi/2012/07/java-leap-second-bug-how-to-fix-your.html Still wondering why - leap seconds do occur. http://en.wikipedia.org/wiki/Leap_second //arl
Btw. Jan 1 and Jul 1 are really quite expensive dates for bugs and problems. Jul is the holiday month here, Jan 1 is also free. If for these kind of bugs people on holiday need to be called it means easily N * 100% extra costs. Wish they could select other dates for leap seconds. //arl
We had the problem on 2 centos 6 iscsi target servers. We got the error in message file "tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable" After applying the command /etc/init.d/ntpd stop; date; date `date +"%m%d%H%M%C%y.%S"`; date; problem solved.
FYI, the widely circulated fix of: # date -s "`date -u`" as root also fixes the symptoms. I watched 4 thrashing JVMs on 4 separate servers relax from 400-500% CPU to 1% or so instantly on entering this command.
This bug is about a specific leap-second deadlock that was fixed years ago. Please stop commenting in this bug about unrelated leap-second problems that occurred in 2012.
I was going on the basis that until Red Hat actually post something about the current issue, people desparate for a fix will end up here. I did as did several of my colleagues and the above commentators. My apologies if this offends you.
(In reply to comment #30) > This bug is about a specific leap-second deadlock that was fixed years ago. > Please stop commenting in this bug about unrelated leap-second problems that > occurred in 2012. As Andrew said, people are ending up here. So thanks for posting the fix to the current leap second bug in the totally unrelated kernel bug that was fixed years ago. I sure appreciate it.
For those of you who are interested in the RHEL6 resolutions to the leap second, please see https://bugzilla.redhat.com/show_bug.cgi?id=836748 https://bugzilla.redhat.com/show_bug.cgi?id=836803 Thanks, P.