479765 – Leap second message can hang the kernel

Bug 479765 - Leap second message can hang the kernel

Summary: Leap second message can hang the kernel

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Prarit Bhargava
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	800289 (view as bug list)
Depends On:
Blocks:	483701 485920 801794 1300182
TreeView+	depends on / blocked

Reported:	2009-01-12 22:31 UTC by Chris Adams
Modified:	2019-07-11 07:31 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 08:33:56 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
RHEL5 x86/x86_64 fix for this issue (2.75 KB, patch) 2009-01-28 12:13 UTC, Prarit Bhargava	no flags	Details \| Diff
RHEL5 fix for this issue (4.95 KB, patch) 2009-01-28 20:07 UTC, Prarit Bhargava	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description Chris Adams 2009-01-12 22:31:59 UTC

The kernel attempts to print a message when a leap second is inserted or removed.  This can cause kernel versions prior to 2.6.29 to hang, due to a deadlock on xtime_lock.  See http://lkml.org/lkml/2009/1/2/373 for a trace and explanation.

The simplest solution for the older RHEL kernels is probably just to remove the leap second printks.

Comment 1 Prarit Bhargava 2009-01-27 23:04:51 UTC

>The simplest solution for the older RHEL kernels is probably just to remove the
>leap second printks.

True, but the leap second printks are useful.

Comment 2 Prarit Bhargava 2009-01-28 12:11:04 UTC

I've spent some time looking at a solution for this.

1.  (Obviously) We currently differ greatly from upstream.  This causes any
solution that keeps the printks to become messy.

2.  Any arch using kernel/timer.c will require a messy modification to individual /arch files.

3.  We just had a leap second between 2008-2009.  No reports of hangs were
seen*.  With all the installations we have around the world, no one has reported a problem.

Based on this evaluation, I'm closing this as WONTFIX.  AFAICT, there is very little or no risk of hitting this and the risk far outweighs the benefit.

Worse-comes-to-worse, and we do eventually hit this situation we can revisit the issue.

* This clearly is a bug, however.  It _can_ happen but the window for it to occur seems very very very small.

P.

Comment 3 Prarit Bhargava 2009-01-28 12:13:34 UTC

Created attachment 330223 [details]
RHEL5 x86/x86_64 fix for this issue

Comment 4 Chris Adams 2009-01-28 13:47:54 UTC

The reason I chased this down is because I had a RHEL 4 server hang on the leap second.  The same problem exists between RHEL 4 and RHEL 5 kernels.  I opened bugs for both, but the RHEL 5 case is more important, since it is more likely to still be under support when the next leap second comes around.

If you don't want to do anything invasive to fix it, at least take out the printks.  Yes they are useful, but having servers that run reliably through a leap second is more important.

Comment 5 Prarit Bhargava 2009-01-28 14:09:10 UTC

(In reply to comment #4)
> The reason I chased this down is because I had a RHEL 4 server hang on the leap
> second.  The same problem exists between RHEL 4 and RHEL 5 kernels.  I opened
> bugs for both, but the RHEL 5 case is more important, since it is more likely
> to still be under support when the next leap second comes around.
> 

Ah -- okay.  I thought you were reporting a hypothetical situation.

> If you don't want to do anything invasive to fix it, at least take out the
> printks.  Yes they are useful, but having servers that run reliably through a
> leap second is more important.

No, I think the right thing to do is to keep the printks.  They are informative to users that a leap second has occurred.

P.

Comment 6 Chris Adams 2009-01-28 14:26:51 UTC

(In reply to comment #5)
> Ah -- okay.  I thought you were reporting a hypothetical situation.

Unfortunately not. :-(  Out of a couple of dozen Linux systems, most running RHEL 4 or 5 (with several Fedora and a couple of CentOS), one RHEL 4 server here hung on New Year's Eve (luckily I wasn't on call).  I just thought it was odd until I saw reports of Linux hangs on /., the NTP newsgroup, and NANOG; then I set up a test environment and tracked it down to the printks.

It only seems to hang when the system is busy; an idle system wouldn't hang, but running "find / -mount -type f | xargs cat > /dev/null" would cause it to hang at the first leap second attempt.

> No, I think the right thing to do is to keep the printks.  They are informative
> to users that a leap second has occurred.

Oh I definately like the message (I run NTP with a GPS receiver for stratum 1 accuracy; in other words, I'm a time nut :-) ).  I just wasn't sure how invasive the changes would be to keep it working.

BTW: if you want to test, I reproduced this with a script that used adjtimex to set the flag to insert a leap second, set the clock to 2008-12-31 23:59:59 UTC, watched the clock for a couple of seconds, and looped.  I then started the above find command in another window, and the system crashed at the printk.  I think I still have the script on my test system (powered off at home right now) if you want it.

Comment 7 Prarit Bhargava 2009-01-28 19:57:21 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Ah -- okay.  I thought you were reporting a hypothetical situation.
> 
> Unfortunately not. :-(  Out of a couple of dozen Linux systems, most running
> RHEL 4 or 5 (with several Fedora and a couple of CentOS), one RHEL 4 server
> here hung on New Year's Eve (luckily I wasn't on call).  I just thought it was
> odd until I saw reports of Linux hangs on /., the NTP newsgroup, and NANOG;
> then I set up a test environment and tracked it down to the printks.
> 

:(  You should file BZs for F8, F9, and F10.  I'm more than willing to help get patches out, so cc me.

> It only seems to hang when the system is busy; an idle system wouldn't hang,
> but running "find / -mount -type f | xargs cat > /dev/null" would cause it to
> hang at the first leap second attempt.
> 
> > No, I think the right thing to do is to keep the printks.  They are informative
> > to users that a leap second has occurred.
> 
> Oh I definately like the message (I run NTP with a GPS receiver for stratum 1
> accuracy; in other words, I'm a time nut :-) ).  I just wasn't sure how
> invasive the changes would be to keep it working.
> 

Exactly :) -- I've got a build running across all arches and I'll attach and post the patch for internal review shortly.

> BTW: if you want to test, I reproduced this with a script that used adjtimex to
> set the flag to insert a leap second, set the clock to 2008-12-31 23:59:59 UTC,
> watched the clock for a couple of seconds, and looped.  I then started the
> above find command in another window, and the system crashed at the printk.  I
> think I still have the script on my test system (powered off at home right now)
> if you want it.

:)  I actually wrote something similar and was able to reproduce the hang & crash.  But thanks -- the offer of the script is appreciated.

P.

Comment 8 Prarit Bhargava 2009-01-28 20:07:46 UTC

Created attachment 330280 [details]
RHEL5 fix for this issue

Comment 9 Prarit Bhargava 2009-01-28 20:08:34 UTC

Brew built here:  http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1666146

P.

Comment 10 Chris Adams 2009-01-29 02:00:42 UTC

(In reply to comment #7)
> :(  You should file BZs for F8, F9, and F10.  I'm more than willing to help get
> patches out, so cc me.

I'm not all that concerned about this being a problem for Fedora.  It'll be a while before another leap second (at least 5 months, and probably a couple of years), so any current Fedora is most likely going to be past end of life.  Also, this won't be a problem going forward, kernel 2.6.29 fixes the problem of calling printk while holding the xtime lock hanging, so once that version lands in Fedora, there won't be a problem.

I wanted to push getting this fixed in RHEL 4 and 5 since I'll probably still have RHEL 5 (and probably a few RHEL 4) servers running the next time there's a leap second, and they'll still be running kernels < 2.6.29.

Thanks for working on this.

Comment 11 RHEL Program Management 2009-01-30 08:35:00 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Don Zickus 2009-02-09 18:25:56 UTC

in kernel-2.6.18-131.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 13 RHEL Program Management 2009-02-16 15:05:37 UTC

Updating PM score.

Comment 18 errata-xmlrpc 2009-09-02 08:33:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 19 Chris Williams 2012-03-09 14:04:28 UTC

*** Bug 800289 has been marked as a duplicate of this bug. ***

Comment 21 linz 2012-06-25 11:16:21 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Ah -- okay.  I thought you were reporting a hypothetical situation.
> 
> Unfortunately not. :-(  Out of a couple of dozen Linux systems, most running
> RHEL 4 or 5 (with several Fedora and a couple of CentOS), one RHEL 4 server
> here hung on New Year's Eve (luckily I wasn't on call).  I just thought it
> was odd until I saw reports of Linux hangs on /., the NTP newsgroup, and
> NANOG; then I set up a test environment and tracked it down to the printks.
> 
> It only seems to hang when the system is busy; an idle system wouldn't hang,
> but running "find / -mount -type f | xargs cat > /dev/null" would cause it
> to hang at the first leap second attempt.
> 
> > No, I think the right thing to do is to keep the printks.  They are informative
> > to users that a leap second has occurred.
> 
> Oh I definately like the message (I run NTP with a GPS receiver for stratum
> 1 accuracy; in other words, I'm a time nut :-) ).  I just wasn't sure how
> invasive the changes would be to keep it working.
> 
> BTW: if you want to test, I reproduced this with a script that used adjtimex
> to set the flag to insert a leap second, set the clock to 2008-12-31
> 23:59:59 UTC, watched the clock for a couple of seconds, and looped.  I then
> started the above find command in another window, and the system crashed at
> the printk.  I think I still have the script on my test system (powered off
> at home right now) if you want it.

I want to test, please send the script to zhang__3125

Comment 22 David Tonhofer 2012-07-01 18:29:38 UTC

Could it be that this is still active on RH6.2 and RH6.3 on 2012-07-01 00:00 GMT???

Comment 23 Prarit Bhargava 2012-07-01 18:47:31 UTC

(In reply to comment #22)
> Could it be that this is still active on RH6.2 and RH6.3 on 2012-07-01 00:00
> GMT???

David -- no, however, other bugs were identified that impact 6.2 and 6.3.  Please contact your support representative for details.

P.

Comment 24 Ari Lemmke 2012-07-02 07:58:12 UTC

Noticed jboss running 375% cpu load (1.6GB res mem) on RHEL6 (RHEV3 manager). Caused load 75+ on Xeon E5506-based machine (20GB RAM).

Also RHEL6 based hypervisor loops there. load ~70. (2*X5650,128GB).
(machine is quite jammed, but has 55GB free memory).

So shutdowned our client's RHEV3 test environment. Not nice.

Is to contact support representative going to be now on the standard answer?

Why do you continuously close bugs when all the effects are not properly examined?

//arl

Comment 25 David Tonhofer 2012-07-02 08:14:03 UTC

Hi Ari,

It's indeed a new bug - the great leap second weekend disaster of 2012, which is somehow linked to Java

https://access.redhat.com/knowledge/articles/15145 - Leap Seconds in Red Hat Enterprise Linux (07/01/12 - 11:03)

Comment 26 Ari Lemmke 2012-07-02 08:54:14 UTC

http://pedroalves-bi.blogspot.fi/2012/07/java-leap-second-bug-how-to-fix-your.html

Still wondering why - leap seconds do occur.

http://en.wikipedia.org/wiki/Leap_second

//arl

Comment 27 Ari Lemmke 2012-07-02 09:06:49 UTC

Btw. Jan 1 and Jul 1 are really quite expensive dates for bugs and problems.

Jul is the holiday month here, Jan 1 is also free.

If for these kind of bugs people on holiday need to be called it means easily N * 100% extra costs.

Wish they could select other dates for leap seconds.

//arl

Comment 28 M.T 2012-07-03 12:59:41 UTC

We had the problem on 2 centos 6 iscsi target servers. We got the error in message file 

"tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable"

After applying the command 
 /etc/init.d/ntpd stop; date; date `date +"%m%d%H%M%C%y.%S"`; date;

problem solved.

Comment 29 Andrew Meredith 2012-07-03 15:31:48 UTC

FYI, the widely circulated fix of:

  # date -s "`date -u`"

as root also fixes the symptoms.

I watched 4 thrashing JVMs on 4 separate servers relax from 400-500% CPU to 1% or so instantly on entering this command.

Comment 30 Chris Adams 2012-07-03 15:45:10 UTC

This bug is about a specific leap-second deadlock that was fixed years ago.  Please stop commenting in this bug about unrelated leap-second problems that occurred in 2012.

Comment 31 Andrew Meredith 2012-07-03 16:30:11 UTC

I was going on the basis that until Red Hat actually post something about the current issue, people desparate for a fix will end up here. I did as did several of my colleagues and the above commentators. My apologies if this offends you.

Comment 32 Brian 2012-07-03 18:32:15 UTC

(In reply to comment #30)
> This bug is about a specific leap-second deadlock that was fixed years ago. 
> Please stop commenting in this bug about unrelated leap-second problems that
> occurred in 2012.

As Andrew said, people are ending up here.  So thanks for posting the fix to the current leap second bug in the totally unrelated kernel bug that was fixed years ago.  I sure appreciate it.

Comment 33 Prarit Bhargava 2012-07-06 19:30:37 UTC

For those of you who are interested in the RHEL6 resolutions to the leap second, please see

https://bugzilla.redhat.com/show_bug.cgi?id=836748
https://bugzilla.redhat.com/show_bug.cgi?id=836803

Thanks,

P.

Note You need to log in before you can comment on or make changes to this bug.