575874 – ntpd can die due to ulimit on 64bit arch

Bug 575874 - ntpd can die due to ulimit on 64bit arch

Summary: ntpd can die due to ulimit on 64bit arch

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	ntp
Sub Component:
Version:	5.4
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Miroslav Lichvar
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	672571 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-03-22 15:46 UTC by Martin Poole
Modified:	2018-11-27 19:53 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	The ntpd daemon could terminate unexpectedly due to a low memory lock limit. With this update, the memory lock limit has been doubled.
Clone Of:
Environment:
Last Closed:	2011-07-21 06:42:46 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
increase memlock rlimit (580 bytes, patch) 2011-01-20 11:32 UTC, Miroslav Lichvar	no flags	Details \| Diff
ntpd.log during the stage when it dies (3.25 KB, application/rtf) 2011-03-09 14:15 UTC, dushy2010	no flags	Details
ntp.conf (1.83 KB, text/plain) 2011-03-23 06:20 UTC, dushy2010	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Legacy)	27385	0	None	None	None	Never
Red Hat Product Errata	RHBA-2011:0980	0	normal	SHIPPED_LIVE	ntp bug fix and enhancement update	2011-07-20 15:45:14 UTC

Description Martin Poole 2010-03-22 15:46:53 UTC

Description of problem:

ntpd performs setrlimit and mlock calls to catch bloating or leaks.

The value is hard-coded at a value which is too low for 64bit systems that are used as time servers for a modest-sized network. This is due to the larger allocations used for libraries

Stack is set as (setrlimit(RLIMIT_STACK, 50 * 4096)
with locked memory  setrlimit(RLIMIT_MEMLOCK,  32*1024*1024


Version-Release number of selected component (if applicable):

ntp-4.2.2p1-9.el5_3.2

  
Actual results:

process can die with " out of memory [21002] "

Expected results:

process lives

Additional info:

A typical 32bit machine has ntp  memory use like the following after few months uptime

VmPeak:     5752 kB
VmSize:     5700 kB
VmLck:         0 kB
VmHWM:      1440 kB
VmRSS:      1096 kB
VmData:     1424 kB
VmStk:        84 kB
VmExe:       536 kB
VmLib:      3448 kB
VmPTE:        28 kB

a freshly booted RHEL5 64bit machine with small RAM

VmPeak:    19244 kB
VmSize:    19188 kB
VmLck:     19188 kB
VmHWM:      4940 kB
VmRSS:      4884 kB
VmData:      780 kB
VmStk:        84 kB
VmExe:       460 kB
VmLib:      3308 kB
VmPTE:        60 kB

and on a 64bit with larger RAM

VmPeak:    29924 kB
VmSize:    29880 kB
VmLck:         0 kB
VmHWM:      1544 kB
VmRSS:      1540 kB
VmData:      564 kB
VmStk:        84 kB
VmExe:       544 kB
VmLib:      3852 kB
VmPTE:        76 kB

And on a problem machine after a few hours uptime

VmPeak:    38000 kB
VmSize:    37996 kB
VmLck:     37996 kB
VmHWM:      9604 kB
VmRSS:      9600 kB
VmData:     1716 kB
VmStk:        84 kB
VmExe:       460 kB
VmLib:      6648 kB
VmPTE:        92 kB


Not sure which approach to a fix is going to be most acceptable here.

 Upstream has "-m" option to specify.
 Doubling the values for 64bit arch.
 Doubling the values for all arch.
 Setting values based on consumption after startup + growth allowance.

Comment 1 Miroslav Lichvar 2010-03-29 11:14:30 UTC

The memory locking is enabled only when -m option is used.

This was fixed in Fedora by doubling the limit.

Comment 2 Miroslav Lichvar 2010-03-29 11:19:26 UTC

Actually in the 4.2.2p1 version memory is locked unconditionally.

It might be useful to backport the option as running ntpd in locked memory is rarely needed.

Comment 7 RHEL Program Management 2010-08-09 18:40:04 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 8 Aaron Grewell 2010-11-25 00:50:10 UTC

Any chance we can get an update released for this?  Given the proliferation of x86_64 hosts it's only becoming a bigger problem as time goes on.

Comment 11 RHEL Program Management 2011-01-11 21:07:15 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 12 RHEL Program Management 2011-01-11 23:14:20 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 13 dushy2010 2011-01-20 11:22:11 UTC

Hi folks,

Could you let us know the version in which the fix would be provided?
Or if fix for this bugzilla is already available in some version, please provide  the version.
 
Thanks,
Dushyant

Comment 14 Miroslav Lichvar 2011-01-20 11:32:17 UTC

Created attachment 474441 [details]
increase memlock rlimit

Comment 15 Miroslav Lichvar 2011-01-25 16:20:21 UTC

*** Bug 672571 has been marked as a duplicate of this bug. ***

Comment 18 dushy2010 2011-03-09 10:21:37 UTC

Hi Miroslav,

Adding the patch hasn't solved the issue, ntpd still dies with the following errors:
15 Feb 16:00:44 ntpd[6400]: synchronized to 172.22.32.6, stratum 3
15 Feb 16:40:44 ntpd[6400]: Exiting: No more memory!
15 Feb 17:23:45 ntpd[4382]: Attemping to register mDNS

15 Feb 17:23:45 ntpd[4382]: Unable to register mDNS

15 Feb 17:23:45 ntpd[4384]: logging to file /localdisk1/var/log/ntpd.log

Is there anything else that could be done to solve the issue?
If it wasn't already clear, we're using an x86_64 cluster.

Thanks,
Dushyant

Comment 19 Miroslav Lichvar 2011-03-09 12:58:09 UTC

What values do you see in /proc/`pidof ntpd`/status in the VmSize field? Is it increasing over time?

Comment 20 dushy2010 2011-03-09 14:15:59 UTC

Created attachment 483227 [details]
ntpd.log during the stage when it dies

I've attached a few lines of the ntpd.log during the stage it dies. 
I notice that for each scenario in ntpd.log where the ntpd dies
on "Exiting: No more memory" then in the preceeding lines, we see the following message 
"ntp_io:estimated max descriptors: 1024, initial socket boundary: 16".

Comparing to the scenarios where the ntpd is restarted and does not fail - we can see that the same "ntp_io" line is different - the new healthy daemon has this contents 
"ntp_io: estimated max descriptors: 65536, initial socket boundary: 16"

Is there something fishy about why before it dies, the estimated max descriptors is 1024 and later at 65536 when it is restarted?

Comment 21 Miroslav Lichvar 2011-03-09 15:08:47 UTC

The "estimated max descriptors" number is result from the getdtablesize() glibc call and it's not used in ntpd for anything other than printing in the log message.

What VmSize values do you see, are they close to 64MB?

Comment 22 dushy2010 2011-03-10 09:20:10 UTC

The VmSize values are not close to 64MB yet, about 39-40MB right now.

The ulimit is not staying persistent at 65536 on reboots.
The patch that you had attached, I believe doesn't work if the system is rebooted b'cos after reboot, the max descriptors/ulimit goes back to 1024 which leads to Exiting: No more memory! and ntpd daemon dying in a while.

I tried changing a parameter in /etc/security/limits.conf as follows :
soft nofile 65536
But after reboot, the value changes to 1024 again in ntp but ulimit shows correct value of 65536.
Could you let me know from which configuration file ntp(getdtablesize() function) reads the value 1024 and where we could modify to keep the value of open file descriptors persistent across reboots so that ntp shows correct/modified value.

Comment 23 Miroslav Lichvar 2011-03-10 10:08:47 UTC

The maximum number of descriptors shouldn't be related to the memory lock limit. ntpd needs only few descriptors, one for each local address and maybe few for logs, etc.

How exactly is the ntpd service restarted? Is it possible that a different ntpd binary is running after reboot?

Comment 24 dushy2010 2011-03-10 10:16:47 UTC

ntpd daemon is started automatically once the system is rebooted. We observed that the max descriptors that ntp takes after reboot is 1024. After some time we get "Exiting: No more memory" and ntpd dies. We restarted ntp daemon using 'service ntpd restart' . After the restart the max descriptors was changed to 65536.

Comment 25 Miroslav Lichvar 2011-03-22 17:00:33 UTC

Can you please attach your ntp.conf?

Comment 27 dushy2010 2011-03-23 06:20:39 UTC

Created attachment 486964 [details]
ntp.conf

ntp.conf file attached

Comment 28 Miroslav Lichvar 2011-03-23 14:22:54 UTC

dushy2010, it is probably a different limit than memlock which is causing the problem. It seems the limits are different when the service is started on boot and when from root shell. Are there any modifications in /etc/securit/limits.conf or any ulimit calls in shell configuration files such as /etc/profile?

Can you please post the content of /proc/`pidof ntpd`/limits when ntpd is started on boot and when it's restarted?

Comment 29 dushy2010 2011-03-24 08:29:18 UTC

[root@cu1admin1 ~]# ps aux|grep ntp
ntp       4384  0.0  0.0  32112  7580 ?        SLs  Feb15   4:14 ntpd -u ntp:ntp -p
/var/run/ntpd.pid -x -l /localdisk1/var/log/ntpd.log
root      5711  0.0  0.0  61148   788 pts/190  S+   07:12   0:00 grep ntp
[root@cu1admin1 ~]# cat /proc/4384/status
Name:   ntpd
State:  S (sleeping)
SleepAVG:       98%
Tgid:   4384
Pid:    4384
PPid:   1
TracerPid:      0
Uid:    38      38      38      38
Gid:    38      38      38      38
FDSize: 64
Groups: 38
VmPeak:    32132 kB
VmSize:    32112 kB
VmLck:     32112 kB
VmHWM:      7600 kB
VmRSS:      7580 kB
VmData:     2996 kB
VmStk:        84 kB
VmExe:       460 kB
VmLib:      3800 kB
VmPTE:        84 kB
StaBrk: 2b2d56d3d000 kB
Brk:    2b2d56d80000 kB
StaStk: 7fffbfe6ee30 kB
Threads:        1
SigQ:   0/4096
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180006a47
CapInh: 0000000002000000
CapPrm: 0000000002000000
CapEff: 0000000002000000
Cpus_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000fff
Mems_allowed:   00000000,00000003
[root@cu1admin1 ~]#

This output is taken from a stable "restarted" ntp daemon.

I looked at the collectl logs for shortly after this daemon was started manually - it showed 31MB VSZ and 7MB RSZ on 15th Feb, and collectl today shows the same quantities. I don't believe there is an increase over time. But remember that I mentioned that this "restarted" daemon always seems stable and on retart of the server, it eventually dies.

So - looking at the collectl data to understand the memory sizing of the original init.d daemon started automatically during boot of cu0admin1, then I see that it began life after system boot with 39MB VSZ and 15MB VSZ. It continued to have that same sizing until eventually it died.

btw looking elsewhere, 39MB/15MB seems to be standard sizing for an ntp daemon started by init.d

Whereas 31MB/7MB seems to be standard sizing for an ntp daemon started manually by root.

Comment 30 Miroslav Lichvar 2011-03-24 09:20:33 UTC

And the difference in /proc/`pidof ntpd`/limits?

According to the ntpd log, at least the maximum number of descriptors should be different, 1024 and 65536.

Comment 31 dushy2010 2011-03-24 10:19:52 UTC

Before even applying your patch on a fresh server, please have a look at the limits:

# cat /proc/`pidof ntpd`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            204800               unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             36351                36351                processes
Max open files            1024                 1024                 files
Max locked memory         33554432             33554432             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       36351                36351                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0


After applying your patch, have a look at limits:

# cat /proc/`pidof ntpd`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            204800               unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             36351                36351                processes
Max open files            1024                 8192                 files
Max locked memory         33554432             33554432             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       36351                36351                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0


As you can see the open files changes to 8192.
And after restart of the server, please have a look at the limits:
# cat /proc/`pidof ntpd`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            204800               unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             36351                36351                processes
Max open files            1024                 1024                 files
Max locked memory         33554432             33554432             bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       36351                36351                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0

So, there's something wrong that's happening after a machine is restarted and the value that the patch changes is not persistent.
May be changing the value of of ulimit inside the init.d/ntpd script would permanently keep ulimit at 8192/65536 or whatever higher value than 1024.

Comment 32 Miroslav Lichvar 2011-03-24 11:24:12 UTC

That's odd. According to the ntpd log, max open files should be 65536, but your output shows only 1024 (the soft limit is the one enforced by kernel, hard limit is just the allowed maximum for soft limit). Also, with the patch applied max locked memory should 64MB, not 32MB.

This is the output I get here, and it doesn't change after manually restarting the service.

Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            204800               unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             4095                 4095                 processes 
Max open files            1024                 1024                 files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       4095                 4095                 signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0

Comment 33 dushy2010 2011-03-24 11:45:13 UTC

For the moment this is what I've done:

----------------------------------------------------------------------------------
# diff -ruN /etc/init.d/ntpd.orig  /etc/init.d/ntpd
--- /etc/init.d/ntpd.orig       2011-03-10 14:52:44.000000000 +0530
+++ /etc/init.d/ntpd    2011-03-10 14:54:12.000000000 +0530
@@ -91,7 +91,8 @@
        [ "$NETWORKING" = "no" ] && exit 1

        readconf;
-
+       # Modifying ulimit value so ntp takes 65536 as max descriptors
+        ulimit -n 65536
        if [ -n "$dostep" ]; then
            echo -n $"$prog: Synchronizing with time server: "
            /usr/sbin/ntpdate $dropstr -s -b $NTPDATE_OPTIONS $tickers &>/dev/null

----------------------------------------------------------------------------------

Comment 38 Eva Kopalova 2011-06-30 12:43:41 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
The ntpd daemon could terminate unexpectedly due to a low memory lock limit. With this update, the memory lock limit has been doubled.

Comment 39 David Swegen 2011-07-01 14:45:10 UTC

Just a brief note: If anyone wonders this bug seems to already have been addressed for RHEL6 in ntp-4.2.4p8-2.el6.

Comment 40 errata-xmlrpc 2011-07-21 06:42:46 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0980.html

Note You need to log in before you can comment on or make changes to this bug.