Bug 154759
Summary: | (kernel-2.6.11 exec-shield) ntpd and perl segfaults | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Florin Andrei <florin> |
Component: | kernel | Assignee: | Ingo Molnar <mingo> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3 | CC: | aleksey, amk, anderson, bbaetz, davej, dblistsub-redzilla, dedourek, deron.meranda, dgunchev, drepper, dwalsh, fche, gauret, gilboad, hansecke, harald, howanitz, ianburrell, jamesodhunt, jcc103, john, jonathan.underwood, jonte, j, katzj, kevin.russell, lfarkas, mattdm, m.a.young, mingo, njh, oliva, ralston, rlocke, rnichols42, robatino, steve30401, tometzky+redhat, trevor, umar, werkman, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-06-27 17:49:18 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 136452 | ||
Attachments: |
Description
Florin Andrei
2005-04-13 23:41:56 UTC
Created attachment 113166 [details]
ntp.conf file
Created attachment 113167 [details]
ntp -n strace log
Hello all, I'm having the same problem... but with a twist. Long story short: I'm using apt-get to upgrade my workstation and servers. My servers only get updates from "OS", "core" and "updates" (and some, "tupdates") My workstations get updates from "OS", "core", "updates", "tupdates" and "kde-redhat". While I can't really understand why, the nptd only SIGSEGV on (all) my workstations, while working just fine on all my servers. I compared the list of the ntp-4.2.0.a.20040617-4.rpm dependencies (glibc, libcap, readline, cursors ) on a selected server (which has a working ntpd) and a workstation (The source of the attached strace log) and they both had the same base rpms. Never the less, the crash happens like clock work on all my kde-redhat workstations and on non of my servers. NTP version: ntp-4.2.0.a.20040617-4 Attached: ntp.conf file and ntp-trace.log Let me know if I can supply further informatiom, Gilboa Created attachment 113171 [details]
sysreport
Might be some weird timing issue on certain architectures. What are your
systems?
Mine are PIII/800, regular IDE machine. I attached the archive generated by
sysreport on one of them (slightly obfuscated to remove domain names and
network addresses).
Created attachment 113172 [details]
strace -fFttx -o ntp-strace.txt /usr/sbin/ntpd -u ntp -p /var/run/ntpd.pid -g
Here's my strace.
Now i noticed that ntp segfaults also on another type of machines, still PIII
but different hardware (Xeon, SCSI...).
This bug has been reported to ntp.org as well: http://bugzilla.ntp.org/show_bug.cgi?id=413 FC4 Test 2 seems to be fine, based on shallow, circumstantial evidence (a single system, the kind that makes ntpd crash instantly on FC3). I am pretty sure the kernel version is important. We started seeing the problem after upgrading to the 2.6.11-1.14 kernel, but not the earlier 2.6.10-1.770 kernel. There may also be a race condition, as I have found that if you try several times to start the ntpd daemon, it eventually starts and stays up. I second Michael Young's point. The 3 workstations are: Athlon MP 2400 x 2 / 1GB / SCSI RAID, 2.6.11-1-SMP. Athlon XP 1700 x 2 / 1GB / IDE RAID, 2.6.11-1-SMP. P4 3.06 (HT) / 1GB / IDE, 2.6.11-1-SMP. The servers are: Athlon XP 1700 x 2 / 1GB / IDE RAID, 2.6.9-667 SMP. K7 750 / 512MB / IDE, 2.6.10-737. P2 366 / 256MB / IDE, 2.6.10-737. All the 2.6.11 machines seem to kill the ntpd. (At least in my case...) Based on memory, it kind-of looks like it's a recent issue, indeed. If i can, i will attempt to revert one system to an older kernel and see if the problem stays. And yes, if you are _very_ persistent, it sometimes does work. But after running the machine for a while, it gets very hard to make it work. It's very hard to tell anything with precision. P.S.: Discussion thread about this bug on fedora-list, subject "BUG: check your ntpd, it may be dead": https://www.redhat.com/archives/fedora-list/2005-April/thread.html#03018 I can confirm that, at least on one system, reverting back to the original FC3 kernel eliminated the problem, now ntpd works fine. With the latest kernel update, ntpd becomes extremely unreliable. works: kernel-2.6.9-1.667 doesn't work: kernel-2.6.11-1.14_FC3 The first time I noticed this was the first shutdown running kernel-2.6.11-1.14_FC3, when ntpd shutdown failed (since it had died). It never happened even once running earlier kernels. I will echo comment #12. I first noticed after shutting down a 2.6.11-1.14_FC3 kernel and seeing the failure. Do not recall seeing the problem on 2.6.10-1.770_FC3. The startup, under 2.6.11-1.14_FC3 claims to successfully start ntpd but there is no running process. I have tried to repeatedly start the service but it never leaves a running process, of course, maybe I am not being as persistent as Comment #10.... :-) For what it's worth, I ran strace -f -o ntpd.txt /usr/sbin/ntpd and after it crashed, attached below is the output. Created attachment 113273 [details]
strace -f -o ntpd.txt /usr/sbin/ntpd
Output of strace when ntpd crashes.
ntpd is segfaulting while starting up. Here is the output from running 'ntpd -d': ntpd 4.2.0a Mon Oct 11 09:10:20 EDT 2004 (1) addto_syslog: ntpd 4.2.0a Mon Oct 11 09:10:20 EDT 2004 (1) addto_syslog: precision = 1.000 usec create_sockets(123) bind() fd 4, family 2, port 123, addr 0.0.0.0, flags=8 addto_syslog: Listening on interface wildcard, 0.0.0.0#123 bind() fd 5, family 10, port 123, addr ::, flags=0 addto_syslog: Listening on interface wildcard, ::#123 bind() fd 6, family 2, port 123, addr 127.0.0.1, flags=0 addto_syslog: Listening on interface lo, 127.0.0.1#123 bind() fd 7, family 2, port 123, addr 192.168.1.11, flags=8 addto_syslog: Listening on interface eth0, 192.168.1.11#123 init_io: maxactivefd 7 local_clock: time 0 clock 0.000000 offset 0.000000 freq 0.000 state 0 Segmentation fault (core dumped) The stack trace from the core file is: I have some core files and stack traces from 'ntpd -d' segfaulting. Every stack trace is different: Thread 1 (process 20050): #0 0x00e1ad30 in _IO_vfprintf (s=0xbff2e35c, format=0xbff2e460 "kernel time sync status %04x", ap=0xbff2e998 "@") at vfprintf.c:182 #1 0x00e3b536 in _IO_vsnprintf (string=0xbff2e998 "@", maxlen=3220366432, format=0xbff2e960 "\uffff\005", args=0xbff2e960 "\uffff\005") at vsnprintf.c:120 #2 0x0047b34a in msyslog (level=-1074599584, fmt=0xbff2e960 "\uffff\005") at msyslog.c:165 #3 0x004451a3 in loop_config (item=-1074599584, freq=0) at ntp_loopfilter.c:858 Thread 1 (process 20320): #0 0x00665106 in __res_vinit (statp=0x6b8ee0, preinit=0) at res_init.c:150 fp = (FILE *) Cannot access memory at address 0xbfe85288 Just another data point: this is also failing for me with 2.6.11-1.14_FC3 (i686 architecture) on a single-processor Athlon 800MHz; SELinux is disabled. It only SEGV's during startup, and it's somewhat random, with different strace output. Occassionally it will start fine about 1/5th of the time. Tried with and without -u option, which doesn't seem to matter. Most of the time the strace ends like, 5594 write(1, "local_clock: time 0 clock 0.0000"..., 70) = 70 5594 rt_sigaction(SIGSYS, {0x5343a0, [], SA_RESTORER, 0xdf78c8}, {SIG_DFL}, 8) = 0 5594 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 5594 adjtimex({modes=61, offset=0, freq=-60817, maxerror=16, esterror=16, status=64, constant=0, precision=1, tolerance=33554432, time={1113804277, 564806}}) = 5 5594 rt_sigaction(SIGSYS, {SIG_DFL}, NULL, 8) = 0 5594 --- SIGSEGV (Segmentation fault) @ 0 (0) --- 5594 +++ killed by SIGSEGV +++ And with two or more -d options, it usually SEGVs after this output, init_io: maxactivefd 7 local_clock: time 0 clock 0.000000 offset 0.000000 freq 0.000 state 0 getnetnum given 127.0.0.1, got 127.0.0.1 And running inside gdb, I get this stacktrace most of the time, Program received signal SIGSEGV, Segmentation fault. #0 0x00207d30 in vfprintf () from /lib/tls/libc.so.6 #1 0x00228536 in vsnprintf () from /lib/tls/libc.so.6 #2 0x007a434a in receive () from /usr/sbin/ntpd #3 0x0076e1a3 in main () from /usr/sbin/ntpd But I've sometimes seen the SEGV at different points in initialization too. I also found I had this problem on my server. I had three servers called out in /etc/ntp.conf and I found that if I commented out the first server it now works fine. First, I can confirm that this problem is absent on earlier kernels. I just rebooted a server to 2.6.10 and no problem there. Secondly, I think this might be an execshield problem. I do not see these segmentation faults if I do an echo 0 > /proc/sys/kernel/exec_shield However, doing this: echo 2 > /proc/sys/kernel/exec_shield echo 0 > /proc/sys/kernel/exec_shield-randomize prevents the problem as well. Lastly, I attach a few gdb backtraces that I obtained by doing an "ulimit -c unlimited" followed by a few "ntpd -d -g -n". Each trace is done once with "bt", once with "bt full". debuginfo packages are the sysadmins friend.... Created attachment 113456 [details]
a few gdb "bt" and "bt full" stacktraces
*** Bug 155498 has been marked as a duplicate of this bug. *** I see the problem on all (3) of my servers, all since updating to new kernel release 2.6.11-1.14_FC3. If you repeat the starting process for ntpd, at about every third attempt it will start on my slowest server (PIII-1000), on the fastest (Dual P4-XEON-2400) it will start at about every 10. attempt. Strace'ing the start process leads to different outputs every time, so this doesn't help I think. If ntpd is started successfully it stays up all the time (no problems for 5 days now). Only a thought: is the problem possibly related to an old version of libcap? *** Bug 155490 has been marked as a duplicate of this bug. *** Since about 2 days I have all my servers running with exec-shield RANDOMIZATION turned OFF and it completely fixes this problem. You can try this yourself like this (as root): echo 0 > /proc/sys/kernel/exec_shield-randomize This is an absolutely functional workaround for me, I believe some of redhat's kernel people (davej, riel or arjanv) should look at this. Additionally, I used to see perl crashes (bug #155490) with this kernel, but they have disappeared as well. echo 0 > /proc/sys/kernel/exec_shield-randomize solves the problem here too. Both ntp and perl crashing with exec-shield enabled, but only in 2.6.11? Perhaps an exec-shield issue in the new 2.6.11? Please reassign back if this turns out to be multiple userspace problems instead. Please see also: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=145258 which is related. Perhaps the discussion of NTP Bug 220 might be relevant. http://bugzilla.ntp.org/show_bug.cgi?id=220 It seems that ntp-4.2.0.a.20040617-4 uses setrlimit to reduce the amount of locked stack space. I believe my system tickled this bug during calls to getaddrinfo(). I'll just throw in another "me too". kernel-2.6.10-1.766_FC3 == no problems. kernel-2.6.11-1.14_FC3 == can't start ntpd (segfault). Seems to be most prevalent on the systems I run that have ide=nodma, though I can't see how that would be related unless it is a timing issue (disk access is dog slow on those systems). We need /proc/cpuinfo and core dumps from people affected by this bug. How do i enable core dumps? I did a "ulimit -c unlimited" and still no core dump (although ntpd crashes). Created attachment 113819 [details]
/proc/cpuinfo from affected servers
/proc/cpuinfo servers affected by this bug. Next up: /proc/cpuinfo from the
ones not affected.... The test is really easy:
echo 1 > /proc/sys/kernel/exec-shield-randomize
ntpd -g -n
echo 0 > /proc/sys/kernel/exec-shield-randomize
I do not see a correlation with DMA/noDMA: one of the affected ones is a SCSI
system with just one backup IDE disc which has DMA enabled.
I run all my systems with kde-redhat packages.
florin: do you try to run ntpd directly from the command line or via
/etc/rc.d/init.d/ntpd? "ulimit -c 0" is part of that.... Try running
it as "ntpd -d -g -n" or so.
Created attachment 113820 [details]
/proc/cpuinfo from un-affected servers
Ok, i figured it out, Warren has the info now, thanks. Created attachment 113829 [details]
core from ntpd segfault
Created attachment 113831 [details]
output from ntpd command
Created attachment 113832 [details]
cpuinfo on system with crashing ntpd
I just realized that this bug has been occurring on most of the machines I administer since I switched to 2.6.11... I just didn't notice it at first because on most of them it starts after only 1 or 2 failures and there is zero log output indicating a crash or failure. *** Bug 151262 has been marked as a duplicate of this bug. *** *** Bug 145258 has been marked as a duplicate of this bug. *** execstack works around the problem for me, as noted in one of the duplicate bugs. This should be published in the FC3 summary so that everybody can do this without having to resort to a bug search to discover the bug workaround. Please see also this comment for bug #145258 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=145258#c13 for workaround - disabling exec-shield only for ntpd - not for the whole system: execstack -s /usr/sbin/ntpd It can be dangerous - I think the cause of the problem is not known yet - exec shield randomization can break some other applications as well. But I haven't found a perl testcase. Bug #151262 harald gave a simple C sample that demonstrates this problem. Unfortunately I cannot reproduce it with that sample though. bad example... "-fstack-check" seems to be broken.. *** Bug 156603 has been marked as a duplicate of this bug. *** I have suddenly the same problem on my server, using execstack -s let ntpd start again. # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 6 cpu MHz : 930.391 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1839.10 Problem persists with kernel-2.6.11-1.27_FC3. (Not too surprising since I didn't see it mentioned in the changelog.) And now maybe bind/named as well? After the switch from 2.6.11-1.14_FC3smp to 2.6.11-1.27_FC3 named/bind also started showing seemingly similar mis-behaviour on a P4 machine but not on a PIII box. I am seeing the ntpd and bind problem with 2.6.11-1.27_FC3 on a Pentium 4 running both SMP (using hyperthreading) and non SMP. However, I am *not* seeing either the ntpd (or the bind) problem on an old PIII machine that is also running 2.6.11-1.27_FC3 Oh yes - the named/bind that seems to failing is bind-9.2.5-1 I'm not seeing the named problem on my Pentium 4, but it is not a hyperthreaded machine. Also the "execstack -s /usr/sbin/ntpd" workaround seems to have taken care of my ntpd problem. bind-9.2.5-1 kernel 2.6.11-1.27_FC3 These reports don't seem to be happening with FC4 kernels so I'm removing from FC4Blocker. If you *are* seeing this on FC4 (2.6.11-1.1340_FC4 or later), please speak up :) Just to confirm, 2.6.11-1.27_FC3 on Athlon and on P3 still does this. *** Bug 159060 has been marked as a duplicate of this bug. *** *** Bug 159132 has been marked as a duplicate of this bug. *** Inspired by comment #51 I've just compiled kernel-2.6.11-1.27_FC3 with a patch from http://people.redhat.com/mingo/exec-shield/exec-shield-nx-2.6.11-A8 instead of patches 511: linux-2.6.11-execshield.patch 512: linux-2.6.8-print-fatal-signals.patch 513: linux-2.6.8-execshield-vaspace.patch 515: linux-2.6.10-x86_64-read-implies-exec32.patch They are commented as "The execshield patch series, broken into smaller pieces" in spec file. I ignored reject of the last hunk of the patch - it is just changing EXTRAVERSION in Makefile and conflicts with 1: patch-2.6.11.10.bz2. After installing the kernel and rebooting the ntp crash _disappeared_. I have started ntpd more than 50 times and it didn't segfault. So it looks like there's a bug in exec-shield used in FC3 kernel. It causes ntp crash and could cause crashes in other programs as well - AFAIK nobody knows what exactly causes ntpd crash. I think there should be released errata kernel as soon as possible. I can provide my i686.rpm and src.rpm, but this kernel is a little more customized than described above - it has Cyclades driver compiled in and it is optimized for Pentium III. Same problem with ntpd happens here on Fedora Core 3 (with all updates) on three Pentium III machines (600, 650, 800 MHz) and a Pentium IV machine (2.4 GHz, no HyperThreading). As suggested, "execstack -s /usr/sbin/ntpd" seems to fix this problem, but SELinux now complains on every start of ntpd: Jun 5 11:12:43 myhost kernel: audit(1117789963.646:0): avc: denied { execute } for pid=3137 comm=ntpd path=/etc/ld.so.cache dev=md6 ino=30082 scontext=user_u:system_r:ntpd_t tcontext=user_u:object_r:ld_so_cache_t tclass=file Should I be worried about this? What version of the selinux policy rpm do you have installed ? I'm seeing that message with selinux-policy-targeted-1.17.30-2.96 (and execstack -s to avoid the crash), but I have selinux in permissive mode, so ntpd's behaviour should be the same I get the messages with all current FC3 updates, including the selinux-policy packages: selinux-policy-strict-1.19.10-2 selinux-policy-strict-sources-1.19.10-2 selinux-policy-targeted-1.17.30-2.96 selinux-policy-targeted-sources-1.17.30-2.96 My SELinux config is the default one (/etc/sysconfig/selinux): SELINUX=enforcing SELINUXTYPE=targeted The funny thing is, that the ntpd seems to work fine. This problem DOES seem to happen with FC4 kernels - with 2.6.11-1.1369_FC4, ntpd seems to stay up, but nscd usually dies, though it doesn't happen for at least a few seconds. *** Bug 149135 has been marked as a duplicate of this bug. *** Mee to: FC3, kernel 2.6.11-1.14_FC3, selinux-policy-targeted-1.17.30-2.96 This problem seems to be fixed, for ntpd at least, in the kernel currently in testing, kernel-2.6.11-1.35_FC3 downloadable from http://download.fedora.redhat.com/pub/fedora/linux/core/updates/testing/3/ . This should be closed when the fix makes to updates-released. So what was the problem? For me ntpd works now with kernel-2.6.11-1.1369_FC4 and 2.6.11-1.35_FC3smp. What were the 'Exec-shield improvements' :) ? Thanks for fixing it. there was a flaw in the address space randomisation. |