Bug 236705

Summary:	strange segfaults from crond, tar and rsync
Product:	Red Hat Enterprise Linux 4	Reporter:	Jan-Frode Myklebust <mykleb>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED DUPLICATE	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.4
Target Milestone:	---
Target Release:	---
Hardware:	ia32e
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-05-29 20:33:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan-Frode Myklebust 2007-04-17 07:38:28 UTC

Description of problem:

I'm getting a few semi-random segfaults from crond, tar and rsync. 'dmesg' is
reporting:

rsync[28504]: segfault at 0000002a9813f0c8 rip 0000003ea24f504f rsp
0000007fbfffd900 error 4

I'm running an rsync from another machine. I seem to be getting this error the
first time I run the rsync after a boot, and then I'm unable to reproduce it
(verified 3 times). I have no idea what triggers the crond segfaults, but they
seem to be coming around the time I'm doing the rsyncs.


# grep segfault messages
Apr 16 09:06:07 http2 kernel: rsync[21417]: segfault at 0000002a9813f020 rip
0000003ea24f504f rsp 0000007fbfffd910 error 4
Apr 16 13:01:01 http2 kernel: crond[13764]: segfault at 0000002a9563e818 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 13:01:01 http2 kernel: crond[13767]: segfault at 0000002a9563e818 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 14:53:01 http2 kernel: crond[9345]: segfault at 0000002a955c5218 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 14:53:01 http2 kernel: crond[9346]: segfault at 0000002a955c5218 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 14:53:01 http2 kernel: crond[9347]: segfault at 0000002a955c5218 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 14:53:01 http2 kernel: crond[9348]: segfault at 0000002a955c5218 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 14:53:01 http2 kernel: crond[9349]: segfault at 0000002a955c5218 rip
0000002a95a7e04f rsp 0000007fbffbf690 error 4
Apr 16 17:21:30 http2 kernel: rsync[28504]: segfault at 0000002a9813f0c8 rip
0000003ea24f504f rsp 0000007fbfffd900 error 4


Version-Release number of selected component (if applicable):


How reproducible:

Too reproducible for me to dare put this machine into production.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

This is an IBM x346, with 1 cpu (3.2 GHz Xeon), 5 GB memory, 1 TB storage on
SAN. It uses RDAC multipath driver (www.lsi.com/rdac), and IBM GPFS filesystem.
So I'm uncertain where to put the blame.. I guess both RHEL, GPFS and RDAC could
be the cause..

And I've run extensive memory, cpu and cache tests from the "boot to
diagnostics" option.

Could this be related to
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=181721  ??

Comment 1 Jan-Frode Myklebust 2007-04-17 08:07:51 UTC

BTW: the rsync command I'm running from another machine is

time /usr/local/rsync-2.6.9/bin/rsync -av --delete --progress -e ssh /home/
root.41.4:/home/

and the cron-jobs I put in to try to "stress" it was simply 6 of this one:

# cat /etc/cron.d/segfault-test
* * * * * root date > /dev/null

Comment 2 Jan-Frode Myklebust 2007-04-17 08:27:06 UTC

Strange.. I rebooted this server yesterday, ran the 'rsync' and saw it fail
immediately. Then there were no other segfaults before I this morning ran a new
successfull rsync, and got these segfaults at the same time:

crond[1747]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[1748]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[1749]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[1750]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[1751]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4


BTW: I've also tried running a bunch "find /home -type f -exec md5sum '{}' \;"
to see if that managed to trigger the same segfault for 'md5sum', thinking it
might be a general problem forking new processes, but got no segfaults from this.

Comment 3 Jan-Frode Myklebust 2007-04-17 08:40:06 UTC

.. And an rsync to /home_local (internal disk, ext3 on LVM) also triggered a
bunch of crond segfaults.

# dmesg
crond[2265]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[2266]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[2267]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[2268]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4
crond[2269]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp
0000007fbffbf690 error 4

Comment 4 Jan-Frode Myklebust 2007-04-17 10:02:43 UTC

I added a "ulimit -c unlimited" to the /etc/init.d/crond and stopped/started
crond. Then I did not seem to be able to get the segfaults from crond while
running rsyncs. Also rebooted, and am now trying to trigger these segfaults.
After rebooting I also commented out the "ulimit -S -c 0" from /etc/profile, but
when running the first rsync segfaulted, but I couldn't find any core-files. Now
I'm trying again and again to trigger the crond-segfaults.. hoping they will
produce a  core-file. But am afraid the "ulimit -c unlimited" might have removed
the problem from crond... ??

Comment 5 Jan-Frode Myklebust 2007-04-17 10:20:10 UTC

Nope, there it triggered again, but I can't find any core-files. Are there any
special procedure (besides adding "ulimit -c unlimited" to the initscript), to
make crond dump core's ?

Comment 6 Jan-Frode Myklebust 2007-04-17 11:57:36 UTC

Minor correction.. I said I made 6 of the cronjobs that executed "date" every
minute.. Actually it was 5, which matches perfectly with the 5 segfaults that
seems to be coming together. I don't think it's a problem with these running at
the same time, as initialy I had only one cronjob running every minute, and that
triggered it too.

Comment 7 Jan-Frode Myklebust 2007-04-18 08:55:40 UTC

BTW; the message from the rsync-sending side is:

# time /usr/local/rsync-2.6.9/bin/rsync -av --delete --progress -e ssh /home/
root.41.4:/home/
building file list ... 
439888 files to consider
rsync: connection unexpectedly closed (8 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(453) [sender=2.6.9]

real    1m3.876s
user    0m19.660s
sys     0m9.290s


On the first rsync. Later rsyncs always works fine.

Comment 8 Jan-Frode Myklebust 2007-04-18 09:09:09 UTC

Inspired by bug 181721 I tried booting with mem=4G, mem=3G and mem=2G.. but it
had no effect on the first rsync. It crashed in all 3 cases.

Comment 9 Jan-Frode Myklebust 2007-04-18 09:22:44 UTC

I removed GPFS and RDAC from the equation.. and it still fails !

i.e. i chkconfig'red GPFS off, changed the grub.conf to not boot with the
rdac-enabled initrd, but use the default RHEL4u4 initrd, rebooted, and saw the
same error when running rsync to the /home_local LVM/ext3 file system.

# lsmod|egrep -i 'mpp|rdac|gpfs'
# cat /proc/sys/kernel/tainted 
0
# uname -r -s -p -i -m
Linux 2.6.9-42.0.10.ELsmp x86_64 x86_64 x86_64
# dmesg
rsync[4351]: segfault at 0000002a9813f528 rip 0000003ea24f504f rsp
0000007fbfffd8f0 error 4

Comment 10 Jan-Frode Myklebust 2007-04-20 09:14:46 UTC

After finally getting a core dumped (needed to set ulimit in /etc/init.d/sshd,
/etc/profile wasn't enough), it points at nscd:

(gdb) where
#0  0x0000003ea24f504f in __nscd_cache_search () from /lib64/tls/libc.so.6
#1  0x0000003ea24f2e91 in nscd_getpw_r () from /lib64/tls/libc.so.6
#2  0x0000003ea24f3226 in __nscd_getpwnam_r () from /lib64/tls/libc.so.6
#3  0x0000003ea248e7cd in getpwnam_r@@GLIBC_2.2.5 () from /lib64/tls/libc.so.6
#4  0x0000003ea248e23f in getpwnam () from /lib64/tls/libc.so.6
#5  0x0000000000407a07 in ?? ()
#6  0x0000000000415f9a in ?? ()
#7  0x0000000000411dc4 in ?? ()
#8  0x00000000004098dc in ?? ()
#9  0x000000000040a6d1 in ?? ()
#10 0x0000003ea241c3fb in __libc_start_main () from /lib64/tls/libc.so.6
#11 0x0000000000402fca in ?? ()
#12 0x0000007fbffffbf8 in ?? ()
#13 0x000000000000001c in ?? ()
#14 0x0000000000000006 in ?? ()
#15 0x0000007fbffffddf in ?? ()
#16 0x0000007fbffffde5 in ?? ()
#17 0x0000007fbffffdee in ?? ()
#18 0x0000007fbffffdf8 in ?? ()
#19 0x0000007fbffffe01 in ?? ()
#20 0x0000007fbffffe03 in ?? ()
#21 0x0000000000000000 in ?? ()


and turning off nscd seems to make the problem go away. At least for my reliable
rsync test.

Comment 11 Jakub Jelinek 2007-05-29 16:34:23 UTC

What glibc version are you using?  There were several nscd and nscd client
code in libc.so fixes in RHEL4.5 glibc.

Comment 12 Jan-Frode Myklebust 2007-05-29 20:24:41 UTC

# rpm -q glibc
glibc-2.3.4-2.25
glibc-2.3.4-2.25

Comment 13 Jan-Frode Myklebust 2007-05-29 20:26:32 UTC

BTW: this server went into production just before U5, and unfortunately we
wounn't be able to upgrade to U5 in a while.. Running without nscd has been
working fine so far.

Comment 14 Jakub Jelinek 2007-05-29 20:33:12 UTC

Closing as dup of #219145 then.
- fix application crashes when doing NSS lookups through nscd
  mmapped databases and nscd decides to start garbage collection
  during the lookups (#219145)
which is fixed in 2.3.4-2.36 and above.  If you get a chance to upgrade to it
and happen to reproduce it even with that glibc, please reopen.

*** This bug has been marked as a duplicate of 219145 ***