Bug 236705
Summary: | strange segfaults from crond, tar and rsync | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Jan-Frode Myklebust <mykleb> |
Component: | glibc | Assignee: | Jakub Jelinek <jakub> |
Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.4 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | ia32e | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-05-29 20:33:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jan-Frode Myklebust
2007-04-17 07:38:28 UTC
BTW: the rsync command I'm running from another machine is time /usr/local/rsync-2.6.9/bin/rsync -av --delete --progress -e ssh /home/ root.41.4:/home/ and the cron-jobs I put in to try to "stress" it was simply 6 of this one: # cat /etc/cron.d/segfault-test * * * * * root date > /dev/null Strange.. I rebooted this server yesterday, ran the 'rsync' and saw it fail immediately. Then there were no other segfaults before I this morning ran a new successfull rsync, and got these segfaults at the same time: crond[1747]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[1748]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[1749]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[1750]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[1751]: segfault at 0000002a955ee218 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 BTW: I've also tried running a bunch "find /home -type f -exec md5sum '{}' \;" to see if that managed to trigger the same segfault for 'md5sum', thinking it might be a general problem forking new processes, but got no segfaults from this. .. And an rsync to /home_local (internal disk, ext3 on LVM) also triggered a bunch of crond segfaults. # dmesg crond[2265]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[2266]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[2267]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[2268]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 crond[2269]: segfault at 0000002a955ee780 rip 0000002a95a7e04f rsp 0000007fbffbf690 error 4 I added a "ulimit -c unlimited" to the /etc/init.d/crond and stopped/started crond. Then I did not seem to be able to get the segfaults from crond while running rsyncs. Also rebooted, and am now trying to trigger these segfaults. After rebooting I also commented out the "ulimit -S -c 0" from /etc/profile, but when running the first rsync segfaulted, but I couldn't find any core-files. Now I'm trying again and again to trigger the crond-segfaults.. hoping they will produce a core-file. But am afraid the "ulimit -c unlimited" might have removed the problem from crond... ?? Nope, there it triggered again, but I can't find any core-files. Are there any special procedure (besides adding "ulimit -c unlimited" to the initscript), to make crond dump core's ? Minor correction.. I said I made 6 of the cronjobs that executed "date" every minute.. Actually it was 5, which matches perfectly with the 5 segfaults that seems to be coming together. I don't think it's a problem with these running at the same time, as initialy I had only one cronjob running every minute, and that triggered it too. BTW; the message from the rsync-sending side is: # time /usr/local/rsync-2.6.9/bin/rsync -av --delete --progress -e ssh /home/ root.41.4:/home/ building file list ... 439888 files to consider rsync: connection unexpectedly closed (8 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(453) [sender=2.6.9] real 1m3.876s user 0m19.660s sys 0m9.290s On the first rsync. Later rsyncs always works fine. Inspired by bug 181721 I tried booting with mem=4G, mem=3G and mem=2G.. but it had no effect on the first rsync. It crashed in all 3 cases. I removed GPFS and RDAC from the equation.. and it still fails ! i.e. i chkconfig'red GPFS off, changed the grub.conf to not boot with the rdac-enabled initrd, but use the default RHEL4u4 initrd, rebooted, and saw the same error when running rsync to the /home_local LVM/ext3 file system. # lsmod|egrep -i 'mpp|rdac|gpfs' # cat /proc/sys/kernel/tainted 0 # uname -r -s -p -i -m Linux 2.6.9-42.0.10.ELsmp x86_64 x86_64 x86_64 # dmesg rsync[4351]: segfault at 0000002a9813f528 rip 0000003ea24f504f rsp 0000007fbfffd8f0 error 4 After finally getting a core dumped (needed to set ulimit in /etc/init.d/sshd, /etc/profile wasn't enough), it points at nscd: (gdb) where #0 0x0000003ea24f504f in __nscd_cache_search () from /lib64/tls/libc.so.6 #1 0x0000003ea24f2e91 in nscd_getpw_r () from /lib64/tls/libc.so.6 #2 0x0000003ea24f3226 in __nscd_getpwnam_r () from /lib64/tls/libc.so.6 #3 0x0000003ea248e7cd in getpwnam_r@@GLIBC_2.2.5 () from /lib64/tls/libc.so.6 #4 0x0000003ea248e23f in getpwnam () from /lib64/tls/libc.so.6 #5 0x0000000000407a07 in ?? () #6 0x0000000000415f9a in ?? () #7 0x0000000000411dc4 in ?? () #8 0x00000000004098dc in ?? () #9 0x000000000040a6d1 in ?? () #10 0x0000003ea241c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #11 0x0000000000402fca in ?? () #12 0x0000007fbffffbf8 in ?? () #13 0x000000000000001c in ?? () #14 0x0000000000000006 in ?? () #15 0x0000007fbffffddf in ?? () #16 0x0000007fbffffde5 in ?? () #17 0x0000007fbffffdee in ?? () #18 0x0000007fbffffdf8 in ?? () #19 0x0000007fbffffe01 in ?? () #20 0x0000007fbffffe03 in ?? () #21 0x0000000000000000 in ?? () and turning off nscd seems to make the problem go away. At least for my reliable rsync test. What glibc version are you using? There were several nscd and nscd client code in libc.so fixes in RHEL4.5 glibc. # rpm -q glibc glibc-2.3.4-2.25 glibc-2.3.4-2.25 BTW: this server went into production just before U5, and unfortunately we wounn't be able to upgrade to U5 in a while.. Running without nscd has been working fine so far. Closing as dup of #219145 then. - fix application crashes when doing NSS lookups through nscd mmapped databases and nscd decides to start garbage collection during the lookups (#219145) which is fixed in 2.3.4-2.36 and above. If you get a chance to upgrade to it and happen to reproduce it even with that glibc, please reopen. *** This bug has been marked as a duplicate of 219145 *** |