Bug 488449
Summary: | segfaults from ld-linux-x86-64 | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Neal Becker <ndbecker2> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 10 | CC: | drepper, jakub, jbs, kernel-maint, knightr, krellan, kszysiu, lucas, markrwatts, michael.madore, quintela, rocketraman, sergey_bogomolov | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | 2.6.27.21-170.2.56.fc10 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-04-02 21:14:02 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Neal Becker
2009-03-04 12:52:54 UTC
Try a different kernel. I saw something like this for specific kernel versions. *** Bug 490514 has been marked as a duplicate of this bug. *** I'm seeing these issue with 2.6.27.19-170.2.35.fc10.x86_64 Is it worth downgrading to try? *** Bug 483449 has been marked as a duplicate of this bug. *** (In reply to comment #3) > Is it worth downgrading to try? I very much suspect kernel issues, so, yes, downgrade, upgrade, whatever. Try different kernel versions. I tried with kernel 2.6.27.15-170.2.24.fc10.x86_64 and 2.6.27.19-170.2.35.fc10.x86_64 glibc-2.9-3.x86_64 glibc-2.9-3.i686 SELINUX=permissive SELINUXTYPE=targeted setsebool -P allow_unconfined_mmap_low 1 $ ldd /usr/bin/* ... ldd: exited with unknown exit code (139) kernel: ld-linux-x86-64[27165]: segfault at 0 ip 000000000040074b sp 00007fffb68943c0 error 6 in addr2line[400000+6000] $ sync $ ldd /usr/bin/addr2line linux-vdso.so.1 => (0x00007fff1a9fe000) libbfd-2.18.50.0.9-8.fc10.so => /usr/lib64/libbfd-2.18.50.0.9-8.fc10.so (0x0000000000606000) libz.so.1 => /lib64/libz.so.1 (0x0000000000110000) libc.so.6 => /lib64/libc.so.6 (0x0000000000c94000) /lib64/ld-linux-x86-64.so.2 (0x0000000000a73000) Name of "bad" programm (addr2line here) is random. I have never enable selinux here. I still see these issues with 2.6.29-3.fc10.x86_64 I too am having the same problems on two different installations of F10 x86_64. One of the PCs in question (built around a Asus P5B-plus) has been running multiple Fedora releases (32 bit though) in the past while the other is basically a new PC. The other pc (built around an Asus P5Q-E) has an hardware RAID controller (3ware 9650SE) and it complained about some sort of writing error within a minute of the last segfault so my guess is that the problems are related: re: <snip> Mar 15 18:01:33 hunter kernel: ld-linux-x86-64[8533]: segfault at de6000 ip 00000000003d057b sp 00007fffb86c8b58 error 6 in ld-2.9.so[3b8000+20000] Mar 15 18:01:33 hunter kernel: ld-linux-x86-64[8536]: segfault at e06000 ip 00000000003c258b sp 00007fff8672fb98 error 6 in ld-2.9.so[3aa000+20000] Mar 15 18:01:36 hunter kernel: __ratelimit: 6 callbacks suppressed Mar 15 18:01:36 hunter kernel: ld-linux-x86-64[8824]: segfault at 5 ip 0000000000400dbb sp 00007fffb74fd970 error 6 in jhat[400000+8000] Mar 15 18:01:42 hunter kernel: ld-linux-x86-64[8908]: segfault at 85a000 ip 00000000002c658f sp 00007fff828e0d68 error 6 in ld-2.9.so[2ae000+20000] Mar 15 18:01:49 hunter kernel: ld-linux-x86-64[8983]: segfault at 60e510 ip 000000000060e510 sp 00007fff8aecc348 error 15 in gnome-session-splash[60e000+1000] Mar 15 18:01:53 hunter kernel: ld-linux-x86-64[8991]: segfault at 607410 ip 0000000000402da8 sp 00007fff138d2d50 error 4 in screenshot[400000+7000] Mar 15 18:01:53 hunter kernel: ld-linux-x86-64[9000]: segfault at 606c20 ip 0000000000606c20 sp 00007fffa06cb3b8 error 15 in file-wmf[606000+1000] Mar 15 18:01:53 hunter kernel: ld-linux-x86-64[9045]: segfault at 605dbb ip 0000000000605dbb sp 00007fffd2fb2440 error 15 in color-exchange[605000+1000] Mar 15 18:01:54 hunter kernel: ld-linux-x86-64[9133]: segfault at 603310 ip 0000000000603310 sp 00007fffce92c5d8 error 15 in shift[603000+1000] Mar 15 18:01:54 hunter kernel: ld-linux-x86-64[9154]: segfault at 60492a ip 000000000060492a sp 00007fffa35b0a38 error 15 in unsharp-mask[604000+1000] Mar 15 18:02:43 hunter kernel: attempt to access beyond end of device Mar 15 18:02:43 hunter kernel: dm-0: rw=1, want=2817055128, limit=80216064 Mar 15 18:02:43 hunter kernel: Buffer I/O error on device dm-0, logical block 1370162 Mar 15 18:02:43 hunter kernel: lost page write due to I/O error on dm-0 Mar 15 18:02:43 hunter kernel: ext3_journal_dirty_data: aborting transaction: IO failure in ext3_journal_dirty_data Mar 15 18:02:43 hunter kernel: EXT3-fs error (device dm-0) in ext3_ordered_writepage: IO failure Mar 15 18:02:43 hunter kernel: JBD: Detected IO errors while flushing file data on dm-0 That problem is a real show stopper as it is impossible to trust a PC/server which runs this OS when it crashes with these kind of errors... I should have changed it to a kernel bug a long time ago. I started getting a bunch of these after my FC10 x86_64 laptop crashed. Programs like NetworkManager and packagekitd were segfaulting and the kernel was spitting out these "general protection" errors in ld-2.9.so ...so I worked backwards through my log to find the first "general protection" error and only a few lines above it, I had some of this: Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: INFO: recovery required on readonly filesystem. Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: write access will be enabled during recovery. Mar 30 16:49:42 hutl13413 kernel: kjournald starting. Commit interval 5 seconds Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: sda2: orphan cleanup on readonly fs Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: sda2: 71 orphan inodes deleted Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: recovery complete. Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: mounted filesystem with ordered data mode. Hmmm... Next I used yum to reinstall glibc, which in turn runs /sbin/ldconfig which spat out a whole bunch of angry crap: /sbin/ldconfig: /lib64/libdevmapper.so.1.02 is not an ELF file - it has the wrong magic bytes at the start. /sbin/ldconfig: /lib64/libdevmapper.so is not an ELF file - it has the wrong magic bytes at the start. /sbin/ldconfig: /usr/lib64/libkalarm_resources.so.4 is not an ELF file - it has the wrong magic bytes at the start. /sbin/ldconfig: file /usr/lib64/libnm-util.so.1.0.0 is truncated ...etc... Now I'm thinking ahah! It's not glibc or kernel fault, it's corrupted libraries... $ rpm -qf /usr/lib64/libnm-util.so.1.0.0 $ NetworkManager-glib-0.7.0.99-5.git20090326.fc10.x86_64 $ yum reinstall NetworkManager-glib-0.7.0.99-5.git20090326.fc10.x86_64 $ sudo service NetworkManager start Setting network parameters... [ OK ] Starting NetworkManager daemon: [ OK ] $ nm-applet libnotify-Message: GetCapabilities call failed: Launch helper exited with unknown return code 127 /usr/lib64/libgvfscommon.so.0: invalid ELF header Failed to load module: /usr/lib64/gio/modules/libgioremote-volume-monitor.so /usr/lib64/libgvfscommon.so.0: invalid ELF header Failed to load module: /usr/lib64/gio/modules/libgvfsdbus.so So I've still got some work to do... but at least I've made progress, NetworkManager now runs rather than segfaulting and nm-applet runs with default config... Just gotta reinstall the rest of those corrupt libraries and maybe get rpm to verify all packages just to be sure... Hope this helps some of the rest of you! -- Rich Created attachment 337372 [details]
strace output from SEGFAULTed run of "ldd ls"
I tried to reinstall glibc but there were no errors/corrupt libraries reported as per the last comment.
I do not have selinux enabled.
I get this error randomly from ldd:
ldd: exited with unknown exit code (139)
which seems to be due to a segmentation fault in /lib64/ld-linux-x86-64.so.2.
I was able to capture an strace from a failed execution via:
# cd /bin
# while [ $(strace -f ldd ls 2>&1 | tee ls.strace.log | grep -c SEG) = 0 ]; do echo "trying ldd"; done;
(as a previous poster mentioned, ls is a random binary -- I can trigger the error much faster by using other binaries -- the error seems to occur more often the bigger the binary is, but I haven't tested that rigorously)
I have attached the strace output.
(In reply to comment #12) > I get this error randomly from ldd: > > ldd: exited with unknown exit code (139) Forgot to mention: # uname -r 2.6.27.19-170.2.35.fc10.x86_64 Hi, I am also seeing this problem, but not with all kernels: kernel-2.6.27.5-117.fc10.src.rpm = OK kernel-2.6.27.19-170.2.35.fc10.src.rpm = BAD I can now reproduce the behaviour consistently: 1) Perform a fresh install of Fedora 10 2) Disable crond (to prevent prelink from running) 3) Perform yum update 4) Reboot 5) Run /etc/cron.daily/prelink If I remove the following patch from the kernel: Patch160: linux-2.6-execshield.patch Then the segfaults don't occur. Let me know if I can provide any additional information. Mike Madore I avoided segfault with setarch x86_64 --addr-compat-layout ldd ... (In reply to comment #15) > I avoided segfault with > > setarch x86_64 --addr-compat-layout ldd ... I confirm this works around the segfault on my machine as well. Hi, If I modify the call to prelink in /etc/cron.daily/prelink like so: setarch x86_64 -L /usr/sbin/prelink -av $PRELINK_OPTS >> /var/log/prelink/prelink.log 2>&1 \ Then the segfaults are eliminated on my system. Mike Madore This bug also breaks VMware Workstation 6.5.1 on Fedora 10. It makes it so that VMware fails to start when clicked upon. The interesting thing is that it happened only sometimes, not all of the time, and seemed internal to ldd. http://communities.vmware.com//message/1213672 This thread went on for a while, until somebody kindly discovered this bug and the workarounds in the comments here. I applied the workaround in comment 17 to prelink, and it seemed to make the problem in VMware go away. (In reply to comment #14) > If I remove the following patch from the kernel: > > Patch160: linux-2.6-execshield.patch > > Then the segfaults don't occur. > There was one change to that patch recently. If you revert the below change instead of removing the execshield patch does it fix the problem? --- linux-2.6-execshield.patch 29 Jan 2009 21:14:11 -0000 1.100 +++ linux-2.6-execshield.patch 11 Mar 2009 16:35:27 -0000 1.101 @@ -456,10 +456,10 @@ /* Enable PSE if available */ if (cpu_has_pse) diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c -index 56fe712..ec932ae 100644 +index 56fe712..30d2be7 100644 --- a/arch/x86/mm/mmap.c +++ b/arch/x86/mm/mmap.c -@@ -111,13 +111,15 @@ static unsigned long mmap_legacy_base(void) +@@ -111,13 +111,16 @@ static unsigned long mmap_legacy_base(void) */ void arch_pick_mmap_layout(struct mm_struct *mm) { @@ -471,7 +471,8 @@ } else { mm->mmap_base = mmap_base(); mm->get_unmapped_area = arch_get_unmapped_area_topdown; -+ if (!(current->personality & READ_IMPLIES_EXEC)) ++ if (!(current->personality & READ_IMPLIES_EXEC) ++ && mmap_is_ia32()) + mm->get_unmapped_exec_area = arch_get_unmapped_exec_area; mm->unmap_area = arch_unmap_area_topdown; } (In reply to comment #19) > There was one change to that patch recently. If you revert the below change > instead of removing the execshield patch does it fix the problem? > Never mind, that change was made after the bug was first reported. Can someone confirm: kernel-2.6.27.12-170.2.5.fc10 should be BAD: http://koji.fedoraproject.org/koji/buildinfo?buildID=79612 kernel-2.6.27.9-159.fc10 should be GOOD: http://koji.fedoraproject.org/koji/buildinfo?buildID=74993 Has anyone tested kernel-2.6.27.21-170.2.56.fc10 from updates-testing to see if the latest execshield update fixes the problem? Hi Chuck, Here are the results of testing the various kernel packages: 1) kernel-2.6.27.9-159.fc10 = GOOD 2) kernel-2.6.27.12-170.2.5.fc10 = BAD 3) kernel-2.6.27.21-170.2.56.fc10 = GOOD I then applied the patch from above to my bad kernel (kernel-2.6.27.19-170.2.35.fc10) and that fixed it. Do you know what the timeframe is for moving kernel 3) from testing to updates? Mike (In reply to comment #22) > Do you know what the timeframe is for moving kernel 3) from testing to updates? The move was requested on Mar 30. kernel-2.6.27.21-170.2.56.fc10 has been marked stable. |