Bug 488449

Summary: segfaults from ld-linux-x86-64
Product: [Fedora] Fedora Reporter: Neal Becker <ndbecker2>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: 10CC: drepper, jakub, jbs, kernel-maint, knightr, krellan, kszysiu, lucas, markrwatts, michael.madore, quintela, rocketraman, sergey_bogomolov
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.27.21-170.2.56.fc10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-02 21:14:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace output from SEGFAULTed run of "ldd ls" none

Description Neal Becker 2009-03-04 12:52:54 UTC
Description of problem:
I just installed F10 onto a new machine, and I see many messages like this:

Mar  3 11:27:28 localhost kernel: ld-linux-x86-64[18733]: segfault at 400000 ip 000000000042b976 sp 00007fff5c8e9d88 error 6 in find[400000+37000]
Mar  3 11:27:29 localhost kernel: ld-linux-x86-64[18891] general protection ip:1e399b sp:7fff20c007d0 error:0 in ld-2.9.so[1e0000+20000]
Mar  3 11:27:29 localhost kernel: ld-linux-x86-64[18900]: segfault at 400000 ip 000000000040092a sp 00007fff672fa798 error 7 in kaffeine[400000+160000]
Mar  3 11:27:29 localhost kernel: ld-linux-x86-64[18945] general protection ip:4003c0 sp:7fffaf6b5278 error:0 in fc-match[400000+2000]
Mar  3 11:27:30 localhost kernel: ld-linux-x86-64[19012] general protection ip:2181f0 sp:7fffd04cbb00 error:0 in ld-2.9.so[210000+20000]
Mar  3 11:27:32 localhost kernel: ld-linux-x86-64[19140] general protection ip:401963 sp:7fff72ef6aa8 error:0 in hal-is-caller-privileged[400000+2000]
Mar  3 11:27:32 localhost kernel: ld-linux-x86-64[19143]: segfault at 75d3d8 ip 0000000000428a20 sp 00007fffddbcd060 error 4 in nautilus[400000+158000]
Mar  3 11:27:32 localhost kernel: ld-linux-x86-64[19147]: segfault at d1092a ip 0000000000d1092a sp 00007fff47cd2168 error 15
Mar  3 11:27:32 localhost kernel: ld-linux-x86-64[19186] trap invalid opcode ip:407ab8 sp:7fff6968d1f8 error:0 in paste[400000+8000]
Mar  3 11:27:33 localhost kernel: ld-linux-x86-64[19191]: segfault at 2a8 ip 00000000004d199b sp 00007ffff282f400 error 4 in ld-2.9.so[4ce000+20000]
Mar  3 11:27:34 localhost kernel: __ratelimit: 1 callbacks suppressed


Version-Release number of selected component (if applicable):

glibc-2.9-3.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Ulrich Drepper 2009-03-16 20:54:33 UTC
Try a different kernel.  I saw something like this for specific kernel versions.

Comment 2 Ulrich Drepper 2009-03-16 20:56:00 UTC
*** Bug 490514 has been marked as a duplicate of this bug. ***

Comment 3 Mark Watts 2009-03-16 20:59:40 UTC
I'm seeing these issue with 2.6.27.19-170.2.35.fc10.x86_64

Is it worth downgrading to try?

Comment 4 Ulrich Drepper 2009-03-16 21:29:15 UTC
*** Bug 483449 has been marked as a duplicate of this bug. ***

Comment 5 Ulrich Drepper 2009-03-16 21:30:07 UTC
(In reply to comment #3)
> Is it worth downgrading to try?  

I very much suspect kernel issues, so, yes, downgrade, upgrade, whatever.  Try different kernel versions.

Comment 6 Sergey 2009-03-17 09:52:43 UTC
I tried with kernel 2.6.27.15-170.2.24.fc10.x86_64 and 2.6.27.19-170.2.35.fc10.x86_64

glibc-2.9-3.x86_64
glibc-2.9-3.i686

SELINUX=permissive
SELINUXTYPE=targeted
setsebool -P allow_unconfined_mmap_low 1

$ ldd /usr/bin/*
...
ldd: exited with unknown exit code (139)

kernel: ld-linux-x86-64[27165]: segfault at 0 ip 000000000040074b sp
00007fffb68943c0 error 6 in addr2line[400000+6000]

$ sync
$ ldd /usr/bin/addr2line
 linux-vdso.so.1 =>  (0x00007fff1a9fe000)
 libbfd-2.18.50.0.9-8.fc10.so => /usr/lib64/libbfd-2.18.50.0.9-8.fc10.so
(0x0000000000606000)
 libz.so.1 => /lib64/libz.so.1 (0x0000000000110000)
 libc.so.6 => /lib64/libc.so.6 (0x0000000000c94000)
 /lib64/ld-linux-x86-64.so.2 (0x0000000000a73000)

Name of "bad" programm (addr2line here) is random.

Comment 7 Neal Becker 2009-03-17 10:59:10 UTC
I have never enable selinux here.

Comment 8 Mark Watts 2009-03-26 22:39:18 UTC
I still see these issues with 2.6.29-3.fc10.x86_64

Comment 9 Nicolas Riendeau 2009-03-27 00:57:12 UTC
I too am having the same problems on two different installations of F10 x86_64.

One of the PCs in question (built around a Asus P5B-plus) has been running multiple Fedora releases (32 bit though) in the past while the other is basically a new PC.

The other pc (built around an Asus P5Q-E) has an hardware RAID controller (3ware 9650SE) and it complained about some sort of writing error within a minute of the last segfault so my guess is that the problems are related:

re:

<snip>

Mar 15 18:01:33 hunter kernel: ld-linux-x86-64[8533]: segfault at de6000 ip 00000000003d057b sp 00007fffb86c8b58 error 6 in ld-2.9.so[3b8000+20000]
Mar 15 18:01:33 hunter kernel: ld-linux-x86-64[8536]: segfault at e06000 ip 00000000003c258b sp 00007fff8672fb98 error 6 in ld-2.9.so[3aa000+20000]
Mar 15 18:01:36 hunter kernel: __ratelimit: 6 callbacks suppressed
Mar 15 18:01:36 hunter kernel: ld-linux-x86-64[8824]: segfault at 5 ip 0000000000400dbb sp 00007fffb74fd970 error 6 in jhat[400000+8000]
Mar 15 18:01:42 hunter kernel: ld-linux-x86-64[8908]: segfault at 85a000 ip 00000000002c658f sp 00007fff828e0d68 error 6 in ld-2.9.so[2ae000+20000]
Mar 15 18:01:49 hunter kernel: ld-linux-x86-64[8983]: segfault at 60e510 ip 000000000060e510 sp 00007fff8aecc348 error 15 in gnome-session-splash[60e000+1000]
Mar 15 18:01:53 hunter kernel: ld-linux-x86-64[8991]: segfault at 607410 ip 0000000000402da8 sp 00007fff138d2d50 error 4 in screenshot[400000+7000]
Mar 15 18:01:53 hunter kernel: ld-linux-x86-64[9000]: segfault at 606c20 ip 0000000000606c20 sp 00007fffa06cb3b8 error 15 in file-wmf[606000+1000]
Mar 15 18:01:53 hunter kernel: ld-linux-x86-64[9045]: segfault at 605dbb ip 0000000000605dbb sp 00007fffd2fb2440 error 15 in color-exchange[605000+1000]
Mar 15 18:01:54 hunter kernel: ld-linux-x86-64[9133]: segfault at 603310 ip 0000000000603310 sp 00007fffce92c5d8 error 15 in shift[603000+1000]
Mar 15 18:01:54 hunter kernel: ld-linux-x86-64[9154]: segfault at 60492a ip 000000000060492a sp 00007fffa35b0a38 error 15 in unsharp-mask[604000+1000]
Mar 15 18:02:43 hunter kernel: attempt to access beyond end of device
Mar 15 18:02:43 hunter kernel: dm-0: rw=1, want=2817055128, limit=80216064
Mar 15 18:02:43 hunter kernel: Buffer I/O error on device dm-0, logical block 1370162
Mar 15 18:02:43 hunter kernel: lost page write due to I/O error on dm-0
Mar 15 18:02:43 hunter kernel: ext3_journal_dirty_data: aborting transaction: IO failure in ext3_journal_dirty_data
Mar 15 18:02:43 hunter kernel: EXT3-fs error (device dm-0) in ext3_ordered_writepage: IO failure
Mar 15 18:02:43 hunter kernel: JBD: Detected IO errors while flushing file data on dm-0


That problem is a real show stopper as it is impossible to trust a PC/server which runs this OS when it crashes with these kind of errors...

Comment 10 Ulrich Drepper 2009-03-27 20:40:42 UTC
I should have changed it to a kernel bug a long time ago.

Comment 11 Richard Guest 2009-03-30 20:54:25 UTC
I started getting a bunch of these after my FC10 x86_64 laptop crashed. Programs like NetworkManager and packagekitd were segfaulting and the kernel was spitting out these "general protection" errors in ld-2.9.so

...so I worked backwards through my log to find the first "general protection" error and only a few lines above it, I had some of this:
Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: INFO: recovery required on readonly filesystem.
Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: write access will be enabled during recovery.
Mar 30 16:49:42 hutl13413 kernel: kjournald starting.  Commit interval 5 seconds
Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: sda2: orphan cleanup on readonly fs
Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: sda2: 71 orphan inodes deleted
Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: recovery complete.
Mar 30 16:49:42 hutl13413 kernel: EXT3-fs: mounted filesystem with ordered data mode.

Hmmm... Next I used yum to reinstall glibc, which in turn runs /sbin/ldconfig which spat out a whole bunch of angry crap:
/sbin/ldconfig: /lib64/libdevmapper.so.1.02 is not an ELF file - it has the wrong magic bytes at the start.

/sbin/ldconfig: /lib64/libdevmapper.so is not an ELF file - it has the wrong magic bytes at the start.

/sbin/ldconfig: /usr/lib64/libkalarm_resources.so.4 is not an ELF file - it has the wrong magic bytes at the start.

/sbin/ldconfig: file /usr/lib64/libnm-util.so.1.0.0 is truncated

...etc...

Now I'm thinking ahah! It's not glibc or kernel fault, it's corrupted libraries...
$ rpm -qf /usr/lib64/libnm-util.so.1.0.0
$ NetworkManager-glib-0.7.0.99-5.git20090326.fc10.x86_64
$ yum reinstall NetworkManager-glib-0.7.0.99-5.git20090326.fc10.x86_64
$ sudo service NetworkManager start 
Setting network parameters...                              [  OK  ]
Starting NetworkManager daemon:                            [  OK  ]
$ nm-applet
libnotify-Message: GetCapabilities call failed: Launch helper exited with unknown return code 127
/usr/lib64/libgvfscommon.so.0: invalid ELF header
Failed to load module: /usr/lib64/gio/modules/libgioremote-volume-monitor.so
/usr/lib64/libgvfscommon.so.0: invalid ELF header
Failed to load module: /usr/lib64/gio/modules/libgvfsdbus.so


So I've still got some work to do... but at least I've made progress, NetworkManager now runs rather than segfaulting and nm-applet runs with default config... Just gotta reinstall the rest of those corrupt libraries and maybe get rpm to verify all packages just to be sure...


Hope this helps some of the rest of you!

--
Rich

Comment 12 Raman Gupta 2009-03-31 18:12:42 UTC
Created attachment 337372 [details]
strace output from SEGFAULTed run of "ldd ls"

I tried to reinstall glibc but there were no errors/corrupt libraries reported as per the last comment.

I do not have selinux enabled.

I get this error randomly from ldd:

  ldd: exited with unknown exit code (139)

which seems to be due to a segmentation fault in /lib64/ld-linux-x86-64.so.2.

I was able to capture an strace from a failed execution via:

# cd /bin
# while [ $(strace -f ldd ls 2>&1 | tee ls.strace.log | grep -c SEG) = 0 ]; do echo "trying ldd"; done;

(as a previous poster mentioned, ls is a random binary -- I can trigger the error much faster by using other binaries -- the error seems to occur more often the bigger the binary is, but I haven't tested that rigorously)

I have attached the strace output.

Comment 13 Raman Gupta 2009-03-31 18:15:54 UTC
(In reply to comment #12)
> I get this error randomly from ldd:
> 
>   ldd: exited with unknown exit code (139)

Forgot to mention:

# uname -r
2.6.27.19-170.2.35.fc10.x86_64

Comment 14 Michael Madore 2009-04-01 04:16:49 UTC
Hi,

I am also seeing this problem, but not with all kernels:

kernel-2.6.27.5-117.fc10.src.rpm = OK

kernel-2.6.27.19-170.2.35.fc10.src.rpm = BAD

I can now reproduce the behaviour consistently:

1) Perform a fresh install of Fedora 10
2) Disable crond (to prevent prelink from running)
3) Perform yum update
4) Reboot
5) Run /etc/cron.daily/prelink

If I remove the following patch from the kernel:

Patch160:  linux-2.6-execshield.patch

Then the segfaults don't occur.

Let me know if I can provide any additional information.

Mike Madore

Comment 15 Sergey 2009-04-01 11:34:53 UTC
I avoided segfault with

setarch x86_64 --addr-compat-layout ldd ...

Comment 16 Raman Gupta 2009-04-01 14:12:56 UTC
(In reply to comment #15)
> I avoided segfault with
> 
> setarch x86_64 --addr-compat-layout ldd ...  

I confirm this works around the segfault on my machine as well.

Comment 17 Michael Madore 2009-04-01 17:34:14 UTC
Hi,

If I modify the call to prelink in /etc/cron.daily/prelink like so: 

setarch x86_64 -L /usr/sbin/prelink -av $PRELINK_OPTS >> /var/log/prelink/prelink.log 2>&1 \

Then the segfaults are eliminated on my system.

Mike Madore

Comment 18 JoSH Lehan 2009-04-01 22:18:04 UTC
This bug also breaks VMware Workstation 6.5.1 on Fedora 10.

It makes it so that VMware fails to start when clicked upon.  The interesting thing is that it happened only sometimes, not all of the time, and seemed internal to ldd.

http://communities.vmware.com//message/1213672

This thread went on for a while, until somebody kindly discovered this bug and the workarounds in the comments here.

I applied the workaround in comment 17 to prelink, and it seemed to make the problem in VMware go away.

Comment 19 Chuck Ebbert 2009-04-02 05:08:31 UTC
(In reply to comment #14)
> If I remove the following patch from the kernel:
> 
> Patch160:  linux-2.6-execshield.patch
> 
> Then the segfaults don't occur.
> 

There was one change to that patch recently. If you revert the below change instead of removing the execshield patch does it fix the problem?

--- linux-2.6-execshield.patch	29 Jan 2009 21:14:11 -0000	1.100
+++ linux-2.6-execshield.patch	11 Mar 2009 16:35:27 -0000	1.101
@@ -456,10 +456,10 @@
  	/* Enable PSE if available */
  	if (cpu_has_pse)
 diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
-index 56fe712..ec932ae 100644
+index 56fe712..30d2be7 100644
 --- a/arch/x86/mm/mmap.c
 +++ b/arch/x86/mm/mmap.c
-@@ -111,13 +111,15 @@ static unsigned long mmap_legacy_base(void)
+@@ -111,13 +111,16 @@ static unsigned long mmap_legacy_base(void)
   */
  void arch_pick_mmap_layout(struct mm_struct *mm)
  {
@@ -471,7 +471,8 @@
  	} else {
  		mm->mmap_base = mmap_base();
  		mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-+		if (!(current->personality & READ_IMPLIES_EXEC))
++		if (!(current->personality & READ_IMPLIES_EXEC)
++		    && mmap_is_ia32())
 +			mm->get_unmapped_exec_area = arch_get_unmapped_exec_area;
  		mm->unmap_area = arch_unmap_area_topdown;
  	}

Comment 20 Chuck Ebbert 2009-04-02 05:33:29 UTC
(In reply to comment #19)
> There was one change to that patch recently. If you revert the below change
> instead of removing the execshield patch does it fix the problem?
> 

Never mind, that change was made after the bug was first reported.

Can someone confirm:

kernel-2.6.27.12-170.2.5.fc10 should be BAD:
  http://koji.fedoraproject.org/koji/buildinfo?buildID=79612

kernel-2.6.27.9-159.fc10 should be GOOD:
  http://koji.fedoraproject.org/koji/buildinfo?buildID=74993

Comment 21 Chuck Ebbert 2009-04-02 05:53:12 UTC
Has anyone tested kernel-2.6.27.21-170.2.56.fc10 from updates-testing to see if the  latest execshield update fixes the problem?

Comment 22 Michael Madore 2009-04-02 15:16:36 UTC
Hi Chuck,

Here are the results of testing the various kernel packages:

1) kernel-2.6.27.9-159.fc10 = GOOD
2) kernel-2.6.27.12-170.2.5.fc10 = BAD
3) kernel-2.6.27.21-170.2.56.fc10 = GOOD

I then applied the patch from above to my bad kernel (kernel-2.6.27.19-170.2.35.fc10) and that fixed it.

Do you know what the timeframe is for moving kernel 3) from testing to updates?

Mike

Comment 23 Chuck Ebbert 2009-04-02 20:38:10 UTC
(In reply to comment #22)
> Do you know what the timeframe is for moving kernel 3) from testing to updates?

The move was requested on Mar 30.

Comment 24 Chuck Ebbert 2009-04-02 21:14:02 UTC
kernel-2.6.27.21-170.2.56.fc10 has been marked stable.