Bug 160341 - executing arbitrary programs sometimes show usage info about ld.so
executing arbitrary programs sometimes show usage info about ld.so
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
3
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-06-14 11:31 EDT by Gerben Roest
Modified: 2015-01-04 17:20 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-12-07 02:30:33 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
strace file of the "rsh ls /tmp/poep" command which went wrong on node07 (6.70 KB, text/plain)
2005-06-14 13:17 EDT, Gerben Roest
no flags Details
strace file where in.rshd got straced, and where it went wrong on node03. (115.92 KB, text/plain)
2005-06-14 13:34 EDT, Gerben Roest
no flags Details
strace file of "rsh node01 uptime" with aux vector output. (39.91 KB, text/plain)
2005-06-15 05:12 EDT, Gerben Roest
no flags Details
strace file of "rsh node01 uptime" with aux vector output. (116.43 KB, text/plain)
2005-06-15 06:00 EDT, Gerben Roest
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Linux Kernel 4851 None None None Never

  None (edit)
Description Gerben Roest 2005-06-14 11:31:05 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511 Firefox/1.0.4

Description of problem:
This mostly occurs with "rsh" to the machine and running a program as command-line option to rsh, but also running programs after I login with rsh.

Executing an arbitrary program gives this output:
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives
in executable files using ELF shared libraries tell the system's program
--- etc
This lately occurred when I typed "mail" (after I logged in with rsh). I typed "mail" again and it went fine.
This also happens when crond executes something, I get the usage about ld.so in the mail.

This erratic behaviour makes the cluster unusable.

Version-Release number of selected component (if applicable):
glibc-2.3.5-0.fc3.1

How reproducible:
Sometimes

Steps to Reproduce:
1. for i in `seq -w 36`;do rsh node$i uptime;done
2. 
3.
  

Actual Results:  The above script does "uptime" on every node in the cluster.
Often at least one of the nodes show the "Usage: ld.so" notification.

Expected Results:  show info about the uptime and load.

Additional info:

This is the complete output:

Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives
in executable files using ELF shared libraries tell the system's program
loader to load the helper program from this file.  This helper program loads
the shared libraries needed by the program executable, prepares the program
to run, and runs it.  You may invoke this helper program directly from the
command line to load and run an ELF executable file; this is like executing
that file itself, but always uses this helper program from the file you
specified, instead of the helper program file specified in the executable
file you run.  This is mostly of use for maintainers to test new versions
of this helper program; chances are you did not intend to run this program.

  --list                list all dependencies and how they are resolved
  --verify              verify that given object really is a dynamically linked
                        object we can handle
  --library-path PATH   use given PATH instead of content of the environment
                        variable LD_LIBRARY_PATH
  --inhibit-rpath LIST  ignore RUNPATH and RPATH information in object names
                        in LIST


Using ssh instead of rsh shows no errors running the "for i in" script, but I think when it also happens with some cron-activated programs (gmetric in this case), I doubt if rsh(-server) is the culprit.
Comment 1 Jakub Jelinek 2005-06-14 11:55:48 EDT
If you execute /lib64/ld-linux-x86-64.so.2 or /lib/ld-linux.so.2 and don't
supply arguments to it, then that's the expected output.
I have never seen that come up in other situations though, nor can it reproduce
now with rsh.
The only other way how this could happen is if the kernel doesn't supply
AT_ENTRY in the auxiliary vector on the stack (or if the auxiliary vector
can't be found).
Can you run the rshd under strace -f to see if you can reproduce it?
Can you reproduce it on any box in the cluster or just one particular (could be
a memory problem)?
Comment 2 Gerben Roest 2005-06-14 12:20:48 EDT
I wanted to run in.rshd with strace, and while being busy on the node that I
wanted to test (ALL nodes have the same problem, by the way), I typed the following:

[root@node01 xinetd.d]# man in.rshd
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives

Also because a crond'd program has the same error, doesn't this exclude rshd or
rlogind?

I have installed a second cluster with nodes on Fedora3 which has the exact same
problem, so it is not the Fedora3 server: the other cluster has a RH7.3 server
from which I rsh.
Comment 3 Jakub Jelinek 2005-06-14 12:47:15 EDT
man in.rshd works here just fine too.
Really, unless you manage to provide any details that could show where the bug
could be (strace, LD_SHOW_AUXV=1 dumps when it fails, etc.), I'm afraid we
can't move with this, as I can't reproduce it myself.
Comment 4 Gerben Roest 2005-06-14 13:17:07 EDT
Created attachment 115420 [details]
strace file of the "rsh ls /tmp/poep" command which went wrong on node07
Comment 5 Gerben Roest 2005-06-14 13:19:05 EDT
I did the following to hopefully produce something you find interesting (if you
know something better, please let me know.. if you want me to "strace -f -o" the
in.rshd, can you tell me how to fix that in /etc/xinetd.d/rsh?)

[root@master strace]# rshe "strace -f -o /tmp/ls-strace ls -al /tmp/poep"
node01:
ls: /tmp/poep: No such file or directory
node02:
ls: /tmp/poep: No such file or directory
node03:
ls: /tmp/poep: No such file or directory
node04:
ls: /tmp/poep: No such file or directory
node05:
ls: /tmp/poep: No such file or directory
node06:
ls: /tmp/poep: No such file or directory
node07:
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives

The /tmp/ls-strace from node07 is no different from that of node06. It reads in
the end:

> 13823 write(2, "ls: ", 4)               = 4
> 13823 write(2, "/tmp/poep", 9)          = 9
> 13823 write(2, ": No such file or directory", 27) = 27
> 13823 write(2, "\n", 1)                 = 1
> 13823 exit_group(0x1, 0x1, 0x7ffffffff970, 0x37d2e30e88, 0x3c <unfinished ...
exit status 1>

As you see the "write" doesn't show above, only the Usage of ld.so.
but I attached it nonetheless.
Comment 6 Gerben Roest 2005-06-14 13:34:29 EDT
Created attachment 115421 [details]
strace file where in.rshd got straced, and where it went wrong on node03.

I managed to strace in.rshd on all the nodes, and then did a "rshe ls -al
/tmp/poep" which went fine on all nodes except node03. Its strace file I
attached,  in this strace file the Usage of ld.so is mentioned.
Comment 7 Gerben Roest 2005-06-14 13:43:47 EDT
While I was busy with copying to/from the nodes I noticed that I got the same
problem with "rcp" and with "ssh". 
Comment 8 Jakub Jelinek 2005-06-14 17:10:26 EDT
Weird.  Can you perhaps also run in.rshd under
strace -E LD_SHOW_AUXV=1 -o /tmp/rshd.log /usr/sbin/in.rshd
?  That ought to show the auxiliary vector after each exec.
Thanks.
Comment 9 Ulrich Drepper 2005-06-14 19:04:13 EDT
As Jakub said, this behavior is completely impossible unless

a) the kernel messes the process creation up and doesn't pass the correct
information

b) you have some sort of breakin and either a root kit or userlevel trick is
installed and fails occasionally.

Since something like this hasn't been reported ever, a) would probably mean a
hardware problem.  This is why I probably discount this possibility.

b) is more likely.  Somebody intercepting exec calls, inserting explicit calls
to ld.so and botching this up.  I suggest to audit your system or even better,
reinstall the system while not preserving any executable, just data files.
Comment 10 Gerben Roest 2005-06-15 05:07:39 EDT
I used the strace that Jakub suggested, and the output of that, when it gave the
ld.so usage info:

AT_HWCAP:        78bfbff
AT_PAGESZ:       4096
AT_CLKTCK:       100
AT_PHDR:         0x555555555040
AT_PHENT:        56
AT_PHNUM:        8
AT_BASE:         0x2a76041ab000
AT_FLAGS:        0x0
AT_ENTRY:        0x555555556e80
AT_UID:          0
AT_EUID:         0
AT_GID:          0
AT_EGID:         0
AT_SECURE:       0
AT_PLATFORM:     x86_64

The actual strace file I will attach in a separate comment.
Comment 11 Gerben Roest 2005-06-15 05:12:48 EDT
Created attachment 115466 [details]
strace file of "rsh node01 uptime" with aux vector output.

This is given by node01 when I did "rsh node01 uptime". The strace command was:


/usr/bin/strace -E LD_SHOW_AUXV=1 -o /tmp/rshd.log /usr/sbin/in.rshd 2>&1 >
/tmp/rshd.out

The contents of rshd.out I have given in my previous comment.
Comment 12 Jakub Jelinek 2005-06-15 05:34:53 EDT
Oops, sorry, that one was without -f strace option which is needed too.
Comment 13 Gerben Roest 2005-06-15 06:00:40 EDT
Created attachment 115467 [details]
strace file of "rsh node01 uptime" with aux vector output.

this time with "-f".
Comment 14 Jakub Jelinek 2005-06-15 18:50:52 EDT
It did not print the aux table, so it is quite likely the kernel has not provided
it (but that would be a kernel bug, not glibc).
Now, to prove it, you'd ideally want to get a core dump.
Best I think is to put hlt instruction to the place which prints the
Usage: ld.so message.
For glibc-2.3.5-0.fc3.1.x86_64's /lib64/ld-2.3.5.so it is I think at
offset 0x2103 or 0x2120.  hlt code is 0xf4, so if you just store
0xf4 byte at offsets 0x2103 and 0x2120 into the file, you should be prepared.
echo -n -e '\xf4' | dd of=/lib64/ld-2.3.5.so conv=notrunc bs=1 count=1 seek=8451
echo -n -e '\xf4' | dd of=/lib64/ld-2.3.5.so conv=notrunc bs=1 count=1 seek=8480
could do the job.
Then just ulimit -c unlimited and run in.rshd to look for the crash.
Comment 15 Gerben Roest 2005-06-23 06:02:14 EDT
I have installed the latest update for FC3 (kernel-smp-2.6.11-1.27_FC3) and
re-installed glibc* but after initial hope (it seemed to go fine) it is
reporting ls.so usage again. I have made changes to ld-2.3.5 but couldn't save
them so I made changes to a copy of is, and let the link
"/lib64/ld-linux-x86-64.so.2 -> ld-2.3.5.so" point to the copy. I saw that the
in.rlogind used the copy. Should this produce a core file, and where? I haven't
been able to find one.

thanks.
Comment 16 Gerben Roest 2005-07-18 09:59:52 EDT
It seems that the combination of the following three things may have solved the
problem:

- upgrading to latest Fedora kernel
- NOT setting "noapic" as kernel parameter
- enabling ACPI 2.0 in bios (machines are Tyan S2882)

I think that because the kernel now can find the HPET timer, things are going
better. The machines all had problems with keeping their time accurate (doing a
"date" within 5 seconds sometimes showed 30 minutes difference) and that's what
led me to suspecting the timer. I used "noapic" because earlier kernels had
problems booting without it.

Does this all sound reasonable?

Comment 17 Konstantin Olchanski 2005-07-19 22:12:07 EDT
Metoo! I see the same problem with programs running from cron, on SMP AMD64
machines running FC3 kernel 2.6.11-1.27_FC3smp or earlier. I think I do not see
this problem on my lone non-SMP AMD64 machine. The frequency of this fault is
maybe 10 times per day for a cron job that runs every 5 minutes. All the
affected machines are in production so it is a bit hard for me to play with
kernel versions and stuff. K.O.
Comment 18 Ulrich Drepper 2005-07-25 18:14:53 EDT
Reassigning to kernel.  This is in any case a kernel issue.  Maybe a kernel
person can answer the question in comment #16.
Comment 19 Dan Carpenter 2005-07-26 03:43:03 EDT
Yeah.  It does sound like a hardware/kernel thing.

>  The machines all had problems with keeping their time accurate (doing a
"date" within 5 seconds sometimes showed 30 minutes difference) and that's what
led me to suspecting the timer.

I use a lot of 2882 mobos.  I've seen a few quirky things with the BIOS but
nothing like what you're describing.  What BIOS are you using?  Can you post
your dmesg output?

Comment 20 Dave Jones 2005-07-28 17:03:38 EDT
Interesting. I've seen this happen once too, also on a Tyan S2882.
Shortly afterwards, for other reasons, I reinstalled with FC4, and haven't seen
this reoccur since.  FC3 now has a 2.6.12 kernel thats very similar to the one
in FC4. Can you try and reproduce with that update ?

Comment 21 Konstantin Olchanski 2005-07-30 02:15:37 EDT
I have one dual-opteron machine that has the ld.so problem scheduled to be
rebooted into the 2.6.12 kernel, and I will post results when available. (this
got delayed by the mkinitrd problem that prevented the 2.6.12 kernel update from
booting).

For the record, on a dual-opteron machine with the 2.6.11-1.27_FC3smp kernel, I
observe a 0.7% failure rate running my script in a tight loop: out of 633904
invocations, I got 4754 "ld.so usage" dumps. Unfortunately I cannot post the
scripts as they involve convoluted perl scripts for feeding "sensors" data into
ganglia that have to be reduced to something smaller to become a useful test case.

K.O.
Comment 22 Ulrich Drepper 2005-07-31 15:42:56 EDT
The bug in the kernel.org bz might point at a similar issue.  This too is on
Tyan 28?2 SMP motherboards.
Comment 23 Konstantin Olchanski 2005-08-01 19:23:53 EDT
I think this Tyan thing is a red herring. My two problem machines are both MSI
MS-9161 mobos (K8D Master-blah...). My two no-problem machines are Tyan S2880
Thunder K8S and Tyan S2885 Thunder K8W mobos. K.O.
Comment 24 Konstantin Olchanski 2005-08-02 17:56:10 EDT
2.6.12-1.1372_FC3smp freezes within 5 minutes of running my test scripts
(perl+sensors+ganglia). Last entry in the system log file is "swap_free: Bad
swap file entry" and there is a panic stack trace, showing functions with names
containing irq, nmi, tcp_send, HPET (whatever that is) etc. Next time I will
write it down on paper (10 years ago, SGI IRIX could save the panic stack traces
to disk and paper+pencil were not required. Why can't Linux do it today?!?). K.O.
Comment 25 Dave Jones 2005-08-02 20:09:29 EDT
you already filed that bug as #164941 , lets not clutter this bug up with other
issues, as right now, theres no indication they are related.
Comment 26 Konstantin Olchanski 2005-08-03 20:49:59 EDT
2.6.12-1.1372_FC3smp the frequency of "ld.so usage dumps" is greatly diminished.
In two days running I only got 1 dump from my cron job compared to maybe 10
dumps from the machine running the older kernel. K.O.
Comment 27 Konstantin Olchanski 2005-08-05 03:18:16 EDT
Here is my verdict on 2.6.12-1.1372_FC3smp: the "ld.so usage" problem is
definitely still there, I saw it three times after running for 3-4 days.

I am curious if the vanilla kernels show this (and other) problems, so I will
eventually continue playing with it on another dual-opteron machine, after I
reninstall the second cpu and after I obtain a power supply that does not
crawbar on startup. The present machine is back to running 2.6.11-1.27_FC3smp to
avoid the panics from the "bad swap file entry" problem. K.O.
Comment 28 Dave Jones 2005-12-07 02:30:33 EST
Haven't seen this in a long time. I believe this was caused by the AMD TLB flush
filter errata fixed in kernel 2.6.12-1.1375_FC3 and newer.
Comment 29 Rudi Chiarito 2006-04-18 13:08:36 EDT
As a data point, it happened again with 1.381, followed by a bunch of these:

Apr 17 07:07:56 localhost kernel: Unable to handle kernel NULL pointer
dereference at 000000000000000b RIP:
Apr 17 07:07:56 localhost kernel: [<000000000000000b>]
Apr 17 07:07:56 localhost kernel: PGD 0
Apr 17 07:07:56 localhost kernel: Oops: 0010 [1] SMP
Apr 17 07:07:56 localhost kernel: CPU 0
Apr 17 07:07:56 localhost kernel: Modules linked in: loop nfsd exportfs lockd
jfs parport_pc lp parport autofs4 w83627hf eeprom adm1026 i2c_sensor i2c_isa
i2c_amd756 i2c_dev i2c_core sunrpc md5 ipv6 pcmcia yenta_socket rsrc_nonstatic
pcmcia_core iptable_nat ip_conntrack iptable_filter ip_tables video button
battery ac ohci_hcd hw_random e1000 dm_snapshot dm_zero dm_mirror ext3 jbd
dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Apr 17 07:07:56 localhost kernel: Pid: 19438, comm: su Not tainted
2.6.12-1.1381_FC3smp
Apr 17 07:07:56 localhost kernel: RIP: 0010:[<000000000000000b>]
[<000000000000000b>]
Apr 17 07:07:56 localhost kernel: RSP: 0018:ffff8100580a9e70  EFLAGS: 00010202
Apr 17 07:07:56 localhost kernel: RAX: 000000000000000b RBX: ffffffff804088d0
RCX: 0000000000000000
Apr 17 07:07:56 localhost kernel: RDX: ffff8100f6c2c400 RSI: ffff8100f6d5c3c0
RDI: ffff810092baee00
Apr 17 07:07:56 localhost kernel: RBP: ffff8100f6d5c3c0 R08: ffffffffffffffff
R09: 0000000000000f18
Apr 17 07:07:56 localhost kernel: R10: 0000000000000022 R11: 0000000000000246
R12: ffff810092baee00
Apr 17 07:07:56 localhost kernel: R13: 0000000000000000 R14: 0000000000000400
R15: ffff810092baee30
Apr 17 07:07:56 localhost kernel: FS:  00002aaaab237e80(0000)
GS:ffffffff804e1300(0000) knlGS:00000000f7fbf8e0
Apr 17 07:07:56 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 17 07:07:56 localhost kernel: CR2: 000000000000000b CR3: 00000000f395c000
CR4: 00000000000006e0
Apr 17 07:07:56 localhost kernel: Process su (pid: 19438, threadinfo
ffff8100580a8000, task ffff810002f467c0)
Apr 17 07:07:56 localhost kernel: Stack: ffffffff801977f4 0000000000000000
ffff810092baee00 ffff8100f6d5c3c0
Apr 17 07:07:56 localhost kernel:        0000000000000000 00000000000000e3
ffffffff8019b9fd 0000000000000000
Apr 17 07:07:56 localhost kernel:        0000000000000000 ffff8100580a9f50
Apr 17 07:07:56 localhost kernel: Call Trace:<ffffffff801977f4>{show_vfsmnt+279}
<ffffffff8019b9fd>{seq_read+473}
Apr 17 07:07:56 localhost kernel:        <ffffffff8017ca08>{vfs_read+205}
<ffffffff8017ccaa>{sys_read+69}
Apr 17 07:07:56 localhost kernel:        <ffffffff8010e7be>{system_call+126}
Apr 17 07:07:56 localhost kernel:
Apr 17 07:07:56 localhost kernel: Code:  Bad RIP value.
Apr 17 07:07:56 localhost kernel: RIP [<000000000000000b>] RSP <ffff8100580a9e70>
Apr 17 07:07:56 localhost kernel: CR2: 000000000000000b
Apr 17 07:07:56 localhost kernel:  <3>Debug: sleeping function called from
invalid context at include/linux/rwsem.h:43
Apr 17 07:07:56 localhost kernel: in_atomic():0, irqs_disabled():1
Apr 17 07:07:56 localhost kernel:
Apr 17 07:07:56 localhost kernel: Call
Trace:<ffffffff8012ff03>{__might_sleep+193} <ffffffff80136842>{profile_task_exit+34}
Apr 17 07:07:56 localhost kernel:        <ffffffff80137e84>{do_exit+34}
<ffffffff80202b32>{vgacon_cursor+228}
Apr 17 07:07:56 localhost kernel:        <ffffffff801225f8>{do_page_fault+1904}
<ffffffff8010f2e1>{error_exit+0}
Apr 17 07:07:56 localhost kernel:        <ffffffff8010f2e1>{error_exit+0}
<ffffffff801977f4>{show_vfsmnt+279}
Apr 17 07:07:56 localhost kernel:        <ffffffff8019b9fd>{seq_read+473}
<ffffffff8017ca08>{vfs_read+205}
Apr 17 07:07:56 localhost kernel:        <ffffffff8017ccaa>{sys_read+69}
<ffffffff8010e7be>{system_call+126}
Apr 17 07:07:56 localhost kernel:

Does anything in that oops shed any light?

The system is now running the first FC3 Legacy kernel update, we'll see how that
one fares. Other data points: when the problem occurred, the system was doing a
bunch of find/xargs with very long command lines. Processor is a 2x Opteron 246,
stepping 10. Motherboard is an Iwill DK8S2. There is a BIOS update available
which mentions some very vague "update mirco(sic) code" for version AMI BIOS
1.50 (2005/3/10), but I assumed that there are workarounds in software for all
the AMD errata. Anything that can be done to further investigate the issue? 


Note You need to log in before you can comment on or make changes to this bug.