Red Hat Bugzilla – Bug 502074
running commands segfault randomly, with 32 bit host and guest, when host is a Xen or VMWare guest
Last modified: 2011-04-12 14:00:40 EDT
This is the error message, which seems to indicate a
bug in the sha1sum program itself:
sha1sum: segfault at 0 ip (null) sp bfe1887c error 4 in libc-2.10.1.so[110000+16b000]
Created attachment 345019 [details]
Search the attachment for 'sha1sum' and you can see the
program segfaulting like crazy.
I wonder if this is a general 'coreutils programs segfault' in
the guest problem. For example, here's another one:
md5sum: segfault at b697 ip 0000af67 sp bfb332d0 error 4 in libc-2.10.1.so[110000+16b000]
Adding Jim Meyering to the CC of this bug.
This is very out of leftfield for coreutils, but I wonder if you
have any idea why the *sum programs in coreutils could
possibly segfault randomly when run in a Fedora 11 i586
qemu guest? F-11, i586 and qemu [not KVM] all seem
to be significant factors.
a stack trace, gdb backtrace from sha1sum would help, but here's a shot in the dark:
could this be due to locale settings different from LC_ALL=C and running without locale-related files? I.e., could your removing some locale-related infrastructure have caused this?
Then I looked at your attachment.
Right before the first segfault, I see what looks like serious FS trouble with a mkdir syscall.
It's probably worth investigating that first.
------------[ cut here ]------------
WARNING: at fs/fs-writeback.c:302 __writeback_single_inode+0x1d4/0x27a() (Tainted: G W )
Modules linked in: ext2 virtio_net virtio_pci
Pid: 276, comm: mkdir Tainted: G W 18.104.22.168-155.fc11.i586 #1
[<c0478bf0>] ? find_get_pages_tag+0x32/0xa2
[<c047fb86>] ? pagevec_lookup_tag+0x1e/0x25
[<c0479935>] ? wait_on_page_writeback_range+0xa2/0xdd
[<c04bb661>] ? wait_on_buffer+0x32/0x35
[<c04bb8e2>] ? sync_dirty_buffer+0x59/0x8d
[<d887409f>] ? brelse+0x11/0x13 [ext2]
[<d887432d>] ? ext2_update_inode+0x28c/0x2c7 [ext2]
[<c044077d>] ? wake_bit_function+0x0/0x3c
[<d8873fa0>] ext2_sync_inode+0x2c/0x33 [ext2]
[<d8872838>] ext2_commit_chunk+0x92/0xa6 [ext2]
[<d88729af>] ext2_make_empty+0x163/0x17a [ext2]
[<d8875d03>] ext2_mkdir+0x9d/0xf1 [ext2]
[<c0433ee0>] ? _local_bh_enable+0x8d/0x9d
[<c043401a>] ? __do_softirq+0x12a/0x139
---[ end trace fe7116eb9e9c7886 ]---
This is the environment that commands running in the
I'm now in the process of adding more environment
variables (particularly PATH) to see if that fixes
Nasty, nasty - could be worth trying to reproduce locally with qemu -no-kvm
fixed typo in title: s/libguetfs/libguestfs/
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.
More information and reason for this action is here:
*** Bug 512680 has been marked as a duplicate of this bug. ***
Changed the summary to:
"running commands segfault randomly, with 32 bit host and guest,
when host is a Xen or VMWare guest"
All the factors in that summary seem to be crucial: the "host"
(ie. what is running libguestfs) will be a 32 bit Xen or VMWare
Programs running in the guestfs will segfault randomly, or
give a spurious kernel error like "no vm86_info: BAD error",
or give a qemu error "tcg fatal error".
The problem seems to be confined to F11, i586, RHEL 5 Xen.
Verified that this doesn't happen with the vanilla,
upstream qemu 0.10.4. Solution is probably to upgrade
qemu, but we need to do a proper bisect.
Setting Product to Virtualization Tools.
Haven't seen this for a long time, and it's probably a
TCG-related bug, but I'll leave it open for now.
Hrm, I just started seeing this (or something else with the same symptoms) after a domU reboot. 5.3 Dom0, F10 domU.
When it gets like this, seemingly everything will segfault (even 'less'), though I can get a reboot to happen. A reboot of the domU will forestall the problem for something on the order of a day. I see no obvious problems on the Dom0 or in a light Fedora 11 DomU or a Nexenta (OpenSolaris) DomU.
segfaults look like:
Sep 29 15:14:38 borlaug kernel: MailScanner: segfault at 50 ip 008071ba s
p bfdc7820 error 6 in libperl.so[6f7000+269000]
Sep 29 15:15:10 borlaug kernel: MailScanner: segfault at 639ce069 ip 007a
83b1 sp bfdc7990 error 4 in libperl.so[6f7000+269000]
One difference would be these two kernel versions:
Aug 25 21:28:31 borlaug kernel: Linux version 22.214.171.124-170.2.56.fc10.i686.PAE (firstname.lastname@example.org) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Mon Mar 23 23:24:26 EDT 2009
Sep 14 18:35:31 borlaug kernel: Linux version 126.96.36.199-170.2.82.fc10.i686.PAE (email@example.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Mon Aug 17 08:24:23 EDT 2009
thought there have been other subsequent package updates. Is it helpful to get a bt on e.g. 'less' when this happens again? I'll try switching back to the old kernel to see if that makes a difference.
The domU kernel didn't make a difference, but the dom0 kernel did.
Problems (any of several f10 domU kernels):
Sep 26 19:08:02 stevens kernel: Linux version 2.6.18-164.el5xen (firstname.lastname@example.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 04:47:32 EDT 2009
No problems (most recent f10 domU kernel):
Oct 7 23:42:28 stevens kernel: Linux version 2.6.18-128.1.10.el5xen (email@example.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Thu May 7 11:51:15 EDT 2009
am I chasing the same bug here? I skipped many updates with a reboot of the Dom0 - I'll try a binary search of the updates to see where the problems started if this isn't already being hunted in another bug.
I'm pretty certain this is a different bug. The current bug is with
TCG emulation under QEMU, and nothing to do with Xen.
Ah, I see I've misparsed comment #10. I'll find/make a different one.
Still seeing this on Koji, eg:
[ 966.687020] sha224sum: segfault at eae7 ip 0000e3b7 sp bfa418d0 error 4 in ld-2.11.1.so[581000+1e000]
[ 966.691152] no vm86_info: BAD
[ 967.673020] sha256sum: segfault at f056 ip 0000e926 sp bfbf4ef0 error 4 in libc-2.11.1.so[16a000+16f000]
[ 967.674848] no vm86_info: BAD
[ 1013.971021] sha224sum: segfault at e864 ip 0000e134 sp bfbfb7f0 error 4 in libc-2.11.1.so[5fd000+16f000]
[ 1013.975147] no vm86_info: BAD
(Note that this test was exactly the same as comment 17, and yet it only
failed once this time, so the failures are not completely deterministic).
Haven't seen this happen for quite a long time, and we regularly
run the full test suite on i386 and x86-64.