502074 – running commands segfault randomly, with 32 bit host and guest, when host is a Xen or VMWare guest

Bug 502074 - running commands segfault randomly, with 32 bit host and guest, when host is a Xen or VMWare guest

Summary: running commands segfault randomly, with 32 bit host and guest, when host is ...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libguestfs
Sub Component:
Version:	unspecified
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Richard W.M. Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	512680 (view as bug list)
Depends On:
Blocks:	F11VirtTarget
TreeView+	depends on / blocked

Reported:	2009-05-21 18:27 UTC by Richard W.M. Jones
Modified:	2011-04-12 18:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-04-12 18:00:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
build.log (1.63 MB, text/plain) 2009-05-21 19:57 UTC, Richard W.M. Jones	no flags	Details
View All

Description Richard W.M. Jones 2009-05-21 18:27:31 UTC

This is the error message, which seems to indicate a
bug in the sha1sum program itself:

sha1sum /sysroot/new
sha1sum[986]: segfault at 0 ip (null) sp bfe1887c error 4 in libc-2.10.1.so[110000+16b000]

Comment 1 Richard W.M. Jones 2009-05-21 19:57:05 UTC

Created attachment 345019 [details]
build.log

Search the attachment for 'sha1sum' and you can see the
program segfaulting like crazy.

Comment 2 Richard W.M. Jones 2009-05-21 21:53:37 UTC

I wonder if this is a general 'coreutils programs segfault' in
the guest problem.  For example, here's another one:

md5sum /sysroot/upload
md5sum[1038]: segfault at b697 ip 0000af67 sp bfb332d0 error 4 in libc-2.10.1.so[110000+16b000]

from http://koji.fedoraproject.org/koji/getfile?taskID=1369168&name=build.log

Comment 3 Richard W.M. Jones 2009-05-21 21:58:14 UTC

Adding Jim Meyering to the CC of this bug.

Jim:
This is very out of leftfield for coreutils, but I wonder if you
have any idea why the *sum programs in coreutils could
possibly segfault randomly when run in a Fedora 11 i586
qemu guest?  F-11, i586 and qemu [not KVM] all seem
to be significant factors.

Comment 4 Jim Meyering 2009-05-22 06:28:53 UTC

Hi Rich,
a stack trace, gdb backtrace from sha1sum would help, but here's a shot in the dark:

could this be due to locale settings different from LC_ALL=C and running without locale-related files?  I.e., could your removing some locale-related infrastructure have caused this?

Then I looked at your attachment.
Right before the first segfault, I see what looks like serious FS trouble with a mkdir syscall.
It's probably worth investigating that first.

------------[ cut here ]------------
WARNING: at fs/fs-writeback.c:302 __writeback_single_inode+0x1d4/0x27a() (Tainted: G        W )
Hardware name: 
Modules linked in: ext2 virtio_net virtio_pci
Pid: 276, comm: mkdir Tainted: G        W  2.6.29.3-155.fc11.i586 #1
Call Trace:
 [<c042f2f2>] warn_slowpath+0x7c/0xa4
 [<c0478bf0>] ? find_get_pages_tag+0x32/0xa2
 [<c047fb86>] ? pagevec_lookup_tag+0x1e/0x25
 [<c0479935>] ? wait_on_page_writeback_range+0xa2/0xdd
 [<c04bb661>] ? wait_on_buffer+0x32/0x35
 [<c04bb8e2>] ? sync_dirty_buffer+0x59/0x8d
 [<d887409f>] ? brelse+0x11/0x13 [ext2]
 [<d887432d>] ? ext2_update_inode+0x28c/0x2c7 [ext2]
 [<c04b673d>] __writeback_single_inode+0x1d4/0x27a
 [<c044077d>] ? wake_bit_function+0x0/0x3c
 [<c04b6808>] sync_inode+0x25/0x38
 [<d8873fa0>] ext2_sync_inode+0x2c/0x33 [ext2]
 [<d8872838>] ext2_commit_chunk+0x92/0xa6 [ext2]
 [<d88729af>] ext2_make_empty+0x163/0x17a [ext2]
 [<d8875d03>] ext2_mkdir+0x9d/0xf1 [ext2]
 [<c04a8501>] vfs_mkdir+0x61/0x9f
 [<c04a9b8b>] sys_mkdirat+0x89/0xc2
 [<c0433ee0>] ? _local_bh_enable+0x8d/0x9d
 [<c043401a>] ? __do_softirq+0x12a/0x139
 [<c04a9bd9>] sys_mkdir+0x15/0x17
 [<c0403f72>] syscall_call+0x7/0xb
---[ end trace fe7116eb9e9c7886 ]---

Comment 5 Richard W.M. Jones 2009-05-22 08:30:13 UTC

This is the environment that commands running in the
daemon get.

 TERM=linux
 PWD=/
 guestfs=10.0.2.4:6666
 SHLVL=0
 HOME=/

I'm now in the process of adding more environment
variables (particularly PATH) to see if that fixes
the problem.

Comment 6 Mark McLoughlin 2009-05-25 15:00:18 UTC

Nasty, nasty - could be worth trying to reproduce locally with qemu -no-kvm

Comment 7 Jim Meyering 2009-05-25 15:08:26 UTC

fixed typo in title: s/libguetfs/libguestfs/

Comment 8 Bug Zapper 2009-06-09 16:16:50 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 9 Richard W.M. Jones 2009-07-21 12:51:15 UTC

*** Bug 512680 has been marked as a duplicate of this bug. ***

Comment 10 Richard W.M. Jones 2009-07-21 12:58:18 UTC

Changed the summary to:

  "running commands segfault randomly, with 32 bit host and guest,
   when host is a Xen or VMWare guest"

All the factors in that summary seem to be crucial: the "host"
(ie. what is running libguestfs) will be a 32 bit Xen or VMWare
guest.

Programs running in the guestfs will segfault randomly, or
give a spurious kernel error like "no vm86_info: BAD error",
or give a qemu error "tcg fatal error".

The problem seems to be confined to F11, i586, RHEL 5 Xen.

Comment 11 Richard W.M. Jones 2009-07-23 09:58:56 UTC

Verified that this doesn't happen with the vanilla,
upstream qemu 0.10.4.  Solution is probably to upgrade
qemu, but we need to do a proper bisect.

Comment 12 Richard W.M. Jones 2009-09-22 14:24:32 UTC

Setting Product to Virtualization Tools.

Haven't seen this for a long time, and it's probably a
TCG-related bug, but I'll leave it open for now.

Comment 13 Bill McGonigle 2009-10-01 17:16:09 UTC

Hrm, I just started seeing this (or something else with the same symptoms) after a domU reboot.  5.3 Dom0, F10 domU.

When it gets like this, seemingly everything will segfault (even 'less'), though I can get a reboot to happen.  A reboot of the domU will forestall the problem for something on the order of a day.  I see no obvious problems on the Dom0 or in a light Fedora 11 DomU or a Nexenta (OpenSolaris) DomU.

segfaults look like:

  Sep 29 15:14:38 borlaug kernel: MailScanner[26490]: segfault at 50 ip 008071ba s
p bfdc7820 error 6 in libperl.so[6f7000+269000]
  Sep 29 15:15:10 borlaug kernel: MailScanner[26599]: segfault at 639ce069 ip 007a
83b1 sp bfdc7990 error 4 in libperl.so[6f7000+269000]

One difference would be these two kernel versions:

  Aug 25 21:28:31 borlaug kernel: Linux version 2.6.27.21-170.2.56.fc10.i686.PAE (mockbuild.redhat.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Mon Mar 23 23:24:26 EDT 2009
  Sep 14 18:35:31 borlaug kernel: Linux version 2.6.27.30-170.2.82.fc10.i686.PAE (mockbuild.phx.redhat.com) (gcc version 4.3.2 20081105 (Red Hat 4.3.2-7) (GCC) ) #1 SMP Mon Aug 17 08:24:23 EDT 2009

thought there have been other subsequent package updates.  Is it helpful to get a bt on e.g. 'less' when this happens again?  I'll try switching back to the old kernel to see if that makes a difference.

Comment 14 Bill McGonigle 2009-10-09 17:18:53 UTC

The domU kernel didn't make a difference, but the dom0 kernel did.

Problems (any of several f10 domU kernels):

Sep 26 19:08:02 stevens kernel: Linux version 2.6.18-164.el5xen (mockbuild.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 04:47:32 EDT 2009

No problems (most recent f10 domU kernel):

Oct  7 23:42:28 stevens kernel: Linux version 2.6.18-128.1.10.el5xen (mockbuild.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Thu May 7 11:51:15 EDT 2009

am I chasing the same bug here?  I skipped many updates with a reboot of the Dom0 - I'll try a binary search of the updates to see where the problems started if this isn't already being hunted in another bug.

Comment 15 Richard W.M. Jones 2009-10-09 17:41:35 UTC

I'm pretty certain this is a different bug.  The current bug is with
TCG emulation under QEMU, and nothing to do with Xen.

Comment 16 Bill McGonigle 2009-10-09 17:49:20 UTC

Ah, I see I've misparsed comment #10.  I'll find/make a different one.

Comment 17 Richard W.M. Jones 2010-03-02 16:37:25 UTC

Still seeing this on Koji, eg:

http://koji.fedoraproject.org/koji/getfile?taskID=2025462&name=build.log

sha224sum /sysroot/known-3
[  966.687020] sha224sum[5431]: segfault at eae7 ip 0000e3b7 sp bfa418d0 error 4 in ld-2.11.1.so[581000+1e000]
[  966.691152] no vm86_info: BAD

sha256sum /sysroot/known-3
[  967.673020] sha256sum[5441]: segfault at f056 ip 0000e926 sp bfbf4ef0 error 4 in libc-2.11.1.so[16a000+16f000]
[  967.674848] no vm86_info: BAD

Comment 18 Richard W.M. Jones 2010-03-02 17:36:56 UTC

Another example:

http://koji.fedoraproject.org/koji/getfile?taskID=2025622&name=build.log

sha224sum /sysroot/known-3
[ 1013.971021] sha224sum[5281]: segfault at e864 ip 0000e134 sp bfbfb7f0 error 4 in libc-2.11.1.so[5fd000+16f000]
[ 1013.975147] no vm86_info: BAD

(Note that this test was exactly the same as comment 17, and yet it only
failed once this time, so the failures are not completely deterministic).

Comment 19 Richard W.M. Jones 2011-04-12 18:00:40 UTC

Haven't seen this happen for quite a long time, and we regularly
run the full test suite on i386 and x86-64.

Closing ...

Note You need to log in before you can comment on or make changes to this bug.