Bug 555788

Summary:

SIGTRAP leakage between separate virtual machines

Product:

[Fedora] Fedora

Reporter:

Tom Horsley <horsley1953>

Component:

kvm

Assignee:

Glauber Costa <gcosta>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

berrange, clalance, ehabkost, gcosta, jforbes, markmc, quintela, virt-maint

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

555889 (view as bug list)

Environment:

Last Closed:

2010-06-08 15:08:41 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

546327

Bug Blocks:

555889

Attachments:

Description	Flags
gzipped tar archive with test program to generate address traps	none
normal boot on 2nd virtual machine	none
screenshot of another boot showing sigtrap disruption	none
sles10i virtual machine xml definition	none
the 2nd virtual machine sles10x xml definition	none

Description Tom Horsley 2010-01-15 14:39:47 UTC

Description of problem:

I run a program on one virtual machine which generates gazillions of
SIGTRAPs by using the DBn debug registers to do address traps.

I run a completely separate virtual machine on the same host, and random
processes on that 2nd virtual machine get SIGTRAPs as though either the
traps are being delivered to the wrong process, or the contents of the
debug registers are leaking across virtual machines and causing traps
in the wrong process.

Version-Release number of selected component (if applicable):

qemu-kvm-0.11.0-12.fc12.x86_64
kernel-2.6.31.9-174.fc12.x86_64

How reproducible:
Somewhat random, but with the program I'll attach to generate vast
numbers of address traps, it does seem to eventually happen every time.

Steps to Reproduce:
1. boot one VM, unpack watchme.tar.gz, run make
2. boot another VM
3. watch random processes complain about traps during boot
  
Actual results:
leakage across virtual machine

Expected results:
no leakage across virtual machines

Additional info:

I only see this behavior on my opteron based host system. I have another
host running xeon chips where the sigtraps have never appeared.

This may also be a vary very old problem. I was using an ancient version
of xen to host my virtual machines on the same hardware previously, and
was getting the spurious SIGTRAP problems then as well (which is one the
the motives I had to upgrade to shiny new KVM and fedora 12).

I'd also point out this could be considered a nasty security problem with
one user on one virtual machine being able to disrupt other virtual
machines at random.

Comment 1 Tom Horsley 2010-01-15 14:43:28 UTC

Created attachment 384628 [details]
gzipped tar archive with test program to generate address traps

This program consists of a custom "debugger" (watcher) which debugs the
watchme program, using the debug registers to generate zillions of
address traps in multiple threads.

Comment 2 Tom Horsley 2010-01-15 14:45:37 UTC

Created attachment 384630 [details]
normal boot on 2nd virtual machine

Here is a screenshot of the virtual machine booting normally. This was the
very first virtual machine booted after a reboot of the host.

Comment 3 Tom Horsley 2010-01-15 14:48:08 UTC

Created attachment 384631 [details]
screenshot of another boot showing sigtrap disruption

In this screen shot, another virtual machine is running the test program, and
you can see the sigtrap abort being reported during this boot of the same
vm that had no problem booting before the test program was started.

Comment 4 Tom Horsley 2010-01-15 14:51:34 UTC

Created attachment 384633 [details]
sles10i virtual machine xml definition

The sles10i virtual machine was the one I built and ran the test program
on where the debug registers were modified and the address traps generated.
My impression is that the specific virtual machines don't really make any
difference, but I include this for a complete description of the problem.

Comment 5 Tom Horsley 2010-01-15 14:53:50 UTC

Created attachment 384635 [details]
the 2nd virtual machine sles10x xml definition

This is the sles10x virtual machine which the screen shots were from. Booting
this machine worked before I ran the test over on sles10i, and failed after
I shut it down, started the test over on sles10i, and tried booting it again.

Comment 6 Tom Horsley 2010-01-15 14:58:48 UTC

The kvm host is an 8 core opteron with 8 gig of memory, smolt profile:

http://www.smolts.org/client/show/pub_fde3037c-3bfd-4a28-9f73-601d559b6567

Comment 7 Tom Horsley 2010-01-15 16:07:02 UTC

>I run a program on one virtual machine which generates gazillions of
>SIGTRAPs by using the DBn debug registers to do address traps.

I meant to say the DRn registers, not DBn.

Comment 8 Tom Horsley 2010-01-16 00:43:13 UTC

More data (but I don't know what it means):

I've been trying to use the test program to trigger this bug on my home
system (a 4 core single chip intel box), and I have not yet seen it
fail.

The intel system at work, however, has exhibited this symptom by having
a kernel build fail with the compiler aborting due to a SIGTRAP, but I
haven't explicitly run my test program to see it trigger the failure on
an intel box.

The two systems I have now seen show this symptom are both dual socket
motherboards with two 4 core chips (opterons in one, xeon in the other),
so perhaps something about multiple cpu chips makes this more likely.

Here's the smolt profile for the dual cpu intel box:

http://www.smolts.org/client/show/pub_7cfd66c1-0166-4f15-92ca-739b774c9559

(It is running a somewhat old fedora 11).

Comment 9 Tom Horsley 2010-06-08 14:36:09 UTC

I have just installed fedora 13 (and grabbed all updates) on the host machine
described in the original bug. I then ran the exact same test with the
exact same test program and virtual machines hosted on the new fedora 13
host kernel, and I am happy to say that after multiple reboots of the sles10x
virtual machine while sels10i was running my test program, I did not see
a single spurious SIGTRAP interfere with the sles10x virtual machine.
It will be a while before I have lots of results from running my testbeds
continuously, but it certainly looks like this problem may be fixed in the
latest KVM code. Rpms currently on system for this test are:

qemu-kvm-0.12.3-8.fc13.x86_64
qemu-img-0.12.3-8.fc13.x86_64
qemu-system-x86-0.12.3-8.fc13.x86_64
qemu-common-0.12.3-8.fc13.x86_64
gpxe-roms-qemu-1.0.0-1.fc13.noarch
kernel-2.6.33.5-112.fc13.x86_64

Comment 10 Justin M. Forbes 2010-06-08 15:08:41 UTC

Right, the fix for this went into the upstream kernel and should be available in an update for the F-12 kernel.