Bug 555788

Summary: SIGTRAP leakage between separate virtual machines
Product: [Fedora] Fedora Reporter: Tom Horsley <horsley1953>
Component: kvmAssignee: Glauber Costa <gcosta>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 12CC: berrange, clalance, ehabkost, gcosta, jforbes, markmc, quintela, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 555889 (view as bug list) Environment:
Last Closed: 2010-06-08 11:08:41 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On: 546327    
Bug Blocks: 555889    
Description Flags
gzipped tar archive with test program to generate address traps
normal boot on 2nd virtual machine
screenshot of another boot showing sigtrap disruption
sles10i virtual machine xml definition
the 2nd virtual machine sles10x xml definition none

Description Tom Horsley 2010-01-15 09:39:47 EST
Description of problem:

I run a program on one virtual machine which generates gazillions of
SIGTRAPs by using the DBn debug registers to do address traps.

I run a completely separate virtual machine on the same host, and random
processes on that 2nd virtual machine get SIGTRAPs as though either the
traps are being delivered to the wrong process, or the contents of the
debug registers are leaking across virtual machines and causing traps
in the wrong process.

Version-Release number of selected component (if applicable):


How reproducible:
Somewhat random, but with the program I'll attach to generate vast
numbers of address traps, it does seem to eventually happen every time.

Steps to Reproduce:
1. boot one VM, unpack watchme.tar.gz, run make
2. boot another VM
3. watch random processes complain about traps during boot
Actual results:
leakage across virtual machine

Expected results:
no leakage across virtual machines

Additional info:

I only see this behavior on my opteron based host system. I have another
host running xeon chips where the sigtraps have never appeared.

This may also be a vary very old problem. I was using an ancient version
of xen to host my virtual machines on the same hardware previously, and
was getting the spurious SIGTRAP problems then as well (which is one the
the motives I had to upgrade to shiny new KVM and fedora 12).

I'd also point out this could be considered a nasty security problem with
one user on one virtual machine being able to disrupt other virtual
machines at random.
Comment 1 Tom Horsley 2010-01-15 09:43:28 EST
Created attachment 384628 [details]
gzipped tar archive with test program to generate address traps

This program consists of a custom "debugger" (watcher) which debugs the
watchme program, using the debug registers to generate zillions of
address traps in multiple threads.
Comment 2 Tom Horsley 2010-01-15 09:45:37 EST
Created attachment 384630 [details]
normal boot on 2nd virtual machine

Here is a screenshot of the virtual machine booting normally. This was the
very first virtual machine booted after a reboot of the host.
Comment 3 Tom Horsley 2010-01-15 09:48:08 EST
Created attachment 384631 [details]
screenshot of another boot showing sigtrap disruption

In this screen shot, another virtual machine is running the test program, and
you can see the sigtrap abort being reported during this boot of the same
vm that had no problem booting before the test program was started.
Comment 4 Tom Horsley 2010-01-15 09:51:34 EST
Created attachment 384633 [details]
sles10i virtual machine xml definition

The sles10i virtual machine was the one I built and ran the test program
on where the debug registers were modified and the address traps generated.
My impression is that the specific virtual machines don't really make any
difference, but I include this for a complete description of the problem.
Comment 5 Tom Horsley 2010-01-15 09:53:50 EST
Created attachment 384635 [details]
the 2nd virtual machine sles10x xml definition

This is the sles10x virtual machine which the screen shots were from. Booting
this machine worked before I ran the test over on sles10i, and failed after
I shut it down, started the test over on sles10i, and tried booting it again.
Comment 6 Tom Horsley 2010-01-15 09:58:48 EST
The kvm host is an 8 core opteron with 8 gig of memory, smolt profile:

Comment 7 Tom Horsley 2010-01-15 11:07:02 EST
>I run a program on one virtual machine which generates gazillions of
>SIGTRAPs by using the DBn debug registers to do address traps.

I meant to say the DRn registers, not DBn.
Comment 8 Tom Horsley 2010-01-15 19:43:13 EST
More data (but I don't know what it means):

I've been trying to use the test program to trigger this bug on my home
system (a 4 core single chip intel box), and I have not yet seen it

The intel system at work, however, has exhibited this symptom by having
a kernel build fail with the compiler aborting due to a SIGTRAP, but I
haven't explicitly run my test program to see it trigger the failure on
an intel box.

The two systems I have now seen show this symptom are both dual socket
motherboards with two 4 core chips (opterons in one, xeon in the other),
so perhaps something about multiple cpu chips makes this more likely.

Here's the smolt profile for the dual cpu intel box:


(It is running a somewhat old fedora 11).
Comment 9 Tom Horsley 2010-06-08 10:36:09 EDT
I have just installed fedora 13 (and grabbed all updates) on the host machine
described in the original bug. I then ran the exact same test with the
exact same test program and virtual machines hosted on the new fedora 13
host kernel, and I am happy to say that after multiple reboots of the sles10x
virtual machine while sels10i was running my test program, I did not see
a single spurious SIGTRAP interfere with the sles10x virtual machine.
It will be a while before I have lots of results from running my testbeds
continuously, but it certainly looks like this problem may be fixed in the
latest KVM code. Rpms currently on system for this test are:

Comment 10 Justin M. Forbes 2010-06-08 11:08:41 EDT
Right, the fix for this went into the upstream kernel and should be available in an update for the F-12 kernel.