Bug 1408333 - Regression: [BISECTED] Guest hangs on migrate, reverting patch fixes the problem
Regression: [BISECTED] Guest hangs on migrate, reverting patch fixes the problem
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
7.3
x86_64 Linux
unspecified Severity urgent
: rc
: ---
Assigned To: Dr. David Alan Gilbert
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-12-22 18:59 EST by bugzilla
Modified: 2017-02-15 13:36 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-02-15 13:36:36 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Virtual machine XML (1.68 KB, text/plain)
2016-12-22 18:59 EST, bugzilla
no flags Details
Guest kernel panic after migrate (28.70 KB, image/png)
2016-12-22 19:00 EST, bugzilla
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1261797 None None None 2016-12-22 18:59 EST

  None (edit)
Description bugzilla 2016-12-22 18:59:31 EST
Created attachment 1234900 [details]
Virtual machine XML

Description of problem:

We did a full update to el7.3 last week and broke live migration of virtual machines on our infrastructure. VM migration would hang on the destination spinning 100% CPU and eventually provide a double fault kernel panic (image attached).

After a week of troubleshooting, we downloaded the src.rpm for qemu-kvm and started bisecting by commenting patch numbers in the .spec until the offending patch was found:
    kvm-target-i386-get-put-MSR_TSC_AUX-across-reset-and-mig.patch

See also bz#1261797

After reverting this patch, we can use the latest version of qemu-kvm (qemu-kvm-1.5.3-126.el7.x86_64.rpm) and migrate without issue.

Version-Release number of selected component (if applicable):

1.5.3-126.el7.x86_64


How reproducible:

Very. We can reproduce it reliably on both hardware (our production environment), and nested KVM (our test environment).

Steps to Reproduce:
1. Configure a shared LUN (we use DRBD).
2. Configure two hypervisors with el7.3.1611 and access to the shared LUN
3. Install el7.3.1611 minimal in the guest
4. Attempt to migrate


Actual results:

Guest hangs and ultimately presents the attached double fault panic.


Expected results:

Guest should migrate successfully and continue normal operation.


Additional info:

See attached. We are providing the dumpxl and the kernel panic screenshot.
Comment 1 bugzilla 2016-12-22 19:00 EST
Created attachment 1234901 [details]
Guest kernel panic after migrate
Comment 2 Qunfang Zhang 2016-12-22 21:56:29 EST
Hi, Min

Please give a help to reproduce the bug, thanks.
Comment 3 Min Deng 2016-12-23 04:19:27 EST
  Could you please provide an entire qemu cli and accurate version of el7.3.1611 if possible.Thanks in advance !
Thanks 
Min
Comment 5 Eric Wheeler 2016-12-26 12:34:09 EST
What do you mean entire qemu cli and accurate version?  The qemu-kvm package version at issue as above: qemu-kvm-1.5.3-126.el7.x86_64.rpm

We downloaded the .src.rpm, rebuilt and confirmed the problem.  We then removed the patch shown above and our VM stopped hanging on migrate.

We just use  this to migrate, nothing special and no direct qemu monitor interaction:
  virsh --connect=qemu:///system --quiet migrate --live myfavoritevm qemu+ssh://remotenode/system

Libvirt doesn't seem involved, but we are using this version: libvirt-2.0.0-10.el7_3.2.x86_64 which comes with the latest el7 7.3.1611 release.
Comment 6 Qunfang Zhang 2016-12-27 03:51:35 EST
(In reply to Eric Wheeler from comment #5)
> What do you mean entire qemu cli and accurate version?  The qemu-kvm package
> version at issue as above: qemu-kvm-1.5.3-126.el7.x86_64.rpm
> 
> We downloaded the .src.rpm, rebuilt and confirmed the problem.  We then
> removed the patch shown above and our VM stopped hanging on migrate.
> 
> We just use  this to migrate, nothing special and no direct qemu monitor
> interaction:
>   virsh --connect=qemu:///system --quiet migrate --live myfavoritevm
> qemu+ssh://remotenode/system
> 
> Libvirt doesn't seem involved, but we are using this version:
> libvirt-2.0.0-10.el7_3.2.x86_64 which comes with the latest el7 7.3.1611
> release.

Hi, Eric

Thanks for your reply. The entire qemu cli here means entire qemu "command line" which could be gathered on host with "#ps ax | grep kvm" when the vm is running.

Regards,
Qunfang
Comment 7 bugzilla 2016-12-27 11:47:47 EST
Ah! That makes sense. Here it is:

/usr/libexec/qemu-kvm -name demo-1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 384 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 08edf62d-1580-41f9-9fbc-36310a48bbca -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-105-demo-1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/drbd/by-res/demo-1,format=raw,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -msg timestamp=on
Comment 8 Eric Wheeler 2016-12-27 17:31:31 EST
If it is helpful, we have confirmed this on the following CPU hardware:

Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
Comment 9 Qunfang Zhang 2016-12-27 21:01:03 EST
Thanks for the information.
Comment 10 Xujun Ma 2016-12-30 07:15:12 EST
qemu-kvm-1.5.3-126.el7.x86_64 
guestos:RHEL-7.3-updates-20161130.1
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
shared storage:nfs

I can't reproduce this issue with qemu cmd lines as comment 7. I will try to reproduce it if i reserve a machine with cpu as comment 8.
Comment 11 bugzilla 2017-01-24 19:21:23 EST
Hello All,

We continued to do troubleshooting on our side. It occurred to us that perhaps this is not a user space problem. We were running the Linux longterm 4.1.y releases and discovered that this was causing the problem. It turns out that in 4.1.16, This patch was merged into the kernel: KVM: x86: expose MSR_TSC_AUX to userspace
	https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=8a3185c54d650a86dafc8d8bcafa124b50944315

It was flagged for cc: stable@vger.kernel.org, but had some dependencies that were missed. In order to be stable, these commits must also be pulled into the 4.1.y series:
	609e36d372a KVM: x86: pass host_initiated to functions that read MSRs
	81b1b9ca6d5 backport: KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX

These commits were signed off by:
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
	Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>

I'm not sure if they should be added to this BZ or not, so I will let your team decide on that.

I understand that because that this is not a supported kernel that you may be inclined to mark this as "not a bug" or "won't fix" or some other appropriate flag for RHEL itself. Please feel free to do whatever is most appropriate for your work flow.

However, Please leave this BZ public so that I can post to the KVM and Linux-stable lists and reference this bug for complete information.

Thank you everyone for your help in troubleshooting to get to the root of this issue!

Sincerely,

Eric Wheeler
Comment 12 Dr. David Alan Gilbert 2017-02-15 07:56:24 EST
(In reply to bugzilla from comment #11)
> Hello All,
> 
> We continued to do troubleshooting on our side. It occurred to us that
> perhaps this is not a user space problem. We were running the Linux longterm
> 4.1.y releases and discovered that this was causing the problem.

I wish you'd mentioned you were using a non-distro kernel earlier!
Does it work with the distro kernel?

> It turns
> out that in 4.1.16, This patch was merged into the kernel: KVM: x86: expose
> MSR_TSC_AUX to userspace
> 	https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> ?id=8a3185c54d650a86dafc8d8bcafa124b50944315

OK, I see that in our distro kernel.

> It was flagged for cc: stable@vger.kernel.org, but had some dependencies
> that were missed. In order to be stable, these commits must also be pulled
> into the 4.1.y series:
> 	609e36d372a KVM: x86: pass host_initiated to functions that read MSRs
> 	81b1b9ca6d5 backport: KVM: VMX: Fix host initiated access to guest
> MSR_TSC_AUX

I see both of those in our distro kernel.

> These commits were signed off by:
> 	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 	Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> 
> I'm not sure if they should be added to this BZ or not, so I will let your
> team decide on that.

You might want b0996ae48 as well which is a fix for the first of those.

> 
> I understand that because that this is not a supported kernel that you may
> be inclined to mark this as "not a bug" or "won't fix" or some other
> appropriate flag for RHEL itself. Please feel free to do whatever is most
> appropriate for your work flow.
> 
> However, Please leave this BZ public so that I can post to the KVM and
> Linux-stable lists and reference this bug for complete information.
> 
> Thank you everyone for your help in troubleshooting to get to the root of
> this issue!

Can you just confirm it works fine with the distro kernel?
Thanks for tracking it down and making sure the missing fixes went into stable.

Dave
> Sincerely,
> 
> Eric Wheeler
Comment 13 bugzilla 2017-02-15 13:30:22 EST
(In reply to Dr. David Alan Gilbert from comment #12)
> (In reply to bugzilla from comment #11)
> > Hello All,
> > 
> > We continued to do troubleshooting on our side. It occurred to us that
> > perhaps this is not a user space problem. We were running the Linux longterm
> > 4.1.y releases and discovered that this was causing the problem.
> 
> I wish you'd mentioned you were using a non-distro kernel earlier!
> Does it work with the distro kernel?

My apologies for not mentioning the kernel version. Since I was able to fix this in userspace, I had not considered it could be a kernel issue and forgot to mention it.

Yes, it works with the distro kernel.
 
> > It turns
> > out that in 4.1.16, This patch was merged into the kernel: KVM: x86: expose
> > MSR_TSC_AUX to userspace
> > 	https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> > ?id=8a3185c54d650a86dafc8d8bcafa124b50944315
> 
> OK, I see that in our distro kernel.
> 
> > It was flagged for cc: stable@vger.kernel.org, but had some dependencies
> > that were missed. In order to be stable, these commits must also be pulled
> > into the 4.1.y series:
> > 	609e36d372a KVM: x86: pass host_initiated to functions that read MSRs
> > 	81b1b9ca6d5 backport: KVM: VMX: Fix host initiated access to guest
> > MSR_TSC_AUX
> 
> I see both of those in our distro kernel.
> 
> > These commits were signed off by:
> > 	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > 	Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
> > 
> > I'm not sure if they should be added to this BZ or not, so I will let your
> > team decide on that.
> 
> You might want b0996ae48 as well which is a fix for the first of those.

Thank you for that!

> > I understand that because that this is not a supported kernel that you may
> > be inclined to mark this as "not a bug" or "won't fix" or some other
> > appropriate flag for RHEL itself. Please feel free to do whatever is most
> > appropriate for your work flow.
> > 
> > However, Please leave this BZ public so that I can post to the KVM and
> > Linux-stable lists and reference this bug for complete information.
> > 
> > Thank you everyone for your help in troubleshooting to get to the root of
> > this issue!
> 
> Can you just confirm it works fine with the distro kernel?

Confirmed. This problem does not present itself with the distro kernel.

> Thanks for tracking it down and making sure the missing fixes went into
> stable.

You're welcome, I am happy to help!

-Eric

> Dave
> > Sincerely,
> > 
> > Eric Wheeler
Comment 14 Dr. David Alan Gilbert 2017-02-15 13:36:36 EST
Thanks!

Based on comment 13;  Closed Not-a-bug; the distro kernel works, the main upstream works; it's just a bug in the upstream stable tree.

Note You need to log in before you can comment on or make changes to this bug.