Bug 1408333

Summary: Regression: [BISECTED] Guest hangs on migrate, reverting patch fixes the problem
Product: Red Hat Enterprise Linux 7 Reporter: bugzilla
Component: qemu-kvmAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.3CC: bugzilla, chayang, hhuang, juzhang, knoel, mdeng, michen, qzhang, rbalakri, rh-bugzilla, virt-maint, xianwang, xuma, zhengtli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-15 18:36:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Virtual machine XML
none
Guest kernel panic after migrate none

Description bugzilla 2016-12-22 23:59:31 UTC
Created attachment 1234900 [details]
Virtual machine XML

Description of problem:

We did a full update to el7.3 last week and broke live migration of virtual machines on our infrastructure. VM migration would hang on the destination spinning 100% CPU and eventually provide a double fault kernel panic (image attached).

After a week of troubleshooting, we downloaded the src.rpm for qemu-kvm and started bisecting by commenting patch numbers in the .spec until the offending patch was found:
    kvm-target-i386-get-put-MSR_TSC_AUX-across-reset-and-mig.patch

See also bz#1261797

After reverting this patch, we can use the latest version of qemu-kvm (qemu-kvm-1.5.3-126.el7.x86_64.rpm) and migrate without issue.

Version-Release number of selected component (if applicable):

1.5.3-126.el7.x86_64


How reproducible:

Very. We can reproduce it reliably on both hardware (our production environment), and nested KVM (our test environment).

Steps to Reproduce:
1. Configure a shared LUN (we use DRBD).
2. Configure two hypervisors with el7.3.1611 and access to the shared LUN
3. Install el7.3.1611 minimal in the guest
4. Attempt to migrate


Actual results:

Guest hangs and ultimately presents the attached double fault panic.


Expected results:

Guest should migrate successfully and continue normal operation.


Additional info:

See attached. We are providing the dumpxl and the kernel panic screenshot.

Comment 1 bugzilla 2016-12-23 00:00:22 UTC
Created attachment 1234901 [details]
Guest kernel panic after migrate

Comment 2 Qunfang Zhang 2016-12-23 02:56:29 UTC
Hi, Min

Please give a help to reproduce the bug, thanks.

Comment 3 Min Deng 2016-12-23 09:19:27 UTC
  Could you please provide an entire qemu cli and accurate version of el7.3.1611 if possible.Thanks in advance !
Thanks 
Min

Comment 5 Eric Wheeler 2016-12-26 17:34:09 UTC
What do you mean entire qemu cli and accurate version?  The qemu-kvm package version at issue as above: qemu-kvm-1.5.3-126.el7.x86_64.rpm

We downloaded the .src.rpm, rebuilt and confirmed the problem.  We then removed the patch shown above and our VM stopped hanging on migrate.

We just use  this to migrate, nothing special and no direct qemu monitor interaction:
  virsh --connect=qemu:///system --quiet migrate --live myfavoritevm qemu+ssh://remotenode/system

Libvirt doesn't seem involved, but we are using this version: libvirt-2.0.0-10.el7_3.2.x86_64 which comes with the latest el7 7.3.1611 release.

Comment 6 Qunfang Zhang 2016-12-27 08:51:35 UTC
(In reply to Eric Wheeler from comment #5)
> What do you mean entire qemu cli and accurate version?  The qemu-kvm package
> version at issue as above: qemu-kvm-1.5.3-126.el7.x86_64.rpm
> 
> We downloaded the .src.rpm, rebuilt and confirmed the problem.  We then
> removed the patch shown above and our VM stopped hanging on migrate.
> 
> We just use  this to migrate, nothing special and no direct qemu monitor
> interaction:
>   virsh --connect=qemu:///system --quiet migrate --live myfavoritevm
> qemu+ssh://remotenode/system
> 
> Libvirt doesn't seem involved, but we are using this version:
> libvirt-2.0.0-10.el7_3.2.x86_64 which comes with the latest el7 7.3.1611
> release.

Hi, Eric

Thanks for your reply. The entire qemu cli here means entire qemu "command line" which could be gathered on host with "#ps ax | grep kvm" when the vm is running.

Regards,
Qunfang

Comment 7 bugzilla 2016-12-27 16:47:47 UTC
Ah! That makes sense. Here it is:

/usr/libexec/qemu-kvm -name demo-1 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -m 384 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 08edf62d-1580-41f9-9fbc-36310a48bbca -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-105-demo-1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/drbd/by-res/demo-1,format=raw,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -vga cirrus -msg timestamp=on

Comment 8 Eric Wheeler 2016-12-27 22:31:31 UTC
If it is helpful, we have confirmed this on the following CPU hardware:

Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz
Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz

Comment 9 Qunfang Zhang 2016-12-28 02:01:03 UTC
Thanks for the information.

Comment 10 Xujun Ma 2016-12-30 12:15:12 UTC
qemu-kvm-1.5.3-126.el7.x86_64 
guestos:RHEL-7.3-updates-20161130.1
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
shared storage:nfs

I can't reproduce this issue with qemu cmd lines as comment 7. I will try to reproduce it if i reserve a machine with cpu as comment 8.

Comment 11 bugzilla 2017-01-25 00:21:23 UTC
Hello All,

We continued to do troubleshooting on our side. It occurred to us that perhaps this is not a user space problem. We were running the Linux longterm 4.1.y releases and discovered that this was causing the problem. It turns out that in 4.1.16, This patch was merged into the kernel: KVM: x86: expose MSR_TSC_AUX to userspace
	https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=8a3185c54d650a86dafc8d8bcafa124b50944315

It was flagged for cc: stable.org, but had some dependencies that were missed. In order to be stable, these commits must also be pulled into the 4.1.y series:
	609e36d372a KVM: x86: pass host_initiated to functions that read MSRs
	81b1b9ca6d5 backport: KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX

These commits were signed off by:
	Signed-off-by: Paolo Bonzini <pbonzini>
	Signed-off-by: Haozhong Zhang <haozhong.zhang>

I'm not sure if they should be added to this BZ or not, so I will let your team decide on that.

I understand that because that this is not a supported kernel that you may be inclined to mark this as "not a bug" or "won't fix" or some other appropriate flag for RHEL itself. Please feel free to do whatever is most appropriate for your work flow.

However, Please leave this BZ public so that I can post to the KVM and Linux-stable lists and reference this bug for complete information.

Thank you everyone for your help in troubleshooting to get to the root of this issue!

Sincerely,

Eric Wheeler

Comment 12 Dr. David Alan Gilbert 2017-02-15 12:56:24 UTC
(In reply to bugzilla from comment #11)
> Hello All,
> 
> We continued to do troubleshooting on our side. It occurred to us that
> perhaps this is not a user space problem. We were running the Linux longterm
> 4.1.y releases and discovered that this was causing the problem.

I wish you'd mentioned you were using a non-distro kernel earlier!
Does it work with the distro kernel?

> It turns
> out that in 4.1.16, This patch was merged into the kernel: KVM: x86: expose
> MSR_TSC_AUX to userspace
> 	https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> ?id=8a3185c54d650a86dafc8d8bcafa124b50944315

OK, I see that in our distro kernel.

> It was flagged for cc: stable.org, but had some dependencies
> that were missed. In order to be stable, these commits must also be pulled
> into the 4.1.y series:
> 	609e36d372a KVM: x86: pass host_initiated to functions that read MSRs
> 	81b1b9ca6d5 backport: KVM: VMX: Fix host initiated access to guest
> MSR_TSC_AUX

I see both of those in our distro kernel.

> These commits were signed off by:
> 	Signed-off-by: Paolo Bonzini <pbonzini>
> 	Signed-off-by: Haozhong Zhang <haozhong.zhang>
> 
> I'm not sure if they should be added to this BZ or not, so I will let your
> team decide on that.

You might want b0996ae48 as well which is a fix for the first of those.

> 
> I understand that because that this is not a supported kernel that you may
> be inclined to mark this as "not a bug" or "won't fix" or some other
> appropriate flag for RHEL itself. Please feel free to do whatever is most
> appropriate for your work flow.
> 
> However, Please leave this BZ public so that I can post to the KVM and
> Linux-stable lists and reference this bug for complete information.
> 
> Thank you everyone for your help in troubleshooting to get to the root of
> this issue!

Can you just confirm it works fine with the distro kernel?
Thanks for tracking it down and making sure the missing fixes went into stable.

Dave
> Sincerely,
> 
> Eric Wheeler

Comment 13 bugzilla 2017-02-15 18:30:22 UTC
(In reply to Dr. David Alan Gilbert from comment #12)
> (In reply to bugzilla from comment #11)
> > Hello All,
> > 
> > We continued to do troubleshooting on our side. It occurred to us that
> > perhaps this is not a user space problem. We were running the Linux longterm
> > 4.1.y releases and discovered that this was causing the problem.
> 
> I wish you'd mentioned you were using a non-distro kernel earlier!
> Does it work with the distro kernel?

My apologies for not mentioning the kernel version. Since I was able to fix this in userspace, I had not considered it could be a kernel issue and forgot to mention it.

Yes, it works with the distro kernel.
 
> > It turns
> > out that in 4.1.16, This patch was merged into the kernel: KVM: x86: expose
> > MSR_TSC_AUX to userspace
> > 	https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> > ?id=8a3185c54d650a86dafc8d8bcafa124b50944315
> 
> OK, I see that in our distro kernel.
> 
> > It was flagged for cc: stable.org, but had some dependencies
> > that were missed. In order to be stable, these commits must also be pulled
> > into the 4.1.y series:
> > 	609e36d372a KVM: x86: pass host_initiated to functions that read MSRs
> > 	81b1b9ca6d5 backport: KVM: VMX: Fix host initiated access to guest
> > MSR_TSC_AUX
> 
> I see both of those in our distro kernel.
> 
> > These commits were signed off by:
> > 	Signed-off-by: Paolo Bonzini <pbonzini>
> > 	Signed-off-by: Haozhong Zhang <haozhong.zhang>
> > 
> > I'm not sure if they should be added to this BZ or not, so I will let your
> > team decide on that.
> 
> You might want b0996ae48 as well which is a fix for the first of those.

Thank you for that!

> > I understand that because that this is not a supported kernel that you may
> > be inclined to mark this as "not a bug" or "won't fix" or some other
> > appropriate flag for RHEL itself. Please feel free to do whatever is most
> > appropriate for your work flow.
> > 
> > However, Please leave this BZ public so that I can post to the KVM and
> > Linux-stable lists and reference this bug for complete information.
> > 
> > Thank you everyone for your help in troubleshooting to get to the root of
> > this issue!
> 
> Can you just confirm it works fine with the distro kernel?

Confirmed. This problem does not present itself with the distro kernel.

> Thanks for tracking it down and making sure the missing fixes went into
> stable.

You're welcome, I am happy to help!

-Eric

> Dave
> > Sincerely,
> > 
> > Eric Wheeler

Comment 14 Dr. David Alan Gilbert 2017-02-15 18:36:36 UTC
Thanks!

Based on comment 13;  Closed Not-a-bug; the distro kernel works, the main upstream works; it's just a bug in the upstream stable tree.