Bug 1584775

Summary:

VMs hung after migration

Product:

Red Hat Enterprise Linux 7

Reporter:

Kapetanakis Giannis <bilias>

Component:

kernel

Assignee:

Dr. David Alan Gilbert <dgilbert>

kernel sub component:

Virtualization

QA Contact:

Yumei Huang <yuhuang>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

urgent

CC:

alejandro.cortina2, amashah, bilias, chayang, dhoward, fbaudin, gveitmic, jinzhao, juzhang, knoel, michal.skrivanek, michen, mtessun, qzhang, ruben, slopezpa, yuhuang

Version:

7.5

Keywords:

Regression, ZStream

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-3.10.0-911.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1594288 1594292 (view as bug list)

Environment:

Last Closed:

2018-10-30 09:18:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1594288, 1594292

Attachments:

Description	Flags
engine.log	none
qemu from source host	none
qemu from dest host	none

Description Kapetanakis Giannis 2018-05-31 15:44:47 UTC

After upgrading to kernel
kernel-3.10.0-862.2.3.el7.x86_64 as well as
kernel-3.10.0-862.3.2.el7.x86_64

I've having the following problem:
After I migrate a VM to another hypervisor (live migration),
the VM has very big delays

- ssh login slow. I get motd and then after maybe a minute the prompt
- top does not refresh
- tcpdump does not output
- ACPI events to VM never appear
- some times complete hung (no console - no network)

All these are fixed if I put interrupt pressure on the VM, for instance dd the whole disk to /dev/null. While dd eveything is back to normal. When I stop dd, problems appear again.

It's like waiting for a buffer/queue to fill up before doing anything.

All of these problems stop when I boot hypervisor nodes to
kernel-3.10.0-693.21.1.el7.x86_64

Problems are also gone if I (live) migrate to a hypervisor node with
kernel-3.10.0-693.21.1.el7.x86_64. This works also with a complete frozen VM.

I've noticed this with OpenBSD vms. I've also noticed random (not consistent) hungs (complete freeze without access at all) to EL5 VMs. I don't have problems with EL6, EL7 VMs so far.

Environment is EL7 up2date running latest version of ovirt 4.2.3

I've also tried downgrading other parts like qemu-img-ev or trying other options on ovirt but nothing did a change.

Also in another infrastructure running kernel-3.10.0-693.21.1.el7.x86_64 and ovirt 4.2.2 I don't have these kind of problems with any of the VMs.

Probably a bug come up with kernel-3.10.0-862.2.3.el7.x86_64

regards

Giannis

Comment 2 Kapetanakis Giannis 2018-06-01 10:07:33 UTC

Unfortunately kernel-3.10.0-862.el7.x86_64 is also causing the exact same problems.

https://access.redhat.com/errata/RHSA-2018:1062

Comment 3 Dr. David Alan Gilbert 2018-06-01 14:12:44 UTC

Hi Giannis,
  Thanks for the report; I've got some questions:

    a) Can you describe the hardware you're running this on - eg a cat /proc/cpuinfo from the host; also what connection are you migrating over (1G or 10G? etc)
    b) are there any errors in the host dmesg after the migrate?
    c) Can you provide a copy of /var/log/libvirt/qemu/THEINSTANCE.log for the VM instances of both the openbsd and rhel5 VMs; preferably matching ones from a source and destination host  which show the migration
    d) have you got matching versions of qemu-kvm installed on source and destination?
    e) Are the host clocks synchronised (e.g. with ntp)
    f) how often does the rhel5 migration fail - e.g. 1/5 or 1/10 etc?
    g) how often does the openbsd migration fail?

We don't test OpenBSD much - but the fact it's a regression is interesting, so worth understanding; and rhel5 guest should work.

Thanks,

Dave

Comment 4 Kapetanakis Giannis 2018-06-01 15:29:18 UTC

Created attachment 1446729 [details]
engine.log

Comment 5 Kapetanakis Giannis 2018-06-01 15:29:53 UTC

Created attachment 1446730 [details]
qemu from source host

Comment 6 Kapetanakis Giannis 2018-06-01 15:30:03 UTC

Created attachment 1446731 [details]
qemu from dest host

Comment 7 Kapetanakis Giannis 2018-06-01 15:30:30 UTC

I have 2 kinds of machines but they behave the same:

1) nodes of Dell PowerEdge R730: 2 x E5-2680 v4
2) nodes of IBM System x3650 M4: -[7915LRN]-: 2 x E5-2640 v2
you want flags?

Migration is over 10G. vlan on top of bond (mode=1) master 10G / slave 1G
Storage is also iSCSI over that 10G bond interface.
Migration network is same as storage network (same vlan).
VMs run of different vlans.

No, there are no errors at all on VM. No errors on engine.log either.
I'll have to check with libvirt/qemu

All software is the same on all hypervisors (kernel, libvirt, vdsm, qemu etc.)

Clocks are synced on hypervisors with chrony. VMs also sync time (ntp/chrony).

Migration is not failing. Problems/hungs occur after migration.
It doesn't seem like a network problem cause a shell prior to migration continues to operate normally. New (console also) logins delay.
Funny staff with top/tcpdump.

On OpenBSD 100% every time.

On EL5 it's quite random. I've noticed this twice on the last 2 upgrades (from 7.4->7.5 and yesterday patching 7.5) that the EL5 machines got stack. I couldn't ping and I couldn't access the VM console. I did a power cycle on them
However on later migrations I didn't -always- reproduce the problem.

Nevertheless I would also like to try revert some patches and see what happens.
For instance I see this:

https://access.redhat.com/articles/3411331
coming from https://access.redhat.com/errata/RHSA-2018:1130
Previously, migrating a virtual machine (VM) using Advanced Vector Extensions (AVX) sometimes corrupted the ymm registers, leading to guest-visible register corruption. This happened because the kernel failed to preserve some vector registers when asked by QEMU. With this update, the kernel now preserves the correct registers, and the described problem no longer occurs. (BZ#1542617)

Don't know if it's related but I don't have access to that BZ/patch to test it out.

Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can reproduced it there too. Don't have access there either.

About the qemu logs. They are in UTC. engine log is on UTC+3

You'll see @ 13:19:04 I requested ovirt to power-off the VM (EL5) because it was not responding.

I can see logs on the VM up to that time, so it was not crashed or something.

Comment 8 Dr. David Alan Gilbert 2018-06-01 15:59:35 UTC

(In reply to Kapetanakis Giannis from comment #7)
> I have 2 kinds of machines but they behave the same:
> 
> 1) nodes of Dell PowerEdge R730: 2 x E5-2680 v4
> 2) nodes of IBM System x3650 M4: -[7915LRN]-: 2 x E5-2640 v2
> you want flags?
> 
> Migration is over 10G. vlan on top of bond (mode=1) master 10G / slave 1G
> Storage is also iSCSI over that 10G bond interface.
> Migration network is same as storage network (same vlan).
> VMs run of different vlans.

OK, thanks for the info; nothing too unusual there - I've got a couple of e5-2620 v2's I can easily test on so that should be similar to your second box.
(Although I can probably find an exact match if I need to)

> No, there are no errors at all on VM. No errors on engine.log either.
> I'll have to check with libvirt/qemu

They look clean.

> All software is the same on all hypervisors (kernel, libvirt, vdsm, qemu
> etc.)
> 
> Clocks are synced on hypervisors with chrony. VMs also sync time
> (ntp/chrony).
> 
> Migration is not failing. Problems/hungs occur after migration.
> It doesn't seem like a network problem cause a shell prior to migration
> continues to operate normally. New (console also) logins delay. 

OK.

> Funny staff with top/tcpdump.
> 
> On OpenBSD 100% every time.
> 
> On EL5 it's quite random. I've noticed this twice on the last 2 upgrades
> (from 7.4->7.5 and yesterday patching 7.5) that the EL5 machines got stack.
> I couldn't ping and I couldn't access the VM console. I did a power cycle on
> them
> However on later migrations I didn't -always- reproduce the problem.

7.4->7.5 is a bit of a different case - starting on one qemu and landing on a different version;   still I can see about trying to reproduce EL5 7.5<->7.5.
 
> Nevertheless I would also like to try revert some patches and see what
> happens.
> For instance I see this:
> 
> https://access.redhat.com/articles/3411331
> coming from https://access.redhat.com/errata/RHSA-2018:1130
> Previously, migrating a virtual machine (VM) using Advanced Vector
> Extensions (AVX) sometimes corrupted the ymm registers, leading to
> guest-visible register corruption. This happened because the kernel failed
> to preserve some vector registers when asked by QEMU. With this update, the
> kernel now preserves the correct registers, and the described problem no
> longer occurs. (BZ#1542617)
> 
> Don't know if it's related but I don't have access to that BZ/patch to test
> it out.

Oh that bug; it was a fun one....
You should be able to try that by trying kernel-3.10.0-693.23.1.el7 and the previous version if you can get those.
 
> Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can
> reproduced it there too. Don't have access there either.

If I'm right, from those logs I think you're running CentOS rather than RHEL?
If so, do you have access to 4.x kernels you can easily try to see if they work?
 
> About the qemu logs. They are in UTC. engine log is on UTC+3
> 
> You'll see @ 13:19:04 I requested ovirt to power-off the VM (EL5) because it
> was not responding.
> 
> I can see logs on the VM up to that time, so it was not crashed or something.

I'll see if I can reproduce it here and see what happens.

Comment 9 Kapetanakis Giannis 2018-06-01 16:22:11 UTC

(In reply to Dr. David Alan Gilbert from comment #8)

> 7.4->7.5 is a bit of a different case - starting on one qemu and landing on
> a different version;   still I can see about trying to reproduce EL5
> 7.5<->7.5.

Well right now all machines are on 7.5 fully patched, but running the 7.4 kernel-3.10.0-693.21.1.el7.x86_64 and I can't reproduce it.
  
> Oh that bug; it was a fun one....
> You should be able to try that by trying kernel-3.10.0-693.23.1.el7 and the
> previous version if you can get those.

No access to this kernel
  
> > Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can
> > reproduced it there too. Don't have access there either.
> 
> If I'm right, from those logs I think you're running CentOS rather than RHEL?
> If so, do you have access to 4.x kernels you can easily try to see if they
> work?

Yes I'm on CentOS.
Didn't know you produce 4.x kernel for EL7 versions...
I could test but that would not help locating the bug I guess...

Anyway, since you say you'll try reproduce. 
Incase you try OpenBSD I've used yesterday a clean 6.3-amd64 release

Comment 10 Dr. David Alan Gilbert 2018-06-01 16:45:47 UTC

(In reply to Kapetanakis Giannis from comment #9)
> (In reply to Dr. David Alan Gilbert from comment #8)
> 
> > 7.4->7.5 is a bit of a different case - starting on one qemu and landing on
> > a different version;   still I can see about trying to reproduce EL5
> > 7.5<->7.5.
> 
> Well right now all machines are on 7.5 fully patched, but running the 7.4
> kernel-3.10.0-693.21.1.el7.x86_64 and I can't reproduce it.
>   
> > Oh that bug; it was a fun one....
> > You should be able to try that by trying kernel-3.10.0-693.23.1.el7 and the
> > previous version if you can get those.
> 
> No access to this kernel

Hmm OK; I don't think I can get it to you easily.
For reference the upstream kernel fix is: a05917b6ba9dc9a95fc42bdcbe3a875e8ad83935

> > > Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can
> > > reproduced it there too. Don't have access there either.
> > 
> > If I'm right, from those logs I think you're running CentOS rather than RHEL?
> > If so, do you have access to 4.x kernels you can easily try to see if they
> > work?
> 
> Yes I'm on CentOS.
> Didn't know you produce 4.x kernel for EL7 versions...

We don't, but I thought there were CentOS builds somewhere.
(Sorry, I don't use CentOS much, so I don't know where to look for stuff as much)

> I could test but that would not help locating the bug I guess...
> 
> Anyway, since you say you'll try reproduce. 
> Incase you try OpenBSD I've used yesterday a clean 6.3-amd64 release

Thanks; downloading.

Comment 11 Dr. David Alan Gilbert 2018-06-01 19:37:04 UTC

I've got OpenBSD installed now;   interestingly doing an install using the -8xx kernel I had (not quite upto 7.5 release) hung near the package selection/CD/http select a few times; I rebooted to a 6xx kernel and it was OK.

Have you tried a fresh install on a -8xx vm - i.e. is it more general than a migration problem?

Comment 12 Kapetanakis Giannis 2018-06-02 21:57:49 UTC

Almost positive I did, cause I setup a test VM to debug the problem in order not to delay production machines. No problem in install.
My setup at that time was with 3.10.0-862.3.2 kernel.

I will try again tomorrow when I get to office and report back.

Use virtio devices with OpenBSD.

Comment 13 Kapetanakis Giannis 2018-06-04 09:17:53 UTC

I did today an OpenBSD-6.3 install on top of 3.10.0-862.3.2 kernel.

No problems on installation (booted/install from cd iso)

Problems appear 100% every time I migrate TO a 3.10.0-862. node.
If I migrate to a 3.10.0-693.21.1 kernel problems are resolved.

I also did a fresh install of CentOS 5.11
I couldn't NOT reproduce my problems with EL5.

I also did a snapshot-clone of a EL5 that was failing before and I could not reproduce it...

Comment 14 Dr. David Alan Gilbert 2018-06-04 11:59:18 UTC

Yeh, I can recreate this here with both rhel5 and openbsd6.3.
My simplest test is to have the guest run:
   while true
   do
     date
     sleep 10
   done
and/or  top

one or both of them stop updating after the migrate, even if other bits of the guest are apparently working..
Working with a -693 kernel, broken with -862.

I'll go and bisect to find the culprit.

Comment 15 Kapetanakis Giannis 2018-06-04 12:17:44 UTC

I remember seeing a similar post on OpenBSD lists, about time drifting a lot.

Maybe it can help you pinpoint it better.
It had to do with intel-kvm preemption_timer

https://marc.info/?l=openbsd-misc&m=151605213329615&w=2

Comment 16 Dr. David Alan Gilbert 2018-06-04 14:51:20 UTC

That while loop is giving pretty crazy output:

Mon Jun  4 13:32:37 BST 2018
Mon Jun  4 13:32:37 BST 2018
Mon Jun  4 13:32:47 BST 2018
Mon Jun  4 13:32:47 BST 2018
Mon Jun  4 13:34:03 BST 2018
Mon Jun  4 13:34:03 BST 2018
Mon Jun  4 13:35:20 BST 2018
Mon Jun  4 13:35:20 BST 2018
Mon Jun  4 13:36:38 BST 2018
Mon Jun  4 13:36:38 BST 2018
Mon Jun  4 13:37:48 BST 2018
Mon Jun  4 13:37:48 BST 2018
Mon Jun  4 13:39:03 BST 2018
Mon Jun  4 13:39:03 BST 2018

Comment 17 Dr. David Alan Gilbert 2018-06-04 17:30:26 UTC

This looks like it's somewhere between our -744 (good) and our -746 (bad)
-746 has a big kvm merge in (745 seems rather ill)

Comment 18 Dr. David Alan Gilbert 2018-06-04 18:48:18 UTC

Works on upstream 4.17.0.1
Still fails with our current downstream test kernels (-897)

Comment 19 Dr. David Alan Gilbert 2018-06-05 13:05:26 UTC

Paolo suggested upstream commit
d8f2f498d9ed0c5010bc1bbc1146f94c8bf9f8cc
which went in after 4.17.0-rc4 , I tested -rc3 and it's still broken.

I built a downstream -746 (which was broken) with that cherry picked and that seems to work; so it does look like it.
(It applies fairly cleanly - a slight offset, and need to add a call to ktime_to_ns() to fix up some types.

Comment 20 Kapetanakis Giannis 2018-06-05 13:21:03 UTC

nice :)

If you send patch for -862
I will also test and confirm

Comment 21 Dr. David Alan Gilbert 2018-06-05 17:03:19 UTC

Hi Giannis,
  It's this patch here:
    https://patchwork.kernel.org/patch/10411125/
from upstream with one small change, the last line needs to change from:

+               nsec_to_cycles(apic->vcpu, delta);

to
+               nsec_to_cycles(apic->vcpu, ktime_to_ns(delta));

(patch applied it fine for me, git am was a bit fussier because it's moved down a few lines).

Works for me.

Thanks for reporting this!

Dave

Comment 24 Kapetanakis Giannis 2018-06-06 10:22:11 UTC

Applying the above patch on 3.10.0-862.3.2 fixed all of my problems :)

Thank you for looking so fast into this.

Comment 25 Dr. David Alan Gilbert 2018-06-06 10:25:50 UTC

(In reply to Kapetanakis Giannis from comment #24)
> Applying the above patch on 3.10.0-862.3.2 fixed all of my problems :)
> 
> Thank you for looking so fast into this.

OK, thanks for confirming.
You should find it appear in a later released version, but I can't confirm when exactly.

Comment 27 Bruno Meneguele 2018-06-22 13:22:18 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 30 Bruno Meneguele 2018-06-22 20:16:54 UTC

Patch(es) available on kernel-3.10.0-911.el7

Comment 32 Yumei Huang 2018-06-26 08:05:55 UTC

Reproduce:
kernel-3.10.0-907.el7.x86_64
qemu-kvm-rhev-2.12.0-5.el7

Guest: RHEL5.11,  kernel-2.6.18-398.el5
Host: two Xeon systems (src: Intel(R) Xeon(R) CPU E5-2630 v3, dst:Intel(R) Xeon(R) CPU E7- 4830)

Steps:
1. Boot RHEL5 guest on src host

# /usr/libexec/qemu-kvm -m 4G -smp 8 rhel511-64-virtio.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-monitor stdio -vnc :0

2. Boot guest on dst host in incoming mode

# /usr/libexec/qemu-kvm -m 4G -smp 8 rhel511-64-virtio.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-monitor stdio -vnc :0 -incoming tcp:0:5555

3. Run top and following script in guest

# cat test.sh
#! /bin/bash
while true
do
        date
        sleep 5
done

4. Migrate guest to dst host
(qemu) migrate -d tcp:$(dst host ip):5555

5. After migration complete, let the script keep running for a few minutes.


Result: after migration, the date output shows the time interval is longer than 5 seconds sometimes.  

Tue Jun 26 14:17:50 CST 2018
Tue Jun 26 14:17:55 CST 2018
Tue Jun 26 14:18:00 CST 2018
Tue Jun 26 14:18:42 CST 2018     --> 42 seconds
Tue Jun 26 14:18:47 CST 2018
Tue Jun 26 14:19:27 CST 2018     --> 40 seconds
Tue Jun 26 14:19:32 CST 2018
Tue Jun 26 14:19:37 CST 2018
Tue Jun 26 14:20:15 CST 2018     --> 28 seconds
Tue Jun 26 14:20:20 CST 2018

Verify:
kernel-3.10.0-915.el7.x86_64
qemu-kvm-rhev-2.12.0-5.el7

With same steps as above, got following result. The time interval is always 5 seconds.

Tue Jun 26 15:36:17 CST 2018
Tue Jun 26 15:36:22 CST 2018
Tue Jun 26 15:36:27 CST 2018
Tue Jun 26 15:36:32 CST 2018
Tue Jun 26 15:36:37 CST 2018
Tue Jun 26 15:36:42 CST 2018
Tue Jun 26 15:36:47 CST 2018
Tue Jun 26 15:36:52 CST 2018
Tue Jun 26 15:36:57 CST 2018
Tue Jun 26 15:37:02 CST 2018
Tue Jun 26 15:37:07 CST 2018
Tue Jun 26 15:37:12 CST 2018
Tue Jun 26 15:37:17 CST 2018
Tue Jun 26 15:37:22 CST 2018
Tue Jun 26 15:37:27 CST 2018
Tue Jun 26 15:37:32 CST 2018

Comment 34 Kapetanakis Giannis 2018-08-21 08:55:14 UTC

kernel 3.10.0-862.11.6.el7.x86_64 from #1594292
works fine for me

thanks

Comment 35 Dr. David Alan Gilbert 2018-08-29 16:04:12 UTC

Thanks for confirming Giannis; and thanks for reporting the bug.

Comment 37 errata-xmlrpc 2018-10-30 09:18:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3083