Bug 1584775
Summary: | VMs hung after migration | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Kapetanakis Giannis <bilias> | ||||||||
Component: | kernel | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||||||
kernel sub component: | Virtualization | QA Contact: | Yumei Huang <yuhuang> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | high | ||||||||||
Priority: | urgent | CC: | alejandro.cortina2, amashah, bilias, chayang, dhoward, fbaudin, gveitmic, jinzhao, juzhang, knoel, michal.skrivanek, michen, mtessun, qzhang, ruben, slopezpa, yuhuang | ||||||||
Version: | 7.5 | Keywords: | Regression, ZStream | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-3.10.0-911.el7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1594288 1594292 (view as bug list) | Environment: | |||||||||
Last Closed: | 2018-10-30 09:18:51 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1594288, 1594292 | ||||||||||
Attachments: |
|
Description
Kapetanakis Giannis
2018-05-31 15:44:47 UTC
Unfortunately kernel-3.10.0-862.el7.x86_64 is also causing the exact same problems. https://access.redhat.com/errata/RHSA-2018:1062 Hi Giannis, Thanks for the report; I've got some questions: a) Can you describe the hardware you're running this on - eg a cat /proc/cpuinfo from the host; also what connection are you migrating over (1G or 10G? etc) b) are there any errors in the host dmesg after the migrate? c) Can you provide a copy of /var/log/libvirt/qemu/THEINSTANCE.log for the VM instances of both the openbsd and rhel5 VMs; preferably matching ones from a source and destination host which show the migration d) have you got matching versions of qemu-kvm installed on source and destination? e) Are the host clocks synchronised (e.g. with ntp) f) how often does the rhel5 migration fail - e.g. 1/5 or 1/10 etc? g) how often does the openbsd migration fail? We don't test OpenBSD much - but the fact it's a regression is interesting, so worth understanding; and rhel5 guest should work. Thanks, Dave Created attachment 1446729 [details]
engine.log
Created attachment 1446730 [details]
qemu from source host
Created attachment 1446731 [details]
qemu from dest host
I have 2 kinds of machines but they behave the same: 1) nodes of Dell PowerEdge R730: 2 x E5-2680 v4 2) nodes of IBM System x3650 M4: -[7915LRN]-: 2 x E5-2640 v2 you want flags? Migration is over 10G. vlan on top of bond (mode=1) master 10G / slave 1G Storage is also iSCSI over that 10G bond interface. Migration network is same as storage network (same vlan). VMs run of different vlans. No, there are no errors at all on VM. No errors on engine.log either. I'll have to check with libvirt/qemu All software is the same on all hypervisors (kernel, libvirt, vdsm, qemu etc.) Clocks are synced on hypervisors with chrony. VMs also sync time (ntp/chrony). Migration is not failing. Problems/hungs occur after migration. It doesn't seem like a network problem cause a shell prior to migration continues to operate normally. New (console also) logins delay. Funny staff with top/tcpdump. On OpenBSD 100% every time. On EL5 it's quite random. I've noticed this twice on the last 2 upgrades (from 7.4->7.5 and yesterday patching 7.5) that the EL5 machines got stack. I couldn't ping and I couldn't access the VM console. I did a power cycle on them However on later migrations I didn't -always- reproduce the problem. Nevertheless I would also like to try revert some patches and see what happens. For instance I see this: https://access.redhat.com/articles/3411331 coming from https://access.redhat.com/errata/RHSA-2018:1130 Previously, migrating a virtual machine (VM) using Advanced Vector Extensions (AVX) sometimes corrupted the ymm registers, leading to guest-visible register corruption. This happened because the kernel failed to preserve some vector registers when asked by QEMU. With this update, the kernel now preserves the correct registers, and the described problem no longer occurs. (BZ#1542617) Don't know if it's related but I don't have access to that BZ/patch to test it out. Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can reproduced it there too. Don't have access there either. About the qemu logs. They are in UTC. engine log is on UTC+3 You'll see @ 13:19:04 I requested ovirt to power-off the VM (EL5) because it was not responding. I can see logs on the VM up to that time, so it was not crashed or something. (In reply to Kapetanakis Giannis from comment #7) > I have 2 kinds of machines but they behave the same: > > 1) nodes of Dell PowerEdge R730: 2 x E5-2680 v4 > 2) nodes of IBM System x3650 M4: -[7915LRN]-: 2 x E5-2640 v2 > you want flags? > > Migration is over 10G. vlan on top of bond (mode=1) master 10G / slave 1G > Storage is also iSCSI over that 10G bond interface. > Migration network is same as storage network (same vlan). > VMs run of different vlans. OK, thanks for the info; nothing too unusual there - I've got a couple of e5-2620 v2's I can easily test on so that should be similar to your second box. (Although I can probably find an exact match if I need to) > No, there are no errors at all on VM. No errors on engine.log either. > I'll have to check with libvirt/qemu They look clean. > All software is the same on all hypervisors (kernel, libvirt, vdsm, qemu > etc.) > > Clocks are synced on hypervisors with chrony. VMs also sync time > (ntp/chrony). > > Migration is not failing. Problems/hungs occur after migration. > It doesn't seem like a network problem cause a shell prior to migration > continues to operate normally. New (console also) logins delay. OK. > Funny staff with top/tcpdump. > > On OpenBSD 100% every time. > > On EL5 it's quite random. I've noticed this twice on the last 2 upgrades > (from 7.4->7.5 and yesterday patching 7.5) that the EL5 machines got stack. > I couldn't ping and I couldn't access the VM console. I did a power cycle on > them > However on later migrations I didn't -always- reproduce the problem. 7.4->7.5 is a bit of a different case - starting on one qemu and landing on a different version; still I can see about trying to reproduce EL5 7.5<->7.5. > Nevertheless I would also like to try revert some patches and see what > happens. > For instance I see this: > > https://access.redhat.com/articles/3411331 > coming from https://access.redhat.com/errata/RHSA-2018:1130 > Previously, migrating a virtual machine (VM) using Advanced Vector > Extensions (AVX) sometimes corrupted the ymm registers, leading to > guest-visible register corruption. This happened because the kernel failed > to preserve some vector registers when asked by QEMU. With this update, the > kernel now preserves the correct registers, and the described problem no > longer occurs. (BZ#1542617) > > Don't know if it's related but I don't have access to that BZ/patch to test > it out. Oh that bug; it was a fun one.... You should be able to try that by trying kernel-3.10.0-693.23.1.el7 and the previous version if you can get those. > Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can > reproduced it there too. Don't have access there either. If I'm right, from those logs I think you're running CentOS rather than RHEL? If so, do you have access to 4.x kernels you can easily try to see if they work? > About the qemu logs. They are in UTC. engine log is on UTC+3 > > You'll see @ 13:19:04 I requested ovirt to power-off the VM (EL5) because it > was not responding. > > I can see logs on the VM up to that time, so it was not crashed or something. I'll see if I can reproduce it here and see what happens. (In reply to Dr. David Alan Gilbert from comment #8) > 7.4->7.5 is a bit of a different case - starting on one qemu and landing on > a different version; still I can see about trying to reproduce EL5 > 7.5<->7.5. Well right now all machines are on 7.5 fully patched, but running the 7.4 kernel-3.10.0-693.21.1.el7.x86_64 and I can't reproduce it. > Oh that bug; it was a fun one.... > You should be able to try that by trying kernel-3.10.0-693.23.1.el7 and the > previous version if you can get those. No access to this kernel > > Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can > > reproduced it there too. Don't have access there either. > > If I'm right, from those logs I think you're running CentOS rather than RHEL? > If so, do you have access to 4.x kernels you can easily try to see if they > work? Yes I'm on CentOS. Didn't know you produce 4.x kernel for EL7 versions... I could test but that would not help locating the bug I guess... Anyway, since you say you'll try reproduce. Incase you try OpenBSD I've used yesterday a clean 6.3-amd64 release (In reply to Kapetanakis Giannis from comment #9) > (In reply to Dr. David Alan Gilbert from comment #8) > > > 7.4->7.5 is a bit of a different case - starting on one qemu and landing on > > a different version; still I can see about trying to reproduce EL5 > > 7.5<->7.5. > > Well right now all machines are on 7.5 fully patched, but running the 7.4 > kernel-3.10.0-693.21.1.el7.x86_64 and I can't reproduce it. > > > Oh that bug; it was a fun one.... > > You should be able to try that by trying kernel-3.10.0-693.23.1.el7 and the > > previous version if you can get those. > > No access to this kernel Hmm OK; I don't think I can get it to you easily. For reference the upstream kernel fix is: a05917b6ba9dc9a95fc42bdcbe3a875e8ad83935 > > > Also I'm willing to try kernel kernel-3.10.0-693.25.2.el7 to see if I can > > > reproduced it there too. Don't have access there either. > > > > If I'm right, from those logs I think you're running CentOS rather than RHEL? > > If so, do you have access to 4.x kernels you can easily try to see if they > > work? > > Yes I'm on CentOS. > Didn't know you produce 4.x kernel for EL7 versions... We don't, but I thought there were CentOS builds somewhere. (Sorry, I don't use CentOS much, so I don't know where to look for stuff as much) > I could test but that would not help locating the bug I guess... > > Anyway, since you say you'll try reproduce. > Incase you try OpenBSD I've used yesterday a clean 6.3-amd64 release Thanks; downloading. I've got OpenBSD installed now; interestingly doing an install using the -8xx kernel I had (not quite upto 7.5 release) hung near the package selection/CD/http select a few times; I rebooted to a 6xx kernel and it was OK. Have you tried a fresh install on a -8xx vm - i.e. is it more general than a migration problem? Almost positive I did, cause I setup a test VM to debug the problem in order not to delay production machines. No problem in install. My setup at that time was with 3.10.0-862.3.2 kernel. I will try again tomorrow when I get to office and report back. Use virtio devices with OpenBSD. I did today an OpenBSD-6.3 install on top of 3.10.0-862.3.2 kernel. No problems on installation (booted/install from cd iso) Problems appear 100% every time I migrate TO a 3.10.0-862. node. If I migrate to a 3.10.0-693.21.1 kernel problems are resolved. I also did a fresh install of CentOS 5.11 I couldn't NOT reproduce my problems with EL5. I also did a snapshot-clone of a EL5 that was failing before and I could not reproduce it... Yeh, I can recreate this here with both rhel5 and openbsd6.3. My simplest test is to have the guest run: while true do date sleep 10 done and/or top one or both of them stop updating after the migrate, even if other bits of the guest are apparently working.. Working with a -693 kernel, broken with -862. I'll go and bisect to find the culprit. I remember seeing a similar post on OpenBSD lists, about time drifting a lot. Maybe it can help you pinpoint it better. It had to do with intel-kvm preemption_timer https://marc.info/?l=openbsd-misc&m=151605213329615&w=2 That while loop is giving pretty crazy output: Mon Jun 4 13:32:37 BST 2018 Mon Jun 4 13:32:37 BST 2018 Mon Jun 4 13:32:47 BST 2018 Mon Jun 4 13:32:47 BST 2018 Mon Jun 4 13:34:03 BST 2018 Mon Jun 4 13:34:03 BST 2018 Mon Jun 4 13:35:20 BST 2018 Mon Jun 4 13:35:20 BST 2018 Mon Jun 4 13:36:38 BST 2018 Mon Jun 4 13:36:38 BST 2018 Mon Jun 4 13:37:48 BST 2018 Mon Jun 4 13:37:48 BST 2018 Mon Jun 4 13:39:03 BST 2018 Mon Jun 4 13:39:03 BST 2018 This looks like it's somewhere between our -744 (good) and our -746 (bad) -746 has a big kvm merge in (745 seems rather ill) Works on upstream 4.17.0.1 Still fails with our current downstream test kernels (-897) Paolo suggested upstream commit d8f2f498d9ed0c5010bc1bbc1146f94c8bf9f8cc which went in after 4.17.0-rc4 , I tested -rc3 and it's still broken. I built a downstream -746 (which was broken) with that cherry picked and that seems to work; so it does look like it. (It applies fairly cleanly - a slight offset, and need to add a call to ktime_to_ns() to fix up some types. nice :) If you send patch for -862 I will also test and confirm Hi Giannis, It's this patch here: https://patchwork.kernel.org/patch/10411125/ from upstream with one small change, the last line needs to change from: + nsec_to_cycles(apic->vcpu, delta); to + nsec_to_cycles(apic->vcpu, ktime_to_ns(delta)); (patch applied it fine for me, git am was a bit fussier because it's moved down a few lines). Works for me. Thanks for reporting this! Dave Applying the above patch on 3.10.0-862.3.2 fixed all of my problems :) Thank you for looking so fast into this. (In reply to Kapetanakis Giannis from comment #24) > Applying the above patch on 3.10.0-862.3.2 fixed all of my problems :) > > Thank you for looking so fast into this. OK, thanks for confirming. You should find it appear in a later released version, but I can't confirm when exactly. Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-911.el7 Reproduce: kernel-3.10.0-907.el7.x86_64 qemu-kvm-rhev-2.12.0-5.el7 Guest: RHEL5.11, kernel-2.6.18-398.el5 Host: two Xeon systems (src: Intel(R) Xeon(R) CPU E5-2630 v3, dst:Intel(R) Xeon(R) CPU E7- 4830) Steps: 1. Boot RHEL5 guest on src host # /usr/libexec/qemu-kvm -m 4G -smp 8 rhel511-64-virtio.qcow2 \ -netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \ -monitor stdio -vnc :0 2. Boot guest on dst host in incoming mode # /usr/libexec/qemu-kvm -m 4G -smp 8 rhel511-64-virtio.qcow2 \ -netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \ -monitor stdio -vnc :0 -incoming tcp:0:5555 3. Run top and following script in guest # cat test.sh #! /bin/bash while true do date sleep 5 done 4. Migrate guest to dst host (qemu) migrate -d tcp:$(dst host ip):5555 5. After migration complete, let the script keep running for a few minutes. Result: after migration, the date output shows the time interval is longer than 5 seconds sometimes. Tue Jun 26 14:17:50 CST 2018 Tue Jun 26 14:17:55 CST 2018 Tue Jun 26 14:18:00 CST 2018 Tue Jun 26 14:18:42 CST 2018 --> 42 seconds Tue Jun 26 14:18:47 CST 2018 Tue Jun 26 14:19:27 CST 2018 --> 40 seconds Tue Jun 26 14:19:32 CST 2018 Tue Jun 26 14:19:37 CST 2018 Tue Jun 26 14:20:15 CST 2018 --> 28 seconds Tue Jun 26 14:20:20 CST 2018 Verify: kernel-3.10.0-915.el7.x86_64 qemu-kvm-rhev-2.12.0-5.el7 With same steps as above, got following result. The time interval is always 5 seconds. Tue Jun 26 15:36:17 CST 2018 Tue Jun 26 15:36:22 CST 2018 Tue Jun 26 15:36:27 CST 2018 Tue Jun 26 15:36:32 CST 2018 Tue Jun 26 15:36:37 CST 2018 Tue Jun 26 15:36:42 CST 2018 Tue Jun 26 15:36:47 CST 2018 Tue Jun 26 15:36:52 CST 2018 Tue Jun 26 15:36:57 CST 2018 Tue Jun 26 15:37:02 CST 2018 Tue Jun 26 15:37:07 CST 2018 Tue Jun 26 15:37:12 CST 2018 Tue Jun 26 15:37:17 CST 2018 Tue Jun 26 15:37:22 CST 2018 Tue Jun 26 15:37:27 CST 2018 Tue Jun 26 15:37:32 CST 2018 kernel 3.10.0-862.11.6.el7.x86_64 from #1594292 works fine for me thanks Thanks for confirming Giannis; and thanks for reporting the bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3083 |