Bug 1167905
| Summary: | Repeated Live Migration can cause instance sessions to hang. | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | John Trowbridge <jtrowbri> | |
| Component: | openstack-nova | Assignee: | Nikola Dipanov <ndipanov> | |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | nlevinki <nlevinki> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 5.0 (RHEL 7) | CC: | benglish, berrange, dasmith, dgilbert, dmaley, eglynn, jtrowbri, kchamart, laine, ndipanov, pbrady, sbauza, sclewis, sferdjao, sgordon, vromanso, wdaniel, yeylon | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | 5.0 (RHEL 7) | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1179934 (view as bug list) | Environment: | ||
| Last Closed: | 2015-04-28 14:07:02 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1179934 | |||
| Bug Blocks: | ||||
|
Description
John Trowbridge
2014-11-25 15:26:52 UTC
I just extracted the tar file, looking at the the Neutron configs, the
setup seems to involve GRE tunneling with Neutron Open vSwitch plugin
and RabbitMQ. The environment involves 3 Nova Compute nodes and a
Controller node.
Also, some more questions to help diagnose the issue:
- How was live migration performed? If via Nova CLI,
precise invocation would be useful to know.
- What kind of live migration? The environment involves shared
storage, correct?
- What is the exact libvirt error?
- Just to ensure I understand the issue correctly from your original
description: Live migration fails in this case only when you try to
migrate the Nova instance back to the Compute node where the
instance was *originally* launched, right?
If possible to perform the live-migration test again, some log files
that can be useful:
- Sanitized Nova debug logs (preferably contextual on the source and
destination hosts): compute.log, conductor.log, scheduler.log
- /var/log/libvirt/qemu/instance.log -- For both compute nodes involved.
This will give us the QEMU CLI that's being used by libvirt
If the test can be repeated, it'll be useful to enable debug logging
for libvirt on the Compute nodes involved. As libvirt is the
component that's doing the heavy lifting under the hood. These
logs gives us detailed interactions between libvirt and QEMU on
source and destination Compute nodes. To enable:
1. In /etc/libvirt/libvirtd.conf, have these two config attributes:
. . .
log_level = 1
log_outputs="1:file:/var/tmp/libvirtd.log"
. . .
NOTE: Do *not* forget to turn off this debugging control once the
test is finished -- it continues to log heavily and might fill up
the disk space, otherwise.
2. Restart libvirtd:
$ systemctl restart libvirtd
3. Repeat your test.
4. Capture the libvirt logs and attach them as plain text to the
bug.
Created attachment 976809 [details]
compute node 46a, original node
Created attachment 976810 [details]
node 47a 2nd node in chain
So I have several more questions since I was not able to fully reproduce the issue in my test environment. 1) Where did you run the SSH client? On one of the nodes involved in the deployment or on a laptop/workstation 2) What woudl be really useful is to provide me with the output of 'neutron net-list' and 'neutron net-show' for all the networks. and the networks of the instance. 3) Finally - as a workaround, would it be possible to try tweaking ServerAliveInterval (or ClientAliveInterval in the instance), and ServerAliveCountMax and in the instance ClientAliveCountMax, and see if that helps. So regarding 2) in comment #9 - please see the following link for a handy way to provide all the interesting info on this https://kashyapc.fedorapeople.org/virt/openstack/debugging-neutron.txt If this is just ssh connections that are in flight dropping, I wonder if this could be a case of: https://bugzilla.redhat.com/show_bug.cgi?id=1081461 that is partially to do with the order of which network interfaces are taken 'up', so the interactions are a bit weird. (In reply to Dr. David Alan Gilbert from comment #11) > If this is just ssh connections that are in flight dropping, I wonder if > this could be a case of: > > https://bugzilla.redhat.com/show_bug.cgi?id=1081461 > > that is partially to do with the order of which network interfaces are taken > 'up', so the interactions are a bit weird. Not sure if people the above bug will take note of the "Doc Text" field written by Laine Stump, posting it here for convenience: "In previous versions of libvirt, when migrating a guest that used macvtap ("direct") network interfaces, network connections to the guest would often be reset as soon as the migration started, potentially leaving the guest unreachable until migration was finished. This was due to the destination host beginning to send out packets with the guest's MAC address while the guest was still running on the source host. libvirt now assures that the destination host keeps the guest's macvtap devices inactive until the guest has been stopped on the source host, thus eliminating any interruptions in guest connectivity" To further aid in deciding if Bug 1081461 is the source of the problem here, note that: 1) if the guest is not using macvtap then it most definitely isn't the source of the problem (for other types of interfaces, only the guest itself will send packets with the MAC address used by the guest) 2) in the case described in that bug, libvirt did not log any errors (and none would be expected). So any errors in the log that seem to be related to the problem (the errors in Comment 5 appear to be unrelated to me) would be a vote against Bug1081461 being the cause. We did not hear back from the customer after suggesting they set the AliveInterval value to non-zero and the support case has since closed. We'll re-open this bug if the customer re-opens the support ticket. |