Bug 1903134 - Instance remains in powering-on state forever after compute reboots
Summary: Instance remains in powering-on state forever after compute reboots
Keywords:
Status: CLOSED DUPLICATE of bug 1890895
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-01 12:18 UTC by Eduardo Olivares
Modified: 2020-12-01 14:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-01 14:15:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Eduardo Olivares 2020-12-01 12:18:32 UTC
Description of problem:
Issue reprodiced by tobiko test test_reboot_computes_recovery:
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/43/testReport/tobiko.tests.faults.ha.test_cloud_recovery/DisruptTripleoNodesTest/Tobiko___test_reboot_computes_recovery/


compute-0 is hard rebooted:
2020-11-29 12:02:03,431 INFO tobiko.tests.faults.ha.cloud_disruptions | reboot exec:  sudo chmod o+w /proc/sysrq-trigger;sudo echo b > /proc/sysrq-trigger on server: compute-0

All instances are in status SHUTOFF at 12:03:57

Request to start instance 0cf82ec3-efd1-4f6e-baf4-8f04efe90925 is sent via nova API at 12:04:07:
2020-11-29 12:04:07,156 DEBUG tobiko.openstack.nova._client | Waiting for server 0cf82ec3-efd1-4f6e-baf4-8f04efe90925 status to get from SHUTOFF to ACTIVE (progress=None%)


Instance status is not ACTIVE 5 minutes later:
2020-11-29 12:09:03,483 DEBUG tobiko.openstack.nova._client | Waiting for server 0cf82ec3-efd1-4f6e-baf4-8f04efe90925 status to get from SHUTOFF to ACTIVE (progress=None%)


The next test cases fail because that instance never reaches status ACTIVE (last checkings fail at 12:14:00)



Nova compute logs:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/43/compute-0/var/log/containers/nova/nova-compute.log.gz
2020-11-29 12:04:09.452 8 DEBUG nova.compute.manager [req-557d9ed5-94f3-466e-9833-a83f1d9a4f96 f653062052794d108ffdf4add40a89d9 9869705add1e4c939e9f055a39d7950b - default default] [instance: 0cf82ec3-efd1-4f6e-baf4-8f04efe90925] No waiting events found dispatching network-vif-plugged-8d6b66ed-587b-4d2b-bd6a-7b02269e1ecd pop_instance_event /usr/lib/python3.6/site-packages/nova/compute/manager.py:361
2020-11-29 12:04:09.452 8 WARNING nova.compute.manager [req-557d9ed5-94f3-466e-9833-a83f1d9a4f96 f653062052794d108ffdf4add40a89d9 9869705add1e4c939e9f055a39d7950b - default default] [instance: 0cf82ec3-efd1-4f6e-baf4-8f04efe90925] Received unexpected event network-vif-plugged-8d6b66ed-587b-4d2b-bd6a-7b02269e1ecd for instance with vm_state stopped and task_state powering-on.


OVN controller logs:
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/43/compute-0/var/log/containers/openvswitch/ovn-controller.log.gz
2020-11-29T12:02:48.441Z|00013|binding|INFO|Releasing lport 8d6b66ed-587b-4d2b-bd6a-7b02269e1ecd from this chassis.
2020-11-29T12:02:48.442Z|00014|binding|INFO|Releasing lport 741e9af3-ee2b-42a7-9190-a216cb8f7d24 from this chassis.
2020-11-29T12:02:48.442Z|00015|binding|INFO|Releasing lport 9d6b0a2d-2bff-4059-8323-4019a554abc9 from this chassis.
2020-11-29T12:04:09.104Z|00016|binding|INFO|Claiming lport 8d6b66ed-587b-4d2b-bd6a-7b02269e1ecd for this chassis.
2020-11-29T12:04:09.104Z|00017|binding|INFO|8d6b66ed-587b-4d2b-bd6a-7b02269e1ecd: Claiming fa:16:3e:c3:ab:7f 10.100.114.175 2001:db8:0:72b0:f816:3eff:fec3:ab7f
2020-11-29T12:04:09.104Z|00018|binding|INFO|8d6b66ed-587b-4d2b-bd6a-7b02269e1ecd: Claiming unknown



Can this bug be related to https://bugzilla.redhat.com/show_bug.cgi?id=1890895?
Both of them are reproduced by test_reboot_computes_recovery, but BZ1890895 happens more often.




Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20201110.n.1


How reproducible:
Unfrequently (I only saw it once)

Steps to Reproduce:
1. run tobiko test test_reboot_computes_recovery
2.
3.

Comment 1 Lucas Alvares Gomes 2020-12-01 13:52:54 UTC
Hi,

I quickly skimmed the logs from Neutron and Nova compute but I don't see anything specific that I think would cause it to fail.

But, I believe the error was in QEMU restarting the VM.

Here's the creation of that the VM from the nova logs [0]

2020-11-29 10:38:39.943 7 DEBUG nova.virt.libvirt.driver [req-1ac3db66-07c2-4882-b177-a56a14b384c6 e678d845e06f46a7b588ccd7cc732eb9 48ce485e440c4ca0b10a4b4b1c4f24fb - default default] [instance: 0cf82ec3-efd1-4f6e-baf4-8f04efe90925] End _get_guest_xml xml=<domain type="kvm">
  <uuid>0cf82ec3-efd1-4f6e-baf4-8f04efe90925</uuid>
  <name>instance-0000019a</name>
...

So the instance is the "instance-0000019a".

When I look at the libvirt logs for that particular instance I see that the VM failed to start [1] with:

2020-11-29T12:04:09.254297Z qemu-kvm: -device cirrus-vga,id=video0,bus=pci.0,addr=0x2: warning: 'cirrus-vga' is deprecated, please use a different VGA card instead
KVM: entry failed, hardware error 0x80000021

If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

EAX=00000000 EBX=00000000 ECX=00000000 EDX=000006d3
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 00000000 00008000
CS =0000 00000000 00000000 00009b00
SS =0000 00000000 00000000 00009300
DS =0000 00000000 00000000 00008000
FS =0000 00000000 00000000 00008000
GS =0000 00000000 00000000 00008000
LDT=0000 00000000 00000000 00008000
TR =0000 00000000 00000000 00008000
GDT=     00000000 00000000
IDT=     00000000 00000000
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


So maybe the problem is that the VM itself never started due to the error above and therefore never moved to ACTIVE.

I don't know exactly what the error above means, maybe someone more familiar with libvirt/QEMU can give us some clues.

[0] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/43/compute-0/var/log/containers/nova/nova-compute.log.1.gz
[1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-networking-ovn-16.1_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-neutron/43/compute-0/var/log/libvirt/qemu/instance-0000019a.log.gz

Comment 2 Jakub Libosvar 2020-12-01 14:15:00 UTC

*** This bug has been marked as a duplicate of bug 1890895 ***


Note You need to log in before you can comment on or make changes to this bug.