Bug 1382389 - domain not booting fully when launched via HA configuration w/ qemu [NEEDINFO]
Summary: domain not booting fully when launched via HA configuration w/ qemu
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.3
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: pre-dev-freeze
: 7.3
Assignee: Bandan Das
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-06 14:28 UTC by Matt Young
Modified: 2017-01-17 03:36 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-17 03:36:33 UTC
Target Upstream Version:
bdas: needinfo? (matyoung)


Attachments (Terms of Use)

Description Matt Young 2016-10-06 14:28:15 UTC
OSP 10 (newton) puddle: 
- 2016-10-04.2


The test that is failing is our basic "pingtest."  
We have observed this in 2 successive runs with the same puddle, and reproduced 
it on a machine in islocation to facilitate debugging.  Please contact weshay 
or myoung for access.  The (failing) test does the following:

1. deploy HA overcloud (undercloud, 3 controller, 1 compute) in virt (using tripleo-quickstart)
2. Launch basic stack template that launches a single VM/domain
3. Attempt to ping VM and get a response.

Observed behavior:

- the console log is empty for the VM created and launched by the heat stack
- unable to connect to console via virsh

Comment 3 wes hayutin 2016-10-06 20:54:04 UTC
FYI.. 
from the compute node:

[heat-admin@overcloud-novacompute-0 ~]$ lsmod | grep kvm
kvm_intel             170181  0 
kvm                   554609  1 kvm_intel
irqbypass              13503  1 kvm

from the virthost:
[root@localhost ~]# cat /etc/modprobe.d/kvm.conf 
options kvm_intel nested=1 
options kvm_amd nested=1 
[root@localhost ~]# lsmod |grep kvm
kvm_intel             162153  18 
kvm                   525409  1 kvm_intel
[root@localhost ~]#

Comment 4 wes hayutin 2016-10-06 20:54:21 UTC
FYI.. 
from the compute node:

[heat-admin@overcloud-novacompute-0 ~]$ lsmod | grep kvm
kvm_intel             170181  0 
kvm                   554609  1 kvm_intel
irqbypass              13503  1 kvm

from the virthost:
[root@localhost ~]# cat /etc/modprobe.d/kvm.conf 
options kvm_intel nested=1 
options kvm_amd nested=1 
[root@localhost ~]# lsmod |grep kvm
kvm_intel             162153  18 
kvm                   525409  1 kvm_intel
[root@localhost ~]#

Comment 5 Matt Young 2016-10-06 21:37:00 UTC
We confirmed today the following (with debug help from dansmith):

1. the domain is booting --> bios, but the kernel is not booting (at least far enough to init serial port).  This explains why where is not a console log present.  Here's a screenshot: http://imgur.com/a/bX79E

2. the VM launched by the heat stack is configured to boot from a cinder volume.  The block device is created and is readable, the entire cirros image can be dd'd successfully.  This resolved a working hypothesis: even though the volume is created and block device present, the VM (after load bios) was attempting to read initial blocks from the volume and hanging on a read().

3. (later) reproduces without a cinder volume at all, booting from an ephemeral disk.  this confirms #2, this is not storage/cinder related.

4. reset on domain (power cycle) seems to not be responsive, or it's rebooting so quickly it's not registering on the VGA console.  We did not determine which.  However destroying the domain and restarting it yields the domain wedged in a similar fashion.

5. Have reproduced on my own (myoung) hardware (again virt, HA).

Comment 6 wes hayutin 2016-10-07 10:57:43 UTC
https://review.gerrithub.io/#/c/297457/ in ci to switch the compute to kvm from qemu

Comment 7 wes hayutin 2016-10-07 12:06:45 UTC
whooops.. could be a bug still apparently

Comment 8 Matt Young 2016-10-07 15:29:32 UTC
We've got initial confirmation that switching CI --> KVM resolves this issue, and are working to land patches and fully validate.

Per discussion, changed subject / focus of this particular issue to be QEMU specific.  We still clearly have an bug here, but it's not blocking CI/automation, and this (nested virt + qemu) clearly isn't a recommended customer configuration.  Dropping severity to medium to reflect this.

Comment 10 Matt Young 2016-10-14 14:18:43 UTC
We just landed

https://github.com/redhat-openstack/ansible-role-tripleo-overcloud/commit/fdaeeedb1cc54122eed5fe3adc82d86ab911f0dc

which unblocks rhos-delivery tripleo-quickstart based CI by switching the overcloud node libvirt type --> kvm for RHEL.

Comment 11 Matthew Booth 2016-10-14 14:51:10 UTC
Sounds like you got this? Closing it out, because it doesn't sound like there's anything for Nova here.

Comment 12 Matt Young 2016-10-14 18:59:08 UTC
Well...no...there's still a bug here, but it's not something blocking CI any more.  This configuration isn't a recommended config for nova, but it seems like there's still a bug here (perhaps deeper, qemu/kvm)...yes?

Comment 13 Matt Young 2016-10-14 19:07:06 UTC
(hit save too soon)

Also, this reliably reproduces in HA (3 controller), and reliably works with 1 controller.  Is this from your perspective not a nova issue because qemu was being used?

Comment 14 Matt Young 2016-10-18 16:45:20 UTC
This is a bug (it's not openstack nova) but it's a bug.  reassigning to correct group.

Comment 16 Bandan Das 2016-11-08 19:07:56 UTC
Sorry, I am new to this, can you describe the setup in a little more detail ?

Also can you please attach the host dmesg ? And if possible, also the qemu command line on the host when this happens.

Comment 17 Bandan Das 2017-01-17 03:36:33 UTC
I am closing this due to lack of data, please reopen if you can provide further informtaion as mentioned in comment 16.


Note You need to log in before you can comment on or make changes to this bug.