1382389 – domain not booting fully when launched via HA configuration w/ qemu

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1382389 - domain not booting fully when launched via HA configuration w/ qemu

Summary: domain not booting fully when launched via HA configuration w/ qemu

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	7.3
Assignee:	Bandan Das
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-06 14:28 UTC by Matt Young
Modified:	2023-09-14 03:31 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-17 03:36:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matt Young 2016-10-06 14:28:15 UTC

OSP 10 (newton) puddle: 
- 2016-10-04.2


The test that is failing is our basic "pingtest."  
We have observed this in 2 successive runs with the same puddle, and reproduced 
it on a machine in islocation to facilitate debugging.  Please contact weshay 
or myoung for access.  The (failing) test does the following:

1. deploy HA overcloud (undercloud, 3 controller, 1 compute) in virt (using tripleo-quickstart)
2. Launch basic stack template that launches a single VM/domain
3. Attempt to ping VM and get a response.

Observed behavior:

- the console log is empty for the VM created and launched by the heat stack
- unable to connect to console via virsh

Comment 3 wes hayutin 2016-10-06 20:54:04 UTC

FYI.. 
from the compute node:

[heat-admin@overcloud-novacompute-0 ~]$ lsmod | grep kvm
kvm_intel             170181  0 
kvm                   554609  1 kvm_intel
irqbypass              13503  1 kvm

from the virthost:
[root@localhost ~]# cat /etc/modprobe.d/kvm.conf 
options kvm_intel nested=1 
options kvm_amd nested=1 
[root@localhost ~]# lsmod |grep kvm
kvm_intel             162153  18 
kvm                   525409  1 kvm_intel
[root@localhost ~]#

Comment 4 wes hayutin 2016-10-06 20:54:21 UTC

FYI.. 
from the compute node:

[heat-admin@overcloud-novacompute-0 ~]$ lsmod | grep kvm
kvm_intel             170181  0 
kvm                   554609  1 kvm_intel
irqbypass              13503  1 kvm

from the virthost:
[root@localhost ~]# cat /etc/modprobe.d/kvm.conf 
options kvm_intel nested=1 
options kvm_amd nested=1 
[root@localhost ~]# lsmod |grep kvm
kvm_intel             162153  18 
kvm                   525409  1 kvm_intel
[root@localhost ~]#

Comment 5 Matt Young 2016-10-06 21:37:00 UTC

We confirmed today the following (with debug help from dansmith):

1. the domain is booting --> bios, but the kernel is not booting (at least far enough to init serial port).  This explains why where is not a console log present.  Here's a screenshot: http://imgur.com/a/bX79E

2. the VM launched by the heat stack is configured to boot from a cinder volume.  The block device is created and is readable, the entire cirros image can be dd'd successfully.  This resolved a working hypothesis: even though the volume is created and block device present, the VM (after load bios) was attempting to read initial blocks from the volume and hanging on a read().

3. (later) reproduces without a cinder volume at all, booting from an ephemeral disk.  this confirms #2, this is not storage/cinder related.

4. reset on domain (power cycle) seems to not be responsive, or it's rebooting so quickly it's not registering on the VGA console.  We did not determine which.  However destroying the domain and restarting it yields the domain wedged in a similar fashion.

5. Have reproduced on my own (myoung) hardware (again virt, HA).

Comment 6 wes hayutin 2016-10-07 10:57:43 UTC

https://review.gerrithub.io/#/c/297457/ in ci to switch the compute to kvm from qemu

Comment 7 wes hayutin 2016-10-07 12:06:45 UTC

whooops.. could be a bug still apparently

Comment 8 Matt Young 2016-10-07 15:29:32 UTC

We've got initial confirmation that switching CI --> KVM resolves this issue, and are working to land patches and fully validate.

Per discussion, changed subject / focus of this particular issue to be QEMU specific.  We still clearly have an bug here, but it's not blocking CI/automation, and this (nested virt + qemu) clearly isn't a recommended customer configuration.  Dropping severity to medium to reflect this.

Comment 10 Matt Young 2016-10-14 14:18:43 UTC

We just landed

https://github.com/redhat-openstack/ansible-role-tripleo-overcloud/commit/fdaeeedb1cc54122eed5fe3adc82d86ab911f0dc

which unblocks rhos-delivery tripleo-quickstart based CI by switching the overcloud node libvirt type --> kvm for RHEL.

Comment 11 Matthew Booth 2016-10-14 14:51:10 UTC

Sounds like you got this? Closing it out, because it doesn't sound like there's anything for Nova here.

Comment 12 Matt Young 2016-10-14 18:59:08 UTC

Well...no...there's still a bug here, but it's not something blocking CI any more.  This configuration isn't a recommended config for nova, but it seems like there's still a bug here (perhaps deeper, qemu/kvm)...yes?

Comment 13 Matt Young 2016-10-14 19:07:06 UTC

(hit save too soon)

Also, this reliably reproduces in HA (3 controller), and reliably works with 1 controller.  Is this from your perspective not a nova issue because qemu was being used?

Comment 14 Matt Young 2016-10-18 16:45:20 UTC

This is a bug (it's not openstack nova) but it's a bug.  reassigning to correct group.

Comment 16 Bandan Das 2016-11-08 19:07:56 UTC

Sorry, I am new to this, can you describe the setup in a little more detail ?

Also can you please attach the host dmesg ? And if possible, also the qemu command line on the host when this happens.

Comment 17 Bandan Das 2017-01-17 03:36:33 UTC

I am closing this due to lack of data, please reopen if you can provide further informtaion as mentioned in comment 16.

Comment 18 Red Hat Bugzilla 2023-09-14 03:31:57 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.