Hide Forgot
OSP 10 (newton) puddle: - 2016-10-04.2 The test that is failing is our basic "pingtest." We have observed this in 2 successive runs with the same puddle, and reproduced it on a machine in islocation to facilitate debugging. Please contact weshay or myoung for access. The (failing) test does the following: 1. deploy HA overcloud (undercloud, 3 controller, 1 compute) in virt (using tripleo-quickstart) 2. Launch basic stack template that launches a single VM/domain 3. Attempt to ping VM and get a response. Observed behavior: - the console log is empty for the VM created and launched by the heat stack - unable to connect to console via virsh
FYI.. from the compute node: [heat-admin@overcloud-novacompute-0 ~]$ lsmod | grep kvm kvm_intel 170181 0 kvm 554609 1 kvm_intel irqbypass 13503 1 kvm from the virthost: [root@localhost ~]# cat /etc/modprobe.d/kvm.conf options kvm_intel nested=1 options kvm_amd nested=1 [root@localhost ~]# lsmod |grep kvm kvm_intel 162153 18 kvm 525409 1 kvm_intel [root@localhost ~]#
We confirmed today the following (with debug help from dansmith): 1. the domain is booting --> bios, but the kernel is not booting (at least far enough to init serial port). This explains why where is not a console log present. Here's a screenshot: http://imgur.com/a/bX79E 2. the VM launched by the heat stack is configured to boot from a cinder volume. The block device is created and is readable, the entire cirros image can be dd'd successfully. This resolved a working hypothesis: even though the volume is created and block device present, the VM (after load bios) was attempting to read initial blocks from the volume and hanging on a read(). 3. (later) reproduces without a cinder volume at all, booting from an ephemeral disk. this confirms #2, this is not storage/cinder related. 4. reset on domain (power cycle) seems to not be responsive, or it's rebooting so quickly it's not registering on the VGA console. We did not determine which. However destroying the domain and restarting it yields the domain wedged in a similar fashion. 5. Have reproduced on my own (myoung) hardware (again virt, HA).
https://review.gerrithub.io/#/c/297457/ in ci to switch the compute to kvm from qemu
whooops.. could be a bug still apparently
We've got initial confirmation that switching CI --> KVM resolves this issue, and are working to land patches and fully validate. Per discussion, changed subject / focus of this particular issue to be QEMU specific. We still clearly have an bug here, but it's not blocking CI/automation, and this (nested virt + qemu) clearly isn't a recommended customer configuration. Dropping severity to medium to reflect this.
We just landed https://github.com/redhat-openstack/ansible-role-tripleo-overcloud/commit/fdaeeedb1cc54122eed5fe3adc82d86ab911f0dc which unblocks rhos-delivery tripleo-quickstart based CI by switching the overcloud node libvirt type --> kvm for RHEL.
Sounds like you got this? Closing it out, because it doesn't sound like there's anything for Nova here.
Well...no...there's still a bug here, but it's not something blocking CI any more. This configuration isn't a recommended config for nova, but it seems like there's still a bug here (perhaps deeper, qemu/kvm)...yes?
(hit save too soon) Also, this reliably reproduces in HA (3 controller), and reliably works with 1 controller. Is this from your perspective not a nova issue because qemu was being used?
This is a bug (it's not openstack nova) but it's a bug. reassigning to correct group.
Sorry, I am new to this, can you describe the setup in a little more detail ? Also can you please attach the host dmesg ? And if possible, also the qemu command line on the host when this happens.
I am closing this due to lack of data, please reopen if you can provide further informtaion as mentioned in comment 16.