Bug 1610196 - FFU - All VMS are paused after FFU
Summary: FFU - All VMS are paused after FFU
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-31 07:48 UTC by Yolanda Robla
Modified: 2018-07-31 12:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-31 12:33:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Yolanda Robla 2018-07-31 07:48:33 UTC
Description of problem:

After performing fast forward upgrade from 10 to 13, we are hitting problems on the existing vms (they are paused), and new vms (they remain on building, after that they go into error state).
When we log in onto the computes that hold them, we can just see they are on paused state:

LANG=EN virsh list
setlocale: No such file or directory
 Id    Name                           State
----------------------------------------------------
 15    instance-0000002f              paused


Journal shows this error on libvirt, that repeats when i try to create vms:

jul 31 07:44:59 compute-0 dockerd-current[203282]: 2018-07-31 07:44:59.371+0000: 431434: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlag
jul 31 07:44:59 compute-0 dockerd-current[203282]: libvirt: QEMU Driver error : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlags)
jul 31 07:44:59 compute-0 dockerd-current[203282]: libvirt: XML-RPC error : Cannot write data: Broken pipe

When i access to the container, and i get the bootlog from the created vm, i can see:

()[root@compute-0 qemu]# cat instance-0000002f.log
2018-07-31 07:42:21.668+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-06-05-05:26:44, x86-041.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.4), hostname: compute-0.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-0000002f,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-15-instance-0000002f/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client-IBRS,ss=on,hypervisor=on,tsc_adjust=on,stibp=on,pdpe1gb=on,mpx=off,xsavec=off,xgetbv1=off -m 64 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 8ead153a-0a02-4a6c-8dfe-6a388c8e8e9e -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=17.0.3-0.20180420001141.el7ost,serial=4c4c4544-004c-4310-8039-b1c04f535032,uuid=8ead153a-0a02-4a6c-8dfe-6a388c8e8e9e,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-15-instance-0000002f/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/8ead153a-0a02-4a6c-8dfe-6a388c8e8e9e/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu9f3a18d4-59,server -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:bf:49:60,bus=pci.0,addr=0x3 -add-fd set=0,fd=28 -chardev pty,id=charserial0,logfile=/dev/fdset/0,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.10.125.111:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
2018-07-31T07:42:22.122739Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu9f3a18d4-59,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu9f3a18d4-59,server

Comment 4 Lee Yarwood 2018-07-31 09:15:53 UTC
Hey Yolanda, 

As discussed this appears to be the result of the ovs-dpdk-permissions.yaml environment [1] not being used when updating the stack ahead of the FFU run leaving the /var/lib/vhost_sockets/ directory inaccessible leaving the QEMU processes paused.

This was seen and documented in the following bug:

[OSP13]Boot guest with vhostuser port,QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu73c7fe09-7d
https://bugzilla.redhat.com/show_bug.cgi?id=1583447

Really we need this in the FFU upgrade docs as well.

Leaving a needinfo against you while you rerun your tests to confirm this.

[1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/ovs-dpdk-permissions.yaml

Comment 5 Yolanda Robla 2018-07-31 11:12:18 UTC
So the problem was with an incorrect mapping of VhostuserSocketGroup parameter. The template was mapping it to ComputeOvsDpdk role but this was not used. We were using the regular ComputeRole, so it needed to be like:

parameter_defaults:
  ComputeParameters:
    VhostuserSocketGroup: "hugetlbfs"


Note You need to log in before you can comment on or make changes to this bug.