Description of problem: After performing fast forward upgrade from 10 to 13, we are hitting problems on the existing vms (they are paused), and new vms (they remain on building, after that they go into error state). When we log in onto the computes that hold them, we can just see they are on paused state: LANG=EN virsh list setlocale: No such file or directory Id Name State ---------------------------------------------------- 15 instance-0000002f paused Journal shows this error on libvirt, that repeats when i try to create vms: jul 31 07:44:59 compute-0 dockerd-current[203282]: 2018-07-31 07:44:59.371+0000: 431434: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlag jul 31 07:44:59 compute-0 dockerd-current[203282]: libvirt: QEMU Driver error : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainCreateWithFlags) jul 31 07:44:59 compute-0 dockerd-current[203282]: libvirt: XML-RPC error : Cannot write data: Broken pipe When i access to the container, and i get the bootlog from the created vm, i can see: ()[root@compute-0 qemu]# cat instance-0000002f.log 2018-07-31 07:42:21.668+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-06-05-05:26:44, x86-041.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.4), hostname: compute-0.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-0000002f,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-15-instance-0000002f/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Client-IBRS,ss=on,hypervisor=on,tsc_adjust=on,stibp=on,pdpe1gb=on,mpx=off,xsavec=off,xgetbv1=off -m 64 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 8ead153a-0a02-4a6c-8dfe-6a388c8e8e9e -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=17.0.3-0.20180420001141.el7ost,serial=4c4c4544-004c-4310-8039-b1c04f535032,uuid=8ead153a-0a02-4a6c-8dfe-6a388c8e8e9e,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-15-instance-0000002f/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/8ead153a-0a02-4a6c-8dfe-6a388c8e8e9e/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu9f3a18d4-59,server -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:bf:49:60,bus=pci.0,addr=0x3 -add-fd set=0,fd=28 -chardev pty,id=charserial0,logfile=/dev/fdset/0,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.10.125.111:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on 2018-07-31T07:42:22.122739Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhu9f3a18d4-59,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu9f3a18d4-59,server
Hey Yolanda, As discussed this appears to be the result of the ovs-dpdk-permissions.yaml environment [1] not being used when updating the stack ahead of the FFU run leaving the /var/lib/vhost_sockets/ directory inaccessible leaving the QEMU processes paused. This was seen and documented in the following bug: [OSP13]Boot guest with vhostuser port,QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhu73c7fe09-7d https://bugzilla.redhat.com/show_bug.cgi?id=1583447 Really we need this in the FFU upgrade docs as well. Leaving a needinfo against you while you rerun your tests to confirm this. [1] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/ovs-dpdk-permissions.yaml
So the problem was with an incorrect mapping of VhostuserSocketGroup parameter. The template was mapping it to ComputeOvsDpdk role but this was not used. We were using the regular ComputeRole, so it needed to be like: parameter_defaults: ComputeParameters: VhostuserSocketGroup: "hugetlbfs"