Bug 1574465

Summary: OSP13 2018-05-01.6 start guest fail
Product: Red Hat OpenStack Reporter: jianzzha
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED NOTABUG QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: atheurer, berrange, dasmith, eglynn, jhakimra, jianzzha, kchamart, nova-maint, sbauza, sferdjao, sgordon, skramaja, srevivo, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-11 00:22:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jianzzha 2018-05-03 11:30:43 UTC
Description of problem:
with 2018-05-01.6 build, overcloud deployment was success, but fail to start guest. The instance stuck at state "BUILD",

(overcloud) [heat-admin@controller-0 ~]$ openstack server list -f json
[
  {
    "Status": "BUILD", 
    "Name": "demo1", 
    "Image": "nfv", 
    "ID": "eeca6723-cb58-4262-8aea-457a231f0c59", 
    "Flavor": "nfv", 
    "Networks": "access=10.1.1.3; provider-nfv1=20.1.0.9; provider-nfv0=20.0.0.8"
  }
]

(overcloud) [heat-admin@controller-0 ~]$ openstack network list -f json
[
  {
    "Subnets": "f265ed62-2887-40fb-92ec-f427327cb81a", 
    "ID": "04cd2374-fa8c-485b-ae84-fd76bb587411", 
    "Name": "access"
  }, 
  {
    "Subnets": "4bbd1cfa-e984-40f4-bd1c-d31d6c523a12", 
    "ID": "88d04486-ca26-4d7a-ba42-f43d6b144ef1", 
    "Name": "provider-nfv1"
  }, 
  {
    "Subnets": "e60dacb2-7b8c-4e6e-97fd-f204a390e1dd", 
    "ID": "9b7d6ef2-f7b2-47d8-8292-86299d88334b", 
    "Name": "provider-nfv0"
  }
]



(overcloud) [heat-admin@controller-0 ~]$ openstack flavor  show nfv -f json
{
  "name": "nfv", 
  "ram": 4096, 
  "properties": "hw:cpu_policy='dedicated', hw:cpu_thread_policy='require', hw:emulator_threads_policy='isolate', hw:mem_page_size='1GB'", 
  "OS-FLV-DISABLED:disabled": false, 
  "vcpus": 6, 
  "access_project_ids": null, 
  "os-flavor-access:is_public": true, 
  "rxtx_factor": 1.0, 
  "OS-FLV-EXT-DATA:ephemeral": 0, 
  "disk": 20, 
  "id": "1", 
  "swap": ""


in OSP12, The same script starts guest fine.

How reproducible:
every  time

Steps to Reproduce:
1. create flavor with "properties": "hw:cpu_policy='dedicated', hw:cpu_thread_policy='require', hw:emulator_threads_policy='isolate', hw:mem_page_size='1GB'"
2. create 3 provider network: vlan provider net for access; flat provider for nfv data flow
3. start instance using above flavor and network; the instance creation stuck at BUILD status

Comment 1 Saravanan KR 2018-05-03 11:45:55 UTC
Could you confirm if it is related to SELinux or not, because i faced a similar issue #1572510 in OSP10 with ovs2.9 version.

Comment 2 jianzzha 2018-05-03 12:45:50 UTC
(In reply to Saravanan KR from comment #1)
> Could you confirm if it is related to SELinux or not, because i faced a
> similar issue #1572510 in OSP10 with ovs2.9 version.

I did notice some syslog err from ovs,

May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00205|dpdk|ERR|RING: Cannot ress:iface-status=active external-ids:attached-mac=fa:16:3e:59:49:11 external-ids:v
m-uuid=eeca6723-cb58-4262-8aea-457a231f0c59 type=dpdkvhostuserclient options:vho
st-server-path=/var/lib/vhost_sockets/vhud7b18355-c7
May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --
timeout=120 -- set interface vhud7b18355-c7 mtu_request=9000
erve memory
May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00206|dpdk|ERR|RING: Cannot reserve memory
May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00207|dpdk|ERR|RING: Cannot reserve memory
May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00208|dpdk|ERR|RING: Cannot reserve memory
May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl -- --may-exist add-br br-int -- set Bridge br-int datapath_type=netdev
May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=120 -- --may-exist add-port br-int vhud3b6b1a9-eb -- set Interface vhud3b6b1a9-eb external-ids:iface-id=d3b6b1a9-eb92-476d-b965-a421a0ad67d9 external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:09:3b:ec external-ids:vm-uuid=eeca6723-cb58-4262-8aea-457a231f0c59 type=dpdkvhostuserclient options:vhost-server-path=/var/lib/vhost_sockets/vhud3b6b1a9-eb

Comment 3 Saravanan KR 2018-05-03 13:24:59 UTC
(In reply to jianzzha from comment #2)
> (In reply to Saravanan KR from comment #1)
> > Could you confirm if it is related to SELinux or not, because i faced a
> > similar issue #1572510 in OSP10 with ovs2.9 version.
> 
> I did notice some syslog err from ovs,
> 
> May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00205|dpdk|ERR|RING:
> Cannot ress:iface-status=active external-ids:attached-mac=fa:16:3e:59:49:11
> external-ids:v
> m-uuid=eeca6723-cb58-4262-8aea-457a231f0c59 type=dpdkvhostuserclient
> options:vho
> st-server-path=/var/lib/vhost_sockets/vhud7b18355-c7
> May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> ovs-vsctl --
> timeout=120 -- set interface vhud7b18355-c7 mtu_request=9000
> erve memory
> May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00206|dpdk|ERR|RING:
> Cannot reserve memory
> May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00207|dpdk|ERR|RING:
> Cannot reserve memory
> May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00208|dpdk|ERR|RING:
> Cannot reserve memory
> May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> ovs-vsctl -- --may-exist add-br br-int -- set Bridge br-int
> datapath_type=netdev
> May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> ovs-vsctl --timeout=120 -- --may-exist add-port br-int vhud3b6b1a9-eb -- set
> Interface vhud3b6b1a9-eb
> external-ids:iface-id=d3b6b1a9-eb92-476d-b965-a421a0ad67d9
> external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:09:3b:ec
> external-ids:vm-uuid=eeca6723-cb58-4262-8aea-457a231f0c59
> type=dpdkvhostuserclient
> options:vhost-server-path=/var/lib/vhost_sockets/vhud3b6b1a9-eb

Is it related to MTU 9000? Could you check with a lower value to confirm? How much socket memory is configured?

Comment 4 jianzzha 2018-05-03 14:22:28 UTC
huge page is setup,
[heat-admin@compute-0 ~]$ cat /proc/meminfo | grep Huge
AnonHugePages:      2048 kB
HugePages_Total:      32
HugePages_Free:       30
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB

Comment 5 jianzzha 2018-05-03 14:24:52 UTC
(In reply to Saravanan KR from comment #3)
> (In reply to jianzzha from comment #2)
> > (In reply to Saravanan KR from comment #1)
> > > Could you confirm if it is related to SELinux or not, because i faced a
> > > similar issue #1572510 in OSP10 with ovs2.9 version.
> > 
> > I did notice some syslog err from ovs,
> > 
> > May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00205|dpdk|ERR|RING:
> > Cannot ress:iface-status=active external-ids:attached-mac=fa:16:3e:59:49:11
> > external-ids:v
> > m-uuid=eeca6723-cb58-4262-8aea-457a231f0c59 type=dpdkvhostuserclient
> > options:vho
> > st-server-path=/var/lib/vhost_sockets/vhud7b18355-c7
> > May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> > ovs-vsctl --
> > timeout=120 -- set interface vhud7b18355-c7 mtu_request=9000
> > erve memory
> > May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00206|dpdk|ERR|RING:
> > Cannot reserve memory
> > May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00207|dpdk|ERR|RING:
> > Cannot reserve memory
> > May  2 21:18:55 compute-0 ovs-vswitchd[4902]: ovs|00208|dpdk|ERR|RING:
> > Cannot reserve memory
> > May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> > ovs-vsctl -- --may-exist add-br br-int -- set Bridge br-int
> > datapath_type=netdev
> > May  2 21:18:55 compute-0 ovs-vsctl: ovs|00001|vsctl|INFO|Called as
> > ovs-vsctl --timeout=120 -- --may-exist add-port br-int vhud3b6b1a9-eb -- set
> > Interface vhud3b6b1a9-eb
> > external-ids:iface-id=d3b6b1a9-eb92-476d-b965-a421a0ad67d9
> > external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:09:3b:ec
> > external-ids:vm-uuid=eeca6723-cb58-4262-8aea-457a231f0c59
> > type=dpdkvhostuserclient
> > options:vhost-server-path=/var/lib/vhost_sockets/vhud3b6b1a9-eb
> 
> Is it related to MTU 9000? Could you check with a lower value to confirm?
> How much socket memory is configured?

I'm assuming the 9000 is default on the vhost port, not sure what option to use to change that. I feel this might still relate to selinux. how do you confirm if it is related to selinux?

Comment 6 Artom Lifshitz 2018-05-03 16:53:13 UTC
I seem to recall seeing something similar to this before, but IIRC in that case the root cause was hugepages not being enabled.

In this particular case, I do find it curious that only 2 1GB pages are free when a VM with 4GB of RAM is supposed to exist.

In any case, it's hard to speculate without more information. Would it be possible to reproduce this with nova in debug mode and collect sosreports from all nodes?

Comment 7 Saravanan KR 2018-05-04 04:35:31 UTC
(In reply to jianzzha from comment #5)
> (In reply to Saravanan KR from comment #3)
> > (In reply to jianzzha from comment #2)
> > > (In reply to Saravanan KR from comment #1)

> > Is it related to MTU 9000? Could you check with a lower value to confirm?
> > How much socket memory is configured?
> 
> I'm assuming the 9000 is default on the vhost port, not sure what option to
> use to change that. I feel this might still relate to selinux. how do you
> confirm if it is related to selinux?

You can disable selinux on on the compute nodes and then create a VM to ensure that it is working fine.

The default MTU is 1500. But MTU can be configured using the nic templates  and the template parameter NeutronGlobalPhysnetMtu.

Can you share the templates that you are using? socket-mem (OvsDpdkSocketMemory) value need to be checked whether it can hold the 9000 MTU buffer?

Comment 8 jianzzha 2018-05-04 13:11:40 UTC
ok I see the 9000 in te template. Here is the templates:
https://github.com/jianzzha/director-deploy/blob/master/files/skylake-environment.yaml
https://github.com/jianzzha/director-deploy/blob/master/files/nic-configs/skylake.yaml
https://github.com/jianzzha/director-deploy/blob/master/files/nic-configs/controller.yaml

The same templates works fine in osp12

I did try disable selinux on on the compute nodes, same result.

Comment 9 Saravanan KR 2018-05-04 13:17:47 UTC
Thanks for the templates. The socket memory is configured as "1024,1024", which is definitely not enough for 9000MTU.
https://github.com/jianzzha/director-deploy/blob/master/files/network-environment.yaml#L73

Can you increase value of OvsDpdkSocketMemory to "8192,8192" and try the deployment?

Comment 10 jianzzha 2018-05-04 13:30:12 UTC
ok will do. by the way, here is the domainxml:

[heat-admin@compute-0 ~]$ sudo virsh list
 Id    Name                           State
----------------------------------------------------
 1     instance-00000002              paused

[heat-admin@compute-0 ~]$ sudo virsh dumpxml instance-00000002
<domain type='kvm' id='1'>
  <name>instance-00000002</name>
  <uuid>4df64de3-ffa1-43db-95a8-09b49f30aac2</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="17.0.3-0.20180420001138.el7ost"/>
      <nova:name>demo1</nova:name>
      <nova:creationTime>2018-05-03 14:03:43</nova:creationTime>
      <nova:flavor name="nfv">
        <nova:memory>4096</nova:memory>
        <nova:disk>20</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>6</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="b50648d154fc4edbae2319f7c39fffbc">admin</nova:user>
        <nova:project uuid="386dabf9ab164738b2541f40ae4f52db">admin</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="fd6db93a-2fca-425f-9f83-0fb8a9265e9d"/>
    </nova:instance>
  </metadata>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>6</vcpu>
  <cputune>
    <shares>6144</shares>
    <vcpupin vcpu='0' cpuset='38'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <vcpupin vcpu='2' cpuset='36'/>
    <vcpupin vcpu='3' cpuset='12'/>
    <vcpupin vcpu='4' cpuset='8'/>
    <vcpupin vcpu='5' cpuset='32'/>
    <emulatorpin cpuset='10'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
    <memnode cellid='0' mode='strict' nodeset='0'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>Red Hat</entry>
      <entry name='product'>OpenStack Compute</entry>
      <entry name='version'>17.0.3-0.20180420001138.el7ost</entry>
      <entry name='serial'>4c4c4544-0052-4e10-8044-b8c04f394e32</entry>
      <entry name='uuid'>4df64de3-ffa1-43db-95a8-09b49f30aac2</entry>
      <entry name='family'>Virtual Machine</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.5.0'>hvm</type>
    <boot dev='hd'/>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='forbid'>Skylake-Server-IBRS</model>
    <vendor>Intel</vendor>
    <topology sockets='3' cores='1' threads='2'/>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='clflushopt'/>
    <feature policy='require' name='pku'/>
    <feature policy='require' name='stibp'/>
    <numa>
      <cell id='0' cpus='0-5' memory='4194304' unit='KiB' memAccess='shared'/>
    </numa>
  </cpu>
  <clock offset='utc'>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/var/lib/nova/instances/4df64de3-ffa1-43db-95a8-09b49f30aac2/disk'/>
      <backingStore type='file' index='1'>
        <format type='raw'/>
        <source file='/var/lib/nova/instances/_base/2db52b2b5ac9c44f05b3fbd12878c6dde0d7645b'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='piix3-uhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <interface type='vhostuser'>
      <mac address='fa:16:3e:33:6d:cd'/>
      <source type='unix' path='/var/lib/vhost_sockets/vhu88c61e5d-b9' mode='server'/>
      <target dev='vhu88c61e5d-b9'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='fa:16:3e:9c:2e:68'/>
      <source type='unix' path='/var/lib/vhost_sockets/vhue9250d57-10' mode='server'/>
      <target dev='vhue9250d57-10'/>
      <model type='virtio'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='fa:16:3e:7e:0d:18'/>
      <source type='unix' path='/var/lib/vhost_sockets/vhuf02d14eb-bf' mode='server'/>
      <target dev='vhuf02d14eb-bf'/>
      <model type='virtio'/>
      <alias name='net2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </interface>
    <serial type='pty'>
      <log file='/var/lib/nova/instances/4df64de3-ffa1-43db-95a8-09b49f30aac2/console.log' append='off'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty'>
      <log file='/var/lib/nova/instances/4df64de3-ffa1-43db-95a8-09b49f30aac2/console.log' append='off'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='vnc' port='5900' autoport='yes' listen='172.22.33.18' keymap='en-us'>
      <listen type='address' address='172.22.33.18'/>
    </graphics>
    <video>
      <model type='cirrus' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='10'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c30,c967</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c30,c967</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

Comment 11 jianzzha 2018-05-04 15:03:01 UTC
this is the shell script I used to run the overcloud  deployment,
https://github.com/jianzzha/directordeploy/blob/master/files/overcloud_skylake_deploy.sh

Comment 13 Andrew Theurer 2018-05-08 18:51:37 UTC
We no longer have the memory problem (I believe Jianzhu redeployed with 1500 MTU), but we now have a problem with what appears to be a race condition between Qemu and OVS.  

Here is the log of qemu:
2018-05-08 17:25:26.615+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-03-20-14:43:16, x86-041.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.2), hostname: compute-0.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOME=/root QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=instance-00000001,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-instance-00000001/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off -cpu Skylake-Server-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,pku=on,stibp=on -m 4096 -realtime mlock=off -smp 6,sockets=3,cores=1,threads=2 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/1-instance-00000001,share=yes,size=4294967296,host-nodes=0,policy=bind -numa node,nodeid=0,cpus=0-5,memdev=ram-node0 -uuid 1b5ecf77-2b4e-475d-a064-6682ea4f2967 -smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=17.0.3-0.20180420001138.el7ost,serial=4c4c4544-0052-4e10-8044-b8c04f394e32,uuid=1b5ecf77-2b4e-475d-a064-6682ea4f2967,family=Virtual Machine' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-instance-00000001/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/1b5ecf77-2b4e-475d-a064-6682ea4f2967/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhue81cc988-2e,server -netdev vhost-user,chardev=charnet0,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:91:13:88,bus=pci.0,addr=0x3 -chardev socket,id=charnet1,path=/var/lib/vhost_sockets/vhud31272c8-cb,server -netdev vhost-user,chardev=charnet1,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:b0:5a:b9,bus=pci.0,addr=0x4 -chardev socket,id=charnet2,path=/var/lib/vhost_sockets/vhu998df8be-ec,server -netdev vhost-user,chardev=charnet2,id=hostnet2 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=fa:16:3e:70:e0:b9,bus=pci.0,addr=0x5 -add-fd set=0,fd=28 -chardev pty,id=charserial0,logfile=/dev/fdset/0,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 172.22.33.18:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
2018-05-08T17:25:27.410565Z qemu-kvm: -chardev socket,id=charnet0,path=/var/lib/vhost_sockets/vhue81cc988-2e,server: info: QEMU waiting for connection on: disconnected:unix:/var/lib/vhost_sockets/vhue81cc988-2e,server


And for OVS:
2018-05-08T17:25:24.423Z|00195|dpdk|INFO|VHOST_CONFIG: vhost-user client: socket created, fd: 133
2018-05-08T17:25:24.423Z|00196|netdev_dpdk|INFO|vHost User device 'vhue81cc988-2e' created in 'client' mode, using client socket '/var/lib/vhost_sockets/vhue81cc988-2e'
2018-05-08T17:25:24.426Z|00197|dpdk|WARN|VHOST_CONFIG: failed to connect to /var/lib/vhost_sockets/vhue81cc988-2e: No such file or directory
2018-05-08T17:25:24.426Z|00198|dpdk|INFO|VHOST_CONFIG: /var/lib/vhost_sockets/vhue81cc988-2e: reconnecting...
2018-05-08T17:25:24.426Z|00199|dpif_netdev|INFO|Core 18 on numa node 0 assigned port 'dpdk2' rx queue 0 (measured processing cycles 834422).
2018-05-08T17:25:24.426Z|00200|dpif_netdev|INFO|Core 2 on numa node 0 assigned port 'dpdk0' rx queue 0 (measured processing cycles 0).
2018-05-08T17:25:24.426Z|00201|dpif_netdev|INFO|Core 26 on numa node 0 assigned port 'dpdk1' rx queue 0 (measured processing cycles 0).
2018-05-08T17:25:24.426Z|00202|dpif_netdev|INFO|Core 42 on numa node 0 assigned port 'vhue81cc988-2e' rx queue 0 (measured processing cycles 0).
2018-05-08T17:25:24.426Z|00203|bridge|INFO|bridge br-int: added interface vhue81cc988-2e on port 5

Note that Qemu did not create the vhostuser socket until 2 seconds after OVS attempted to connect to it (and could not find it).  Does OVS attempt to retry a connection to the socket for vhostuser?

OVS appears to give up quickly on connecting and tries the next vhostuser port, which of course will not yet be there.

Comment 14 jianzzha 2018-05-10 18:24:49 UTC
/usr/share/openstack-tripleo-heat-templates/environments/ovs-dpdk-permissions.yaml   is missing from the deployment. Once this template is applied for overcloud, I can start the guest.

There is still an issue with containerized neutron-ovs-dpdk, the over cloud deployment would fail when this template is applied: /usr/share/openstack-tripleo-heat-templates/environments/services-docker/neutron-ovs-dpdk.yaml

if using the non docker version template in stead, it has no problem:  -e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dpdk.yaml \

Comment 15 jianzzha 2018-05-10 18:28:35 UTC
This is the error when services-docker/neutron-ovs-dpdk.yaml is applied during overcloud deployment:

[jianzzha@jianzzha ~]$ cat errors
    
    TASK [Run docker-puppet tasks (generate config) during step 1] *****************
    ok: [localhost]
    
    TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 1] ***
    fatal: [localhost]: FAILED! => {
        "changed": false, 
        "failed_when_result": true, 
        "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
            "2018-05-10 11:39:23,049 INFO: 12170 -- Running docker-puppet", 
            "2018-05-10 11:39:23,050 INFO: 12170 -- Service compilation completed.", 
            "2018-05-10 11:39:23,050 INFO: 12170 -- Starting multiprocess configuration steps.  Using 3 processes.", 
            "2018-05-10 11:39:23,058 INFO: 12173 -- Starting configuration of neutron using image docker-registry.engineering.redhat.com/rhosp13/openstack-neutron-server:2018-05-01.6", 
            "2018-05-10 11:39:23,058 INFO: 12171 -- Starting configuration of nova_libvirt using image docker-registry.engineering.redhat.com/rhosp13/openstack-nova-compute:2018-05-01.6", 
            "2018-05-10 11:39:23,058 INFO: 12172 -- Starting configuration of crond using image docker-registry.engineering.redhat.com/rhosp13/openstack-cron:2018-05-01.6", 
            "2018-05-10 11:39:23,060 INFO: 12173 -- Removing container: docker-puppet-neutron", 
            "2018-05-10 11:39:23,060 INFO: 12172 -- Removing container: docker-puppet-crond", 
            "2018-05-10 11:39:23,060 INFO: 12171 -- Removing container: docker-puppet-nova_libvirt", 
            "2018-05-10 11:39:23,085 INFO: 12171 -- Pulling image: docker-registry.engineering.redhat.com/rhosp13/openstack-nova-compute:2018-05-01.6", 
            "2018-05-10 11:39:23,085 INFO: 12172 -- Pulling image: docker-registry.engineering.redhat.com/rhosp13/openstack-cron:2018-05-01.6", 
            "2018-05-10 11:39:23,085 INFO: 12173 -- Pulling image: docker-registry.engineering.redhat.com/rhosp13/openstack-neutron-server:2018-05-01.6", 
            "2018-05-10 11:39:50,645 INFO: 12172 -- Removing container: docker-puppet-crond", 
            "2018-05-10 11:39:50,690 INFO: 12172 -- Finished processing puppet configs for crond", 
            "2018-05-10 11:39:50,690 INFO: 12172 -- Starting configuration of iscsid using image docker-registry.engineering.redhat.com/rhosp13/openstack-iscsid:2018-05-01.6", 
            "2018-05-10 11:39:50,691 INFO: 12172 -- Removing container: docker-puppet-iscsid", 
            "2018-05-10 11:39:50,706 INFO: 12172 -- Pulling image: docker-registry.engineering.redhat.com/rhosp13/openstack-iscsid:2018-05-01.6", 
            "2018-05-10 11:39:56,946 ERROR: 12173 -- Failed running docker-puppet.py for neutron", 
            "2018-05-10 11:39:56,947 ERROR: 12173 -- Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend", 
            "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend", 
            "Notice: Compiled catalog for overcloud-computeovsdpdk-0.localdomain in environment production in 2.71 seconds", 
            "Notice: /Stage[main]/Tripleo::Profile::Base::Neutron::Ovs/File[/var/lib/vhost_sockets]/ensure: created", 
            "Notice: /Stage[main]/Vswitch::Dpdk/File_line[/etc/sysconfig/openvswitch]/ensure: created", 
            "Changes:", 
            "            Total: 2", 
            "Events:", 
            "          Success: 2", 
            "Resources:", 
            "          Changed: 2", 
            "      Out of sync: 2", 
            "          Skipped: 5", 
            "            Total: 9", 
            "Time:", 
            "        Resources: 0.00", 
            "        File line: 0.00", 
            "             File: 0.00", 
            "         Last run: 1525952396", 
            "   Config retrieval: 2.91", 
            "            Total: 2.91", 
            "Version:", 
            "           Config: 1525952393", 
            "           Puppet: 4.8.2", 
            "", 
            "2018-05-10 11:39:56,947 ERROR: 12173 -- + mkdir -p /etc/puppet", 
            "+ cp -a /tmp/puppet-etc/auth.conf /tmp/puppet-etc/hiera.yaml /tmp/puppet-etc/hieradata /tmp/puppet-etc/modules /tmp/puppet-etc/puppet.conf /tmp/puppet-etc/ssl /etc/puppet", 
            "+ rm -Rf /etc/puppet/ssl", 
            "+ echo '{\"step\": 6}'", 
            "+ TAGS=", 
            "+ '[' -n file,file_line,concat,augeas,cron,neutron_plugin_ml2,neutron_config,neutron_agent_ovs,neutron_plugin_ml2 ']'", 
            "+ TAGS='--tags file,file_line,concat,augeas,cron,neutron_plugin_ml2,neutron_config,neutron_agent_ovs,neutron_plugin_ml2'", 
            "+ origin_of_time=/var/lib/config-data/neutron.origin_of_time", 
            "+ touch /var/lib/config-data/neutron.origin_of_time", 
            "+ sync", 
            "+ set +e", 
            "+ FACTER_hostname=overcloud-computeovsdpdk-0", 
            "+ FACTER_uuid=docker", 
            "+ /usr/bin/puppet apply --summarize --detailed-exitcodes --color=false --logdest syslog --logdest console --modulepath=/etc/puppet/modules:/usr/share/openstack-puppet/modules --tags file,file_line,concat,augeas,cron,neutron_plugin_ml2,neutron_config,neutron_agent_ovs,neutron_plugin_ml2 /etc/config.pp", 
            "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)", 
            "Failed to get D-Bus connection: Operation not permitted", 
            "Warning: Undefined variable 'deploy_config_name'; ", 
            "   (file & line not available)", 
            "Warning: ModuleLoader: module 'neutron' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", 
            "Warning: ModuleLoader: module 'openstacklib' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", 
            "Warning: Scope(Class[Neutron]): neutron::rabbit_host, neutron::rabbit_hosts, neutron::rabbit_password, neutron::rabbit_port, neutron::rabbit_user, neutron::rabbit_virtual_host and neutron::rpc_backend are deprecated. Please use neutron::default_transport_url instead.", 
            "Warning: This method is deprecated, please use match expressions with Stdlib::Compat::Array instead. They are described at https://docs.puppet.com/puppet/latest/reference/lang_data_type.html#match-expressions. at [\"/etc/puppet/modules/neutron/manifests/init.pp\", 530]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/neutron/plugins/ml2.pp\", 45]", 
            "   (at /etc/puppet/modules/stdlib/lib/puppet/functions/deprecation.rb:28:in `deprecation')", 
            "Warning: This method is deprecated, please use the stdlib validate_legacy function,", 
            "                    with Stdlib::Compat::Hash. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/neutron/manifests/config.pp\", 132]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/neutron.pp\", 131]", 
            "                    with Stdlib::Compat::Array. There is further documentation for validate_legacy function in the README. at [\"/etc/puppet/modules/neutron/manifests/agents/ml2/ovs.pp\", 219]:[\"/etc/puppet/modules/tripleo/manifests/profile/base/neutron/ovs.pp\", 59]", 
            "Warning: ModuleLoader: module 'oslo' has unresolved dependencies - it will only see those that are resolved. Use 'puppet module list --tree' to see information about modules", 
            "Error: Failed to apply catalog: Execution of '/usr/bin/ovs-vsctl list Open_vSwitch .' returned 1: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)", 
            "+ rc=1", 
            "+ set -e", 
            "+ '[' 1 -ne 2 -a 1 -ne 0 ']'", 
            "+ exit 1", 
            "2018-05-10 11:39:56,947 INFO: 12173 -- Finished processing puppet configs for neutron", 
            "2018-05-10 11:40:00,788 INFO: 12172 -- Removing container: docker-puppet-iscsid", 
            "2018-05-10 11:40:00,826 INFO: 12172 -- Finished processing puppet configs for iscsid", 
            "2018-05-10 11:40:09,136 INFO: 12171 -- Removing container: docker-puppet-nova_libvirt", 
            "2018-05-10 11:40:09,175 INFO: 12171 -- Finished processing puppet configs for nova_libvirt", 
            "2018-05-10 11:40:09,175 ERROR: 12170 -- ERROR configuring neutron"
        ]
    }
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/0287566c-cf41-430c-bf32-b79ffeb51128_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=25   changed=12   unreachable=0    failed=1   
    
  deploy_stderr: |


[jianzzha@jianzzha ~]$

Comment 16 Artom Lifshitz 2018-05-11 00:22:13 UTC
> /usr/share/openstack-tripleo-heat-templates/environments/ovs-dpdk-
> permissions.yaml   is missing from the deployment. Once this template is
> applied for overcloud, I can start the guest.

While there are other ongoing issues, IIUC there is no nova bug here. I'm going to close this BZ as a NOTABUG. By all means reopen it if there's a nova bug that engineering can help with.

Cheers!

Comment 17 Saravanan KR 2018-05-11 10:03:32 UTC
(In reply to jianzzha from comment #15)
>             "Error: Failed to apply catalog: Execution of
> '/usr/bin/ovs-vsctl list Open_vSwitch .' returned 1: ovs-vsctl:
> unix:/var/run/openvswitch/db.sock: database connection failed (No such file
> or directory)", 
>             "+ rc=1", 
>             "+ set -e", 
This issue is related to BZ #1571297, which has been fixed in the latest puddle 2018-05-07.2. Could you use this puddle and re-check it?

Comment 18 jianzzha 2018-05-14 13:50:09 UTC
(In reply to Saravanan KR from comment #17)
> (In reply to jianzzha from comment #15)
> >             "Error: Failed to apply catalog: Execution of
> > '/usr/bin/ovs-vsctl list Open_vSwitch .' returned 1: ovs-vsctl:
> > unix:/var/run/openvswitch/db.sock: database connection failed (No such file
> > or directory)", 
> >             "+ rc=1", 
> >             "+ set -e", 
> This issue is related to BZ #1571297, which has been fixed in the latest
> puddle 2018-05-07.2. Could you use this puddle and re-check it?

Thanks Saravanan, I have proceeded to the OSP13 Mellanox perf evaluation without the docker ovs agent. Will let you know once I'm done with that and re-install with latest puddle.