Bug 1489631

Summary: OVS-dpdk: A vhostuser server guest hang on start
Product: Red Hat Enterprise Linux 7 Reporter: Jean-Tsung Hsiao <jhsiao>
Component: openvswitchAssignee: Aaron Conole <aconole>
Status: CLOSED NOTABUG QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: aconole, atragler, echaudro, fleitner, jhsiao, ktraynor, kzhang, rcain
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-13 18:02:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jean-Tsung Hsiao 2017-09-08 00:49:28 UTC
Description of problem: OVS-dpdk: A vhostuser server guest hang on start

A VM was configured with a vhostuser server port:
<source type='unix' path='/tmp/vhost0' mode='server'/>
 
When started, it hang after qemu-kvm created /tmp/vhost0.

Version-Release number of selected component (if applicable):

[root@netqe19 ~]# uname -a
Linux netqe19.knqe.lab.eng.bos.redhat.com 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@netqe19 ~]# rpm -q openvswitch
openvswitch-2.8.0-0.2.20170810git3631ed2.el7fdb.x86_64

[root@netqe19 ~]# rpm -q qemu-kvm-rhev
qemu-kvm-rhev-2.6.0-25.el7.x86_64
[root@netqe19 ~]# 

*** NOTE: When OVS 2.7.0-10 was deployed. the VM got started successfully.***

How reproducible: Reproducible


Steps to Reproduce:
1. Configure an OVS-dpdk bridge with a dpdkvhostuserclient port
2. Configure the corresponding VM with a vhostuser server port
3. Start the bridge
4. Start the VM

Actual results:
VM hang on start

Expected results:
VM should start successfully.

Additional info:

Comment 2 Kevin Traynor 2017-09-22 09:28:54 UTC
Hi, This works for me using qemu. I've tested with
qemu-kvm-rhev-2.6.0-25.el7.x86_64.rpm
using both 
openvswitch-2.7.0-10.git20170530.el7fdb.x86_64.rpm and
openvswitch-2.8.0-1.el7fdb.x86_64.rpm

The package openvswitch-2.8.0-0.2.20170810git3631ed2.el7fdb.x86_64 was a pre-release and is not in brew.
Can you retest with openvswitch-2.8.0-1.el7fdb.x86_64.rpm ?

If still not working then please give virsh dmpxml, exact steps to reproduce, last log message from qemu, and numactl -H

thanks,
Kevin.

Comment 3 Jean-Tsung Hsiao 2017-09-22 13:23:07 UTC
(In reply to Kevin Traynor from comment #2)
> Hi, This works for me using qemu. I've tested with
> qemu-kvm-rhev-2.6.0-25.el7.x86_64.rpm
> using both 
> openvswitch-2.7.0-10.git20170530.el7fdb.x86_64.rpm and
> openvswitch-2.8.0-1.el7fdb.x86_64.rpm
> 
> The package openvswitch-2.8.0-0.2.20170810git3631ed2.el7fdb.x86_64 was a
> pre-release and is not in brew.
> Can you retest with openvswitch-2.8.0-1.el7fdb.x86_64.rpm ?
> 
> If still not working then please give virsh dmpxml, exact steps to
> reproduce, last log message from qemu, and numactl -H
> 
> thanks,
> Kevin.

Yes, I will.
Thanks!
Jean

Comment 4 Jean-Tsung Hsiao 2017-09-23 17:36:18 UTC
Hi Kevin,
I tried with penvswitch-2.8.0-1.el7fdb.x86_64.rpm, but the same issue still remains.
Please see attached below.
What else needs to be tried ?
Thanks!
Jean

*** kernel and OVS ***
[root@netqe5 vxlan-tunnel]# uname -a
Linux netqe5.knqe.lab.eng.bos.redhat.com 4.11.0-16.el7a.x86_64 #1 SMP Mon Jul 24 22:25:46 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@netqe5 vxlan-tunnel]# rpm -q openvswitch
openvswitch-2.8.0-1.el7fdb.x86_64

*** virsh start mq-vhu-tunnel-server hang here ***
virsh # start mq-vhu-tunnel-server

*** It's now in paused state ***
[root@netqe5 vxlan-tunnel]# virsh dominfo mq-vhu-tunnel-server
Id:             5
Name:           mq-vhu-tunnel-server
UUID:           599be106-543c-4c3a-b65f-4e11f5bd8d79
OS Type:        hvm
State:          paused
CPU(s):         5
CPU time:       0.1s
Max memory:     8388608 KiB
Used memory:    8388608 KiB
Persistent:     yes
Autostart:      disable
Managed save:   no
Security model: selinux
Security DOI:   0
Security label: system_u:system_r:svirt_t:s0:c416,c1017 (permissive)

*** socket created ***
[root@netqe5 vxlan-tunnel]# ls -l /tmp/vhost0 
srwxrwxr-x. 1 qemu qemu 0 Sep 23 13:24 /tmp/vhost0

*** qemu task ***
[root@netqe5 vxlan-tunnel]# ps -elf | grep qemu-kvm
6 S qemu      2412     1  0  80   0 - 77917 skb_wa 13:24 ?        00:00:00 /usr/libexec/qemu-kv -name guest=mq-vhu-tunnel-server,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-mq-vhu-tunnel-server/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m 8192 -realtime mlock=off -smp 5,sockets=5,cores=1,threads=1 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages/libvirt/qemu/5-mq-vhu-tunnel-server,share=yes,size=8589934592 -numa node,nodeid=0,cpus=0,memdev=ram-node0 -uuid 599be106-543c-4c3a-b65f-4e11f5bd8d79 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-5-mq-vhu-tunnel-server/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/home/images/mq_vhu_tunnel_server.img,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -chardev socket,id=charnet0,path=/tmp/vhost0,server -netdev vhost-user,chardev=charnet0,queues=2,id=hostnet0 -device virtio-net-pci,mq=on,vectors=6,netdev=hostnet0,id=net0,mac=52:54:00:32:0f:71,bus=pci.0,addr=0x3 -netdev tap,fd=25,id=hostnet1,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:77:96:95,bus=pci.0,addr=0xb -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-5-mq-vhu-tunnel-server/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on

Comment 5 Kevin Traynor 2017-10-03 18:25:51 UTC
Hanging after the socket is /tmp/vhost0 is created can indicate it is waiting for the ovs-dpdk to connect to the socket. Is ovs-dpdk running and trying to connect to the right socket location?

Can you run an sosreport when the hang happens?

Comment 6 Jean-Tsung Hsiao 2017-10-04 11:30:27 UTC
(In reply to Kevin Traynor from comment #5)
> Hanging after the socket is /tmp/vhost0 is created can indicate it is
> waiting for the ovs-dpdk to connect to the socket. Is ovs-dpdk running and

Guest is server now. "virsh start" should just return whether ovs-dpdk is running or not. But, the domain state of the guest is paused --- see commentt #4.

NOTE: Same configuration and same process works for 2.7. This is a 2.8 specific issue.

> trying to connect to the right socket location?
> 
> Can you run an sosreport when the hang happens?

I will when I have a chance to reproduce the issue.

Comment 7 Jean-Tsung Hsiao 2017-10-04 12:57:22 UTC
(In reply to Jean-Tsung Hsiao from comment #6)
> (In reply to Kevin Traynor from comment #5)
> > Hanging after the socket is /tmp/vhost0 is created can indicate it is
> > waiting for the ovs-dpdk to connect to the socket. Is ovs-dpdk running and
> 
> Guest is server now. "virsh start" should just return whether ovs-dpdk is
> running or not. But, the domain state of the guest is paused --- see
> commentt #4.
> 
> NOTE: Same configuration and same process works for 2.7. This is a 2.8
> specific issue.
> 
> > trying to connect to the right socket location?
> > 
> > Can you run an sosreport when the hang happens?
> 
> I will when I have a chance to reproduce the issue.

http://file.bos.redhat.com/~jhsiao/sosreport-JHsiao.1489631-20171004083450.tar.xz

Hi Kevin,

Re-produced the issue again with ovs-2.8.0-1. Please get the sosreport from the link above. 

Here are the steps

* Start OVS-dpdk first, then
* start the guest

Following the same steps there is no such issue for 2.7.0-10 and 2.7.2-10.

More questions?

Thanks!

Jean

Comment 8 Jean-Tsung Hsiao 2017-10-04 13:30:05 UTC
Hi Kevin,

Here is an update.

If guest is started first, then it will be in paused state. This is true for 2.7 and 2.8.

Once OVS-dpdk gets started, the state will be active. This is true for 2.7, but not for 2.8.

Thanks!

Jean

Comment 9 Kevin Traynor 2017-10-05 17:01:35 UTC
> 
> http://file.bos.redhat.com/~jhsiao/sosreport-JHsiao.1489631-20171004083450.
> tar.xz

Hi Jean, Can you put this on netqe5? I don't seem to have permission to pick it up in the link above.

Comment 10 Jean-Tsung Hsiao 2017-10-05 23:53:53 UTC
(In reply to Kevin Traynor from comment #9)
> > 
> > http://file.bos.redhat.com/~jhsiao/sosreport-JHsiao.1489631-20171004083450.
> > tar.xz
> 
> Hi Jean, Can you put this on netqe5? I don't seem to have permission to pick
> it up in the link above.

root@netqe5 ~]# ll
total 11608
-rw-------. 1 root root    15689 Jul  6 12:16 anaconda-ks.cfg
drwxr-xr-x. 7 root root      226 Jul 26 09:01 kev
-rw-------. 1 root root    14099 Jul  6 12:16 original-ks.cfg
-rw-------. 1 root root 11849948 Oct  4 08:35 sosreport-JHsiao.1489631-20171004083450.tar.xz
[root@netqe5 ~]#

NOTE: I also made the following file readable.

http://file.bos.redhat.com/~jhsiao/sosreport-JHsiao.1489631-20171004083450.
tar.xz

Comment 11 Kevin Traynor 2017-10-09 16:50:24 UTC
I tested on Jean's board and my own. I think the issue is related to permissions accessing memory when running with systemd. After running with systemd, comparing OVS 2.7 and OVS 2.8 when DPDK is initialized:

[root@netqe5 qemu_hang]# ovs-vsctl -t 5 get Open_vSwitch . other_config
{dpdk-extra="--log-level 8", dpdk-lcore-mask="0x4", dpdk-socket-mem="1024"}
[root@netqe5 qemu_hang]# cat /proc/meminfo | grep Huge
AnonHugePages:      4096 kB
ShmemHugePages:        0 kB
HugePages_Total:      32
HugePages_Free:       31
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
[root@netqe5 qemu_hang]# grep huge /proc/mounts 
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime 0 0
[root@netqe5 qemu_hang]# cat /sys/devices/system/node/node*/meminfo | fgrep Huge
Node 0 AnonHugePages:      4096 kB
Node 0 ShmemHugePages:        0 kB
Node 0 HugePages_Total:    16
Node 0 HugePages_Free:     15
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:         0 kB
Node 1 ShmemHugePages:        0 kB
Node 1 HugePages_Total:    16
Node 1 HugePages_Free:     16
Node 1 HugePages_Surp:      0
[root@netqe5 qemu_hang]# ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true



######openvswitch-2.8.0-1.el7fdb.x86_64:

2017-10-09T14:58:21.226Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T14:58:21.230Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T14:58:21.230Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T14:58:21.230Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T14:58:21.230Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T14:58:21.230Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T14:58:21.231Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2017-10-09T14:58:21.232Z|00008|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.8.0
2017-10-09T14:59:00.127Z|00009|memory|INFO|18020 kB peak resident set size after 38.9 seconds

2017-10-09T15:01:22.683Z|00010|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:01:22.683Z|00011|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:01:22.683Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:01:22.684Z|00013|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:01:22.685Z|00014|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:01:22.685Z|00015|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:01:22.687Z|00016|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2017-10-09T15:01:22.687Z|00017|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
2017-10-09T15:01:22.687Z|00018|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 0MB
2017-10-09T15:01:22.687Z|00019|dpdk|ERR|EAL: Cannot init memory
2017-10-09T15:01:23.059Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T15:01:23.063Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T15:01:23.063Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T15:01:23.063Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T15:01:23.063Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T15:01:23.063Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T15:01:23.064Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:01:23.064Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:01:23.064Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:01:23.065Z|00010|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:01:23.065Z|00011|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:01:23.065Z|00012|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:01:23.068Z|00013|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2017-10-09T15:01:23.068Z|00014|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
2017-10-09T15:01:23.068Z|00015|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 0MB
2017-10-09T15:01:23.068Z|00016|dpdk|ERR|EAL: Cannot init memory
2017-10-09T15:01:23.073Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted))
2017-10-09T15:01:23.073Z|00003|daemon_unix|EMER|could not detach from foreground session
2017-10-09T15:01:23.379Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T15:01:23.382Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T15:01:23.382Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T15:01:23.382Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T15:01:23.382Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T15:01:23.382Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T15:01:23.384Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:01:23.384Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:01:23.384Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:01:23.385Z|00010|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:01:23.385Z|00011|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:01:23.385Z|00012|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:01:23.388Z|00013|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2017-10-09T15:01:23.388Z|00014|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
2017-10-09T15:01:23.388Z|00015|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 0MB
2017-10-09T15:01:23.388Z|00016|dpdk|ERR|EAL: Cannot init memory
2017-10-09T15:01:23.392Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted))
2017-10-09T15:01:23.392Z|00003|daemon_unix|EMER|could not detach from foreground session
2017-10-09T15:01:23.629Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T15:01:23.632Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T15:01:23.632Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T15:01:23.632Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T15:01:23.632Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T15:01:23.632Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T15:01:23.634Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:01:23.634Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:01:23.634Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:01:23.635Z|00010|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:01:23.635Z|00011|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:01:23.635Z|00012|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:01:23.638Z|00013|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2017-10-09T15:01:23.638Z|00014|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
2017-10-09T15:01:23.638Z|00015|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 0MB
2017-10-09T15:01:23.638Z|00016|dpdk|ERR|EAL: Cannot init memory
2017-10-09T15:01:23.642Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted))
2017-10-09T15:01:23.643Z|00003|daemon_unix|EMER|could not detach from foreground session
2017-10-09T15:01:23.879Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T15:01:23.882Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T15:01:23.882Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T15:01:23.882Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T15:01:23.882Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T15:01:23.883Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T15:01:23.884Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:01:23.884Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:01:23.884Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:01:23.885Z|00010|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:01:23.885Z|00011|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:01:23.885Z|00012|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:01:23.888Z|00013|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2017-10-09T15:01:23.888Z|00014|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
2017-10-09T15:01:23.888Z|00015|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 0MB
2017-10-09T15:01:23.888Z|00016|dpdk|ERR|EAL: Cannot init memory
2017-10-09T15:01:23.893Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted))
2017-10-09T15:01:23.893Z|00003|daemon_unix|EMER|could not detach from foreground session
2017-10-09T15:01:24.129Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T15:01:24.132Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T15:01:24.132Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T15:01:24.132Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T15:01:24.132Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T15:01:24.132Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T15:01:24.134Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:01:24.134Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:01:24.134Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:01:24.135Z|00010|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:01:24.135Z|00011|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:01:24.135Z|00012|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:01:24.138Z|00013|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
2017-10-09T15:01:24.138Z|00014|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
2017-10-09T15:01:24.138Z|00015|dpdk|ERR|EAL: Not enough memory available on socket 0! Requested: 1024MB, available: 0MB
2017-10-09T15:01:24.138Z|00016|dpdk|ERR|EAL: Cannot init memory
2017-10-09T15:01:24.142Z|00002|daemon_unix|ERR|fork child died before signaling startup (killed (Aborted))
2017-10-09T15:01:24.142Z|00003|daemon_unix|EMER|could not detach from foreground session



######openvswitch-2.7.2-10.git20170914.el7fdp.x86_64:

2017-10-09T15:14:58.928Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2017-10-09T15:14:58.931Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2017-10-09T15:14:58.931Z|00003|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 1
2017-10-09T15:14:58.931Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 24 CPU cores
2017-10-09T15:14:58.931Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2017-10-09T15:14:58.931Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2017-10-09T15:14:58.932Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2017-10-09T15:14:58.933Z|00008|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.7.2
2017-10-09T15:15:12.043Z|00009|memory|INFO|15284 kB peak resident set size after 13.1 seconds
2017-10-09T15:15:22.211Z|00010|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T15:15:22.211Z|00011|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T15:15:22.211Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024 --log-level 8
2017-10-09T15:15:22.212Z|00013|dpdk|INFO|EAL: Detected 24 lcore(s)
2017-10-09T15:15:22.235Z|00014|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T15:15:22.235Z|00015|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T15:15:30.851Z|00016|dpdk|INFO|EAL: PCI device 0000:81:00.0 on NUMA socket 1
2017-10-09T15:15:30.851Z|00017|dpdk|INFO|EAL:   probe driver: 8086:154d net_ixgbe
2017-10-09T15:15:30.851Z|00018|dpdk|INFO|EAL:   using IOMMU type 1 (Type 1)
2017-10-09T15:15:31.021Z|00019|dpdk|INFO|EAL: Ignore mapping IO port bar(2) addr: 8021
2017-10-09T15:15:31.277Z|00020|dpdk|INFO|EAL: PCI device 0000:81:00.1 on NUMA socket 1
2017-10-09T15:15:31.277Z|00021|dpdk|INFO|EAL:   probe driver: 8086:154d net_ixgbe
2017-10-09T15:15:31.441Z|00022|dpdk|INFO|EAL: Ignore mapping IO port bar(2) addr: 8001
2017-10-09T15:15:31.697Z|00023|dpdk|INFO|EAL: PCI device 0000:83:00.0 on NUMA socket 1
2017-10-09T15:15:31.697Z|00024|dpdk|INFO|EAL:   probe driver: 8086:1528 net_ixgbe
2017-10-09T15:15:31.697Z|00025|dpdk|INFO|EAL: PCI device 0000:83:00.1 on NUMA socket 1
2017-10-09T15:15:31.697Z|00026|dpdk|INFO|EAL:   probe driver: 8086:1528 net_ixgbe
2017-10-09T15:15:31.697Z|00027|dpdk|INFO|EAL: PCI device 0000:84:00.0 on NUMA socket 1
2017-10-09T15:15:31.697Z|00028|dpdk|INFO|EAL:   probe driver: 8086:1583 net_i40e
2017-10-09T15:15:31.994Z|00029|dpdk|INFO|PMD: eth_i40e_dev_init(): FW 5.0 API 1.5 NVM 05.00.04 eetrack 80002505
2017-10-09T15:15:32.097Z|00030|dpdk|INFO|EAL: PCI device 0000:84:00.1 on NUMA socket 1
2017-10-09T15:15:32.097Z|00031|dpdk|INFO|EAL:   probe driver: 8086:1583 net_i40e
2017-10-09T15:15:32.102Z|00032|dpdk|INFO|DPDK Enabled - initialized

Comment 12 Kevin Traynor 2017-10-09 16:55:29 UTC
I've also seen permission issues with running the openvswitch service

-- Unit ovs-vswitchd.service has begun starting up.
Oct 09 06:28:25 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00013|dpdk|ERR|EAL: rte_mem_virt2phy(): cannot open /proc/self/pagemap: Permission denied
Oct 09 06:28:25 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00014|dpdk|ERR|EAL: Cannot obtain physical addresses: Permission denied. Only vfio will function.
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00017|dpdk|ERR|EAL: Cannot open /dev/vfio/17: Permission denied
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00018|dpdk|ERR|EAL: Requested device 0000:04:00.0 cannot be used
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00021|dpdk|ERR|EAL: Cannot open /dev/vfio/17: Permission denied
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00022|dpdk|ERR|EAL: Requested device 0000:04:00.1 cannot be used
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00025|dpdk|ERR|EAL: Cannot open /dev/vfio/40: Permission denied
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00026|dpdk|ERR|EAL: Requested device 0000:81:00.0 cannot be used
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00029|dpdk|ERR|EAL: Cannot open /dev/vfio/41: Permission denied
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00030|dpdk|ERR|EAL: Requested device 0000:81:00.1 cannot be used
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00037|dpdk|ERR|EAL: Cannot open /dev/vfio/45: Permission denied
Oct 09 06:28:33 netqe5.knqe.lab.eng.bos.redhat.com ovs-vswitchd[4996]: ovs|00038|dpdk|ERR|EAL: Requested device 0000:84:00.0 cannot be used

Comment 13 Kevin Traynor 2017-10-09 16:56:43 UTC
and similar error, when having dpdk-init=true before starting ovs-vswitchd:

2017-10-09T16:45:08.473Z|00007|dpdk|INFO|DPDK Enabled - initializing...
2017-10-09T16:45:08.473Z|00008|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2017-10-09T16:45:08.473Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x4 --socket-mem 1024,1024 --log-level 8
2017-10-09T16:45:08.474Z|00010|dpdk|INFO|EAL: Detected 28 lcore(s)
2017-10-09T16:45:08.474Z|00011|dpdk|INFO|EAL: Probing VFIO support...
2017-10-09T16:45:08.474Z|00012|dpdk|INFO|EAL: VFIO support initialized
2017-10-09T16:45:16.922Z|00013|dpdk|INFO|EAL: PCI device 0000:01:00.0 on NUMA socket 0
2017-10-09T16:45:16.922Z|00014|dpdk|INFO|EAL:   probe driver: 8086:10fb net_ixgbe
2017-10-09T16:45:16.922Z|00015|dpdk|ERR|EAL: Cannot open /dev/vfio/15: Permission denied
2017-10-09T16:45:16.922Z|00016|dpdk|ERR|EAL: Requested device 0000:01:00.0 cannot be used
2017-10-09T16:45:16.922Z|00017|dpdk|INFO|EAL: PCI device 0000:01:00.1 on NUMA socket 0
2017-10-09T16:45:16.922Z|00018|dpdk|INFO|EAL:   probe driver: 8086:10fb net_ixgbe
2017-10-09T16:45:16.922Z|00019|dpdk|ERR|EAL: Cannot open /dev/vfio/16: Permission denied
2017-10-09T16:45:16.922Z|00020|dpdk|ERR|EAL: Requested device 0000:01:00.1 cannot be used
2017-10-09T16:45:16.922Z|00021|dpdk|INFO|EAL: PCI device 0000:07:00.0 on NUMA socket 0
2017-10-09T16:45:16.922Z|00022|dpdk|INFO|EAL:   probe driver: 8086:1521 net_e1000_igb
2017-10-09T16:45:16.922Z|00023|dpdk|INFO|EAL: PCI device 0000:07:00.1 on NUMA socket 0
2017-10-09T16:45:16.922Z|00024|dpdk|INFO|EAL:   probe driver: 8086:1521 net_e1000_igb
2017-10-09T16:45:16.923Z|00025|dpdk|INFO|DPDK pdump packet capture enabled
2017-10-09T16:45:16.927Z|00026|dpdk|INFO|DPDK Enabled - initialized
2017-10-09T16:45:16.929Z|00027|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.8.0

Comment 14 Kevin Traynor 2017-10-09 17:25:55 UTC
Jean, I spoke to Aaron about this issue. He says there should be a reboot for upgrade from OVS 2.7 to OVS 2.8. Can you test with a reboot between them?

Comment 15 Jean-Tsung Hsiao 2017-10-10 02:39:26 UTC
Hi Kevin,

I have provisioned a new system, netqe19(10.19.15.69), and installed OVS 2.8.0-1. Still, encountered the same issue.

Please log in and debug there.

Thanks!

Jean

Comment 16 Kevin Traynor 2017-10-10 16:57:00 UTC
I tested today on my lab machine with reboots. What I am seeing is that when ovs-vswitchd is run through systemd as user openvswitch, ovs-vswitchd can't connect to the qemu or libvirt/qemu created vhost sockets.

Comment 17 Aaron Conole 2017-10-10 20:07:31 UTC
Here's the deal:

1. QEMU, when started by libvirt, spawns the vhost chardev as a socket in server mode.  Whenever a socket chardev is spawned, if it is a listen socket (which a server is) it is defaulted to 'wait for connection'.  This means that the qemu process will *always* wait for a connection on a server socket for vhostuser unless a chardev option directs it not to.  Hence the 'stuck' observation.

2. When libvirt starts QEMU, it first checks in /etc/libvirt/qemu.conf for a user/group to spawn the vm process (qemu).  This defaults to qemu:qemu.

To get rid of this, edit /etc/libvirt/qemu.conf and change the line:

#group = "root"

to 

group = "hugetlbfs"

Please try this, and see if it gets you passed the problem.  I don't think this is a bug in the code, but will need to be documented somewhere for client mode support.

Comment 18 Jean-Tsung Hsiao 2017-10-10 22:26:55 UTC
(In reply to Aaron Conole from comment #17)
> Here's the deal:
> 
> 1. QEMU, when started by libvirt, spawns the vhost chardev as a socket in
> server mode.  Whenever a socket chardev is spawned, if it is a listen socket
> (which a server is) it is defaulted to 'wait for connection'.  This means
> that the qemu process will *always* wait for a connection on a server socket
> for vhostuser unless a chardev option directs it not to.  Hence the 'stuck'
> observation.
> 
> 2. When libvirt starts QEMU, it first checks in /etc/libvirt/qemu.conf for a
> user/group to spawn the vm process (qemu).  This defaults to qemu:qemu.
> 
> To get rid of this, edit /etc/libvirt/qemu.conf and change the line:
> 
> #group = "root"
> 
> to 
> 
> group = "hugetlbfs"
> 
> Please try this, and see if it gets you passed the problem.  I don't think
> this is a bug in the code, but will need to be documented somewhere for
> client mode support.

Hi Aaron,

I tried as you instructed on netqe19, but still encountered the same issue. I even changed the user to openvswitch --- same issue.

Feel free login into netqe19 and check.

Thanks!

Jean

Comment 21 Aaron Conole 2017-10-13 18:02:51 UTC
This is something that needs to be changed in libvirtd.  We have a bug 1478791 to address this, but it has to be done from a triple-o / orchestration side.

Thanks for the confirmation, Jean.