Description of problem: openstack-neutron-openvswitch-agent:2018-08-17.2 is missing libibverbs package and overcloud instances cannot be spawned. After overcloud gets deployed we can see in the neutron_ovs_agent container logs: INFO:__main__:Setting permission for /var/log/neutron/openvswitch-agent.log ++ cat /run_command + CMD=/neutron_ovs_agent_launcher.sh + ARGS= + [[ ! -n '' ]] + . kolla_extend_start ++ [[ ! -d /var/log/kolla/neutron ]] +++ stat -c %a /var/log/kolla/neutron ++ [[ 2755 != \7\5\5 ]] ++ chmod 755 /var/log/kolla/neutron ++ . /usr/local/bin/kolla_neutron_extend_start + echo 'Running command: '\''/neutron_ovs_agent_launcher.sh'\''' + exec /neutron_ovs_agent_launcher.sh Running command: '/neutron_ovs_agent_launcher.sh' + /usr/bin/python -m neutron.cmd.destroy_patch_ports --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) + /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) Version-Release number of selected component (if applicable): openstack-neutron-openvswitch-agent:2018-08-17.2 How reproducible: 100% Steps to Reproduce: 1. Deploy OSP14 with 1 controller + 1 compute 2. Launch instance attached to a vxlan network Actual results: Instance ends up in ERROR state. Expected results: Instance launches successfully. Additional info: After installing the libibverbs package inside the neutron_ovs_agent container I was able to spawn an instance successfully. [root@compute-0 heat-admin]# docker exec --user root -it neutron_ovs_agent yum localinstall -y http://$url/libibverbs-15-7.el7_5.x86_64.rpm http://$url/rdma-core-15-7.el7_5.x86_64.rpm [root@compute-0 heat-admin]# docker restart neutron_ovs_agent [root@compute-0 heat-admin]# docker logs -f neutron_ovs_agent INFO:__main__:Setting permission for /var/log/neutron/openvswitch-agent.log ++ cat /run_command + CMD=/neutron_ovs_agent_launcher.sh + ARGS= + [[ ! -n '' ]] + . kolla_extend_start ++ [[ ! -d /var/log/kolla/neutron ]] +++ stat -c %a /var/log/kolla/neutron ++ [[ 2755 != \7\5\5 ]] ++ chmod 755 /var/log/kolla/neutron Running command: '/neutron_ovs_agent_launcher.sh' ++ . /usr/local/bin/kolla_neutron_extend_start + echo 'Running command: '\''/neutron_ovs_agent_launcher.sh'\''' + exec /neutron_ovs_agent_launcher.sh + /usr/bin/python -m neutron.cmd.destroy_patch_ports --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent + /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log
Created attachment 1477316 [details] openvswitch-agent.log
Created attachment 1477317 [details] nova-compute.log
Some ovs/packaging questions here, as this library seems to be focused on mellanox and/or DPDK: * is this core library mandatory now in ovs 2.10? in that case the fix should be to add it to requires * if not, can it be disabled, and is that a configuration issue at build time for "vanilla" ovs?
Recapping feedback from Flavio and my investigation: openvswitch has all drivers built-in, and unfortunately Mellanox needs extra libs, so the driver can't be initialized if you miss those libs. But this is just a warning and does not block "normal" ovs operations. Adding libibverbs as hard dependency in ovs pulls more packages that are unneeded for 99% of our customers Looking at openvswitch-agent.log, I think what happens here is that ovsdb-client calls (from the agent) output: PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory [...] {"data":[...]} # Proper JSON output And that confuses the agent when parsing the output Workaround here (installing libibverbs) works as it makes the warnings disappear. To check, these warnings should be on stderr and not get in the string the agent parses
Confirming my theory, this is agent code that makes ovsdb-client calls: https://git.openstack.org/cgit/openstack/neutron/tree/neutron/agent/linux/ovsdb_monitor.py#n72 Setting die_on_error on AsyncProcess means the process is killed on any stderr output.
Possible fixes: 1. Current workaround, install libibverbs in container. Pro: easy fix. Con: adds extra (mostly not needed) packages and loading in ovs, no guarantee against further breakage. 2. Add libibverbs as a dependency in openvswitch. Similar to 1, with added con that it brings in the unwelcome dependency everywhere, not only in our containers 3. Make standard package buildable/built without Mellanox support. Not sure of the feasibility here, may need ovs code changes beyond packaging changes -- From these, neutron-side fixes options 4. Disable die_on_error in OvsdbMonitor, update sub-classes process_events() to filter out non JSON output. Log error lines in debug or similar. Pro: only go through JSON output in any case. Con: we may miss actual errors, and slower reactions to them (until we hit timeout) 5. Update the OvsdbMonitor/AsyncProcess logic to check process return code. Pro: we can ignore/log in a low level stderr output and rely on process reporting success. Con: may not work for all processes, are there cases where we actually need to check stderr output? 6. Directly interrogate ovsdb. Pros: robust and clean handling, no more subprocess and vulnerability to CLI changes. Cons: longer term fix, do we have valid use cases where there is no direct ovsdb_connection?
Master review merged, created cherry-picks for stable branches
We have a workaround to include the libibverbs from the Dockerfile now, but the right thing to do is requiring it from the neutron spec file. The linked reviews seem to be for a different bug, so I will remove them.
Actually, the fix is to change neutron-openvswitch-agent behaviour not to die if ovsdb-client generates output on stderr (which is what happens with ovs 2.10 when libibverbs package is not installed). That is what the linked review is for. The workaround was to install this optional library, so ovsdb-client stays quiet (even if it does not need this library)
(In reply to Bernard Cafarelli from comment #15) > Actually, the fix is to change neutron-openvswitch-agent behaviour not to > die if ovsdb-client generates output on stderr (which is what happens with > ovs 2.10 when libibverbs package is not installed). That is what the linked > review is for. > > The workaround was to install this optional library, so ovsdb-client stays > quiet (even if it does not need this library) Thanks for the clarification Bernard, I was surprised and I thought it was a different review.
I see two issues mentioned here: A) According to http://post-office.corp.redhat.com/archives/rhos-qe-dept/2018-September/msg00775.html 2018-09-26.1 was not CI ready so it could not reach successful OC deployment. On the other hand, next puddle 2018-09-27.3 with openstack-neutron-13.0.2-0.20180922043831.266d1ad.el7ost (containers on UC and OC, openstack-neutron-openvswitch-agent:2018-09-26.1(build 52)) can reach such stage and seems to have this issue fixed. VM's are spawnable, no errors in neutron_ovs_agent containers on OC nor UC node(s). Full Tempest passed with 0 errors (1342 tests), OSP14, topology 1:1:1:1. Marking Verified and bumping Build ID and Fixed-in values. B) Problem with "ovs-vsctl" command on UC producing warnings still stands, but that should be likely addressed by another dedicated BZ since that is more openvswitch2.10 related. I will first make sure that issue is still around in newest puddles.
Ad B) Tested with 2018-12-05.2 puddle, topology 1:1:1:1, ovs-vsctl is runnable without any issue on all nodes (UC+OC). Both issues seem to be resolved.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045