Bug 1619387 - openstack-neutron-openvswitch-agent:2018-08-17.2 image is missing libibverbs package and overcloud instances cannot be spawned
Summary: openstack-neutron-openvswitch-agent:2018-08-17.2 image is missing libibverbs ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
medium
urgent
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Bernard Cafarelli
QA Contact: Filip Hubík
URL:
Whiteboard:
Depends On:
Blocks: 1629629
TreeView+ depends on / blocked
 
Reported: 2018-08-20 16:48 UTC by Marius Cornea
Modified: 2019-01-11 11:51 UTC (History)
13 users (show)

Fixed In Version: openstack-neutron-13.0.2-0.20180922043831.266d1ad.el7ost
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
: 1629629 (view as bug list)
Environment:
Last Closed: 2019-01-11 11:51:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openvswitch-agent.log (9.14 MB, text/plain)
2018-08-20 17:18 UTC, Marius Cornea
no flags Details
nova-compute.log (5.06 MB, text/plain)
2018-08-20 17:20 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1788865 0 None None None 2018-08-24 13:30:16 UTC
OpenStack gerrit 596717 0 None MERGED ovsdb monitor: do not die on ovsdb-client stderr output 2020-09-16 08:59:23 UTC
OpenStack gerrit 603046 0 None MERGED ovsdb monitor: do not die on ovsdb-client stderr output 2020-09-16 08:59:23 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:51:42 UTC

Description Marius Cornea 2018-08-20 16:48:45 UTC
Description of problem:

openstack-neutron-openvswitch-agent:2018-08-17.2 is missing libibverbs package and overcloud instances cannot be spawned.

After overcloud gets deployed we can see in the neutron_ovs_agent container logs:

INFO:__main__:Setting permission for /var/log/neutron/openvswitch-agent.log
++ cat /run_command
+ CMD=/neutron_ovs_agent_launcher.sh
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/neutron ]]
+++ stat -c %a /var/log/kolla/neutron
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/neutron
++ . /usr/local/bin/kolla_neutron_extend_start
+ echo 'Running command: '\''/neutron_ovs_agent_launcher.sh'\'''
+ exec /neutron_ovs_agent_launcher.sh
Running command: '/neutron_ovs_agent_launcher.sh'
+ /usr/bin/python -m neutron.cmd.destroy_patch_ports --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
+ /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)


Version-Release number of selected component (if applicable):
openstack-neutron-openvswitch-agent:2018-08-17.2

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP14 with 1 controller + 1 compute
2. Launch instance attached to a vxlan network

Actual results:
Instance ends up in ERROR state.

Expected results:
Instance launches successfully.

Additional info:

After installing the libibverbs package inside the neutron_ovs_agent container I was able to spawn an instance successfully.

[root@compute-0 heat-admin]# docker exec --user root -it neutron_ovs_agent yum localinstall -y http://$url/libibverbs-15-7.el7_5.x86_64.rpm http://$url/rdma-core-15-7.el7_5.x86_64.rpm

[root@compute-0 heat-admin]# docker restart neutron_ovs_agent

[root@compute-0 heat-admin]# docker logs -f neutron_ovs_agent

INFO:__main__:Setting permission for /var/log/neutron/openvswitch-agent.log
++ cat /run_command
+ CMD=/neutron_ovs_agent_launcher.sh
+ ARGS=
+ [[ ! -n '' ]]
+ . kolla_extend_start
++ [[ ! -d /var/log/kolla/neutron ]]
+++ stat -c %a /var/log/kolla/neutron
++ [[ 2755 != \7\5\5 ]]
++ chmod 755 /var/log/kolla/neutron
Running command: '/neutron_ovs_agent_launcher.sh'
++ . /usr/local/bin/kolla_neutron_extend_start
+ echo 'Running command: '\''/neutron_ovs_agent_launcher.sh'\'''
+ exec /neutron_ovs_agent_launcher.sh
+ /usr/bin/python -m neutron.cmd.destroy_patch_ports --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent
+ /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log

Comment 2 Marius Cornea 2018-08-20 17:18:40 UTC
Created attachment 1477316 [details]
openvswitch-agent.log

Comment 3 Marius Cornea 2018-08-20 17:20:19 UTC
Created attachment 1477317 [details]
nova-compute.log

Comment 4 Bernard Cafarelli 2018-08-21 08:23:59 UTC
Some ovs/packaging questions here, as this library seems to be focused on mellanox and/or DPDK:
* is this core library mandatory now in ovs 2.10? in that case the fix should be to add it to requires
* if not, can it be disabled, and is that a configuration issue at build time for "vanilla" ovs?

Comment 6 Bernard Cafarelli 2018-08-22 13:37:10 UTC
Recapping feedback from Flavio and my investigation:

openvswitch has all drivers built-in, and unfortunately Mellanox needs extra libs, so the driver can't be initialized if you miss those libs. But this is just a warning and does not block "normal" ovs operations.
Adding libibverbs as hard dependency in ovs pulls more packages that are unneeded for 99% of our customers

Looking at openvswitch-agent.log, I think what happens here is that ovsdb-client calls (from the agent) output:
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
[...]
{"data":[...]} # Proper JSON output
And that confuses the agent when parsing the output

Workaround here (installing libibverbs) works as it makes the warnings disappear.

To check, these warnings should be on stderr and not get in the string the agent parses

Comment 7 Bernard Cafarelli 2018-08-23 10:09:13 UTC
Confirming my theory, this is agent code that makes ovsdb-client calls:
https://git.openstack.org/cgit/openstack/neutron/tree/neutron/agent/linux/ovsdb_monitor.py#n72

Setting die_on_error on AsyncProcess means the process is killed on any stderr output.

Comment 8 Bernard Cafarelli 2018-08-23 14:37:16 UTC
Possible fixes:
1. Current workaround, install libibverbs in container. Pro: easy fix. Con: adds extra (mostly not needed) packages and loading in ovs, no guarantee against further breakage.
2. Add libibverbs as a dependency in openvswitch. Similar to 1, with added con that it brings in the unwelcome dependency everywhere, not only in our containers
3. Make standard package buildable/built without Mellanox support. Not sure of the feasibility here, may need ovs code changes beyond packaging changes
-- From these, neutron-side fixes options
4. Disable die_on_error in OvsdbMonitor, update sub-classes process_events() to filter out non JSON output. Log error lines in debug or similar. Pro: only go through JSON output in any case. Con: we may miss actual errors, and slower reactions to them (until we hit timeout)
5. Update the OvsdbMonitor/AsyncProcess logic to check process return code. Pro: we can ignore/log in a low level stderr output and rely on process reporting success. Con: may not work for all processes, are there cases where we actually need to check stderr output?
6. Directly interrogate ovsdb. Pros: robust and clean handling, no more subprocess and vulnerability to CLI changes. Cons: longer term fix, do we have valid use cases where there is no direct ovsdb_connection?

Comment 10 Bernard Cafarelli 2018-09-17 08:21:27 UTC
Master review merged, created cherry-picks for stable branches

Comment 14 Miguel Angel Ajo 2018-09-24 13:02:09 UTC
We have a workaround to include the libibverbs from the Dockerfile now, but the right thing to do is requiring it from the neutron spec file.

The linked reviews seem to be for a different bug, so I will remove them.

Comment 15 Bernard Cafarelli 2018-09-24 13:33:36 UTC
Actually, the fix is to change neutron-openvswitch-agent behaviour not to die if ovsdb-client generates output on stderr (which is what happens with ovs 2.10 when libibverbs package is not installed). That is what the linked review is for.

The workaround was to install this optional library, so ovsdb-client stays quiet (even if it does not need this library)

Comment 16 Miguel Angel Ajo 2018-09-25 06:17:00 UTC
(In reply to Bernard Cafarelli from comment #15)
> Actually, the fix is to change neutron-openvswitch-agent behaviour not to
> die if ovsdb-client generates output on stderr (which is what happens with
> ovs 2.10 when libibverbs package is not installed). That is what the linked
> review is for.
> 
> The workaround was to install this optional library, so ovsdb-client stays
> quiet (even if it does not need this library)

Thanks for the clarification Bernard, I was surprised and I thought it was a different review.

Comment 20 Filip Hubík 2018-10-05 11:13:34 UTC
I see two issues mentioned here:

A) According to http://post-office.corp.redhat.com/archives/rhos-qe-dept/2018-September/msg00775.html 2018-09-26.1 was not CI ready so it could not reach successful OC deployment.

On the other hand, next puddle 2018-09-27.3 with openstack-neutron-13.0.2-0.20180922043831.266d1ad.el7ost (containers on UC and OC, openstack-neutron-openvswitch-agent:2018-09-26.1(build 52)) can reach such stage and seems to have this issue fixed.

VM's are spawnable, no errors in neutron_ovs_agent containers on OC nor UC node(s). Full Tempest passed with 0 errors (1342 tests), OSP14, topology 1:1:1:1. Marking Verified and bumping Build ID and Fixed-in values.

B) Problem with "ovs-vsctl" command on UC producing warnings still stands, but that should be likely addressed by another dedicated BZ since that is more openvswitch2.10 related. I will first make sure that issue is still around in newest puddles.

Comment 21 Filip Hubík 2018-12-07 14:15:17 UTC
Ad B) Tested with 2018-12-05.2 puddle, topology 1:1:1:1, ovs-vsctl is runnable without any issue on all nodes (UC+OC).

Both issues seem to be resolved.

Comment 23 errata-xmlrpc 2019-01-11 11:51:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.