Bug 1608508

Summary: Overcloud instances cannot pxe boot during overcloud deployment, ovsdb monitor regressed
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openvswitchAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Roee Agiman <ragiman>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: agurenko, amuller, apevec, bcafarel, bfournie, bhaley, chrisw, dbecker, fhubik, fleitner, jschluet, mburns, mcornea, morazi, njohnston, nyechiel, ohochman, rhos-maint, sasha, srevivo, tvignaud, twilson
Target Milestone: betaKeywords: AutomationBlocker, Regression, Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.10-2.10.0-0.20180810git58a7ce6.el7fdp Doc Type: Bug Fix
Doc Text:
Cause: A change in Open vSwitch's table printing inadvertently removed appending a newline to the output. Consequence: When using ovsdb_interface=vsctl, Neutron's parsing of ovsdb-client monitor output relied on the newline for detecting added ports. Since the port was never detected, the tag is not set and traffic cannot be passed over the port. Fix: Re-add the newline in the Open vSwitch code. Result: Newly added ports can now pass traffic again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:50:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport-undercloud-0.redhat.local-20180725122842.tar.xz
none
combined ovs-vswitchd, dhcp and openvswitch agent log files none

Description Marius Cornea 2018-07-25 16:36:16 UTC
Created attachment 1470542 [details]
sosreport-undercloud-0.redhat.local-20180725122842.tar.xz

Description of problem:

Overcloud instances cannot pxe boot during overcloud deployment. Sniffing the br-ctlplane traffic on undercloud we can notice that dhcp request hit the interface but without getting any replies:

 [root@undercloud-0 stack]# tcpdump -i br-ctlplane  -n port 67 and port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-ctlplane, link-type EN10MB (Ethernet), capture size 262144 bytes
11:58:29.434682 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:31:5a:82, length 396
11:58:37.343693 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:31:5a:82, length 396

Checking audit.log for selinux denials we can spot:
type=AVC msg=audit(1532527823.424:683): avc:  denied  { net_broadcast } for  pid=13881 comm="ovs-vswitchd" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability

After switching to permissive the issue persists. Checking the ovs ports:

 [root@undercloud-0 stack]# ovs-vsctl show
4e7e276f-2132-463d-9e4c-fe2018b8be1e
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-ctlplane
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "eth0"
            Interface "eth0"
        Port br-ctlplane
            Interface br-ctlplane
                type: internal
        Port phy-br-ctlplane
            Interface phy-br-ctlplane
                type: patch
                options: {peer=int-br-ctlplane}
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port int-br-ctlplane
            Interface int-br-ctlplane
                type: patch
                options: {peer=phy-br-ctlplane}
        Port "tap207ed573-fa"
            tag: 4095
            Interface "tap207ed573-fa"
                type: internal
    ovs_version: "2.10.0"

We can see that the tap interface(tap207ed573-fa) used for the undercloud's dhcp agent has tag: 4095 

Checking the neutron port for dhcp agent it's reported as DOWN:

(undercloud) [stack@undercloud-0 ~]$ neutron port-show 207ed573-fad3-4345-b093-4e57007164e5
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+-----------------------+-------------------------------------------------------------------------------------+
| Field                 | Value                                                                               |
+-----------------------+-------------------------------------------------------------------------------------+
| admin_state_up        | True                                                                                |
| allowed_address_pairs |                                                                                     |
| binding:host_id       | undercloud-0.localdomain                                                            |
| binding:profile       | {}                                                                                  |
| binding:vif_details   | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": false}          |
| binding:vif_type      | ovs                                                                                 |
| binding:vnic_type     | normal                                                                              |
| created_at            | 2018-07-25T14:23:33Z                                                                |
| description           |                                                                                     |
| device_id             | dhcp3399fad5-b1b7-5f16-832b-991aeb0022d2-6a3b6822-9104-4322-a2f5-4b235e2a93ed       |
| device_owner          | network:dhcp                                                                        |
| extra_dhcp_opts       |                                                                                     |
| fixed_ips             | {"subnet_id": "1025671f-c88c-4cff-ad47-fa59c1af7574", "ip_address": "192.168.24.5"} |
| id                    | 207ed573-fad3-4345-b093-4e57007164e5                                                |
| ip_allocation         | immediate                                                                           |
| mac_address           | fa:16:3e:e7:07:7d                                                                   |
| name                  |                                                                                     |
| network_id            | 6a3b6822-9104-4322-a2f5-4b235e2a93ed                                                |
| port_security_enabled | False                                                                               |
| project_id            | fc64981873354f2a91b8fdfb6b95750f                                                    |
| revision_number       | 2                                                                                   |
| security_groups       |                                                                                     |
| status                | DOWN                                                                                |
| tags                  |                                                                                     |
| tenant_id             | fc64981873354f2a91b8fdfb6b95750f                                                    |
| updated_at            | 2018-07-25T14:23:34Z                                                                |
+-----------------------+-------------------------------------------------------------------------------------+


Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP14 undercloud
2. Deploy overcloud

Actual results:
Overcloud nodes do not get any DHCP address.

Expected results:
Nodes boot properly.

Additional info:
Attaching sosreport.

Comment 1 Marius Cornea 2018-07-25 18:39:56 UTC
Workaround is to disable selinux and reboot(or set undercloud_enable_selinux in undercloud.conf before starting undercloud installation)

sed -i s/SELINUX=enforcing/SELINUX=permissive/ /etc/selinux/config
reboot


 [root@undercloud-0 stack]# grep denied /var/log/audit/audit.log 
type=AVC msg=audit(1532537334.221:459): avc:  denied  { net_broadcast } for  pid=6385 comm="privsep-helper" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532537334.321:466): avc:  denied  { net_broadcast } for  pid=6408 comm="privsep-helper" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532537836.193:50): avc:  denied  { net_broadcast } for  pid=1057 comm="ovs-vswitchd" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532537920.353:49): avc:  denied  { net_broadcast } for  pid=1058 comm="ovs-vswitchd" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532538150.079:50): avc:  denied  { net_broadcast } for  pid=1056 comm="ovs-vswitchd" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532538421.132:50): avc:  denied  { net_broadcast } for  pid=1059 comm="ovs-vswitchd" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532538699.233:48): avc:  denied  { net_broadcast } for  pid=1021 comm="ovs-vswitchd" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
type=AVC msg=audit(1532538742.752:430): avc:  denied  { net_broadcast } for  pid=5902 comm="privsep-helper" capability=11  scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability
 [root@undercloud-0 stack]#

Comment 2 Lon Hohberger 2018-07-25 18:49:47 UTC
Is openvswitch-selinux-extra-policy installed? If so, we can move this to the OVS team.

Comment 3 Marius Cornea 2018-07-25 18:56:32 UTC
(In reply to Lon Hohberger from comment #2)
> Is openvswitch-selinux-extra-policy installed? If so, we can move this to
> the OVS team.

Yes, openvswitch-selinux-extra-policy-1.0-3.el7fdp.noarch is installed

Comment 4 Marius Cornea 2018-07-27 15:37:40 UTC
The issue is still present even with the selinux fix for openvswitch.

Comment 5 Marius Cornea 2018-07-27 19:02:29 UTC
Note that switching SELinux to permissive doesn't work as a workaround. It seems that the port tag can get set to '1' after a random number of undercloud reboot attempts which leads me to guess it's some sort of race condition introduced with the containerized Neutron services.

Comment 6 Filip Hubík 2018-07-30 11:19:25 UTC
I can confirm this issue being hit in downstream CI, puddle 2018-07-26.7:

uc $ tcpdump -i br-ctlplane  -n port 67 and port 68                                                                                                                                                           
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-ctlplane, link-type EN10MB (Ethernet), capture size 262144 bytes
07:13:50.117185 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:d1:87:df, length 396
07:13:55.478138 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:dc:85:50, length 396
07:13:55.754999 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:53:05:44, length 396
07:13:58.021948 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:d1:87:df, length 396
07:14:03.387144 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:dc:85:50, length 396
07:14:03.663673 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:53:05:44, length 396
07:14:13.841256 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:d1:87:df, length 396
07:14:19.205997 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:dc:85:50, length 396
07:14:19.482659 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:53:05:44, length 396

Nodes are stuck in BUILD state until job is explicitely killed by timeout. No avc denials.

Ipxe fails:

No more network devices

No bootable device.

openvswitch-selinux-extra-policy-1.0-4.1.el7ost.noarch
openstack-tripleo-heat-templates-9.0.0-0.20180720154239.959e1d7.el7ost.noarch

Comment 7 Aaron Conole 2018-07-30 12:08:43 UTC
"The issue is still present even with the selinux fix for openvswitch."

Additionally,

"No avc denials."

This tells me that it's unlikely to be an selinux issue.  What are the logs from the jobs?

Comment 11 Marius Cornea 2018-07-31 16:11:24 UTC
I'm moving the bug back to NEW since the issue reported initially is still present regardless the SELinux fix. The issue is presently under investigation by Networking DFG.

Comment 12 Aaron Conole 2018-08-02 13:17:37 UTC
Given c#5 can we reassign this?  It doesn't seem likely to be an ovs bug.

Comment 13 Marius Cornea 2018-08-02 13:27:28 UTC
(In reply to Aaron Conole from comment #12)
> Given c#5 can we reassign this?  It doesn't seem likely to be an ovs bug.

Yes, re-assigned.

Comment 16 Nate Johnston 2018-08-02 21:14:01 UTC
8/2 COB update: Checking the test system where the issue is replicated, I see after multiple reboots of the undercloud node that sometimes the interface corresponding to the DHCP agent has VLAN 1 tagged (correct) and other times it has VLAN 4095 tagged.  VLAN 4095 functions to discard all traffic to the interface, so that if congruent with the input in the bug report.  Inspecting the logs, so far the main thing that I have noticed is that oslo.privsep is complaining about not being able to find an interface - but that interface does exist.

    2018-08-02 14:31:49.128 6218 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[139927564007376]: Network interface tap5ddf0705-0e not found in namespace qdhcp-3a7596e0-4556-4db4-8777-d420f547736c. loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:449

I will continue debugging with this as my top priority.

Comment 17 Terry Wilson 2018-08-02 22:26:40 UTC
Created attachment 1472826 [details]
combined ovs-vswitchd, dhcp and openvswitch agent log files

Note that the ovs-vswtichd log lines merged seem to be merged into the correct second w/ other logs, but for for some reason are listed at the top of the second.
Some relevant info from the combined logs:


DHCP agent starts
2018-07-30 11:20:47.813 3976 INFO neutron.common.config [-] Logging enabled!
2018-07-30 11:20:47.824 3976 INFO neutron.common.config [-] /usr/bin/neutron-dhcp-agent version 13.0.0.0b3.dev177
2018-07-30 11:20:47.824 3976 DEBUG neutron.common.config [-] command line: /usr/bin/neutron-dhcp-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/dhcp-agent.log setup_logging /usr/lib/python2.7/site-packages/neutron/common/config.py:104

OVS agent starts
2018-07-30 11:20:51.177 4377 INFO neutron.common.config [-] Logging enabled!
2018-07-30 11:20:51.199 4377 INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 13.0.0.0b3.dev177
2018-07-30 11:20:51.199 4377 DEBUG neutron.common.config [-] command line: /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log setup_logging /usr/lib/python2.7/site-packages/neutron/common/config.py:104


DHCP Agent actually starts doing things and recreates interface
   Lots of RPC failures until we finally sync up
2018-07-30 11:22:53.546 [dhcp-agent.log.1] 3976 INFO oslo.privsep.daemon [-] Running privsep helper: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'privsep-helper', '--config-file', '/usr/share/neutron/neutron-dist.conf', '--config-file', '/etc/neutron/neutron.conf', '--config-file', '/etc/neutron/dhcp_agent.ini', '--config-dir', '/etc/neutron/conf.d/neutron-dhcp-agent', '--privsep_context', 'neutron.privileged.default', '--privsep_sock_path', '/tmp/tmpWrqKIu/privsep.sock']
...
2018-07-30 11:22:54.435 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DelPortCommand(if_exists=True, bridge=None,
 port=tapcf84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30T15:22:54.437Z| [ovs-vswitchd.log-20180731] 00064|bridge|INFO|bridge br-int: deleted interface tapcf84dbec-1e on port 4
2018-07-30 11:22:54.450 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddPortCommand(bridge=br-int, may_exist=Fal
se, port=tapcf84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30 11:22:54.450 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): DbSetCommand(table=Port, col_values=(('tag'
, 4095),), record=tapcf84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30 11:22:54.452 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): DbSetCommand(table=Interface, col_values=((
'type', 'internal'), ('external_ids', {'iface-id': u'cf84dbec-1ea0-486c-b464-b774b8c576ff', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:50:bd:75'})), record=tapc
f84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30T15:22:54.460Z| [ovs-vswitchd.log-20180731] 00065|bridge|INFO|bridge br-int: added interface tapcf84dbec-1e on port 3
2018-07-30T15:22:54.493Z| [ovs-vswitchd.log-20180731] 00066|netdev_linux|INFO|ioctl(SIOCGIFHWADDR) on tapcf84dbec-1e device failed: No such device

After this, the openvswitch agent never detects that the ovsdb port was deleted and re-added, so the port is left with tag=4095. Not sure why ovs-vswitchd complains about the tapcf84dbec-1e device not existing when it just created it and it is a type 'internal' port.

Comment 19 Terry Wilson 2018-08-03 19:19:25 UTC
For reference, from a working devstack install, the merged log will have output from ovsdb-client monitor after the dhcp agent recreates the port. It will look something like:

Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DelPortCommand(if_exists=True, bridge=None, port=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}
Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121}}
Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddPortCommand(bridge=br-int, may_exist=False, port=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}
Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): DbSetCommand(table=Port, col_values=(('tag', 4095),), record=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}
Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'bda81760-5978-471d-99a3-927f69d16523', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:61:e0:67'})), record=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}
Aug 03 13:02:45 neutron-openvswitch-agent[25979]: DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["74270315-5533-4d5b-98f2-02cbb56e9f98","insert","tapbda81760-59",["set",[]],["map",[["attached-mac","fa:16:3e:61:e0:67"],["iface-id","bda81760-5978-471d-99a3-927f69d16523"],["iface-status","active"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=25979) _read_stdout /opt/stack/neutron/neutron/agent/linux/async_process.py:241}}
Aug 03 13:02:45 neutron-openvswitch-agent[25979]: DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["74270315-5533-4d5b-98f2-02cbb56e9f98","old",null,["set",[]],null],["","new","tapbda81760-59",3,["map",[["attached-mac","fa:16:3e:61:e0:67"],["iface-id","bda81760-5978-471d-99a3-927f69d16523"],["iface-status","active"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=25979) _read_stdout /opt/stack/neutron/neutron/agent/linux/async_process.py:241}}

In our logs, that doesn't happen even though a minute before that the ovsdb-client monitor...Interface command is started.

Comment 20 Brian Haley 2018-08-06 21:03:27 UTC
I looked at this some more with Slawek today, just wanted to add that info here.

After rebooting Nate's test undercloud, we had some current OVS and DHCP logs that showed the failure.  Since the logs are being rotated in just a few days this is almost a necessary step.  Either way, from the OVS agent log we can see that it started and the dhcp tap port was in the DB:

2018-08-06 11:12:36.243 4595 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbListCommand(if_exists=False, records=[u'tap5ddf0705-0e'], table=Port, columns=['name', 'other_config', 'tag'], row=False) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-08-06 11:12:36.244 4595 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121

"Transaction caused no change" means the port was already there.

Right afterwards the ovsdb monitor process starts, which watches for events:

2018-08-06 11:12:36.579 4595 DEBUG neutron.agent.linux.async_process [req-07b9dca0-d1b5-4ffd-b7a9-5c0ecc52fb9d - - - - -] Launching async process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]. start /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:111

Just a couple of seconds later the dhcp agent has started, and realized the dhcp tap port for the undercloud doesn't exist, so creates it:

2018-08-06 11:12:39.055 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DelPortCommand(if_exists=True, bridge=None, port=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-08-06 11:12:39.066 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddPortCommand(bridge=br-int, may_exist=False, port=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-08-06 11:12:39.066 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): DbSetCommand(table=Port, col_values=(('tag', 4095),), record=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-08-06 11:12:39.068 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'5ddf0705-0ede-46c3-b545-997a34334d24', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:49:50:5e'})), record=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84

The ovsdb monitor should have noticed that, and the ovs agent should have started work right afterwards.  But there is no indication in the log that the ovs agent did anything.

If we assume somehow the monitor missed the event, from looking at the ovs agent code it's not clear it could detect a change in the port, since it should (i believe) look the same as far as config info even when the dhcp agent plugs it in.  We would need to do some further debugging to see if that's the case and work on something to address it if so.

For now though we are thinking that on a boot, it might be enough to make sure the dhcp agent is always started after the ovs agent, which should give it enough time to initialize itself and the monitor before the dhcp agent plugs the port.  I believe there is a way to do that via systemd so we will look at doing that and running some tests to see if it helps.

Comment 21 Terry Wilson 2018-08-07 02:21:22 UTC
> "Transaction caused no change" means the port was already there.

Just as an FYI, since it was a DBListCommand and not a AddPortCommand, this just means that we didn't modify the DB by doing the command (dblist only reads). So it doesn't speak to whether the port was there (though if it was a reboot, it almost certainly *was* there since ovs/netns cleanup aren't running.

Although the DHCP agent is starting before the OVS agent, I can see from the attached logs that the ovsdb-client monitor command runs in the OVS agent well before the DHCP agent tries to add the port, so although starting the DHCP agent afterwards certainly sounds like a good idea to me as well, I'm not sure it will fix the issue.

I'm still really confused why the supposedly running ovsdb-client monitor isn't seeing port add/del. I haven't recreated that yet, even running ovs-vsctl from the dhcp-agent container and ovsdb-client monitor from the ovs agent container to replicate what should be going on.

NOTE: OVS logs I've looked at so far show a ioctl(SIOCGIFHWADDR) device not found error (trying to get the ethaddr) each time that I've seen the port add and no ovsdb-client monitor logs lines in the ovs agent. The ovs code mentions that it probably just means that the port was deleted asynchronously, and the code path doesn't *look* like it should matter. But it is something to note.

Comment 22 Brian Haley 2018-08-07 21:03:35 UTC
Terry found an issue where the ovsdb-client monitor when outputting json does not appear to be emitting newlines, which is most likely causing the agent to not get the notification the port has been added/updated.  He's doing some additional testing now.

Comment 24 Terry Wilson 2018-08-07 21:14:31 UTC
It appears that OVS commit cb139fa8b3a1d653e123df8533ab8d42ae354149 modified the output of the table printing in format=json so that ovsdb-client monitor no longer prints a newline for each change. The async_process code in neutron calls readline(), and thus blocks forever since there is never a new line and thus neutron_ovs_agent never sees the port being added and never sets the tag, so no traffic passes on the the dhcp interface.

I'll update the bug and look for exactly where in that commit the offending code is, but before that commit there were newlines and after it no newlines.

Comment 25 Terry Wilson 2018-08-08 05:00:22 UTC
aconole has a patch for the ovs newline issue upstream at https://mail.openvswitch.org/pipermail/ovs-dev/2018-August/350795.html

Comment 26 Terry Wilson 2018-08-08 15:11:24 UTC
I can verify that when I use an rpm I built with aconole's patch on the undercloud and in the ovs-agent container, that when I manually set the tag=4095 and reboot the undercloud that eventually the tag gets set to '1' by ovs agent like it should.

Comment 27 Terry Wilson 2018-08-09 03:00:23 UTC
fbl built openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp which includes aconole's fix.

Comment 32 Filip Hubík 2018-08-17 12:28:06 UTC
I can not confirm this issue being solved (anymore) in CI with latest puddle (2018-08-15.1), I am hitting same most likely the same issue as described here in phase1 job and also manual deployments, openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp package is installed on UC - but it looks like OVS network layout changed on UC (since puddle 2018-08-02.2 on which I verified fix previously https://bugzilla.redhat.com/show_bug.cgi?id=1608508#c30):

uc $ sudo ovs-vsctl show
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
453cae58-2513-4d4e-90c9-773676d4993d
    Bridge br-ctlplane
        fail_mode: standalone
        Port br-ctlplane
            Interface br-ctlplane
                type: internal
        Port "eth0"
            Interface "eth0"
    ovs_version: "2.10.0"


OC nodes are stuck in BUILD state and DHCP traffic looks like:

uc $ sudo tcpdump -i br-ctlplane -n port 67 and port 68
07:47:12.251548 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ed:20:fd, length 396
07:47:12.582448 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:84:54:dd, length 396
07:47:13.039903 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:41:dc:84, length 396
07:47:20.160399 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ed:20:fd, length 396
07:47:20.491573 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:84:54:dd, length 396
07:47:20.949136 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:41:dc:84, length 396
07:47:35.980112 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ed:20:fd, length 396
07:47:36.310307 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:84:54:dd, length 396
07:47:36.767715 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:41:dc:84, length 396
...

rpm -qa | grep openvswi
openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch
rhosp-openvswitch-2.10-0.1.el7ost.noarch
openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp.x86_64

Also, docker reports openstack-neutron-openvswitch-agent restarting:
38284633a770        docker-registry.engineering.redhat.com/rhosp14/openstack-neutron-openvswitch-agent:2018-08-15.1   "kolla_start"            About an hour ago   Restarting (1) 4 minutes ago                       neutron_ovs_agent

Docker log for related container indicates compat issue:
uc $ sudo docker logs 38284633a770

 /usr/bin/python -m neutron.cmd.destroy_patch_ports --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/neutron/cmd/destroy_patch_ports.py", line 83, in <module>
    main()
  File "/usr/lib/python2.7/site-packages/neutron/cmd/destroy_patch_ports.py", line 78, in main
    port_cleaner = PatchPortCleaner(cfg.CONF)
  File "/usr/lib/python2.7/site-packages/neutron/cmd/destroy_patch_ports.py", line 44, in __init__
    for bridge in mappings.values()]
  File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 218, in __init__
    super(OVSBridge, self).__init__()
  File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 117, in __init__
    self.ovsdb = ovsdb_api.from_config(self)
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/api.py", line 30, in from_config
    ovsdb_api.interface_map[iface_name or cfg.CONF.OVS.ovsdb_interface])
  File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 73, in import_module
    __import__(import_str)
  File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py", line 17, in <module>
    from ovsdbapp.backend.ovs_idl import command
  File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 16, in <module>
    from ovsdbapp.backend.ovs_idl import command as cmd
  File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 21, in <module>
    from ovsdbapp.backend.ovs_idl import idlutils
  File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 21, in <module>
    from ovs.db import idl
  File "/usr/lib64/python2.7/site-packages/ovs/db/idl.py", line 25, in <module>
    from ovs.db import custom_index
  File "/usr/lib64/python2.7/site-packages/ovs/db/custom_index.py", line 12, in <module>
    from ovs.compat import sortedcontainers
ImportError: No module named compat

Comment 33 Filip Hubík 2018-08-17 12:33:28 UTC
This can be related to https://bugzilla.redhat.com/show_bug.cgi?id=1615892 and https://bugzilla.redhat.com/show_bug.cgi?id=1614817 , possibly we are missing fixed package in container?

Comment 34 Filip Hubík 2018-08-17 12:43:37 UTC
Ignore last part of my previous comment, it looks like supposedly fixed package is present inside failing openstack-neutron-openvswitch-agent:2018-08-15.1 container:

uc $ docker exec -it 195fa477cc1b rpm -qa | grep openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp.x86_64
rhosp-openvswitch-2.10-0.1.el7ost.noarch
openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch
python-openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp.x86_64

Comment 35 Bernard Cafarelli 2018-08-17 13:03:31 UTC
From bug #1614817 the fixed version for python-openvswitch should be 20180810, while the container's version is 20180808 (so without compat fix)

Comment 36 Filip Hubík 2018-08-20 12:31:41 UTC
It seems that some of common symptoms (missing dhcp replies, OC VM's stuck in BUILD state also openstack-neutron-openvswitch-agent container being restarted periodically) were indeed caused by different (compat, https://bugzilla.redhat.com/show_bug.cgi?id=1614817) issue, not improper tagging of one of the br-int inferfaces.

Since compat-related issues were resolved by newer openvswitch2.10-2.10.0-0.20180810git58a7ce6.el7fdp and not yet openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp , there is no way to verify these OVS related issues being fixed on older 20180808, but we have it fixed since 20180810git58a7ce6 and director is able to deploy overcloud stack.

Moving back to Modified for another spin through CI since we already have both issues fixed in newer package.

Comment 39 Filip Hubík 2018-08-20 14:49:25 UTC
Verified, fixed package is included since puddle 2018-08-17.2, tap interface connected to br-int has right tag "1" and we are able to pass heat deployment (overcloud deployment reaches CREATE_COMPLETE).

Comment 40 Filip Hubík 2018-10-26 15:09:06 UTC
Note since the main issue with interface tagging was fixed, the selinux avc denials mentioned here are still visible in deployments. I created https://bugzilla.redhat.com/show_bug.cgi?id=1643571 to track that issue specifically.

Comment 43 errata-xmlrpc 2019-01-11 11:50:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045