Bug 1608508
Summary: | Overcloud instances cannot pxe boot during overcloud deployment, ovsdb monitor regressed | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||||
Component: | openvswitch | Assignee: | Terry Wilson <twilson> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Roee Agiman <ragiman> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 14.0 (Rocky) | CC: | agurenko, amuller, apevec, bcafarel, bfournie, bhaley, chrisw, dbecker, fhubik, fleitner, jschluet, mburns, mcornea, morazi, njohnston, nyechiel, ohochman, rhos-maint, sasha, srevivo, tvignaud, twilson | ||||||
Target Milestone: | beta | Keywords: | AutomationBlocker, Regression, Triaged | ||||||
Target Release: | 14.0 (Rocky) | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | openvswitch2.10-2.10.0-0.20180810git58a7ce6.el7fdp | Doc Type: | Bug Fix | ||||||
Doc Text: |
Cause: A change in Open vSwitch's table printing inadvertently removed appending a newline to the output.
Consequence: When using ovsdb_interface=vsctl, Neutron's parsing of ovsdb-client monitor output relied on the newline for detecting added ports. Since the port was never detected, the tag is not set and traffic cannot be passed over the port.
Fix: Re-add the newline in the Open vSwitch code.
Result: Newly added ports can now pass traffic again.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2019-01-11 11:50:58 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Workaround is to disable selinux and reboot(or set undercloud_enable_selinux in undercloud.conf before starting undercloud installation) sed -i s/SELINUX=enforcing/SELINUX=permissive/ /etc/selinux/config reboot [root@undercloud-0 stack]# grep denied /var/log/audit/audit.log type=AVC msg=audit(1532537334.221:459): avc: denied { net_broadcast } for pid=6385 comm="privsep-helper" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532537334.321:466): avc: denied { net_broadcast } for pid=6408 comm="privsep-helper" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532537836.193:50): avc: denied { net_broadcast } for pid=1057 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532537920.353:49): avc: denied { net_broadcast } for pid=1058 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532538150.079:50): avc: denied { net_broadcast } for pid=1056 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532538421.132:50): avc: denied { net_broadcast } for pid=1059 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532538699.233:48): avc: denied { net_broadcast } for pid=1021 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability type=AVC msg=audit(1532538742.752:430): avc: denied { net_broadcast } for pid=5902 comm="privsep-helper" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability [root@undercloud-0 stack]# Is openvswitch-selinux-extra-policy installed? If so, we can move this to the OVS team. (In reply to Lon Hohberger from comment #2) > Is openvswitch-selinux-extra-policy installed? If so, we can move this to > the OVS team. Yes, openvswitch-selinux-extra-policy-1.0-3.el7fdp.noarch is installed The issue is still present even with the selinux fix for openvswitch. Note that switching SELinux to permissive doesn't work as a workaround. It seems that the port tag can get set to '1' after a random number of undercloud reboot attempts which leads me to guess it's some sort of race condition introduced with the containerized Neutron services. I can confirm this issue being hit in downstream CI, puddle 2018-07-26.7: uc $ tcpdump -i br-ctlplane -n port 67 and port 68 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ctlplane, link-type EN10MB (Ethernet), capture size 262144 bytes 07:13:50.117185 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:d1:87:df, length 396 07:13:55.478138 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:dc:85:50, length 396 07:13:55.754999 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:53:05:44, length 396 07:13:58.021948 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:d1:87:df, length 396 07:14:03.387144 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:dc:85:50, length 396 07:14:03.663673 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:53:05:44, length 396 07:14:13.841256 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:d1:87:df, length 396 07:14:19.205997 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:dc:85:50, length 396 07:14:19.482659 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:53:05:44, length 396 Nodes are stuck in BUILD state until job is explicitely killed by timeout. No avc denials. Ipxe fails: No more network devices No bootable device. openvswitch-selinux-extra-policy-1.0-4.1.el7ost.noarch openstack-tripleo-heat-templates-9.0.0-0.20180720154239.959e1d7.el7ost.noarch "The issue is still present even with the selinux fix for openvswitch." Additionally, "No avc denials." This tells me that it's unlikely to be an selinux issue. What are the logs from the jobs? I'm moving the bug back to NEW since the issue reported initially is still present regardless the SELinux fix. The issue is presently under investigation by Networking DFG. Given c#5 can we reassign this? It doesn't seem likely to be an ovs bug. (In reply to Aaron Conole from comment #12) > Given c#5 can we reassign this? It doesn't seem likely to be an ovs bug. Yes, re-assigned. 8/2 COB update: Checking the test system where the issue is replicated, I see after multiple reboots of the undercloud node that sometimes the interface corresponding to the DHCP agent has VLAN 1 tagged (correct) and other times it has VLAN 4095 tagged. VLAN 4095 functions to discard all traffic to the interface, so that if congruent with the input in the bug report. Inspecting the logs, so far the main thing that I have noticed is that oslo.privsep is complaining about not being able to find an interface - but that interface does exist. 2018-08-02 14:31:49.128 6218 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[139927564007376]: Network interface tap5ddf0705-0e not found in namespace qdhcp-3a7596e0-4556-4db4-8777-d420f547736c. loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:449 I will continue debugging with this as my top priority. Created attachment 1472826 [details]
combined ovs-vswitchd, dhcp and openvswitch agent log files
Note that the ovs-vswtichd log lines merged seem to be merged into the correct second w/ other logs, but for for some reason are listed at the top of the second.
Some relevant info from the combined logs:
DHCP agent starts
2018-07-30 11:20:47.813 3976 INFO neutron.common.config [-] Logging enabled!
2018-07-30 11:20:47.824 3976 INFO neutron.common.config [-] /usr/bin/neutron-dhcp-agent version 13.0.0.0b3.dev177
2018-07-30 11:20:47.824 3976 DEBUG neutron.common.config [-] command line: /usr/bin/neutron-dhcp-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-dhcp-agent --log-file=/var/log/neutron/dhcp-agent.log setup_logging /usr/lib/python2.7/site-packages/neutron/common/config.py:104
OVS agent starts
2018-07-30 11:20:51.177 4377 INFO neutron.common.config [-] Logging enabled!
2018-07-30 11:20:51.199 4377 INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 13.0.0.0b3.dev177
2018-07-30 11:20:51.199 4377 DEBUG neutron.common.config [-] command line: /usr/bin/neutron-openvswitch-agent --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-file /etc/neutron/plugins/ml2/ml2_conf.ini --config-dir /etc/neutron/conf.d/common --log-file=/var/log/neutron/openvswitch-agent.log setup_logging /usr/lib/python2.7/site-packages/neutron/common/config.py:104
DHCP Agent actually starts doing things and recreates interface
Lots of RPC failures until we finally sync up
2018-07-30 11:22:53.546 [dhcp-agent.log.1] 3976 INFO oslo.privsep.daemon [-] Running privsep helper: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'privsep-helper', '--config-file', '/usr/share/neutron/neutron-dist.conf', '--config-file', '/etc/neutron/neutron.conf', '--config-file', '/etc/neutron/dhcp_agent.ini', '--config-dir', '/etc/neutron/conf.d/neutron-dhcp-agent', '--privsep_context', 'neutron.privileged.default', '--privsep_sock_path', '/tmp/tmpWrqKIu/privsep.sock']
...
2018-07-30 11:22:54.435 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DelPortCommand(if_exists=True, bridge=None,
port=tapcf84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30T15:22:54.437Z| [ovs-vswitchd.log-20180731] 00064|bridge|INFO|bridge br-int: deleted interface tapcf84dbec-1e on port 4
2018-07-30 11:22:54.450 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddPortCommand(bridge=br-int, may_exist=Fal
se, port=tapcf84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30 11:22:54.450 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): DbSetCommand(table=Port, col_values=(('tag'
, 4095),), record=tapcf84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30 11:22:54.452 [dhcp-agent.log.1] 3976 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): DbSetCommand(table=Interface, col_values=((
'type', 'internal'), ('external_ids', {'iface-id': u'cf84dbec-1ea0-486c-b464-b774b8c576ff', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:50:bd:75'})), record=tapc
f84dbec-1e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2018-07-30T15:22:54.460Z| [ovs-vswitchd.log-20180731] 00065|bridge|INFO|bridge br-int: added interface tapcf84dbec-1e on port 3
2018-07-30T15:22:54.493Z| [ovs-vswitchd.log-20180731] 00066|netdev_linux|INFO|ioctl(SIOCGIFHWADDR) on tapcf84dbec-1e device failed: No such device
After this, the openvswitch agent never detects that the ovsdb port was deleted and re-added, so the port is left with tag=4095. Not sure why ovs-vswitchd complains about the tapcf84dbec-1e device not existing when it just created it and it is a type 'internal' port.
For reference, from a working devstack install, the merged log will have output from ovsdb-client monitor after the dhcp agent recreates the port. It will look something like: Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DelPortCommand(if_exists=True, bridge=None, port=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}} Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121}} Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddPortCommand(bridge=br-int, may_exist=False, port=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}} Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): DbSetCommand(table=Port, col_values=(('tag', 4095),), record=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}} Aug 03 13:02:45 neutron-dhcp-agent[26581]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'bda81760-5978-471d-99a3-927f69d16523', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:61:e0:67'})), record=tapbda81760-59) {{(pid=26581) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}} Aug 03 13:02:45 neutron-openvswitch-agent[25979]: DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["74270315-5533-4d5b-98f2-02cbb56e9f98","insert","tapbda81760-59",["set",[]],["map",[["attached-mac","fa:16:3e:61:e0:67"],["iface-id","bda81760-5978-471d-99a3-927f69d16523"],["iface-status","active"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=25979) _read_stdout /opt/stack/neutron/neutron/agent/linux/async_process.py:241}} Aug 03 13:02:45 neutron-openvswitch-agent[25979]: DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["74270315-5533-4d5b-98f2-02cbb56e9f98","old",null,["set",[]],null],["","new","tapbda81760-59",3,["map",[["attached-mac","fa:16:3e:61:e0:67"],["iface-id","bda81760-5978-471d-99a3-927f69d16523"],["iface-status","active"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=25979) _read_stdout /opt/stack/neutron/neutron/agent/linux/async_process.py:241}} In our logs, that doesn't happen even though a minute before that the ovsdb-client monitor...Interface command is started. I looked at this some more with Slawek today, just wanted to add that info here. After rebooting Nate's test undercloud, we had some current OVS and DHCP logs that showed the failure. Since the logs are being rotated in just a few days this is almost a necessary step. Either way, from the OVS agent log we can see that it started and the dhcp tap port was in the DB: 2018-08-06 11:12:36.243 4595 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbListCommand(if_exists=False, records=[u'tap5ddf0705-0e'], table=Port, columns=['name', 'other_config', 'tag'], row=False) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-08-06 11:12:36.244 4595 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121 "Transaction caused no change" means the port was already there. Right afterwards the ovsdb monitor process starts, which watches for events: 2018-08-06 11:12:36.579 4595 DEBUG neutron.agent.linux.async_process [req-07b9dca0-d1b5-4ffd-b7a9-5c0ecc52fb9d - - - - -] Launching async process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]. start /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:111 Just a couple of seconds later the dhcp agent has started, and realized the dhcp tap port for the undercloud doesn't exist, so creates it: 2018-08-06 11:12:39.055 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DelPortCommand(if_exists=True, bridge=None, port=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-08-06 11:12:39.066 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddPortCommand(bridge=br-int, may_exist=False, port=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-08-06 11:12:39.066 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): DbSetCommand(table=Port, col_values=(('tag', 4095),), record=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2018-08-06 11:12:39.068 4897 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'5ddf0705-0ede-46c3-b545-997a34334d24', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:49:50:5e'})), record=tap5ddf0705-0e) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 The ovsdb monitor should have noticed that, and the ovs agent should have started work right afterwards. But there is no indication in the log that the ovs agent did anything. If we assume somehow the monitor missed the event, from looking at the ovs agent code it's not clear it could detect a change in the port, since it should (i believe) look the same as far as config info even when the dhcp agent plugs it in. We would need to do some further debugging to see if that's the case and work on something to address it if so. For now though we are thinking that on a boot, it might be enough to make sure the dhcp agent is always started after the ovs agent, which should give it enough time to initialize itself and the monitor before the dhcp agent plugs the port. I believe there is a way to do that via systemd so we will look at doing that and running some tests to see if it helps. > "Transaction caused no change" means the port was already there.
Just as an FYI, since it was a DBListCommand and not a AddPortCommand, this just means that we didn't modify the DB by doing the command (dblist only reads). So it doesn't speak to whether the port was there (though if it was a reboot, it almost certainly *was* there since ovs/netns cleanup aren't running.
Although the DHCP agent is starting before the OVS agent, I can see from the attached logs that the ovsdb-client monitor command runs in the OVS agent well before the DHCP agent tries to add the port, so although starting the DHCP agent afterwards certainly sounds like a good idea to me as well, I'm not sure it will fix the issue.
I'm still really confused why the supposedly running ovsdb-client monitor isn't seeing port add/del. I haven't recreated that yet, even running ovs-vsctl from the dhcp-agent container and ovsdb-client monitor from the ovs agent container to replicate what should be going on.
NOTE: OVS logs I've looked at so far show a ioctl(SIOCGIFHWADDR) device not found error (trying to get the ethaddr) each time that I've seen the port add and no ovsdb-client monitor logs lines in the ovs agent. The ovs code mentions that it probably just means that the port was deleted asynchronously, and the code path doesn't *look* like it should matter. But it is something to note.
Terry found an issue where the ovsdb-client monitor when outputting json does not appear to be emitting newlines, which is most likely causing the agent to not get the notification the port has been added/updated. He's doing some additional testing now. It appears that OVS commit cb139fa8b3a1d653e123df8533ab8d42ae354149 modified the output of the table printing in format=json so that ovsdb-client monitor no longer prints a newline for each change. The async_process code in neutron calls readline(), and thus blocks forever since there is never a new line and thus neutron_ovs_agent never sees the port being added and never sets the tag, so no traffic passes on the the dhcp interface. I'll update the bug and look for exactly where in that commit the offending code is, but before that commit there were newlines and after it no newlines. aconole has a patch for the ovs newline issue upstream at https://mail.openvswitch.org/pipermail/ovs-dev/2018-August/350795.html I can verify that when I use an rpm I built with aconole's patch on the undercloud and in the ovs-agent container, that when I manually set the tag=4095 and reboot the undercloud that eventually the tag gets set to '1' by ovs agent like it should. fbl built openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp which includes aconole's fix. I can not confirm this issue being solved (anymore) in CI with latest puddle (2018-08-15.1), I am hitting same most likely the same issue as described here in phase1 job and also manual deployments, openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp package is installed on UC - but it looks like OVS network layout changed on UC (since puddle 2018-08-02.2 on which I verified fix previously https://bugzilla.redhat.com/show_bug.cgi?id=1608508#c30): uc $ sudo ovs-vsctl show PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) 453cae58-2513-4d4e-90c9-773676d4993d Bridge br-ctlplane fail_mode: standalone Port br-ctlplane Interface br-ctlplane type: internal Port "eth0" Interface "eth0" ovs_version: "2.10.0" OC nodes are stuck in BUILD state and DHCP traffic looks like: uc $ sudo tcpdump -i br-ctlplane -n port 67 and port 68 07:47:12.251548 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ed:20:fd, length 396 07:47:12.582448 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:84:54:dd, length 396 07:47:13.039903 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:41:dc:84, length 396 07:47:20.160399 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ed:20:fd, length 396 07:47:20.491573 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:84:54:dd, length 396 07:47:20.949136 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:41:dc:84, length 396 07:47:35.980112 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ed:20:fd, length 396 07:47:36.310307 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:84:54:dd, length 396 07:47:36.767715 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:41:dc:84, length 396 ... rpm -qa | grep openvswi openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch rhosp-openvswitch-2.10-0.1.el7ost.noarch openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp.x86_64 Also, docker reports openstack-neutron-openvswitch-agent restarting: 38284633a770 docker-registry.engineering.redhat.com/rhosp14/openstack-neutron-openvswitch-agent:2018-08-15.1 "kolla_start" About an hour ago Restarting (1) 4 minutes ago neutron_ovs_agent Docker log for related container indicates compat issue: uc $ sudo docker logs 38284633a770 /usr/bin/python -m neutron.cmd.destroy_patch_ports --config-file /usr/share/neutron/neutron-dist.conf --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-openvswitch-agent PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/neutron/cmd/destroy_patch_ports.py", line 83, in <module> main() File "/usr/lib/python2.7/site-packages/neutron/cmd/destroy_patch_ports.py", line 78, in main port_cleaner = PatchPortCleaner(cfg.CONF) File "/usr/lib/python2.7/site-packages/neutron/cmd/destroy_patch_ports.py", line 44, in __init__ for bridge in mappings.values()] File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 218, in __init__ super(OVSBridge, self).__init__() File "/usr/lib/python2.7/site-packages/neutron/agent/common/ovs_lib.py", line 117, in __init__ self.ovsdb = ovsdb_api.from_config(self) File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/api.py", line 30, in from_config ovsdb_api.interface_map[iface_name or cfg.CONF.OVS.ovsdb_interface]) File "/usr/lib/python2.7/site-packages/oslo_utils/importutils.py", line 73, in import_module __import__(import_str) File "/usr/lib/python2.7/site-packages/neutron/agent/ovsdb/impl_idl.py", line 17, in <module> from ovsdbapp.backend.ovs_idl import command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 16, in <module> from ovsdbapp.backend.ovs_idl import command as cmd File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 21, in <module> from ovsdbapp.backend.ovs_idl import idlutils File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 21, in <module> from ovs.db import idl File "/usr/lib64/python2.7/site-packages/ovs/db/idl.py", line 25, in <module> from ovs.db import custom_index File "/usr/lib64/python2.7/site-packages/ovs/db/custom_index.py", line 12, in <module> from ovs.compat import sortedcontainers ImportError: No module named compat This can be related to https://bugzilla.redhat.com/show_bug.cgi?id=1615892 and https://bugzilla.redhat.com/show_bug.cgi?id=1614817 , possibly we are missing fixed package in container? Ignore last part of my previous comment, it looks like supposedly fixed package is present inside failing openstack-neutron-openvswitch-agent:2018-08-15.1 container: uc $ docker exec -it 195fa477cc1b rpm -qa | grep openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp.x86_64 rhosp-openvswitch-2.10-0.1.el7ost.noarch openvswitch-selinux-extra-policy-1.0-5.el7fdp.noarch python-openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp.x86_64 From bug #1614817 the fixed version for python-openvswitch should be 20180810, while the container's version is 20180808 (so without compat fix) It seems that some of common symptoms (missing dhcp replies, OC VM's stuck in BUILD state also openstack-neutron-openvswitch-agent container being restarted periodically) were indeed caused by different (compat, https://bugzilla.redhat.com/show_bug.cgi?id=1614817) issue, not improper tagging of one of the br-int inferfaces. Since compat-related issues were resolved by newer openvswitch2.10-2.10.0-0.20180810git58a7ce6.el7fdp and not yet openvswitch2.10-2.10.0-0.20180808gitfaf64fb.el7fdp , there is no way to verify these OVS related issues being fixed on older 20180808, but we have it fixed since 20180810git58a7ce6 and director is able to deploy overcloud stack. Moving back to Modified for another spin through CI since we already have both issues fixed in newer package. Verified, fixed package is included since puddle 2018-08-17.2, tap interface connected to br-int has right tag "1" and we are able to pass heat deployment (overcloud deployment reaches CREATE_COMPLETE). Note since the main issue with interface tagging was fixed, the selinux avc denials mentioned here are still visible in deployments. I created https://bugzilla.redhat.com/show_bug.cgi?id=1643571 to track that issue specifically. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 |
Created attachment 1470542 [details] sosreport-undercloud-0.redhat.local-20180725122842.tar.xz Description of problem: Overcloud instances cannot pxe boot during overcloud deployment. Sniffing the br-ctlplane traffic on undercloud we can notice that dhcp request hit the interface but without getting any replies: [root@undercloud-0 stack]# tcpdump -i br-ctlplane -n port 67 and port 68 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ctlplane, link-type EN10MB (Ethernet), capture size 262144 bytes 11:58:29.434682 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:31:5a:82, length 396 11:58:37.343693 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:31:5a:82, length 396 Checking audit.log for selinux denials we can spot: type=AVC msg=audit(1532527823.424:683): avc: denied { net_broadcast } for pid=13881 comm="ovs-vswitchd" capability=11 scontext=system_u:system_r:openvswitch_t:s0 tcontext=system_u:system_r:openvswitch_t:s0 tclass=capability After switching to permissive the issue persists. Checking the ovs ports: [root@undercloud-0 stack]# ovs-vsctl show 4e7e276f-2132-463d-9e4c-fe2018b8be1e Manager "ptcp:6640:127.0.0.1" is_connected: true Bridge br-ctlplane Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port "eth0" Interface "eth0" Port br-ctlplane Interface br-ctlplane type: internal Port phy-br-ctlplane Interface phy-br-ctlplane type: patch options: {peer=int-br-ctlplane} Bridge br-int Controller "tcp:127.0.0.1:6633" is_connected: true fail_mode: secure Port br-int Interface br-int type: internal Port int-br-ctlplane Interface int-br-ctlplane type: patch options: {peer=phy-br-ctlplane} Port "tap207ed573-fa" tag: 4095 Interface "tap207ed573-fa" type: internal ovs_version: "2.10.0" We can see that the tap interface(tap207ed573-fa) used for the undercloud's dhcp agent has tag: 4095 Checking the neutron port for dhcp agent it's reported as DOWN: (undercloud) [stack@undercloud-0 ~]$ neutron port-show 207ed573-fad3-4345-b093-4e57007164e5 neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead. +-----------------------+-------------------------------------------------------------------------------------+ | Field | Value | +-----------------------+-------------------------------------------------------------------------------------+ | admin_state_up | True | | allowed_address_pairs | | | binding:host_id | undercloud-0.localdomain | | binding:profile | {} | | binding:vif_details | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": false} | | binding:vif_type | ovs | | binding:vnic_type | normal | | created_at | 2018-07-25T14:23:33Z | | description | | | device_id | dhcp3399fad5-b1b7-5f16-832b-991aeb0022d2-6a3b6822-9104-4322-a2f5-4b235e2a93ed | | device_owner | network:dhcp | | extra_dhcp_opts | | | fixed_ips | {"subnet_id": "1025671f-c88c-4cff-ad47-fa59c1af7574", "ip_address": "192.168.24.5"} | | id | 207ed573-fad3-4345-b093-4e57007164e5 | | ip_allocation | immediate | | mac_address | fa:16:3e:e7:07:7d | | name | | | network_id | 6a3b6822-9104-4322-a2f5-4b235e2a93ed | | port_security_enabled | False | | project_id | fc64981873354f2a91b8fdfb6b95750f | | revision_number | 2 | | security_groups | | | status | DOWN | | tags | | | tenant_id | fc64981873354f2a91b8fdfb6b95750f | | updated_at | 2018-07-25T14:23:34Z | +-----------------------+-------------------------------------------------------------------------------------+ Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OSP14 undercloud 2. Deploy overcloud Actual results: Overcloud nodes do not get any DHCP address. Expected results: Nodes boot properly. Additional info: Attaching sosreport.