Description of problem: Trying to install OCP 4.2 (4.2.0-0.nightly-2019-09-10-074025) on OSP15. Installer hangs waiting on bootstrap step, as bootstrap cannot confirm etcd is running on master nodes: Sep 10 14:08:11 morenod-osp15-77t2b-bootstrap bootkube.sh[1740]: https://etcd-2.morenod-osp15.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 192.168.3.137:2379: i/o timeout Sep 10 14:08:11 morenod-osp15-77t2b-bootstrap bootkube.sh[1740]: https://etcd-0.morenod-osp15.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 192.168.1.9:2379: i/o timeout Sep 10 14:08:11 morenod-osp15-77t2b-bootstrap bootkube.sh[1740]: https://etcd-1.morenod-osp15.qe.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 192.168.0.20:2379: i/o timeout Sep 10 14:08:11 morenod-osp15-77t2b-bootstrap bootkube.sh[1740]: Error: unhealthy cluster Sep 10 14:08:12 morenod-osp15-77t2b-bootstrap bootkube.sh[1740]: etcdctl failed. Retrying in 5 seconds... Checking master nodes, etcd is up: # crictl ps -a | grep etc d9d6da7b1d631 a0a9096ca425066cc5765654459c56b96e7e504bc48854a21668b66b489613ff About an hour ago Running etcd-metrics 0 cfcd97f08c964 cde08cac57dfb quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6dae5b313efac6d5f43015befb5a49b354330275714814724a7e89f10c65e452 About an hour ago Running etcd-member 0 cfcd97f08c964 etcd is correctly listening, but etcd nodes cannot talk between them (syn_sent status) # netstat -anp | grep etcd tcp 0 0 127.0.0.1:9977 0.0.0.0:* LISTEN 4134/etcd tcp 0 0 192.168.1.9:36276 192.168.1.9:9978 ESTABLISHED 4134/etcd tcp 0 1 192.168.1.9:49088 192.168.0.20:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:46908 192.168.3.137:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:49080 192.168.0.20:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:49102 192.168.0.20:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:46888 192.168.3.137:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:49084 192.168.0.20:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:46906 192.168.3.137:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:46884 192.168.3.137:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:46892 192.168.3.137:2380 SYN_SENT 4039/etcd tcp 0 1 192.168.1.9:49100 192.168.0.20:2380 SYN_SENT 4039/etcd tcp6 0 0 :::9978 :::* LISTEN 4039/etcd tcp6 0 0 :::9979 :::* LISTEN 4134/etcd tcp6 4 0 :::2379 :::* LISTEN 4039/etcd tcp6 0 0 :::2380 :::* LISTEN 4039/etcd tcp6 0 0 192.168.1.9:9978 192.168.1.9:36276 ESTABLISHED 4039/etcd From the kubelet logs, this line seems to be related to network: Sep 10 14:40:35 morenod-osp15-77t2b-master-0 hyperkube[1192]: E0910 14:40:35.999601 1192 kubelet.go:2180] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network Related config on install-config.yaml: networking: clusterNetworks: - cidr: 10.128.0.0/14 hostSubnetLength: 9 machineCIDR: 192.168.0.0/18 serviceCIDR: 172.30.0.0/16 type: OpenShiftSDN Version-Release number of selected component (if applicable): $ ./openshift-install version ./openshift-install v4.2.0 built from commit 6fdffbb9c21d90ba97fc79dff3ccbeac22fb2a0e release image registry.svc.ci.openshift.org/ocp/release@sha256:7236b09114158a378bf82fb33814f3ba2dcc94b0e48e10a7ff2ccadd07f201a9 # cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="42.80.20190910.0" VERSION_ID="4.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20190910.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.2" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.2" OSTREE_VERSION=42.80.20190910.0 How reproducible: Steps to Reproduce: 1.Install OCP on OSP15 2. 3. Actual results: Installer doesnt finish Expected results: Installer finish Additional info:
The SDN doesn't mediate communication between etcd - it is host-network, talking directly. This means that (probably) you have a network issue or firewall. Please check that you've configured your networks correctly. I suspect this is a testinfra issue. Can you run a tcpdump and see if the traffic is flowing?
Security groups allow traffic on 2380: $ openstack security group show morenod-osp15-688tl-master | grep 2380 | | created_at='2019-09-10T15:06:56Z', direction='ingress', ethertype='IPv4', id='e66e1758-67ee-4ffa-bb35-0be3747d77bc', port_range_max='2380', port_range_min='2379', protocol='tcp', updated_at='2019-09-10T15:06:56Z' # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination KUBE-FIREWALL all -- 0.0.0.0/0 0.0.0.0/0 Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination KUBE-FIREWALL all -- 0.0.0.0/0 0.0.0.0/0 Chain KUBE-FIREWALL (2 references) target prot opt source destination DROP all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000 There is no tcpdump command installed on rhcos
openshift-sdn won't mark the node as ready (or install the CNI plugin) until etcd is up, the control plane is running, and the pods are installed. That error message is expected and harmless. If this is only happening on OSP15, please assign to testinfra. If this happens on every other cluster, then this is an installer bug.
There seem to be an issue on OSP15 with SG rules that spans over a port range. See the following reproducer/ Create a rule than spans multiple ports: openstack security group rule create morenod-osp15-s2hbx-master --protocol tcp --dst-port 2391:2392 The port is still blocked: sh-5.0# telnet 192.168.1.175 2392 Trying 192.168.1.175... ^C Create a rule with only one port: openstack security group rule create morenod-osp15-s2hbx-master --protocol tcp --dst-port 2392:2392 We can now connect to the port: sh-5.0# telnet 192.168.1.175 2392 Trying 192.168.1.175... telnet: connect to address 192.168.1.175: Connection refused (we got connection refused because I didn't have any services listening on that port)
I see the following errors in the neutron server.log (specifically server.log.9.gz at the time of this writing because so much time has elapsed). 2019-09-10 09:08:23.498 34 DEBUG oslo_concurrency.lockutils [req-76d596b8-7ed7-4367-b222-2bf0178c5468 - - - - -] Lock "update_segment_host_mapping" released by "neutron.services.segments.db.update_segment_host_mapping" :: held 0.003s inne r /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339 2019-09-10 09:08:23.512 28 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): CheckLivenessCommand() do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers [req-a360806c-3757-473e-9260-a0c35dddc941 a98c5115917c44ffa010e9b52460612b 9d52881fecd3466892a4a6bb027531af - default default] Mechanism driver 'ovn' failed in update_port_postcommit: ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Logical_Switch_Port with name=1baa667d-008a-44ec-a454-663b9d056082 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers Traceback (most recent call last): 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/neutron/plugins/ml2/managers.py", line 447, in _call_on_drivers 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers getattr(driver.obj, method_name)(context) 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/networking_ovn/ml2/mech_driver.py", line 566, in update_port_postcommit 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers self._ovn_client.update_port(port, port_object=original_port) 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/networking_ovn/common/ovn_client.py", line 462, in update_port 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers ovn_port = self._nb_idl.lookup('Logical_Switch_Port', port['id']) 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 103, in lookup 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers return self._lookup(table, record) 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 143, in _lookup 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers row = idlutils.row_by_value(self, rl.table, rl.column, record) 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 65, in row_by_value 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers raise RowNotFound(table=table, col=column, match=match) 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Logical_Switch_Port with name=1baa667d-008a-44ec-a454-663b9d056082 2019-09-10 09:08:23.514 28 ERROR neutron.plugins.ml2.managers 2019-09-10 09:08:23.517 28 ERROR neutron.plugins.ml2.plugin [req-a360806c-3757-473e-9260-a0c35dddc941 a98c5115917c44ffa010e9b52460612b 9d52881fecd3466892a4a6bb027531af - default default] mechanism_manager.update_port_postcommit failed for port 1baa667d-008a-44ec-a454-663b9d056082: neutron.plugins.ml2.common.exceptions.MechanismDriverError This looks like an error coming from the OVN subsystem, so I am going to pass it along to the OVN team. I have copied server.log.9.gz to /opt/stack so that it will not be logrotated out of existence. I have asked Terry Wilson (irc: otherwiseguy) to take a look at it.
I wasn't able to dig into this much, but wasn't able to reproduce it on the master branch using ports 21:22 in a rule, ssh still worked. I'm wondering if there was an openflow rule added when 2391:2392 was used. $ sudo ovs-ofctl dump-flows br-int | grep tp_dst=2392
I did a test in such environments with two different security group rules (see below). TL&DR; All looks good from networking-ovn and NB/SB databases standpoint. The problem seems to be with the use of conjunctive actions by ovn-controller when specifying port ranges (or maybe with the conjunctive actions themselves in OVS). Details: 196 openstack security group rule create morenod-osp15-s2hbx-master --protocol tcp --dst-port 504:505 201 openstack security group rule create morenod-osp15-s2hbx-master --protocol tcp --dst-port 506:506 ()[root@controller-0 /]# ovn-nbctl list acl|grep 504 _uuid : 05c91504-ddd4-47d1-ba93-d80a73dab298 match : "outport == @pg_4e10e382_e96a_4514_b211_606509b3f7cc && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst >= 504 && tcp.dst <= 505" ()[root@controller-0 /]# ovn-nbctl list acl|grep "tcp.dst == 506" match : "outport == @pg_4e10e382_e96a_4514_b211_606509b3f7cc && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 506" ()[root@controller-0 /]# ovn-sbctl list logical_fl|grep 505 | grep match match : "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_4e10e382_e96a_4514_b211_606509b3f7cc && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst >= 504 && tcp.dst <= 505)" match : "!ct.new && ct.est && !ct.rpl && ct_label.blocked == 0 && (outport == @pg_4e10e382_e96a_4514_b211_606509b3f7cc && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst >= 504 && tcp.dst <= 505)" ()[root@controller-0 /]# ovn-sbctl list logical_fl|grep 506 | grep match match : "((ct.new && !ct.est) || (!ct.new && ct.est && !ct.rpl && ct_label.blocked == 1)) && (outport == @pg_4e10e382_e96a_4514_b211_606509b3f7cc && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 506)" match : "!ct.new && ct.est && !ct.rpl && ct_label.blocked == 0 && (outport == @pg_4e10e382_e96a_4514_b211_606509b3f7cc && ip4 && ip4.src == 0.0.0.0/0 && tcp && tcp.dst == 506)" [root@compute-0 ~]# ovs-ofctl dump-flows br-int table=44 |grep dst=504 | grep "+new" cookie=0x0, duration=2389.963s, table=44, n_packets=0, n_bytes=0, idle_age=2389, priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,tp_dst=504 actions=conjunction(19,2/2) [root@compute-0 ~]# ovs-ofctl dump-flows br-int table=44 |grep dst=505 | grep "+new" cookie=0x0, duration=2418.667s, table=44, n_packets=0, n_bytes=0, idle_age=2418, priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,tp_dst=505 actions=conjunction(19,2/2) [root@compute-0 ~]# ovs-ofctl dump-flows br-int table=44 |grep dst=506 | grep "+new" cookie=0xa4a1b64c, duration=852.386s, table=44, n_packets=0, n_bytes=0, idle_age=852, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x7,metadata=0x1f,tp_dst=506 actions=load:0x1->NXM_NX_XXREG0[97],resubmit(,45) cookie=0xa4a1b64c, duration=852.386s, table=44, n_packets=1, n_bytes=74, idle_age=487, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x2,metadata=0x1f,tp_dst=506 actions=load:0x1->NXM_NX_XXREG0[97],resubmit(,45) cookie=0xa4a1b64c, duration=852.386s, table=44, n_packets=0, n_bytes=0, idle_age=852, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x5,metadata=0x1f,tp_dst=506 actions=load:0x1->NXM_NX_XXREG0[97],resubmit(,45) cookie=0xa4a1b64c, duration=852.386s, table=44, n_packets=0, n_bytes=0, idle_age=852, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x6,metadata=0x1f,tp_dst=506 actions=load:0x1->NXM_NX_XXREG0[97],resubmit(,45) If I try to connect from 192.168.3.151 to 192.168.1.175 to port 506, I can see that all works, and the corresponding flow gets hit in table=44 when the SYN packet arrives: cookie=0xa4a1b64c, duration=369.066s, table=44, n_packets=1, n_bytes=74, idle_age=3, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x2,metadata=0x1f,tp_dst=506 actions=load:0x1->NXM_NX_XXREG0[97],resubmit(,45) However if I do it on port 504 or 505, the packet is not hitting the rules with the conjunctive actions and I can see the drop in the datapath: # ovs-dpctl dump-flows | grep dst=504 recirc_id(0x26e1),in_port(14),ct_state(+new-est-rel-rpl-inv+trk),ct_label(0/0x1),eth(src=fa:16:3e:b3:d2:43),eth_type(0x0800),ipv4(src=192.168.0.0/255.255.192.0,proto=6,frag=no),tcp(dst=504), packets:2, bytes:148, used:1.118s, flags:S, actions:drop As you can see from the flows above, ovn-controller makes use of the conjunctive actions when there’s a range of ports while for a single port, it installs one OpenFlow rule per logical output port (reg15). Looks like a problem in core OVN or even in OVS. I’ll loop in Numan from core OVN team to take a look now that it’s more narrowed down.
For now it looks like a bug in ovn-controller which is not installing the conjunctive clause to match on the destination port. [root@compute-0 openvswitch]# ovs-ofctl dump-flows br-int | grep "conjunction(19,1/2)" | wc -l 0 We'd expect ovn-controller to install the clause for every dest port. I tried to create those flows manually and it worked: root@compute-0 ~]# podman stop ovn_controller ba98460940bc54110148ee232652d370bdfe315acf3dad3078e676262c8a8813 [root@compute-0 ~]# ovs-ofctl add-flow br-int "table=44,priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,reg15=0x6 actions=conjunction(31,1/2)" [root@compute-0 ~]# ovs-ofctl add-flow br-int "table=44,priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,reg15=0x5 actions=conjunction(31,1/2)" [root@compute-0 ~]# ovs-ofctl add-flow br-int "table=44,priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,reg15=0x2 actions=conjunction(31,1/2)" [root@compute-0 ~]# ovs-ofctl add-flow br-int "table=44,priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,reg15=0x7 actions=conjunction(31,1/2)" [root@compute-0 ~]# ovs-ofctl dump-flows br-int table=44 | grep "conjunction(31" cookie=0x0, duration=166.939s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,tp_dst=700 actions=conjunction(31,2/2) cookie=0x0, duration=166.925s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,tp_dst=701 actions=conjunction(31,2/2) cookie=0x0, duration=166.916s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,metadata=0x1f,tp_dst=702 actions=conjunction(31,2/2) cookie=0x0, duration=94.132s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x6,metadata=0x1f actions=conjunction(31,1/2) cookie=0x0, duration=79.851s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x5,metadata=0x1f actions=conjunction(31,1/2) cookie=0x0, duration=76.857s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x2,metadata=0x1f actions=conjunction(31,1/2) cookie=0x0, duration=73.560s, table=44, n_packets=0, n_bytes=0, idle_age=166, priority=2002,ct_state=+new-est+trk,tcp,reg15=0x7,metadata=0x1f actions=conjunction(31,1/2) Now if I try to telnet port 700 or 701 or 702, I'll see the conj_id=31 flow being hit: sh-5.0# telnet 192.168.1.175 701 Trying 192.168.1.175... telnet: connect to address 192.168.1.175: Connection refused . <- It's refused because nobody is listening on that port [root@compute-0 ~]# ovs-ofctl dump-flows br-int table=44 | grep conj_id=31 cookie=0xe52a3afb, duration=195.892s, table=44, n_packets=1, n_bytes=74, idle_age=99, priority=2002,conj_id=31,ct_state=+new-est+trk,tcp,metadata=0x1f actions=load:0x1->NXM_NX_XXREG0[97],resubmit(,45) At this point we can confirm that the problem is in ovn-controller not generating the conjunctive clauses properly when using ACLs with port ranges in combination with Port Groups. I believe we need an OSP15 BZ against 'ovn' component to be able to track it.
While trying to reproduce the issue on an isolated OVN environment or even on OSP, it was not possible to reproduce. Accidentally I found out that the port receiving the connections in this BZ is the parent port of a trunk: (shiftstack) [stack@undercloud-0 ~]$ openstack port show 1bbd99ac-8065-47e0-85ca-33f6bf6 42a06 | grep trunk | trunk_details | {'trunk_id': '925d4427-89c3-4c4a-93e4-0a7b6748da7b', 'sub_ports': []} We're investigating the conjunctive clauses for trunk ports to see where the issue could be.
The current plan is for this to be resolved in OSP. We plan to attach the OSP bug to this.