Description of problem: While installing OCP4.7.2 over OSP16.1 (RHOS-16.1-RHEL-8-20210304.n.0) with OVN-Octavia provider and TLS-Everywhere enabled, the installation timed out because of the network operator on DEGRADED state. kuryr-controller is being restarted with below log: 2021-03-10T18:26:52.962063791Z 2021-03-10 18:26:52.958 1 ERROR kuryr_kubernetes.handlers.logging raise k_exc.ResourceNotReady(vif) 2021-03-10T18:26:52.962063791Z 2021-03-10 18:26:52.958 1 ERROR kuryr_kubernetes.handlers.logging kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: VIFVlanNested(active=False,address=fa:16:3e:fc:97:89,has_traffic_filtering=False,id=c626b527-c922-4e0c-a737-aa39ac6f5297,network=Network(011040fd-069a-44ec-8211-66b2233cbda3),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapc626b527-c9',vlan_id=2600) The issue is resolved by destroying the cluster and recreating it again. Version-Release number of selected component (if applicable): OCP4.7.2 RHOS-16.1-RHEL-8-20210304.n.0 How reproducible: random. Steps to Reproduce: Run kuryr CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable Actual results: installation fails Expected results: installation ok. Additional info: must-gather: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/17/artifact/must-gather/must-gather-install.tar.gz https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/17/
This is caused by Neutron never transitioning the port c626b527-c922-4e0c-a737-aa39ac6f5297 to ACTIVE state. In the logs of kuryr-cni from ostest-4fn7j-master-2 we can see that port is plugged into the pod correctly: 2021-03-10T18:20:58.132721473Z 2021-03-10 18:20:58.131 366 INFO os_vif [-] Successfully plugged vif VIFVlanNested(active=False,address=fa:16:3e:fc:97:89,has_traffic_filtering=False,id=c626b527-c922-4e0c-a737-aa39ac6f5297,network=Network(011040fd-069a-44ec-8211-66b2233cbda3),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapc626b527-c9',vlan_id=2600) Problem is Neutron never transitions it to ACTIVE. Would we have Neutron logs for this one? Otherwise I assume there's no way networking team can find the reason and I'd just close it.
You can find the neutron logs in the artifacts (compute-0.tar.gz and controller-[0|1|2].tar.gz): https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/17/artifact/
Port c626b527-c922-4e0c-a737-aa39ac6f5297 is trunk's subport. This is networking-ovn deployment so I think that someone from OVN squad should take a look at it.
Hit again on RHOS-16.1-RHEL-8-20210311.n.1 and OCP4.7.2 with OVN-Octavia and kuryr. One of the trunk ports remained on DOWN state leading to failures during OCP installation with kuryr: $ oc logs -n openshift-kuryr kuryr-controller-6cbddb865d-v8krf -p [...] 2021-03-17 08:05:13.308 1 ERROR kuryr_kubernetes.handlers.logging kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: VIFVlanNested(active=False,address=fa:16:3e:d6:95:8e,has_traffic_filtering=False,id=9f096afa-8ae3-47d8-aeda-354b48027d44,network=Network(f38dcfe9-fdd3-4585-9773-25d6cf82083e),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap9f096afa-8a',vlan_id=3405) $ openstack port show 9f096afa-8ae3-47d8-aeda-354b48027d44 +-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | admin_state_up | UP | | allowed_address_pairs | | | binding_host_id | None | | binding_profile | None | | binding_vif_details | None | | binding_vif_type | None | | binding_vnic_type | normal | | created_at | 2021-03-17T00:34:35Z | | data_plane_status | None | | description | | | device_id | | | device_owner | trunk:subport | | dns_assignment | fqdn='host-10-128-29-169.shiftstack.com.', hostname='host-10-128-29-169', ip_address='10.128.29.169' | | dns_domain | | | dns_name | | | extra_dhcp_opts | | | fixed_ips | ip_address='10.128.29.169', subnet_id='e86c3423-f53a-4424-b565-127fda402d9a' | | id | 9f096afa-8ae3-47d8-aeda-354b48027d44 | | location | cloud='', project.domain_id=, project.domain_name='Default', project.id='1b9474ca19ad4080999e3a2b663cbc03', project.name='shiftstack', region_name='regionOne', zone= | | mac_address | fa:16:3e:d6:95:8e | | name | | | network_id | f38dcfe9-fdd3-4585-9773-25d6cf82083e | | port_security_enabled | True | | project_id | 1b9474ca19ad4080999e3a2b663cbc03 | | propagate_uplink_status | None | | qos_policy_id | None | | resource_request | None | | revision_number | 6 | | security_group_ids | 8940af79-40b5-4f30-bda1-24ad399c4cf5 | | status | DOWN | | tags | openshiftClusterID=ostest-2w8d5 | | trunk_details | None | | updated_at | 2021-03-17T07:02:50Z | +-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ The installation is stuck with pods on ContainerCreating status: $ oc get pods -A -o wide | grep -v -e Running -e Completed | head -2 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-dns-operator dns-operator-5f6cc86fb5-wd7zb 0/2 ContainerCreating 0 7h44m <none> ostest-2w8d5-master-2 <none> <none> (shiftstack) [stack@undercloud-0 ~]$ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 7h45m Unable to apply 4.7.2: some cluster operators have not yet rolled out sos-report: http://rhos-release.virt.bos.redhat.com/log/bz1937851/ job artifacts: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/21/ The issue is not persistent. We run our CI job 4 times: 2 passed and 2 failed.
(In reply to rlobillo from comment #4) > Hit again on RHOS-16.1-RHEL-8-20210311.n.1 and OCP4.7.2 with OVN-Octavia and > kuryr. > > One of the trunk ports remained on DOWN state leading to failures during OCP > installation with kuryr: > > $ oc logs -n openshift-kuryr kuryr-controller-6cbddb865d-v8krf -p > [...] > 2021-03-17 08:05:13.308 1 ERROR kuryr_kubernetes.handlers.logging > kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: > VIFVlanNested(active=False,address=fa:16:3e:d6:95:8e, > has_traffic_filtering=False,id=9f096afa-8ae3-47d8-aeda-354b48027d44, > network=Network(f38dcfe9-fdd3-4585-9773-25d6cf82083e),plugin='noop', > port_profile=<?>,preserve_on_delete=False,vif_name='tap9f096afa-8a', > vlan_id=3405) > > > $ openstack port show 9f096afa-8ae3-47d8-aeda-354b48027d44 > +-------------------------+-------------------------------------------------- > ----------------------------------------------------------------------------- > ----------------------------------------+ > | Field | Value > | > +-------------------------+-------------------------------------------------- > ----------------------------------------------------------------------------- > ----------------------------------------+ > | admin_state_up | UP > | > | allowed_address_pairs | > | > | binding_host_id | None > | > | binding_profile | None > | > | binding_vif_details | None > | > | binding_vif_type | None > | > | binding_vnic_type | normal > | > | created_at | 2021-03-17T00:34:35Z > | > | data_plane_status | None > | > | description | > | > | device_id | > | > | device_owner | trunk:subport > | > | dns_assignment | fqdn='host-10-128-29-169.shiftstack.com.', > hostname='host-10-128-29-169', ip_address='10.128.29.169' > | > | dns_domain | > | > | dns_name | > | > | extra_dhcp_opts | > | > | fixed_ips | ip_address='10.128.29.169', > subnet_id='e86c3423-f53a-4424-b565-127fda402d9a' > | > | id | 9f096afa-8ae3-47d8-aeda-354b48027d44 > | > | location | cloud='', project.domain_id=, > project.domain_name='Default', > project.id='1b9474ca19ad4080999e3a2b663cbc03', project.name='shiftstack', > region_name='regionOne', zone= | > | mac_address | fa:16:3e:d6:95:8e > | > | name | > | > | network_id | f38dcfe9-fdd3-4585-9773-25d6cf82083e > | > | port_security_enabled | True > | > | project_id | 1b9474ca19ad4080999e3a2b663cbc03 > | > | propagate_uplink_status | None > | > | qos_policy_id | None > | > | resource_request | None > | > | revision_number | 6 > | > | security_group_ids | 8940af79-40b5-4f30-bda1-24ad399c4cf5 > | > | status | DOWN > | > | tags | openshiftClusterID=ostest-2w8d5 > | > | trunk_details | None > | > | updated_at | 2021-03-17T07:02:50Z > | > +-------------------------+-------------------------------------------------- > ----------------------------------------------------------------------------- > ----------------------------------------+ > > The installation is stuck with pods on ContainerCreating status: > > $ oc get pods -A -o wide | grep -v -e Running -e Completed | head -2 > NAMESPACE NAME > READY STATUS RESTARTS AGE IP NODE > NOMINATED NODE READINESS GATES > openshift-dns-operator > dns-operator-5f6cc86fb5-wd7zb 0/2 > ContainerCreating 0 7h44m <none> > ostest-2w8d5-master-2 <none> <none> > (shiftstack) [stack@undercloud-0 ~]$ > > $ oc get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version False True 7h45m Unable to apply 4.7.2: > some cluster operators have not yet rolled out > > > sos-report: http://rhos-release.virt.bos.redhat.com/log/bz1937851/ > job artifacts: > https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra- > shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/21/ > > The issue is not persistent. We run our CI job 4 times: 2 passed and 2 > failed. Clarifying above comment: the mentioned DOWN port is a trunk's subport, not trunk port. Trunk ports are attached to the OpenShift node VMs and then those trunk ports have subports attached that serve as pods ports.
Reason: This was proposed as a Blocker for 16.1.5, the TRAC have decided it is an exception and should be moved to 16.1.6 because this is not a regression and does not fit the Blocker Criteria
Tested with newly deployed 16.1 and python-networking-ovn-7.3.1-1.20210518143301.4e24f4c.el8ost. The good news: OCP4.6 and 4.7 install both reliably. Unfortunately we still have (overcloud) [stack@tel-director ~]$ openstack port list --device-owner trunk:subport |grep DOWN | 5b5295c4-d671-47f3-b2d4-f70b4bd51b6b | | fa:16:3e:f7:f0:fe | ip_address='10.128.91.97', subnet_id='3326a8e9-db6b-4f8f-8e59-6f56d2717b5f' | DOWN | | 714c8ca5-757f-45ee-9fd2-3670689547a6 | | fa:16:3e:76:1e:e0 | ip_address='10.128.45.250', subnet_id='1f4c68ef-c641-4476-8240-91da1fd7bc41' | DOWN | | 955f9511-b224-45a7-bce2-5983057f9a5e | | fa:16:3e:3d:9a:7f | ip_address='10.128.12.80', subnet_id='38337b90-1dd0-417d-8713-5b337cc56730' | DOWN | | c25ee044-aee0-4ded-84f3-40a7f178157e | | fa:16:3e:52:ef:d2 | ip_address='10.128.83.44', subnet_id='00bc3032-145d-429f-91c2-9b7997e1ed64' | DOWN | | d2086858-687b-4477-b2ef-eab331b5e34d | | fa:16:3e:82:b7:a6 | ip_address='10.128.2.230', subnet_id='f44d85fc-6792-4047-a98c-3dffa8bebd2a' | DOWN | The installation works, but there are cases where pods have to be restarted frequently and updating e.g. the cnv operator always fails as the old version does not vanish and sometimes pods have to be deleted manually to progress. So there are still problems assumingly with the ports.
Additionally I see many warnings on "Kuryr CNI pod Pod" with KuryrCNISlow warnings on OCP4.7 with the message "kuryr-cni-xxxxx is taking too long, on average, to perform CNI ADD requests." So the kuryr integration is still flaky.
Hello Joachim, Can you open a bz for the KuryrCNISlow warnings sharing the osp and ovn version used? This way we can analyse it better. It may be a False Positive or some slowness around Neutron transitioning the Ports to active. Thanks.
(In reply to Joachim von Thadden from comment #30) > Additionally I see many warnings on "Kuryr CNI pod Pod" with KuryrCNISlow > warnings on OCP4.7 with the message "kuryr-cni-xxxxx is taking too long, on > average, to perform CNI ADD requests." > > So the kuryr integration is still flaky. That message is alerting you that CNI ADD requests are taking too long. Kuryr CNI ADD requests need to wait until the Neutron port is ACTIVE to complete. And if the problem is that neutron is not transitioning some ports from DOWN to ACTIVE, then Kuryr CNI ADD request will be waiting forever there, and the message will be raised. That said, if that is happening on nodes where there is no ports in that situation (DOWN status), as Maysa said, please file a new bug and we will investigate if there is something wrong on the way those times are estimated or if there is a need to adapt some of the thresholds.
(In reply to Luis Tomas Bolivar from comment #32) > (In reply to Joachim von Thadden from comment #30) > > Additionally I see many warnings on "Kuryr CNI pod Pod" with KuryrCNISlow > > warnings on OCP4.7 with the message "kuryr-cni-xxxxx is taking too long, on > > average, to perform CNI ADD requests." > > > > So the kuryr integration is still flaky. > > That message is alerting you that CNI ADD requests are taking too long. > Kuryr CNI ADD requests need to wait until the Neutron port is ACTIVE to > complete. And if the problem is that neutron is not transitioning some ports > from DOWN to ACTIVE, then Kuryr CNI ADD request will be waiting forever > there, and the message will be raised. > > That said, if that is happening on nodes where there is no ports in that > situation (DOWN status), as Maysa said, please file a new bug and we will > investigate if there is something wrong on the way those times are estimated > or if there is a need to adapt some of the thresholds. It vanished after I re-connected the ports with: #!/bin/bash -x ports=`openstack port list --device-owner "trunk:subport" -f value | grep DOWN | cut -d " " -f 1` trunks=`openstack network trunk list -f value -c ID` for port in $ports; do for trunk in $trunks; do vlan_id=`openstack network subport list --trunk $trunk -f value | grep $port | cut -d " " -f 3` if [[ $vlan_id ]]; then # Found it! echo "Port $port is in trunk $trunk with VLAN ID $vlan_id" openstack network trunk unset --subport $port $trunk openstack network trunk set --subport port=$port,segmentation-type=vlan,segmentation-id=$vlan_id $trunk break fi done done So with that it is possible to "heal" the deployment.
Openshift on Openstack OCP 4.8.3 OSP 16.1.7 Installed and destroyed the Openshift cluster couple of times and looked that there are no DOWN trunk subports and there were no messages in the Kuryr controller logs regarding DOWN ports
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762