1937851 – OCP installation with kuryr failure - network operator degraded

Bug 1937851 - OCP installation with kuryr failure - network operator degraded

Summary: OCP installation with kuryr failure - network operator degraded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-networking-ovn
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z7
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	OSP Team
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-11 16:22 UTC by rlobillo
Modified:	2022-04-20 13:36 UTC (History)
CC List:	19 users (show)
Fixed In Version:	python-networking-ovn-7.3.1-1.20210505193260.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1964399 (view as bug list)
Environment:
Last Closed:	2021-12-09 20:18:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-452	0	None	None	None	2021-11-18 11:30:17 UTC
Red Hat Product Errata	RHBA-2021:3762	0	None	None	None	2021-12-09 20:18:44 UTC

Description rlobillo 2021-03-11 16:22:39 UTC

Description of problem:

While installing OCP4.7.2 over OSP16.1 (RHOS-16.1-RHEL-8-20210304.n.0) with OVN-Octavia provider and TLS-Everywhere enabled, the installation timed out because of the network operator on DEGRADED state.

kuryr-controller is being restarted with below log:

2021-03-10T18:26:52.962063791Z 2021-03-10 18:26:52.958 1 ERROR kuryr_kubernetes.handlers.logging     raise k_exc.ResourceNotReady(vif)
2021-03-10T18:26:52.962063791Z 2021-03-10 18:26:52.958 1 ERROR kuryr_kubernetes.handlers.logging kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: VIFVlanNested(active=False,address=fa:16:3e:fc:97:89,has_traffic_filtering=False,id=c626b527-c922-4e0c-a737-aa39ac6f5297,network=Network(011040fd-069a-44ec-8211-66b2233cbda3),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapc626b527-c9',vlan_id=2600)


The issue is resolved by destroying the cluster and recreating it again.

Version-Release number of selected component (if applicable):
OCP4.7.2
RHOS-16.1-RHEL-8-20210304.n.0

How reproducible: random.


Steps to Reproduce:
Run kuryr CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable

Actual results: installation fails


Expected results: installation ok.


Additional info: 

must-gather: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/17/artifact/must-gather/must-gather-install.tar.gz

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/17/

Comment 1 Michał Dulko 2021-03-11 16:36:18 UTC

This is caused by Neutron never transitioning the port c626b527-c922-4e0c-a737-aa39ac6f5297 to ACTIVE state. In the logs of kuryr-cni from ostest-4fn7j-master-2 we can see that port is plugged into the pod correctly:

2021-03-10T18:20:58.132721473Z 2021-03-10 18:20:58.131 366 INFO os_vif [-] Successfully plugged vif VIFVlanNested(active=False,address=fa:16:3e:fc:97:89,has_traffic_filtering=False,id=c626b527-c922-4e0c-a737-aa39ac6f5297,network=Network(011040fd-069a-44ec-8211-66b2233cbda3),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tapc626b527-c9',vlan_id=2600)

Problem is Neutron never transitions it to ACTIVE. Would we have Neutron logs for this one? Otherwise I assume there's no way networking team can find the reason and I'd just close it.

Comment 2 rlobillo 2021-03-11 16:53:22 UTC

You can find the neutron logs in the artifacts (compute-0.tar.gz and controller-[0|1|2].tar.gz): 

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/17/artifact/

Comment 3 Slawek Kaplonski 2021-03-12 09:25:04 UTC

Port c626b527-c922-4e0c-a737-aa39ac6f5297 is trunk's subport. This is networking-ovn deployment so I think that someone from OVN squad should take a look at it.

Comment 4 rlobillo 2021-03-17 08:13:34 UTC

Hit again on RHOS-16.1-RHEL-8-20210311.n.1 and OCP4.7.2 with OVN-Octavia and kuryr.

One of the trunk ports remained on DOWN state leading to failures during OCP installation with kuryr:

$ oc logs -n openshift-kuryr kuryr-controller-6cbddb865d-v8krf -p
[...]
2021-03-17 08:05:13.308 1 ERROR kuryr_kubernetes.handlers.logging kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready: VIFVlanNested(active=False,address=fa:16:3e:d6:95:8e,has_traffic_filtering=False,id=9f096afa-8ae3-47d8-aeda-354b48027d44,network=Network(f38dcfe9-fdd3-4585-9773-25d6cf82083e),plugin='noop',port_profile=<?>,preserve_on_delete=False,vif_name='tap9f096afa-8a',vlan_id=3405)      


$ openstack port show 9f096afa-8ae3-47d8-aeda-354b48027d44
+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                                                                 |
+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | UP                                                                                                                                                                    |
| allowed_address_pairs   |                                                                                                                                                                       |
| binding_host_id         | None                                                                                                                                                                  |
| binding_profile         | None                                                                                                                                                                  |
| binding_vif_details     | None                                                                                                                                                                  |
| binding_vif_type        | None                                                                                                                                                                  |
| binding_vnic_type       | normal                                                                                                                                                                |
| created_at              | 2021-03-17T00:34:35Z                                                                                                                                                  |
| data_plane_status       | None                                                                                                                                                                  |
| description             |                                                                                                                                                                       |
| device_id               |                                                                                                                                                                       |
| device_owner            | trunk:subport                                                                                                                                                         |
| dns_assignment          | fqdn='host-10-128-29-169.shiftstack.com.', hostname='host-10-128-29-169', ip_address='10.128.29.169'                                                                  |
| dns_domain              |                                                                                                                                                                       |
| dns_name                |                                                                                                                                                                       |
| extra_dhcp_opts         |                                                                                                                                                                       |
| fixed_ips               | ip_address='10.128.29.169', subnet_id='e86c3423-f53a-4424-b565-127fda402d9a'                                                                                          |
| id                      | 9f096afa-8ae3-47d8-aeda-354b48027d44                                                                                                                                  |
| location                | cloud='', project.domain_id=, project.domain_name='Default', project.id='1b9474ca19ad4080999e3a2b663cbc03', project.name='shiftstack', region_name='regionOne', zone= |
| mac_address             | fa:16:3e:d6:95:8e                                                                                                                                                     |
| name                    |                                                                                                                                                                       |
| network_id              | f38dcfe9-fdd3-4585-9773-25d6cf82083e                                                                                                                                  |
| port_security_enabled   | True                                                                                                                                                                  |
| project_id              | 1b9474ca19ad4080999e3a2b663cbc03                                                                                                                                      |
| propagate_uplink_status | None                                                                                                                                                                  |
| qos_policy_id           | None                                                                                                                                                                  |
| resource_request        | None                                                                                                                                                                  |
| revision_number         | 6                                                                                                                                                                     |
| security_group_ids      | 8940af79-40b5-4f30-bda1-24ad399c4cf5                                                                                                                                  |
| status                  | DOWN                                                                                                                                                                  |
| tags                    | openshiftClusterID=ostest-2w8d5                                                                                                                                       |
| trunk_details           | None                                                                                                                                                                  |
| updated_at              | 2021-03-17T07:02:50Z                                                                                                                                                  |
+-------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The installation is stuck with pods on ContainerCreating status:

$ oc get pods -A -o wide | grep -v -e Running -e Completed | head -2
NAMESPACE                                          NAME                                                      READY   STATUS              RESTARTS   AGE     IP               NODE                    NOMINATED NODE   READINESS GATES
openshift-dns-operator                             dns-operator-5f6cc86fb5-wd7zb                             0/2     ContainerCreating   0          7h44m   <none>           ostest-2w8d5-master-2   <none>           <none>
(shiftstack) [stack@undercloud-0 ~]$ 

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          7h45m   Unable to apply 4.7.2: some cluster operators have not yet rolled out


sos-report: http://rhos-release.virt.bos.redhat.com/log/bz1937851/
job artifacts: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/21/

The issue is not persistent. We run our CI job 4 times: 2 passed and 2 failed.

Comment 6 rlobillo 2021-03-17 11:19:43 UTC

(In reply to rlobillo from comment #4)
> Hit again on RHOS-16.1-RHEL-8-20210311.n.1 and OCP4.7.2 with OVN-Octavia and
> kuryr.
> 
> One of the trunk ports remained on DOWN state leading to failures during OCP
> installation with kuryr:
> 
> $ oc logs -n openshift-kuryr kuryr-controller-6cbddb865d-v8krf -p
> [...]
> 2021-03-17 08:05:13.308 1 ERROR kuryr_kubernetes.handlers.logging
> kuryr_kubernetes.exceptions.ResourceNotReady: Resource not ready:
> VIFVlanNested(active=False,address=fa:16:3e:d6:95:8e,
> has_traffic_filtering=False,id=9f096afa-8ae3-47d8-aeda-354b48027d44,
> network=Network(f38dcfe9-fdd3-4585-9773-25d6cf82083e),plugin='noop',
> port_profile=<?>,preserve_on_delete=False,vif_name='tap9f096afa-8a',
> vlan_id=3405)      
> 
> 
> $ openstack port show 9f096afa-8ae3-47d8-aeda-354b48027d44
> +-------------------------+--------------------------------------------------
> -----------------------------------------------------------------------------
> ----------------------------------------+
> | Field                   | Value                                           
> |
> +-------------------------+--------------------------------------------------
> -----------------------------------------------------------------------------
> ----------------------------------------+
> | admin_state_up          | UP                                              
> |
> | allowed_address_pairs   |                                                 
> |
> | binding_host_id         | None                                            
> |
> | binding_profile         | None                                            
> |
> | binding_vif_details     | None                                            
> |
> | binding_vif_type        | None                                            
> |
> | binding_vnic_type       | normal                                          
> |
> | created_at              | 2021-03-17T00:34:35Z                            
> |
> | data_plane_status       | None                                            
> |
> | description             |                                                 
> |
> | device_id               |                                                 
> |
> | device_owner            | trunk:subport                                   
> |
> | dns_assignment          | fqdn='host-10-128-29-169.shiftstack.com.',
> hostname='host-10-128-29-169', ip_address='10.128.29.169'                   
> |
> | dns_domain              |                                                 
> |
> | dns_name                |                                                 
> |
> | extra_dhcp_opts         |                                                 
> |
> | fixed_ips               | ip_address='10.128.29.169',
> subnet_id='e86c3423-f53a-4424-b565-127fda402d9a'                            
> |
> | id                      | 9f096afa-8ae3-47d8-aeda-354b48027d44            
> |
> | location                | cloud='', project.domain_id=,
> project.domain_name='Default',
> project.id='1b9474ca19ad4080999e3a2b663cbc03', project.name='shiftstack',
> region_name='regionOne', zone= |
> | mac_address             | fa:16:3e:d6:95:8e                               
> |
> | name                    |                                                 
> |
> | network_id              | f38dcfe9-fdd3-4585-9773-25d6cf82083e            
> |
> | port_security_enabled   | True                                            
> |
> | project_id              | 1b9474ca19ad4080999e3a2b663cbc03                
> |
> | propagate_uplink_status | None                                            
> |
> | qos_policy_id           | None                                            
> |
> | resource_request        | None                                            
> |
> | revision_number         | 6                                               
> |
> | security_group_ids      | 8940af79-40b5-4f30-bda1-24ad399c4cf5            
> |
> | status                  | DOWN                                            
> |
> | tags                    | openshiftClusterID=ostest-2w8d5                 
> |
> | trunk_details           | None                                            
> |
> | updated_at              | 2021-03-17T07:02:50Z                            
> |
> +-------------------------+--------------------------------------------------
> -----------------------------------------------------------------------------
> ----------------------------------------+
> 
> The installation is stuck with pods on ContainerCreating status:
> 
> $ oc get pods -A -o wide | grep -v -e Running -e Completed | head -2
> NAMESPACE                                          NAME                     
> READY   STATUS              RESTARTS   AGE     IP               NODE        
> NOMINATED NODE   READINESS GATES
> openshift-dns-operator                            
> dns-operator-5f6cc86fb5-wd7zb                             0/2    
> ContainerCreating   0          7h44m   <none>          
> ostest-2w8d5-master-2   <none>           <none>
> (shiftstack) [stack@undercloud-0 ~]$ 
> 
> $ oc get clusterversion
> NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
> version             False       True          7h45m   Unable to apply 4.7.2:
> some cluster operators have not yet rolled out
> 
> 
> sos-report: http://rhos-release.virt.bos.redhat.com/log/bz1937851/
> job artifacts:
> https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-
> shiftstack_ci-osp_verification-osp16.1-passed_phase2-ocp4-stable/21/
> 
> The issue is not persistent. We run our CI job 4 times: 2 passed and 2
> failed.

Clarifying above comment: the mentioned DOWN port is a trunk's subport, not trunk port. Trunk ports are attached to the OpenShift node VMs and then those trunk ports have subports attached that serve as pods ports.

Comment 12 spower 2021-03-31 13:02:31 UTC

Reason: This was proposed as a Blocker for 16.1.5, the TRAC have decided it is an exception and should be moved to 16.1.6 because this is not a regression and does not fit the Blocker Criteria

Comment 27 Joachim von Thadden 2021-05-20 10:14:06 UTC

Tested with newly deployed 16.1 and python-networking-ovn-7.3.1-1.20210518143301.4e24f4c.el8ost.

The good news: OCP4.6 and 4.7 install both reliably. Unfortunately we still have

(overcloud) [stack@tel-director ~]$ openstack port list --device-owner trunk:subport |grep DOWN
| 5b5295c4-d671-47f3-b2d4-f70b4bd51b6b |      | fa:16:3e:f7:f0:fe | ip_address='10.128.91.97', subnet_id='3326a8e9-db6b-4f8f-8e59-6f56d2717b5f'   | DOWN   |
| 714c8ca5-757f-45ee-9fd2-3670689547a6 |      | fa:16:3e:76:1e:e0 | ip_address='10.128.45.250', subnet_id='1f4c68ef-c641-4476-8240-91da1fd7bc41'  | DOWN   |
| 955f9511-b224-45a7-bce2-5983057f9a5e |      | fa:16:3e:3d:9a:7f | ip_address='10.128.12.80', subnet_id='38337b90-1dd0-417d-8713-5b337cc56730'   | DOWN   |
| c25ee044-aee0-4ded-84f3-40a7f178157e |      | fa:16:3e:52:ef:d2 | ip_address='10.128.83.44', subnet_id='00bc3032-145d-429f-91c2-9b7997e1ed64'   | DOWN   |
| d2086858-687b-4477-b2ef-eab331b5e34d |      | fa:16:3e:82:b7:a6 | ip_address='10.128.2.230', subnet_id='f44d85fc-6792-4047-a98c-3dffa8bebd2a'   | DOWN   |

The installation works, but there are cases where pods have to be restarted frequently and updating e.g. the cnv operator always fails as the old version does not vanish and sometimes pods have to be deleted manually to progress. So there are still problems assumingly with the ports.

Comment 30 Joachim von Thadden 2021-05-20 10:55:09 UTC

Additionally I see many warnings on "Kuryr CNI pod Pod" with KuryrCNISlow warnings on OCP4.7 with the message "kuryr-cni-xxxxx is taking too long, on average, to perform CNI ADD requests."

So the kuryr integration is still flaky.

Comment 31 Maysa Macedo 2021-05-20 11:15:17 UTC

Hello Joachim,

Can you open a bz for the KuryrCNISlow warnings sharing the osp and ovn version used? This way we can analyse it better.
It may be a False Positive or some slowness around Neutron transitioning the Ports to active.

Thanks.

Comment 32 Luis Tomas Bolivar 2021-05-20 11:25:08 UTC

(In reply to Joachim von Thadden from comment #30)
> Additionally I see many warnings on "Kuryr CNI pod Pod" with KuryrCNISlow
> warnings on OCP4.7 with the message "kuryr-cni-xxxxx is taking too long, on
> average, to perform CNI ADD requests."
> 
> So the kuryr integration is still flaky.

That message is alerting you that CNI ADD requests are taking too long. 
Kuryr CNI ADD requests need to wait until the Neutron port is ACTIVE to complete. And if the problem is that neutron is not transitioning some ports from DOWN to ACTIVE, then Kuryr CNI ADD request will be waiting forever there, and the message will be raised.

That said, if that is happening on nodes where there is no ports in that situation (DOWN status), as Maysa said, please file a new bug and we will investigate if there is something wrong on the way those times are estimated or if there is a need to adapt some of the thresholds.

Comment 36 Joachim von Thadden 2021-05-25 13:37:53 UTC

(In reply to Luis Tomas Bolivar from comment #32)
> (In reply to Joachim von Thadden from comment #30)
> > Additionally I see many warnings on "Kuryr CNI pod Pod" with KuryrCNISlow
> > warnings on OCP4.7 with the message "kuryr-cni-xxxxx is taking too long, on
> > average, to perform CNI ADD requests."
> > 
> > So the kuryr integration is still flaky.
> 
> That message is alerting you that CNI ADD requests are taking too long. 
> Kuryr CNI ADD requests need to wait until the Neutron port is ACTIVE to
> complete. And if the problem is that neutron is not transitioning some ports
> from DOWN to ACTIVE, then Kuryr CNI ADD request will be waiting forever
> there, and the message will be raised.
> 
> That said, if that is happening on nodes where there is no ports in that
> situation (DOWN status), as Maysa said, please file a new bug and we will
> investigate if there is something wrong on the way those times are estimated
> or if there is a need to adapt some of the thresholds.

It vanished after I re-connected the ports with:
#!/bin/bash -x

ports=`openstack port list --device-owner "trunk:subport" -f value | grep DOWN | cut -d " " -f 1`
trunks=`openstack network trunk list -f value -c ID`

for port in $ports; do
    for trunk in $trunks; do
        vlan_id=`openstack network subport list --trunk $trunk -f value | grep $port | cut -d " " -f 3`
        if [[ $vlan_id ]]; then
            # Found it!
            echo "Port $port is in trunk $trunk with VLAN ID $vlan_id"
            openstack network trunk unset --subport $port $trunk
            openstack network trunk set --subport port=$port,segmentation-type=vlan,segmentation-id=$vlan_id $trunk
            break
        fi
    done
done

So with that it is possible to "heal" the deployment.

Comment 53 Itzik Brown 2021-08-02 14:31:20 UTC

Openshift on Openstack
OCP 4.8.3
OSP 16.1.7
Installed and destroyed the Openshift cluster couple of times and looked that there are no DOWN trunk subports and there were no messages in the Kuryr controller logs regarding DOWN ports

Comment 66 errata-xmlrpc 2021-12-09 20:18:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762

Note You need to log in before you can comment on or make changes to this bug.