1968411 – Bad SDN setup after openshift-ansible restarts it during normal operation

Bug 1968411 - Bad SDN setup after openshift-ansible restarts it during normal operation

Summary: Bad SDN setup after openshift-ansible restarts it during normal operation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Alexander Constantinescu
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-07 10:53 UTC by Pablo Alonso Rodriguez
Modified:	2021-08-04 11:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-04 11:18:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12336	0	None	open	Bug 1968411: do not specify `force` and `grace-period 0` for SDN deleted	2021-06-16 13:42:32 UTC
Red Hat Product Errata	RHBA-2021:2928	0	None	None	None	2021-08-04 11:18:31 UTC

Description Pablo Alonso Rodriguez 2021-06-07 10:53:23 UTC

Description of problem:

Node connectivity stops working because, at a certain moment during what looks like a pod restart made by an openshift-ansible playbook (concretely, at this step[1]).

What exactly happens is that the OVS flows table has the note flow at table 253 that signals SDN pod was setup, but not any of the flows setup during SDN setup process i.e. no table=0, no drop flow in several tables,... none of the rules setup here[2].

As per my understanding of the source code, I see no obviuos code path that could lead to the flow on table 253 without the flows from[2] having been created as well.

Just as a last note: The correlation between the pod restart and the start of the failures was confirmed by a customer connectivity test. More details in attachments.

Version-Release number of selected component (if applicable):

3.11.439

How reproducible:

Consistently

Steps to Reproduce:
1. Run an upgrade playbook that affects the nodes
2.
3.

Actual results:

Inconsistent OVS flow tables. Connectivity lost.

Expected results:

Consistent OVS flow tables. Connectivity working.

Additional info:

I'll provide detailed attachments.

References:
[1] - https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_node/tasks/sdn_delete.yml#L5
[2] - https://github.com/openshift/origin/blob/release-3.11/pkg/network/node/ovscontroller.go#L73

Comment 7 Alexander Constantinescu 2021-06-07 11:07:54 UTC

Hmm, this looks suspiciously similar to: https://bugzilla.redhat.com/show_bug.cgi?id=1893067 and https://bugzilla.redhat.com/show_bug.cgi?id=1958390 

It seems like there was an upgrade involved in this scenario - from which version to which? The bugs I have linked to have to do with an openshift-sdn issue on upgrades. I have a PR for it (https://github.com/openshift/sdn/pull/306), but I am not sure if it's valid on 3.11...I need to double check the architecture on old versions.

Comment 8 Pablo Alonso Rodriguez 2021-06-07 11:10:29 UTC

Hi,

It happens from any version to any version by just running the upgrade playbook. It has happened even if upgrading "from same version to same version"

Comment 10 Alexander Constantinescu 2021-06-07 11:42:31 UTC

Hi Pablo

Could you also upload the OVS logs from the same upgrade (i,e: before and after)? If they are still available? 

/Alex

Comment 13 Pablo Alonso Rodriguez 2021-06-07 13:01:08 UTC

Thanks for attaching. I was about to ask them through support case.

However, watch out: The way you directly attached made them public, so they could have been accessed by anyone outside Red Hat or your company. For that reason, I have turned them private.

In the future, please attach them through support case (which is private) so I can attach them privately and they are not exposed to public.

Regards.

Comment 14 Pablo Alonso Rodriguez 2021-06-07 13:04:29 UTC

Also a correction to the bug description: It doesn't happen that consistently, but it sometimes requires many re-runs to happen in few machines.

Comment 26 errata-xmlrpc 2021-08-04 11:18:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.487 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2928

Note You need to log in before you can comment on or make changes to this bug.