Bug 1968411 - Bad SDN setup after openshift-ansible restarts it during normal operation
Summary: Bad SDN setup after openshift-ansible restarts it during normal operation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.z
Assignee: Alexander Constantinescu
QA Contact: huirwang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 10:53 UTC by Pablo Alonso Rodriguez
Modified: 2021-08-04 11:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-04 11:18:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12336 0 None open Bug 1968411: do not specify `force` and `grace-period 0` for SDN deleted 2021-06-16 13:42:32 UTC
Red Hat Product Errata RHBA-2021:2928 0 None None None 2021-08-04 11:18:31 UTC

Description Pablo Alonso Rodriguez 2021-06-07 10:53:23 UTC
Description of problem:

Node connectivity stops working because, at a certain moment during what looks like a pod restart made by an openshift-ansible playbook (concretely, at this step[1]).

What exactly happens is that the OVS flows table has the note flow at table 253 that signals SDN pod was setup, but not any of the flows setup during SDN setup process i.e. no table=0, no drop flow in several tables,... none of the rules setup here[2].

As per my understanding of the source code, I see no obviuos code path that could lead to the flow on table 253 without the flows from[2] having been created as well.

Just as a last note: The correlation between the pod restart and the start of the failures was confirmed by a customer connectivity test. More details in attachments.

Version-Release number of selected component (if applicable):

3.11.439

How reproducible:

Consistently

Steps to Reproduce:
1. Run an upgrade playbook that affects the nodes
2.
3.

Actual results:

Inconsistent OVS flow tables. Connectivity lost.

Expected results:

Consistent OVS flow tables. Connectivity working.

Additional info:

I'll provide detailed attachments.

References:
[1] - https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_node/tasks/sdn_delete.yml#L5
[2] - https://github.com/openshift/origin/blob/release-3.11/pkg/network/node/ovscontroller.go#L73

Comment 7 Alexander Constantinescu 2021-06-07 11:07:54 UTC
Hmm, this looks suspiciously similar to: https://bugzilla.redhat.com/show_bug.cgi?id=1893067 and https://bugzilla.redhat.com/show_bug.cgi?id=1958390 

It seems like there was an upgrade involved in this scenario - from which version to which? The bugs I have linked to have to do with an openshift-sdn issue on upgrades. I have a PR for it (https://github.com/openshift/sdn/pull/306), but I am not sure if it's valid on 3.11...I need to double check the architecture on old versions.

Comment 8 Pablo Alonso Rodriguez 2021-06-07 11:10:29 UTC
Hi,

It happens from any version to any version by just running the upgrade playbook. It has happened even if upgrading "from same version to same version"

Comment 10 Alexander Constantinescu 2021-06-07 11:42:31 UTC
Hi Pablo

Could you also upload the OVS logs from the same upgrade (i,e: before and after)? If they are still available? 

/Alex

Comment 13 Pablo Alonso Rodriguez 2021-06-07 13:01:08 UTC
Thanks for attaching. I was about to ask them through support case.

However, watch out: The way you directly attached made them public, so they could have been accessed by anyone outside Red Hat or your company. For that reason, I have turned them private.

In the future, please attach them through support case (which is private) so I can attach them privately and they are not exposed to public.

Regards.

Comment 14 Pablo Alonso Rodriguez 2021-06-07 13:04:29 UTC
Also a correction to the bug description: It doesn't happen that consistently, but it sometimes requires many re-runs to happen in few machines.

Comment 26 errata-xmlrpc 2021-08-04 11:18:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.487 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2928


Note You need to log in before you can comment on or make changes to this bug.