Description of problem: Node connectivity stops working because, at a certain moment during what looks like a pod restart made by an openshift-ansible playbook (concretely, at this step[1]). What exactly happens is that the OVS flows table has the note flow at table 253 that signals SDN pod was setup, but not any of the flows setup during SDN setup process i.e. no table=0, no drop flow in several tables,... none of the rules setup here[2]. As per my understanding of the source code, I see no obviuos code path that could lead to the flow on table 253 without the flows from[2] having been created as well. Just as a last note: The correlation between the pod restart and the start of the failures was confirmed by a customer connectivity test. More details in attachments. Version-Release number of selected component (if applicable): 3.11.439 How reproducible: Consistently Steps to Reproduce: 1. Run an upgrade playbook that affects the nodes 2. 3. Actual results: Inconsistent OVS flow tables. Connectivity lost. Expected results: Consistent OVS flow tables. Connectivity working. Additional info: I'll provide detailed attachments. References: [1] - https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_node/tasks/sdn_delete.yml#L5 [2] - https://github.com/openshift/origin/blob/release-3.11/pkg/network/node/ovscontroller.go#L73
Hmm, this looks suspiciously similar to: https://bugzilla.redhat.com/show_bug.cgi?id=1893067 and https://bugzilla.redhat.com/show_bug.cgi?id=1958390 It seems like there was an upgrade involved in this scenario - from which version to which? The bugs I have linked to have to do with an openshift-sdn issue on upgrades. I have a PR for it (https://github.com/openshift/sdn/pull/306), but I am not sure if it's valid on 3.11...I need to double check the architecture on old versions.
Hi, It happens from any version to any version by just running the upgrade playbook. It has happened even if upgrading "from same version to same version"
Hi Pablo Could you also upload the OVS logs from the same upgrade (i,e: before and after)? If they are still available? /Alex
Thanks for attaching. I was about to ask them through support case. However, watch out: The way you directly attached made them public, so they could have been accessed by anyone outside Red Hat or your company. For that reason, I have turned them private. In the future, please attach them through support case (which is private) so I can attach them privately and they are not exposed to public. Regards.
Also a correction to the bug description: It doesn't happen that consistently, but it sometimes requires many re-runs to happen in few machines.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.487 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2928