We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Once the problem here is understood please take a moment to answer the questions in Comment #2. We'd like to have a better understanding of whether or not the product has regressed ASAP.
This could be a red herring, but https://github.com/openshift/machine-config-operator/pull/1668 was included in 4.3 This changed to NSS_SDB_USE_CACHE=no by default. See also https://bugzilla.redhat.com/show_bug.cgi?id=1820507#c9 It would be worth trying the test case getent/GET test case with NSS_SDB_USE_CACHE to rule out that change as a suspect
(In reply to Mark McLoughlin from comment #29) > This could be a red herring, but > https://github.com/openshift/machine-config-operator/pull/1668 was included > in 4.3 Correction - it was included in 4.3.19 https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.3.19?from=4.3.18 > machine-config-operator: > * Bug 1822269: Add new crio.conf field to the template #1668
Still investigating, moving to 4.6 and will do a backport to 4.5 once fixed.
@aleksander can you grab full flow dumps from good and bad nodes, and also try an ofproto/trace for traffic going to the bad pod?
Yeah, if syncNamespaceFlows() fails it really should retry.
I am closing this issue. Several issues have been fixed with openshift-sdn networking and been back-ported to 4.4 (4.3 will happen this week), see below: https://bugzilla.redhat.com/show_bug.cgi?id=1855118 https://bugzilla.redhat.com/show_bug.cgi?id=1853193 https://bugzilla.redhat.com/show_bug.cgi?id=1857738 Moreover, the effort to investigate the quay outage has stopped.