Bug 1838007

Summary: Networking issue during OSD service outage 2020-05-19
Product: OpenShift Container Platform Reporter: Alexander Constantinescu <aconstan>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DEFERRED Docs Contact:
Severity: urgent    
Priority: urgent CC: aaleman, aarapov, agarcial, akonarde, aos-bugs, apahim, asegundo, bbennett, cattias, cblecker, cdc, dcbw, dhansen, jaharrin, jbeakley, jchevret, jeder, kbsingh, lmohanty, markmc, marobrie, mcambria, nmalik, pbergene, scuppett, sdodson, tparikh, trankin, tsmetana, vrutkovs, wking, yanyang, yufchang
Version: 4.3.0Keywords: ServiceDeliveryBlocker, Upgrades
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1851182 (view as bug list) Environment:
Last Closed: 2020-08-04 08:28:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1851182    
Bug Blocks:    

Comment 2 Lalatendu Mohanty 2020-05-21 14:59:26 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 16 Scott Dodson 2020-06-01 17:57:43 UTC
Once the problem here is understood please take a moment to answer the questions in Comment #2. We'd like to have a better understanding of whether or not the product has regressed ASAP.

Comment 29 Mark McLoughlin 2020-06-17 14:29:14 UTC
This could be a red herring, but https://github.com/openshift/machine-config-operator/pull/1668 was included in 4.3

This changed to NSS_SDB_USE_CACHE=no by default. See also https://bugzilla.redhat.com/show_bug.cgi?id=1820507#c9

It would be worth trying the test case getent/GET test case with NSS_SDB_USE_CACHE to rule out that change as a suspect

Comment 30 Mark McLoughlin 2020-06-17 14:30:55 UTC
(In reply to Mark McLoughlin from comment #29)
> This could be a red herring, but
> https://github.com/openshift/machine-config-operator/pull/1668 was included
> in 4.3

Correction - it was included in 4.3.19

https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.3.19?from=4.3.18

> machine-config-operator:
> * Bug 1822269: Add new crio.conf field to the template #1668

Comment 33 Andrew McDermott 2020-06-18 16:06:26 UTC
Still investigating, moving to 4.6 and will do a backport to 4.5 once fixed.

Comment 48 Dan Williams 2020-06-24 17:29:33 UTC
@aleksander can you grab full flow dumps from good and bad nodes, and also try an ofproto/trace for traffic going to the bad pod?

Comment 59 Dan Williams 2020-06-25 19:03:40 UTC
Yeah, if syncNamespaceFlows() fails it really should retry.

Comment 60 Alexander Constantinescu 2020-08-04 08:28:53 UTC
I am closing this issue. 

Several issues have been fixed with openshift-sdn networking and been back-ported to 4.4 (4.3 will happen this week), see below:

https://bugzilla.redhat.com/show_bug.cgi?id=1855118
https://bugzilla.redhat.com/show_bug.cgi?id=1853193
https://bugzilla.redhat.com/show_bug.cgi?id=1857738

Moreover, the effort to investigate the quay outage has stopped.