1838007 – Networking issue during OSD service outage 2020-05-19

Bug 1838007 - Networking issue during OSD service outage 2020-05-19

Summary: Networking issue during OSD service outage 2020-05-19

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Alexander Constantinescu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1851182
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-20 11:10 UTC by Alexander Constantinescu
Modified:	2021-04-05 17:24 UTC (History)
CC List:	33 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1851182 (view as bug list)
Environment:
Last Closed:	2020-08-04 08:28:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 2 Lalatendu Mohanty 2020-05-21 14:59:26 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 16 Scott Dodson 2020-06-01 17:57:43 UTC

Once the problem here is understood please take a moment to answer the questions in Comment #2. We'd like to have a better understanding of whether or not the product has regressed ASAP.

Comment 29 Mark McLoughlin 2020-06-17 14:29:14 UTC

This could be a red herring, but https://github.com/openshift/machine-config-operator/pull/1668 was included in 4.3

This changed to NSS_SDB_USE_CACHE=no by default. See also https://bugzilla.redhat.com/show_bug.cgi?id=1820507#c9

It would be worth trying the test case getent/GET test case with NSS_SDB_USE_CACHE to rule out that change as a suspect

Comment 30 Mark McLoughlin 2020-06-17 14:30:55 UTC

(In reply to Mark McLoughlin from comment #29)
> This could be a red herring, but
> https://github.com/openshift/machine-config-operator/pull/1668 was included
> in 4.3

Correction - it was included in 4.3.19

https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.3.19?from=4.3.18

> machine-config-operator:
> * Bug 1822269: Add new crio.conf field to the template #1668

Comment 33 Andrew McDermott 2020-06-18 16:06:26 UTC

Still investigating, moving to 4.6 and will do a backport to 4.5 once fixed.

Comment 48 Dan Williams 2020-06-24 17:29:33 UTC

@aleksander can you grab full flow dumps from good and bad nodes, and also try an ofproto/trace for traffic going to the bad pod?

Comment 59 Dan Williams 2020-06-25 19:03:40 UTC

Yeah, if syncNamespaceFlows() fails it really should retry.

Comment 60 Alexander Constantinescu 2020-08-04 08:28:53 UTC

I am closing this issue. 

Several issues have been fixed with openshift-sdn networking and been back-ported to 4.4 (4.3 will happen this week), see below:

https://bugzilla.redhat.com/show_bug.cgi?id=1855118
https://bugzilla.redhat.com/show_bug.cgi?id=1853193
https://bugzilla.redhat.com/show_bug.cgi?id=1857738

Moreover, the effort to investigate the quay outage has stopped.

Note You need to log in before you can comment on or make changes to this bug.

aaleman
aarapov
agarcial
akonarde
aos-bugs
apahim
asegundo
bbennett
cattias
cblecker
cdc
dcbw
dhansen
jaharrin
jbeakley
jchevret
jeder
kbsingh
lmohanty
markmc
marobrie
mcambria
nmalik
pbergene
scuppett
sdodson
tparikh
trankin
tsmetana
vrutkovs
wking
yanyang
yufchang