1981872 – SDN networking failures during GCP upgrades

Bug 1981872 - SDN networking failures during GCP upgrades

Summary: SDN networking failures during GCP upgrades

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	jamo luhrsen
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-13 15:47 UTC by Stephen Benjamin
Modified:	2021-08-20 18:51 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade=all
Last Closed:	2021-08-20 18:51:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Stephen Benjamin 2021-07-13 15:47:09 UTC

We saw in this CI job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1414662507020161024 the following symptoms:

- Multiple unrelated pods fail probes around the same time
- API server is unreachable on old and new connections
- etcd communication using the host network is disrupted

I dug into the SDN logs and I can't find what's obviously wrong, but everything is pointing at something systemic with networking.

I also dug into the historical loki logs for this job in the SDN space, and likewise nothing sticks out:

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221625518320244%22,%221626123120244%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade%2F1414662507020161024%5C%22%7D%20%7C%20unpack%20%7C%20namespace%3D%5C%22openshift-sdn%5C%22%22%7D%5D

Comment 2 Stephen Benjamin 2021-07-13 16:40:19 UTC

TRT setting blocker+ due to impact on upgrades. This appears to impact most gcp upgrade jobs, and there's some evidence of it happening maybe in AWS too.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1414686006967275520

Comment 3 jamo luhrsen 2021-07-22 06:41:08 UTC

@bbennett, @vpickard, I will not be able to figure this one out in any timely fashion. I can help however possible,
but since this is blocker+ I suggest we find someone more expert in the nitty gritty of what could be wrong.

The most common problem with this job is the failure of "Cluster should remain functional during upgrade" which if you dig in to a few
of the jobs is telling us that different APIs are just not responding due to i/o timeouts.

Another test case that shows up to me is "Check if alerts are firing during or after upgrade success". It usually flakes, but sometimes
fails (means it failed twice in a row). This is the message:

alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"}
alert HighOverallControlPlaneCPU fired for 180 seconds with labels: {severity="warning"}

Also, GCP upgrades recently failed the last three jobs test about taking too long to upgrade. Is the cluster choking? Is it worth a try
to see what these jobs would look like running on more resources? I think we do that with our network-stress job so I can see about
trying that just to give us a data point.

I found similar types of failures, although not as consistent, in the ovn upgrade job (same versions), just for reference.
The aws upgrade job [1] (not ovn) failed the same top level test case (cluster should remain functional...) but that has a
different log, about an alert KubePodNotReady firing.

Another test case that is frequently failing is "Cluster frontend ingress remain available" which may be related to this? There
is a BZ for that already, but the PR for it to pre-pull images doesn't seem to help.

[0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1417562198238040064
[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade/1418013626882592768
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1970985

Comment 4 W. Trevor King 2021-07-22 17:58:09 UTC

>  alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"}

This is bad component behavior.

>  alert HighOverallControlPlaneCPU fired for 180 seconds with labels: {severity="warning"}

This one might be bad component behavior?  Or might be an overly-touchy alert rule?  Or might be underprovisioned CI machines?  Or might be overly hungry OpenShift components?  I'd opened [1] for "it's fine, just ignore", but will try to drum up some interest in getting this addressed one way or the other.  I expect it's orthogonal to the SDN/API-connectivity issues.

[1]: https://github.com/openshift/origin/pull/26341

Comment 5 jamo luhrsen 2021-08-03 17:13:35 UTC

> Another test case that shows up to me is "Check if alerts are firing during
> or after upgrade success". It usually flakes, but sometimes
> fails (means it failed twice in a row). This is the message:
> 
>   alert AggregatedAPIErrors fired for 30 seconds with labels:
> {name="v1.packages.operators.coreos.com", namespace="default",
> severity="warning"}
>   alert HighOverallControlPlaneCPU fired for 180 seconds with labels:
> {severity="warning"}

because of this warning I saw I wrote a jira ticket to investigate what these
upgrade jobs look like on larger nodes. I was able to test it and there is
no difference in test results. totally just an fyi:
  https://issues.redhat.com/browse/SDN-2069

Comment 6 Alexander Constantinescu 2021-08-18 14:46:40 UTC

Hi Jamo

Do we have a status update on this bug? Any progress made or help needed? 

/Alex

Comment 7 jamo luhrsen 2021-08-18 18:50:16 UTC

(In reply to Alexander Constantinescu from comment #6)
> Hi Jamo
> 
> Do we have a status update on this bug? Any progress made or help needed? 
> 
> /Alex

@aconstan, back in comment#3 I was asking Ben and Vic if we could get help. This is essentially a
catch-all bz because our upgrade-gcp job is failing most of the time for multiple reasons (see description).
I moved this bz in to our current sprint for myself to finally try to take a closer look.  I think it was
actually worse back in mid July when this was filed. Currently (past ~2 weeks) the job is failing 40% of the
time. The reason for the failures in this time period are almost always because of one failing test case:

  "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]"

The reasons for this test case failure seem multiple as well, although I can't say they are not all
the same root cause or not. There are 15 unresolved bzs that have some comment calling out this test
case [0]. digging through the last two weeks failures of this test case I see:

1)
  "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is degraded" -and-
  "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator authentication is degraded"
  -  There are plenty of open bugs in bz matching 'Cluster operator <X> is degraded'. Do we need to have deep dives in
     to each case to get new bugs open or be able to attribute them to active and open bzs?

2)
  "kube-apiserver-new-connection kube-apiserver-new-connection is not responding to GET requests"
  -  I think this is tracked in this NEW bz: https://bugzilla.redhat.com/show_bug.cgi?id=1955333

3)
  "alert HighOverallControlPlaneCPU fired for 120 seconds with labels" -and-
  "alert ExtremelyHighIndividualControlPlaneCPU fired for 90 seconds with labels"
  -  this seems likely to be https://bugzilla.redhat.com/show_bug.cgi?id=1985073 which has some
     PRs in POST

4)
  "alert AggregatedAPIErrors fired for 30 seconds with labels: {name="v1.packages.operators.coreos.com", namespace="default", severity="warning"}"
  -  This had a similar bug that seemed specific to oVirt, but it was closed recently due to
     insufficient data. https://bugzilla.redhat.com/show_bug.cgi?id=1974283 Either we can try
     to re-open this one or file a new one.


What is the consensus here? keep this open as a catch-all until the upgrade jobs start
passing more reliably? If so, I would not keep this as a blocker. making upgrade jobs
pass is not something we have been blocking releases on in the past. Or can we close this
bug and create some tasks (jira or bzs) to dig in to 1) and possibly re-open 3)  ?




[0] https://bugzilla.redhat.com/buglist.cgi?bug_status=ASSIGNED&bug_status=POST&bug_status=NEW&bug_status=MODIFIED&known_name=Openshift%20Container%20Search&list_id=12085872&longdesc=Cluster%20should%20remain%20functional%20during%20upgrade&longdesc_type=substring&product=OpenShift%20Container%20Platform&query_based_on=Openshift%20Container%20Search&query_format=advanced

Comment 8 jamo luhrsen 2021-08-20 18:51:14 UTC

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1996187 for 1) in comment #7.

Closing this out since we have multiple other bugs to track failures in the upgrade job. I don't expect the upgrade job to
turn fully green after all of them are fixed, but it would behoove us to find some way to have some person/people consistently
dig in to the failures here (and the ovn job) and file specific BZs as they are uncovered.

Note You need to log in before you can comment on or make changes to this bug.