1983758 – upgrades are failing on disruptive tests

Bug 1983758 - upgrades are failing on disruptive tests

Summary: upgrades are failing on disruptive tests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Chad Scribner
QA Contact:	Melvin Joseph
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-19 17:08 UTC by Stephen Benjamin
Modified:	2022-08-04 22:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-network-edge] Cluster frontend ingress remain available [sig-api-machinery] OpenShift APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available with reused connections 84.783 -10.79 95.573 [sig-api-machinery] OpenShift APIs remain available with reused connections [sig-api-machinery] Kubernetes APIs remain available with reused connections 84.647 -10.62 95.267 [sig-api-machinery] Kubernetes APIs remain available for new connections job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade=all
Last Closed:	2022-03-12 04:36:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 26699	0	None	open	Bug 1983758: Add GCE back into the frontend disruption test	2021-12-16 21:27:09 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:36:22 UTC

Description Stephen Benjamin 2021-07-19 17:08:10 UTC

Description of problem:
Sometime around 7/12, gcp upgrades regressed and the ingress disruptive tests started failing.

4.8 test grid[1] shows it pretty clearly,  4.9 test grid is a little more noisy.  

We investigated code changes. There were only two: an RHCOS bump that included one-liner cri-o change[3], and an upgrade to stalld in CNTO[4].

The stalld change was interesting in that it rebuilt the image and plled in a new kernel-tools RPM's, but we've tried pinning that, and reverting the cri-o change and we still see failures.

That leaves some change in GCP loadbalancing as a possible culprit.


[1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade&include-filter-by-regex=available
[2] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade&include-filter-by-regex=available
[3] https://github.com/cri-o/cri-o/compare/30ca719956bba567a93876beaa24773428b2ddb4..8d2015333c7da5d697939bcab352c305e3a19851
[4] https://github.com/openshift/cluster-node-tuning-operator/pull/247


stalld bump in https://github.com/openshift/cluster-node-tuning-operator/pull/247
This rebuilt the CNTO image, we got a new kernel-tools/kernel-tools-libs:

< kernel-tools-4.18.0-305.3.1.el8_4.x86_64
< kernel-tools-libs-4.18.0-305.3.1.el8_4.x86_64
---
> kernel-tools-4.18.0-305.7.1.el8_4.x86_64
> kernel-tools-libs-4.18.0-305.7.1.el8_4.x86_64


OpenShift release version:
4.8 and 4.9 latest builds


Cluster Platform:
GCP

How reproducible:
Most of the time

Steps to Reproduce (in detail):
Upgrade a cluster

Actual results:
Ingress route for frontend becomes briefly unavailable

Expected results:
No disruption


Impact of the problem:
GCP upgrade jobs are failing.

Additional info:

Comment 2 Stephen Benjamin 2021-08-25 15:12:14 UTC

Last time it fixed itself after a few days, but starting on 8/22, we're seeing it again on both 4.8 and 4.9 gcp upgrades:

e.g.:

https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

https://sippy.ci.openshift.org/sippy-ng/tests/4.8/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

Given this has happened twice now, both times we've found no commits that are possible culprits, it's only GCP, and it's happening on two releases at the exact same time, I think it's unlikely related to code changes on our side.

Is it possible to get logging from LB's on GCP? Or escalate this to their support?

Comment 4 Stephen Benjamin 2021-08-26 15:21:23 UTC

The issues appear to be for more than just gcp this time: https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

Comment 8 Chad Scribner 2021-11-18 18:04:08 UTC

Hi Stephen -- I've picked up this BZ. Would you mind confirming if this issue is still occurring? If it is, are the links you posted previously still the right place to check to get the overall picture of the failures/flakes?

Comment 9 Stephen Benjamin 2021-11-18 18:06:48 UTC

Hi Chad! I think we can close this -- ultimately we determined the cause was the CI cluster in use.

Comment 13 Melvin Joseph 2021-12-20 16:03:47 UTC

The PR for this bug is merged to 4.10.0-0.ci-2021-12-18-095230.
From this CI job, it is evident the issue is no longer recurring.
https://prow.ci.openshift.org/pr-history/?org=openshift&repo=origin&pr=26699

So marking the bug as verified

Comment 16 Brandi Munilla 2022-02-10 20:28:27 UTC

Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!

Comment 18 errata-xmlrpc 2022-03-12 04:36:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.