Bug 1983758 - upgrades are failing on disruptive tests
Summary: upgrades are failing on disruptive tests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: Chad Scribner
QA Contact: Melvin Joseph
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-19 17:08 UTC by Stephen Benjamin
Modified: 2022-08-04 22:32 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[sig-network-edge] Cluster frontend ingress remain available [sig-api-machinery] OpenShift APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available with reused connections 84.783 -10.79 95.573 [sig-api-machinery] OpenShift APIs remain available with reused connections [sig-api-machinery] Kubernetes APIs remain available with reused connections 84.647 -10.62 95.267 [sig-api-machinery] Kubernetes APIs remain available for new connections job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade=all
Last Closed: 2022-03-12 04:36:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 26699 0 None open Bug 1983758: Add GCE back into the frontend disruption test 2021-12-16 21:27:09 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:36:22 UTC

Description Stephen Benjamin 2021-07-19 17:08:10 UTC
Description of problem:
Sometime around 7/12, gcp upgrades regressed and the ingress disruptive tests started failing.

4.8 test grid[1] shows it pretty clearly,  4.9 test grid is a little more noisy.  

We investigated code changes. There were only two: an RHCOS bump that included one-liner cri-o change[3], and an upgrade to stalld in CNTO[4].

The stalld change was interesting in that it rebuilt the image and plled in a new kernel-tools RPM's, but we've tried pinning that, and reverting the cri-o change and we still see failures.

That leaves some change in GCP loadbalancing as a possible culprit.


[1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade&include-filter-by-regex=available
[2] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade&include-filter-by-regex=available
[3] https://github.com/cri-o/cri-o/compare/30ca719956bba567a93876beaa24773428b2ddb4..8d2015333c7da5d697939bcab352c305e3a19851
[4] https://github.com/openshift/cluster-node-tuning-operator/pull/247


stalld bump in https://github.com/openshift/cluster-node-tuning-operator/pull/247
This rebuilt the CNTO image, we got a new kernel-tools/kernel-tools-libs:

< kernel-tools-4.18.0-305.3.1.el8_4.x86_64
< kernel-tools-libs-4.18.0-305.3.1.el8_4.x86_64
---
> kernel-tools-4.18.0-305.7.1.el8_4.x86_64
> kernel-tools-libs-4.18.0-305.7.1.el8_4.x86_64


OpenShift release version:
4.8 and 4.9 latest builds


Cluster Platform:
GCP

How reproducible:
Most of the time

Steps to Reproduce (in detail):
Upgrade a cluster

Actual results:
Ingress route for frontend becomes briefly unavailable

Expected results:
No disruption


Impact of the problem:
GCP upgrade jobs are failing.

Additional info:

Comment 2 Stephen Benjamin 2021-08-25 15:12:14 UTC
Last time it fixed itself after a few days, but starting on 8/22, we're seeing it again on both 4.8 and 4.9 gcp upgrades:

e.g.:

https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

https://sippy.ci.openshift.org/sippy-ng/tests/4.8/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

Given this has happened twice now, both times we've found no commits that are possible culprits, it's only GCP, and it's happening on two releases at the exact same time, I think it's unlikely related to code changes on our side.

Is it possible to get logging from LB's on GCP? Or escalate this to their support?

Comment 8 Chad Scribner 2021-11-18 18:04:08 UTC
Hi Stephen -- I've picked up this BZ. Would you mind confirming if this issue is still occurring? If it is, are the links you posted previously still the right place to check to get the overall picture of the failures/flakes?

Comment 9 Stephen Benjamin 2021-11-18 18:06:48 UTC
Hi Chad! I think we can close this -- ultimately we determined the cause was the CI cluster in use.

Comment 13 Melvin Joseph 2021-12-20 16:03:47 UTC
The PR for this bug is merged to 4.10.0-0.ci-2021-12-18-095230.
From this CI job, it is evident the issue is no longer recurring.
https://prow.ci.openshift.org/pr-history/?org=openshift&repo=origin&pr=26699

So marking the bug as verified

Comment 16 Brandi Munilla 2022-02-10 20:28:27 UTC
Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!

Comment 18 errata-xmlrpc 2022-03-12 04:36:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.