Description of problem:
Sometime around 7/12, gcp upgrades regressed and the ingress disruptive tests started failing.
4.8 test grid shows it pretty clearly, 4.9 test grid is a little more noisy.
We investigated code changes. There were only two: an RHCOS bump that included one-liner cri-o change, and an upgrade to stalld in CNTO.
The stalld change was interesting in that it rebuilt the image and plled in a new kernel-tools RPM's, but we've tried pinning that, and reverting the cri-o change and we still see failures.
That leaves some change in GCP loadbalancing as a possible culprit.
stalld bump in https://github.com/openshift/cluster-node-tuning-operator/pull/247
This rebuilt the CNTO image, we got a new kernel-tools/kernel-tools-libs:
OpenShift release version:
4.8 and 4.9 latest builds
Most of the time
Steps to Reproduce (in detail):
Upgrade a cluster
Ingress route for frontend becomes briefly unavailable
Impact of the problem:
GCP upgrade jobs are failing.
Last time it fixed itself after a few days, but starting on 8/22, we're seeing it again on both 4.8 and 4.9 gcp upgrades:
Given this has happened twice now, both times we've found no commits that are possible culprits, it's only GCP, and it's happening on two releases at the exact same time, I think it's unlikely related to code changes on our side.
Is it possible to get logging from LB's on GCP? Or escalate this to their support?
The issues appear to be for more than just gcp this time: https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections
Hi Stephen -- I've picked up this BZ. Would you mind confirming if this issue is still occurring? If it is, are the links you posted previously still the right place to check to get the overall picture of the failures/flakes?
Hi Chad! I think we can close this -- ultimately we determined the cause was the CI cluster in use.
The PR for this bug is merged to 4.10.0-0.ci-2021-12-18-095230.
From this CI job, it is evident the issue is no longer recurring.
So marking the bug as verified
Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.