Hide Forgot
Description of problem: Sometime around 7/12, gcp upgrades regressed and the ingress disruptive tests started failing. 4.8 test grid[1] shows it pretty clearly, 4.9 test grid is a little more noisy. We investigated code changes. There were only two: an RHCOS bump that included one-liner cri-o change[3], and an upgrade to stalld in CNTO[4]. The stalld change was interesting in that it rebuilt the image and plled in a new kernel-tools RPM's, but we've tried pinning that, and reverting the cri-o change and we still see failures. That leaves some change in GCP loadbalancing as a possible culprit. [1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade&include-filter-by-regex=available [2] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade&include-filter-by-regex=available [3] https://github.com/cri-o/cri-o/compare/30ca719956bba567a93876beaa24773428b2ddb4..8d2015333c7da5d697939bcab352c305e3a19851 [4] https://github.com/openshift/cluster-node-tuning-operator/pull/247 stalld bump in https://github.com/openshift/cluster-node-tuning-operator/pull/247 This rebuilt the CNTO image, we got a new kernel-tools/kernel-tools-libs: < kernel-tools-4.18.0-305.3.1.el8_4.x86_64 < kernel-tools-libs-4.18.0-305.3.1.el8_4.x86_64 --- > kernel-tools-4.18.0-305.7.1.el8_4.x86_64 > kernel-tools-libs-4.18.0-305.7.1.el8_4.x86_64 OpenShift release version: 4.8 and 4.9 latest builds Cluster Platform: GCP How reproducible: Most of the time Steps to Reproduce (in detail): Upgrade a cluster Actual results: Ingress route for frontend becomes briefly unavailable Expected results: No disruption Impact of the problem: GCP upgrade jobs are failing. Additional info:
Last time it fixed itself after a few days, but starting on 8/22, we're seeing it again on both 4.8 and 4.9 gcp upgrades: e.g.: https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections https://sippy.ci.openshift.org/sippy-ng/tests/4.8/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections Given this has happened twice now, both times we've found no commits that are possible culprits, it's only GCP, and it's happening on two releases at the exact same time, I think it's unlikely related to code changes on our side. Is it possible to get logging from LB's on GCP? Or escalate this to their support?
The issues appear to be for more than just gcp this time: https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections
Hi Stephen -- I've picked up this BZ. Would you mind confirming if this issue is still occurring? If it is, are the links you posted previously still the right place to check to get the overall picture of the failures/flakes?
Hi Chad! I think we can close this -- ultimately we determined the cause was the CI cluster in use.
The PR for this bug is merged to 4.10.0-0.ci-2021-12-18-095230. From this CI job, it is evident the issue is no longer recurring. https://prow.ci.openshift.org/pr-history/?org=openshift&repo=origin&pr=26699 So marking the bug as verified
Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056