Bug 1983758

Summary: upgrades are failing on disruptive tests
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: NetworkingAssignee: Chad Scribner <cscribne>
Networking sub component: router QA Contact: Melvin Joseph <mjoseph>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: amcdermo, aos-bugs, bmcelvee, cholman, cscribne, ggiguash, hongli, mmasters, wking
Version: 4.9Keywords: Reopened
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-network-edge] Cluster frontend ingress remain available [sig-api-machinery] OpenShift APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available for new connections [sig-api-machinery] OAuth APIs remain available with reused connections 84.783 -10.79 95.573 [sig-api-machinery] OpenShift APIs remain available with reused connections [sig-api-machinery] Kubernetes APIs remain available with reused connections 84.647 -10.62 95.267 [sig-api-machinery] Kubernetes APIs remain available for new connections job=periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade=all
Last Closed: 2022-03-12 04:36:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-07-19 17:08:10 UTC
Description of problem:
Sometime around 7/12, gcp upgrades regressed and the ingress disruptive tests started failing.

4.8 test grid[1] shows it pretty clearly,  4.9 test grid is a little more noisy.  

We investigated code changes. There were only two: an RHCOS bump that included one-liner cri-o change[3], and an upgrade to stalld in CNTO[4].

The stalld change was interesting in that it rebuilt the image and plled in a new kernel-tools RPM's, but we've tried pinning that, and reverting the cri-o change and we still see failures.

That leaves some change in GCP loadbalancing as a possible culprit.


[1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade&include-filter-by-regex=available
[2] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade&include-filter-by-regex=available
[3] https://github.com/cri-o/cri-o/compare/30ca719956bba567a93876beaa24773428b2ddb4..8d2015333c7da5d697939bcab352c305e3a19851
[4] https://github.com/openshift/cluster-node-tuning-operator/pull/247


stalld bump in https://github.com/openshift/cluster-node-tuning-operator/pull/247
This rebuilt the CNTO image, we got a new kernel-tools/kernel-tools-libs:

< kernel-tools-4.18.0-305.3.1.el8_4.x86_64
< kernel-tools-libs-4.18.0-305.3.1.el8_4.x86_64
---
> kernel-tools-4.18.0-305.7.1.el8_4.x86_64
> kernel-tools-libs-4.18.0-305.7.1.el8_4.x86_64


OpenShift release version:
4.8 and 4.9 latest builds


Cluster Platform:
GCP

How reproducible:
Most of the time

Steps to Reproduce (in detail):
Upgrade a cluster

Actual results:
Ingress route for frontend becomes briefly unavailable

Expected results:
No disruption


Impact of the problem:
GCP upgrade jobs are failing.

Additional info:

Comment 2 Stephen Benjamin 2021-08-25 15:12:14 UTC
Last time it fixed itself after a few days, but starting on 8/22, we're seeing it again on both 4.8 and 4.9 gcp upgrades:

e.g.:

https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

https://sippy.ci.openshift.org/sippy-ng/tests/4.8/analysis?test=[sig-api-machinery]%20Kubernetes%20APIs%20remain%20available%20with%20reused%20connections

Given this has happened twice now, both times we've found no commits that are possible culprits, it's only GCP, and it's happening on two releases at the exact same time, I think it's unlikely related to code changes on our side.

Is it possible to get logging from LB's on GCP? Or escalate this to their support?

Comment 8 Chad Scribner 2021-11-18 18:04:08 UTC
Hi Stephen -- I've picked up this BZ. Would you mind confirming if this issue is still occurring? If it is, are the links you posted previously still the right place to check to get the overall picture of the failures/flakes?

Comment 9 Stephen Benjamin 2021-11-18 18:06:48 UTC
Hi Chad! I think we can close this -- ultimately we determined the cause was the CI cluster in use.

Comment 13 Melvin Joseph 2021-12-20 16:03:47 UTC
The PR for this bug is merged to 4.10.0-0.ci-2021-12-18-095230.
From this CI job, it is evident the issue is no longer recurring.
https://prow.ci.openshift.org/pr-history/?org=openshift&repo=origin&pr=26699

So marking the bug as verified

Comment 16 Brandi Munilla 2022-02-10 20:28:27 UTC
Hi, if there is anything that customers should know about this bug or if there are any important workarounds that should be outlined in the bug fixes section OpenShift Container Platform 4.10 release notes, please update the Doc Type and Doc Text fields. If not, can you please mark it as "no doc update"? Thanks!

Comment 18 errata-xmlrpc 2022-03-12 04:36:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056