Bug 1994655

Summary:	openshift-apiserver should not set Available=False APIServicesAvailable on update
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	Networking	Assignee:	jamo luhrsen <jluhrsen>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	low	CC:	aos-bugs, bleanhar, ffernand, kewang, lszaszki, mfojtik, sttts, wking, wlewis, xxia
Version:	4.8	Keywords:	Reopened, Upgrades
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci LifecycleReset
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1948089	Environment:
Last Closed:	2022-01-05 15:14:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1948089
Bug Blocks:

Description Scott Dodson 2021-08-17 16:01:42 UTC

This clone is intended to track backporting of the library-go bumps for 4.8 cluster-kube-apiserver-operator.

+++ This bug was initially created as a clone of Bug #1948089 +++

From CI runs like [1]:

  : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
    Run #0: Failed	0s
    6 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:17:46.730 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/apps.openshift.io/v1: Get "https://10.128.0.73:8443/apis/apps.openshift.io/v1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    Apr 09 13:18:01.727 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/build.openshift.io/v1: Get "https://10.128.0.73:8443/apis/build.openshift.io/v1": context deadline exceeded
    Apr 09 13:18:16.874 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.50:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.50:8443/apis/apps.openshift.io/v1": context deadline exceeded\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.project.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/project.openshift.io/v1: Get "https://10.128.0.73:8443/apis/project.openshift.io/v1": dial tcp 10.128.0.73:8443: i/o timeout
    Apr 09 13:25:10.247 - 25s   E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    Apr 09 13:31:25.504 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.10:8443/apis/apps.openshift.io/v1: Get "https://10.128.0.10:8443/apis/apps.openshift.io/v1": context deadline exceeded
    Apr 09 13:31:45.328 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.130.0.14:8443/apis/build.openshift.io/v1: Get "https://10.130.0.14:8443/apis/build.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+
change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 81% of failures match = 81% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

Possibly a dup of some non-update bug, but if so, please mention the test-case in that bug for Sippy ;).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

--- Additional comment from Stefan Schimanski on 2021-04-12 03:49:24 EDT ---



--- Additional comment from W. Trevor King on 2021-04-14 13:44:23 EDT ---

This update issue is possibly a dup of the bug 1926867 series?

--- Additional comment from Michal Fojtik on 2021-05-14 14:16:45 EDT ---

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

--- Additional comment from W. Trevor King on 2021-05-14 14:23:36 EDT ---

Still popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 18 runs, 78% failed, 107% of failures match = 83% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 84% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 18 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 86% failed, 50% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 63% of failures match = 63% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact

--- Additional comment from Michal Fojtik on 2021-05-14 15:16:49 EDT ---

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

--- Additional comment from Lukasz Szaszkiewicz on 2021-05-24 03:39:38 EDT ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

--- Additional comment from Lukasz Szaszkiewicz on 2021-06-11 07:57:39 EDT ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

--- Additional comment from Michal Fojtik on 2021-06-13 16:03:24 EDT ---

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

--- Additional comment from W. Trevor King on 2021-06-15 00:08:06 EDT ---

Still popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 9 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 10 runs, 80% failed, 88% of failures match = 70% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact

--- Additional comment from Michal Fojtik on 2021-06-15 01:06:02 EDT ---

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

--- Additional comment from Lukasz Szaszkiewicz on 2021-07-05 08:27:29 EDT ---

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

--- Additional comment from Scott Dodson on 2021-07-15 14:48:03 EDT ---

This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

--- Additional comment from Eric Paris on 2021-07-15 15:01:20 EDT ---

This bug has set a target release without specifying a severity. As part of triage when determining the importance of bugs a severity should be specified. Since these bugs have not been properly triaged we are removing the target release. Teams will need to add a severity before setting the target release again.

--- Additional comment from Eric Paris on 2021-07-15 16:00:33 EDT ---

This bug sets blocker+ without setting a Target Release. This is an invalid state as it is impossible to determine what is being blocked. Please be sure to set Priority, Severity, and Target Release before you attempt to set blocker+

--- Additional comment from Wally on 2021-07-16 12:40:13 EDT ---

Updating fields based on C9 and to clear blocker? list.

--- Additional comment from Stefan Schimanski on 2021-08-17 09:18:32 EDT ---

PR has long merged and we did bumps since then in cluster-kube-apiserver-operator.

--- Additional comment from OpenShift Automated Release Tooling on 2021-08-17 10:10:15 EDT ---

Elliott changed bug status from MODIFIED to ON_QA.

Comment 1 Lukasz Szaszkiewicz 2021-09-03 13:37:35 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Lukasz Szaszkiewicz 2021-11-05 09:17:31 UTC

PR are opened, waiting for review

Comment 6 Rahul Gangwar 2021-11-15 03:47:52 UTC

@lszaszki Failures are still popular. Can you please confirm on this?

 w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort|grep 4.8
periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 6 runs, 50% failed, 67% of failures match = 33% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 14% failed, 700% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 6 runs, 50% failed, 167% of failures match = 83% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 4 runs, 50% failed, 150% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 5 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 7 Lukasz Szaszkiewicz 2021-11-15 16:12:40 UTC

The availability can be affected by many things and never will be perfect (100%)

I used the following query to compare the results [1] vs [2].
I excused updates from 4.7 because the fix has been applied to 4.8+. 
I also excluded OVN since the outage in this environment is usually longer which might indicate the underlying network provider.

[1]
https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver.*status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=4.7%7Covn%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job

[2]
https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver.*status%2FFalse+reason%2FAPIServicesAvailable&maxAge=336h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=4.7%7Covn%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 8 Rahul Gangwar 2021-11-16 03:49:41 UTC

@lszaszki I used above query as you shared and failures are still popular. 
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 5% of failures match = 2% impact


#1460265835862953984	junit	10 hours ago	
Nov 15 16:44:02.854 E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
Nov 15 16:44:02.854 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
Nov 15 16:45:11.000 - 10s   E disruption/service-loadbalancer-with-pdb connection/reused disruption/service-loadbalancer-with-pdb connection/reused is not  responding to GET requests over reused connections: missing error in the code
#1460265835862953984	junit	10 hours ago	
Nov 15 16:44:02.854 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: rpc error: code = Unknown desc = OK: HTTP status code 200; transport: missing content-type field
1 tests failed during this blip (2021-11-15 16:44:02.854351947 +0000 UTC to 2021-11-15 16:44:02.854351947 +0000 UTC): [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
#1460108076450320384	junit	21 hours ago	
Nov 15 05:32:41.654 E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
Nov 15 05:32:41.654 - 2s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
Nov 15 05:37:35.015 E ns/openshift-machine-api pod/machine-api-operator-56d84db788-djsq7 node/ci-op-348q3xzi-a9175-m75l6-master-2 container/machine-api-operator reason/ContainerExit code/2 cause/Error
#1460108076450320384	junit	21 hours ago	
Nov 15 05:32:41.654 - 2s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request

Comment 9 Lukasz Szaszkiewicz 2021-12-21 09:49:41 UTC

I looked into the recent failures and they seem to correspond to updating the sdn containers.
In general, the failures are brief and seem to happen during installing/restarting the sdn. It looks like installing a new sdn container is not graceful and cuts off aggregated apis (at least).


For example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1473107794574970880


the sdn-zstbb (master-1) was killed at 03:02:27
and we lost openshift-oauth-apiserver (apiserver-5b7547db44-4vccw) and openshift-apiserver (apiserver-84c48bd9df-bnqmh) both were running on the same node (master-1)


the sdn-hp5f5 (master-2) was killed at 03:02:40
and we lost openshift-oauth-apiserver (apiserver-5b7547db44-pk484) and openshift-apiserver (apiserver-84c48bd9df-xjwvk) both were running on the same node (master-2)

it looks similar on the master-0 and happened around initialization of openshift-sdn/sdn-bxfqj container


The failures are quite common in CI https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver.*status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&context=1&type=junit&name=%5Eperiodic.*upgrade&excludeName=4.7%7Covn%7Csingle-node&maxMatches=5&maxBytes=20971520&groupBy=job

Based on the above I am moving to the SDN team.

Comment 11 jamo luhrsen 2022-01-05 01:57:47 UTC

(In reply to Scott Dodson from comment #0)
> This clone is intended to track backporting of the library-go bumps for 4.8
> cluster-kube-apiserver-operator.

@sdodson , looks like the backports have all merged.



@rgangwar, @wking, @lszaszki, this clone of an already closed bug fell on
my plate recently. I didn't know much about it, but as I dug in I realized all the "failures" we are
getting from those search queries are coming from "flakes". The test case is failing once, but it's always
passing on the 2nd try. These are not affecting our job pass rate. I checked 4.7->4.8, 4.8->4.9 and 4.9->4.10
in testgrid and you can see that the test case is never marked as a failure, just a flake:
  "openshift-tests.[bz-apiserver-auth] clusteroperator/authentication should not change condition/Available"
I checked both GCP and AWS.

BTW, there are *lots* of test cases in these upgrade jobs that flake once and pass on the 2nd try.

Do we want to close this bug or is there something I'm missing that we want to dig deeper
on to get fixed.

FYI, here are the 6 testgrid links I referenced above:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade&show-stale-tests=
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade&show-stale-tests=
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade&show-stale-tests=
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade&show-stale-tests=
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade&show-stale-tests=
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade&show-stale-tests=

Comment 12 W. Trevor King 2022-01-05 05:16:36 UTC

Using the queries based on the 4.9.0 verification [1], but rolled back to look at 4.8 -> 4.9:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&type=junit&name=%5Eperiodic.*4.8.*upgrade&excludeName=4%5C.7' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 5 runs, 80% failed, 125% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-okd-4.9-upgrade-from-okd-4.8-e2e-upgrade-gcp (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Those runs:

$ curl -s 'https://search.ci.openshift.org/search?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&type=junit&name=%5Eperiodic.*4.8.*upgrade&excludeName=4%5C.7' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node/1478496947181457408
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478239379674632192
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478309813623459840
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478413114000019456
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478420659305451520
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478508721863659520
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade/1478442060850663424
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade/1478219006035890176
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-okd-4.9-upgrade-from-okd-4.8-e2e-upgrade-gcp/1478432819796512768

I dunno how important single-node is.  Let's skip it and look at [2]:

: [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available expand_less
Run #0: Failed expand_less	1h43m50s
1 unexpected clusteroperator state transitions during e2e test run 

  Jan 05 01:46:38.189 - 155s  E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "authorization.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request

That certainly feels similar to the issue that caused me to initially open the 4.9.0 bug [3].  But I haven't internalized the distinctions in [4]; perhaps this 503 is one of the issues that got punted to some other bug series?

And yes, the test-case is never fatal [5], which I guess is grounds to say we don't care all that much about backporting fixes, although ideally we're driving out alarmist noise like this.  [2] actually included some fatal test-cases, including:

  disruption_tests: [sig-api-machinery] OpenShift APIs remain available with reused connections	1h40m53s
    Jan  4 08:04:46.751: API "openshift-api-available-reused-connections" was unreachable during disruption (AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804) for at least 30s of 1h40m51s (1%):

  Jan 04 07:57:10.654 E openshift-apiserver-reused-connection openshift-apiserver-reused-connection started failing: Get "https://api.ci-op-byyjrxly-978ed.aws-2.ci.openshift.org:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": read tcp 10.129.9.1:56734->52.9.155.127:6443: read: connection reset by peer
  Jan 04 07:57:10.654 - 30s   E openshift-apiserver-reused-connection openshift-apiserver-reused-connection is not responding to GET requests
  Jan 04 07:57:41.260 I openshift-apiserver-reused-connection openshift-apiserver-reused-connection started responding to GET requests

although that 7:57 business diverges from the 7:38 Available=False block:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478239379674632192/build-log.txt | grep 'clusteroperator/openshift-apiserver condition/Available
  Jan 04 07:38:00.674 E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServices_Error changed: APIServicesAvailable: "authorization.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
  Jan 04 07:38:00.674 - 150s  E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "authorization.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
  Jan 04 07:40:31.187 W clusteroperator/openshift-apiserver condition/Available status/True reason/AsExpected changed: All is well

Anyhow, I'm agnostic on backports here, so feel free to WONTFIX or CURRENTRELEASE or MODIFIED or whatever, as you see fit.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1948089#c31
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade/1478239379674632192
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1948089#c0
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1948089#c20
[5]: https://github.com/openshift/origin/blame/73f3c46763dc2afe16400f5e1cc18f7d2f399a59/pkg/synthetictests/operators.go#L67-L68

Comment 13 Scott Dodson 2022-01-05 15:14:51 UTC

The changes which were made to 4.9 have been backported successfully to 4.8 which was my request. Given that there are other contributing factors which lead to these tests flaking, but not failing, as Jamo mentioned, I will mark this as CLOSED CURRENTRELEASE.

That should not, however, stop us from pursuing additional fixes which reduce the flake rate of this job.