1948089 – openshift-apiserver should not set Available=False APIServicesAvailable on update

Bug 1948089 - openshift-apiserver should not set Available=False APIServicesAvailable on update [NEEDINFO]

Summary: openshift-apiserver should not set Available=False APIServicesAvailable on up...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Lukasz Szaszkiewicz
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:	tag-ci LifecycleReset
Depends On:
Blocks:	1994655
TreeView+	depends on / blocked

Reported:	2021-04-10 01:01 UTC by W. Trevor King
Modified:	2021-11-15 09:36 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	A set of improvements that should prevent some operators from reporting Available=False during an update due to various transient errors.
Clone Of:
Clones:	1994655 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:29:50 UTC
Target Upstream Version:
Embargoed:
Flags:	mfojtik: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-openshift-apiserver-operator pull 471	None	Merged	Bug 1948089: openshift-apiserver should not set Available=False APIServicesAvailable on update	2022-01-05 04:58:42 UTC
Github	openshift kubernetes pull 903	None	Merged	Bug 1948089: openshift-apiserver should not set Available=False APIServicesAvailable on update	2022-01-05 04:58:41 UTC
Github	openshift kubernetes pull 915	None	Merged	Bug 1948089: openshift-apiserver should not set Available=False APIServicesAvailable on update	2022-01-05 04:58:40 UTC
Github	openshift library-go pull 1111	None	Merged	makes the apiservice controller more resilient to failures	2022-01-05 04:58:39 UTC
Github	openshift library-go pull 1189	None	Merged	apiservice-controller: don't update the failing condition when an operator has been requested to shutdown	2022-01-05 04:58:39 UTC
Red Hat Product Errata	RHSA-2021:3759	None	None	None	2021-10-18 17:30:13 UTC

Description W. Trevor King 2021-04-10 01:01:35 UTC

From CI runs like [1]:

  : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
    Run #0: Failed	0s
    6 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:17:46.730 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/apps.openshift.io/v1: Get "https://10.128.0.73:8443/apis/apps.openshift.io/v1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    Apr 09 13:18:01.727 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/build.openshift.io/v1: Get "https://10.128.0.73:8443/apis/build.openshift.io/v1": context deadline exceeded
    Apr 09 13:18:16.874 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.50:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.50:8443/apis/apps.openshift.io/v1": context deadline exceeded\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.project.openshift.io: not available: failing or missing response from https://10.128.0.73:8443/apis/project.openshift.io/v1: Get "https://10.128.0.73:8443/apis/project.openshift.io/v1": dial tcp 10.128.0.73:8443: i/o timeout
    Apr 09 13:25:10.247 - 25s   E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: "apps.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "authorization.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "route.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "security.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)\nAPIServicesAvailable: "template.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
    Apr 09 13:31:25.504 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.128.0.10:8443/apis/apps.openshift.io/v1: Get "https://10.128.0.10:8443/apis/apps.openshift.io/v1": context deadline exceeded
    Apr 09 13:31:45.328 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.130.0.14:8443/apis/build.openshift.io/v1: Get "https://10.130.0.14:8443/apis/build.openshift.io/v1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+
change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 17 runs, 100% failed, 88% of failures match = 88% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 21 runs, 100% failed, 81% of failures match = 81% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 50% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 10 runs, 50% failed, 60% of failures match = 30% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

Possibly a dup of some non-update bug, but if so, please mention the test-case in that bug for Sippy ;).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 Stefan Schimanski 2021-04-12 07:49:24 UTC


*** This bug has been marked as a duplicate of bug 1943442 ***

Comment 2 W. Trevor King 2021-04-14 17:44:23 UTC

This update issue is possibly a dup of the bug 1926867 series?

Comment 3 Michal Fojtik 2021-05-14 18:16:45 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 W. Trevor King 2021-05-14 18:23:36 UTC

Still popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 18 runs, 78% failed, 107% of failures match = 83% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 19 runs, 100% failed, 84% of failures match = 84% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 18 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 20 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 7 runs, 86% failed, 50% of failures match = 43% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 8 runs, 100% failed, 63% of failures match = 63% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 8 runs, 100% failed, 100% of failures match = 100% impact

Comment 5 Michal Fojtik 2021-05-14 19:16:49 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 6 Lukasz Szaszkiewicz 2021-05-24 07:39:38 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 7 Lukasz Szaszkiewicz 2021-06-11 11:57:39 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 8 Michal Fojtik 2021-06-13 20:03:24 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 9 W. Trevor King 2021-06-15 04:08:06 UTC

Still popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-compact-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 9 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 4 runs, 75% failed, 100% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 10 runs, 80% failed, 88% of failures match = 70% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 14 runs, 93% failed, 23% of failures match = 21% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-ovirt-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-metal-ipi-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact

Comment 10 Michal Fojtik 2021-06-15 05:06:02 UTC

The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 11 Lukasz Szaszkiewicz 2021-07-05 12:27:29 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 12 Scott Dodson 2021-07-15 18:48:03 UTC

This really should've been a 4.8.0 blocker but that intent was never conferred to assignees. I'm marking this as a blocker for 4.9.0 and would request that we backport this to 4.8 as soon as reasonable. We really need to get rid of negative signal that we generate during upgrades by operators going degraded during normal operations.

Comment 15 Wally 2021-07-16 16:40:13 UTC

Updating fields based on C9 and to clear blocker? list.

Comment 16 Stefan Schimanski 2021-08-17 13:18:32 UTC

PR has long merged and we did bumps since then in cluster-kube-apiserver-operator.

Comment 18 W. Trevor King 2021-08-19 15:10:24 UTC

Still popular in CI, including for 4.9 jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 7 runs, 71% failed, 40% of failures match = 29% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 7 runs, 29% failed, 350% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 7 runs, 86% failed, 83% of failures match = 71% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-vsphere-upgrade (all) - 2 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 59 runs, 90% failed, 34% of failures match = 31% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 13 runs, 100% failed, 31% of failures match = 31% impact
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 85% failed, 45% of failures match = 38% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 3 runs, 33% failed, 300% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact

Drilling into the periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade job:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | jq -r 'keys[]' | grep nightly-4.9-e2e-aws-upgrade/
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1428038724414869504

which has:

  : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
    Run #0: Failed	48m18s
    1 unexpected clusteroperator state transitions during e2e test run 

    Aug 18 18:16:03.612 - 793ms E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: [Get "https://172.30.0.1:443/apis/apiregistration.k8s.io/v1/apiservices/v1.route.openshift.io": context canceled, context canceled]

From the e2e-interval chart, that's happening as the first control-plane machine to reboot is coming back up, and the second one is about to start draining.  I dunno if the "context canceled" is sufficiently different from this bug's original 503 to be worth a separate bug or not.

Comment 19 W. Trevor King 2021-08-19 15:23:39 UTC

Statistics on the messages:

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable.*is not ready: 503' | grep 'failures match'
  ...no hits...

The context issue is unique to 4.9, which supports the new message being the same underlying issue but with messaging altered by the library-go pivot (or some other 4.9 change):

  $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable.*Get.*context+canceled' | grep 'failures match' | sort
  periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 55 runs, 89% failed, 8% of failures match = 7% impact
  periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade (all) - 13 runs, 100% failed, 8% of failures match = 8% impact
  periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 13 runs, 85% failed, 18% of failures match = 15% impact
  periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
  periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade (all) - 3 runs, 100% failed, 67% of failures match = 67% impact

Poking around in job names involving 4.9:

  $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | uniq -c | sort -n
  ...lots of stuff...

other common failure message seems to include "context deadline exceeded", "dial tcp ...:8443: i/o timeout", "Client.Timeout exceeded while awaiting headers", "etcdserver: leader changed", and other things going on as well.

  $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | grep "context canceled" | wc -l
  12
  $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | grep "context deadline exceeded" | wc -l
  3
  $ curl -s 'https://search.ci.openshift.org/search?maxAge=24h&name=^periodic.*4.9.*upgrade&type=junit&context=0&search=clusteroperator/openshift-apiserver+condition/Available+status/False+reason/APIServicesAvailable' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's/.*APIServicesAvailable: //' | sort | grep "dial tcp.*8443: i/o timeout" | wc -l
  4

The other modes each occurred only once in the past 24h.

Comment 20 Lukasz Szaszkiewicz 2021-08-24 08:37:25 UTC

It looks like there are at least 3 categories of errors:

1. an "i/o timeout" while connecting to an aggregated API - possibly an SDN error
 
https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*dial&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

2. client-side timeout (context canceled) while getting an API service resource from the Kube API Server

https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*context+canceled&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

3."failing or missing response from an aggregated API ... context deadline exceeded" error - reported by the Kube API server status controller.

https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*8443%2Fapis.*context&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 21 Lukasz Szaszkiewicz 2021-08-24 10:31:00 UTC

Looking at the first category, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade/1429964492430643200


# The APIServer was unavailable around 01:43

01:43:32	openshift-apiserver-operator	openshift-apiserver-operator-status-controller-statussyncer_openshift-apiserver	openshift-apiserver-operator	
OperatorStatusChanged

Status for clusteroperator/openshift-apiserver changed: Available changed from True to False ("APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.130.0.60:8443/apis/apps.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/apps.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout\nAPIServicesAvailable: apiservices.apiregistration.k8s.io/v1.build.openshift.io: not available: failing or missing response from https://10.130.0.60:8443/apis/build.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/build.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout")

# The Kube API Server (actually a controller) on master-0 marked the API service as unavailable around that time

2021-08-24T01:43:32.042537997Z I0824 01:43:32.042433      18 available_controller.go:474] "changing APIService availability" name="v1.image.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/image.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/image.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout" reason="FailedDiscoveryCheck"
2021-08-24T01:43:32.049374015Z I0824 01:43:32.049281      18 available_controller.go:474] "changing APIService availability" name="v1.apps.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/apps.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/apps.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout" reason="FailedDiscoveryCheck"
2021-08-24T01:43:32.050004876Z I0824 01:43:32.049947      18 available_controller.go:474] "changing APIService availability" name="v1.build.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/build.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/build.openshift.io/v1\": dial tcp 10.130.0.60:8443: i/o timeout" reason="FailedDiscoveryCheck"
2021-08-24T01:43:32.050549703Z I0824 01:43:32.050489      18 available_controller.go:474] "changing APIService availability" name="v1.project.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.130.0.60:8443/apis/project.openshift.io/v1: Get \"https://10.130.0.60:8443/apis/project.openshift.io/v1\": context deadline exceeded" reason="FailedDiscoveryCheck"
2021-08-24T01:43:32.051029638Z I0824 01:43:32.050973      18 available_controller.go:474] "changing APIService availability" name="v1.template.openshift.io" oldStatus=True newStatus=False message="failing or missing response from https://10.128.0.74:8443/apis/template.openshift.io/v1: Get \"https://10.128.0.74:8443/apis/template.openshift.io/v1\": context deadline exceeded" reason="FailedDiscoveryCheck"

# The master-0 was drained before 1:43
# 10.130.0.60 was running on master-1
# SDN on master-0 wasn't ready around 1:43
# In consequence, the controller wasn't able to reach a pod on a different host and marked the API service as unavailable
# The outage was reported by the operator

Comment 22 Lukasz Szaszkiewicz 2021-08-25 12:50:15 UTC

The failures in the second category seem to be corresponding with the operator being terminated. During termination, the context is canceled. The cancelation signal is propagated to the network calls which makes them fail. The failures are reported by the operator and finally, the operator terminates.

Comment 25 Wally 2021-08-27 13:04:27 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1998516 will address #3 on c20 (non-blocker).

Comment 28 Lukasz Szaszkiewicz 2021-09-01 10:53:11 UTC

The first category seems to be resolved - https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*context+canceled&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

The second looks better - https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*Get.*dial&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


The overall availability seems also better - https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=4.8&maxMatches=5&maxBytes=20971520&groupBy=job

Not sure if periodic-ci-openshift-release-master jobs contains the fixes

Comment 29 Xingxing Xia 2021-09-01 13:29:31 UTC

QE also noticed similar issue in QE upgrade test case recorded in https://issues.redhat.com/browse/OCPQE-5300 not yet able to be timely investigated, will keep watching latest QE CI jobs and investigate.
Could you give some clue how to verify this bug? Seems only reply on CI? But CI does not absolutely have none of any failures, like https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse.*openshift.io.v1%22+is+not+ready&maxAge=24h&context=1&type=junit&name=upgrade.*4%5C.9%7C4%5C.9.*upgrade&excludeName=4%5C.7&maxMatches=1&maxBytes=20971520&groupBy=job . Thanks

Comment 30 Lukasz Szaszkiewicz 2021-09-01 17:42:24 UTC

probably the best way is to search for occurrences in CI. The fixes were applied only to 4.9

So far we are getting ( [1] )
-  dial tcp 172.30.0.1:443: connect: connection refused - it seems to be restricted to SNO clusters
-  APIServicesAvailable: etcdserver: leader changed - which seems to be a response from the Kube API Server

[1] https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable.*&maxAge=24h&context=1&type=junit&name=periodic.*4.9.*upgrade&excludeName=4.8&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 31 Xingxing Xia 2021-09-02 10:52:41 UTC

OK, used comment 19 way, proved comment 30's conclusion:
[xxia 2021-09-02 18:49:07 CST my]$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=clusteroperator%2Fopenshift-apiserver+condition%2FAvailable+status%2FFalse+reason%2FAPIServicesAvailable&maxAge=24h&context=0&type=junit&name=%5Eperiodic.*4.9.*upgrade&excludeName=4%5C.8&maxMatches=5&maxBytes=20971520&groupBy=job'
[clusteroperator/open] [1d ] [No context] [junit    ]
[Search]
Job:
[^periodic.*4.9.*upgr] [4\.8                ] [5                   ] [20971520            ] [job  ]
[ ] Wrap lines
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all)
#1433278112061198336                                                                                                                                                                                                            junit                                                    5 hours ago
Sep 02 05:18:10.803 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed
Sep 02 05:18:10.803 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed

#1433223340943740928                                                                                                                                                                                                            junit                                                    8 hours ago
Sep 02 01:48:19.803 - 2s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed
Sep 02 01:48:19.803 - 2s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed

#1433174896422162432                                                                                                                                                                                                            junit                                                    12 hours ago
Sep 01 21:45:13.697 - 8s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed
Sep 01 21:45:13.697 - 8s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed

#1433038397445771264                                                                                                                                                                                                            junit                                                    21 hours ago
Sep 01 12:38:18.730 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed
Sep 01 12:38:18.730 - 1s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: etcdserver: leader changed

periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all)
#1433116964133277696                                                                                                                                                                                                            junit                                                    14 hours ago
Sep 01 18:03:14.778 - 5s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: Get "https://172.30.0.1:443/apis/apiregistration.k8s.io/v1/apiservices/v1.apps.openshift.io": dial tcp 172.30.0.1:443: i/o timeout
Sep 01 18:03:14.778 - 5s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: Get "https://172.30.0.1:443/apis/apiregistration.k8s.io/v1/apiservices/v1.apps.openshift.io": dial tcp 172.30.0.1:443: i/o timeout

periodic-ci-openshift-release-master-nightly-4.9-upgrade-from-stable-4.7-e2e-aws-upgrade-paused (all)
#1433132315776651264                                                                                                                                                                                                            junit                                                    15 hours ago
Sep 01 19:58:20.010 - 6s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.11:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.11:8443/apis/apps.openshift.io/v1": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Sep 01 19:58:20.010 - 6s    E clusteroperator/openshift-apiserver condition/Available status/False reason/APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.apps.openshift.io: not available: failing or missing response from https://10.129.0.11:8443/apis/apps.openshift.io/v1: Get "https://10.129.0.11:8443/apis/apps.openshift.io/v1": net/http: request canceled (Client.Timeout exceeded while awaiting headers)


Would like to move it to VERIFIED therefore.

Comment 34 errata-xmlrpc 2021-10-18 17:29:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.