Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1812142

Summary:	[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] cluster upgrade is Failing: Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator"
Product:	OpenShift Container Platform	Reporter:	tflannag
Component:	kube-scheduler	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED NOTABUG	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.2.0	CC:	aos-bugs, bparees, cajieh, dgoodwin, jchaloup, mfojtik, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-18 10:21:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description tflannag 2020-03-10 15:46:18 UTC

Description of problem:
Disruptive test results in a panic after attempting to update the openshift-kube-scheduler-operator deployment:

```
Mar 10 13:20:31.206: INFO: cluster upgrade is Failing: Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator" (66 of 350)
SIGABRT: abort
PC=0x462161 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0xa66db28, 0x80, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7ffd6eb4fc18, 0x40d9d1, ...)
	/usr/local/go/src/runtime/sys_linux_amd64.s:535 +0x21
runtime.futexsleep(0xa66db28, 0x0, 0xffffffffffffffff)
	/usr/local/go/src/runtime/os_linux.go:46 +0x4b
runtime.notesleep(0xa66db28)
	/usr/local/go/src/runtime/lock_futex.go:151 +0xa1
runtime.stopm()
	/usr/local/go/src/runtime/proc.go:1936 +0xc1
runtime.findrunnable(0xc000068f00, 0x0)
	/usr/local/go/src/runtime/proc.go:2399 +0x54a
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2525 +0x21c
runtime.park_m(0xc0027ed200)
	/usr/local/go/src/runtime/proc.go:2605 +0xa1
runtime.mcall(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:299 +0x5b
``` 

Link to job:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/514

Link to build-log:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/514/build-log.txt

Version-Release number of selected component (if applicable):
4.1 to 4.2 upgrade rollback

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Maciej Szulik 2020-03-11 10:01:05 UTC

That stack is coming from tests and is irrelevant to the problem here. From what I'm checking the problem seems to be either on the CVO side
which is reporting:

Deployment openshift-kube-scheduler-operator is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)

but ks-o is meant to have just one replica running and it's there running.

or on the apiserver, since CVO is reporting:

I0310 12:18:21.896804       1 sync_worker.go:740] Update error 66/350: UpdatePayloadFailed Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator" (66 of 350) (*errors.errorString: timed out waiting for the condition)
E0310 12:18:21.896824       1 sync_worker.go:311] unable to synchronize image (waiting 1m26.262851224s): Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator" (66 of 350)

there are also these failures in apiserver logs:

E0310 12:03:31.003196       1 pathrecorder.go:107] registered "/healthz/crd-informer-synced" from goroutine 1 [running]:
runtime/debug.Stack(0x4e952e0, 0xc006ed1680, 0xc00a0b4740)
	/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/debug/stack.go:24 +0xa7
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux.(*PathRecorderMux).trackCallers(0xc002c15340, 0xc00a0b4740, 0x1c)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:109 +0x89
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux.(*PathRecorderMux).Handle(0xc002c15340, 0xc00a0b4740, 0x1c, 0xaa61480, 0xc009d84260)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:173 +0x86
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz.InstallPathHandler(0xaa5a6c0, 0xc002c15340, 0x5977e1f, 0x8, 0xc003600c00, 0x1c, 0x20)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz/healthz.go:120 +0x3c1
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz.InstallHandler(0xaa5a6c0, 0xc002c15340, 0xc003600c00, 0x1c, 0x20)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz/healthz.go:102 +0x68
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server.(*GenericAPIServer).installHealthz(0xc0086a4c60)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz.go:45 +0xb0
github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server.(*GenericAPIServer).PrepareRun(0xc0086a4c60, 0xc0002e52c0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/genericapiserver.go:265 +0x4f
github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kube-apiserver/app.Run(0xc00024efc0, 0xc0002e52c0, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:153 +0x12a
github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kube-apiserver/app.NewAPIServerCommand.func1(0xc000aeb400, 0x0, 0x0, 0x0, 0x1, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/cmd/kube-apiserver/app/server.go:115 +0x109
github.com/openshift/origin/pkg/cmd/openshift-kube-apiserver.RunOpenShiftKubeAPIServerServer(0xc0009bec00, 0xc0002e52c0, 0x24, 0x7ffdbd8bca8c)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/openshift-kube-apiserver/server.go:70 +0x57e
github.com/openshift/origin/pkg/cmd/openshift-kube-apiserver.(*OpenShiftKubeAPIServerServer).RunAPIServer(0xc000358ce0, 0xc0002e52c0, 0xc0004b4420, 0xc001118600)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/openshift-kube-apiserver/cmd.go:129 +0x817
github.com/openshift/origin/pkg/cmd/openshift-kube-apiserver.NewOpenShiftKubeAPIServerServerCommand.func1(0xc000af3b80, 0xc000359420, 0x0, 0x2)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/openshift-kube-apiserver/cmd.go:61 +0x10c
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc000af3b80, 0xc000359380, 0x2, 0x2, 0xc000af3b80, 0xc000359380)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:760 +0x2cc
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc000af3180, 0xc000aea280, 0xc000aea000, 0xc000af3b80)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:846 +0x2fd
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(0xc000af3180, 0xc000af3180, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:794 +0x2b
main.main()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/hypershift/main.go:45 +0x2b8

which suggest there's a problem waiting for CRDs to be available, which might in turn be caused either by etcd not being available or
networking not being reachable. 

I'm sending this to api team for starters to double check.

Comment 3 Devan Goodwin 2020-03-19 17:15:21 UTC

Test is still failing. Sample from today: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/533

Comment 5 Cyril 2020-05-01 16:08:56 UTC

Test is still failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5BDisruptive%5C%5D+Cluster+upgrade+should+maintain+a+functioning+cluster+%5C%5BFeature%3AClusterUpgrade%5C%5D+%5C%5BSuite%3Aopenshift%5C%5D+%5C%5BSerial%5C%5D

Job view:https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/80

Comment 6 Jan Chaloupka 2020-05-18 12:58:08 UTC

> Job view:https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/80

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May  1 03:37:34.248: Service was unreachable during disruption for at least 2m18s of 53m55s (4%):

May 01 02:57:24.994 E ns/e2e-k8s-service-lb-available-8928 svc/service-test Service stopped responding to GET requests over new connections
May 01 02:57:25.993 - 19s   E ns/e2e-k8s-service-lb-available-8928 svc/service-test Service is not responding to GET requests over new connections
May 01 02:57:28.994 E ns/e2e-k8s-service-lb-available-8928 svc/service-test Service stopped responding to GET requests on reused connections
...
May 01 03:34:22.072 I ns/e2e-k8s-service-lb-available-8928 svc/service-test Service started responding to GET requests over new connections
May 01 03:35:20.994 E ns/e2e-k8s-service-lb-available-8928 svc/service-test Service stopped responding to GET requests over new connections
May 01 03:35:21.993 - 1s    E ns/e2e-k8s-service-lb-available-8928 svc/service-test Service is not responding to GET requests over new connections
May 01 03:35:23.363 I ns/e2e-k8s-service-lb-available-8928 svc/service-test Service started responding to GET requests over new connections

Not resembling `Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator` error.

Comment 7 Jan Chaloupka 2020-05-18 13:38:05 UTC

Checking the latest release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2 run (6 days ago):

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.1-to-4.2/640

From the upgrade logs:
* Cluster operator cloud-credential is reporting a failure: 8 of 13 credentials requests are failing to sync.
* Cluster operator monitoring is still updating
* Cluster operator openshift-samples is still updating
* Could not update customresourcedefinition "catalogsourceconfigs.operators.coreos.com" (272 of 350): the object is invalid, possibly due to local cluster configuration

Checking openshift-cloud-credential-operator_cloud-credential-operator-6f84d766f7-k47nk_manager.log:

time="2020-05-11T19:48:21Z" level=error msg="error syncing creds in mint-mode" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-gcp error="error decoding provider v1 spec: decoding failure: no kind \"GCPProviderSpec\" is registered for version \"cloudcredential.openshift.io/v1\" in scheme \"sigs.k8s.io/controller-runtime/pkg/runtime/scheme/scheme.go:54\""
time="2020-05-11T19:48:21Z" level=error msg="error syncing credentials: error syncing creds in mint-mode: error decoding provider v1 spec: decoding failure: no kind \"GCPProviderSpec\" is registered for version \"cloudcredential.openshift.io/v1\" in scheme \"sigs.k8s.io/controller-runtime/pkg/runtime/scheme/scheme.go:54\"" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-gcp secret=openshift-machine-api/gcp-cloud-credentials

Not resembling `Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator` error.

Comment 8 Jan Chaloupka 2020-05-18 13:51:47 UTC

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/97/ (9 hours ago)

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May 18 03:48:25.796: Service was unreachable during disruption for at least 1m10s of 42m40s (3%):

May 18 03:40:20.891 E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service stopped responding to GET requests over new connections
May 18 03:40:21.890 - 18s   E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service is not responding to GET requests over new connections
May 18 03:40:40.579 I ns/e2e-k8s-service-lb-available-1240 svc/service-test Service started responding to GET requests over new connections
May 18 03:40:50.579 E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service stopped responding to GET requests over new connections
May 18 03:40:50.890 - 11s   E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service is not responding to GET requests over new connections
May 18 03:41:03.811 I ns/e2e-k8s-service-lb-available-1240 svc/service-test Service started responding to GET requests over new connections
May 18 03:41:24.891 E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service stopped responding to GET requests over new connections
May 18 03:41:25.890 - 13s   E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service is not responding to GET requests over new connections
May 18 03:41:27.891 E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service stopped responding to GET requests on reused connections
May 18 03:41:28.890 - 21s   E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service is not responding to GET requests on reused connections
May 18 03:41:39.813 I ns/e2e-k8s-service-lb-available-1240 svc/service-test Service started responding to GET requests over new connections
May 18 03:41:49.814 E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service stopped responding to GET requests over new connections
May 18 03:41:49.890 - 1s    E ns/e2e-k8s-service-lb-available-1240 svc/service-test Service is not responding to GET requests over new connections
May 18 03:41:51.539 I ns/e2e-k8s-service-lb-available-1240 svc/service-test Service started responding to GET requests over new connections
May 18 03:41:51.539 I ns/e2e-k8s-service-lb-available-1240 svc/service-test Service started responding to GET requests on reused connections


https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/94 (3 days ago)

May 15 03:55:38.102: INFO: Service e2e-k8s-service-lb-available-533/service-test hasFinalizer=false, want true
May 15 03:56:08.101: INFO: Service e2e-k8s-service-lb-available-533/service-test hasFinalizer=false, want true
May 15 03:56:38.102: INFO: Service e2e-k8s-service-lb-available-533/service-test hasFinalizer=false, want true
May 15 03:56:38.175: INFO: Service e2e-k8s-service-lb-available-533/service-test hasFinalizer=false, want true
May 15 03:56:38.175: INFO: Failed to wait for service to hasFinalizer=true: timed out waiting for the condition

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/93 (4 days ago)

May 14 04:11:03.145: INFO: Service e2e-k8s-service-lb-available-4897/service-test hasFinalizer=false, want true
May 14 04:11:03.162: INFO: Service e2e-k8s-service-lb-available-4897/service-test hasFinalizer=false, want true
May 14 04:11:03.162: INFO: Failed to wait for service to hasFinalizer=true: timed out waiting for the condition
...
fail [k8s.io/kubernetes/test/e2e/framework/service/wait.go:115]: May 14 04:11:03.162: Failed to wait for service to hasFinalizer=true: timed out waiting for the condition

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/92 (5 days ago)

May 13 03:59:06.034: INFO: Service e2e-k8s-service-lb-available-8199/service-test hasFinalizer=false, want true
May 13 03:59:36.034: INFO: Service e2e-k8s-service-lb-available-8199/service-test hasFinalizer=false, want true
May 13 03:59:36.107: INFO: Service e2e-k8s-service-lb-available-8199/service-test hasFinalizer=false, want true
May 13 03:59:36.107: INFO: Failed to wait for service to hasFinalizer=true: timed out waiting for the condition
...
fail [k8s.io/kubernetes/test/e2e/framework/service/wait.go:115]: May 13 03:59:36.107: Failed to wait for service to hasFinalizer=true: timed out waiting for the condition

Not resembling `Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator` error.

Comment 9 Jan Chaloupka 2020-05-18 13:59:14 UTC

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9121/rehearse-9121-release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/3 (2 days ago)

fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:130]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2020-05-15-060031
Unexpected error:
    <*errors.errorString | 0xc001b11390>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator machine-config is still updating",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator machine-config is still updating
occurred

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/9121/rehearse-9121-release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3/1 (2 days ago)

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May 15 23:11:35.709: Frontends were unreachable during disruption for at least 12m31s of 45m49s (27%):

May 15 22:38:10.297 E ns/openshift-console route/console Route stopped responding to GET requests over new connections
May 15 22:38:11.227 - 7s    E ns/openshift-console route/console Route is not responding to GET requests over new connections
May 15 22:38:19.173 I ns/openshift-console route/console Route started responding to GET requests over new connections
May 15 22:38:28.609 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests over new connections
...
May 15 23:02:54.303 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests over new connections
May 15 23:03:56.540 E ns/openshift-console route/console Route stopped responding to GET requests over new connections
May 15 23:03:57.227 - 3s    E ns/openshift-console route/console Route is not responding to GET requests over new connections
May 15 23:04:01.379 I ns/openshift-console route/console Route started responding to GET requests over new connections
May 15 23:04:06.228 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests on reused connections
May 15 23:04:06.228 E ns/openshift-authentication route/oauth-openshift Route stopped responding to GET requests over new connections
May 15 23:04:06.296 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests over new connections
May 15 23:04:06.297 I ns/openshift-authentication route/oauth-openshift Route started responding to GET requests on reused connections
May 15 23:04:11.335 E ns/openshift-console route/console Route stopped responding to GET requests on reused connections
May 15 23:04:11.466 I ns/openshift-console route/console Route started responding to GET requests on reused connections

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3/416 (7 days ago)

May 11 13:11:22.970: INFO: Service e2e-k8s-service-lb-available-8895/service-test still exists with finalizers: [service.kubernetes.io/load-balancer-cleanup]
May 11 13:11:52.970: INFO: Service e2e-k8s-service-lb-available-8895/service-test still exists with finalizers: [service.kubernetes.io/load-balancer-cleanup]
May 11 13:12:22.970: INFO: Service e2e-k8s-service-lb-available-8895/service-test still exists with finalizers: [service.kubernetes.io/load-balancer-cleanup]
May 11 13:12:22.997: INFO: Service e2e-k8s-service-lb-available-8895/service-test still exists with finalizers: [service.kubernetes.io/load-balancer-cleanup]
May 11 13:12:22.997: INFO: Failed to wait for service to disappear: timed out waiting for the condition
...
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:130]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.3.0-0.ci-2020-05-08-220837
Unexpected error:
    <*errors.errorString | 0xc002891250>: {
        s: "Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.2.33: 88% complete",
    }
    Cluster did not complete upgrade: timed out waiting for the condition: Working towards 4.2.33: 88% complete
occurred

Not resembling `Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator` error.

Comment 11 Ben Parees 2020-06-03 20:29:49 UTC

Jan, what was the basis for deferring this bug out of 4.5?  I don't see any explanation, just a reasignment of the target release.

Comment 12 Jan Chaloupka 2020-06-04 09:26:45 UTC

Checking the latest fails in https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5BDisruptive%5C%5D+Cluster+upgrade+should+maintain+a+functioning+cluster+%5C%5BFeature%3AClusterUpgrade%5C%5D+%5C%5BSuite%3Aopenshift%5C%5D+%5C%5BSerial%5C%5D.

The jobs are failing due to the following failures:
- Unable to retrieve available updates: currently installed version 0.0.1-2020-06-04-053610 not found in the \"stable-4.3\" channel
  Unable to retrieve available updates: currently installed version 4.3.0-0.ci-2020-06-03-170945 not found in the \"stable-4.3\" channel
  Unable to retrieve available updates: currently installed version 4.3.0-0.ci-2020-06-03-175744 not found in the \"stable-4.2\" channel
- Frontends were unreachable during disruption for at least 1m34s of 31m30s (5%):
  Service was unreachable during disruption for at least 40s of 29m35s (2%!)(MISSING):
  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May 29 01:47:09.034: Service was unreachable during disruption for at least 1m5s of 29m40s (4%):
- Jun  4 04:22:08.230: INFO: Failed to wait for service to hasFinalizer=true: timed out waiting for the condition
- fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun  2 04:04:50.359: API was unreachable during disruption for at least 9m7s of 58m59s (15%):
- SDN healthcheck unable to reconnect to OVS server
- openshift-apiserver OpenShift API stopped responding to GET requests: the server is currently unable to handle the request (get imagestreams.image.openshift.io missing)
- network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
- error waiting for deployment \"dp\" status to match expectation
- Cluster operator cloud-credential is reporting a failure: 8 of 13 credentials requests are failing to sync.

> Jan, what was the basis for deferring this bug out of 4.5?  I don't see any explanation, just a reasignment of the target release.

- The issue is reported specifically for 'Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator"' error message. I have not seen the error message in any of the failed jobs
- The issue is reported for 4.1 to 4.2 upgrade rollback, although I have not seen the error message in any of the upgrades

Apologies for not being clear right away. Also reassigned to 4.6 to keep the error in mind. Though, unless the error re-appears again in the near future, I plan to close the issue completely.

Comment 14 Jan Chaloupka 2020-06-18 10:21:11 UTC

Checking the failed runs again. Only a single run has `Could not update deployment "openshift-kube-scheduler-operator/openshift-kube-scheduler-operator"` error message. Yet, the operator gets eventually deployed and running [2].

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1271859844130803712/artifacts/launch/
[2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1271859844130803712/artifacts/launch/pods/openshift-kube-scheduler-operator_openshift-kube-scheduler-operator-59d6449bc7-zprrw_kube-scheduler-operator-container.log

Closing the issue as the reported incident has not occurred for quite some time.