2026352 – Kube-Scheduler revision-pruner fail during install of new cluster

Bug 2026352 - Kube-Scheduler revision-pruner fail during install of new cluster

Summary: Kube-Scheduler revision-pruner fail during install of new cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2044622
TreeView+	depends on / blocked

Reported:	2021-11-24 12:35 UTC by Neil Girard
Modified:	2022-12-02 13:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-12 04:39:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-apiserver-operator pull 1260	None	open	bug 2026352: Sync with library-go to pick fixes for pruner panic	2021-11-29 13:30:57 UTC
Github	openshift cluster-kube-controller-manager-operator pull 578	None	open	bug 2026352: Sync with the latest openshift/library-go@master to pick pruner cert dir check	2021-11-29 09:52:26 UTC
Github	openshift cluster-kube-scheduler-operator pull 383	None	open	bug 2026352: Sync with library-go to pick fixes for pruner panic	2021-11-26 09:38:33 UTC
Github	openshift library-go pull 1255	None	open	bug 2026352: staticpod pruner: check if the cert directory exists	2021-11-26 09:23:28 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-12 04:39:16 UTC

Description Neil Girard 2021-11-24 12:35:49 UTC

Description of problem:

During the installation of a new cluster, the kube-scheduler revision-pruner pods fail several times during control plane creation.

Version-Release number of selected component (if applicable):
4.9.4

How reproducible:

Seems to be low to reproduce.

Steps to Reproduce:
N/A

Actual results:
revision-pruner pods fail

Expected results:
revision-pruner pods complete

Additional info:

~~~
$ omg get pods -o wide
NAME                                                                READY  STATUS     RESTARTS  AGE    IP           NODE
installer-5-ip-10-0-15-64.us-west-2.compute.internal                0/1    Succeeded  0         9h34m  10.129.0.16  ip-10-0-15-64.us-west-2.compute.internal
installer-6-ip-10-0-15-64.us-west-2.compute.internal                0/1    Succeeded  0         9h32m  10.129.0.24  ip-10-0-15-64.us-west-2.compute.internal
installer-6-ip-10-0-16-77.us-west-2.compute.internal                0/1    Succeeded  0         9h31m  10.128.0.42  ip-10-0-16-77.us-west-2.compute.internal
installer-7-ip-10-0-16-77.us-west-2.compute.internal                0/1    Succeeded  0         9h31m  10.128.0.43  ip-10-0-16-77.us-west-2.compute.internal
installer-7-ip-10-0-17-153.us-west-2.compute.internal               0/1    Succeeded  0         9h30m  10.130.0.27  ip-10-0-17-153.us-west-2.compute.internal
installer-8-ip-10-0-15-64.us-west-2.compute.internal                0/1    Succeeded  0         9h28m  10.129.0.32  ip-10-0-15-64.us-west-2.compute.internal
installer-8-ip-10-0-16-77.us-west-2.compute.internal                0/1    Succeeded  0         9h27m  10.128.0.52  ip-10-0-16-77.us-west-2.compute.internal
installer-8-ip-10-0-17-153.us-west-2.compute.internal               0/1    Succeeded  0         9h30m  10.130.0.28  ip-10-0-17-153.us-west-2.compute.internal
openshift-kube-scheduler-ip-10-0-15-64.us-west-2.compute.internal   3/3    Running    0         9h28m  10.0.15.64   ip-10-0-15-64.us-west-2.compute.internal
openshift-kube-scheduler-ip-10-0-16-77.us-west-2.compute.internal   3/3    Running    0         9h27m  10.0.16.77   ip-10-0-16-77.us-west-2.compute.internal
openshift-kube-scheduler-ip-10-0-17-153.us-west-2.compute.internal  3/3    Running    0         9h29m  10.0.17.153  ip-10-0-17-153.us-west-2.compute.internal
revision-pruner-6-ip-10-0-15-64.us-west-2.compute.internal          0/1    Succeeded  0         9h32m  10.129.0.23  ip-10-0-15-64.us-west-2.compute.internal
revision-pruner-6-ip-10-0-16-77.us-west-2.compute.internal          0/1    Failed     0         9h32m  10.128.0.39  ip-10-0-16-77.us-west-2.compute.internal
revision-pruner-6-ip-10-0-17-153.us-west-2.compute.internal         0/1    Failed     0         9h32m  10.130.0.21  ip-10-0-17-153.us-west-2.compute.internal
revision-pruner-7-ip-10-0-15-64.us-west-2.compute.internal          0/1    Succeeded  0         9h31m  10.129.0.25  ip-10-0-15-64.us-west-2.compute.internal
revision-pruner-7-ip-10-0-16-77.us-west-2.compute.internal          0/1    Succeeded  0         9h31m  10.128.0.44  ip-10-0-16-77.us-west-2.compute.internal
revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal         0/1    Failed     0         9h31m  10.130.0.24  ip-10-0-17-153.us-west-2.compute.internal
revision-pruner-8-ip-10-0-15-64.us-west-2.compute.internal          0/1    Succeeded  0         9h30m  10.129.0.28  ip-10-0-15-64.us-west-2.compute.internal
revision-pruner-8-ip-10-0-16-77.us-west-2.compute.internal          0/1    Succeeded  0         9h30m  10.128.0.46  ip-10-0-16-77.us-west-2.compute.internal
revision-pruner-8-ip-10-0-17-153.us-west-2.compute.internal         0/1    Succeeded  0         9h30m  10.130.0.29  ip-10-0-17-153.us-west-2.compute.internal
~~~

Logs:

~~~
$ omg logs revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal
/cases/03072554/0020-must-gather.tar.gz/must-gather.local.3799198186116137965/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-ef4d0df32283c1aae39942b149010b23c98659d54e8845ea0fcfffc36ea99f4e/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal/pruner/pruner/logs/current.log
2021-11-03T04:25:43.102103535Z I1103 04:25:43.101919       1 cmd.go:41] &{<nil> true {false} prune true map[cert-dir:0xc0005fe960 max-eligible-revision:0xc0005fe6e0 protected-revisions:0xc0005fe780 resource-dir:0xc0005fe820 static-pod-name:0xc0005fe8c0 v:0xc00032cc80] [0xc00032cc80 0xc0005fe6e0 0xc0005fe780 0xc0005fe820 0xc0005fe960 0xc0005fe8c0] [] map[add-dir-header:0xc00032c5a0 alsologtostderr:0xc00032c640 cert-dir:0xc0005fe960 help:0xc0005fefa0 log-backtrace-at:0xc00032c6e0 log-dir:0xc00032c780 log-file:0xc00032c820 log-file-max-size:0xc00032c8c0 log-flush-frequency:0xc000719680 logtostderr:0xc00032c960 max-eligible-revision:0xc0005fe6e0 one-output:0xc00032ca00 protected-revisions:0xc0005fe780 resource-dir:0xc0005fe820 skip-headers:0xc00032caa0 skip-log-headers:0xc00032cb40 static-pod-name:0xc0005fe8c0 stderrthreshold:0xc00032cbe0 v:0xc00032cc80 vmodule:0xc00032cd20] [0xc0005fe6e0 0xc0005fe780 0xc0005fe820 0xc0005fe8c0 0xc0005fe960 0xc00032c5a0 0xc00032c640 0xc00032c6e0 0xc00032c780 0xc00032c820 0xc00032c8c0 0xc000719680 0xc00032c960 0xc00032ca00 0xc00032caa0 0xc00032cb40 0xc00032cbe0 0xc00032cc80 0xc00032cd20 0xc0005fefa0] [0xc00032c5a0 0xc00032c640 0xc0005fe960 0xc0005fefa0 0xc00032c6e0 0xc00032c780 0xc00032c820 0xc00032c8c0 0xc000719680 0xc00032c960 0xc0005fe6e0 0xc00032ca00 0xc0005fe780 0xc0005fe820 0xc00032caa0 0xc00032cb40 0xc0005fe8c0 0xc00032cbe0 0xc00032cc80 0xc00032cd20] map[104:0xc0005fefa0 118:0xc00032cc80] [] -1 0 0xc0005caff0 true <nil> []}
2021-11-03T04:25:43.102190905Z I1103 04:25:43.102101       1 cmd.go:42] (*prune.PruneOptions)(0xc0005e4550)({
2021-11-03T04:25:43.102190905Z  MaxEligibleRevision: (int) 7,
2021-11-03T04:25:43.102190905Z  ProtectedRevisions: ([]int) (len=6 cap=6) {
2021-11-03T04:25:43.102190905Z   (int) 2,
2021-11-03T04:25:43.102190905Z   (int) 3,
2021-11-03T04:25:43.102190905Z   (int) 4,
2021-11-03T04:25:43.102190905Z   (int) 5,
2021-11-03T04:25:43.102190905Z   (int) 6,
2021-11-03T04:25:43.102190905Z   (int) 7
2021-11-03T04:25:43.102190905Z  },
2021-11-03T04:25:43.102190905Z  ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
2021-11-03T04:25:43.102190905Z  CertDir: (string) (len=20) "kube-scheduler-certs",
2021-11-03T04:25:43.102190905Z  StaticPodName: (string) (len=18) "kube-scheduler-pod"
2021-11-03T04:25:43.102190905Z })
2021-11-03T04:25:43.102203390Z F1103 04:25:43.102194       1 cmd.go:48] lstat /etc/kubernetes/static-pod-resources/kube-scheduler-certs: no such file or directory
2021-11-03T04:25:43.194947275Z goroutine 1 [running]:
2021-11-03T04:25:43.194947275Z k8s.io/klog/v2.stacks(0xc000012001, 0xc0001c81c0, 0x84, 0xda)
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:1026 +0xb9
2021-11-03T04:25:43.194947275Z k8s.io/klog/v2.(*loggingT).output(0x3a8bd60, 0xc000000003, 0x0, 0x0, 0xc0003c81c0, 0x1, 0x2f628ac, 0x6, 0x30, 0x414600)
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:975 +0x1e5
2021-11-03T04:25:43.194947275Z k8s.io/klog/v2.(*loggingT).printDepth(0x3a8bd60, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000496910, 0x1, 0x1)
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:735 +0x185
2021-11-03T04:25:43.194947275Z k8s.io/klog/v2.(*loggingT).print(...)
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:717
2021-11-03T04:25:43.194947275Z k8s.io/klog/v2.Fatal(...)
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:1494
2021-11-03T04:25:43.194947275Z github.com/openshift/library-go/pkg/operator/staticpod/prune.NewPrune.func1(0xc000267680, 0xc0003cb680, 0x0, 0x6)
2021-11-03T04:25:43.194947275Z     github.com/openshift/library-go.0-20210915142033-188c3c82f817/pkg/operator/staticpod/prune/cmd.go:48 +0x3aa
2021-11-03T04:25:43.194947275Z github.com/spf13/cobra.(*Command).execute(0xc000267680, 0xc0003cb620, 0x6, 0x6, 0xc000267680, 0xc0003cb620)
2021-11-03T04:25:43.194947275Z     github.com/spf13/cobra.3/command.go:856 +0x2c2
2021-11-03T04:25:43.194947275Z github.com/spf13/cobra.(*Command).ExecuteC(0xc000266c80, 0xc000056080, 0xc000266c80, 0xc000000180)
2021-11-03T04:25:43.194947275Z     github.com/spf13/cobra.3/command.go:960 +0x375
2021-11-03T04:25:43.194947275Z github.com/spf13/cobra.(*Command).Execute(...)
2021-11-03T04:25:43.194947275Z     github.com/spf13/cobra.3/command.go:897
2021-11-03T04:25:43.194947275Z main.main()
2021-11-03T04:25:43.194947275Z     github.com/openshift/cluster-kube-scheduler-operator/cmd/cluster-kube-scheduler-operator/main.go:34 +0x176
2021-11-03T04:25:43.194947275Z 
2021-11-03T04:25:43.194947275Z goroutine 6 [chan receive]:
2021-11-03T04:25:43.194947275Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x3a8bd60)
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:1169 +0x8b
2021-11-03T04:25:43.194947275Z created by k8s.io/klog/v2.init.0
2021-11-03T04:25:43.194947275Z     k8s.io/klog/v2.0/klog.go:420 +0xdf
2021-11-03T04:25:43.194947275Z 
2021-11-03T04:25:43.194947275Z goroutine 64 [runnable]:
2021-11-03T04:25:43.194947275Z k8s.io/apimachinery/pkg/util/wait.Forever(0x27490f0, 0x12a05f200)
2021-11-03T04:25:43.194947275Z     k8s.io/apimachinery.1/pkg/util/wait/wait.go:80
2021-11-03T04:25:43.194947275Z created by k8s.io/component-base/logs.InitLogs
2021-11-03T04:25:43.194947275Z     k8s.io/component-base.1/logs/logs.go:58 +0x8a
~~~

The linked case has must-gather available for further investigation.

Comment 1 Jan Chaloupka 2021-11-25 13:51:51 UTC

From openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal/installer/installer/logs/current.log:
```

2021-11-03T04:26:41.149182411Z I1103 04:26:41.148843       1 cmd.go:186] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-scheduler-certs" ...
2021-11-03T04:26:41.149182411Z I1103 04:26:41.148914       1 cmd.go:194] Getting secrets ...
2021-11-03T04:26:41.275975557Z I1103 04:26:41.275933       1 copy.go:32] Got secret openshift-kube-scheduler/kube-scheduler-client-cert-key
2021-11-03T04:26:41.276021276Z I1103 04:26:41.275976       1 cmd.go:207] Getting config maps ...
2021-11-03T04:26:41.276021276Z I1103 04:26:41.275985       1 cmd.go:226] Creating directory "/etc/kubernetes/static-pod-resources/kube-scheduler-certs/secrets/kube-scheduler-client-cert-key" ...
2021-11-03T04:26:41.276127775Z I1103 04:26:41.276101       1 cmd.go:449] Writing secret manifest "/etc/kubernetes/static-pod-resources/kube-scheduler-certs/secrets/kube-scheduler-client-cert-key/tls.crt" ...
2021-11-03T04:26:41.276210474Z I1103 04:26:41.276192       1 cmd.go:449] Writing secret manifest "/etc/kubernetes/static-pod-resources/kube-scheduler-certs/secrets/kube-scheduler-client-cert-key/tls.key" ...
```

revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal panics at 04:25:43.102194.
```
2021-11-03T04:25:43.102203390Z F1103 04:25:43.102194       1 cmd.go:48] lstat /etc/kubernetes/static-pod-resources/kube-scheduler-certs: no such file or directory
```

So the missing /etc/kubernetes/static-pod-resources/kube-scheduler-certs directory is created eventually. Just not quick enough.

From the openshift-kube-scheduler-operator:
```
2021-11-03T04:25:39.184994392Z I1103 04:25:39.184944       1 request.go:665] Waited for 1.192417299s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal
2021-11-03T04:25:40.383204649Z I1103 04:25:40.383152       1 request.go:665] Waited for 1.190625443s due to client-side throttling, not priority and fairness, request: POST:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods
2021-11-03T04:25:40.404609063Z I1103 04:25:40.404559       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing
2021-11-03T04:26:35.508527549Z I1103 04:26:35.508455       1 request.go:665] Waited for 1.179951135s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal
2021-11-03T04:26:36.557796191Z I1103 04:26:36.556367       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing
2021-11-03T04:26:36.722600664Z I1103 04:26:36.722546       1 request.go:665] Waited for 1.010649519s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-16-77.us-west-2.compute.internal
2021-11-03T04:26:37.592274527Z I1103 04:26:37.592233       1 installer_controller.go:512] "ip-10-0-17-153.us-west-2.compute.internal" is in transition to 7, but has not made progress because installer is not finished, but in Pending phase
2021-11-03T04:26:37.904840556Z I1103 04:26:37.904787       1 request.go:665] Waited for 1.124217196s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal
2021-11-03T04:26:38.906680674Z I1103 04:26:38.906637       1 request.go:665] Waited for 1.310842986s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal
2021-11-03T04:26:40.111645240Z I1103 04:26:40.111607       1 request.go:665] Waited for 1.164360997s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal
```

The operator created the pruner pod before the installer pod with almost 1 minute delay:
```
2021-11-03T04:25:40.404609063Z I1103 04:25:40.404559       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing
2021-11-03T04:26:36.557796191Z I1103 04:26:36.556367       1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing
```

The static pod builder creates both the pruner and the installer as two independently running controllers. The pruner has no check saying "wait until the installer pod of my revision finishes". So there's no safe guard for avoiding the incident

We have two options:
- have the pruner pod wait for the installer pod to finish (or for a presence of required directories) (e.g. up to few minutes before it fails)
- update the pruner controller to create the pruner pod not before the corresponding installer pod is finished

Comment 2 Jan Chaloupka 2021-11-25 15:19:48 UTC

The issue was reported yesterday. Will need another sprint to properly implement the changes.

Comment 3 Jan Chaloupka 2021-11-26 09:22:57 UTC

Based on the provided oc get pods output for ip-10-0-17-153.us-west-2.compute.internal:

~~~
$ omg get pods -o wide
NAME                                                                READY  STATUS     RESTARTS  AGE    IP           NODE
installer-7-ip-10-0-17-153.us-west-2.compute.internal               0/1    Succeeded  0         9h30m  10.130.0.27  ip-10-0-17-153.us-west-2.compute.internal
installer-8-ip-10-0-17-153.us-west-2.compute.internal               0/1    Succeeded  0         9h30m  10.130.0.28  ip-10-0-17-153.us-west-2.compute.internal
openshift-kube-scheduler-ip-10-0-17-153.us-west-2.compute.internal  3/3    Running    0         9h29m  10.0.17.153  ip-10-0-17-153.us-west-2.compute.internal
revision-pruner-6-ip-10-0-17-153.us-west-2.compute.internal         0/1    Failed     0         9h32m  10.130.0.21  ip-10-0-17-153.us-west-2.compute.internal
revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal         0/1    Failed     0         9h31m  10.130.0.24  ip-10-0-17-153.us-west-2.compute.internal
revision-pruner-8-ip-10-0-17-153.us-west-2.compute.internal         0/1    Succeeded  0         9h30m  10.130.0.29  ip-10-0-17-153.us-west-2.compute.internal
~~~

The pruner gets eventually running.

Comment 4 Jan Chaloupka 2021-11-26 09:29:15 UTC

In this case the failing pruner does not cause any issues. The pruner has nothing to do over the non-existing directory. It just panics. Once the PR is merged, the panics disappears.

Comment 5 Jan Chaloupka 2021-11-29 10:04:51 UTC

Neil, as it may seem, this issue has no impact on the functionality of the pruner. Thus, we will not be fixing it in 4.9. The missing cert directory (which is reported as missing) will get created eventually by one of the installer pods. Once done, the pruner will stop failing with the reported stack trace. Would it be sufficient for the customer to have this fixed in 4.10 with the explanation I provided?

Comment 6 Neil Girard 2021-11-29 12:25:58 UTC

Hello Jan, that is acceptable.  I'll let the customer know.  Thanks for taking a look into it.

Comment 10 Yuri Obshansky 2022-01-24 21:30:40 UTC

I test Assisted Service and bumped with similar problem when run 4.9.15 version
See more information https://issues.redhat.com/browse/MGMT-9036
No problem with OCP OCP 4.10.0-fc.0

Comment 11 RamaKasturi 2022-01-25 10:57:13 UTC

Moving the bug to verified state as i did not see revision pruner in failed state during new install of cluster. Will reopen the bug again if i hit it during any of the installs.

[knarra@knarra ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-24-070025   True        False         4h52m   Cluster version is 4.10.0-0.nightly-2022-01-24-070025

[knarra@knarra ~]$ oc get pods -n openshift-kube-scheduler
NAME                                                                             READY   STATUS      RESTARTS   AGE
installer-5-ip-10-0-189-93.ap-northeast-1.compute.internal                       0/1     Completed   0          5h11m
installer-5-ip-10-0-204-104.ap-northeast-1.compute.internal                      0/1     Completed   0          5h10m
installer-6-ip-10-0-156-150.ap-northeast-1.compute.internal                      0/1     Completed   0          5h4m
installer-6-ip-10-0-204-104.ap-northeast-1.compute.internal                      0/1     Completed   0          5h9m
installer-7-ip-10-0-156-150.ap-northeast-1.compute.internal                      0/1     Completed   0          5h4m
installer-7-ip-10-0-189-93.ap-northeast-1.compute.internal                       0/1     Completed   0          5h3m
installer-8-ip-10-0-156-150.ap-northeast-1.compute.internal                      0/1     Completed   0          4h59m
installer-8-ip-10-0-189-93.ap-northeast-1.compute.internal                       0/1     Completed   0          5h2m
installer-8-ip-10-0-204-104.ap-northeast-1.compute.internal                      0/1     Completed   0          5h1m
installer-9-ip-10-0-156-150.ap-northeast-1.compute.internal                      0/1     Completed   0          4h36m
installer-9-ip-10-0-189-93.ap-northeast-1.compute.internal                       0/1     Completed   0          4h35m
installer-9-ip-10-0-204-104.ap-northeast-1.compute.internal                      0/1     Completed   0          4h34m
openshift-kube-scheduler-guard-ip-10-0-156-150.ap-northeast-1.compute.internal   1/1     Running     0          5h13m
openshift-kube-scheduler-guard-ip-10-0-189-93.ap-northeast-1.compute.internal    1/1     Running     0          5h12m
openshift-kube-scheduler-guard-ip-10-0-204-104.ap-northeast-1.compute.internal   1/1     Running     0          5h10m
openshift-kube-scheduler-ip-10-0-156-150.ap-northeast-1.compute.internal         3/3     Running     0          4h36m
openshift-kube-scheduler-ip-10-0-189-93.ap-northeast-1.compute.internal          3/3     Running     0          4h35m
openshift-kube-scheduler-ip-10-0-204-104.ap-northeast-1.compute.internal         3/3     Running     0          4h33m
revision-pruner-8-ip-10-0-156-150.ap-northeast-1.compute.internal                0/1     Completed   0          5h1m
revision-pruner-8-ip-10-0-189-93.ap-northeast-1.compute.internal                 0/1     Completed   0          5h1m
revision-pruner-8-ip-10-0-204-104.ap-northeast-1.compute.internal                0/1     Completed   0          5h1m
revision-pruner-9-ip-10-0-156-150.ap-northeast-1.compute.internal                0/1     Completed   0          4h36m
revision-pruner-9-ip-10-0-189-93.ap-northeast-1.compute.internal                 0/1     Completed   0          4h36m
revision-pruner-9-ip-10-0-204-104.ap-northeast-1.compute.internal                0/1     Completed   0          4h36m

Comment 12 Yuri Obshansky 2022-01-25 16:32:37 UTC

Hi @knarra
As I reported early, the issue is reproducible on OCP 4.9.15.
Are you going to backport to 4.9.* version?
Thank you

Comment 13 RamaKasturi 2022-01-25 17:17:28 UTC

(In reply to Yuri Obshansky from comment #12)
> Hi @knarra
> As I reported early, the issue is reproducible on OCP 4.9.15.
> Are you going to backport to 4.9.* version?
> Thank you

Hello Yuri,

   Yes, i already see that 4.9.z bug is in POST state, please see here https://bugzilla.redhat.com/show_bug.cgi?id=2044622

Thanks
kasturi

Comment 14 Yuri Obshansky 2022-01-25 18:33:40 UTC

Thank you for update
Yuri
(In reply to RamaKasturi from comment #13)
> (In reply to Yuri Obshansky from comment #12)
> > Hi @knarra
> > As I reported early, the issue is reproducible on OCP 4.9.15.
> > Are you going to backport to 4.9.* version?
> > Thank you
> 
> Hello Yuri,
> 
>    Yes, i already see that 4.9.z bug is in POST state, please see here
> https://bugzilla.redhat.com/show_bug.cgi?id=2044622
> 
> Thanks
> kasturi

Comment 16 RamaKasturi 2022-02-03 07:28:34 UTC

Hello Yuri,

    I am trying to verify the 4.9.z bug and just wanted to understand about the cloud provider where you hit the issue. I am just trying to reproduce the issue so just wanted to understand.

Thanks
kasturi

Comment 17 Yuri Obshansky 2022-02-07 14:28:51 UTC

(In reply to RamaKasturi from comment #16)
> Hello Yuri,
> 
>     I am trying to verify the 4.9.z bug and just wanted to understand about
> the cloud provider where you hit the issue. I am just trying to reproduce
> the issue so just wanted to understand.
> 
> Thanks
> kasturi

Hi kasturi,

We test Assisted Service cloud approach to install Openshift.
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters

Here is repo -> https://github.com/openshift/assisted-service

https://cloud.redhat.com/blog/using-the-openshift-assisted-installer-service-to-deploy-an-openshift-cluster-on-metal-and-vsphere

Let me know if you need more info

Thanks
Yuri

Comment 18 RamaKasturi 2022-02-07 16:23:06 UTC

(In reply to Yuri Obshansky from comment #17)
> (In reply to RamaKasturi from comment #16)
> > Hello Yuri,
> > 
> >     I am trying to verify the 4.9.z bug and just wanted to understand about
> > the cloud provider where you hit the issue. I am just trying to reproduce
> > the issue so just wanted to understand.
> > 
> > Thanks
> > kasturi
> 
> Hi kasturi,
> 
> We test Assisted Service cloud approach to install Openshift.
> https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters
> 
> Here is repo -> https://github.com/openshift/assisted-service
> 
> https://cloud.redhat.com/blog/using-the-openshift-assisted-installer-service-
> to-deploy-an-openshift-cluster-on-metal-and-vsphere
> 
> Let me know if you need more info
> 
> Thanks
> Yuri

okay, thank you. 

Could you please help try if your issue is resolved in the new 4.9 build?

Comment 19 Yuri Obshansky 2022-02-07 17:26:39 UTC

(In reply to RamaKasturi from comment #18)
> 
> okay, thank you. 
> 
> Could you please help try if your issue is resolved in the new 4.9 build?

We can test only version which is deployed on Cloud. 
I cannot update version on it.
Now, it is 4.9.17
Where's the issue fixed ? What version?

Comment 20 RamaKasturi 2022-02-08 06:46:21 UTC

(In reply to Yuri Obshansky from comment #19)
> (In reply to RamaKasturi from comment #18)
> > 
> > okay, thank you. 
> > 
> > Could you please help try if your issue is resolved in the new 4.9 build?
> 
> We can test only version which is deployed on Cloud. 
> I cannot update version on it.
> Now, it is 4.9.17
> Where's the issue fixed ? What version?

Issue fixed in 4.9.19 build. Here is the complete change log https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.9.19 also 4.9 bug which i have moved to verified state https://bugzilla.redhat.com/show_bug.cgi?id=2044622

Comment 21 Yuri Obshansky 2022-02-08 16:05:17 UTC

(In reply to RamaKasturi from comment #20)
> Issue fixed in 4.9.19 build. Here is the complete change log
> https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-
> stable/release/4.9.19 also 4.9 bug which i have moved to verified state
> https://bugzilla.redhat.com/show_bug.cgi?id=2044622
Great. I'll verify issue when we got image 4.9.19 on Staging
Will update bugzilla with results

Comment 23 errata-xmlrpc 2022-03-12 04:39:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 24 RamaKasturi 2022-11-16 11:10:59 UTC

Hello Yuri,

   Could you please help set the right test coverage flag here ?

Thanks
kasturi

Comment 25 RamaKasturi 2022-11-30 06:11:37 UTC

Hello Yuri,

   Any reason you cleared the needinfo with out setting the test coverage flag here nor i did not see any comments mentioning why, could you please help me understand the same ?

Thanks
kasturi

Comment 26 RamaKasturi 2022-11-30 06:12:51 UTC

Hello Yuri,

   Any reason you cleared the needinfo with out setting the test coverage flag here nor i did not see any comments mentioning why, could you please help me understand the same ?

Thanks
kasturi

Comment 27 Yuri Obshansky 2022-11-30 14:07:26 UTC

Hello Rama, 

We do not have image 4.9.19 on our Staging setup.
So, I cannot verify this bug.
Please, find attached screenshot with list of images on Staging.
Let me know which version will be good for bug verification.

Thank you
Yuri

Comment 29 RamaKasturi 2022-11-30 15:34:53 UTC

Hi Yuri,

   please help verify with 4.9.37 as i understand fix should be present there.

Thanks
kasturi

Comment 30 Yuri Obshansky 2022-11-30 21:00:58 UTC

Hi, 

Just verified with 4.9.37

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.37    True        False         93m     Cluster version is 4.9.37
# oc get pods -n openshift-kube-scheduler
NAME                                  READY   STATUS      RESTARTS   AGE
installer-3-master-0-0                0/1     Completed   0          112m
installer-4-master-0-0                0/1     Completed   0          111m
installer-5-master-0-0                0/1     Completed   0          110m
installer-5-master-0-2                0/1     Completed   0          108m
installer-6-master-0-0                0/1     Completed   0          105m
installer-6-master-0-1                0/1     Completed   0          101m
installer-6-master-0-2                0/1     Completed   0          104m
openshift-kube-scheduler-master-0-0   3/3     Running     0          105m
openshift-kube-scheduler-master-0-1   3/3     Running     0          99m
openshift-kube-scheduler-master-0-2   3/3     Running     0          104m
revision-pruner-6-master-0-0          0/1     Completed   0          103m
revision-pruner-6-master-0-1          0/1     Completed   0          101m
revision-pruner-6-master-0-2          0/1     Completed   0          103m

Issued resolved

Comment 31 RamaKasturi 2022-12-01 07:42:26 UTC

Hello Yuri,

   Thanks for verifying. Could you please help set the right test coverage flag here ?

Thanks
kasturi

Comment 32 Yuri Obshansky 2022-12-01 16:59:04 UTC

Hi 
Honestly I do not know what should be flag qe_tes_coverage
Discuss with your QE managers.
Sorry about that.
Thank you

Comment 33 RamaKasturi 2022-12-02 11:44:04 UTC

Hello Yuri,

   Since the issue happens when installtion during assissted installer, i was wondering if you have a test case added to check that. If you have test case added please help set '+' in the qe_test_coverage flag else set '-' saying why you think we do not need to add a case for this.

Thanks
kasturi

Comment 34 Yuri Obshansky 2022-12-02 13:47:09 UTC

Hi, 

This issue is not specific to Assisted service deployment.
We do no need add special test case for that.
Probably should be tested in regular cluster deployment flow.

Thank you
Yuri

Note You need to log in before you can comment on or make changes to this bug.