Bug 2026352
Summary: | Kube-Scheduler revision-pruner fail during install of new cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Neil Girard <ngirard> |
Component: | kube-scheduler | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 4.9 | CC: | aos-bugs, mfojtik, yobshans |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-12 04:39:01 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2044622 |
Description
Neil Girard
2021-11-24 12:35:49 UTC
From openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal/installer/installer/logs/current.log: ``` 2021-11-03T04:26:41.149182411Z I1103 04:26:41.148843 1 cmd.go:186] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-scheduler-certs" ... 2021-11-03T04:26:41.149182411Z I1103 04:26:41.148914 1 cmd.go:194] Getting secrets ... 2021-11-03T04:26:41.275975557Z I1103 04:26:41.275933 1 copy.go:32] Got secret openshift-kube-scheduler/kube-scheduler-client-cert-key 2021-11-03T04:26:41.276021276Z I1103 04:26:41.275976 1 cmd.go:207] Getting config maps ... 2021-11-03T04:26:41.276021276Z I1103 04:26:41.275985 1 cmd.go:226] Creating directory "/etc/kubernetes/static-pod-resources/kube-scheduler-certs/secrets/kube-scheduler-client-cert-key" ... 2021-11-03T04:26:41.276127775Z I1103 04:26:41.276101 1 cmd.go:449] Writing secret manifest "/etc/kubernetes/static-pod-resources/kube-scheduler-certs/secrets/kube-scheduler-client-cert-key/tls.crt" ... 2021-11-03T04:26:41.276210474Z I1103 04:26:41.276192 1 cmd.go:449] Writing secret manifest "/etc/kubernetes/static-pod-resources/kube-scheduler-certs/secrets/kube-scheduler-client-cert-key/tls.key" ... ``` revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal panics at 04:25:43.102194. ``` 2021-11-03T04:25:43.102203390Z F1103 04:25:43.102194 1 cmd.go:48] lstat /etc/kubernetes/static-pod-resources/kube-scheduler-certs: no such file or directory ``` So the missing /etc/kubernetes/static-pod-resources/kube-scheduler-certs directory is created eventually. Just not quick enough. From the openshift-kube-scheduler-operator: ``` 2021-11-03T04:25:39.184994392Z I1103 04:25:39.184944 1 request.go:665] Waited for 1.192417299s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal 2021-11-03T04:25:40.383204649Z I1103 04:25:40.383152 1 request.go:665] Waited for 1.190625443s due to client-side throttling, not priority and fairness, request: POST:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods 2021-11-03T04:25:40.404609063Z I1103 04:25:40.404559 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing 2021-11-03T04:26:35.508527549Z I1103 04:26:35.508455 1 request.go:665] Waited for 1.179951135s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal 2021-11-03T04:26:36.557796191Z I1103 04:26:36.556367 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing 2021-11-03T04:26:36.722600664Z I1103 04:26:36.722546 1 request.go:665] Waited for 1.010649519s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-16-77.us-west-2.compute.internal 2021-11-03T04:26:37.592274527Z I1103 04:26:37.592233 1 installer_controller.go:512] "ip-10-0-17-153.us-west-2.compute.internal" is in transition to 7, but has not made progress because installer is not finished, but in Pending phase 2021-11-03T04:26:37.904840556Z I1103 04:26:37.904787 1 request.go:665] Waited for 1.124217196s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal 2021-11-03T04:26:38.906680674Z I1103 04:26:38.906637 1 request.go:665] Waited for 1.310842986s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal 2021-11-03T04:26:40.111645240Z I1103 04:26:40.111607 1 request.go:665] Waited for 1.164360997s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-scheduler/pods/installer-7-ip-10-0-17-153.us-west-2.compute.internal ``` The operator created the pruner pod before the installer pod with almost 1 minute delay: ``` 2021-11-03T04:25:40.404609063Z I1103 04:25:40.404559 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing 2021-11-03T04:26:36.557796191Z I1103 04:26:36.556367 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-scheduler-operator", Name:"openshift-kube-scheduler-operator", UID:"6bb5dcf7-ffe6-4ace-a208-a476379fb082", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-7-ip-10-0-17-153.us-west-2.compute.internal -n openshift-kube-scheduler because it was missing ``` The static pod builder creates both the pruner and the installer as two independently running controllers. The pruner has no check saying "wait until the installer pod of my revision finishes". So there's no safe guard for avoiding the incident We have two options: - have the pruner pod wait for the installer pod to finish (or for a presence of required directories) (e.g. up to few minutes before it fails) - update the pruner controller to create the pruner pod not before the corresponding installer pod is finished The issue was reported yesterday. Will need another sprint to properly implement the changes. Based on the provided oc get pods output for ip-10-0-17-153.us-west-2.compute.internal: ~~~ $ omg get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE installer-7-ip-10-0-17-153.us-west-2.compute.internal 0/1 Succeeded 0 9h30m 10.130.0.27 ip-10-0-17-153.us-west-2.compute.internal installer-8-ip-10-0-17-153.us-west-2.compute.internal 0/1 Succeeded 0 9h30m 10.130.0.28 ip-10-0-17-153.us-west-2.compute.internal openshift-kube-scheduler-ip-10-0-17-153.us-west-2.compute.internal 3/3 Running 0 9h29m 10.0.17.153 ip-10-0-17-153.us-west-2.compute.internal revision-pruner-6-ip-10-0-17-153.us-west-2.compute.internal 0/1 Failed 0 9h32m 10.130.0.21 ip-10-0-17-153.us-west-2.compute.internal revision-pruner-7-ip-10-0-17-153.us-west-2.compute.internal 0/1 Failed 0 9h31m 10.130.0.24 ip-10-0-17-153.us-west-2.compute.internal revision-pruner-8-ip-10-0-17-153.us-west-2.compute.internal 0/1 Succeeded 0 9h30m 10.130.0.29 ip-10-0-17-153.us-west-2.compute.internal ~~~ The pruner gets eventually running. In this case the failing pruner does not cause any issues. The pruner has nothing to do over the non-existing directory. It just panics. Once the PR is merged, the panics disappears. Neil, as it may seem, this issue has no impact on the functionality of the pruner. Thus, we will not be fixing it in 4.9. The missing cert directory (which is reported as missing) will get created eventually by one of the installer pods. Once done, the pruner will stop failing with the reported stack trace. Would it be sufficient for the customer to have this fixed in 4.10 with the explanation I provided? Hello Jan, that is acceptable. I'll let the customer know. Thanks for taking a look into it. I test Assisted Service and bumped with similar problem when run 4.9.15 version See more information https://issues.redhat.com/browse/MGMT-9036 No problem with OCP OCP 4.10.0-fc.0 Moving the bug to verified state as i did not see revision pruner in failed state during new install of cluster. Will reopen the bug again if i hit it during any of the installs. [knarra@knarra ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-24-070025 True False 4h52m Cluster version is 4.10.0-0.nightly-2022-01-24-070025 [knarra@knarra ~]$ oc get pods -n openshift-kube-scheduler NAME READY STATUS RESTARTS AGE installer-5-ip-10-0-189-93.ap-northeast-1.compute.internal 0/1 Completed 0 5h11m installer-5-ip-10-0-204-104.ap-northeast-1.compute.internal 0/1 Completed 0 5h10m installer-6-ip-10-0-156-150.ap-northeast-1.compute.internal 0/1 Completed 0 5h4m installer-6-ip-10-0-204-104.ap-northeast-1.compute.internal 0/1 Completed 0 5h9m installer-7-ip-10-0-156-150.ap-northeast-1.compute.internal 0/1 Completed 0 5h4m installer-7-ip-10-0-189-93.ap-northeast-1.compute.internal 0/1 Completed 0 5h3m installer-8-ip-10-0-156-150.ap-northeast-1.compute.internal 0/1 Completed 0 4h59m installer-8-ip-10-0-189-93.ap-northeast-1.compute.internal 0/1 Completed 0 5h2m installer-8-ip-10-0-204-104.ap-northeast-1.compute.internal 0/1 Completed 0 5h1m installer-9-ip-10-0-156-150.ap-northeast-1.compute.internal 0/1 Completed 0 4h36m installer-9-ip-10-0-189-93.ap-northeast-1.compute.internal 0/1 Completed 0 4h35m installer-9-ip-10-0-204-104.ap-northeast-1.compute.internal 0/1 Completed 0 4h34m openshift-kube-scheduler-guard-ip-10-0-156-150.ap-northeast-1.compute.internal 1/1 Running 0 5h13m openshift-kube-scheduler-guard-ip-10-0-189-93.ap-northeast-1.compute.internal 1/1 Running 0 5h12m openshift-kube-scheduler-guard-ip-10-0-204-104.ap-northeast-1.compute.internal 1/1 Running 0 5h10m openshift-kube-scheduler-ip-10-0-156-150.ap-northeast-1.compute.internal 3/3 Running 0 4h36m openshift-kube-scheduler-ip-10-0-189-93.ap-northeast-1.compute.internal 3/3 Running 0 4h35m openshift-kube-scheduler-ip-10-0-204-104.ap-northeast-1.compute.internal 3/3 Running 0 4h33m revision-pruner-8-ip-10-0-156-150.ap-northeast-1.compute.internal 0/1 Completed 0 5h1m revision-pruner-8-ip-10-0-189-93.ap-northeast-1.compute.internal 0/1 Completed 0 5h1m revision-pruner-8-ip-10-0-204-104.ap-northeast-1.compute.internal 0/1 Completed 0 5h1m revision-pruner-9-ip-10-0-156-150.ap-northeast-1.compute.internal 0/1 Completed 0 4h36m revision-pruner-9-ip-10-0-189-93.ap-northeast-1.compute.internal 0/1 Completed 0 4h36m revision-pruner-9-ip-10-0-204-104.ap-northeast-1.compute.internal 0/1 Completed 0 4h36m Hi @knarra As I reported early, the issue is reproducible on OCP 4.9.15. Are you going to backport to 4.9.* version? Thank you (In reply to Yuri Obshansky from comment #12) > Hi @knarra > As I reported early, the issue is reproducible on OCP 4.9.15. > Are you going to backport to 4.9.* version? > Thank you Hello Yuri, Yes, i already see that 4.9.z bug is in POST state, please see here https://bugzilla.redhat.com/show_bug.cgi?id=2044622 Thanks kasturi Thank you for update Yuri (In reply to RamaKasturi from comment #13) > (In reply to Yuri Obshansky from comment #12) > > Hi @knarra > > As I reported early, the issue is reproducible on OCP 4.9.15. > > Are you going to backport to 4.9.* version? > > Thank you > > Hello Yuri, > > Yes, i already see that 4.9.z bug is in POST state, please see here > https://bugzilla.redhat.com/show_bug.cgi?id=2044622 > > Thanks > kasturi Hello Yuri, I am trying to verify the 4.9.z bug and just wanted to understand about the cloud provider where you hit the issue. I am just trying to reproduce the issue so just wanted to understand. Thanks kasturi (In reply to RamaKasturi from comment #16) > Hello Yuri, > > I am trying to verify the 4.9.z bug and just wanted to understand about > the cloud provider where you hit the issue. I am just trying to reproduce > the issue so just wanted to understand. > > Thanks > kasturi Hi kasturi, We test Assisted Service cloud approach to install Openshift. https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters Here is repo -> https://github.com/openshift/assisted-service https://cloud.redhat.com/blog/using-the-openshift-assisted-installer-service-to-deploy-an-openshift-cluster-on-metal-and-vsphere Let me know if you need more info Thanks Yuri (In reply to Yuri Obshansky from comment #17) > (In reply to RamaKasturi from comment #16) > > Hello Yuri, > > > > I am trying to verify the 4.9.z bug and just wanted to understand about > > the cloud provider where you hit the issue. I am just trying to reproduce > > the issue so just wanted to understand. > > > > Thanks > > kasturi > > Hi kasturi, > > We test Assisted Service cloud approach to install Openshift. > https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters > > Here is repo -> https://github.com/openshift/assisted-service > > https://cloud.redhat.com/blog/using-the-openshift-assisted-installer-service- > to-deploy-an-openshift-cluster-on-metal-and-vsphere > > Let me know if you need more info > > Thanks > Yuri okay, thank you. Could you please help try if your issue is resolved in the new 4.9 build? (In reply to RamaKasturi from comment #18) > > okay, thank you. > > Could you please help try if your issue is resolved in the new 4.9 build? We can test only version which is deployed on Cloud. I cannot update version on it. Now, it is 4.9.17 Where's the issue fixed ? What version? (In reply to Yuri Obshansky from comment #19) > (In reply to RamaKasturi from comment #18) > > > > okay, thank you. > > > > Could you please help try if your issue is resolved in the new 4.9 build? > > We can test only version which is deployed on Cloud. > I cannot update version on it. > Now, it is 4.9.17 > Where's the issue fixed ? What version? Issue fixed in 4.9.19 build. Here is the complete change log https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4-stable/release/4.9.19 also 4.9 bug which i have moved to verified state https://bugzilla.redhat.com/show_bug.cgi?id=2044622 (In reply to RamaKasturi from comment #20) > Issue fixed in 4.9.19 build. Here is the complete change log > https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4- > stable/release/4.9.19 also 4.9 bug which i have moved to verified state > https://bugzilla.redhat.com/show_bug.cgi?id=2044622 Great. I'll verify issue when we got image 4.9.19 on Staging Will update bugzilla with results Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 Hello Yuri, Could you please help set the right test coverage flag here ? Thanks kasturi Hello Yuri, Any reason you cleared the needinfo with out setting the test coverage flag here nor i did not see any comments mentioning why, could you please help me understand the same ? Thanks kasturi Hello Yuri, Any reason you cleared the needinfo with out setting the test coverage flag here nor i did not see any comments mentioning why, could you please help me understand the same ? Thanks kasturi Hello Rama, We do not have image 4.9.19 on our Staging setup. So, I cannot verify this bug. Please, find attached screenshot with list of images on Staging. Let me know which version will be good for bug verification. Thank you Yuri Hi Yuri, please help verify with 4.9.37 as i understand fix should be present there. Thanks kasturi Hi, Just verified with 4.9.37 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.37 True False 93m Cluster version is 4.9.37 # oc get pods -n openshift-kube-scheduler NAME READY STATUS RESTARTS AGE installer-3-master-0-0 0/1 Completed 0 112m installer-4-master-0-0 0/1 Completed 0 111m installer-5-master-0-0 0/1 Completed 0 110m installer-5-master-0-2 0/1 Completed 0 108m installer-6-master-0-0 0/1 Completed 0 105m installer-6-master-0-1 0/1 Completed 0 101m installer-6-master-0-2 0/1 Completed 0 104m openshift-kube-scheduler-master-0-0 3/3 Running 0 105m openshift-kube-scheduler-master-0-1 3/3 Running 0 99m openshift-kube-scheduler-master-0-2 3/3 Running 0 104m revision-pruner-6-master-0-0 0/1 Completed 0 103m revision-pruner-6-master-0-1 0/1 Completed 0 101m revision-pruner-6-master-0-2 0/1 Completed 0 103m Issued resolved Hello Yuri, Thanks for verifying. Could you please help set the right test coverage flag here ? Thanks kasturi Hi Honestly I do not know what should be flag qe_tes_coverage Discuss with your QE managers. Sorry about that. Thank you Hello Yuri, Since the issue happens when installtion during assissted installer, i was wondering if you have a test case added to check that. If you have test case added please help set '+' in the qe_test_coverage flag else set '-' saying why you think we do not need to add a case for this. Thanks kasturi Hi, This issue is not specific to Assisted service deployment. We do no need add special test case for that. Probably should be tested in regular cluster deployment flow. Thank you Yuri |