Bug 2045872

Summary: SNO: cluster-policy-controller failed to start due to missing serving-cert/tls.crt
Product: OpenShift Container Platform Reporter: Igal Tsoiref <itsoiref>
Component: kube-controller-managerAssignee: Filip Krepinsky <fkrepins>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: calfonso, maszulik, mfojtik, rfreiman, surbania, tkatarki, vrutkovs
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:43:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2048484    

Description Igal Tsoiref 2022-01-25 20:19:31 UTC
Description of problem:
Lately we have SNO ci jobs failing. 

We found that cluster-policy-controller failed to start due to :


2022-01-22T15:55:00.211340589Z F0122 15:55:00.211297       1 cmd.go:138] open /etc/kubernetes/static-pod-resources/secrets/serving-cert/tls.crt: no such file or directory
2022-01-22T15:55:00.211691699Z goroutine 1 [running]:
2022-01-22T15:55:00.211691699Z k8s.io/klog/v2.stacks(0x1)
2022-01-22T15:55:00.211691699Z 	k8s.io/klog/v2.0/klog.go:1038 +0x8a
2022-01-22T15:55:00.211691699Z k8s.io/klog/v2.(*loggingT).output(0x3a75fe0, 0x3, 0x0, 0xc0006f6000, 0x1, {0x2cfcd69, 0x10}, 0xc000100000, 0x0)
2022-01-22T15:55:00.211691699Z 	k8s.io/klog/v2.0/klog.go:987 +0x5fd
2022-01-22T15:55:00.211691699Z k8s.io/klog/v2.(*loggingT).printDepth(0xc000253bc0, 0x26b1c70, 0x0, {0x0, 0x0}, 0x37, {0xc00089da70, 0x1, 0x1})
2022-01-22T15:55:00.211691699Z 	k8s.io/klog/v2.0/klog.go:735 +0x1ae
2022-01-22T15:55:00.211691699Z k8s.io/klog/v2.(*loggingT).print(...)
2022-01-22T15:55:00.211691699Z 	k8s.io/klog/v2.0/klog.go:717
2022-01-22T15:55:00.211691699Z k8s.io/klog/v2.Fatal(...)
2022-01-22T15:55:00.211691699Z 	k8s.io/klog/v2.0/klog.go:1512

link to specific logfile:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-single-node-live-iso/1484898355296342016/artifacts/e2e-metal-single-node-live-iso/baremetalds-sno-gather/artifacts/post-tests-must-gather/registry-redhat-io-openshift4-ose-must-gather-sha256-8c0b3bc10756c463f1aa6b622e396ae244079dd8f7f2f3c5d8695a777c95eec6/namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-test-infra-cluster-master-0/cluster-policy-controller/cluster-policy-controller/logs/current.log



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
cluster-policy-controller failing to start

Expected results:
cluster-policy-controller starts

Additional info:
You can find must-gather logs here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-single-node-live-iso/1484898355296342016/artifacts/e2e-metal-single-node-live-iso/baremetalds-sno-gather/artifacts/post-tests-must-gather

Another example:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-metal-single-node-live-iso/1484777420153163776/artifacts/e2e-metal-single-node-live-iso/baremetalds-sno-gather/artifacts/post-tests-must-gather/

we have couple more if needed

Comment 1 Sergiusz Urbaniak 2022-01-26 14:39:42 UTC
1. in “regular” installs, the bootstrap cluster-policy-controller ensures that UID ranges are applied in namespaces. this instance does not require service-ca certs.
2. the final cluster-policy-controller does require a service-ca cert. Hence if service-ca-operator cannot start, because cluster-policy-controller from 1. did not provision UID ranges in its namespace, SCC admission will fail preventing service-ca-operator to start, resulting in the failure state here.

The fix to the issue is to make cluster-policy-controller not to be dependent on the service-ca cert secret. It must be able to start without it even in scenario 2. For rebootstrapping scenarios, the cluster-policy-controller must be able to start without a service-ca generated serving certificate

Comment 10 Filip Krepinsky 2022-01-31 10:51:52 UTC
backport created https://bugzilla.redhat.com/show_bug.cgi?id=2048484

Comment 13 Filip Krepinsky 2022-04-04 21:50:41 UTC
*** Bug 1961204 has been marked as a duplicate of this bug. ***

Comment 17 errata-xmlrpc 2022-08-10 10:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 18 Red Hat Bugzilla 2023-09-15 01:51:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days