Hide Forgot
Description of problem: We use configmaps in several places, that get certificate related assets injected by operators. Some pods get started before these resources are available, leading to failures and restarts. How reproducible: Everytime, I think. Steps to Reproduce: Start a cluster and check the prometheus-operator pod. The kube-rbac-proxy container will have been restart once and the previous log lists an error Actual results: kube-rbac-proxy in the prometheus-operator pod will have a previous log containing an error: I1019 19:32:47.579385 1 main.go:181] Valid token audiences: F1019 19:32:47.579986 1 main.go:222] Failed to set up upstream TLS connection: error parsing upstream CA certificate goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc000128001, 0xc000512280, 0x7b, 0x97) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:996 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x22992c0, 0xc000000003, 0x0, 0x0, 0xc0001c05b0, 0x1bff7db, 0x7, 0xde, 0x0) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:945 +0x191 k8s.io/klog/v2.(*loggingT).printf(0x22992c0, 0x3, 0x0, 0x0, 0x176b141, 0x2c, 0xc000535cc8, 0x1, 0x1) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733 +0x17a k8s.io/klog/v2.Fatalf(...) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1463 main.main() /go/src/github.com/brancz/kube-rbac-proxy/main.go:222 +0x166e goroutine 18 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x22992c0) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1131 +0x8b created by k8s.io/klog/v2.init.0 /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:416 +0xd8 goroutine 43 [select]: k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc0002f3c80) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:231 +0x3df created by k8s.io/client-go/util/workqueue.newDelayingQueue /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:68 +0x185 goroutine 42 [chan receive]: k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop(0xc0002f3a40) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:198 +0xac created by k8s.io/client-go/util/workqueue.newQueue /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:58 +0x135 goroutine 44 [semacquire]: sync.runtime_SemacquireMutex(0x22992dc, 0xc0000c1d00, 0x1) /usr/lib/golang/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0x22992d8) /usr/lib/golang/src/sync/mutex.go:138 +0x105 sync.(*Mutex).Lock(...) /usr/lib/golang/src/sync/mutex.go:81 k8s.io/klog/v2.(*loggingT).output(0x22992c0, 0xc000000000, 0x0, 0x0, 0xc0002093b0, 0x1bfed0d, 0x19, 0xa7, 0x0) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:882 +0x825 k8s.io/klog/v2.(*loggingT).printf(0x22992c0, 0x0, 0x0, 0x0, 0x1742b89, 0xb, 0xc0000c1f20, 0x1, 0x1) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733 +0x17a k8s.io/klog/v2.Infof(...) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1363 k8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicFileCAContent).Run(0xc0002f3ce0, 0x1, 0x0) /go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_cafile_content.go:167 +0x145 github.com/brancz/kube-rbac-proxy/pkg/authn.(*DelegatingAuthenticator).Run(0xc00000fea8, 0x1, 0x0) /go/src/github.com/brancz/kube-rbac-proxy/pkg/authn/delegating.go:82 +0x51 created by main.main /go/src/github.com/brancz/kube-rbac-proxy/main.go:189 +0x3547 Expected results: The Pod waits until the CA has been injected into the ConfigMap. Additional info: This is caused by improper usages. We must a) In the Volume spec of a pod, list the items we want specficially. This will cause k8s to schedule to pod only when the key has been created. cf. https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#restrictions b) Not create the configmap with the key that will be injected pre-created. Currently we create the key with an empty value (e.g. serving-ca.crt: "") in the respective ConfigMaps. This allows Pods to start before the CA data is present. This was discovered while investigating https://bugzilla.redhat.com/show_bug.cgi?id=1953264 and may even fix it.
yes, found the issue in 4.10.0-0.nightly-2021-10-25-190146 # oc -n openshift-monitoring get pod prometheus-operator-54695d6648-lskjc -oyaml ... - containerID: cri-o://adb147861b4c06252af9599936b90d684dfbe911a77652820f0675a225327c61 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dfc11491ae5f242603571cefaae1879af9427ca1178bedf53cc385d1dae115a7 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dfc11491ae5f242603571cefaae1879af9427ca1178bedf53cc385d1dae115a7 lastState: terminated: containerID: cri-o://8332e0b9e0b79c91eb7d3b4ae2639118fef008b1593df72af345174e81f4e053 exitCode: 255 finishedAt: "2021-10-27T00:54:11Z" message: "ueue.(*Type).updateUnfinishedWorkLoop(0xc000370a20)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:198 +0xac\ncreated by k8s.io/client-go/util/workqueue.newQueue\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:58 +0x135\n\ngoroutine 13 [select]:\nk8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc000370b40)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:231 +0x3df\ncreated by k8s.io/client-go/util/workqueue.newDelayingQueue\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:68 +0x185\n\ngoroutine 14 [semacquire]:\nsync.runtime_SemacquireMutex(0x22992dc, 0xc0000f3d00, 0x1)\n\t/usr/lib/golang/src/runtime/sema.go:71 +0x47\nsync.(*Mutex).lockSlow(0x22992d8)\n\t/usr/lib/golang/src/sync/mutex.go:138 +0x105\nsync.(*Mutex).Lock(...)\n\t/usr/lib/golang/src/sync/mutex.go:81\nk8s.io/klog/v2.(*loggingT).output(0x22992c0, 0xc000000000, 0x0, 0x0, 0xc00043c070, 0x1bfed0d, 0x19, 0xa7, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:882 +0x825\nk8s.io/klog/v2.(*loggingT).printf(0x22992c0, 0x0, 0x0, 0x0, 0x1742b89, 0xb, 0xc0000f3f20, 0x1, 0x1)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733 +0x17a\nk8s.io/klog/v2.Infof(...)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1363\nk8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicFileCAContent).Run(0xc000370ba0, 0x1, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_cafile_content.go:167 +0x145\ngithub.com/brancz/kube-rbac-proxy/pkg/authn.(*DelegatingAuthenticator).Run(0xc00009e288, 0x1, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/pkg/authn/delegating.go:82 +0x51\ncreated by main.main\n\t/go/src/github.com/brancz/kube-rbac-proxy/main.go:189 +0x3547\nI1027 00:54:11.192982 1 dynamic_cafile_content.go:167] Starting client-ca::/etc/tls/client/client-ca.crt\n" reason: Error startedAt: "2021-10-27T00:54:11Z" name: kube-rbac-proxy ready: true restartCount: 1 started: true state: running: startedAt: "2021-10-27T00:54:11Z" *************** waiting for the fix merged to payload
*** Bug 2017616 has been marked as a duplicate of this bug. ***
tested with the PRs, no restarted pods in the fresh cluster # oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 6/6 Running 0 132m alertmanager-main-1 6/6 Running 0 132m alertmanager-main-2 6/6 Running 0 132m cluster-monitoring-operator-85489b49d6-n7hr7 2/2 Running 0 146m grafana-6fddbbcc5c-v5tcx 3/3 Running 0 132m kube-state-metrics-86c755dc84-tprm9 3/3 Running 0 140m node-exporter-d99lw 2/2 Running 0 140m node-exporter-r4x44 2/2 Running 0 140m node-exporter-rjzkw 2/2 Running 0 140m node-exporter-t2b7h 2/2 Running 0 133m node-exporter-x76d9 2/2 Running 0 133m node-exporter-z4rq9 2/2 Running 0 133m openshift-state-metrics-7b5dd5bdd4-5jnrt 3/3 Running 0 140m prometheus-adapter-6fb8fd644-r4njz 1/1 Running 0 134m prometheus-adapter-6fb8fd644-vmvnr 1/1 Running 0 134m prometheus-k8s-0 6/6 Running 0 132m prometheus-k8s-1 6/6 Running 0 132m prometheus-operator-7776f6885f-5l6kq 2/2 Running 0 140m telemeter-client-68494bcb44-6tk6c 3/3 Running 0 140m thanos-querier-58dcb67767-l2zd2 6/6 Running 0 132m thanos-querier-58dcb67767-mllhv 6/6 Running 0 132m
# oc -n openshift-user-workload-monitoring get pod NAME READY STATUS RESTARTS AGE prometheus-operator-57599545c8-v7qn6 2/2 Running 0 3m37s prometheus-user-workload-0 5/5 Running 0 3m34s prometheus-user-workload-1 5/5 Running 0 3m34s thanos-ruler-user-workload-0 3/3 Running 0 3m30s thanos-ruler-user-workload-1 3/3 Running 0 3m30s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056