2016352 – Some pods start before CA resources are present

Bug 2016352 - Some pods start before CA resources are present

Summary: Some pods start before CA resources are present

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jan Fajerski
QA Contact:	Junqi Zhao
Docs Contact:	Brian Burt
URL:
Whiteboard:
Duplicates (1):	2017616 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-21 11:07 UTC by Jan Fajerski
Modified:	2022-03-10 16:21 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	See also: https://bugzilla.redhat.com/show_bug.cgi?id=1953264 This release fixes an issue in which some pods in the monitoring stack would start before TLS certificate-related resources were present, which resulted in failures and restarts.
Clone Of:
Environment:
Last Closed:	2022-03-10 16:21:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1451	None	open	Bug 2016352: Fix cert injection	2021-10-27 12:01:50 UTC
Github	openshift telemeter pull 398	None	Merged	Bug 2016352: client: don't create empty service-ca.crt key in configmap	2021-10-27 12:01:04 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-10 16:21:54 UTC

Description Jan Fajerski 2021-10-21 11:07:08 UTC

Description of problem:
We use configmaps in several places, that get certificate related assets injected by operators.
Some pods get started before these resources are available, leading to failures and restarts.

How reproducible:
Everytime, I think.

Steps to Reproduce:
Start a cluster and check the prometheus-operator pod. The kube-rbac-proxy container will have been restart once and the previous log lists an error


Actual results:
kube-rbac-proxy in the prometheus-operator pod will have a previous log containing an error:

I1019 19:32:47.579385       1 main.go:181] Valid token audiences: 
F1019 19:32:47.579986       1 main.go:222] Failed to set up upstream TLS connection: error parsing upstream CA certificate
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000128001, 0xc000512280, 0x7b, 0x97)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x22992c0, 0xc000000003, 0x0, 0x0, 0xc0001c05b0, 0x1bff7db, 0x7, 0xde, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:945 +0x191
k8s.io/klog/v2.(*loggingT).printf(0x22992c0, 0x3, 0x0, 0x0, 0x176b141, 0x2c, 0xc000535cc8, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733 +0x17a
k8s.io/klog/v2.Fatalf(...)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1463
main.main()
	/go/src/github.com/brancz/kube-rbac-proxy/main.go:222 +0x166e

goroutine 18 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x22992c0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1131 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:416 +0xd8

goroutine 43 [select]:
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc0002f3c80)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:231 +0x3df
created by k8s.io/client-go/util/workqueue.newDelayingQueue
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:68 +0x185

goroutine 42 [chan receive]:
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop(0xc0002f3a40)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:198 +0xac
created by k8s.io/client-go/util/workqueue.newQueue
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:58 +0x135

goroutine 44 [semacquire]:
sync.runtime_SemacquireMutex(0x22992dc, 0xc0000c1d00, 0x1)
	/usr/lib/golang/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0x22992d8)
	/usr/lib/golang/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/usr/lib/golang/src/sync/mutex.go:81
k8s.io/klog/v2.(*loggingT).output(0x22992c0, 0xc000000000, 0x0, 0x0, 0xc0002093b0, 0x1bfed0d, 0x19, 0xa7, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:882 +0x825
k8s.io/klog/v2.(*loggingT).printf(0x22992c0, 0x0, 0x0, 0x0, 0x1742b89, 0xb, 0xc0000c1f20, 0x1, 0x1)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733 +0x17a
k8s.io/klog/v2.Infof(...)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1363
k8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicFileCAContent).Run(0xc0002f3ce0, 0x1, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_cafile_content.go:167 +0x145
github.com/brancz/kube-rbac-proxy/pkg/authn.(*DelegatingAuthenticator).Run(0xc00000fea8, 0x1, 0x0)
	/go/src/github.com/brancz/kube-rbac-proxy/pkg/authn/delegating.go:82 +0x51
created by main.main
	/go/src/github.com/brancz/kube-rbac-proxy/main.go:189 +0x3547

Expected results:

The Pod waits until the CA has been injected into the ConfigMap.

Additional info:
This is caused by improper usages. We must
a) In the Volume spec of a pod, list the items we want specficially. This will cause k8s to schedule to pod only when the key has been created. cf. https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#restrictions
b) Not create the configmap with the key that will be injected pre-created. Currently we create the key with an empty value (e.g. serving-ca.crt: "") in the respective ConfigMaps. This allows Pods to start before the CA data is present.

This was discovered while investigating https://bugzilla.redhat.com/show_bug.cgi?id=1953264 and may even fix it.

Comment 4 Junqi Zhao 2021-10-27 02:30:05 UTC

yes, found the issue in 4.10.0-0.nightly-2021-10-25-190146
# oc -n openshift-monitoring get pod prometheus-operator-54695d6648-lskjc -oyaml
...
  - containerID: cri-o://adb147861b4c06252af9599936b90d684dfbe911a77652820f0675a225327c61
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dfc11491ae5f242603571cefaae1879af9427ca1178bedf53cc385d1dae115a7
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dfc11491ae5f242603571cefaae1879af9427ca1178bedf53cc385d1dae115a7
    lastState:
      terminated:
        containerID: cri-o://8332e0b9e0b79c91eb7d3b4ae2639118fef008b1593df72af345174e81f4e053
        exitCode: 255
        finishedAt: "2021-10-27T00:54:11Z"
        message: "ueue.(*Type).updateUnfinishedWorkLoop(0xc000370a20)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:198
          +0xac\ncreated by k8s.io/client-go/util/workqueue.newQueue\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/queue.go:58
          +0x135\n\ngoroutine 13 [select]:\nk8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0xc000370b40)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:231
          +0x3df\ncreated by k8s.io/client-go/util/workqueue.newDelayingQueue\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/client-go/util/workqueue/delaying_queue.go:68
          +0x185\n\ngoroutine 14 [semacquire]:\nsync.runtime_SemacquireMutex(0x22992dc,
          0xc0000f3d00, 0x1)\n\t/usr/lib/golang/src/runtime/sema.go:71 +0x47\nsync.(*Mutex).lockSlow(0x22992d8)\n\t/usr/lib/golang/src/sync/mutex.go:138
          +0x105\nsync.(*Mutex).Lock(...)\n\t/usr/lib/golang/src/sync/mutex.go:81\nk8s.io/klog/v2.(*loggingT).output(0x22992c0,
          0xc000000000, 0x0, 0x0, 0xc00043c070, 0x1bfed0d, 0x19, 0xa7, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:882
          +0x825\nk8s.io/klog/v2.(*loggingT).printf(0x22992c0, 0x0, 0x0, 0x0, 0x1742b89,
          0xb, 0xc0000f3f20, 0x1, 0x1)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:733
          +0x17a\nk8s.io/klog/v2.Infof(...)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/klog/v2/klog.go:1363\nk8s.io/apiserver/pkg/server/dynamiccertificates.(*DynamicFileCAContent).Run(0xc000370ba0,
          0x1, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_cafile_content.go:167
          +0x145\ngithub.com/brancz/kube-rbac-proxy/pkg/authn.(*DelegatingAuthenticator).Run(0xc00009e288,
          0x1, 0x0)\n\t/go/src/github.com/brancz/kube-rbac-proxy/pkg/authn/delegating.go:82
          +0x51\ncreated by main.main\n\t/go/src/github.com/brancz/kube-rbac-proxy/main.go:189
          +0x3547\nI1027 00:54:11.192982       1 dynamic_cafile_content.go:167] Starting
          client-ca::/etc/tls/client/client-ca.crt\n"
        reason: Error
        startedAt: "2021-10-27T00:54:11Z"
    name: kube-rbac-proxy
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2021-10-27T00:54:11Z"
***************
waiting for the fix merged to payload

Comment 5 Jan Fajerski 2021-10-27 08:30:14 UTC

*** Bug 2017616 has been marked as a duplicate of this bug. ***

Comment 6 Junqi Zhao 2021-10-28 06:00:26 UTC

tested with the PRs, no restarted pods in the fresh cluster
# oc -n openshift-monitoring get pod
NAME                                           READY   STATUS    RESTARTS   AGE
alertmanager-main-0                            6/6     Running   0          132m
alertmanager-main-1                            6/6     Running   0          132m
alertmanager-main-2                            6/6     Running   0          132m
cluster-monitoring-operator-85489b49d6-n7hr7   2/2     Running   0          146m
grafana-6fddbbcc5c-v5tcx                       3/3     Running   0          132m
kube-state-metrics-86c755dc84-tprm9            3/3     Running   0          140m
node-exporter-d99lw                            2/2     Running   0          140m
node-exporter-r4x44                            2/2     Running   0          140m
node-exporter-rjzkw                            2/2     Running   0          140m
node-exporter-t2b7h                            2/2     Running   0          133m
node-exporter-x76d9                            2/2     Running   0          133m
node-exporter-z4rq9                            2/2     Running   0          133m
openshift-state-metrics-7b5dd5bdd4-5jnrt       3/3     Running   0          140m
prometheus-adapter-6fb8fd644-r4njz             1/1     Running   0          134m
prometheus-adapter-6fb8fd644-vmvnr             1/1     Running   0          134m
prometheus-k8s-0                               6/6     Running   0          132m
prometheus-k8s-1                               6/6     Running   0          132m
prometheus-operator-7776f6885f-5l6kq           2/2     Running   0          140m
telemeter-client-68494bcb44-6tk6c              3/3     Running   0          140m
thanos-querier-58dcb67767-l2zd2                6/6     Running   0          132m
thanos-querier-58dcb67767-mllhv                6/6     Running   0          132m

Comment 7 Junqi Zhao 2021-10-28 06:19:49 UTC

# oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-57599545c8-v7qn6   2/2     Running   0          3m37s
prometheus-user-workload-0             5/5     Running   0          3m34s
prometheus-user-workload-1             5/5     Running   0          3m34s
thanos-ruler-user-workload-0           3/3     Running   0          3m30s
thanos-ruler-user-workload-1           3/3     Running   0          3m30s

Comment 13 errata-xmlrpc 2022-03-10 16:21:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.