Version: 4.7.0-0.ci.test-2021-01-19-093712-ci-ln-cvxm9pb registry.svc.ci.openshift.org/sno-dev/openshift-bip:0.2.0 Steps to reproduce: 1. Deploy SNO with BIP. 2. Check all the pods that are not Running nor Complete Result: prometheus-adapter pods are stuck in "Terminating" status. [root@sealusa52 ~]# oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-adapter-77d9c78b6-pfjj6 0/1 Terminating 0 29m openshift-monitoring prometheus-adapter-77d9c78b6-t2mf5 0/1 Terminating 0 29m [root@sealusa52 ~]# [root@sealusa52 ~]# oc describe pod -n openshift-monitoring prometheus-adapter-77d9c78b6-pfjj6 Name: prometheus-adapter-77d9c78b6-pfjj6 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: sno-0-0/192.168.123.116 Start Time: Tue, 26 Jan 2021 16:41:46 -0500 Labels: name=prometheus-adapter pod-template-hash=77d9c78b6 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.52" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.52" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Terminating (lasts 28m) Termination Grace Period: 30s IP: 10.128.0.52 IPs: IP: 10.128.0.52 Controlled By: ReplicaSet/prometheus-adapter-77d9c78b6 Containers: prometheus-adapter: Container ID: Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779 Image ID: Port: 6443/TCP Host Port: 0/TCP Args: --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml --config=/etc/adapter/config.yaml --logtostderr=true --metrics-relist-interval=1m --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091 --secure-port=6443 --client-ca-file=/etc/tls/private/client-ca-file --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Waiting Reason: ContainerCreating Last State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was deleted. The container used to be Running Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Requests: cpu: 1m memory: 25Mi Environment: <none> Mounts: /etc/adapter from config (rw) /etc/prometheus-config from prometheus-adapter-prometheus-config (rw) /etc/ssl/certs from serving-certs-ca-bundle (rw) /etc/tls/private from tls (ro) /tmp from tmpfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-4zmn2 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmpfs: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config: Type: ConfigMap (a volume populated by a ConfigMap) Name: adapter-config Optional: false prometheus-adapter-prometheus-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-adapter-prometheus-config Optional: false serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-37virbg3cl6r0 Optional: false prometheus-adapter-token-4zmn2: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-token-4zmn2 Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 32m (x2 over 32m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Normal Scheduled 32m default-scheduler Successfully assigned openshift-monitoring/prometheus-adapter-77d9c78b6-pfjj6 to sno-0-0 Normal AddedInterface 32m multus Add eth0 [10.128.0.52/23] Normal Pulling 32m kubelet Pulling image "registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779" Normal Pulled 32m kubelet Successfully pulled image "registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779" in 4.322131984s Normal Created 32m kubelet Created container prometheus-adapter Normal Started 32m kubelet Started container prometheus-adapter Warning FailedMount 27m kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[tmpfs config prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-4zmn2]: timed out waiting for the condition Normal Killing 27m kubelet Stopping container prometheus-adapter Warning FailedMount 68s (x22 over 29m) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-37virbg3cl6r0" not found [root@sealusa52 ~]# Note: Force deleting the pods manually seems to work.
From the information you shared, the problem seems to be coming from the kubelet. Thus, I am transferring this bug to the Node team.
The issue is intermittent. Reproduced again (same version).
reproduced - attaching must-gather
Created attachment 1754157 [details] must-gather logs
[root@sealusa34 ~]# oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-adapter-7b549d98d7-49vvz 0/1 Terminating 0 14h openshift-monitoring prometheus-adapter-7b549d98d7-7vf6n 0/1 Terminating 0 14h [root@sealusa34 ~]# oc describe pod -n openshift-monitoring prometheus-adapter-7b549d98d7-49vvz Name: prometheus-adapter-7b549d98d7-49vvz Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: sno-0-0.ocp-edge-cluster-0.qe.lab.redhat.com/fd2e:6f44:5dd8::13c Start Time: Mon, 01 Feb 2021 20:31:24 -0500 Labels: name=prometheus-adapter pod-template-hash=7b549d98d7 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["fd01:0:0:1::30/64"],"mac_address":"0a:58:a7:3b:11:54","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:... k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "fd01:0:0:1::30" ], "mac": "0a:58:a7:3b:11:54", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "fd01:0:0:1::30" ], "mac": "0a:58:a7:3b:11:54", "default": true, "dns": {} }] openshift.io/scc: restricted Status: Terminating (lasts 14h) Termination Grace Period: 30s IP: fd01:0:0:1::30 IPs: IP: fd01:0:0:1::30 Controlled By: ReplicaSet/prometheus-adapter-7b549d98d7 Containers: prometheus-adapter: Container ID: cri-o://eaf3ef2f7efad9de19e4ba0cff90b8b102a46e4d43bfa962c8748701280f24b9 Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779 Image ID: registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779 Port: 6443/TCP Host Port: 0/TCP Args: --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml --config=/etc/adapter/config.yaml --logtostderr=true --metrics-relist-interval=1m --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091 --secure-port=6443 --client-ca-file=/etc/tls/private/client-ca-file --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Terminated Reason: Error Message: :CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority] E0202 01:31:49.294047 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority] E0202 01:31:49.294200 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority] E0202 01:31:49.294310 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority] E0202 01:32:09.146572 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority] E0202 01:32:09.146614 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority] Exit Code: 2 Started: Mon, 01 Feb 2021 20:31:32 -0500 Finished: Mon, 01 Feb 2021 20:35:41 -0500 Ready: False Restart Count: 0 Requests: cpu: 1m memory: 25Mi Environment: <none> Mounts: /etc/adapter from config (rw) /etc/prometheus-config from prometheus-adapter-prometheus-config (rw) /etc/ssl/certs from serving-certs-ca-bundle (rw) /etc/tls/private from tls (ro) /tmp from tmpfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-t8h6p (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmpfs: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config: Type: ConfigMap (a volume populated by a ConfigMap) Name: adapter-config Optional: false prometheus-adapter-prometheus-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-adapter-prometheus-config Optional: false serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-sbmaajok6lbe Optional: false prometheus-adapter-token-t8h6p: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-token-t8h6p Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 83s (x421 over 14h) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-sbmaajok6lbe" not found [root@sealusa34 ~]#
I am having this issue as well in 4.6.12 after restoring my cluster from etcd backup. NAMESPACE NAME READY STATUS RESTARTS AGE openshift-insights insights-operator-68b87568c7-wvgvj 0/1 ContainerCreating 0 17d openshift-logging elasticsearch-cdm-xqj5yjib-1-5f7466cf75-426lb 0/2 ContainerCreating 0 29m openshift-monitoring cluster-monitoring-operator-5cdc6d5fcb-9ptb9 0/2 ContainerCreating 0 17d openshift-monitoring prometheus-adapter-d8c689779-fw6vn 0/1 ContainerCreating 0 12d openshift-monitoring prometheus-adapter-d8c689779-trs9v 0/1 ContainerCreating 0 6m8s openshift-monitoring prometheus-operator-76567945c4-j4wlh 0/2 ContainerCreating 0 17d Events for prometheus-adapter-d8c689779-trs9 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 5m2s kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-czsgr tmpfs config]: timed out waiting for the condition Warning FailedMount 53s (x11 over 7m5s) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-ai86df6ln5tau" not found Warning FailedMount 28s (x2 over 2m43s) kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-czsgr tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition
Can we get this retested with the rc.0 candidate? There was a memory leak fix in thanos that could have contributed to this bz.
fwiw the thanos querier bug is included in 4.6.16 if that helps in veryfying.
s/bug/bug fix
The issue intermittently reproduced with: 4.8.0-0.ci.test-2021-02-14-182353-ci-ln-0n8kv3b Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip:0.5.0
*** This bug has been marked as a duplicate of bug 1929463 ***
Re-opened and transferred to monitoring after discussions in https://bugzilla.redhat.com/show_bug.cgi?id=1929463
although the prometheus-adapter pods are Terminating, but the failed reason in Comment 0 and Comment 5 is different, Comment 0 should be a node issue, Comment 5 maybe an auth issue or monitoring issue
FYI, from the must-gather file in Comment 4, the cluster only has one node sno-0-0.ocp-edge-cluster-0.qe.lab.redhat.com and it is an IPV6 cluster
The relevant comment from bug 1929463 is: https://bugzilla.redhat.com/show_bug.cgi?id=1929463#c25
assigning to damien who can take a peek at the signal handling logic withing prometheus-adapter.
I attached the usptream PR to add a signal handler to prometheus-adapter.
# oc version Client Version: 4.8.0-0.nightly-2021-04-08-200632 Server Version: 4.8.0-0.nightly-2021-04-08-200632 Kubernetes Version: v1.21.0-rc.0+6d27558 # oc -n openshift-monitoring get deploy cluster-monitoring-operator -oyaml | grep prometheus-adapter - -images=k8s-prometheus-adapter=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a48430f2ae4d6f9382218868aad95639ae17e3e0824f209b2f7fde219c87b8e9 # docker inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a48430f2ae4d6f9382218868aad95639ae17e3e0824f209b2f7fde219c87b8e9 | grep "io.openshift.build.commit.url" "io.openshift.build.commit.url": "https://github.com/openshift/images/commit/bcab0f7337420343611546aae2634eaf0d36c33e", "io.openshift.build.commit.url": "https://github.com/openshift/k8s-prometheus-adapter/commit/2856bc27f7319c069c02cbc5210852c34ef6e4ef", checked files in https://github.com/openshift/k8s-prometheus-adapter/commit/2856bc27f7319c069c02cbc5210852c34ef6e4ef the fix is in the payload now, we should update other resources to 0.8.4, example: # oc -n openshift-monitoring get pod --show-labels | grep prometheus-adapter prometheus-adapter-5967cb7df6-6xgrt 1/1 Running 0 59m app.kubernetes.io/component=metrics-adapter,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=prometheus-adapter,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=0.8.2,pod-template-hash=5967cb7df6 # oc -n openshift-monitoring get deploy prometheus-adapter -oyaml ... labels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.8.2 ... # oc -n openshift-monitoring get clusterrolebinding prometheus-adapter -oyaml ... labels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.8.2 ... # oc -n openshift-monitoring get svc prometheus-adapter -oyaml ... labels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.8.2
Thank you for verifying, I forgot to update the labels. It should be good with the new PR.
tested with 4.8.0-0.nightly-2021-04-13-171608, prometheus-adapter is bumped to 0.8.4, labels are also right, no regression issues
FailedQA Reproduced: [kni@r640-u09 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-21-084059 True False 57m Cluster version is 4.8.0-0.nightly-2021-04-21-084059 [kni@r640-u09 ~]$ oc get pod -A|grep Term openshift-monitoring prometheus-adapter-6b7474585-2drrw 0/1 Terminating 0 77m [kni@r640-u09 ~]$
The following PR may fix the issue because it remove anti-affinity constraints for SNO https://github.com/openshift/cluster-monitoring-operator/pull/1124
(In reply to hongyan li from comment #29) > The following PR may fix the issue because it remove anti-affinity > constraints for SNO > https://github.com/openshift/cluster-monitoring-operator/pull/1124 yes no issue with 4.8.0-0.nightly-2021-04-22-225832 in SNO cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-225832 True False 2m26s Cluster version is 4.8.0-0.nightly-2021-04-22-225832 # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE monitoring 4.8.0-0.nightly-2021-04-22-225832 True False False 2m32s # oc get no NAME STATUS ROLES AGE VERSION ip-10-0-157-157.us-west-2.compute.internal Ready master,worker 24m v1.21.0-rc.0+af8ab09 # oc -n openshift-monitoring get pod -o wide | grep prometheus-adapter prometheus-adapter-5b4659d98-vvw5m 1/1 Running 0 16m 10.128.0.71 ip-10-0-157-157.us-west-2.compute.internal <none> <none>
#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10 #oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C4 app.kubernetes.io/part-of: openshift-monitoring strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 0 type: RollingUpdate template: metadata: annotations:
FailedQA: @hongyan The issue is not resolved. Please note that it is intermittent and may or may not reproduce. My last deployment was done a few hours ago and the issue reproduced: [root@sealusa35 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-182303 True False 14h Cluster version is 4.8.0-0.nightly-2021-04-22-182303 [root@sealusa35 ~]# oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-adapter-655d6fdbc8-6tlfw 0/1 Terminating 0 14h
Closing, @sasha feel free to reopen if you can reproduce the issue.
Just re-produced: Version: 4.8.0-0.nightly-2021-05-25-121139 [kni@r640-u01 ~]$ oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-adapter-64cd49f459-f44hx 0/1 Terminating 0 56m [kni@r640-u01 ~]$ oc describe pod -n openshift-monitoring prometheus-adapter-64cd49f459-f44hx Name: prometheus-adapter-64cd49f459-f44hx Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: openshift-master-0.qe1.kni.lab.eng.bos.redhat.com/10.19.134.13 Start Time: Tue, 25 May 2021 15:02:10 -0400 Labels: app.kubernetes.io/component=metrics-adapter app.kubernetes.io/name=prometheus-adapter app.kubernetes.io/part-of=openshift-monitoring app.kubernetes.io/version=0.8.4 pod-template-hash=64cd49f459 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.45" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.45" ], "default": true, "dns": {} }] openshift.io/scc: restricted workload.openshift.io/warning: the node "openshift-master-0.qe1.kni.lab.eng.bos.redhat.com" does not have resource "management.workload.openshift.io/cores" Status: Terminating (lasts 51m) Termination Grace Period: 30s IP: 10.128.0.45 IPs: IP: 10.128.0.45 Controlled By: ReplicaSet/prometheus-adapter-64cd49f459 Containers: prometheus-adapter: Container ID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623 Port: 6443/TCP Host Port: 0/TCP Args: --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml --config=/etc/adapter/config.yaml --logtostderr=true --metrics-relist-interval=1m --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091 --secure-port=6443 --client-ca-file=/etc/tls/private/client-ca-file --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 25 May 2021 15:02:23 -0400 Finished: Tue, 25 May 2021 15:04:57 -0400 Ready: False Restart Count: 0 Requests: cpu: 1m memory: 40Mi Environment: <none> Mounts: /etc/adapter from config (rw) /etc/prometheus-config from prometheus-adapter-prometheus-config (rw) /etc/ssl/certs from serving-certs-ca-bundle (rw) /etc/tls/private from tls (ro) /tmp from tmpfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-6cwx7 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmpfs: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config: Type: ConfigMap (a volume populated by a ConfigMap) Name: adapter-config Optional: false prometheus-adapter-prometheus-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-adapter-prometheus-config Optional: false serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-129lg25chsi53 Optional: false prometheus-adapter-token-6cwx7: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-token-6cwx7 Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 56m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 56m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling 54m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Normal Scheduled 54m default-scheduler Successfully assigned openshift-monitoring/prometheus-adapter-64cd49f459-f44hx to openshift-master-0.qe1.kni.lab.eng.bos.redhat.com Normal AddedInterface 54m multus Add eth0 [10.128.0.45/23] Normal Pulling 54m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623" Normal Pulled 54m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623" in 9.20035679s Normal Created 54m kubelet Created container prometheus-adapter Normal Started 54m kubelet Started container prometheus-adapter Normal Killing 51m kubelet Stopping container prometheus-adapter Warning FailedMount 58s (x33 over 51m) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-129lg25chsi53" not found [kni@r640-u01 ~]$
Created attachment 1786983 [details] must-gather logs from 4.8.0-0.nightly-2021-05-25-121139
As far as I can tell from the logs, prometheus-adapter is now completing properly after the addition of a signal handler for SIGINT and SIGTERM, however even though the container is reported completed with an exit code of 0 by cri-o, the pod itself is still marked as running. From the statuses of the prometheus-adapter pod that is stuck in Terminating state, we can get that the container completed, but the pod is still running: ``` status: conditions: ... - lastProbeTime: null lastTransitionTime: "2021-05-25T19:02:10Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1 image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623 lastState: {} name: prometheus-adapter ready: false restartCount: 0 started: false state: terminated: containerID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1 exitCode: 0 finishedAt: "2021-05-25T19:04:57Z" reason: Completed startedAt: "2021-05-25T19:02:23Z" hostIP: 10.19.134.13 phase: Running ... ``` Also cri-o logs confirm that the container was removed: ``` May 25 19:04:58.436206 openshift-master-0.qe1.kni.lab.eng.bos.redhat.com crio[2720]: time="2021-05-25 19:04:58.436162287Z" level=info msg="Removed container 00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1: openshift-monitoring/prometheus-adapter-64cd49f459-f44hx/prometheus-adapter" id=dba5daab-c886-472c-a0a4-99f80fb966b8 name=/runtime.v1alpha2.RuntimeService/RemoveContainer ``` And the original replicaset for prometheus-adapter deployment reports 0 replicas so the pod shouldn't be in running state: ``` - apiVersion: apps/v1 kind: ReplicaSet metadata: annotations: deployment.kubernetes.io/desired-replicas: "1" deployment.kubernetes.io/max-replicas: "2" deployment.kubernetes.io/revision: "1" creationTimestamp: "2021-05-25T19:00:27Z" generation: 2 labels: app.kubernetes.io/component: metrics-adapter app.kubernetes.io/name: prometheus-adapter app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.8.4 pod-template-hash: 64cd49f459 name: prometheus-adapter-64cd49f459 namespace: openshift-monitoring ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: Deployment name: prometheus-adapter uid: 983a2d50-45c7-46ab-943e-94aeac5033f5 resourceVersion: "9814" uid: ee599589-0b68-44c9-8f23-29dd01db50ed spec: replicas: 0 ``` Originally, according to comment 18, it seemed that the issue was caused by prometheus-adapter not terminating gracefully after receiving a SIGTERM, but now that it handles the signal properly, the issue is still here and it seems to be caused by the pod resource not being properly updated. Note that prometheus-adapter seems to have been killed because by "No sandbox for pod can be found. Need to start a new one" pod="openshift-monitoring/prometheus-adapter-64cd49f459-f44hx", so there might be a race occurring when this happens. That said, I am sending the bug over to the Node team to further investigate why the pod is stuck in Terminating state.
oc get clusterversion --kubeconfig sno-0.kubeconfig NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-fc.7 True False 28m Cluster version is 4.8.0-fc.7 oc get pod --kubeconfig sno-0.kubeconfig -n openshift-monitoring |grep -v Run|grep -v Comple NAME READY STATUS RESTARTS AGE prometheus-adapter-7c8ffccf56-66rl4 0/1 Terminating 0 37m oc describe pod prometheus-adapter-7c8ffccf56-66rl4 --kubeconfig sno-0.kubeconfig -n openshift-monitoring Name: prometheus-adapter-7c8ffccf56-66rl4 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: sno-0.vlan614.rdu2.scalelab.redhat.com/1000::1:1 Start Time: Tue, 08 Jun 2021 17:14:23 +0000 Labels: app.kubernetes.io/component=metrics-adapter app.kubernetes.io/name=prometheus-adapter app.kubernetes.io/part-of=openshift-monitoring app.kubernetes.io/version=0.8.4 pod-template-hash=7c8ffccf56 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["fd01:0:0:1::3a/64"],"mac_address":"0a:58:ed:0b:91:6d","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:... k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "fd01:0:0:1::3a" ], "mac": "0a:58:ed:0b:91:6d", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "fd01:0:0:1::3a" ], "mac": "0a:58:ed:0b:91:6d", "default": true, "dns": {} }] openshift.io/scc: restricted workload.openshift.io/warning: the node "sno-0.vlan614.rdu2.scalelab.redhat.com" does not have resource "management.workload.openshift.io/cores" Status: Terminating (lasts 36m) Termination Grace Period: 30s IP: fd01:0:0:1::3a IPs: IP: fd01:0:0:1::3a Controlled By: ReplicaSet/prometheus-adapter-7c8ffccf56 Containers: prometheus-adapter: Container ID: cri-o://90969aa50f20a0dd9adda7146f60c0e2b847fe28ca63e26d8c556a192fdec948 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d Port: 6443/TCP Host Port: 0/TCP Args: --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml --config=/etc/adapter/config.yaml --logtostderr=true --metrics-relist-interval=1m --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091 --secure-port=6443 --client-ca-file=/etc/tls/private/client-ca-file --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 08 Jun 2021 17:14:28 +0000 Finished: Tue, 08 Jun 2021 17:18:24 +0000 Ready: False Restart Count: 0 Requests: cpu: 1m memory: 40Mi Environment: <none> Mounts: /etc/adapter from config (rw) /etc/prometheus-config from prometheus-adapter-prometheus-config (rw) /etc/ssl/certs from serving-certs-ca-bundle (rw) /etc/tls/private from tls (ro) /tmp from tmpfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-hzmzh (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmpfs: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config: Type: ConfigMap (a volume populated by a ConfigMap) Name: adapter-config Optional: false prometheus-adapter-prometheus-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-adapter-prometheus-config Optional: false serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-4rmrrbs3tregh Optional: false prometheus-adapter-token-hzmzh: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-token-hzmzh Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 38m default-scheduler Successfully assigned openshift-monitoring/prometheus-adapter-7c8ffccf56-66rl4 to sno-0.vlan614.rdu2.scalelab.redhat.com Normal AddedInterface 38m multus Add eth0 [fd01:0:0:1::3a/64] Normal Pulling 38m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d" Normal Pulled 38m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d" in 1.380712712s Normal Created 38m kubelet Created container prometheus-adapter Normal Started 38m kubelet Started container prometheus-adapter Warning FailedMount 34m kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-hzmzh tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition Normal Killing 34m kubelet Stopping container prometheus-adapter Warning FailedMount 12s (x26 over 36m) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-4rmrrbs3tregh" not found