Bug 1920700
| Summary: | SNO: prometheus-adapter pods are stuck in "Terminating" status. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alexander Chuzhoy <sasha> | ||||||
| Component: | Node | Assignee: | Ryan Phillips <rphillips> | ||||||
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> | ||||||
| Status: | CLOSED DUPLICATE | Docs Contact: | |||||||
| Severity: | medium | ||||||||
| Priority: | unspecified | CC: | achernet, alegrand, anpicker, aos-bugs, dgrisonn, erooth, hongyli, kakkoyun, lcosic, nagrawal, ohochman, pkrupa, rphillips, spasquie, tsweeney | ||||||
| Version: | 4.8 | Keywords: | Reopened | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 4.8.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-06-09 14:19:58 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
From the information you shared, the problem seems to be coming from the kubelet. Thus, I am transferring this bug to the Node team. The issue is intermittent. Reproduced again (same version). reproduced - attaching must-gather Created attachment 1754157 [details]
must-gather logs
[root@sealusa34 ~]# oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-monitoring prometheus-adapter-7b549d98d7-49vvz 0/1 Terminating 0 14h
openshift-monitoring prometheus-adapter-7b549d98d7-7vf6n 0/1 Terminating 0 14h
[root@sealusa34 ~]# oc describe pod -n openshift-monitoring prometheus-adapter-7b549d98d7-49vvz
Name: prometheus-adapter-7b549d98d7-49vvz
Namespace: openshift-monitoring
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: sno-0-0.ocp-edge-cluster-0.qe.lab.redhat.com/fd2e:6f44:5dd8::13c
Start Time: Mon, 01 Feb 2021 20:31:24 -0500
Labels: name=prometheus-adapter
pod-template-hash=7b549d98d7
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["fd01:0:0:1::30/64"],"mac_address":"0a:58:a7:3b:11:54","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"fd01:0:0:1::30"
],
"mac": "0a:58:a7:3b:11:54",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"fd01:0:0:1::30"
],
"mac": "0a:58:a7:3b:11:54",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Terminating (lasts 14h)
Termination Grace Period: 30s
IP: fd01:0:0:1::30
IPs:
IP: fd01:0:0:1::30
Controlled By: ReplicaSet/prometheus-adapter-7b549d98d7
Containers:
prometheus-adapter:
Container ID: cri-o://eaf3ef2f7efad9de19e4ba0cff90b8b102a46e4d43bfa962c8748701280f24b9
Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779
Image ID: registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779
Port: 6443/TCP
Host Port: 0/TCP
Args:
--prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
--config=/etc/adapter/config.yaml
--logtostderr=true
--metrics-relist-interval=1m
--prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
--secure-port=6443
--client-ca-file=/etc/tls/private/client-ca-file
--requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
--requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--tls-cert-file=/etc/tls/private/tls.crt
--tls-private-key-file=/etc/tls/private/tls.key
State: Terminated
Reason: Error
Message: :CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:31:49.294047 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:31:49.294200 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:31:49.294310 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:32:09.146572 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:32:09.146614 1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
Exit Code: 2
Started: Mon, 01 Feb 2021 20:31:32 -0500
Finished: Mon, 01 Feb 2021 20:35:41 -0500
Ready: False
Restart Count: 0
Requests:
cpu: 1m
memory: 25Mi
Environment: <none>
Mounts:
/etc/adapter from config (rw)
/etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
/etc/ssl/certs from serving-certs-ca-bundle (rw)
/etc/tls/private from tls (ro)
/tmp from tmpfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-t8h6p (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmpfs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: adapter-config
Optional: false
prometheus-adapter-prometheus-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-adapter-prometheus-config
Optional: false
serving-certs-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: serving-certs-ca-bundle
Optional: false
tls:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-adapter-sbmaajok6lbe
Optional: false
prometheus-adapter-token-t8h6p:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-adapter-token-t8h6p
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 83s (x421 over 14h) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-sbmaajok6lbe" not found
[root@sealusa34 ~]#
I am having this issue as well in 4.6.12 after restoring my cluster from etcd backup. NAMESPACE NAME READY STATUS RESTARTS AGE openshift-insights insights-operator-68b87568c7-wvgvj 0/1 ContainerCreating 0 17d openshift-logging elasticsearch-cdm-xqj5yjib-1-5f7466cf75-426lb 0/2 ContainerCreating 0 29m openshift-monitoring cluster-monitoring-operator-5cdc6d5fcb-9ptb9 0/2 ContainerCreating 0 17d openshift-monitoring prometheus-adapter-d8c689779-fw6vn 0/1 ContainerCreating 0 12d openshift-monitoring prometheus-adapter-d8c689779-trs9v 0/1 ContainerCreating 0 6m8s openshift-monitoring prometheus-operator-76567945c4-j4wlh 0/2 ContainerCreating 0 17d Events for prometheus-adapter-d8c689779-trs9 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 5m2s kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-czsgr tmpfs config]: timed out waiting for the condition Warning FailedMount 53s (x11 over 7m5s) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-ai86df6ln5tau" not found Warning FailedMount 28s (x2 over 2m43s) kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-czsgr tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition Can we get this retested with the rc.0 candidate? There was a memory leak fix in thanos that could have contributed to this bz. fwiw the thanos querier bug is included in 4.6.16 if that helps in veryfying. s/bug/bug fix The issue intermittently reproduced with: 4.8.0-0.ci.test-2021-02-14-182353-ci-ln-0n8kv3b Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip:0.5.0 *** This bug has been marked as a duplicate of bug 1929463 *** Re-opened and transferred to monitoring after discussions in https://bugzilla.redhat.com/show_bug.cgi?id=1929463 although the prometheus-adapter pods are Terminating, but the failed reason in Comment 0 and Comment 5 is different, Comment 0 should be a node issue, Comment 5 maybe an auth issue or monitoring issue FYI, from the must-gather file in Comment 4, the cluster only has one node sno-0-0.ocp-edge-cluster-0.qe.lab.redhat.com and it is an IPV6 cluster The relevant comment from bug 1929463 is: https://bugzilla.redhat.com/show_bug.cgi?id=1929463#c25 assigning to damien who can take a peek at the signal handling logic withing prometheus-adapter. I attached the usptream PR to add a signal handler to prometheus-adapter. # oc version
Client Version: 4.8.0-0.nightly-2021-04-08-200632
Server Version: 4.8.0-0.nightly-2021-04-08-200632
Kubernetes Version: v1.21.0-rc.0+6d27558
# oc -n openshift-monitoring get deploy cluster-monitoring-operator -oyaml | grep prometheus-adapter
- -images=k8s-prometheus-adapter=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a48430f2ae4d6f9382218868aad95639ae17e3e0824f209b2f7fde219c87b8e9
# docker inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a48430f2ae4d6f9382218868aad95639ae17e3e0824f209b2f7fde219c87b8e9 | grep "io.openshift.build.commit.url"
"io.openshift.build.commit.url": "https://github.com/openshift/images/commit/bcab0f7337420343611546aae2634eaf0d36c33e",
"io.openshift.build.commit.url": "https://github.com/openshift/k8s-prometheus-adapter/commit/2856bc27f7319c069c02cbc5210852c34ef6e4ef",
checked files in https://github.com/openshift/k8s-prometheus-adapter/commit/2856bc27f7319c069c02cbc5210852c34ef6e4ef
the fix is in the payload now, we should update other resources to 0.8.4, example:
# oc -n openshift-monitoring get pod --show-labels | grep prometheus-adapter
prometheus-adapter-5967cb7df6-6xgrt 1/1 Running 0 59m app.kubernetes.io/component=metrics-adapter,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=prometheus-adapter,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=0.8.2,pod-template-hash=5967cb7df6
# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml
...
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.8.2
...
# oc -n openshift-monitoring get clusterrolebinding prometheus-adapter -oyaml
...
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.8.2
...
# oc -n openshift-monitoring get svc prometheus-adapter -oyaml
...
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/managed-by: cluster-monitoring-operator
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.8.2
Thank you for verifying, I forgot to update the labels. It should be good with the new PR. tested with 4.8.0-0.nightly-2021-04-13-171608, prometheus-adapter is bumped to 0.8.4, labels are also right, no regression issues FailedQA Reproduced: [kni@r640-u09 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-21-084059 True False 57m Cluster version is 4.8.0-0.nightly-2021-04-21-084059 [kni@r640-u09 ~]$ oc get pod -A|grep Term openshift-monitoring prometheus-adapter-6b7474585-2drrw 0/1 Terminating 0 77m [kni@r640-u09 ~]$ The following PR may fix the issue because it remove anti-affinity constraints for SNO https://github.com/openshift/cluster-monitoring-operator/pull/1124 (In reply to hongyan li from comment #29) > The following PR may fix the issue because it remove anti-affinity > constraints for SNO > https://github.com/openshift/cluster-monitoring-operator/pull/1124 yes no issue with 4.8.0-0.nightly-2021-04-22-225832 in SNO cluster # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-225832 True False 2m26s Cluster version is 4.8.0-0.nightly-2021-04-22-225832 # oc get co monitoring NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE monitoring 4.8.0-0.nightly-2021-04-22-225832 True False False 2m32s # oc get no NAME STATUS ROLES AGE VERSION ip-10-0-157-157.us-west-2.compute.internal Ready master,worker 24m v1.21.0-rc.0+af8ab09 # oc -n openshift-monitoring get pod -o wide | grep prometheus-adapter prometheus-adapter-5b4659d98-vvw5m 1/1 Running 0 16m 10.128.0.71 ip-10-0-157-157.us-west-2.compute.internal <none> <none> #oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep maxUnavailable -C4
app.kubernetes.io/part-of: openshift-monitoring
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
FailedQA: @hongyan The issue is not resolved. Please note that it is intermittent and may or may not reproduce. My last deployment was done a few hours ago and the issue reproduced: [root@sealusa35 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-22-182303 True False 14h Cluster version is 4.8.0-0.nightly-2021-04-22-182303 [root@sealusa35 ~]# oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-adapter-655d6fdbc8-6tlfw 0/1 Terminating 0 14h Closing, @sasha feel free to reopen if you can reproduce the issue. Just re-produced:
Version: 4.8.0-0.nightly-2021-05-25-121139
[kni@r640-u01 ~]$ oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-monitoring prometheus-adapter-64cd49f459-f44hx 0/1 Terminating 0 56m
[kni@r640-u01 ~]$ oc describe pod -n openshift-monitoring prometheus-adapter-64cd49f459-f44hx
Name: prometheus-adapter-64cd49f459-f44hx
Namespace: openshift-monitoring
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: openshift-master-0.qe1.kni.lab.eng.bos.redhat.com/10.19.134.13
Start Time: Tue, 25 May 2021 15:02:10 -0400
Labels: app.kubernetes.io/component=metrics-adapter
app.kubernetes.io/name=prometheus-adapter
app.kubernetes.io/part-of=openshift-monitoring
app.kubernetes.io/version=0.8.4
pod-template-hash=64cd49f459
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.128.0.45"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"10.128.0.45"
],
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
workload.openshift.io/warning:
the node "openshift-master-0.qe1.kni.lab.eng.bos.redhat.com" does not have resource "management.workload.openshift.io/cores"
Status: Terminating (lasts 51m)
Termination Grace Period: 30s
IP: 10.128.0.45
IPs:
IP: 10.128.0.45
Controlled By: ReplicaSet/prometheus-adapter-64cd49f459
Containers:
prometheus-adapter:
Container ID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
Port: 6443/TCP
Host Port: 0/TCP
Args:
--prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
--config=/etc/adapter/config.yaml
--logtostderr=true
--metrics-relist-interval=1m
--prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
--secure-port=6443
--client-ca-file=/etc/tls/private/client-ca-file
--requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
--requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--tls-cert-file=/etc/tls/private/tls.crt
--tls-private-key-file=/etc/tls/private/tls.key
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 25 May 2021 15:02:23 -0400
Finished: Tue, 25 May 2021 15:04:57 -0400
Ready: False
Restart Count: 0
Requests:
cpu: 1m
memory: 40Mi
Environment: <none>
Mounts:
/etc/adapter from config (rw)
/etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
/etc/ssl/certs from serving-certs-ca-bundle (rw)
/etc/tls/private from tls (ro)
/tmp from tmpfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-6cwx7 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmpfs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: adapter-config
Optional: false
prometheus-adapter-prometheus-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-adapter-prometheus-config
Optional: false
serving-certs-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: serving-certs-ca-bundle
Optional: false
tls:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-adapter-129lg25chsi53
Optional: false
prometheus-adapter-token-6cwx7:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-adapter-token-6cwx7
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 56m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Warning FailedScheduling 56m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Warning FailedScheduling 54m default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Normal Scheduled 54m default-scheduler Successfully assigned openshift-monitoring/prometheus-adapter-64cd49f459-f44hx to openshift-master-0.qe1.kni.lab.eng.bos.redhat.com
Normal AddedInterface 54m multus Add eth0 [10.128.0.45/23]
Normal Pulling 54m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623"
Normal Pulled 54m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623" in 9.20035679s
Normal Created 54m kubelet Created container prometheus-adapter
Normal Started 54m kubelet Started container prometheus-adapter
Normal Killing 51m kubelet Stopping container prometheus-adapter
Warning FailedMount 58s (x33 over 51m) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-129lg25chsi53" not found
[kni@r640-u01 ~]$
Created attachment 1786983 [details]
must-gather logs from 4.8.0-0.nightly-2021-05-25-121139
As far as I can tell from the logs, prometheus-adapter is now completing properly after the addition of a signal handler for SIGINT and SIGTERM, however even though the container is reported completed with an exit code of 0 by cri-o, the pod itself is still marked as running.
From the statuses of the prometheus-adapter pod that is stuck in Terminating state, we can get that the container completed, but the pod is still running:
```
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2021-05-25T19:02:10Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
lastState: {}
name: prometheus-adapter
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1
exitCode: 0
finishedAt: "2021-05-25T19:04:57Z"
reason: Completed
startedAt: "2021-05-25T19:02:23Z"
hostIP: 10.19.134.13
phase: Running
...
```
Also cri-o logs confirm that the container was removed:
```
May 25 19:04:58.436206 openshift-master-0.qe1.kni.lab.eng.bos.redhat.com crio[2720]: time="2021-05-25 19:04:58.436162287Z" level=info msg="Removed container 00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1: openshift-monitoring/prometheus-adapter-64cd49f459-f44hx/prometheus-adapter" id=dba5daab-c886-472c-a0a4-99f80fb966b8 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
```
And the original replicaset for prometheus-adapter deployment reports 0 replicas so the pod shouldn't be in running state:
```
- apiVersion: apps/v1
kind: ReplicaSet
metadata:
annotations:
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2021-05-25T19:00:27Z"
generation: 2
labels:
app.kubernetes.io/component: metrics-adapter
app.kubernetes.io/name: prometheus-adapter
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.8.4
pod-template-hash: 64cd49f459
name: prometheus-adapter-64cd49f459
namespace: openshift-monitoring
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: Deployment
name: prometheus-adapter
uid: 983a2d50-45c7-46ab-943e-94aeac5033f5
resourceVersion: "9814"
uid: ee599589-0b68-44c9-8f23-29dd01db50ed
spec:
replicas: 0
```
Originally, according to comment 18, it seemed that the issue was caused by prometheus-adapter not terminating gracefully after receiving a SIGTERM, but now that it handles the signal properly, the issue is still here and it seems to be caused by the pod resource not being properly updated.
Note that prometheus-adapter seems to have been killed because by "No sandbox for pod can be found. Need to start a new one" pod="openshift-monitoring/prometheus-adapter-64cd49f459-f44hx", so there might be a race occurring when this happens.
That said, I am sending the bug over to the Node team to further investigate why the pod is stuck in Terminating state.
oc get clusterversion --kubeconfig sno-0.kubeconfig
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-fc.7 True False 28m Cluster version is 4.8.0-fc.7
oc get pod --kubeconfig sno-0.kubeconfig -n openshift-monitoring |grep -v Run|grep -v Comple
NAME READY STATUS RESTARTS AGE
prometheus-adapter-7c8ffccf56-66rl4 0/1 Terminating 0 37m
oc describe pod prometheus-adapter-7c8ffccf56-66rl4 --kubeconfig sno-0.kubeconfig -n openshift-monitoring
Name: prometheus-adapter-7c8ffccf56-66rl4
Namespace: openshift-monitoring
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: sno-0.vlan614.rdu2.scalelab.redhat.com/1000::1:1
Start Time: Tue, 08 Jun 2021 17:14:23 +0000
Labels: app.kubernetes.io/component=metrics-adapter
app.kubernetes.io/name=prometheus-adapter
app.kubernetes.io/part-of=openshift-monitoring
app.kubernetes.io/version=0.8.4
pod-template-hash=7c8ffccf56
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["fd01:0:0:1::3a/64"],"mac_address":"0a:58:ed:0b:91:6d","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"fd01:0:0:1::3a"
],
"mac": "0a:58:ed:0b:91:6d",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "",
"interface": "eth0",
"ips": [
"fd01:0:0:1::3a"
],
"mac": "0a:58:ed:0b:91:6d",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
workload.openshift.io/warning:
the node "sno-0.vlan614.rdu2.scalelab.redhat.com" does not have resource "management.workload.openshift.io/cores"
Status: Terminating (lasts 36m)
Termination Grace Period: 30s
IP: fd01:0:0:1::3a
IPs:
IP: fd01:0:0:1::3a
Controlled By: ReplicaSet/prometheus-adapter-7c8ffccf56
Containers:
prometheus-adapter:
Container ID: cri-o://90969aa50f20a0dd9adda7146f60c0e2b847fe28ca63e26d8c556a192fdec948
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d
Port: 6443/TCP
Host Port: 0/TCP
Args:
--prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
--config=/etc/adapter/config.yaml
--logtostderr=true
--metrics-relist-interval=1m
--prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
--secure-port=6443
--client-ca-file=/etc/tls/private/client-ca-file
--requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
--requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
--requestheader-extra-headers-prefix=X-Remote-Extra-
--requestheader-group-headers=X-Remote-Group
--requestheader-username-headers=X-Remote-User
--tls-cert-file=/etc/tls/private/tls.crt
--tls-private-key-file=/etc/tls/private/tls.key
State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 08 Jun 2021 17:14:28 +0000
Finished: Tue, 08 Jun 2021 17:18:24 +0000
Ready: False
Restart Count: 0
Requests:
cpu: 1m
memory: 40Mi
Environment: <none>
Mounts:
/etc/adapter from config (rw)
/etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
/etc/ssl/certs from serving-certs-ca-bundle (rw)
/etc/tls/private from tls (ro)
/tmp from tmpfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-hzmzh (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmpfs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: adapter-config
Optional: false
prometheus-adapter-prometheus-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-adapter-prometheus-config
Optional: false
serving-certs-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: serving-certs-ca-bundle
Optional: false
tls:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-adapter-4rmrrbs3tregh
Optional: false
prometheus-adapter-token-hzmzh:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-adapter-token-hzmzh
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38m default-scheduler Successfully assigned openshift-monitoring/prometheus-adapter-7c8ffccf56-66rl4 to sno-0.vlan614.rdu2.scalelab.redhat.com
Normal AddedInterface 38m multus Add eth0 [fd01:0:0:1::3a/64]
Normal Pulling 38m kubelet Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d"
Normal Pulled 38m kubelet Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d" in 1.380712712s
Normal Created 38m kubelet Created container prometheus-adapter
Normal Started 38m kubelet Started container prometheus-adapter
Warning FailedMount 34m kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-hzmzh tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition
Normal Killing 34m kubelet Stopping container prometheus-adapter
Warning FailedMount 12s (x26 over 36m) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-4rmrrbs3tregh" not found
*** This bug has been marked as a duplicate of bug 1929463 *** |
Version: 4.7.0-0.ci.test-2021-01-19-093712-ci-ln-cvxm9pb registry.svc.ci.openshift.org/sno-dev/openshift-bip:0.2.0 Steps to reproduce: 1. Deploy SNO with BIP. 2. Check all the pods that are not Running nor Complete Result: prometheus-adapter pods are stuck in "Terminating" status. [root@sealusa52 ~]# oc get pod -A|grep -v Run|grep -v Comple NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-adapter-77d9c78b6-pfjj6 0/1 Terminating 0 29m openshift-monitoring prometheus-adapter-77d9c78b6-t2mf5 0/1 Terminating 0 29m [root@sealusa52 ~]# [root@sealusa52 ~]# oc describe pod -n openshift-monitoring prometheus-adapter-77d9c78b6-pfjj6 Name: prometheus-adapter-77d9c78b6-pfjj6 Namespace: openshift-monitoring Priority: 2000000000 Priority Class Name: system-cluster-critical Node: sno-0-0/192.168.123.116 Start Time: Tue, 26 Jan 2021 16:41:46 -0500 Labels: name=prometheus-adapter pod-template-hash=77d9c78b6 Annotations: k8s.v1.cni.cncf.io/network-status: [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.52" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "interface": "eth0", "ips": [ "10.128.0.52" ], "default": true, "dns": {} }] openshift.io/scc: restricted Status: Terminating (lasts 28m) Termination Grace Period: 30s IP: 10.128.0.52 IPs: IP: 10.128.0.52 Controlled By: ReplicaSet/prometheus-adapter-77d9c78b6 Containers: prometheus-adapter: Container ID: Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779 Image ID: Port: 6443/TCP Host Port: 0/TCP Args: --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml --config=/etc/adapter/config.yaml --logtostderr=true --metrics-relist-interval=1m --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091 --secure-port=6443 --client-ca-file=/etc/tls/private/client-ca-file --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --tls-cert-file=/etc/tls/private/tls.crt --tls-private-key-file=/etc/tls/private/tls.key State: Waiting Reason: ContainerCreating Last State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was deleted. The container used to be Running Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Requests: cpu: 1m memory: 25Mi Environment: <none> Mounts: /etc/adapter from config (rw) /etc/prometheus-config from prometheus-adapter-prometheus-config (rw) /etc/ssl/certs from serving-certs-ca-bundle (rw) /etc/tls/private from tls (ro) /tmp from tmpfs (rw) /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-4zmn2 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmpfs: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config: Type: ConfigMap (a volume populated by a ConfigMap) Name: adapter-config Optional: false prometheus-adapter-prometheus-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: prometheus-adapter-prometheus-config Optional: false serving-certs-ca-bundle: Type: ConfigMap (a volume populated by a ConfigMap) Name: serving-certs-ca-bundle Optional: false tls: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-37virbg3cl6r0 Optional: false prometheus-adapter-token-4zmn2: Type: Secret (a volume populated by a Secret) SecretName: prometheus-adapter-token-4zmn2 Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 32m (x2 over 32m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Normal Scheduled 32m default-scheduler Successfully assigned openshift-monitoring/prometheus-adapter-77d9c78b6-pfjj6 to sno-0-0 Normal AddedInterface 32m multus Add eth0 [10.128.0.52/23] Normal Pulling 32m kubelet Pulling image "registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779" Normal Pulled 32m kubelet Successfully pulled image "registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779" in 4.322131984s Normal Created 32m kubelet Created container prometheus-adapter Normal Started 32m kubelet Started container prometheus-adapter Warning FailedMount 27m kubelet Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[tmpfs config prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-4zmn2]: timed out waiting for the condition Normal Killing 27m kubelet Stopping container prometheus-adapter Warning FailedMount 68s (x22 over 29m) kubelet MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-37virbg3cl6r0" not found [root@sealusa52 ~]# Note: Force deleting the pods manually seems to work.