1920700 – SNO: prometheus-adapter pods are stuck in "Terminating" status.

Bug 1920700 - SNO: prometheus-adapter pods are stuck in "Terminating" status.

Summary: SNO: prometheus-adapter pods are stuck in "Terminating" status.

Keywords:
Status:	CLOSED DUPLICATE of bug 1929463
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-26 22:16 UTC by Alexander Chuzhoy
Modified:	2021-11-22 22:05 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-09 14:19:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather logs (11.25 MB, application/gzip) 2021-02-02 03:26 UTC, Alexander Chuzhoy	no flags	Details
must-gather logs from 4.8.0-0.nightly-2021-05-25-121139 (9.11 MB, application/gzip) 2021-05-25 19:57 UTC, Alexander Chuzhoy	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-sigs prometheus-adapter pull 389	None	closed	Add signal handler	2021-06-08 13:49:36 UTC
Github	openshift cluster-monitoring-operator pull 1113	None	closed	Bug 1920700: Bump prometheus-adapter to v0.8.4	2021-06-08 13:49:34 UTC
Github	openshift k8s-prometheus-adapter pull 47	None	closed	Bug 1920700: Bump prometheus-adapter to v0.8.4	2021-06-08 13:49:31 UTC

Description Alexander Chuzhoy 2021-01-26 22:16:27 UTC

Version:
4.7.0-0.ci.test-2021-01-19-093712-ci-ln-cvxm9pb
registry.svc.ci.openshift.org/sno-dev/openshift-bip:0.2.0


Steps to reproduce:

1. Deploy SNO with BIP. 
2. Check all the pods that are not Running nor Complete

Result:
prometheus-adapter pods are stuck in "Terminating" status.

[root@sealusa52 ~]# oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                     READY   STATUS        RESTARTS   AGE
openshift-monitoring                               prometheus-adapter-77d9c78b6-pfjj6                       0/1     Terminating   0          29m
openshift-monitoring                               prometheus-adapter-77d9c78b6-t2mf5                       0/1     Terminating   0          29m
[root@sealusa52 ~]# 




[root@sealusa52 ~]# oc describe pod -n openshift-monitoring                               prometheus-adapter-77d9c78b6-pfjj6 
Name:                      prometheus-adapter-77d9c78b6-pfjj6
Namespace:                 openshift-monitoring
Priority:                  2000000000
Priority Class Name:       system-cluster-critical
Node:                      sno-0-0/192.168.123.116
Start Time:                Tue, 26 Jan 2021 16:41:46 -0500
Labels:                    name=prometheus-adapter
                           pod-template-hash=77d9c78b6
Annotations:               k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.128.0.52"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.128.0.52"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted
Status:                    Terminating (lasts 28m)
Termination Grace Period:  30s
IP:                        10.128.0.52
IPs:
  IP:           10.128.0.52
Controlled By:  ReplicaSet/prometheus-adapter-77d9c78b6
Containers:
  prometheus-adapter:
    Container ID:  
    Image:         registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779
    Image ID:      
    Port:          6443/TCP
    Host Port:     0/TCP
    Args:
      --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
      --config=/etc/adapter/config.yaml
      --logtostderr=true
      --metrics-relist-interval=1m
      --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
      --secure-port=6443
      --client-ca-file=/etc/tls/private/client-ca-file
      --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
      --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
      --requestheader-extra-headers-prefix=X-Remote-Extra-
      --requestheader-group-headers=X-Remote-Group
      --requestheader-username-headers=X-Remote-User
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/adapter from config (rw)
      /etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
      /etc/ssl/certs from serving-certs-ca-bundle (rw)
      /etc/tls/private from tls (ro)
      /tmp from tmpfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-4zmn2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmpfs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      adapter-config
    Optional:  false
  prometheus-adapter-prometheus-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-adapter-prometheus-config
    Optional:  false
  serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-37virbg3cl6r0
    Optional:    false
  prometheus-adapter-token-4zmn2:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-token-4zmn2
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  32m (x2 over 32m)   default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Normal   Scheduled         32m                 default-scheduler  Successfully assigned openshift-monitoring/prometheus-adapter-77d9c78b6-pfjj6 to sno-0-0
  Normal   AddedInterface    32m                 multus             Add eth0 [10.128.0.52/23]
  Normal   Pulling           32m                 kubelet            Pulling image "registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779"
  Normal   Pulled            32m                 kubelet            Successfully pulled image "registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779" in 4.322131984s
  Normal   Created           32m                 kubelet            Created container prometheus-adapter
  Normal   Started           32m                 kubelet            Started container prometheus-adapter
  Warning  FailedMount       27m                 kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[tmpfs config prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-4zmn2]: timed out waiting for the condition
  Normal   Killing           27m                 kubelet            Stopping container prometheus-adapter
  Warning  FailedMount       68s (x22 over 29m)  kubelet            MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-37virbg3cl6r0" not found
[root@sealusa52 ~]# 



Note: Force deleting the pods manually seems to work.

Comment 1 Damien Grisonnet 2021-01-27 13:34:18 UTC

From the information you shared, the problem seems to be coming from the kubelet. Thus, I am transferring this bug to the Node team.

Comment 2 Alexander Chuzhoy 2021-01-29 19:16:16 UTC

The issue is intermittent.
Reproduced again (same version).

Comment 3 Alexander Chuzhoy 2021-02-02 03:25:14 UTC

reproduced - attaching must-gather

Comment 4 Alexander Chuzhoy 2021-02-02 03:26:43 UTC

Created attachment 1754157 [details]
must-gather logs

Comment 5 Alexander Chuzhoy 2021-02-02 15:37:28 UTC

[root@sealusa34 ~]# oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                                    READY   STATUS        RESTARTS   AGE
openshift-monitoring                               prometheus-adapter-7b549d98d7-49vvz                                     0/1     Terminating   0          14h
openshift-monitoring                               prometheus-adapter-7b549d98d7-7vf6n                                     0/1     Terminating   0          14h



[root@sealusa34 ~]# oc describe pod -n openshift-monitoring                               prometheus-adapter-7b549d98d7-49vvz
Name:                      prometheus-adapter-7b549d98d7-49vvz
Namespace:                 openshift-monitoring
Priority:                  2000000000
Priority Class Name:       system-cluster-critical
Node:                      sno-0-0.ocp-edge-cluster-0.qe.lab.redhat.com/fd2e:6f44:5dd8::13c
Start Time:                Mon, 01 Feb 2021 20:31:24 -0500
Labels:                    name=prometheus-adapter
                           pod-template-hash=7b549d98d7
Annotations:               k8s.ovn.org/pod-networks:
                             {"default":{"ip_addresses":["fd01:0:0:1::30/64"],"mac_address":"0a:58:a7:3b:11:54","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "fd01:0:0:1::30"
                                 ],
                                 "mac": "0a:58:a7:3b:11:54",
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "fd01:0:0:1::30"
                                 ],
                                 "mac": "0a:58:a7:3b:11:54",
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted
Status:                    Terminating (lasts 14h)
Termination Grace Period:  30s
IP:                        fd01:0:0:1::30
IPs:
  IP:           fd01:0:0:1::30
Controlled By:  ReplicaSet/prometheus-adapter-7b549d98d7
Containers:
  prometheus-adapter:
    Container ID:  cri-o://eaf3ef2f7efad9de19e4ba0cff90b8b102a46e4d43bfa962c8748701280f24b9
    Image:         registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779
    Image ID:      registry.svc.ci.openshift.org/sno-dev/openshift-bip@sha256:fa65d90f0e3a10463c03fa4e6c167f296843af51eb87b4df95733542bc8ab779
    Port:          6443/TCP
    Host Port:     0/TCP
    Args:
      --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
      --config=/etc/adapter/config.yaml
      --logtostderr=true
      --metrics-relist-interval=1m
      --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
      --secure-port=6443
      --client-ca-file=/etc/tls/private/client-ca-file
      --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
      --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
      --requestheader-extra-headers-prefix=X-Remote-Extra-
      --requestheader-group-headers=X-Remote-Group
      --requestheader-username-headers=X-Remote-User
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:      Terminated
      Reason:   Error
      Message:  :CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:31:49.294047       1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:31:49.294200       1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:31:49.294310       1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:32:09.146572       1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]
E0202 01:32:09.146614       1 authentication.go:53] Unable to authenticate the request due to an error: [x509: certificate signed by unknown authority, verifying certificate SN=6838792845141263573, SKID=16:E7:74:4D:12:62:BF:25:1C:93:84:7C:74:55:3C:89:74:E6:CA:20, AKID=FB:F6:2E:F6:99:60:BC:F7:63:EB:6B:51:17:44:9D:71:52:25:96:EA failed: x509: certificate signed by unknown authority]

      Exit Code:    2
      Started:      Mon, 01 Feb 2021 20:31:32 -0500
      Finished:     Mon, 01 Feb 2021 20:35:41 -0500
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/adapter from config (rw)
      /etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
      /etc/ssl/certs from serving-certs-ca-bundle (rw)
      /etc/tls/private from tls (ro)
      /tmp from tmpfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-t8h6p (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmpfs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      adapter-config
    Optional:  false
  prometheus-adapter-prometheus-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-adapter-prometheus-config
    Optional:  false
  serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-sbmaajok6lbe
    Optional:    false
  prometheus-adapter-token-t8h6p:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-token-t8h6p
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                  From     Message
  ----     ------       ----                 ----     -------
  Warning  FailedMount  83s (x421 over 14h)  kubelet  MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-sbmaajok6lbe" not found
[root@sealusa34 ~]#

Comment 8 Dana 2021-02-08 21:04:59 UTC

I am having this issue as well in 4.6.12 after restoring my cluster from etcd backup.

NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS   AGE
openshift-insights                                 insights-operator-68b87568c7-wvgvj                                0/1     ContainerCreating   0          17d
openshift-logging                                  elasticsearch-cdm-xqj5yjib-1-5f7466cf75-426lb                     0/2     ContainerCreating   0          29m
openshift-monitoring                               cluster-monitoring-operator-5cdc6d5fcb-9ptb9                      0/2     ContainerCreating   0          17d
openshift-monitoring                               prometheus-adapter-d8c689779-fw6vn                                0/1     ContainerCreating   0          12d
openshift-monitoring                               prometheus-adapter-d8c689779-trs9v                                0/1     ContainerCreating   0          6m8s
openshift-monitoring                               prometheus-operator-76567945c4-j4wlh                              0/2     ContainerCreating   0          17d


Events for prometheus-adapter-d8c689779-trs9

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Warning  FailedMount  5m2s                 kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[prometheus-adapter-prometheus-config serving-certs-ca-bundle tls prometheus-adapter-token-czsgr tmpfs config]: timed out waiting for the condition
  Warning  FailedMount  53s (x11 over 7m5s)  kubelet            MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-ai86df6ln5tau" not found
  Warning  FailedMount  28s (x2 over 2m43s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-czsgr tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition

Comment 9 Ryan Phillips 2021-02-09 21:18:39 UTC

Can we get this retested with the rc.0 candidate? There was a memory leak fix in thanos that could have contributed to this bz.

Comment 10 Sergiusz Urbaniak 2021-02-10 09:04:36 UTC

fwiw the thanos querier bug is included in 4.6.16 if that helps in veryfying.

Comment 11 Sergiusz Urbaniak 2021-02-10 09:04:51 UTC

s/bug/bug fix

Comment 12 Alexander Chuzhoy 2021-02-19 20:55:21 UTC

The issue intermittently reproduced with:
4.8.0-0.ci.test-2021-02-14-182353-ci-ln-0n8kv3b
Image: registry.svc.ci.openshift.org/sno-dev/openshift-bip:0.5.0

Comment 14 Ryan Phillips 2021-03-03 18:11:33 UTC


*** This bug has been marked as a duplicate of bug 1929463 ***

Comment 15 Alexander Chuzhoy 2021-03-24 22:56:25 UTC

Re-opened and transferred to monitoring after discussions in https://bugzilla.redhat.com/show_bug.cgi?id=1929463

Comment 16 Junqi Zhao 2021-03-25 01:45:57 UTC

although the prometheus-adapter pods are Terminating, but the failed reason in Comment 0 and Comment 5 is different, Comment 0 should be a node issue, Comment 5 maybe an auth issue or monitoring issue

Comment 17 Junqi Zhao 2021-03-25 01:53:36 UTC

FYI, from the must-gather file in Comment 4, the cluster only has one node sno-0-0.ocp-edge-cluster-0.qe.lab.redhat.com and it is an IPV6 cluster

Comment 18 Simon Pasquier 2021-03-25 08:13:28 UTC

The relevant comment from bug 1929463 is: https://bugzilla.redhat.com/show_bug.cgi?id=1929463#c25

Comment 19 Sergiusz Urbaniak 2021-03-29 08:10:53 UTC

assigning to damien who can take a peek at the signal handling logic withing prometheus-adapter.

Comment 20 Damien Grisonnet 2021-03-29 15:36:58 UTC

I attached the usptream PR to add a signal handler to prometheus-adapter.

Comment 23 Junqi Zhao 2021-04-09 07:12:44 UTC

# oc version
Client Version: 4.8.0-0.nightly-2021-04-08-200632
Server Version: 4.8.0-0.nightly-2021-04-08-200632
Kubernetes Version: v1.21.0-rc.0+6d27558

# oc -n openshift-monitoring get deploy cluster-monitoring-operator -oyaml | grep prometheus-adapter
        - -images=k8s-prometheus-adapter=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a48430f2ae4d6f9382218868aad95639ae17e3e0824f209b2f7fde219c87b8e9

# docker inspect quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a48430f2ae4d6f9382218868aad95639ae17e3e0824f209b2f7fde219c87b8e9 | grep "io.openshift.build.commit.url"
                "io.openshift.build.commit.url": "https://github.com/openshift/images/commit/bcab0f7337420343611546aae2634eaf0d36c33e",
                "io.openshift.build.commit.url": "https://github.com/openshift/k8s-prometheus-adapter/commit/2856bc27f7319c069c02cbc5210852c34ef6e4ef",

checked files in https://github.com/openshift/k8s-prometheus-adapter/commit/2856bc27f7319c069c02cbc5210852c34ef6e4ef
the fix is in the payload now, we should update other resources to 0.8.4, example:

# oc -n openshift-monitoring get pod --show-labels | grep prometheus-adapter
prometheus-adapter-5967cb7df6-6xgrt           1/1     Running   0          59m   app.kubernetes.io/component=metrics-adapter,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=prometheus-adapter,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=0.8.2,pod-template-hash=5967cb7df6

# oc -n openshift-monitoring get deploy prometheus-adapter -oyaml
...
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.8.2
...

# oc -n openshift-monitoring get clusterrolebinding prometheus-adapter -oyaml
...
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.8.2
...

# oc -n openshift-monitoring get svc prometheus-adapter -oyaml
...
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/managed-by: cluster-monitoring-operator
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: openshift-monitoring
    app.kubernetes.io/version: 0.8.2

Comment 24 Damien Grisonnet 2021-04-12 15:36:32 UTC

Thank you for verifying, I forgot to update the labels. It should be good with the new PR.

Comment 27 Junqi Zhao 2021-04-14 02:46:17 UTC

tested with 4.8.0-0.nightly-2021-04-13-171608, prometheus-adapter is bumped to 0.8.4, labels are also right, no regression issues

Comment 28 Alexander Chuzhoy 2021-04-21 22:15:46 UTC

FailedQA



Reproduced:
[kni@r640-u09 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-21-084059   True        False         57m     Cluster version is 4.8.0-0.nightly-2021-04-21-084059

[kni@r640-u09 ~]$ oc get pod -A|grep Term
openshift-monitoring                               prometheus-adapter-6b7474585-2drrw                                           0/1     Terminating   0          77m
[kni@r640-u09 ~]$

Comment 29 hongyan li 2021-04-23 07:10:45 UTC

The following PR may fix the issue because it remove anti-affinity constraints for SNO
https://github.com/openshift/cluster-monitoring-operator/pull/1124

Comment 30 Junqi Zhao 2021-04-23 07:40:14 UTC

(In reply to hongyan li from comment #29)
> The following PR may fix the issue because it remove anti-affinity
> constraints for SNO
> https://github.com/openshift/cluster-monitoring-operator/pull/1124

yes
no issue with 4.8.0-0.nightly-2021-04-22-225832 in SNO cluster
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-22-225832   True        False         2m26s   Cluster version is 4.8.0-0.nightly-2021-04-22-225832
# oc get co monitoring
NAME         VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
monitoring   4.8.0-0.nightly-2021-04-22-225832   True        False         False      2m32s
# oc get no
NAME                                         STATUS   ROLES           AGE   VERSION
ip-10-0-157-157.us-west-2.compute.internal   Ready    master,worker   24m   v1.21.0-rc.0+af8ab09
# oc -n openshift-monitoring get pod -o wide | grep prometheus-adapter
prometheus-adapter-5b4659d98-vvw5m            1/1     Running   0          16m   10.128.0.71    ip-10-0-157-157.us-west-2.compute.internal   <none>           <none>

Comment 31 hongyan li 2021-04-23 07:54:50 UTC

#oc -n openshift-monitoring get deploy prometheus-adapter -oyaml | grep affinity -A10
#oc -n openshift-monitoring get deploy  prometheus-adapter -oyaml | grep maxUnavailable -C4
     app.kubernetes.io/part-of: openshift-monitoring
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:

Comment 32 Alexander Chuzhoy 2021-04-23 13:25:31 UTC

FailedQA:



@hongyan

The issue is not resolved. Please note that it is intermittent and may or may not reproduce.

My last deployment was done a few hours ago and the issue reproduced:

[root@sealusa35 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-22-182303   True        False         14h     Cluster version is 4.8.0-0.nightly-2021-04-22-182303

[root@sealusa35 ~]# oc get pod -A|grep -v Run|grep -v Comple 
NAMESPACE                                          NAME                                                      READY   STATUS        RESTARTS   AGE
openshift-monitoring                               prometheus-adapter-655d6fdbc8-6tlfw                       0/1     Terminating   0          14h

Comment 40 Damien Grisonnet 2021-05-25 15:40:09 UTC

Closing, @sasha feel free to reopen if you can reproduce the issue.

Comment 41 Alexander Chuzhoy 2021-05-25 19:57:01 UTC

Just re-produced:
Version: 4.8.0-0.nightly-2021-05-25-121139


[kni@r640-u01 ~]$ oc get pod -A|grep -v Run|grep -v Comple
NAMESPACE                                          NAME                                                                         READY   STATUS        RESTARTS   AGE
openshift-monitoring                               prometheus-adapter-64cd49f459-f44hx                                          0/1     Terminating   0          56m
[kni@r640-u01 ~]$ oc describe pod -n openshift-monitoring                               prometheus-adapter-64cd49f459-f44hx
Name:                      prometheus-adapter-64cd49f459-f44hx
Namespace:                 openshift-monitoring
Priority:                  2000000000
Priority Class Name:       system-cluster-critical
Node:                      openshift-master-0.qe1.kni.lab.eng.bos.redhat.com/10.19.134.13
Start Time:                Tue, 25 May 2021 15:02:10 -0400
Labels:                    app.kubernetes.io/component=metrics-adapter
                           app.kubernetes.io/name=prometheus-adapter
                           app.kubernetes.io/part-of=openshift-monitoring
                           app.kubernetes.io/version=0.8.4
                           pod-template-hash=64cd49f459
Annotations:               k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.128.0.45"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.128.0.45"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted
                           workload.openshift.io/warning:
                             the node "openshift-master-0.qe1.kni.lab.eng.bos.redhat.com" does not have resource "management.workload.openshift.io/cores"
Status:                    Terminating (lasts 51m)
Termination Grace Period:  30s
IP:                        10.128.0.45
IPs:
  IP:           10.128.0.45
Controlled By:  ReplicaSet/prometheus-adapter-64cd49f459
Containers:
  prometheus-adapter:
    Container ID:  cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
    Port:          6443/TCP
    Host Port:     0/TCP
    Args:
      --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
      --config=/etc/adapter/config.yaml
      --logtostderr=true
      --metrics-relist-interval=1m
      --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
      --secure-port=6443
      --client-ca-file=/etc/tls/private/client-ca-file
      --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
      --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
      --requestheader-extra-headers-prefix=X-Remote-Extra-
      --requestheader-group-headers=X-Remote-Group
      --requestheader-username-headers=X-Remote-User
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 25 May 2021 15:02:23 -0400
      Finished:     Tue, 25 May 2021 15:04:57 -0400
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /etc/adapter from config (rw)
      /etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
      /etc/ssl/certs from serving-certs-ca-bundle (rw)
      /etc/tls/private from tls (ro)
      /tmp from tmpfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-6cwx7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmpfs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      adapter-config
    Optional:  false
  prometheus-adapter-prometheus-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-adapter-prometheus-config
    Optional:  false
  serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-129lg25chsi53
    Optional:    false
  prometheus-adapter-token-6cwx7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-token-6cwx7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  56m                 default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  56m                 default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  54m                 default-scheduler  0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Normal   Scheduled         54m                 default-scheduler  Successfully assigned openshift-monitoring/prometheus-adapter-64cd49f459-f44hx to openshift-master-0.qe1.kni.lab.eng.bos.redhat.com
  Normal   AddedInterface    54m                 multus             Add eth0 [10.128.0.45/23]
  Normal   Pulling           54m                 kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623"
  Normal   Pulled            54m                 kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623" in 9.20035679s
  Normal   Created           54m                 kubelet            Created container prometheus-adapter
  Normal   Started           54m                 kubelet            Started container prometheus-adapter
  Normal   Killing           51m                 kubelet            Stopping container prometheus-adapter
  Warning  FailedMount       58s (x33 over 51m)  kubelet            MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-129lg25chsi53" not found
[kni@r640-u01 ~]$

Comment 42 Alexander Chuzhoy 2021-05-25 19:57:51 UTC

Created attachment 1786983 [details]
must-gather logs from 4.8.0-0.nightly-2021-05-25-121139

Comment 43 Damien Grisonnet 2021-05-27 13:21:09 UTC

As far as I can tell from the logs, prometheus-adapter is now completing properly after the addition of a signal handler for SIGINT and SIGTERM, however even though the container is reported completed with an exit code of 0 by cri-o, the pod itself is still marked as running.

From the statuses of the prometheus-adapter pod that is stuck in Terminating state, we can get that the container completed, but the pod is still running:
```
status:
  conditions:
...
  - lastProbeTime: null
    lastTransitionTime: "2021-05-25T19:02:10Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:59cba985b3ba921ce66139743b463503ce7a83284f424e63b5ede6e278c6b623
    lastState: {}
    name: prometheus-adapter
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: cri-o://00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1
        exitCode: 0
        finishedAt: "2021-05-25T19:04:57Z"
        reason: Completed
        startedAt: "2021-05-25T19:02:23Z"
  hostIP: 10.19.134.13
  phase: Running
...
```

Also cri-o logs confirm that the container was removed:
```
May 25 19:04:58.436206 openshift-master-0.qe1.kni.lab.eng.bos.redhat.com crio[2720]: time="2021-05-25 19:04:58.436162287Z" level=info msg="Removed container 00f02317c46c5b42b8f900e017b1b70dffe1bb3726a03080dc3e511b1e43c5a1: openshift-monitoring/prometheus-adapter-64cd49f459-f44hx/prometheus-adapter" id=dba5daab-c886-472c-a0a4-99f80fb966b8 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
```

And the original replicaset for prometheus-adapter deployment reports 0 replicas so the pod shouldn't be in running state:
```
- apiVersion: apps/v1
  kind: ReplicaSet
  metadata:
    annotations:
      deployment.kubernetes.io/desired-replicas: "1"
      deployment.kubernetes.io/max-replicas: "2"
      deployment.kubernetes.io/revision: "1"
    creationTimestamp: "2021-05-25T19:00:27Z"
    generation: 2
    labels:
      app.kubernetes.io/component: metrics-adapter
      app.kubernetes.io/name: prometheus-adapter                                                                                                                                                                   
      app.kubernetes.io/part-of: openshift-monitoring
      app.kubernetes.io/version: 0.8.4
      pod-template-hash: 64cd49f459
    name: prometheus-adapter-64cd49f459
    namespace: openshift-monitoring
    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true 
      controller: true 
      kind: Deployment
      name: prometheus-adapter
      uid: 983a2d50-45c7-46ab-943e-94aeac5033f5
    resourceVersion: "9814"
    uid: ee599589-0b68-44c9-8f23-29dd01db50ed
  spec:
    replicas: 0
```

Originally, according to comment 18, it seemed that the issue was caused by prometheus-adapter not terminating gracefully after receiving a SIGTERM, but now that it handles the signal properly, the issue is still here and it seems to be caused by the pod resource not being properly updated.

Note that prometheus-adapter seems to have been killed because by "No sandbox for pod can be found. Need to start a new one" pod="openshift-monitoring/prometheus-adapter-64cd49f459-f44hx", so there might be a race occurring when this happens.

That said, I am sending the bug over to the Node team to further investigate why the pod is stuck in Terminating state.

Comment 50 Alexander Chuzhoy 2021-06-08 17:53:30 UTC

oc get clusterversion --kubeconfig sno-0.kubeconfig 
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-fc.7   True        False         28m     Cluster version is 4.8.0-fc.7



oc get pod --kubeconfig sno-0.kubeconfig -n openshift-monitoring |grep -v Run|grep -v Comple
NAME                                          READY   STATUS        RESTARTS   AGE
prometheus-adapter-7c8ffccf56-66rl4           0/1     Terminating   0          37m



oc describe pod prometheus-adapter-7c8ffccf56-66rl4 --kubeconfig sno-0.kubeconfig -n openshift-monitoring
Name:                      prometheus-adapter-7c8ffccf56-66rl4
Namespace:                 openshift-monitoring
Priority:                  2000000000
Priority Class Name:       system-cluster-critical
Node:                      sno-0.vlan614.rdu2.scalelab.redhat.com/1000::1:1
Start Time:                Tue, 08 Jun 2021 17:14:23 +0000
Labels:                    app.kubernetes.io/component=metrics-adapter
                           app.kubernetes.io/name=prometheus-adapter
                           app.kubernetes.io/part-of=openshift-monitoring
                           app.kubernetes.io/version=0.8.4
                           pod-template-hash=7c8ffccf56
Annotations:               k8s.ovn.org/pod-networks:
                             {"default":{"ip_addresses":["fd01:0:0:1::3a/64"],"mac_address":"0a:58:ed:0b:91:6d","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "fd01:0:0:1::3a"
                                 ],
                                 "mac": "0a:58:ed:0b:91:6d",
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "interface": "eth0",
                                 "ips": [
                                     "fd01:0:0:1::3a"
                                 ],
                                 "mac": "0a:58:ed:0b:91:6d",
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted
                           workload.openshift.io/warning:
                             the node "sno-0.vlan614.rdu2.scalelab.redhat.com" does not have resource "management.workload.openshift.io/cores"
Status:                    Terminating (lasts 36m)
Termination Grace Period:  30s
IP:                        fd01:0:0:1::3a
IPs:
  IP:           fd01:0:0:1::3a
Controlled By:  ReplicaSet/prometheus-adapter-7c8ffccf56
Containers:
  prometheus-adapter:
    Container ID:  cri-o://90969aa50f20a0dd9adda7146f60c0e2b847fe28ca63e26d8c556a192fdec948
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d
    Port:          6443/TCP
    Host Port:     0/TCP
    Args:
      --prometheus-auth-config=/etc/prometheus-config/prometheus-config.yaml
      --config=/etc/adapter/config.yaml
      --logtostderr=true
      --metrics-relist-interval=1m
      --prometheus-url=https://prometheus-k8s.openshift-monitoring.svc:9091
      --secure-port=6443
      --client-ca-file=/etc/tls/private/client-ca-file
      --requestheader-client-ca-file=/etc/tls/private/requestheader-client-ca-file
      --requestheader-allowed-names=kube-apiserver-proxy,system:kube-apiserver-proxy,system:openshift-aggregator
      --requestheader-extra-headers-prefix=X-Remote-Extra-
      --requestheader-group-headers=X-Remote-Group
      --requestheader-username-headers=X-Remote-User
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 08 Jun 2021 17:14:28 +0000
      Finished:     Tue, 08 Jun 2021 17:18:24 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /etc/adapter from config (rw)
      /etc/prometheus-config from prometheus-adapter-prometheus-config (rw)
      /etc/ssl/certs from serving-certs-ca-bundle (rw)
      /etc/tls/private from tls (ro)
      /tmp from tmpfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-adapter-token-hzmzh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tmpfs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      adapter-config
    Optional:  false
  prometheus-adapter-prometheus-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-adapter-prometheus-config
    Optional:  false
  serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-4rmrrbs3tregh
    Optional:    false
  prometheus-adapter-token-hzmzh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-adapter-token-hzmzh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       38m                 default-scheduler  Successfully assigned openshift-monitoring/prometheus-adapter-7c8ffccf56-66rl4 to sno-0.vlan614.rdu2.scalelab.redhat.com
  Normal   AddedInterface  38m                 multus             Add eth0 [fd01:0:0:1::3a/64]
  Normal   Pulling         38m                 kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d"
  Normal   Pulled          38m                 kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:668f06675e7bc86e28a0b8f8011092bf72cb45368daacaa36393f683f91d326d" in 1.380712712s
  Normal   Created         38m                 kubelet            Created container prometheus-adapter
  Normal   Started         38m                 kubelet            Started container prometheus-adapter
  Warning  FailedMount     34m                 kubelet            Unable to attach or mount volumes: unmounted volumes=[tls], unattached volumes=[serving-certs-ca-bundle tls prometheus-adapter-token-hzmzh tmpfs config prometheus-adapter-prometheus-config]: timed out waiting for the condition
  Normal   Killing         34m                 kubelet            Stopping container prometheus-adapter
  Warning  FailedMount     12s (x26 over 36m)  kubelet            MountVolume.SetUp failed for volume "tls" : secret "prometheus-adapter-4rmrrbs3tregh" not found

Comment 51 Ryan Phillips 2021-06-09 14:19:58 UTC


*** This bug has been marked as a duplicate of bug 1929463 ***

Note You need to log in before you can comment on or make changes to this bug.