Bug 1889111 - Prometheus and alertmanager pods won't start on specific node - in Terminated state
Summary: Prometheus and alertmanager pods won't start on specific node - in Terminated...
Keywords:
Status: CLOSED DUPLICATE of bug 1887354
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.7.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-17 22:11 UTC by Vladislav Walek
Modified: 2023-12-15 19:49 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-21 14:02:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vladislav Walek 2020-10-17 22:11:38 UTC
Description of problem:

- The alertmanager-main-2 and prometheus-k8s-1 pods are not able to start, stalling in Terminated state. 
- The other pods from statefulset are running normally - without any error.
- both problematic pods are started on the same infra node
- other pods are running normally (non monitoring ones)

alertmanager-main-2                            0/5     Terminating   0          1s
prometheus-k8s-1                               0/7     Terminating   0          2s

After checking logs from kubelet and crio - I see that the containers are constantly brought down and up in endless loop without removing the status.

Scaling down and up the sts won't help.

Restart of the kubelet won't solve the issue.

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.5


How reproducible:
#n/a


Additional info:
will provide data in additional comment

Comment 5 Nicolas Nosenzo 2020-10-19 08:51:01 UTC
The prometheus-operator is erroring with the following message:

"""
2020-10-16T15:33:42.713855516+00:00 stderr F E1016 15:33:42.713828       1 operator.go:996] Sync "openshift-monitoring/k8s" failed: failed to create new ConfigMap 'prometheus-k8s-rulefiles-0': configmaps "prometheus-k8s-rulefiles-0" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>

Then, it repeatedly posts:

"""
2020-10-17T18:22:33.440002053Z E1017 18:22:33.439961       1 operator.go:996] Sync "openshift-monitoring/k8s" failed: failed to create new ConfigMap 'prometheus-k8s-rulefiles-0': configmaps is forbidden: User "system:serviceaccount:openshift-monitoring:prometheus-operator" cannot create resource "configmaps" in API group "" in the namespace "openshift-monitoring"
2020-10-17T18:22:46.314927734Z E1017 18:22:46.314847       1 operator.go:996] Sync "openshift-monitoring/k8s" failed: failed to create ConfigMap 'prometheus-k8s-rulefiles-0': configmaps "prometheus-k8s-rulefiles-0" already exists
"""

Comment 6 Ryan Phillips 2020-10-19 19:32:30 UTC
This looks like a monitoring operator RBAC permissions issue.

Moving over...

Comment 10 Simon Pasquier 2020-10-21 14:02:27 UTC
After discussing off-line with @surbania we concluded that if might be related to bug 1863011 (bug 1887354 for 4.5.z) so closing this one as a duplicate since the resolution is already on-going.

*** This bug has been marked as a duplicate of bug 1887354 ***


Note You need to log in before you can comment on or make changes to this bug.