1848833 – 1 of 2 prometheus-user-workload pods fails to run

Bug 1848833 - 1 of 2 prometheus-user-workload pods fails to run

Summary: 1 of 2 prometheus-user-workload pods fails to run

Keywords:
Status:	CLOSED DUPLICATE of bug 1848450
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Pawel Krupa
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-19 04:10 UTC by Daneyon Hansen
Modified:	2020-06-19 07:01 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-19 07:01:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Daneyon Hansen 2020-06-19 04:10:32 UTC

Description of problem:
1 of 2 prometheus-user-workload pods fail to run when trying to monitor my own service. The 2nd pod

Version-Release number of selected component (if applicable):
4.3.19

How reproducible:
Always

Steps to Reproduce:
1. Create a cluster
2. Follow product docs [1] to monitor my own service

Actual results:

$ oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-756b9cbd89-zx6w6   1/1     Running   0          5h28m
prometheus-user-workload-0             0/5     Pending   0          4h23m
prometheus-user-workload-1             5/5     Running   1          4h23m

Expected results:

$ oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-756b9cbd89-zx6w6   1/1     Running   0          5h28m
prometheus-user-workload-0             5/5     Running   0          4h23m
prometheus-user-workload-1             5/5     Running   1          4h23m

Additional info:

$ oc -n openshift-user-workload-monitoring describe po/prometheus-user-workload-0 
<SNIP>
Warning  FailedScheduling  <unknown>  default-scheduler  0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.

Worker nodes have no taints and master nodes have the following taint:

  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master

Other than control plane pods, only my test app pod is running:

$ oc get po -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
network-check-9ffbdd476-8lp96   1/1     Running   0          4h57m   10.131.0.20   ip-10-0-159-58.us-west-2.compute.internal   <none>           <none> 

No worker node is reporting memory, disk or PID pressure. See [2] for details of worker nodes. The 2nd pod runs after making master nodes schedulable:

$ oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-756b9cbd89-zx6w6   1/1     Running   0          5h45m
prometheus-user-workload-0             5/5     Running   1          4h40m
prometheus-user-workload-1             5/5     Running   1          4h40m

[1] https://docs.openshift.com/container-platform/4.3/monitoring/monitoring-your-own-services.html#creating-a-role-for-setting-up-metrics-collection_monitoring-your-own-services

[2] https://gist.githubusercontent.com/danehans/3b075e36cf65184ffacc0569103e25d1/raw/7b874d05c15c6eb0f2515fe65549db31645459c3/02_worker_node_details

Comment 1 Pawel Krupa 2020-06-19 07:01:27 UTC


*** This bug has been marked as a duplicate of bug 1848450 ***

Note You need to log in before you can comment on or make changes to this bug.