Description of problem: Get 'Application is not available' when access Prometheus UI Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-28-005607 How reproducible: always Steps to Reproduce: 1.Go to console->Monitoring->Metrics 2.Click link Platform Prometheus UI to access Prometheus UI Actual results: Console works well Get 'Application is not available' when access Prometheus UI Expected results: Additional info: Run the following, get timeout #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:09 --:--:-- 0curl: (7) Failed to connect to prometheus-k8s.openshift-monitoring.svc port 9091: Connection timed out command terminated with exit code 7 Get correct result when use pod IP #oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://{prometheus_pod_ip}:9091/api/v1/query?query=ALERTS' | jq #oc -n openshift-monitoring get svc prometheus-k8s -oyaml apiVersion: v1 kind: Service metadata: annotations: service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1619581198 service.beta.openshift.io/serving-cert-secret-name: prometheus-k8s-tls service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1619581198 creationTimestamp: "2021-04-28T03:51:36Z" labels: app.kubernetes.io/component: prometheus app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.24.0 prometheus: k8s managedFields: - apiVersion: v1 name: prometheus-k8s namespace: openshift-monitoring resourceVersion: "22238" uid: b96cf523-bdbd-4fa5-8b54-1c41b4da0b11 spec: clusterIP: 172.30.87.10 clusterIPs: - 172.30.87.10 ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: web port: 9091 protocol: TCP targetPort: web - name: tenancy port: 9092 protocol: TCP targetPort: tenancy selector: app: prometheus app.kubernetes.io/component: prometheus app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring prometheus: k8s sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800 type: ClusterIP status: loadBalancer: {}
Monitoring-Metrics console works well
Grafana UI displays well Get 'Application is not available' when access both Prometheus UI and AlertManager UI
# oc -n openshift-monitoring get svc prometheus-k8s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus-k8s ClusterIP 172.30.87.10 <none> 9091/TCP,9092/TCP 4h42m # oc -n openshift-monitoring get pod -o wide | grep prometheus-k8s prometheus-k8s-0 7/7 Running 1 4h20m 10.131.0.33 ip-10-0-197-153.ap-northeast-1.compute.internal <none> <none> prometheus-k8s-1 7/7 Running 1 4h17m 10.128.2.30 ip-10-0-142-190.ap-northeast-1.compute.internal <none> <none> # oc debug node/ip-10-0-197-153.ap-northeast-1.compute.internal sh-4.4# chroot /host sh-4.4# iptables-save | grep 9091 -A KUBE-SERVICES ! -s 10.128.0.0/14 -d 172.30.212.37/32 -p tcp -m comment --comment "openshift-monitoring/thanos-querier:web cluster IP" -m tcp --dport 9091 -j KUBE-MARK-MASQ -A KUBE-SERVICES -d 172.30.212.37/32 -p tcp -m comment --comment "openshift-monitoring/thanos-querier:web cluster IP" -m tcp --dport 9091 -j KUBE-SVC-G5A7ID5ATXHWKRS5 -A KUBE-SEP-6K6USKENEZOOGRJZ -p tcp -m comment --comment "openshift-monitoring/thanos-querier:web" -m tcp -j DNAT --to-destination 10.131.0.30:9091 -A KUBE-SEP-SSSHBF3TPULS5UN7 -p tcp -m comment --comment "openshift-monitoring/thanos-querier:web" -m tcp -j DNAT --to-destination 10.128.2.25:9091 -A KUBE-SERVICES -d 172.30.87.10/32 -p tcp -m comment --comment "openshift-monitoring/prometheus-k8s:web has no endpoints" -m tcp --dport 9091 -j REJECT --reject-with icmp-port-unreachable ************************************************ no prometheus-k8s:web cluster IP setting from the result, example: -A KUBE-SERVICES -d 172.30.87.10/32 -p tcp -m comment --comment "openshift-monitoring/prometheus-k8s:web cluster IP" -m tcp --dport 9091 -j KUBE-SVC-DCLNKYLNAMROIJRV
# oc -n openshift-monitoring get ep NAME ENDPOINTS AGE alertmanager-main <none> 4h55m alertmanager-operated 10.128.2.31:9095,10.128.2.32:9095,10.131.0.35:9095 + 6 more... 4h55m cluster-monitoring-operator 10.128.0.89:8443 5h7m grafana 10.128.2.29:3000 4h55m kube-state-metrics 10.131.0.26:8443,10.131.0.26:9443 5h7m node-exporter 10.0.142.190:9100,10.0.153.211:9100,10.0.168.159:9100 + 3 more... 5h7m openshift-state-metrics 10.131.0.29:8443,10.131.0.29:9443 5h7m prometheus-adapter 10.128.2.27:6443,10.131.0.27:6443 5h7m prometheus-k8s <none> 4h55m prometheus-k8s-thanos-sidecar <none> 4h55m prometheus-operated 10.128.2.30:9091,10.131.0.33:9091,10.128.2.30:10901 + 1 more... 4h55m prometheus-operator 10.130.0.79:8080,10.130.0.79:8443 5h7m telemeter-client 10.128.2.24:8443 5h7m thanos-querier 10.128.2.25:9093,10.131.0.30:9093,10.128.2.25:9092 + 3 more... 5h7m # oc -n openshift-monitoring get ep prometheus-k8s -oyaml apiVersion: v1 kind: Endpoints metadata: annotations: endpoints.kubernetes.io/last-change-trigger-time: "2021-04-28T03:51:36Z" creationTimestamp: "2021-04-28T03:51:48Z" labels: app.kubernetes.io/component: prometheus app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.24.0 prometheus: k8s name: prometheus-k8s namespace: openshift-monitoring resourceVersion: "22584" uid: 7f3573f1-e28d-4785-a172-de03797da1cb
checked with 4.8.0-0.nightly-2021-04-29-063720, can login prometheus UI now, see the attached picture, all the endpoints are normal # oc -n openshift-monitoring get ep NAME ENDPOINTS AGE alertmanager-main 10.128.2.27:9095,10.131.0.32:9095,10.131.0.38:9095 + 3 more... 66m alertmanager-operated 10.128.2.27:9095,10.131.0.32:9095,10.131.0.38:9095 + 6 more... 66m cluster-monitoring-operator 10.130.0.76:8443 74m grafana 10.128.2.24:3000 66m kube-state-metrics 10.131.0.26:8443,10.131.0.26:9443 74m node-exporter 10.0.0.3:9100,10.0.0.4:9100,10.0.0.5:9100 + 3 more... 74m openshift-state-metrics 10.131.0.30:8443,10.131.0.30:9443 74m prometheus-adapter 10.128.2.23:6443,10.131.0.29:6443 74m prometheus-k8s 10.128.2.29:9092,10.131.0.34:9092,10.128.2.29:9091 + 1 more... 66m prometheus-k8s-thanos-sidecar 10.128.2.29:10902,10.131.0.34:10902 66m prometheus-operated 10.128.2.29:9091,10.131.0.34:9091,10.128.2.29:10901 + 1 more... 66m prometheus-operator 10.128.0.96:8080,10.128.0.96:8443 74m telemeter-client 10.128.2.25:8443 74m thanos-querier 10.128.2.36:9093,10.129.2.33:9093,10.128.2.36:9092 + 3 more... 74m # oc -n openshift-monitoring get sts prometheus-k8s -oyaml labels: app.kubernetes.io/component: prometheus app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: prometheus app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 2.24.0 # oc -n openshift-monitoring get sts alertmanager-main -oyaml labels: alertmanager: main app.kubernetes.io/component: alert-router app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/name: alertmanager app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.21.0 # oc -n openshift-monitoring get pod --show-labels | grep -E "prometheus-k8s|alertmanager-main" alertmanager-main-0 5/5 Running 0 46m alertmanager=main,app.kubernetes.io/component=alert-router,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=alertmanager,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=0.21.0,app=alertmanager,controller-revision-hash=alertmanager-main-78f7cc764d,statefulset.kubernetes.io/pod-name=alertmanager-main-0 alertmanager-main-1 5/5 Running 0 51m alertmanager=main,app.kubernetes.io/component=alert-router,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=alertmanager,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=0.21.0,app=alertmanager,controller-revision-hash=alertmanager-main-78f7cc764d,statefulset.kubernetes.io/pod-name=alertmanager-main-1 alertmanager-main-2 5/5 Running 0 45m alertmanager=main,app.kubernetes.io/component=alert-router,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=alertmanager,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=0.21.0,app=alertmanager,controller-revision-hash=alertmanager-main-78f7cc764d,statefulset.kubernetes.io/pod-name=alertmanager-main-2 prometheus-k8s-0 7/7 Running 1 45m app.kubernetes.io/component=prometheus,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=prometheus,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=2.24.0,app=prometheus,controller-revision-hash=prometheus-k8s-588f669d48,operator.prometheus.io/name=k8s,operator.prometheus.io/shard=0,prometheus=k8s,statefulset.kubernetes.io/pod-name=prometheus-k8s-0 prometheus-k8s-1 7/7 Running 1 51m app.kubernetes.io/component=prometheus,app.kubernetes.io/managed-by=cluster-monitoring-operator,app.kubernetes.io/name=prometheus,app.kubernetes.io/part-of=openshift-monitoring,app.kubernetes.io/version=2.24.0,app=prometheus,controller-revision-hash=prometheus-k8s-588f669d48,operator.prometheus.io/name=k8s,operator.prometheus.io/shard=0,prometheus=k8s,statefulset.kubernetes.io/pod-name=prometheus-k8s-1 no need to remove managed-by label, maybe caused by other issues
Created attachment 1777086 [details] prometheus UI can login now
We couldn't merge the PR [1] that fixed the selector labels in time because the CMO CI pipeline was broken (for other reasons). So we decided to revert the prometheus operator bump [2]. I'm moving the bug to MODIFIED. [1] https://github.com/openshift/cluster-monitoring-operator/pull/1138 [2] https://github.com/openshift/prometheus-operator/pull/116
Verified with payload 4.8.0-0.nightly-2021-04-29-151418 Prometheus UI works well
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438