Bug 2063047

Summary:	Configuring a full-path query log file in CMO breaks Prometheus with the latest version of the operator
Product:	OpenShift Container Platform	Reporter:	Simon Pasquier <spasquie>
Component:	Monitoring	Assignee:	Joao Marcal <jmarcal>
Status:	CLOSED ERRATA	QA Contact:	hongyan li <hongyli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.11	CC:	amuller, anpicker, aos-bugs, hongyli
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:53:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Pasquier 2022-03-11 08:29:35 UTC

Description of problem:

When passing the following configuration to CMO, the deployment of Prometheus fails with the latest upstream version of the operator (v0.55.0) because it enforces a root read-only filesystem.

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    prometheusK8s:
      queryLogFile: /tmp/test.log   

Version-Release number of selected component (if applicable):
4.11

How reproducible:
Always

Steps to Reproduce:
1. Apply the configmap from above
2.
3.

Actual results:
CMO gets degraded.

Expected results:
CMO doesn't report degraded.


Additional info:
PR bumping the Prometheus operator version to v0.55.0 => https://github.com/openshift/prometheus-operator/pull/162
Failed job => https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_prometheus-operator/162/pull-ci-openshift-prometheus-operator-master-e2e-agnostic-cmo/1502153637059629056
The same fix needs to be done for UWM Prometheus.

Comment 1 Simon Pasquier 2022-03-11 08:40:41 UTC

Because the root filesystem is read-only, a emptyDir volume needs to be provisioned by CMO if the query log file is a full path (as explained in the release CHANGELOG [1]).

CMO should automatically add the volume + mount it asthe queryLogFile's directory except for the following edge cases:
* when the queryLogFile's directory starts with "/dev" (e.g. "/dev/stdout") because this destination is writable.
* when the queryLogFile's directory starts with "/prometheus" (e.g. "/prometheus/query.log") because it is already writable (TSDB storage directory).
* when the queryLogFile's directory starts with "/" (e.g. "/query.log"), this should be rejected since we don't want to mount a volume at the root location.

[1] https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.55.0

Comment 3 hongyan li 2022-03-18 05:00:55 UTC

Test with payload 4.11.0-0.nightly-2022-03-18-003836
% oc apply -f - <<EOF
heredoc> apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    prometheusK8s:
      queryLogFile: /tmp/test.log
heredoc> EOF
tail /tmp/test.log
{"params":{"end":"2022-03-18T02:56:46.864Z","query":"min_over_time(prometheus_operator_managed_resources{job=\"prometheus-operator\",namespace=~\"openshift-monitoring|openshift-user-workload-monitoring\",state=\"rejected\"}[5m]) > 0","start":"2022-03-18T02:56:46.864Z","step":0},"ruleGroup":{"file":"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-monitoring-prometheus-operator-rules-1906bb0c-447d-4527-96e9-2f9d29cb61c3.yaml","name":"prometheus-operator"},"stats":{"timings":{"evalTotalTime":0.00013654,"resultSortTime":0,"queryPreparationTime":0.000072124,"innerEvalTime":0.000054378,"execQueueTime":0.000008232,"execTotalTime":0.000149892}},"ts":"2022-03-18T02:56:46.868Z"}
{"params":{"end":"2022-03-18T02:56:47.019Z","query":"kube_pod_status_ready{condition=\"true\",namespace=\"openshift-cluster-node-tuning-operator\"} == 0","start":"2022-03-18T02:56:47.019Z","step":0},"ruleGroup":{"file":"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-cluster-node-tuning-operator-node-tuning-operator-c9a433ca-1024-468b-82ef-c04e973d1c8e.yaml","name":"node-tuning-operator.rules"},"stats":{"timings":{"evalTotalTime":0.000189451,"resultSortTime":0,"queryPreparationTime":0.000092864,"innerEvalTime":0.000089069,"execQueueTime":0.000011035,"execTotalTime":0.000206109}},"ts":"2022-03-18T02:56:47.020Z"}

% oc apply -f - <<EOF
heredoc> apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    prometheusK8s:
      queryLogFile: /var/test.log  
heredoc> EOF
tail /var/test.log
{"params":{"end":"2022-03-18T02:59:53.532Z","query":"cco_credentials_requests_conditions{condition=\"InsufficientCloudCreds\"} > 0","start":"2022-03-18T02:59:53.532Z","step":0},"ruleGroup":{"file":"/etc/prometheus/rules/prometheus-k8s-rulefiles-0/openshift-cloud-credential-operator-cloud-credential-operator-alerts-f3c8c22c-1897-4f4f-80a3-642504067996.yaml","name":"CloudCredentialOperator"},"stats":{"timings":{"evalTotalTime":0.000089687,"resultSortTime":0,"queryPreparationTime":0.000059958,"innerEvalTime":0.00002176,"execQueueTime":0.000008778,"execTotalTime":0.000104447}},"ts":"2022-03-18T02:59:53.558Z"}

% oc apply -f - <<EOF
heredoc> apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    prometheusK8s:
      queryLogFile: /dev/test.log  
heredoc> EOF
% oc -n openshift-monitoring logs cluster-monitoring-operator-69d4486df9-xq4s6 cluster-monitoring-operator
-----
W0318 03:03:37.459612       1 tasks.go:71] task 4 of 14: Updating Prometheus-k8s failed: initializing Prometheus object failed: query log file can't be stored on a new file on the dev directory: invalid value for config
% oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    prometheusK8s:
      queryLogFile: /test.log   
EOF
% oc -n openshift-monitoring logs cluster-monitoring-operator-69d4486df9-xq4s6 cluster-monitoring-operator
-----
W0318 03:07:23.079970       1 tasks.go:71] task 4 of 14: Updating Prometheus-k8s failed: initializing Prometheus object failed: query log file can't be stored on the root directory: invalid value for config

Comment 4 hongyan li 2022-03-18 05:16:11 UTC

% oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      queryLogFile: /tmp/test.log

heredoc> EOF
% oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-5fc4b476dc-x5zc2   2/2     Running   0          4m28s
prometheus-user-workload-0             4/5     Running   0          8s
prometheus-user-workload-1             5/5     Running   0          26s
thanos-ruler-user-workload-0           3/3     Running   0          4m18s
thanos-ruler-user-workload-1           3/3     Running   0          4m18s
hongyli@hongyli-mac Downloads % oc -n openshift-user-workload-monitoring rsh prometheus-user-workload-0
sh-4.4$ tail -f /tmp/test.log 
no result

Comment 6 hongyan li 2022-03-18 08:05:25 UTC

When configure an invalid directory, the error information in log file are not correct

% oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus:
      queryLogFile: /test.log
EOF
% oc -n openshift-monitoring logs cluster-monitoring-operator-69d4486df9-xq4s6 cluster-monitoring-operator
-----
E0318 07:34:42.684084       1 operator.go:537] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0318 07:34:42.684167       1 operator.go:538] sync "openshift-monitoring/cluster-monitoring-config" failed: the User Workload Configuration from "config.yaml" key in the "openshift-user-workload-monitoring/user-workload-monitoring-config" ConfigMap could not be parsed: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go struct field UserWorkloadConfiguration.prometheus of type manifests.PrometheusRestrictedConfig
W0318 07:35:30.756558       1 operator.go:781] Error creating User Workload Configuration from "config.yaml" key in the "openshift-user-workload-monitoring/user-workload-monitoring-config" ConfigMap. Error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go struct field UserWorkloadConfiguration.prometheus of type manifests.PrometheusRestrictedConfig
I0318 07:35:30.756586       1 operator.go:681] ClusterOperator reconciliation failed (attempt 34), retrying. 
W0318 07:35:30.756592       1 operator.go:684] Updating ClusterOperator status to failed after 34 attempts.
E0318 07:35:30.773779       1 operator.go:537] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0318 07:35:30.773809       1 operator.go:538] sync "openshift-monitoring/cluster-monitoring-config" failed: the User Workload Configuration from "config.yaml" key in the "openshift-user-workload-monitoring/user-workload-monitoring-config" ConfigMap could not be parsed: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal string into Go struct field UserWorkloadConfiguration.prometheus of type manifests.PrometheusRestrictedConfig
I0318 07:38:15.576292       1 operator.go:509] Triggering an update due to ConfigMap or Secret: openshift-user-workload-monitoring/user-workload-monitoring-config

Comment 7 hongyan li 2022-03-18 08:15:05 UTC

Facing 2 issues now:
When configure valid path for user workload query log, failed to see any query log, though I done many prometheus-example app queries by exploring dashboard or query 'version' directly.
When configure invalid path for user workload query log, error information is log file are not correct

So I reopened the bug.

Comment 8 hongyan li 2022-03-18 08:54:36 UTC

There is another critical issue, when configure query log file as the following, cause prometheus-k8s pod to crash
% oc apply -f - <<EOF                            
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      queryLogFile: ./lhy/test.log      
EOF

% oc -n openshift-monitoring get pod|grep prometheus-k8s
prometheus-k8s-0                               6/6     Running            0             15m
prometheus-k8s-1                               5/6     CrashLoopBackOff   5 (39s ago)   3m39s

Comment 9 Simon Pasquier 2022-03-18 09:17:23 UTC

regarding comment 8, this should be fixed by bumping to the latest version of prometheus-operator (https://github.com/openshift/prometheus-operator/pull/162) that merged 1 hour ago.

Comment 10 Joao Marcal 2022-03-18 09:53:06 UTC

Regarding the first issue:
=================
jmarcal ~ → oc get cm user-workload-monitoring-config -o yaml | cat
apiVersion: v1
data:
  config.yaml: |
    prometheus:
      queryLogFile: /tmp/test.log
kind: ConfigMap
metadata:
  creationTimestamp: "2022-03-18T08:48:53Z"
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
  resourceVersion: "46245"
  uid: 0f341905-4ff6-43fd-a3a2-0eb5b9b67ff5

jmarcal ~ → oc rsh prometheus-user-workload-1
sh-4.4$ curl --data-urlencode 'query=up{job="alertmanager"}' 127.0.0.1:9090/api/v1/query
sh-4.4$ cat /tmp/test.log
{"httpRequest":{"clientIP":"127.0.0.1","method":"POST","path":"/api/v1/query"},"params":{"end":"2022-03-18T09:35:15.992Z","query":"up{job=\"alertmanager\"}","start":"2022-03-18T09:35:15.992Z","step":0},"stats":{"timings":{"evalTotalTime":0.000053284,"resultSortTime":0,"queryPreparationTime":0.000036362,"innerEvalTime":0.000010217,"execQueueTime":0.000013256,"execTotalTime":0.000081297}},"ts":"2022-03-18T09:35:15.993Z"}
=================
So from my side everything seems to be working accordingly. From what Simon told me for user-workload we need to hit the Promethues API query endpoint directly because going through Thanos query wouldn't trigger request logging.

Regarding the second issue:
=================
jmarcal ~ → oc get cm user-workload-monitoring-config -o yaml | cat
apiVersion: v1
data:
  config.yaml: |
    prometheus:
      queryLogFile: /test.log
kind: ConfigMap
metadata:
  creationTimestamp: "2022-03-18T08:48:53Z"
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
  resourceVersion: "52109"
  uid: 0f341905-4ff6-43fd-a3a2-0eb5b9b67ff5
jmarcal ~ → oc -n openshift-monitoring logs cluster-monitoring-operator-69d4486df9-9z7c8 -c cluster-monitoring-operator | tail
I0318 09:50:03.950181       1 tasks.go:74] ran task 9 of 14: Updating openshift-state-metrics
I0318 09:50:03.989334       1 tasks.go:74] ran task 8 of 14: Updating kube-state-metrics
I0318 09:50:04.057089       1 tasks.go:74] ran task 7 of 14: Updating node-exporter
I0318 09:50:04.168306       1 tasks.go:74] ran task 11 of 14: Updating Telemeter client
W0318 09:50:04.249201       1 tasks.go:71] task 5 of 14: Updating Prometheus-user-workload failed: initializing UserWorkload Prometheus object failed: query log file can't be stored on the root directory: invalid value for config
I0318 09:50:04.312247       1 tasks.go:74] ran task 10 of 14: Updating prometheus-adapter
I0318 09:50:04.989025       1 tasks.go:74] ran task 1 of 14: Updating user workload Prometheus Operator
I0318 09:50:05.574858       1 tasks.go:74] ran task 3 of 14: Updating Grafana
I0318 09:50:07.315837       1 tasks.go:74] ran task 12 of 14: Updating Thanos Querier
I0318 09:50:14.212797       1 tasks.go:74] ran task 6 of 14: Updating Alertmanager
=================
So from my side it seems like the YAML you used for some reason was not valid and that caused the error message you saw.

Comment 11 hongyan li 2022-03-21 02:52:59 UTC

Test again, configuring a full-path query log file works well for user-workload-monitoring
@jamarcal, thanks for your checking.

Comment 12 hongyan li 2022-03-21 03:03:01 UTC

Test with payload 4.11.0-0.nightly-2022-03-20-160505
promtheus-operator's version is 0.55.0

oc -n openshift-monitoring logs prometheus-operator-5cb86ff95c-xglgp 
level=info ts=2022-03-21T02:02:52.099203159Z caller=main.go:220 msg="Starting Prometheus Operator" version="(version=0.55.0, branch=rhaos-4.11-rhel-8, revision=5d799b9)

oc apply -f - <<EOF                            
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
    prometheusK8s:
      queryLogFile: ./lhy/test.log      
EOF

oc -n openshift-monitoring get pod |grep prometheus-k8s
prometheus-k8s-0                               6/6     Running            0              46m
prometheus-k8s-1                               5/6     CrashLoopBackOff   6 (5m6s ago)   10m

$ oc -n openshift-monitoring describe pod prometheus-k8s-1
Name:                 prometheus-k8s-1
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ip-10-0-153-240.us-east-2.compute.internal/10.0.153.240
---------
Containers:
  prometheus:
    Container ID:  cri-o://ff7b03f6a64e625d3eddb2b9fb8638a0bcbc3582aacb1a82286d57ab8dd0aeab
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81ea1941f7a902c68c696e99afa860c41ac7fe0ab0c209e79cc2a7855cdbd5b7
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81ea1941f7a902c68c696e99afa860c41ac7fe0ab0c209e79cc2a7855cdbd5b7
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --storage.tsdb.retention.time=15d
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --storage.tsdb.path=/prometheus
      --web.enable-lifecycle
      --web.external-url=https:/console-openshift-console.apps.hongyli-0321.qe.devcluster.openshift.com/monitoring
      --web.route-prefix=/
      --web.listen-address=127.0.0.1:9090
      --web.config.file=/etc/prometheus/web_config/web-config.yaml
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   msg="Using pod service account via in-cluster config"
ts=2022-03-21T02:59:25.993Z caller=kubernetes.go:313 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-03-21T02:59:25.994Z caller=kubernetes.go:313 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-03-21T02:59:25.995Z caller=kubernetes.go:313 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-03-21T02:59:25.996Z caller=kubernetes.go:313 level=info component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2022-03-21T02:59:26.132Z caller=main.go:815 level=info msg="Stopping scrape discovery manager..."
ts=2022-03-21T02:59:26.132Z caller=main.go:829 level=info msg="Stopping notify discovery manager..."
ts=2022-03-21T02:59:26.132Z caller=main.go:851 level=info msg="Stopping scrape manager..."
ts=2022-03-21T02:59:26.132Z caller=main.go:811 level=info msg="Scrape discovery manager stopped"
ts=2022-03-21T02:59:26.133Z caller=main.go:825 level=info msg="Notify discovery manager stopped"
ts=2022-03-21T02:59:26.134Z caller=manager.go:945 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-03-21T02:59:26.134Z caller=main.go:845 level=info msg="Scrape manager stopped"
ts=2022-03-21T02:59:26.134Z caller=manager.go:955 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-03-21T02:59:26.135Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-03-21T02:59:26.135Z caller=main.go:1071 level=info msg="Notifier 
      Exit Code:    1
      Started:      Mon, 21 Mar 2022 10:59:25 +0800
      Finished:     Mon, 21 Mar 2022 10:59:26 +0800
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:        70m
      memory:     1Gi
    Liveness:     exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl http://localhost:9090/-/healthy; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/healthy; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=5s #success=1 #failure=3
    Startup:      exec [sh -c if [ -x "$(command -v curl)" ]; then exec curl http://localhost:9090/-/ready; elif [ -x "$(command -v wget)" ]; then exec wget -q -O /dev/null http://localhost:9090/-/ready; else exit 1; fi] delay=0s timeout=3s period=15s #success=1 #failure=60
    Environment:  <none>
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro)
      /etc/prometheus/certs from tls-assets (ro)
      /etc/prometheus/config_out from config-out (ro)
      /etc/prometheus/configmaps/kubelet-serving-ca-bundle from configmap-kubelet-serving-ca-bundle (ro)
      /etc/prometheus/configmaps/metrics-client-ca from configmap-metrics-client-ca (ro)
      /etc/prometheus/configmaps/serving-certs-ca-bundle from configmap-serving-certs-ca-bundle (ro)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/secrets/kube-etcd-client-certs from secret-kube-etcd-client-certs (ro)
      /etc/prometheus/secrets/kube-rbac-proxy from secret-kube-rbac-proxy (ro)
      /etc/prometheus/secrets/metrics-client-certs from secret-metrics-client-certs (ro)
      /etc/prometheus/secrets/prometheus-k8s-proxy from secret-prometheus-k8s-proxy (ro)
      /etc/prometheus/secrets/prometheus-k8s-thanos-sidecar-tls from secret-prometheus-k8s-thanos-sidecar-tls (ro)
      /etc/prometheus/secrets/prometheus-k8s-tls from secret-prometheus-k8s-tls (ro)
      /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml")
      /prometheus from prometheus-k8s-db (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h45dn (ro)
      lhy from query-log (rw)
  config-reloader:
    Container ID:  cri-o://c180d9ed73e68e40ce64bb0571bec00f0038b5773cd9ab39c09f7a0319ab96e0
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b25becec8ff8c04d449d7133fb54b2b082f1bc779dc0a289013cbe9e9dd87db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b25becec8ff8c04d449d7133fb54b2b082f1bc779dc0a289013cbe9e9dd87db
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/prometheus-config-reloader
    Args:
      --listen-address=localhost:8080
      --reload-url=http://localhost:9090/-/reload
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Running
      Started:      Mon, 21 Mar 2022 10:48:39 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_NAME:  prometheus-k8s-1 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h45dn (ro)
  thanos-sidecar:
    Container ID:  cri-o://163d59de54f18f3963a71ecad205578a1cf53d3ab23ee9c0a3093b3ec1a23ed3
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:419ae6e3ddc102d59063407ab6d287f3c5a5fcddca921b4fe5c09f515eb1a72e
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:419ae6e3ddc102d59063407ab6d287f3c5a5fcddca921b4fe5c09f515eb1a72e
    Ports:         10902/TCP, 10901/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      sidecar
      --prometheus.url=http://localhost:9090/
      --tsdb.path=/prometheus
      --http-address=127.0.0.1:10902
      --grpc-server-tls-cert=/etc/tls/grpc/server.crt
      --grpc-server-tls-key=/etc/tls/grpc/server.key
      --grpc-server-tls-client-ca=/etc/tls/grpc/ca.crt
    State:          Running
      Started:      Mon, 21 Mar 2022 10:48:40 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /etc/tls/grpc from secret-grpc-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h45dn (ro)
  prometheus-proxy:
    Container ID:  cri-o://f74fe302e5390e163b9e205b627dec153cd14b324f215a33a8d35d8c72dbe297
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1e0560f81cde0731eeb20f6332ee56853cc08652e2212511b02d53d1a97bc8e
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1e0560f81cde0731eeb20f6332ee56853cc08652e2212511b02d53d1a97bc8e
    Port:          9091/TCP
    Host Port:     0/TCP
    Args:
      -provider=openshift
      -https-address=:9091
      -http-address=
      -email-domain=*
      -upstream=http://localhost:9090
      -openshift-service-account=prometheus-k8s
      -openshift-sar={"resource": "namespaces", "verb": "get"}
      -openshift-delegate-urls={"/": {"resource": "namespaces", "verb": "get"}}
      -tls-cert=/etc/tls/private/tls.crt
      -tls-key=/etc/tls/private/tls.key
      -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
      -cookie-secret-file=/etc/proxy/secrets/session_secret
      -openshift-ca=/etc/pki/tls/cert.pem
      -openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    State:          Running
      Started:      Mon, 21 Mar 2022 10:48:40 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  20Mi
    Environment:
      HTTP_PROXY:   
      HTTPS_PROXY:  
      NO_PROXY:     
    Mounts:
      /etc/pki/ca-trust/extracted/pem/ from prometheus-trusted-ca-bundle (ro)
      /etc/proxy/secrets from secret-prometheus-k8s-proxy (rw)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h45dn (ro)
  kube-rbac-proxy:
    Container ID:  cri-o://d719ca84e6a03ae493ee25aa5d1152763e2a0b3cc10980e9691b1e62530a83b6
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eae733a92bbd6fb71e26105cca172aceb5a72ccf38daf6f8115a3243edcaf840
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eae733a92bbd6fb71e26105cca172aceb5a72ccf38daf6f8115a3243edcaf840
    Port:          9092/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=0.0.0.0:9092
      --upstream=http://127.0.0.1:9090
      --allow-paths=/metrics
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --client-ca-file=/etc/tls/client/client-ca.crt
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --logtostderr=true
      --v=10
      --tls-min-version=VersionTLS12
    State:          Running
      Started:      Mon, 21 Mar 2022 10:48:40 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        1m
      memory:     15Mi
    Environment:  <none>
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw)
      /etc/tls/client from configmap-metrics-client-ca (ro)
      /etc/tls/private from secret-prometheus-k8s-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h45dn (ro)
  kube-rbac-proxy-thanos:
    Container ID:  cri-o://1ca96f73d5547c7ab6a498de1625df7b96dfcd864f10b6267f1d1319ae0428fc
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eae733a92bbd6fb71e26105cca172aceb5a72ccf38daf6f8115a3243edcaf840
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eae733a92bbd6fb71e26105cca172aceb5a72ccf38daf6f8115a3243edcaf840
    Port:          10902/TCP
    Host Port:     0/TCP
    Args:
      --secure-listen-address=[$(POD_IP)]:10902
      --upstream=http://127.0.0.1:10902
      --tls-cert-file=/etc/tls/private/tls.crt
      --tls-private-key-file=/etc/tls/private/tls.key
      --client-ca-file=/etc/tls/client/client-ca.crt
      --config-file=/etc/kube-rbac-proxy/config.yaml
      --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --allow-paths=/metrics
      --logtostderr=true
      --tls-min-version=VersionTLS12
      --client-ca-file=/etc/tls/client/client-ca.crt
    State:          Running
      Started:      Mon, 21 Mar 2022 10:48:40 +0800
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     1m
      memory:  10Mi
    Environment:
      POD_IP:   (v1:status.podIP)
    Mounts:
      /etc/kube-rbac-proxy from secret-kube-rbac-proxy (rw)
      /etc/tls/client from metrics-client-ca (ro)
      /etc/tls/private from secret-prometheus-k8s-thanos-sidecar-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h45dn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s
    Optional:    false
  tls-assets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          prometheus-k8s-tls-assets-0
    SecretOptionalName:  <nil>
  config-out:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  prometheus-k8s-rulefiles-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rulefiles-0
    Optional:  false
  web-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-web-config
    Optional:    false
  secret-kube-etcd-client-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-etcd-client-certs
    Optional:    false
  secret-prometheus-k8s-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-tls
    Optional:    false
  secret-prometheus-k8s-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-proxy
    Optional:    false
  secret-prometheus-k8s-thanos-sidecar-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-thanos-sidecar-tls
    Optional:    false
  secret-kube-rbac-proxy:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-rbac-proxy
    Optional:    false
  secret-metrics-client-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-client-certs
    Optional:    false
  configmap-serving-certs-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serving-certs-ca-bundle
    Optional:  false
  configmap-kubelet-serving-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubelet-serving-ca-bundle
    Optional:  false
  configmap-metrics-client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      metrics-client-ca
    Optional:  false
  prometheus-k8s-db:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  query-log:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  metrics-client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      metrics-client-ca
    Optional:  false
  secret-grpc-tls:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-grpc-tls-8la0v33v3r7hi
    Optional:    false
  prometheus-trusted-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-trusted-ca-bundle-2rsonso43rc5p
    Optional:  true
  kube-api-access-h45dn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       11m                 default-scheduler  Successfully assigned openshift-monitoring/prometheus-k8s-1 to ip-10-0-153-240.us-east-2.compute.internal
  Normal   AddedInterface  11m                 multus             Add eth0 [10.129.2.17/23] from openshift-sdn
  Normal   Pulled          11m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b25becec8ff8c04d449d7133fb54b2b082f1bc779dc0a289013cbe9e9dd87db" already present on machine
  Normal   Created         11m                 kubelet            Created container init-config-reloader
  Normal   Started         11m                 kubelet            Started container init-config-reloader
  Normal   Created         11m                 kubelet            Created container config-reloader
  Normal   Pulled          11m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:419ae6e3ddc102d59063407ab6d287f3c5a5fcddca921b4fe5c09f515eb1a72e" already present on machine
  Normal   Started         11m                 kubelet            Started container config-reloader
  Normal   Pulled          11m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6b25becec8ff8c04d449d7133fb54b2b082f1bc779dc0a289013cbe9e9dd87db" already present on machine
  Normal   Created         11m                 kubelet            Created container thanos-sidecar
  Normal   Started         11m (x2 over 11m)   kubelet            Started container prometheus
  Normal   Created         11m (x2 over 11m)   kubelet            Created container prometheus
  Normal   Pulled          11m (x2 over 11m)   kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:81ea1941f7a902c68c696e99afa860c41ac7fe0ab0c209e79cc2a7855cdbd5b7" already present on machine
  Normal   Started         11m                 kubelet            Started container thanos-sidecar
  Normal   Pulled          11m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1e0560f81cde0731eeb20f6332ee56853cc08652e2212511b02d53d1a97bc8e" already present on machine
  Normal   Created         11m                 kubelet            Created container prometheus-proxy
  Normal   Started         11m                 kubelet            Started container prometheus-proxy
  Normal   Pulled          11m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eae733a92bbd6fb71e26105cca172aceb5a72ccf38daf6f8115a3243edcaf840" already present on machine
  Normal   Created         11m                 kubelet            Created container kube-rbac-proxy
  Normal   Started         11m                 kubelet            Started container kube-rbac-proxy
  Normal   Pulled          11m                 kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eae733a92bbd6fb71e26105cca172aceb5a72ccf38daf6f8115a3243edcaf840" already present on machine
  Normal   Created         11m                 kubelet            Created container kube-rbac-proxy-thanos
  Normal   Started         11m                 kubelet            Started container kube-rbac-proxy-thanos
  Warning  BackOff         78s (x64 over 11m)  kubelet            Back-off restarting failed container

There is not abnormal log in cluster monitoring operator and prometheus operator

Comment 13 hongyan li 2022-03-21 03:04:35 UTC

When configure query logfile as ./lhy/test.log , still face pod crash issue

Comment 14 Simon Pasquier 2022-03-21 11:34:14 UTC

Right, I think that to be error-proof, CMO should forbid relative paths for queryLogFile.

Comment 17 hongyan li 2022-04-02 02:00:06 UTC

Test with payload 4.11.0-0.nightly-2022-04-01-172551, when config relative path

% oc -n openshift-monitoring logs cluster-monitoring-operator-5dd6f54457-6hj7g cluster-monitoring-operator
-----
W0402 01:57:22.591131       1 tasks.go:71] task 4 of 14: Updating Prometheus-k8s failed: initializing Prometheus object failed: relative paths to query log file are not supported: invalid value for config

Comment 20 errata-xmlrpc 2022-08-10 10:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069