Bug 1852846

Summary: logLevel for prometheusOperator/thanosRuler don't work
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.6CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:11:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
openshift-user-workload-monitoring dump file
none
prometheus-operator deployment file none

Description Junqi Zhao 2020-07-01 12:32:15 UTC
Created attachment 1699481 [details]
openshift-user-workload-monitoring dump file

Description of problem:
enable enableUserWorkload and set logLevel for prometheusOperator/prometheus/thanosRuler, logLevel for prometheusOperator/thanosRuler don't work, only work for prometheus-user-workload pods, more details please see the attached dump file
******************************
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
******************************
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheusOperator:
      logLevel: error
    prometheus:
      logLevel: warn
      retention: 48h
    thanosRuler:
      logLevel: info
******************************

# oc -n openshift-user-workload-monitoring get sts prometheus-user-workload -oyaml |  grep "log.level"
        - --log.level=warn

# oc -n openshift-user-workload-monitoring get sts  thanos-ruler-user-workload -oyaml |  grep "log"
        terminationMessagePath: /dev/termination-log
        terminationMessagePath: /dev/termination-log
        terminationMessagePath: /dev/termination-log

# oc -n openshift-user-workload-monitoring get deploy prometheus-operator -oyaml |  grep "log"
        - --logtostderr=true
        - --deny-namespaces=openshift-apiserver,openshift-apiserver-operator,openshift-authentication,openshift-authentication-operator,openshift-cloud-credential-operator,openshift-cluster-machine-approver,openshift-cluster-samples-operator,openshift-cluster-storage-operator,openshift-cluster-version,openshift-config-operator,openshift-console-operator,openshift-controller-manager,openshift-controller-manager-operator,openshift-dns,openshift-dns-operator,openshift-etcd-operator,openshift-image-registry,openshift-ingress,openshift-ingress-operator,openshift-insights,openshift-kube-apiserver,openshift-kube-apiserver-operator,openshift-kube-controller-manager,openshift-kube-controller-manager-operator,openshift-kube-scheduler,openshift-kube-scheduler-operator,openshift-kube-storage-version-migrator,openshift-kube-storage-version-migrator-operator,openshift-machine-api,openshift-machine-config-operator,openshift-marketplace,openshift-monitoring,openshift-multus,openshift-operator-lifecycle-manager,openshift-sdn,openshift-service-ca-operator,openshift-service-catalog-removed,openshift-user-workload-monitoring
        terminationMessagePath: /dev/termination-log
        - --logtostderr
        terminationMessagePath: /dev/termination-log

# oc -n openshift-user-workload-monitoring logs prometheus-user-workload-0 -c prometheus
no output, since there is not warn info

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-06-30-000342

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Junqi Zhao 2020-07-01 12:43:35 UTC
# oc -n openshift-user-workload-monitoring get thanosruler user-workload -oyaml | grep logLevel
  logLevel: info

#  oc -n openshift-user-workload-monitoring get deploy prometheus-operator  -oyaml | grep logLevel
no result

Comment 2 Junqi Zhao 2020-07-01 12:45:33 UTC
logLevel setting in thanosruler doesn't injected to thanos-ruler-user-workload statefulset, and there maybe no logLevel for prometheus-operator deploy

Comment 3 Lili Cosic 2020-07-01 13:10:40 UTC
I see that logLevel info gets set to the ThanosRuler CR instance in openshift-user-workload-monitoring namespace:
 
 logLevel: info

Can you confirm that as well?

Comment 11 Junqi Zhao 2020-07-27 01:56:49 UTC
Tested with 4.6.0-0.nightly-2020-07-25-091217, set logLevel: error for prometheusOperator, "- --log-level=error" is in prometheus-operator deployment, but still see the logs which logLevel is not error
***************
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheusOperator:
      logLevel: error
****************
# oc -n openshift-user-workload-monitoring get deploy prometheus-operator -oyaml | grep "log-level"
        - --log-level=error

# oc -n openshift-user-workload-monitoring logs $(oc -n openshift-user-workload-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator
ts=2020-07-27T01:28:08.920655206Z caller=main.go:217 msg="Starting Prometheus Operator version '0.40.0'."
ts=2020-07-27T01:28:08.934023544Z caller=main.go:104 msg="Starting insecure server on [::]:8080"

Comment 12 Junqi Zhao 2020-07-27 01:58:32 UTC
Created attachment 1702466 [details]
prometheus-operator deployment file

Comment 13 Lili Cosic 2020-07-27 09:08:41 UTC
@Junqi can you explain what failed? Seems like the deployment file got log-level=error?

Comment 14 Junqi Zhao 2020-07-27 09:43:14 UTC
(In reply to Lili Cosic from comment #13)
> @Junqi can you explain what failed? Seems like the deployment file got
> log-level=error?

see from
# oc -n openshift-user-workload-monitoring logs $(oc -n openshift-user-workload-monitoring get po | grep prometheus-operator | awk '{print $1}') -c prometheus-operator
ts=2020-07-27T01:28:08.920655206Z caller=main.go:217 msg="Starting Prometheus Operator version '0.40.0'."
ts=2020-07-27T01:28:08.934023544Z caller=main.go:104 msg="Starting insecure server on [::]:8080"

we should only see error logs if we set log-level=error

Comment 15 Lili Cosic 2020-07-27 09:46:44 UTC
Those are always there, regardless of which log level you set as its the first two info that always needs to be logged. It is expected to be logged. It just means to allow error logs, not to deny any other log levels. Hope that makes sense?
https://github.com/coreos/prometheus-operator/blob/ad3571f1e23c51277f6522dee93919ce153d1f46/cmd/operator/main.go#L208

Comment 16 Junqi Zhao 2020-07-27 10:45:23 UTC
move to VERIFIED

Comment 17 Junqi Zhao 2020-07-28 06:05:53 UTC
(In reply to Lili Cosic from comment #15)
> Those are always there, regardless of which log level you set as its the
> first two info that always needs to be logged. It is expected to be logged.
> It just means to allow error logs, not to deny any other log levels. Hope
> that makes sense?
> https://github.com/coreos/prometheus-operator/blob/
> ad3571f1e23c51277f6522dee93919ce153d1f46/cmd/operator/main.go#L208

there is an exception for logLevel: error, if there is not error log, then there is not logs output

**********************
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheusOperator:
      logLevel: error
    thanosRuler:
      logLevel: error
**********************
# oc -n openshift-user-workload-monitoring logs thanos-ruler-user-workload-0 -c thanos-ruler
no result

# oc -n openshift-user-workload-monitoring logs prometheus-operator-56fcff76cd-pvlwx -c prometheus-operator
ts=2020-07-28T05:50:33.440295623Z caller=main.go:217 msg="Starting Prometheus Operator version '0.40.0'."
ts=2020-07-28T05:50:33.452148767Z caller=main.go:104 msg="Starting insecure server on [::]:8080"

Comment 19 errata-xmlrpc 2020-10-27 16:11:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196