Bug 1866791

Summary: storage co operatorLogLevel is not validated
Product: OpenShift Container Platform Reporter: Qin Ping <piqin>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Operators QA Contact: Qin Ping <piqin>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, jsafrane
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1877408 (view as bug list) Environment:
Last Closed: 2020-10-27 16:25:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1877408, 1878007    
Bug Blocks:    

Description Qin Ping 2020-08-06 11:18:07 UTC
Description of problem:
The description of OperatorLogLevelChange event is not correct

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-05-203353

How reproducible:
Always

Steps to Reproduce:
1.oc patch  storage cluster -p '{"spec": {"operatorLogLevel" : "Debug"}}' --type merge
2.oc patch  storage cluster -p '{"spec": {"operatorLogLevel" : "TraceALL"}}' --type merge
3.oc patch  storage cluster -p '{"spec": {"operatorLogLevel" : "test"}}' --type merge
4. oc patch  storage cluster -p '{"spec": {"operatorLogLevel" : "Normal"}}' --type merge
5. Check the log of cluster-storage-operator
6. Check the events of cluster-storage-operator deloyment

Actual results:
1. The logs did not change when updating the operatorLogLevel
2. All the events:
Normal  OperatorLogLevelChange  6m35s (x2 over 12m)  cluster-storage-operator-loggingsyncer  Operator log level changed from "Normal" to "Debug"
  Normal  OperatorLogLevelChange  6m20s                cluster-storage-operator-loggingsyncer  Operator log level changed from "Debug" to "TraceALL"
  Normal  OperatorLogLevelChange  5m54s (x3 over 30m)  cluster-storage-operator-loggingsyncer  Operator log level changed from "Normal" to "test"

When updating operatorloglevel from TraceALL to the test, the event report "changed from Normal to test"
When updating operatorloglevel from TraceALL to the normal, there is no event.

Expected results:
When changing the operatorLogLevel, the logs level is changed actually.
The events can record the operation correctly.

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Jan Safranek 2020-08-06 13:25:28 UTC
3. oc patch  storage cluster -p '{"spec": {"operatorLogLevel" : "test"}}' --type merge

This one looks like a real bug, validation should reject value "test".

On the other hand, the other are working for me. Especially change from/to TraceAll is very noticable. Please make sure that CVO is not running, because it will overwrite your Storage changes back to Normal:
oc scale -n openshift-cluster-version  deployment/cluster-version-operator --replicas=0

On Trace level, you should see for example:
I0806 15:23:13.747678 2714590 crcontroller.go:154] CSIDriverOperatorCRController sync finished
I0806 15:23:13.811154 2714590 crcontroller.go:109] CSIDriverOperatorCRController sync started

On TraceAll, you should see: 
I0806 15:23:00.672438 2714590 request.go:1095] Request Body:
00000000  6b 38 73 00 0a 0f 0a 02  76 31 12 09 43 6f 6e 66  |k8s.....v1..Conf|
[whole protobuff dump]

Comment 2 Qin Ping 2020-08-07 02:43:43 UTC
Yes, "TraceAll" is working but "TraceALL" is not working.

Comment 3 Jan Safranek 2020-08-07 06:54:57 UTC
Oh, ok. It's the validation then.

Comment 4 Jan Safranek 2020-08-12 20:48:26 UTC
Since we use library-go, we cannot add hard validation of all OperatorSpec.LogLevel there - other CRs that use the same OperatorSpec and thus have the same validation can already have a CR with wrong LogLevel in etcd and this CR would get unusable after upgrade to new version with tight validation.

Adding soft validation (log message, event), but keep the CR working: https://github.com/openshift/library-go/pull/864

Comment 6 Jan Safranek 2020-09-11 11:08:38 UTC
Note to QA: operatorLogLevel should have very soft validation in CSO and all our CSI operators. Unknown operatorLoglevel leads to a log message + "Normal" level is used.

Check bug #1878007 for solution for all operator CRs + both logLevel / operatorLogLevel fields.

Comment 8 Qin Ping 2020-09-16 06:09:23 UTC
verified with: 4.6.0-0.nightly-2020-09-14-225526

Comment 11 errata-xmlrpc 2020-10-27 16:25:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196