2038275 – Change cluster version operator log level

Bug 2038275 - Change cluster version operator log level

Summary: Change cluster version operator log level

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jack Ottofaro
QA Contact:	Evgeni Vakhonin
Docs Contact:
URL:
Whiteboard:
Depends On:	2034493
Blocks:	2038936
TreeView+	depends on / blocked

Reported:	2022-01-07 16:44 UTC by Scott Dodson
Modified:	2024-04-24 15:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2034493
Clones:	2038936 (view as bug list)
Environment:
Last Closed:	2022-01-17 08:07:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 723	0	None	Merged	Bug 2038275: *: Use --v=2 logging to drop client-side throttling noise	2022-02-01 05:27:07 UTC
Red Hat Product Errata	RHBA-2022:0110	0	None	None	None	2022-01-17 08:07:49 UTC

Description Scott Dodson 2022-01-07 16:44:01 UTC

+++ This bug was initially created as a clone of Bug #2034493 +++

Description of problem:

Hello Team

The cluster version operator will produce a lot of logs, those logs will into the EFK stack.

I will check those logs find the logs are info level logs and sync the cluster status, becasue of so many logs to produce, so May I know we have some method to change the cluster version log level?

More Thanks

--- Additional comment from W. Trevor King on 2021-12-21 15:50:25 EST ---

There is no knob for this today; growing one would be an RFE [1].

We have had bugs in the past around blanket reductions in CVO log volume, e.g. bug 1984414 and bug 1886900.  Repeating the analysis from bug 1886900 on a recent 4.10 CI job [2]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1473280349592096768/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-57cb78fc99-4kswj_cluster-version-operator.log | sed -n 's/.* \([^ ]*[.]go:[0-9]*\)].*/\1/p' | sort | uniq -c | sort -n | tail -n7
    184 task_graph.go:474
    242 apps.go:38
    760 sync_worker.go:393
   1053 task_graph.go:477
   7569 request.go:597
   9940 sync_worker.go:771
   9941 sync_worker.go:759

Getting the relevant CVO commit:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1473280349592096768/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-57cb78fc99-4kswj_cluster-version-operator.log | head -n1
  I1221 13:29:51.914804       1 start.go:23] ClusterVersionOperator 4.10.0-202112180033.p0.gede5b83.assembly.stream-ede5b83

Checking those noisy entries:

  $ git cat-file -p ede5b83:pkg/cvo/sync_worker.go | grep -n . | grep '^\(759\|771\):'
  759:                    klog.V(4).Infof("Running sync for %s", task)
  771:                    klog.V(4).Infof("Done syncing for %s", task)

So logging the start and end of each manifest task is nothing new for us, and it's useful to confirm that the CVO is walking the manifest task-node graph appropriately.

  $ git cat-file -p ede5b83:./vendor/k8s.io/client-go/rest/request.go | grep -n . | grep -B7 '^597:'
  589:    switch {
  590:    case len(retryInfo) > 0:
  591:            message = fmt.Sprintf("Waited for %v, %s - request: %s:%s", latency, retryInfo, r.verb, r.URL().String())
  592:    default:
  593:            message = fmt.Sprintf("Waited for %v due to client-side throttling, not priority and fairness, request: %s:%s", latency, r.verb, r.URL().String())
  594:    }
  596:    if latency > longThrottleLatency {
  597:            klog.V(3).Info(message)

We've received requests to reduce that client-side throttling volume in the past, and are completely fine with dropping it, but there's currently no library knob that allows us to drop it [3].

So options:

a. File an RFE ticket asking for a CVO knob to allow folks to tune log verbosity.
b. File an upstream RFE asking for configurable logging in k8s.io/client-go/rest, to unblock [3].
c. Comment in this bug explaining that in your particular case, the unwanted volume was really from some other type of entry that we didn't see as much of in the CI run I picked for analysis.
d. I'm sure there are other options besides the three I'm listing above; feel free to get creative and not be constrained to the option's I'm presenting ;).

[1]: https://issues.redhat.com/projects/RFE/issues
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws/1473280349592096768
[3]: https://issues.redhat.com/browse/OTA-414

--- Additional comment from  on 2021-12-22 03:22:33 EST ---

Hello 

Thank you for your confirm, I will open a RFE for the question, because of the customer feed back them ocp 4.8 cluster one day will produce 800 000 logs from the cluster version operator, those logs will make the EFK logging system pressure，so I think only keep the Error level log can mitigate the issue.

Thank you, I will close the bug and open a RFE

Regards
Li Yao

--- Additional comment from W. Trevor King on 2022-01-04 13:29:18 EST ---

Scott had the good idea of adjusting the default log level to work around the klog.V(3) noise, so I'll re-open while we think about that.

--- Additional comment from OpenShift Automated Release Tooling on 2022-01-05 16:57:49 EST ---

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.10 release created.

--- Additional comment from OpenShift BugZilla Robot on 2022-01-06 19:44:22 EST ---

Bugfix included in accepted release 4.10.0-0.nightly-2022-01-06-183041
Bug will not be automatically moved to VERIFIED for the following reasons:
- PR openshift/cluster-version-operator#721 not approved by QA contact

This bug must now be manually moved to VERIFIED by jiajliu

--- Additional comment from liujia on 2022-01-07 04:13:33 EST ---

Reproduced on 4.10.0-0.nightly-2022-01-05-181126

# ./oc -n openshift-cluster-version get deployment cluster-version-operator -oyaml|grep -e "--v"
        - --v=5

Cut out 1 hour's logs from cvo log for statistics:
# ./oc -n openshift-cluster-version logs cluster-version-operator-bbfbc5fc5-b889s|grep "I0107 07:[0-5][0-9]" > cvo1.log
# ll cvo1.log 
-rw-r--r--. 1 root root 6755882 Jan  7 07:50 cvo1.log

# grep "due to client-side throttling" cvo1.log |wc -l
9590


Verified on 4.10.0-0.nightly-2022-01-06-183041(Also cutting out 1 hour's logs from cvo log for statistics.)

1. The log level is updated to --v=2
# ./oc -n openshift-cluster-version get deployment cluster-version-operator -oyaml|grep -e "--v"
        - --v=2
2. The total size reduced about 34% for 1 hour's log
# ./oc -n openshift-cluster-version logs cluster-version-operator-5d97cf9bd8-h2m9t|grep "I0107 07:[0-5][0-9]" > cvo2.log
# ll cvo2.log 
-rw-r--r--. 1 root root 4486883 Jan  7 07:50 cvo2.log

3. Most of client-side throttling noise dropped.
# grep "due to client-side throttling" cvo2.log |wc -l
14

Comment 1 Scott Dodson 2022-01-07 16:45:08 UTC

Nothing in comment 0 looks private, also reset to default owner, Trevor fixed it on master but doesn't have to be the one to backport.

Comment 5 Evgeni Vakhonin 2022-01-10 18:14:44 UTC

using the same method as #2034493

Reproduced
Cluster version is 4.9.0-0.nightly-2022-01-07-212215

# oc -n openshift-cluster-version get deployment cluster-version-operator -oyaml|grep -e "--v"
        - --v=5
# oc logs -n openshift-cluster-version cluster-version-operator-8599f5dc55-vpcl8 |grep "I0110 16:[0-5][0-9]" > cvo-old.log
# ls -l cvo-old.log
-rw-r--r--. 1 evakhoni evakhoni 6987198 Jan 10 19:55 cvo-old.log
# grep "due to client-side throttling" cvo-old.log |wc -l 
9884

Verified
Cluster version is 4.9.0-0.nightly-2022-01-10-045851

# oc -n openshift-cluster-version get deployment cluster-version-operator -oyaml|grep -e "--v"
        - --v=2
# oc logs -n openshift-cluster-version cluster-version-operator-5c8fcb4896-f76nh |grep "I0110 16:[0-5][0-9]" > cvo-new.log
# ls -l cvo-new.log 
-rw-r--r--. 1 evakhoni evakhoni 4436218 Jan 10 19:57 cvo-new.log
# grep "due to client-side throttling" cvo-new.log |wc -l
0

log size reduced by 36.5%

Comment 8 errata-xmlrpc 2022-01-17 08:07:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0110

Comment 11 Scott Dodson 2024-04-24 15:38:07 UTC

Locking to prevent additional spam comments.

Note You need to log in before you can comment on or make changes to this bug.