Bug 1925168 - nfd-operator generates over 15,000 log entries per minute, appears to be logging in "info" mode
Summary: nfd-operator generates over 15,000 log entries per minute, appears to be logg...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Feature Discovery Operator
Version: 4.8
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 4.8.0
Assignee: Carlos Eduardo Arango Gutierrez
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks: 1936464
TreeView+ depends on / blocked
 
Reported: 2021-02-04 14:14 UTC by Scott Worthington
Modified: 2024-06-14 00:10 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1936464 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:19:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Sample of nfd-operator logs containing 17251 lines of "info" level logs (2.70 MB, text/plain)
2021-02-04 14:14 UTC, Scott Worthington
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-nfd-operator issues 133 0 None open RH nfd-operator excessively logs at over 15,000 logged lines per minute 2021-02-19 21:19:15 UTC
Github openshift cluster-nfd-operator pull 140 0 None closed nodefeaturediscovery_state: avoid accumulating states 2021-03-08 14:30:50 UTC
Red Hat Product Errata RHBA-2021:2435 0 None None None 2021-07-27 22:19:47 UTC

Description Scott Worthington 2021-02-04 14:14:32 UTC
Created attachment 1755074 [details]
Sample of nfd-operator logs containing 17251 lines of "info" level logs

Description of problem:

After installing the Red Hat "node feature discovery" operator and the gpu-operator to support NVDIA GPU nodes on AWS, the nfd-operator pod is logging over 14,400 lines of logs per minute.  I am experiencing that the nfd-operator is set to either "debug" or "info" level logging.  I expected the nfd-operator to only minimally log errors and failures.


Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.22    True        False         50d     Cluster version is 4.5.22

$ oc get csv
NAME                                           DISPLAY                          VERSION                 REPLACES                                       PHASE
elasticsearch-operator.4.5.0-202011132127.p0   Elasticsearch Operator           4.5.0-202011132127.p0   elasticsearch-operator.4.5.0-202010161522.p0   Succeeded
gpu-operator-certified.v1.4.0                  NVIDIA GPU Operator              1.4.0                                                                  Succeeded
jaeger-operator.v1.17.7                        Red Hat OpenShift Jaeger         1.17.7                  jaeger-operator.v1.17.6                        Succeeded
kiali-operator.v1.24.2                         Kiali Operator                   1.24.2                  kiali-operator.v1.12.16                        Succeeded
nfd.4.5.0-202011132127.p0                      Node Feature Discovery           4.5.0                                                                  Succeeded
servicemeshoperator.v2.0.0.2                   Red Hat OpenShift Service Mesh   2.0.0-2


How reproducible:
Always

Steps to Reproduce:
1. Install nfd-operator
2. Install gpu-operator
3. Create machineset with GPU and spin up an AWS EC2 with GPU
4. Examine logs from the nfd-operator

Actual results:

A LOT OF LOGS WRITTEN PER MINUTE:

$ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l
Wed, Feb  3, 2021  2:44:30 PM
171957

$ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l
Wed, Feb  3, 2021  2:47:01 PM
203070

Calculations:
2 minutes, 31 seconds since last count
171,975 - 203,070 = 31,095 messages
31095 logs / 2.5 minutes = 15547 logs


Expected results:

Logs that contain start-up initialization and errors, not the logs for every reconciliation action


Additional info:

Comment 2 Matthew Sweikert 2021-02-22 17:48:08 UTC
Moving to high.  This is operator is generating an excessive amount of noise that may contribute to performance issues.

Comment 3 Carlos Eduardo Arango Gutierrez 2021-02-25 13:58:23 UTC
I just tested this and couldn't reproduce, the logs stop after all the pods are running successfully and then the logs just wait for an event on any of the OWNED or WATCHED resources.

Maybe something is happening on this customer cluster that is interfering the one of the OWNED resources of NFD

Comment 6 Walid A. 2021-03-09 23:30:54 UTC
Verified on OCP 4.6.15 with the latest master image of nfd-operator that the nfd-operator logs are not increasingly growing over time.

After deploying NFD operator, waiting 10 minutes:

# oc logs -n openshift-nfd nfd-operator-7559cf85fb-sjxbs | wc -l
461

This is after 4 hours, adding new machineset, deploying GPU operator, running GPU workload, etc :

# oc logs -n openshift-nfd nfd-operator-7559cf85fb-sjxbs | wc -l
1151

Comment 7 Sergio G. 2021-04-09 08:01:39 UTC
Is this going to be backported to 4.6?

Comment 8 Frederic Giloux 2021-04-12 15:12:09 UTC
Same question here as comment 7
I can see that the code has been committed in the 4.6 release branch: https://github.com/openshift/cluster-nfd-operator/commit/6d13ce0ba68255518e3faca63ccb1931ca6b1c3d
but I could not find any errata with the fix.

Comment 9 Carlos Eduardo Arango Gutierrez 2021-04-12 15:14:22 UTC
It has been back ported to 4.5 https://github.com/openshift/cluster-nfd-operator/pull/143 
For questions o when will it hit OperatorHub please refer to the ART team

Comment 12 errata-xmlrpc 2021-07-27 22:19:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.2 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2435


Note You need to log in before you can comment on or make changes to this bug.