Created attachment 1755074 [details] Sample of nfd-operator logs containing 17251 lines of "info" level logs Description of problem: After installing the Red Hat "node feature discovery" operator and the gpu-operator to support NVDIA GPU nodes on AWS, the nfd-operator pod is logging over 14,400 lines of logs per minute. I am experiencing that the nfd-operator is set to either "debug" or "info" level logging. I expected the nfd-operator to only minimally log errors and failures. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.22 True False 50d Cluster version is 4.5.22 $ oc get csv NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.4.5.0-202011132127.p0 Elasticsearch Operator 4.5.0-202011132127.p0 elasticsearch-operator.4.5.0-202010161522.p0 Succeeded gpu-operator-certified.v1.4.0 NVIDIA GPU Operator 1.4.0 Succeeded jaeger-operator.v1.17.7 Red Hat OpenShift Jaeger 1.17.7 jaeger-operator.v1.17.6 Succeeded kiali-operator.v1.24.2 Kiali Operator 1.24.2 kiali-operator.v1.12.16 Succeeded nfd.4.5.0-202011132127.p0 Node Feature Discovery 4.5.0 Succeeded servicemeshoperator.v2.0.0.2 Red Hat OpenShift Service Mesh 2.0.0-2 How reproducible: Always Steps to Reproduce: 1. Install nfd-operator 2. Install gpu-operator 3. Create machineset with GPU and spin up an AWS EC2 with GPU 4. Examine logs from the nfd-operator Actual results: A LOT OF LOGS WRITTEN PER MINUTE: $ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l Wed, Feb 3, 2021 2:44:30 PM 171957 $ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l Wed, Feb 3, 2021 2:47:01 PM 203070 Calculations: 2 minutes, 31 seconds since last count 171,975 - 203,070 = 31,095 messages 31095 logs / 2.5 minutes = 15547 logs Expected results: Logs that contain start-up initialization and errors, not the logs for every reconciliation action Additional info:
Moving to high. This is operator is generating an excessive amount of noise that may contribute to performance issues.
I just tested this and couldn't reproduce, the logs stop after all the pods are running successfully and then the logs just wait for an event on any of the OWNED or WATCHED resources. Maybe something is happening on this customer cluster that is interfering the one of the OWNED resources of NFD
Verified on OCP 4.6.15 with the latest master image of nfd-operator that the nfd-operator logs are not increasingly growing over time. After deploying NFD operator, waiting 10 minutes: # oc logs -n openshift-nfd nfd-operator-7559cf85fb-sjxbs | wc -l 461 This is after 4 hours, adding new machineset, deploying GPU operator, running GPU workload, etc : # oc logs -n openshift-nfd nfd-operator-7559cf85fb-sjxbs | wc -l 1151
Is this going to be backported to 4.6?
Same question here as comment 7 I can see that the code has been committed in the 4.6 release branch: https://github.com/openshift/cluster-nfd-operator/commit/6d13ce0ba68255518e3faca63ccb1931ca6b1c3d but I could not find any errata with the fix.
It has been back ported to 4.5 https://github.com/openshift/cluster-nfd-operator/pull/143 For questions o when will it hit OperatorHub please refer to the ART team
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.2 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2435