Description of problem (please be detailed as possible and provide log snippests): I am trying to debug one of the osd node that is down but the ocs-operator log message is flooded with below message and is making it diffcult to catch the actual error message "level":"info","ts":"2020-10-07T16:20:39.974Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:39.983Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:39.992Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:40.000Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:40.009Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:40.014Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:44.606Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:44.612Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:44.621Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:44.629Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:48.105Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:48.112Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:48.121Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:48.127Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:49.094Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:49.100Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:49.113Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:49.129Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:57.406Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:57.414Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:57.425Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:20:57.430Z","logger":"controller_storagecluster","msg":"No component operator reported negatively","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2020-10-07T16:21:06.631Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} Version of all relevant components (if applicable): Client Version: 4.5.0-0.nightly-2020-08-15-052753 Server Version: 4.5.0-0.nightly-2020-09-20-185910 Kubernetes Version: v1.18.3+47c0e71 sh-4.4# rook version rook: 4.5-49.f5afaeeb.release_4.5 go: go1.13.4 sh-4.4# ceph version ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable) Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? not sure Can this issue reproduce from the UI? not sure If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. keep running ocs 4.5 nightly for couple of days > 4 days 2. observe the logs are flooded with above message
Created attachment 1719806 [details] ocs-operator-logs
$ grep "Reconciling StorageCluster" ocs-operator-logs.txt | wc 44281 88562 8723357 $ wc ocs-operator-logs.txt 89101 325554 18958957 ocs-operator-logs.txt =~ 50% filled with same lines
This is the OCS operator log, rather than rook.
This is expected behavior, since we periodically reconcile the StorageCluster. While I have a PR in the works that should serve to at least reduce this chatter, though it will never (and should never) be completely eliminated: https://github.com/openshift/ocs-operator/pull/796 Still, this is not an urgent matter, moving this to OCS 4.7.
I think it is important to reduce the frequency of these messages so that its helpful to debug and make space for other important logs that will help us debug better(like OSD that went into init stage and has missing logs). The current messages(attached operator logs) isn't really helpful so I believe we should fix this sooner.
Jose, would it make sense to only log every 10th reconcile message or similar? That would reduce the number of messages and would be an easy change.
(In reply to N Balachandran from comment #7) > Jose, would it make sense to only log every 10th reconcile message or > similar? That would reduce the number of messages and would be an easy > change. That's how the logging is configured currently. Problem is that the logs itself aren't formatted in a way to make it easier for the logger to filter these out. We did a small refactor for Error level logs. We need to do the same for Info and Warnings.
Umanga, are we targeting it for 4.7?
In OCS 4.6 we did https://github.com/openshift/ocs-operator/pull/717 and https://github.com/openshift/ocs-operator/pull/666 to try and reduce the logs. In OCS 4.7 some of these updates are removed and we depend on default Kubernetes aware logger configuration. So, this should already be testable. If this still doesn't meet the expectation, we'll take it as a part of bigger enhancement for logging in 4.8. There is also https://bugzilla.redhat.com/show_bug.cgi?id=1890978 which talks about logging in every reconcile (more frequent) which seems opposite of what's expected in this BZ.
Moving to ON_QA based on the above comment, if we need more than what is done than better to open a new BZ.
Explanation of the fix from today's bug triage meeting: Logging frequency was reduced, number of the particular log lines listed in this bug report should go down (where there were 20 lines, we expect about 4 after the change). QE will test this bug based on this explanation.
It looks like about 1/8 of the lines are now Reconciling StorageCluster messages. So I changed this to verified. (venv) wusui@localhost:~/Downloads/logs_1612481699/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1612435608$ find . -type f | grep current.log | grep ocs-operator | xargs wc 25797 129167 6753438 ./test_monitoring_reporting_ok_when_idle_ocs_logs/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-b3e2cbb55f8cfa82480b70e73576ed049cb4e9de984c46f247896951a1746184/namespaces/openshift-storage/pods/ocs-operator-66b79fb54f-fcncf/ocs-operator/ocs-operator/logs/current.log 5438 27247 1425039 ./test_create_new_project_to_verify_logging_ocs_logs/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-b3e2cbb55f8cfa82480b70e73576ed049cb4e9de984c46f247896951a1746184/namespaces/openshift-storage/pods/ocs-operator-66b79fb54f-fcncf/ocs-operator/ocs-operator/logs/current.log 2137 10678 557983 ./test_monitoring_enabled_ocs_logs/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-b3e2cbb55f8cfa82480b70e73576ed049cb4e9de984c46f247896951a1746184/namespaces/openshift-storage/pods/ocs-operator-66b79fb54f-fcncf/ocs-operator/ocs-operator/logs/current.log 41643 208420 10893114 ./test_pvc_expand_expanded_pvc_ocs_logs/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-b3e2cbb55f8cfa82480b70e73576ed049cb4e9de984c46f247896951a1746184/namespaces/openshift-storage/pods/ocs-operator-66b79fb54f-fcncf/ocs-operator/ocs-operator/logs/current.log 14752 73878 3863131 ./test_namespace_bucket_creation_with_rgw_ocs_logs/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-b3e2cbb55f8cfa82480b70e73576ed049cb4e9de984c46f247896951a1746184/namespaces/openshift-storage/pods/ocs-operator-66b79fb54f-fcncf/ocs-operator/ocs-operator/logs/current.log 47635 238380 12458610 ./test_must_gather[OTHERS]_ocs_logs/ocs_must_gather/registry-redhat-io-ocs4-ocs-must-gather-rhel8-sha256-b3e2cbb55f8cfa82480b70e73576ed049cb4e9de984c46f247896951a1746184/namespaces/openshift-storage/pods/ocs-operator-66b79fb54f-fcncf/ocs-operator/ocs-operator/logs/current.log 137402 687770 35951315 total (venv) wusui@localhost:~/Downloads/logs_1612481699/home/jenkins/current-cluster-dir/logs/failed_testcase_ocs_logs_1612435608$ find . -type f | grep current.log | grep ocs-operator | xargs grep "Reconciling StorageCluster" | wc 17145 51435 8846713
It looks like the changes for https://github.com/openshift/ocs-operator/pull/717 fix this.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days