Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2019532

Summary: Logger object in LSO does not log source location accurately
Product: OpenShift Container Platform Reporter: Hemant Kumar <hekumar>
Component: StorageAssignee: Jonathan Dobson <jdobson>
Storage sub component: Local Storage Operator QA Contact: Chao Yang <chaoyang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, jsafrane
Version: 4.10   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2050420 (view as bug list) Environment:
Last Closed: 2022-08-10 10:39:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2050420, 2050883    

Description Hemant Kumar 2021-11-02 18:45:51 UTC
Logger object in LSO does not log source location accurately

Comment 1 Hemant Kumar 2021-11-02 18:48:55 UTC
I think that *sometimes* the logger object does not log source information accurately. For example, I tried adding a logging statement in https://github.com/openshift/local-storage-operator/blob/master/diskmaker/controllers/lv/reconcile.go#L219 and it logged:

2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.Deleter   controller/controller.go:298    Looking for released PVs to clean up    {"Request.Namespace": "openshift-local-storage", "Request.Name": ""}                                                                                                                                                                                                                    2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0018", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0007", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  I1102 18:33:16.207685 2670386 cache.go:55] Added pv "local-pv-77682331" to cache                                                                                                                                                                                                                                                                                                                                                        2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0024", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0030", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0029", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0006", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0021", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0020", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  2021-11-02T18:33:16.207Z        INFO    diskmaker-controllers.LocalVolume       controller/controller.go:298    ignoring pv     {"request.namespace": "openshift-local-storage", "Request.Name": "local-disks", "pvname": "pv0009", "prov-by": "", "runtime-name": "local-volume-provisioner-crc-dzk9v-master-0-febf7407-d1ed-4a8a-a94b-77724408b477"}                                                                                  

Which seems strange to me. This will perhaps require investigation. Overall, I wonder if we should just move back to klog everywhere since that is what all OCP projects use.

Comment 2 Jonathan Dobson 2021-11-24 01:20:55 UTC
The caller being logged (controller/controller.go:298) is the function that calls our Reconcile() from controller-runtime, rather than the actual log statements within Reconcile():

https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go#L298

That incorrect caller is being logged due this PR:

https://github.com/openshift/local-storage-operator/pull/267

Specifically these 2 lines:

https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/main.go#L81
https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/diskmaker_manager/manager.go#L45

I think that should actually be zap.AddCallerSkip(1) instead of zaplog.AddCaller().
The part that doesn't make sense yet, is if I replace zaplog.AddCaller() with zap.AddCallerSkip(1), no caller gets logged at all.
And further, we actually do have code in controller-runtime that adds zap.AddCallerSkip(1) by default:

https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/sigs.k8s.io/controller-runtime/pkg/log/zap/zap.go#L234

So the question now is... if I just remove this line:
https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/main.go#L81
why isn't the caller logged correctly due to the line here?
https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/sigs.k8s.io/controller-runtime/pkg/log/zap/zap.go#L234

Comment 3 Jonathan Dobson 2021-11-24 19:19:52 UTC
(In reply to Jonathan Dobson from comment #2)
> I think that should actually be zap.AddCallerSkip(1) instead of
> zaplog.AddCaller().
> The part that doesn't make sense yet, is if I replace zaplog.AddCaller()
> with zap.AddCallerSkip(1), no caller gets logged at all.
> And further, we actually do have code in controller-runtime that adds
> zap.AddCallerSkip(1) by default

One thing that I misunderstood: we actually do need zaplog.AddCaller() to enable logging callers in the first place.
The call depth is a separate parameter, and that parameter is wrong.

If I remove this line:
https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/main.go#L81

Then the logr.Logger object ends up with:

  addCaller: (bool) false,
  callerSkip: (int) 2,

callerSkip should be set to 1... so it seems that we're somehow ending up with 2 different places calling zap.AddCallerSkip(1) on the same logger.
As a quick test, I changed ZapOpts to:

    ZapOpts: []zaplog.Option{zaplog.AddCaller(), zaplog.AddCallerSkip(-1)},

And added a test log message, which logged the caller correctly.

{"level":"info","ts":1637780758.5367253,"logger":"controllers.NodeDaemon","caller":"nodedaemon/reconcile.go:63","msg":"jdd test","Request.Namespace":"openshift-local-storage","Request.Name":""}

That -1 changes callerSkip from 2 to 1, which solves the issue, but it's a hack.
So why is zap.AddCallerSkip(1) getting called twice?

Comment 4 Jonathan Dobson 2021-11-24 19:57:46 UTC
(In reply to Jonathan Dobson from comment #3)
> So why is zap.AddCallerSkip(1) getting called twice?

I added debug.PrintStack to AddCallerSkip(), and these were the places where it was provided a non-zero argument:

https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/sigs.k8s.io/controller-runtime/pkg/log/zap/zap.go#L234

goroutine 1 [running]:
runtime/debug.Stack(0x1ed08f0, 0xc0002fddd0, 0xc000590d00)
        /usr/lib/golang/src/runtime/debug/stack.go:24 +0x9f
runtime/debug.PrintStack()
        /usr/lib/golang/src/runtime/debug/stack.go:16 +0x25
go.uber.org/zap.AddCallerSkip(0x1, 0x1ec45e0, 0xc000010020)
        /go/src/github.com/openshift/local-storage-operator/vendor/go.uber.org/zap/options.go:110 +0x25
sigs.k8s.io/controller-runtime/pkg/log/zap.NewRaw(0xc00020bcf8, 0x1, 0x1, 0xc0004a8320)
        /go/src/github.com/openshift/local-storage-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/zap/zap.go:234 +0x106
sigs.k8s.io/controller-runtime/pkg/log/zap.New(0xc00020bcf8, 0x1, 0x1, 0x1, 0x0)
        /go/src/github.com/openshift/local-storage-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/zap/zap.go:43 +0x3f
main.main()
        /go/src/github.com/openshift/local-storage-operator/main.go:93 +0x3f4

https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/github.com/go-logr/zapr/zapr.go#L182

goroutine 1 [running]:
runtime/debug.Stack(0x6, 0x8, 0xc000339920)
        /usr/lib/golang/src/runtime/debug/stack.go:24 +0x9f
runtime/debug.PrintStack()
        /usr/lib/golang/src/runtime/debug/stack.go:16 +0x25
go.uber.org/zap.AddCallerSkip(0x1, 0xc00000ff50, 0x1ec5120)
        /go/src/github.com/openshift/local-storage-operator/vendor/go.uber.org/zap/options.go:110 +0x25
github.com/go-logr/zapr.newLoggerWithExtraSkip(0xc000339920, 0x1, 0x1, 0xc000339920)
        /go/src/github.com/openshift/local-storage-operator/vendor/github.com/go-logr/zapr/zapr.go:172 +0x2f
github.com/go-logr/zapr.NewLogger(...)
        /go/src/github.com/openshift/local-storage-operator/vendor/github.com/go-logr/zapr/zapr.go:182
sigs.k8s.io/controller-runtime/pkg/log/zap.New(0xc00020bcf8, 0x1, 0x1, 0x1, 0x0)
        /go/src/github.com/openshift/local-storage-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/zap/zap.go:43 +0x56
main.main()
        /go/src/github.com/openshift/local-storage-operator/main.go:93 +0x3f4

Comment 5 Jonathan Dobson 2021-11-24 20:11:44 UTC
Seems the issue is that LSO creates the logger via controller-runtime, which adds an extra layer and increments callerSkip accordingly. But then LSO uses the go-logr functions directly, instead of the controller-runtime wrapper functions. This causes the parent caller to be logged instead of the line where the log message actually is.

Comment 6 Jonathan Dobson 2021-11-25 01:28:10 UTC
It seems a little strange that these DelegatingLogger methods all return logr.Logger:
https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go#L183

Nonetheless, the following change does at least log the correct caller by creating a new DelegatingLogger from the logr.Logger object:

 func (r *DaemonReconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl.Result, error) {
-       logger := r.Log.WithValues("Request.Namespace", request.Namespace, "Request.Name", r
equest.Name)
+       logger := log.NewDelegatingLogger(log.FromContext(ctx))
+       logger.Info("jdd test")

This causes us to use the DelegatingLogger wrapper function instead of the go-logr methods:
https://github.com/openshift/local-storage-operator/blob/8fc42cc8b990907c88a6da551dc85b55c2dc4417/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go#L127
It's only when using the go-logr methods directly that callerSkip = 2 is incorrect.

Comment 7 Jonathan Dobson 2021-11-30 15:55:35 UTC
sample output line from:

+       logger := log.NewDelegatingLogger(log.FromContext(ctx))
+       logger.Info("jdd test")

{"level":"info","ts":1637802914.009081,"logger":"controller-runtime.manager.controller.localvolume","caller":"nodedaemon/reconcile.go:66","msg":"jdd test","reconciler group":"local.storage.openshift.io","reconciler kind":"LocalVolume","name":"","namespace":"openshift-local-storage"}

Comment 8 Hemant Kumar 2021-11-30 15:58:46 UTC
okay it is not that bad. If that is only available fix, lets go for it.

Comment 9 Jonathan Dobson 2021-12-03 00:52:55 UTC
Alright, I filed a PR for this issue, but I reverted to klog for now. Seems to be the simplest and most reliable option at the moment.

  https://github.com/openshift/local-storage-operator/pull/299

I stashed the NewDelegatingLogger workaround on a separate branch if we end up going that direction instead (I'm still open to this, but somewhat prefer klog after seeing both implementations):

  https://github.com/openshift/local-storage-operator/compare/master...dobsonj:bug2019532

controller-runtime uses go-logr while client-go uses klog, so we still get a mix of both in the log output no matter what, but I tried to at least make the LSO code use a single logger across the board.

As for why controller-runtime is logging the incorrect source line, I filed an upstream issue for this:
  https://github.com/kubernetes-sigs/controller-runtime/issues/1737

Comment 11 Chao Yang 2022-02-11 13:02:40 UTC
Passed with local-storage-operator.4.10.0-202202101606

Comment 12 Chao Yang 2022-02-21 07:09:37 UTC
Passed with local-storage-operator.4.11.0-202202172124

Comment 14 errata-xmlrpc 2022-08-10 10:39:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069