Description of problem (please be detailed as possible and provide log snippests): We have encountered consistent hanging of CSI backups via Ceph on OpenShift v4.10.3. We have successfully taken 5 backups with Ceph CSI snapshots for 34+ PVCs via OADP v1.0.2 but the next backup always get hang indefinitely in the velero-plugin-for-csi plugin. There are no errors in the velero log and the backup was stuck in InProgress state. The problem is consistent and repeatable on another OpenShift v4.10 cluster. (Kubernetes Version: v1.23.3+e419edf) The last log message from that plugin in the velero log was: time="2022-04-16T00:15:21Z" level=info msg="Executing custom action" backup=oadp/ckpt-backup-1f logSource="pkg/backup/item_backupper.go:327" name=velero-zookeeper-data-zookeeper-0-2pjzv namespace=wkc resource=volumesnapshots.snapshot.storage.k8s.io time="2022-04-16T00:15:21Z" level=info msg="Executing VolumeSnapshotBackupItemAction" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:58" pluginName=velero-plugin-for-csi time="2022-04-16T00:15:21Z" level=info msg="Getting VolumesnapshotContent for Volumesnapshot wkc/velero-zookeeper-data-zookeeper-0-2pjzv" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:91" pluginName=velero-plugin-for-csi Version of all relevant components (if applicable): OADP version: v1.0.2 Velero version: v1.7.1 velero-plugin-for-csi version: 0.2.0 oc version Client Version: 4.10.3 Server Version: 4.10.3 Kubernetes Version: v1.23.3+e419edf Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Unable to perform further backups Is there any workaround available to the best of your knowledge? See explanation below: I debug this further and got a stack dump of the plugin velero-plugin-for-csi gRPC server and it appears that it was in IO wait state for that goroutine(shown below). It was writing deprecated warnings from client-go via klog.Warning() which normally writes to stderr. The warning messages were related to VolumeSnapshot and VolumeSnapshotContent being deprecated. (These warning messages were in a pipe, see explanation below) W0422 22:59:43.573612 66770 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshot is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshot W0422 22:55:37.607548 66737 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshotContent is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshotContent This seems strange at first that the goroutine is blocked writing to stderr. After digging deeper into the go-plugin code(my understanding is that velero uses go-plugin v1.0.1 as of this writing), it turns out that the go-plugin framework sets both the os.stdout and os.stderr to a pipe in the Serve() function. https://github.com/hashicorp/go-plugin/blob/9e3e1c37db188a1acb66561ee0ed4bf4d5e77554/server.go#L350 However, for go-plugin v1.0.1, I don't see any code that consumes from those pipes which explains why eventually the plugin was blocked since it reached the pipe's max capacity with the accumulated warning messages. I went through the go-plugin git issues and found this thread comment which confirms my understanding: hashicorp/go-plugin#147 (comment) It appears this is a known issue and was addressed in post go-plugin v1.0.1. This issue can easily occur when multiple CSI backups are created over time and eventually the backup will get stuck and not be able to perform further backups on newer k8s systems. Updating the csi plugin to use snapshot.storage.k8s.io/v1 or a custom rest.WarningHandler may be some of the alternatives to workaround this but nevertheless, this looks like a general Velero plugin issue that its processing may potentially hang due to the above findings and needs to be addressed at the plugin framework level. I have reported this for Velero at https://github.com/vmware-tanzu/velero/issues/4863 and they have targeted fix for v1.9.0. What is the plan for OADP? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Perform multiple CSI snapshot backups using OADP on OpenShift 4.10 2. Eventually the backup will hang Actual results: Backup hangs indefinitely Expected results: Backup should complete or return error without hanging. Additional info: goroutine 1154 [IO wait, 34 minutes]: internal/poll.runtime_pollWait(0x7fc4a63fc2a0, 0x77, 0xffffffffffffffff) /usr/local/go/src/runtime/netpoll.go:227 +0x55 internal/poll.(*pollDesc).wait(0xc000246cd8, 0x77, 0x1, 0xf5, 0xffffffffffffffff) /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45 internal/poll.(*pollDesc).waitWrite(...) /usr/local/go/src/internal/poll/fd_poll_runtime.go:96 internal/poll.(*FD).Write(0xc000246cc0, 0xc000878000, 0xa8, 0xf5, 0x0, 0x0, 0x0) /usr/local/go/src/internal/poll/fd_unix.go:282 +0x325 os.(*File).write(...) /usr/local/go/src/os/file_posix.go:48 os.(*File).Write(0xc000342038, 0xc000878000, 0xa8, 0xf5, 0xc000936a00, 0x4, 0xc0009beb90) /usr/local/go/src/os/file.go:174 +0x8e k8s.io/klog/v2.(*loggingT).output(0x266eaa0, 0xc000000001, 0x0, 0x0, 0xc000876000, 0x1, 0x1f1a7d4, 0xb, 0x46, 0x40e000) /go/pkg/mod/k8s.io/klog/v2.0/klog.go:924 +0x905 k8s.io/klog/v2.(*loggingT).printDepth(0x266eaa0, 0xc000000001, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0003ec050, 0x1, 0x1) /go/pkg/mod/k8s.io/klog/v2.0/klog.go:735 +0x185 k8s.io/klog/v2.(*loggingT).print(...) /go/pkg/mod/k8s.io/klog/v2.0/klog.go:717 k8s.io/klog/v2.Warning(...) /go/pkg/mod/k8s.io/klog/v2.0/klog.go:1426 k8s.io/client-go/rest.WarningLogger.HandleWarningHeader(0x12b, 0xc000392d84, 0x1, 0xc000990280, 0x79) /go/pkg/mod/k8s.io/client-go.2/rest/warnings.go:70 +0xfe k8s.io/client-go/rest.handleWarnings(0xc00017e060, 0x1bd71c0, 0x269de80, 0xc00017e0c0, 0x1bd7120, 0xc0003ec030) /go/pkg/mod/k8s.io/client-go.2/rest/warnings.go:144 +0x16d k8s.io/client-go/rest.(*Request).transformResponse(0xc000510500, 0xc00080afc0, 0xc000510700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1148 +0x178 k8s.io/client-go/rest.(*Request).Do.func1(0xc000510700, 0xc00080afc0) /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1039 +0x65 k8s.io/client-go/rest.(*Request).request.func2.1(...) /go/pkg/mod/k8s.io/client-go.2/rest/request.go:996 k8s.io/client-go/rest.(*Request).request.func2(0xc00080afc0, 0xc0009bf388, 0xc000510500, 0xc000510700, 0x0, 0x0, 0xc0009bf290, 0xc0009bf360, 0x0) /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1021 +0x175 k8s.io/client-go/rest.(*Request).request(0xc000510500, 0x1c07a70, 0xc000040058, 0xc0009bf388, 0x0, 0x0) /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1023 +0x55d k8s.io/client-go/rest.(*Request).Do(0xc000510500, 0x1c07a70, 0xc000040058, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1038 +0xf1 github.com/kubernetes-csi/external-snapshotter/client/v4/clientset/versioned/typed/volumesnapshot/v1beta1.(*volumeSnapshotContents).Get(0xc000c63f20, 0x1c07a70, 0xc000040058, 0xc00128a720, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/pkg/mod/github.com/kubernetes-csi/external-snapshotter/client/v4.0/clientset/versioned/typed/volumesnapshot/v1beta1/volumesnapshotcontent.go:72 +0x193 github.com/vmware-tanzu/velero-plugin-for-csi/internal/util.GetVolumeSnapshotContentForVolumeSnapshot(0xc00022c3c0, 0x1c07c30, 0xc00032b6b0, 0x1c23f70, 0xc0001e9d50, 0x0, 0x199, 0x0, 0x0) /go/src/velero-plugin-for-csi/internal/util/util.go:165 +0x47b github.com/vmware-tanzu/velero-plugin-for-csi/internal/backup.(*VolumeSnapshotBackupItemAction).Execute(0xc0001580a0, 0x1c15448, 0xc0003420b8, 0xc0009ce280, 0xc0009ce280, 0x0, 0x0, 0xc000a3e2a0, 0xc000a3e2a8, 0x0, ...) /go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:92 +0x519 github.com/vmware-tanzu/velero/pkg/plugin/framework.(*BackupItemActionGRPCServer).Execute(0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc0007c8780, 0x0, 0x0, 0x0) /go/pkg/mod/github.com/vmware-tanzu/velero.0/pkg/plugin/framework/backup_item_action_server.go:101 +0x425 github.com/vmware-tanzu/velero/pkg/plugin/generated._BackupItemAction_Execute_Handler(0x17d6820, 0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc000edc4e0, 0x0, 0x1c07ae0, 0xc00078a930, 0xc0009d6000, 0x1065) /go/pkg/mod/github.com/vmware-tanzu/velero.0/pkg/plugin/generated/BackupItemAction.pb.go:258 +0x214 google.golang.org/grpc.(*Server).processUnaryRPC(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0xc00038ca20, 0x264f198, 0x0, 0x0, 0x0) /go/pkg/mod/google.golang.org/grpc.0/server.go:1180 +0x52b google.golang.org/grpc.(*Server).handleStream(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0x0) /go/pkg/mod/google.golang.org/grpc.0/server.go:1503 +0xd0c google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000790070, 0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400) /go/pkg/mod/google.golang.org/grpc.0/server.go:843 +0xab created by google.golang.org/grpc.(*Server).serveStreams.func1 /go/pkg/mod/google.golang.org/grpc.0/server.go:841 +0x1fd
@Yip are you facing any issues with ODF here? if yes please attach the ODF must-gather logs. as you mentioned the issue is in the OADP layer. this needs to be moved to OADP as OADP is a different component. @Mudit can you please help here?
Yip, can you please respond to https://bugzilla.redhat.com/show_bug.cgi?id=2083353#c3
@Madhu/@Mudit I am having issues with OADP layer which uses Velero under the cover. I filed the same information to Velero and they have reproduced the issue at their end and have slated fixes in Velero v1.9.0. I am querying here to ask what are the plans for OADP to include the upstream fixes and when it will be available. I am not able to select OADP as component here; hence, unclassified. Appreciate if this bugzilla can be directed and replicated to OADP bug system for tracking purpose.
OADP tracks Bugs in Jira opened Jira bug here https://issues.redhat.com/browse/OADP-533. I am closing this one.