2083353 – OADP Backup via Ceph CSI snapshot hangs indefinitely on OpenShift v4.10

Bug 2083353 - OADP Backup via Ceph CSI snapshot hangs indefinitely on OpenShift v4.10

Summary: OADP Backup via Ceph CSI snapshot hangs indefinitely on OpenShift v4.10

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	unclassified
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mudit Agarwal
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-09 18:35 UTC by Yip Ng
Modified:	2023-08-09 17:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-25 04:29:55 UTC
Embargoed:

Attachments	(Terms of Use)

Description Yip Ng 2022-05-09 18:35:18 UTC

Description of problem (please be detailed as possible and provide log
snippests):

We have encountered consistent hanging of CSI backups via Ceph on OpenShift v4.10.3. We have successfully taken 5 backups with Ceph CSI snapshots for 34+ PVCs via OADP v1.0.2 but the next backup always get hang indefinitely in the velero-plugin-for-csi plugin. There are no errors in the velero log and the backup was stuck in InProgress state. The problem is consistent and repeatable on another OpenShift v4.10 cluster. (Kubernetes Version: v1.23.3+e419edf)

The last log message from that plugin in the velero log was:

time="2022-04-16T00:15:21Z" level=info msg="Executing custom action" backup=oadp/ckpt-backup-1f logSource="pkg/backup/item_backupper.go:327" name=velero-zookeeper-data-zookeeper-0-2pjzv namespace=wkc resource=volumesnapshots.snapshot.storage.k8s.io
time="2022-04-16T00:15:21Z" level=info msg="Executing VolumeSnapshotBackupItemAction" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:58" pluginName=velero-plugin-for-csi
time="2022-04-16T00:15:21Z" level=info msg="Getting VolumesnapshotContent for Volumesnapshot wkc/velero-zookeeper-data-zookeeper-0-2pjzv" backup=oadp/ckpt-backup-1f cmd=/plugins/velero-plugin-for-csi logSource="/remote-source/app/internal/backup/volumesnapshot_action.go:91" pluginName=velero-plugin-for-csi


Version of all relevant components (if applicable):
OADP version: v1.0.2
Velero version: v1.7.1
velero-plugin-for-csi version: 0.2.0

oc version
Client Version: 4.10.3
Server Version: 4.10.3
Kubernetes Version: v1.23.3+e419edf

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to perform further backups

Is there any workaround available to the best of your knowledge?
See explanation below:

I debug this further and got a stack dump of the plugin velero-plugin-for-csi gRPC server and it appears that it was in IO wait state for that goroutine(shown below). It was writing deprecated warnings from client-go via klog.Warning() which normally writes to stderr. The warning messages were related to VolumeSnapshot and VolumeSnapshotContent being deprecated. (These warning messages were in a pipe, see explanation below)

W0422 22:59:43.573612  66770 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshot is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshot
W0422 22:55:37.607548  66737 warnings.go:70] snapshot.storage.k8s.io/v1beta1 VolumeSnapshotContent is deprecated; use snapshot.storage.k8s.io/v1 VolumeSnapshotContent

This seems strange at first that the goroutine is blocked writing to stderr. After digging deeper into the go-plugin code(my understanding is that velero uses go-plugin v1.0.1 as of this writing), it turns out that the go-plugin framework sets both the os.stdout and os.stderr to a pipe in the Serve() function.

https://github.com/hashicorp/go-plugin/blob/9e3e1c37db188a1acb66561ee0ed4bf4d5e77554/server.go#L350

However, for go-plugin v1.0.1, I don't see any code that consumes from those pipes which explains why eventually the plugin was blocked since it reached the pipe's max capacity with the accumulated warning messages. I went through the go-plugin git issues and found this thread comment which confirms my understanding:

hashicorp/go-plugin#147 (comment)

It appears this is a known issue and was addressed in post go-plugin v1.0.1.

This issue can easily occur when multiple CSI backups are created over time and eventually the backup will get stuck and not be able to perform further backups on newer k8s systems. Updating the csi plugin to use snapshot.storage.k8s.io/v1 or a custom rest.WarningHandler may be some of the alternatives to workaround this but nevertheless, this looks like a general Velero plugin issue that its processing may potentially hang due to the above findings and needs to be addressed at the plugin framework level. 

I have reported this for Velero at https://github.com/vmware-tanzu/velero/issues/4863 and they have targeted fix for v1.9.0.  What is the plan for OADP?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.  Perform multiple CSI snapshot backups using OADP on OpenShift 4.10
2.  Eventually the backup will hang


Actual results:
Backup hangs indefinitely

Expected results:
Backup should complete or return error without hanging.

Additional info:

goroutine 1154 [IO wait, 34 minutes]:
internal/poll.runtime_pollWait(0x7fc4a63fc2a0, 0x77, 0xffffffffffffffff)
    /usr/local/go/src/runtime/netpoll.go:227 +0x55
internal/poll.(*pollDesc).wait(0xc000246cd8, 0x77, 0x1, 0xf5, 0xffffffffffffffff)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).Write(0xc000246cc0, 0xc000878000, 0xa8, 0xf5, 0x0, 0x0, 0x0)
    /usr/local/go/src/internal/poll/fd_unix.go:282 +0x325
os.(*File).write(...)
    /usr/local/go/src/os/file_posix.go:48
os.(*File).Write(0xc000342038, 0xc000878000, 0xa8, 0xf5, 0xc000936a00, 0x4, 0xc0009beb90)
    /usr/local/go/src/os/file.go:174 +0x8e
k8s.io/klog/v2.(*loggingT).output(0x266eaa0, 0xc000000001, 0x0, 0x0, 0xc000876000, 0x1, 0x1f1a7d4, 0xb, 0x46, 0x40e000)
    /go/pkg/mod/k8s.io/klog/v2.0/klog.go:924 +0x905
k8s.io/klog/v2.(*loggingT).printDepth(0x266eaa0, 0xc000000001, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0003ec050, 0x1, 0x1)
    /go/pkg/mod/k8s.io/klog/v2.0/klog.go:735 +0x185
k8s.io/klog/v2.(*loggingT).print(...)
    /go/pkg/mod/k8s.io/klog/v2.0/klog.go:717
k8s.io/klog/v2.Warning(...)
    /go/pkg/mod/k8s.io/klog/v2.0/klog.go:1426
k8s.io/client-go/rest.WarningLogger.HandleWarningHeader(0x12b, 0xc000392d84, 0x1, 0xc000990280, 0x79)
    /go/pkg/mod/k8s.io/client-go.2/rest/warnings.go:70 +0xfe
k8s.io/client-go/rest.handleWarnings(0xc00017e060, 0x1bd71c0, 0x269de80, 0xc00017e0c0, 0x1bd7120, 0xc0003ec030)
    /go/pkg/mod/k8s.io/client-go.2/rest/warnings.go:144 +0x16d
k8s.io/client-go/rest.(*Request).transformResponse(0xc000510500, 0xc00080afc0, 0xc000510700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1148 +0x178
k8s.io/client-go/rest.(*Request).Do.func1(0xc000510700, 0xc00080afc0)
    /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1039 +0x65
k8s.io/client-go/rest.(*Request).request.func2.1(...)
    /go/pkg/mod/k8s.io/client-go.2/rest/request.go:996
k8s.io/client-go/rest.(*Request).request.func2(0xc00080afc0, 0xc0009bf388, 0xc000510500, 0xc000510700, 0x0, 0x0, 0xc0009bf290, 0xc0009bf360, 0x0)
    /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1021 +0x175
k8s.io/client-go/rest.(*Request).request(0xc000510500, 0x1c07a70, 0xc000040058, 0xc0009bf388, 0x0, 0x0)
    /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1023 +0x55d
k8s.io/client-go/rest.(*Request).Do(0xc000510500, 0x1c07a70, 0xc000040058, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/pkg/mod/k8s.io/client-go.2/rest/request.go:1038 +0xf1
github.com/kubernetes-csi/external-snapshotter/client/v4/clientset/versioned/typed/volumesnapshot/v1beta1.(*volumeSnapshotContents).Get(0xc000c63f20, 0x1c07a70, 0xc000040058, 0xc00128a720, 0x30, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /go/pkg/mod/github.com/kubernetes-csi/external-snapshotter/client/v4.0/clientset/versioned/typed/volumesnapshot/v1beta1/volumesnapshotcontent.go:72 +0x193
github.com/vmware-tanzu/velero-plugin-for-csi/internal/util.GetVolumeSnapshotContentForVolumeSnapshot(0xc00022c3c0, 0x1c07c30, 0xc00032b6b0, 0x1c23f70, 0xc0001e9d50, 0x0, 0x199, 0x0, 0x0)
    /go/src/velero-plugin-for-csi/internal/util/util.go:165 +0x47b
github.com/vmware-tanzu/velero-plugin-for-csi/internal/backup.(*VolumeSnapshotBackupItemAction).Execute(0xc0001580a0, 0x1c15448, 0xc0003420b8, 0xc0009ce280, 0xc0009ce280, 0x0, 0x0, 0xc000a3e2a0, 0xc000a3e2a8, 0x0, ...)
    /go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:92 +0x519
github.com/vmware-tanzu/velero/pkg/plugin/framework.(*BackupItemActionGRPCServer).Execute(0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc0007c8780, 0x0, 0x0, 0x0)
    /go/pkg/mod/github.com/vmware-tanzu/velero.0/pkg/plugin/framework/backup_item_action_server.go:101 +0x425
github.com/vmware-tanzu/velero/pkg/plugin/generated._BackupItemAction_Execute_Handler(0x17d6820, 0xc00000e068, 0x1c07ae0, 0xc00078a930, 0xc000edc4e0, 0x0, 0x1c07ae0, 0xc00078a930, 0xc0009d6000, 0x1065)
    /go/pkg/mod/github.com/vmware-tanzu/velero.0/pkg/plugin/generated/BackupItemAction.pb.go:258 +0x214
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0xc00038ca20, 0x264f198, 0x0, 0x0, 0x0)
    /go/pkg/mod/google.golang.org/grpc.0/server.go:1180 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400, 0x0)
    /go/pkg/mod/google.golang.org/grpc.0/server.go:1503 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000790070, 0xc000398380, 0x1c18798, 0xc000582a80, 0xc000510400)
    /go/pkg/mod/google.golang.org/grpc.0/server.go:843 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc.0/server.go:841 +0x1fd

Comment 3 Madhu Rajanna 2022-05-10 05:03:28 UTC

@Yip are you facing any issues with ODF here? if yes please attach the ODF must-gather logs. as you mentioned the issue is in the OADP layer.
this needs to be moved to OADP as OADP is a different component. @Mudit can you please help here?

Comment 4 Mudit Agarwal 2022-05-24 05:40:18 UTC

Yip, can you please respond to https://bugzilla.redhat.com/show_bug.cgi?id=2083353#c3

Comment 5 Yip Ng 2022-05-24 06:45:23 UTC

@Madhu/@Mudit I am having issues with OADP layer which uses Velero under the cover.  I filed the same information to Velero and they have reproduced the issue at their end and have slated fixes in Velero v1.9.0.  I am querying here to ask what are the plans for OADP to include the upstream fixes and when it will be available.  I am not able to select OADP as component here; hence, unclassified.  Appreciate if this bugzilla can be directed and replicated to OADP bug system for tracking purpose.

Comment 6 Madhu Rajanna 2022-05-25 04:29:55 UTC

OADP tracks Bugs in Jira opened Jira bug here https://issues.redhat.com/browse/OADP-533. I am closing this one.

Note You need to log in before you can comment on or make changes to this bug.