Bug 1870195
| Summary: | [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Eads <deads> |
| Component: | Storage | Assignee: | Benny Zlotnik <bzlotnik> |
| Storage sub component: | oVirt CSI Driver | QA Contact: | Guilherme Santos <gdeolive> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | urgent | CC: | aos-bugs, bzlotnik, gzaidman, jsafrane, tsmetana |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
|
|
| Last Closed: | 2020-10-27 16:29:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
David Eads
2020-08-19 14:03:02 UTC
no action for two days, this is failing the majority of the ovirt CI runs Looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 and the first failed [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods [Suite:openshift/conformance/parallel] [Suite:k8s] The test creates StatefulSet e2e-statefulset-9197/ss, which creates PVC e2e-statefulset-9197/datadir-ss-0, for which PV pvc-114509b2-532f-4200-bcff-eef1ca358128 is provisioned. From test events: Warning FailedMount 2m40s (x11 over 8m54s) kubelet, ovirt11-t77z5-worker-0-r5952 MountVolume.MountDevice failed for volume "pvc-114509b2-532f-4200-bcff-eef1ca358128" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1 Driver logs on the node do not show much more: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096/artifacts/e2e-ovirt/pods/openshift-cluster-csi-drivers_ovirt-csi-driver-node-c9ch7_csi-driver.log I0819 08:09:41.320324 1 node.go:41] Staging volume 8e2c54d6-b312-4e3b-b510-8d34048ec98d with volume_id:"8e2c54d6-b312-4e3b-b510-8d34048ec98d" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-114509b2-532f-4200-bcff-eef1ca358128/globalmount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1597823051721-8081-csi.ovirt.org" > I0819 08:09:41.576150 1 node.go:160] Extracting pvc volume name 8e2c54d6-b312-4e3b-b510-8d34048ec98d I0819 08:09:41.634723 1 node.go:166] Extracted disk ID from PVC 8e2c54d6-b312-4e3b-b510-8d34048ec98d I0819 08:09:41.634837 1 node.go:186] Device path /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b exists I0819 08:09:41.634894 1 node.go:203] lsblk -nro FSTYPE /dev/sdb I0819 08:09:41.644493 1 node.go:67] Creating FS ext4 on device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b I0819 08:09:41.644524 1 node.go:223] Mounting device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b, with FS ext4 E0819 08:09:41.650098 1 node.go:70] Could not create filesystem ext4 on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b E0819 08:09:41.650137 1 server.go:125] /csi.v1.Node/NodeStageVolume returned with error: exit status 1 mkfs failed with exit status 1 Why mkfs failed? Seems like mkfs failed because the disk was not readable (yet?):
Aug 19 08:13:32.630195 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: E0819 08:13:32.630171 1609 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95 podName: nodeName:}" failed. No retries permitted until 2020-08-19 08:13:33.13014761 +0000 UTC m=+1953.218822405 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95\") pod \"ss-0\" (UID: \"4a524db4-9fb6-4802-8496-69a1dd77163f\") : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1"
Aug 19 08:13:32.630261 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: I0819 08:13:32.630234 1609 event.go:291] "Event occurred" object="e2e-statefulset-7400/ss-0" kind="Pod" apiVersion="v1" type="Warning" reason="FailedMount" message="MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1"
Aug 19 08:13:32.639792 ovirt11-t77z5-worker-0-cvbfv kernel: Dev sdc: unable to read RDB block 8
Aug 19 08:13:32.639914 ovirt11-t77z5-worker-0-cvbfv kernel: sdc: unable to read partition table
Aug 19 08:13:32.639938 ovirt11-t77z5-worker-0-cvbfv kernel: sdc: partition table beyond EOD, truncated
We started looking at this yesterday. The issue, it seems, is the disk created as only 4k bytes instead of a 1MiB and there is insufficient space for mkfs. I think the issue is that the "1" is passed as-is and not converted to bytes, I am not sure where the issue exactly is, because manually creating a 1MiB PVC for the same storage class worked fine when I tried it The issue can be reproduced manually by create a PVC with size 1 instead of 1Mi:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: 1m-ovirt-cow-disk
annotations:
volume.beta.kubernetes.io/storage-class: ovirt-csi-sc
spec:
storageClassName: ovirt-csi-sc
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1
This will make CreateVolumeRequest.CapacityRange#GetRequiredBytes return "1" and not 1Mi in bytes
Hi Benny, couple of questions here: 1. Are we sure that this bug is related to the installation process? From afar it seems to me like a storage-related operation. If that's indeed the case, please change the Component. 2. Could you please translate for us what was this issue's symptom before the fix? 3. What are the verification steps here? Thank you! (In reply to Jan Zmeskal from comment #8) > Hi Benny, couple of questions here: > 1. Are we sure that this bug is related to the installation process? From > afar it seems to me like a storage-related operation. If that's indeed the > case, please change the Component. yes, will change > 2. Could you please translate for us what was this issue's symptom before > the fix? mkfs would fail, see comment #2 and comment #3 > 3. What are the verification steps here? 1. Create the following PVC (comment #7): kind: PersistentVolumeClaim apiVersion: v1 metadata: name: 1m-ovirt-cow-disk annotations: volume.beta.kubernetes.io/storage-class: ovirt-csi-sc spec: storageClassName: ovirt-csi-sc accessModes: - ReadWriteOnce resources: requests: storage: 1 (1 instead of 1Gi) 2. try to start a pod using this PVC Verified on: openshift-4.6.0-0.nightly-2020-09-28-212756 Steps: 1. Created a PVC with storage size of 1 2. $ oc get -n openshift-cluster-csi-drivers pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE 1-ovirt-cow-disk Bound pvc-bc27de69-b211-405a-ae38-16d95fa87156 1Mi RWO ovirt-csi-sc 28s 3. created a pod that uses the created pvc 4. $ oc -n openshift-cluster-csi-drivers get pods ... testpodwithcsi 1/1 Running 0 2m37s ... 5. $ oc -n openshift-cluster-csi-drivers ovirt-csi-driver-controller-77cbc597c8-2ppr4 csi-driver ... I1001 15:14:00.169218 1 controller.go:36] Creating disk pvc-bc27de69-b211-405a-ae38-16d95fa87156 ... Results: PVC created with 1Mib and disk created successfully Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |