test: [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+StatefulSet+%5C%5Bk8s%5C.io%5C%5D+Basic+StatefulSet+functionality+%5C%5BStatefulSetBasic%5C%5D+should+not+deadlock+when+a+pod%27s+predecessor+fails On ovirt cluster that install, this set of tests is a consistent failure. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096 is a specific job Here is a link to the ovirt job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 Tagging tests for sippy which fail in a group [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity
no action for two days, this is failing the majority of the ovirt CI runs
Looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 and the first failed [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods [Suite:openshift/conformance/parallel] [Suite:k8s] The test creates StatefulSet e2e-statefulset-9197/ss, which creates PVC e2e-statefulset-9197/datadir-ss-0, for which PV pvc-114509b2-532f-4200-bcff-eef1ca358128 is provisioned. From test events: Warning FailedMount 2m40s (x11 over 8m54s) kubelet, ovirt11-t77z5-worker-0-r5952 MountVolume.MountDevice failed for volume "pvc-114509b2-532f-4200-bcff-eef1ca358128" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1 Driver logs on the node do not show much more: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096/artifacts/e2e-ovirt/pods/openshift-cluster-csi-drivers_ovirt-csi-driver-node-c9ch7_csi-driver.log I0819 08:09:41.320324 1 node.go:41] Staging volume 8e2c54d6-b312-4e3b-b510-8d34048ec98d with volume_id:"8e2c54d6-b312-4e3b-b510-8d34048ec98d" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-114509b2-532f-4200-bcff-eef1ca358128/globalmount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1597823051721-8081-csi.ovirt.org" > I0819 08:09:41.576150 1 node.go:160] Extracting pvc volume name 8e2c54d6-b312-4e3b-b510-8d34048ec98d I0819 08:09:41.634723 1 node.go:166] Extracted disk ID from PVC 8e2c54d6-b312-4e3b-b510-8d34048ec98d I0819 08:09:41.634837 1 node.go:186] Device path /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b exists I0819 08:09:41.634894 1 node.go:203] lsblk -nro FSTYPE /dev/sdb I0819 08:09:41.644493 1 node.go:67] Creating FS ext4 on device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b I0819 08:09:41.644524 1 node.go:223] Mounting device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b, with FS ext4 E0819 08:09:41.650098 1 node.go:70] Could not create filesystem ext4 on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b E0819 08:09:41.650137 1 server.go:125] /csi.v1.Node/NodeStageVolume returned with error: exit status 1 mkfs failed with exit status 1 Why mkfs failed?
Seems like mkfs failed because the disk was not readable (yet?): Aug 19 08:13:32.630195 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: E0819 08:13:32.630171 1609 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95 podName: nodeName:}" failed. No retries permitted until 2020-08-19 08:13:33.13014761 +0000 UTC m=+1953.218822405 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95\") pod \"ss-0\" (UID: \"4a524db4-9fb6-4802-8496-69a1dd77163f\") : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1" Aug 19 08:13:32.630261 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: I0819 08:13:32.630234 1609 event.go:291] "Event occurred" object="e2e-statefulset-7400/ss-0" kind="Pod" apiVersion="v1" type="Warning" reason="FailedMount" message="MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1" Aug 19 08:13:32.639792 ovirt11-t77z5-worker-0-cvbfv kernel: Dev sdc: unable to read RDB block 8 Aug 19 08:13:32.639914 ovirt11-t77z5-worker-0-cvbfv kernel: sdc: unable to read partition table Aug 19 08:13:32.639938 ovirt11-t77z5-worker-0-cvbfv kernel: sdc: partition table beyond EOD, truncated
We started looking at this yesterday. The issue, it seems, is the disk created as only 4k bytes instead of a 1MiB and there is insufficient space for mkfs. I think the issue is that the "1" is passed as-is and not converted to bytes, I am not sure where the issue exactly is, because manually creating a 1MiB PVC for the same storage class worked fine when I tried it
The issue can be reproduced manually by create a PVC with size 1 instead of 1Mi: kind: PersistentVolumeClaim apiVersion: v1 metadata: name: 1m-ovirt-cow-disk annotations: volume.beta.kubernetes.io/storage-class: ovirt-csi-sc spec: storageClassName: ovirt-csi-sc accessModes: - ReadWriteOnce resources: requests: storage: 1 This will make CreateVolumeRequest.CapacityRange#GetRequiredBytes return "1" and not 1Mi in bytes
Hi Benny, couple of questions here: 1. Are we sure that this bug is related to the installation process? From afar it seems to me like a storage-related operation. If that's indeed the case, please change the Component. 2. Could you please translate for us what was this issue's symptom before the fix? 3. What are the verification steps here? Thank you!
(In reply to Jan Zmeskal from comment #8) > Hi Benny, couple of questions here: > 1. Are we sure that this bug is related to the installation process? From > afar it seems to me like a storage-related operation. If that's indeed the > case, please change the Component. yes, will change > 2. Could you please translate for us what was this issue's symptom before > the fix? mkfs would fail, see comment #2 and comment #3 > 3. What are the verification steps here? 1. Create the following PVC (comment #7): kind: PersistentVolumeClaim apiVersion: v1 metadata: name: 1m-ovirt-cow-disk annotations: volume.beta.kubernetes.io/storage-class: ovirt-csi-sc spec: storageClassName: ovirt-csi-sc accessModes: - ReadWriteOnce resources: requests: storage: 1 (1 instead of 1Gi) 2. try to start a pod using this PVC
Verified on: openshift-4.6.0-0.nightly-2020-09-28-212756 Steps: 1. Created a PVC with storage size of 1 2. $ oc get -n openshift-cluster-csi-drivers pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE 1-ovirt-cow-disk Bound pvc-bc27de69-b211-405a-ae38-16d95fa87156 1Mi RWO ovirt-csi-sc 28s 3. created a pod that uses the created pvc 4. $ oc -n openshift-cluster-csi-drivers get pods ... testpodwithcsi 1/1 Running 0 2m37s ... 5. $ oc -n openshift-cluster-csi-drivers ovirt-csi-driver-controller-77cbc597c8-2ppr4 csi-driver ... I1001 15:14:00.169218 1 controller.go:36] Creating disk pvc-bc27de69-b211-405a-ae38-16d95fa87156 ... Results: PVC created with 1Mib and disk created successfully
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196