1870195 – [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails

Bug 1870195 - [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails

Summary: [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetB...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Benny Zlotnik
QA Contact:	Guilherme Santos
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-19 14:03 UTC by David Eads
Modified:	2020-10-27 16:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
Last Closed:	2020-10-27 16:29:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovirt-csi-driver pull 45	0	None	closed	Bug 1870195: round provisioned size up to 1MiB	2020-10-06 08:38:53 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:29:27 UTC

Description David Eads 2020-08-19 14:03:02 UTC

test:
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+StatefulSet+%5C%5Bk8s%5C.io%5C%5D+Basic+StatefulSet+functionality+%5C%5BStatefulSetBasic%5C%5D+should+not+deadlock+when+a+pod%27s+predecessor+fails


On ovirt cluster that install, this set of tests is a consistent failure.  

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096 is a specific job

Here is a link to the ovirt job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 


Tagging tests for sippy which fail in a group


[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity

Comment 1 David Eads 2020-08-21 12:08:09 UTC

no action for two days, this is failing the majority of the ovirt CI runs

Comment 2 Jan Safranek 2020-08-21 13:21:41 UTC

Looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 and the first failed [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods [Suite:openshift/conformance/parallel] [Suite:k8s]

The test creates StatefulSet e2e-statefulset-9197/ss, which creates PVC e2e-statefulset-9197/datadir-ss-0, for which PV pvc-114509b2-532f-4200-bcff-eef1ca358128 is provisioned.

From test events:

Warning  FailedMount             2m40s (x11 over 8m54s)  kubelet, ovirt11-t77z5-worker-0-r5952  MountVolume.MountDevice failed for volume "pvc-114509b2-532f-4200-bcff-eef1ca358128" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1

Driver logs on the node do not show much more:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096/artifacts/e2e-ovirt/pods/openshift-cluster-csi-drivers_ovirt-csi-driver-node-c9ch7_csi-driver.log

I0819 08:09:41.320324       1 node.go:41] Staging volume 8e2c54d6-b312-4e3b-b510-8d34048ec98d with volume_id:"8e2c54d6-b312-4e3b-b510-8d34048ec98d" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-114509b2-532f-4200-bcff-eef1ca358128/globalmount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1597823051721-8081-csi.ovirt.org" > 
I0819 08:09:41.576150       1 node.go:160] Extracting pvc volume name 8e2c54d6-b312-4e3b-b510-8d34048ec98d
I0819 08:09:41.634723       1 node.go:166] Extracted disk ID from PVC 8e2c54d6-b312-4e3b-b510-8d34048ec98d
I0819 08:09:41.634837       1 node.go:186] Device path /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b exists
I0819 08:09:41.634894       1 node.go:203] lsblk -nro FSTYPE /dev/sdb
I0819 08:09:41.644493       1 node.go:67] Creating FS ext4 on device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b
I0819 08:09:41.644524       1 node.go:223] Mounting device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b, with FS ext4
E0819 08:09:41.650098       1 node.go:70] Could not create filesystem ext4 on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b
E0819 08:09:41.650137       1 server.go:125] /csi.v1.Node/NodeStageVolume returned with error: exit status 1 mkfs failed with exit status 1

Why mkfs failed?

Comment 3 Tomas Smetana 2020-08-21 14:00:03 UTC

Seems like mkfs failed because the disk was not readable (yet?):

Aug 19 08:13:32.630195 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: E0819 08:13:32.630171    1609 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95 podName: nodeName:}" failed. No retries permitted until 2020-08-19 08:13:33.13014761 +0000 UTC m=+1953.218822405 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95\") pod \"ss-0\" (UID: \"4a524db4-9fb6-4802-8496-69a1dd77163f\") : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1"
Aug 19 08:13:32.630261 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: I0819 08:13:32.630234    1609 event.go:291] "Event occurred" object="e2e-statefulset-7400/ss-0" kind="Pod" apiVersion="v1" type="Warning" reason="FailedMount" message="MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1"
Aug 19 08:13:32.639792 ovirt11-t77z5-worker-0-cvbfv kernel: Dev sdc: unable to read RDB block 8
Aug 19 08:13:32.639914 ovirt11-t77z5-worker-0-cvbfv kernel:  sdc: unable to read partition table
Aug 19 08:13:32.639938 ovirt11-t77z5-worker-0-cvbfv kernel: sdc: partition table beyond EOD, truncated

Comment 4 Benny Zlotnik 2020-08-21 15:49:36 UTC

We started looking at this yesterday. The issue, it seems, is the disk created as only 4k bytes instead of a 1MiB and there is insufficient space for mkfs. I think the issue is that the "1" is passed as-is and not converted to bytes, I am not sure where the issue exactly is, because manually creating a 1MiB PVC for the same storage class worked fine when I tried it

Comment 7 Benny Zlotnik 2020-08-23 16:16:09 UTC

The issue can be reproduced manually by create a PVC with size 1 instead of 1Mi:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: 1m-ovirt-cow-disk
  annotations:
    volume.beta.kubernetes.io/storage-class: ovirt-csi-sc
spec:
  storageClassName: ovirt-csi-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1

This will make CreateVolumeRequest.CapacityRange#GetRequiredBytes return "1" and not 1Mi in bytes

Comment 8 Jan Zmeskal 2020-09-29 14:06:03 UTC

Hi Benny, couple of questions here:
1. Are we sure that this bug is related to the installation process? From afar it seems to me like a storage-related operation. If that's indeed the case, please change the Component.
2. Could you please translate for us what was this issue's symptom before the fix?
3. What are the verification steps here?

Thank you!

Comment 9 Benny Zlotnik 2020-09-30 14:16:35 UTC

(In reply to Jan Zmeskal from comment #8)
> Hi Benny, couple of questions here:
> 1. Are we sure that this bug is related to the installation process? From
> afar it seems to me like a storage-related operation. If that's indeed the
> case, please change the Component.
yes, will change

> 2. Could you please translate for us what was this issue's symptom before
> the fix?
mkfs would fail, see comment #2 and comment #3

> 3. What are the verification steps here?

1. Create the following PVC (comment #7):
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: 1m-ovirt-cow-disk
  annotations:
    volume.beta.kubernetes.io/storage-class: ovirt-csi-sc
spec:
  storageClassName: ovirt-csi-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1

(1 instead of 1Gi)

2. try to start a pod using this PVC

Comment 10 Guilherme Santos 2020-10-01 15:20:04 UTC

Verified on:
openshift-4.6.0-0.nightly-2020-09-28-212756

Steps:
1. Created a PVC with storage size of 1
2. $ oc get -n openshift-cluster-csi-drivers pvc
NAME               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
1-ovirt-cow-disk   Bound    pvc-bc27de69-b211-405a-ae38-16d95fa87156   1Mi        RWO            ovirt-csi-sc   28s

3. created a pod that uses the created pvc
4. $ oc -n openshift-cluster-csi-drivers get pods
...
testpodwithcsi                                 1/1     Running   0          2m37s
...

5. $ oc -n openshift-cluster-csi-drivers ovirt-csi-driver-controller-77cbc597c8-2ppr4 csi-driver
...
I1001 15:14:00.169218       1 controller.go:36] Creating disk pvc-bc27de69-b211-405a-ae38-16d95fa87156
...

Results:
PVC created with 1Mib and disk created successfully

Comment 12 errata-xmlrpc 2020-10-27 16:29:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.