Bug 1870195 - [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
Summary: [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetB...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Benny Zlotnik
QA Contact: Guilherme Santos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-19 14:03 UTC by David Eads
Modified: 2020-10-27 16:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
Last Closed: 2020-10-27 16:29:12 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovirt-csi-driver pull 45 0 None closed Bug 1870195: round provisioned size up to 1MiB 2020-10-06 08:38:53 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:29:27 UTC

Description David Eads 2020-08-19 14:03:02 UTC
test:
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-apps%5C%5D+StatefulSet+%5C%5Bk8s%5C.io%5C%5D+Basic+StatefulSet+functionality+%5C%5BStatefulSetBasic%5C%5D+should+not+deadlock+when+a+pod%27s+predecessor+fails


On ovirt cluster that install, this set of tests is a consistent failure.  

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096 is a specific job

Here is a link to the ovirt job: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 


Tagging tests for sippy which fail in a group


[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity

Comment 1 David Eads 2020-08-21 12:08:09 UTC
no action for two days, this is failing the majority of the ovirt CI runs

Comment 2 Jan Safranek 2020-08-21 13:21:41 UTC
Looking at https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-ovirt-4.6 and the first failed [sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should adopt matching orphans and release non-matching pods [Suite:openshift/conformance/parallel] [Suite:k8s]

The test creates StatefulSet e2e-statefulset-9197/ss, which creates PVC e2e-statefulset-9197/datadir-ss-0, for which PV pvc-114509b2-532f-4200-bcff-eef1ca358128 is provisioned.

From test events:

Warning  FailedMount             2m40s (x11 over 8m54s)  kubelet, ovirt11-t77z5-worker-0-r5952  MountVolume.MountDevice failed for volume "pvc-114509b2-532f-4200-bcff-eef1ca358128" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1

Driver logs on the node do not show much more:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1295983476473860096/artifacts/e2e-ovirt/pods/openshift-cluster-csi-drivers_ovirt-csi-driver-node-c9ch7_csi-driver.log

I0819 08:09:41.320324       1 node.go:41] Staging volume 8e2c54d6-b312-4e3b-b510-8d34048ec98d with volume_id:"8e2c54d6-b312-4e3b-b510-8d34048ec98d" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-114509b2-532f-4200-bcff-eef1ca358128/globalmount" volume_capability:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1597823051721-8081-csi.ovirt.org" > 
I0819 08:09:41.576150       1 node.go:160] Extracting pvc volume name 8e2c54d6-b312-4e3b-b510-8d34048ec98d
I0819 08:09:41.634723       1 node.go:166] Extracted disk ID from PVC 8e2c54d6-b312-4e3b-b510-8d34048ec98d
I0819 08:09:41.634837       1 node.go:186] Device path /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b exists
I0819 08:09:41.634894       1 node.go:203] lsblk -nro FSTYPE /dev/sdb
I0819 08:09:41.644493       1 node.go:67] Creating FS ext4 on device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b
I0819 08:09:41.644524       1 node.go:223] Mounting device /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b, with FS ext4
E0819 08:09:41.650098       1 node.go:70] Could not create filesystem ext4 on /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_8e2c54d6-b312-4e3b-b
E0819 08:09:41.650137       1 server.go:125] /csi.v1.Node/NodeStageVolume returned with error: exit status 1 mkfs failed with exit status 1

Why mkfs failed?

Comment 3 Tomas Smetana 2020-08-21 14:00:03 UTC
Seems like mkfs failed because the disk was not readable (yet?):

Aug 19 08:13:32.630195 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: E0819 08:13:32.630171    1609 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95 podName: nodeName:}" failed. No retries permitted until 2020-08-19 08:13:33.13014761 +0000 UTC m=+1953.218822405 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" (UniqueName: \"kubernetes.io/csi/csi.ovirt.org^205d55a0-1b77-4082-8e61-3cace5b07d95\") pod \"ss-0\" (UID: \"4a524db4-9fb6-4802-8496-69a1dd77163f\") : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1"
Aug 19 08:13:32.630261 ovirt11-t77z5-worker-0-cvbfv hyperkube[1609]: I0819 08:13:32.630234    1609 event.go:291] "Event occurred" object="e2e-statefulset-7400/ss-0" kind="Pod" apiVersion="v1" type="Warning" reason="FailedMount" message="MountVolume.MountDevice failed for volume \"pvc-4592db77-1d7f-413b-b993-c0ad08d11192\" : rpc error: code = Unknown desc = exit status 1 mkfs failed with exit status 1"
Aug 19 08:13:32.639792 ovirt11-t77z5-worker-0-cvbfv kernel: Dev sdc: unable to read RDB block 8
Aug 19 08:13:32.639914 ovirt11-t77z5-worker-0-cvbfv kernel:  sdc: unable to read partition table
Aug 19 08:13:32.639938 ovirt11-t77z5-worker-0-cvbfv kernel: sdc: partition table beyond EOD, truncated

Comment 4 Benny Zlotnik 2020-08-21 15:49:36 UTC
We started looking at this yesterday. The issue, it seems, is the disk created as only 4k bytes instead of a 1MiB and there is insufficient space for mkfs. I think the issue is that the "1" is passed as-is and not converted to bytes, I am not sure where the issue exactly is, because manually creating a 1MiB PVC for the same storage class worked fine when I tried it

Comment 7 Benny Zlotnik 2020-08-23 16:16:09 UTC
The issue can be reproduced manually by create a PVC with size 1 instead of 1Mi:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: 1m-ovirt-cow-disk
  annotations:
    volume.beta.kubernetes.io/storage-class: ovirt-csi-sc
spec:
  storageClassName: ovirt-csi-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1

This will make CreateVolumeRequest.CapacityRange#GetRequiredBytes return "1" and not 1Mi in bytes

Comment 8 Jan Zmeskal 2020-09-29 14:06:03 UTC
Hi Benny, couple of questions here:
1. Are we sure that this bug is related to the installation process? From afar it seems to me like a storage-related operation. If that's indeed the case, please change the Component.
2. Could you please translate for us what was this issue's symptom before the fix?
3. What are the verification steps here?

Thank you!

Comment 9 Benny Zlotnik 2020-09-30 14:16:35 UTC
(In reply to Jan Zmeskal from comment #8)
> Hi Benny, couple of questions here:
> 1. Are we sure that this bug is related to the installation process? From
> afar it seems to me like a storage-related operation. If that's indeed the
> case, please change the Component.
yes, will change

> 2. Could you please translate for us what was this issue's symptom before
> the fix?
mkfs would fail, see comment #2 and comment #3

> 3. What are the verification steps here?

1. Create the following PVC (comment #7):
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: 1m-ovirt-cow-disk
  annotations:
    volume.beta.kubernetes.io/storage-class: ovirt-csi-sc
spec:
  storageClassName: ovirt-csi-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1

(1 instead of 1Gi)

2. try to start a pod using this PVC

Comment 10 Guilherme Santos 2020-10-01 15:20:04 UTC
Verified on:
openshift-4.6.0-0.nightly-2020-09-28-212756

Steps:
1. Created a PVC with storage size of 1
2. $ oc get -n openshift-cluster-csi-drivers pvc
NAME               STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
1-ovirt-cow-disk   Bound    pvc-bc27de69-b211-405a-ae38-16d95fa87156   1Mi        RWO            ovirt-csi-sc   28s

3. created a pod that uses the created pvc
4. $ oc -n openshift-cluster-csi-drivers get pods
...
testpodwithcsi                                 1/1     Running   0          2m37s
...

5. $ oc -n openshift-cluster-csi-drivers ovirt-csi-driver-controller-77cbc597c8-2ppr4 csi-driver
...
I1001 15:14:00.169218       1 controller.go:36] Creating disk pvc-bc27de69-b211-405a-ae38-16d95fa87156
...

Results:
PVC created with 1Mib and disk created successfully

Comment 12 errata-xmlrpc 2020-10-27 16:29:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.