Created attachment 1729130 [details] MachineConfig definition for udev work-around Description of problem: Using OpenShift 4 on OpenStack 16.1 with StorageClass kubernetes.io/cinder the volume attachment succeeds to the node but the kubelet fails to find the attached volume. When a pod attempts to use the volume it becomes stuck in state "ContainerCreating". Investigation in OpenStack shows the volume is attached to the node. Further, using `oc debug node/...` we confirmed the volume is attached to the node, appears as `/dev/sdb` and with a symbolic links as `/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_<VOLUME_ID>` and `scsi-SQEMU_QEMU_HARDDISK_<VOLUME_ID>`. We also manually tested by formatting a filesystem on the attached volume and mounting it on the host with no difficulty. Further investigation has determined that the underlying issue is that the volume detection truncates the volume ID at 20 characters while the OpenStack volume ID is 36 characters. Comparing to running OpenShift 4 on OpenStack 13, we find the symbolic link at `/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_*` is truncated at 20 characters and everything works, but in OpenStack 16.1 this symbolic link is not truncated. We tracked this back through the configuration in /usr/lib/udev/rules.d/63-scsi-sg3_symlink.rules: ENV{SCSI_IDENT_LUN_VENDOR}=="?*", ENV{DEVTYPE}=="disk", SYMLINK+="disk/by-id/scsi-0$env{SCSI_VENDOR}_$env{SCSI_MODEL}_$env{SCSI_IDENT_LUN_VENDOR}" And then used `udevadm info` to confirm on OpenStack 13 this value is truncated: udevadm info /dev/sdb --query=property | grep SCSI_IDENT_LUN_VENDOR SCSI_IDENT_LUN_VENDOR=a55fa2e2-a60c-470c-8 But on OpenStack 16 it is not: udevadm info /dev/sdb --query=property | grep SCSI_IDENT_LUN_VENDOR SCSI_IDENT_LUN_VENDOR=033fa19a-a5e3-445a-8631-3e9349e540e5 The relevant code we found here: https://github.com/openshift/origin/blob/master/vendor/k8s.io/legacy-cloud-providers/openstack/openstack_volumes.go#L494-L501 candidateDeviceNodes := []string{ // KVM fmt.Sprintf("virtio-%s", volumeID[:20]), // KVM virtio-scsi fmt.Sprintf("scsi-0QEMU_QEMU_HARDDISK_%s", volumeID[:20]), // ESXi fmt.Sprintf("wwn-0x%s", strings.Replace(volumeID, "-", "", -1)), } And recommend a simple fix to add the non-truncated form to the search: candidateDeviceNodes := []string{ // KVM fmt.Sprintf("virtio-%s", volumeID[:20]), // KVM virtio-scsi fmt.Sprintf("scsi-0QEMU_QEMU_HARDDISK_%s", volumeID), fmt.Sprintf("scsi-0QEMU_QEMU_HARDDISK_%s", volumeID[:20]), // ESXi fmt.Sprintf("wwn-0x%s", strings.Replace(volumeID, "-", "", -1)), } Version-Release number of selected component (if applicable): OpenShift 4.6.3 OpenStack 16.1 How reproducible: Always Steps to Reproduce: 1. `oc new-app postgresql-persistent` or any example with a PVC Actual results: Pod gets stuck in "ContainerCreating" state. Expected results: Pod should run. Master Log: n/a Node Log (of failed PODs): Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.142461 1838 reconciler.go:254] Starting operationExecutor.MountVolume for volume "pvc-0b49a15f-62dd-4c52-9495-c370c55e4dd4" (UniqueName: "kubernetes.io/cindr/033fa19a-a5e3-445a-8631-3e9349e540e5") pod "postgresql-8-b92bh" (UID: "7271748f-d063-4956-8322-cbeb190a869b") Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: W1113 15:38:29.142541 1838 plugins.go:689] WARNING: kubernetes.io/cinder built-in volume provider is now deprecated. The Cinder volume provider is deprecated and will be rmoved in a future release Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.165148 1838 exec.go:60] Exec probe response: "fail on inspecting path /tmp/healthliveness: stat /tmp/healthliveness: no such file or directoryOK" Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.165201 1838 prober.go:126] Liveness probe for "istio-galley-7957f7f6bb-8jgnz_user2-istio-system(c30364ea-28a0-4f47-beb1-04e98e645975):galley" succeeded Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.244683 1838 reconciler.go:254] Starting operationExecutor.MountVolume for volume "pvc-0b49a15f-62dd-4c52-9495-c370c55e4dd4" (UniqueName: "kubernetes.io/cindr/033fa19a-a5e3-445a-8631-3e9349e540e5") pod "postgresql-8-b92bh" (UID: "7271748f-d063-4956-8322-cbeb190a869b") Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: W1113 15:38:29.244763 1838 plugins.go:689] WARNING: kubernetes.io/cinder built-in volume provider is now deprecated. The Cinder volume provider is deprecated and will be rmoved in a future release Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.251924 1838 cinder_util.go:264] Successfully probed all attachments Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.252219 1838 openstack_volumes.go:514] Failed to find device for the volumeID: "033fa19a-a5e3-445a-8631-3e9349e540e5" by serial ID Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.252274 1838 metadata.go:166] Attempting to fetch metadata from http://169.254.169.254/openstack/2016-06-30/meta_data.json Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx systemd-logind[1233]: Watching system buttons on /dev/input/event0 (Power Button) Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: I1113 15:38:29.346995 1838 reconciler.go:254] Starting operationExecutor.MountVolume for volume "pvc-0b49a15f-62dd-4c52-9495-c370c55e4dd4" (UniqueName: "kubernetes.io/cindr/033fa19a-a5e3-445a-8631-3e9349e540e5") pod "postgresql-8-b92bh" (UID: "7271748f-d063-4956-8322-cbeb190a869b") Nov 13 15:38:29 cluster-jtk0-dlqhf-worker-lmnqx hyperkube[1838]: W1113 15:38:29.347068 1838 plugins.go:689] WARNING: kubernetes.io/cinder built-in volume provider is now deprecated. The Cinder volume provider is deprecated and will be rmoved in a future release PV Dump: apiVersion: v1 kind: PersistentVolume metadata: annotations: kubernetes.io/createdby: cinder-dynamic-provisioner pv.kubernetes.io/bound-by-controller: "yes" pv.kubernetes.io/provisioned-by: kubernetes.io/cinder creationTimestamp: "2020-11-12T21:20:56Z" finalizers: - kubernetes.io/pv-protection labels: failure-domain.beta.kubernetes.io/region: regionOne failure-domain.beta.kubernetes.io/zone: nova managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:kubernetes.io/createdby: {} f:pv.kubernetes.io/bound-by-controller: {} f:pv.kubernetes.io/provisioned-by: {} f:labels: .: {} f:failure-domain.beta.kubernetes.io/region: {} f:failure-domain.beta.kubernetes.io/zone: {} f:spec: f:accessModes: {} f:capacity: .: {} f:storage: {} f:cinder: .: {} f:fsType: {} f:volumeID: {} f:claimRef: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:namespace: {} f:resourceVersion: {} f:uid: {} f:nodeAffinity: .: {} f:required: .: {} f:nodeSelectorTerms: {} f:persistentVolumeReclaimPolicy: {} f:storageClassName: {} f:volumeMode: {} f:status: f:phase: {} manager: kube-controller-manager operation: Update time: "2020-11-12T21:20:56Z" name: pvc-0b49a15f-62dd-4c52-9495-c370c55e4dd4 resourceVersion: "2652551" selfLink: /api/v1/persistentvolumes/pvc-0b49a15f-62dd-4c52-9495-c370c55e4dd4 uid: e43687a6-7dbf-4527-aecc-8c2042f8a989 spec: accessModes: - ReadWriteOnce capacity: storage: 1Gi cinder: fsType: ext4 volumeID: 033fa19a-a5e3-445a-8631-3e9349e540e5 claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: postgresql namespace: test resourceVersion: "2652525" uid: 0b49a15f-62dd-4c52-9495-c370c55e4dd4 nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: failure-domain.beta.kubernetes.io/zone operator: In values: - nova - key: failure-domain.beta.kubernetes.io/region operator: In values: - regionOne persistentVolumeReclaimPolicy: Delete storageClassName: standard volumeMode: Filesystem status: phase: Bound PVC Dump: apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: openshift.io/generated-by: OpenShiftNewApp pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/cinder volume.kubernetes.io/selected-node: cluster-jtk0-dlqhf-worker-b2wc9 creationTimestamp: "2020-11-12T21:20:51Z" finalizers: - kubernetes.io/pvc-protection labels: app: postgresql-persistent app.kubernetes.io/component: postgresql-persistent app.kubernetes.io/instance: postgresql-persistent template: postgresql-persistent-template managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:openshift.io/generated-by: {} f:labels: .: {} f:app: {} f:app.kubernetes.io/component: {} f:app.kubernetes.io/instance: {} f:template: {} f:spec: f:accessModes: {} f:resources: f:requests: .: {} f:storage: {} f:volumeMode: {} manager: oc operation: Update time: "2020-11-12T21:20:51Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:pv.kubernetes.io/bind-completed: {} f:pv.kubernetes.io/bound-by-controller: {} f:volume.beta.kubernetes.io/storage-provisioner: {} f:spec: f:volumeName: {} f:status: f:accessModes: {} f:capacity: .: {} f:storage: {} f:phase: {} manager: kube-controller-manager operation: Update time: "2020-11-12T21:20:56Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:volume.kubernetes.io/selected-node: {} manager: kube-scheduler operation: Update time: "2020-11-12T21:20:56Z" name: postgresql namespace: test resourceVersion: "2652554" selfLink: /api/v1/namespaces/test/persistentvolumeclaims/postgresql uid: 0b49a15f-62dd-4c52-9495-c370c55e4dd4 spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: standard volumeMode: Filesystem volumeName: pvc-0b49a15f-62dd-4c52-9495-c370c55e4dd4 status: accessModes: - ReadWriteOnce capacity: storage: 1Gi phase: Bound StorageClass Dump (if StorageClass used by PV/PVC): allowVolumeExpansion: true apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" creationTimestamp: "2020-11-11T19:30:58Z" managedFields: - apiVersion: storage.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:allowVolumeExpansion: {} f:metadata: f:annotations: .: {} f:storageclass.kubernetes.io/is-default-class: {} f:ownerReferences: .: {} k:{"uid":"85aa3248-ebd3-4ad3-b37d-8d98435616ac"}: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:uid: {} f:provisioner: {} f:reclaimPolicy: {} f:volumeBindingMode: {} manager: cluster-storage-operator operation: Update time: "2020-11-11T19:30:58Z" name: standard ownerReferences: - apiVersion: v1 kind: clusteroperator name: storage uid: 85aa3248-ebd3-4ad3-b37d-8d98435616ac resourceVersion: "14493" selfLink: /apis/storage.k8s.io/v1/storageclasses/standard uid: 743dd00b-42b3-4c8a-b66c-0bfb94f44531 provisioner: kubernetes.io/cinder reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer Additional info: I was able to validate a work-around by adding a machineset to configure udev to create symbolic links truncated at 20 characters for the volume id. ACTION=="add", ENV{SCSI_IDENT_LUN_VENDOR}=="?*", ENV{DEVTYPE}=="disk", RUN+="/bin/sh -c 'ID=$env{SCSI_IDENT_LUN_VENDOR}; ln -s ../../$name /dev/disk/by-id/scsi-0$env{SCSI_VENDOR}_$env{SCSI_MODEL}_${ID:0:20}'" ACTION=="remove", ENV{SCSI_IDENT_LUN_VENDOR}=="?*", ENV{DEVTYPE}=="disk", RUN+="/bin/sh -c 'ID=$env{SCSI_IDENT_LUN_VENDOR}; rm - f /dev/disk/by-id/scsi-0$env{SCSI_VENDOR}_$env{SCSI_MODEL}_${ID:0:20}'"
Lowering the severity since there is a workaround.
Lowering the severity since there is a workaround. Created issue upstream: https://github.com/kubernetes/kubernetes/issues/96672 Upstream fix: https://github.com/kubernetes/kubernetes/pull/96673 Thanks for the detailed description + code suggestion!
For the record, Cinder CSI driver is already fixed: https://github.com/kubernetes/cloud-provider-openstack/pull/853
*** Bug 1902710 has been marked as a duplicate of this bug. ***
Waiting for upstream to un-freeze.
Verified pass on 4.7.0-0.nightly-2021-01-06-094712 1. Set up cluster with virtio-scsi mode 2. Create pod/pvc, volume is mounted successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633