Created attachment 1966128 [details] Scenario2-add capacity Description of problem (please be detailed as possible and provide log snippests): [Stretch cluster] Add capacity is failing with error skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-3-data-1gzl6b" Version of all relevant components (if applicable): OCP 4.13.0-0.nightly-2023-05-19-120832 odf-operator.v4.13.0-203.stable Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install ocp cluster 2. Add disk in vsphere and install local storage operator 3. Install ODf and Create storage system using local storage and stretch mode enabled 4. Through openshift UI try to Add capacity Scenario1: When you use thin-csi storage class: Add capacity fails and os prepare job gives following log output. 2023-05-22 04:30:18.954035 I | cephosd: skipping device "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b" because it contains a filesystem "ceph_bluestore" 2023-05-22 04:30:18.970266 I | cephosd: configuring osd devices: {"Entries":{}} 2023-05-22 04:30:18.970325 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume. 2023-05-22 04:30:18.970337 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b 2023-05-22 04:30:18.999616 W | cephosd: failed to retrieve logical volume path for "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b". exit status 5 2023-05-22 04:30:18.999683 D | exec: Running command: lsblk /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE 2023-05-22 04:30:19.002456 D | sys: lsblk output: "SIZE=\"1048576\" ROTA=\"1\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/sdc\" KNAME=\"/dev/sdc\" MOUNTPOINT=\"\" FSTYPE=\"ceph_bluestore\"" 2023-05-22 04:30:19.002732 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list --format json 2023-05-22 04:30:19.353781 D | cephosd: {} 2023-05-22 04:30:19.353844 I | cephosd: 0 ceph-volume lvm osd devices configured on this node 2023-05-22 04:30:19.353854 D | exec: Running command: cryptsetup luksDump /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b 2023-05-22 04:30:19.360346 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b". Device /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b is not a valid LUKS device.: exit status 1 2023-05-22 04:30:19.360398 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b --format json 2023-05-22 04:30:19.689066 D | cephosd: {} 2023-05-22 04:30:19.689115 I | cephosd: 0 ceph-volume raw osd devices configured on this node 2023-05-22 04:30:19.689124 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-3-data-1gzl6b" Scenario2: When we use storage class created during local storage storage system, it fails with error "An error Occured", PFA Actual results: Add capacity is failing Expected results: Add capacity should be successful Additional info:
The key error appears to be: cephosd: skipping device "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b" because it contains a filesystem "ceph_bluestore" Was this device used in a previous cluster? It doesn't appear clean. It's also possible there was a different error in the osd prepare pod and it restarted to try again. Does the osd prepare job show that there was a restart? ("oc get pod" would show if there was a restart) If this repros consistently, can you capture the osd prepare log before the restart?
Hi Travis, Any workaround available for this issue? Could you please share? Harish
Hi Travis, I looked at the errors and it seems to me it's something to do with the disks or the way they are processed. As confirmed by Bipul the code on this for UI hasn't changed recently and things like "1" & "1Ti" is not an issue. I tried to check in ocs operator code if anything has changed in the way we handle storage device set creation, And we have no change here whatsoever. So moving it back to rook for investigation.
In case of Baremetal(basically when the storagelcass selected is a no-provisioner sc) it passes 1 and for non-baremetal it passes OSD size with unit. This logic has been the same since a long time.
Bipul Is vwmare with the "thin-csi" storage class considered baremetal? This seems to be the issue. The no-provisioner sc is an LSO case, not with the thin-csi. Joy Could you also test an expansion on a non-stretch cluster? This issue seems it should affect any cluster in vmware, not just stretch.
I feel like option 2 would be a good choice. Option 1 is done by the user when the user chooses a non-LSO storage class. Is mixing of the two provisioners not supported?
Joy were you able to repro this for the non-stretch cluster case? (see comment 27) If you can validate that it happens for non-stretch, we can close this issue and it will need to be considered with the UI BZs that were opened separately. (In reply to Bipul Adhikari from comment #28) > I feel like option 2 would be a good choice. > Option 1 is done by the user when the user chooses a non-LSO storage class. > > Is mixing of the two provisioners not supported? Mixing of the provisioners is fine technically. The issue is that the provisioner storage classes (non-LSO) require the size to be passed, which is not in this case.
Similar behaviour is seen on non stretch LSO cluster upon trying to add capacity through thin-csi storage class. osd-prepare-job was run and one 1Mi pv was created with thin-csi storage class. [jopinto@jopinto wl]$ oc get pods -o wide -n openshift-storage NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-addons-controller-manager-b4c77bcd4-vrnkz 2/2 Running 0 68m 10.130.2.13 compute-3 <none> <none> ... rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69bc7965qsqxr 2/2 Running 0 60m 10.130.2.23 compute-3 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d599686k57fk 2/2 Running 0 60m 10.131.2.20 compute-5 <none> <none> rook-ceph-mgr-a-6698c6695b-kdfbc 2/2 Running 0 61m 10.129.2.28 compute-4 <none> <none> rook-ceph-mon-a-cb4f65b68-6rtgg 2/2 Running 0 62m 10.129.2.27 compute-4 <none> <none> rook-ceph-mon-b-887dddd6c-5x7nf 2/2 Running 0 61m 10.128.2.23 compute-2 <none> <none> rook-ceph-mon-c-7f9857d5cc-x6nd2 2/2 Running 0 61m 10.128.4.15 compute-1 <none> <none> rook-ceph-operator-5d7748588f-95242 1/1 Running 0 63m 10.128.2.21 compute-2 <none> <none> rook-ceph-osd-0-6c776fb5fb-9lr6s 2/2 Running 0 60m 10.129.4.27 compute-0 <none> <none> rook-ceph-osd-1-55ddff5966-m6w75 2/2 Running 0 60m 10.129.2.32 compute-4 <none> <none> rook-ceph-osd-2-6c6ff7d6db-97nfz 2/2 Running 0 60m 10.128.4.19 compute-1 <none> <none> rook-ceph-osd-3-5f8454f89c-g7fdx 2/2 Running 0 60m 10.128.2.27 compute-2 <none> <none> rook-ceph-osd-4-79f45b4df5-hg4bs 2/2 Running 0 60m 10.131.2.16 compute-5 <none> <none> rook-ceph-osd-5-86598db64c-47ctj 2/2 Running 0 60m 10.130.2.18 compute-3 <none> <none> rook-ceph-osd-prepare-6d484fa476b82f7f25e05ed481ca7bc2-pph9m 0/1 Completed 0 60m 10.128.4.18 compute-1 <none> <none> rook-ceph-osd-prepare-71f70c08f1a1fedee7de451281841831-485qz 0/1 Completed 0 60m 10.128.2.26 compute-2 <none> <none> rook-ceph-osd-prepare-9c0a4820057e18e89462227450d968d0-zxln8 0/1 Completed 0 60m 10.129.2.31 compute-4 <none> <none> rook-ceph-osd-prepare-c5751d8347b95d3ae3e6c0bf76f56f30-phbzx 0/1 Completed 0 60m 10.130.2.17 compute-3 <none> <none> rook-ceph-osd-prepare-cdbefc9a962dae5a90a8c8450389f6ac-qhswk 0/1 Completed 0 60m 10.131.2.15 compute-5 <none> <none> rook-ceph-osd-prepare-df58b3ddbdd67e7838fd54a3718829f3-j9z9b 0/1 Completed 0 60m 10.129.4.26 compute-0 <none> <none> rook-ceph-osd-prepare-f13efd856b7a2eb8df124c8adf61927b-vbrtv 0/1 Completed 0 49m 10.129.4.36 compute-0 <none> <none> rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d8dcc8cdp7q4 2/2 Running 0 59m 10.130.2.26 compute-3 <none> <none> rook-ceph-tools-5845b7c568-t54tt 1/1 Running 0 59m 10.128.2.31 compute-2 <none> <none> [jopinto@jopinto wl]$ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pv-1410cc33 100Gi RWO Delete Available lvmsc 56m local-pv-1c939ce5 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-lvmsc-0-data-1ld859 lvmsc 65m local-pv-1d4a44de 100Gi RWO Delete Available lvmsc 56m local-pv-27d5f808 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-lvmsc-0-data-0qpsxq lvmsc 65m local-pv-45fb1544 100Gi RWO Delete Available lvmsc 57m local-pv-5daad652 100Gi RWO Delete Available lvmsc 57m local-pv-6672f275 100Gi RWO Delete Available lvmsc 57m local-pv-a915d313 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-lvmsc-0-data-5wxnpn lvmsc 65m local-pv-b4a012fa 100Gi RWO Delete Available lvmsc 58m local-pv-ba467c3e 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-lvmsc-0-data-4fnlbc lvmsc 65m local-pv-c4964c94 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-lvmsc-0-data-29986v lvmsc 65m local-pv-c79b74ef 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-lvmsc-0-data-3fg8rx lvmsc 65m pvc-b6fde558-98e4-484d-b94a-78c0edca2c7e 1Mi RWO Delete Bound openshift-storage/ocs-deviceset-thin-csi-0-data-0jcr8c thin-csi 51m pvc-ca56e69e-5a49-4cd2-b9bc-7cb3a4b3d1b9 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 60m osd_prepare_job log output: .. 2023-06-08 12:17:47.541664 I | cephosd: creating and starting the osds 2023-06-08 12:17:47.541685 D | cephosd: desiredDevices are [{Name:/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}] 2023-06-08 12:17:47.541688 D | cephosd: context.Devices are: 2023-06-08 12:17:47.541714 D | cephosd: &{Name:/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c Parent: HasChildren:false DevLinks:/dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c296dc6e90646b8c4c0b065e9870 /dev/disk/by-id/scsi-36000c296dc6e90646b8c4c0b065e9870 /dev/disk/by-id/wwn-0x6000c296dc6e90646b8c4c0b065e9870 /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:3:0 /dev/disk/by-diskseq/23 /dev/disk/by-id/google-6000c296dc6e90646b8c4c0b065e9870 Size:1048576 UUID: Serial:36000c296dc6e90646b8c4c0b065e9870 Type:data Rotational:true Readonly:false Partitions:[] Filesystem:ceph_bluestore Mountpoint: Vendor:VMware Model:Virtual_disk WWN:0x6000c296dc6e9064 WWNVendorExtension:0x6000c296dc6e90646b8c4c0b065e9870 Empty:false CephVolumeData: RealPath:/dev/sdd KernelName:sdd Encrypted:false} 2023-06-08 12:17:47.541721 I | cephosd: skipping device "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c" because it contains a filesystem "ceph_bluestore" 2023-06-08 12:17:47.554120 I | cephosd: configuring osd devices: {"Entries":{}} 2023-06-08 12:17:47.554154 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume. 2023-06-08 12:17:47.554162 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c 2023-06-08 12:17:47.595578 W | cephosd: failed to retrieve logical volume path for "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c". exit status 5 2023-06-08 12:17:47.595614 D | exec: Running command: lsblk /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE 2023-06-08 12:17:47.598594 D | sys: lsblk output: "SIZE=\"1048576\" ROTA=\"1\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/sdd\" KNAME=\"/dev/sdd\" MOUNTPOINT=\"\" FSTYPE=\"ceph_bluestore\"" 2023-06-08 12:17:47.598832 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list --format json 2023-06-08 12:17:48.007590 D | cephosd: {} 2023-06-08 12:17:48.007647 I | cephosd: 0 ceph-volume lvm osd devices configured on this node 2023-06-08 12:17:48.007658 D | exec: Running command: cryptsetup luksDump /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c 2023-06-08 12:17:48.015059 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c". Device /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c is not a valid LUKS device.: exit status 1 2023-06-08 12:17:48.015111 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c --format json 2023-06-08 12:17:48.382898 D | cephosd: {} 2023-06-08 12:17:48.382933 I | cephosd: 0 ceph-volume raw osd devices configured on this node 2023-06-08 12:17:48.382945 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-0-data-0jcr8c"
Joy, thanks for confirming that this issue also happens on non-stretch clusters. As mentioned in Comment 29, I'll close this Rook issue, as separate issues have been opened for the UI to consider this error case and whether to disallow changing the storage class type.