2209012 – [Stretch cluster] Add capacity is failing with error skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-3-data-1gzl6b"

Bug 2209012 - [Stretch cluster] Add capacity is failing with error skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-3-data-1gzl6b"

Summary: [Stretch cluster] Add capacity is failing with error skipping OSD configurati...

Keywords:
Status:	CLOSED COMPLETED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-05-22 10:01 UTC by Joy John Pinto
Modified:	2023-08-09 17:03 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-08 18:35:26 UTC
Embargoed:

Attachments	(Terms of Use)
Scenario2-add capacity (67.07 KB, image/png) 2023-05-22 10:01 UTC, Joy John Pinto	no flags	Details
Show Obsolete (1) View All

Description Joy John Pinto 2023-05-22 10:01:10 UTC

Created attachment 1966128 [details]
Scenario2-add capacity

Description of problem (please be detailed as possible and provide log
snippests):
[Stretch cluster] Add capacity is failing with error skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-3-data-1gzl6b"

Version of all relevant components (if applicable):
OCP  4.13.0-0.nightly-2023-05-19-120832
odf-operator.v4.13.0-203.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ocp cluster 
2. Add disk in vsphere and install local storage operator
3. Install ODf and Create storage system using local storage and stretch mode enabled
4. Through openshift UI try to Add capacity
   Scenario1: When you use thin-csi storage class: Add capacity fails and os prepare job gives following log output.
   2023-05-22 04:30:18.954035 I | cephosd: skipping device "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b" because it contains a filesystem "ceph_bluestore"
2023-05-22 04:30:18.970266 I | cephosd: configuring osd devices: {"Entries":{}}
2023-05-22 04:30:18.970325 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume.
2023-05-22 04:30:18.970337 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b
2023-05-22 04:30:18.999616 W | cephosd: failed to retrieve logical volume path for "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b". exit status 5
2023-05-22 04:30:18.999683 D | exec: Running command: lsblk /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2023-05-22 04:30:19.002456 D | sys: lsblk output: "SIZE=\"1048576\" ROTA=\"1\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/sdc\" KNAME=\"/dev/sdc\" MOUNTPOINT=\"\" FSTYPE=\"ceph_bluestore\""
2023-05-22 04:30:19.002732 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
2023-05-22 04:30:19.353781 D | cephosd: {}
2023-05-22 04:30:19.353844 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2023-05-22 04:30:19.353854 D | exec: Running command: cryptsetup luksDump /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b
2023-05-22 04:30:19.360346 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b". Device /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b is not a valid LUKS device.: exit status 1
2023-05-22 04:30:19.360398 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-thin-csi-3-data-1gzl6b --format json
2023-05-22 04:30:19.689066 D | cephosd: {}
2023-05-22 04:30:19.689115 I | cephosd: 0 ceph-volume raw osd devices configured on this node
2023-05-22 04:30:19.689124 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-3-data-1gzl6b"

Scenario2:
  When we use storage class created during local storage storage system, it fails with error "An error Occured", PFA

Actual results:
Add capacity is failing

Expected results:
Add capacity should be successful

Additional info:

Comment 3 Travis Nielsen 2023-05-22 19:57:43 UTC

The key error appears to be:

cephosd: skipping device "/mnt/ocs-deviceset-thin-csi-3-data-1gzl6b" because it contains a filesystem "ceph_bluestore"

Was this device used in a previous cluster? It doesn't appear clean. It's also possible there was a different error in the osd prepare pod and it restarted to try again. Does the osd prepare job show that there was a restart? ("oc get pod" would show if there was a restart) If this repros consistently, can you capture the osd prepare log before the restart?

Comment 15 Harish NV Rao 2023-05-26 07:08:03 UTC

Hi Travis,

Any workaround available for this issue? Could you please share?

Harish

Comment 18 Malay Kumar parida 2023-06-01 11:48:17 UTC

Hi Travis, I looked at the errors and it seems to me it's something to do with the disks or the way they are processed. As confirmed by Bipul the code on this for UI hasn't changed recently and things like "1" & "1Ti" is not an issue. I tried to check in ocs operator code if anything has changed in the way we handle storage device set creation, And we have no change here whatsoever. So moving it back to rook for investigation.

Comment 20 Bipul Adhikari 2023-06-06 07:48:23 UTC

In case of Baremetal(basically when the storagelcass selected is a no-provisioner sc) it passes 1 and for non-baremetal it passes OSD size with unit. This logic has been the same since a long time.

Comment 21 Travis Nielsen 2023-06-06 14:04:48 UTC

Bipul Is vwmare with the "thin-csi" storage class considered baremetal? This seems to be the issue. The no-provisioner sc is an LSO case, not with the thin-csi. 

Joy Could you also test an expansion on a non-stretch cluster? This issue seems it should affect any cluster in vmware, not just stretch.

Comment 28 Bipul Adhikari 2023-06-07 06:46:24 UTC

I feel like option 2 would be a good choice. 
Option 1 is done by the user when the user chooses a non-LSO storage class.

Is mixing of the two provisioners not supported?

Comment 29 Travis Nielsen 2023-06-07 16:57:55 UTC

Joy were you able to repro this for the non-stretch cluster case? (see comment 27)

If you can validate that it happens for non-stretch, we can close this issue and it will need to be considered with the UI BZs that were opened separately.

(In reply to Bipul Adhikari from comment #28)
> I feel like option 2 would be a good choice. 
> Option 1 is done by the user when the user chooses a non-LSO storage class.
> 
> Is mixing of the two provisioners not supported?

Mixing of the provisioners is fine technically. The issue is that the provisioner storage classes (non-LSO) require the size to be passed, which is not in this case.

Comment 30 Joy John Pinto 2023-06-08 13:15:38 UTC

Similar behaviour is seen on non stretch LSO cluster upon trying to add capacity through thin-csi storage class. osd-prepare-job was run and one 1Mi pv was created with thin-csi storage class. 

[jopinto@jopinto wl]$ oc get pods -o wide -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-b4c77bcd4-vrnkz                     2/2     Running     0          68m   10.130.2.13    compute-3   <none>           <none>
...
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-69bc7965qsqxr   2/2     Running     0          60m   10.130.2.23    compute-3   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6d599686k57fk   2/2     Running     0          60m   10.131.2.20    compute-5   <none>           <none>
rook-ceph-mgr-a-6698c6695b-kdfbc                                  2/2     Running     0          61m   10.129.2.28    compute-4   <none>           <none>
rook-ceph-mon-a-cb4f65b68-6rtgg                                   2/2     Running     0          62m   10.129.2.27    compute-4   <none>           <none>
rook-ceph-mon-b-887dddd6c-5x7nf                                   2/2     Running     0          61m   10.128.2.23    compute-2   <none>           <none>
rook-ceph-mon-c-7f9857d5cc-x6nd2                                  2/2     Running     0          61m   10.128.4.15    compute-1   <none>           <none>
rook-ceph-operator-5d7748588f-95242                               1/1     Running     0          63m   10.128.2.21    compute-2   <none>           <none>
rook-ceph-osd-0-6c776fb5fb-9lr6s                                  2/2     Running     0          60m   10.129.4.27    compute-0   <none>           <none>
rook-ceph-osd-1-55ddff5966-m6w75                                  2/2     Running     0          60m   10.129.2.32    compute-4   <none>           <none>
rook-ceph-osd-2-6c6ff7d6db-97nfz                                  2/2     Running     0          60m   10.128.4.19    compute-1   <none>           <none>
rook-ceph-osd-3-5f8454f89c-g7fdx                                  2/2     Running     0          60m   10.128.2.27    compute-2   <none>           <none>
rook-ceph-osd-4-79f45b4df5-hg4bs                                  2/2     Running     0          60m   10.131.2.16    compute-5   <none>           <none>
rook-ceph-osd-5-86598db64c-47ctj                                  2/2     Running     0          60m   10.130.2.18    compute-3   <none>           <none>
rook-ceph-osd-prepare-6d484fa476b82f7f25e05ed481ca7bc2-pph9m      0/1     Completed   0          60m   10.128.4.18    compute-1   <none>           <none>
rook-ceph-osd-prepare-71f70c08f1a1fedee7de451281841831-485qz      0/1     Completed   0          60m   10.128.2.26    compute-2   <none>           <none>
rook-ceph-osd-prepare-9c0a4820057e18e89462227450d968d0-zxln8      0/1     Completed   0          60m   10.129.2.31    compute-4   <none>           <none>
rook-ceph-osd-prepare-c5751d8347b95d3ae3e6c0bf76f56f30-phbzx      0/1     Completed   0          60m   10.130.2.17    compute-3   <none>           <none>
rook-ceph-osd-prepare-cdbefc9a962dae5a90a8c8450389f6ac-qhswk      0/1     Completed   0          60m   10.131.2.15    compute-5   <none>           <none>
rook-ceph-osd-prepare-df58b3ddbdd67e7838fd54a3718829f3-j9z9b      0/1     Completed   0          60m   10.129.4.26    compute-0   <none>           <none>
rook-ceph-osd-prepare-f13efd856b7a2eb8df124c8adf61927b-vbrtv      0/1     Completed   0          49m   10.129.4.36    compute-0   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d8dcc8cdp7q4   2/2     Running     0          59m   10.130.2.26    compute-3   <none>           <none>
rook-ceph-tools-5845b7c568-t54tt  
                                1/1     Running     0          59m   10.128.2.31    compute-2   <none>           <none>
[jopinto@jopinto wl]$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                                                    STORAGECLASS                  REASON   AGE
local-pv-1410cc33                          100Gi      RWO            Delete           Available                                                            lvmsc                                  56m
local-pv-1c939ce5                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-lvmsc-0-data-1ld859      lvmsc                                  65m
local-pv-1d4a44de                          100Gi      RWO            Delete           Available                                                            lvmsc                                  56m
local-pv-27d5f808                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-lvmsc-0-data-0qpsxq      lvmsc                                  65m
local-pv-45fb1544                          100Gi      RWO            Delete           Available                                                            lvmsc                                  57m
local-pv-5daad652                          100Gi      RWO            Delete           Available                                                            lvmsc                                  57m
local-pv-6672f275                          100Gi      RWO            Delete           Available                                                            lvmsc                                  57m
local-pv-a915d313                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-lvmsc-0-data-5wxnpn      lvmsc                                  65m
local-pv-b4a012fa                          100Gi      RWO            Delete           Available                                                            lvmsc                                  58m
local-pv-ba467c3e                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-lvmsc-0-data-4fnlbc      lvmsc                                  65m
local-pv-c4964c94                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-lvmsc-0-data-29986v      lvmsc                                  65m
local-pv-c79b74ef                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-lvmsc-0-data-3fg8rx      lvmsc                                  65m
pvc-b6fde558-98e4-484d-b94a-78c0edca2c7e   1Mi        RWO            Delete           Bound       openshift-storage/ocs-deviceset-thin-csi-0-data-0jcr8c   thin-csi                               51m
pvc-ca56e69e-5a49-4cd2-b9bc-7cb3a4b3d1b9   50Gi       RWO            Delete           Bound       openshift-storage/db-noobaa-db-pg-0                      ocs-storagecluster-ceph-rbd            60m


osd_prepare_job log output:
..
2023-06-08 12:17:47.541664 I | cephosd: creating and starting the osds
2023-06-08 12:17:47.541685 D | cephosd: desiredDevices are [{Name:/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2023-06-08 12:17:47.541688 D | cephosd: context.Devices are:
2023-06-08 12:17:47.541714 D | cephosd: &{Name:/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c Parent: HasChildren:false DevLinks:/dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c296dc6e90646b8c4c0b065e9870 /dev/disk/by-id/scsi-36000c296dc6e90646b8c4c0b065e9870 /dev/disk/by-id/wwn-0x6000c296dc6e90646b8c4c0b065e9870 /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:3:0 /dev/disk/by-diskseq/23 /dev/disk/by-id/google-6000c296dc6e90646b8c4c0b065e9870 Size:1048576 UUID: Serial:36000c296dc6e90646b8c4c0b065e9870 Type:data Rotational:true Readonly:false Partitions:[] Filesystem:ceph_bluestore Mountpoint: Vendor:VMware Model:Virtual_disk WWN:0x6000c296dc6e9064 WWNVendorExtension:0x6000c296dc6e90646b8c4c0b065e9870 Empty:false CephVolumeData: RealPath:/dev/sdd KernelName:sdd Encrypted:false}
2023-06-08 12:17:47.541721 I | cephosd: skipping device "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c" because it contains a filesystem "ceph_bluestore"
2023-06-08 12:17:47.554120 I | cephosd: configuring osd devices: {"Entries":{}}
2023-06-08 12:17:47.554154 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume.
2023-06-08 12:17:47.554162 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c
2023-06-08 12:17:47.595578 W | cephosd: failed to retrieve logical volume path for "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c". exit status 5
2023-06-08 12:17:47.595614 D | exec: Running command: lsblk /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2023-06-08 12:17:47.598594 D | sys: lsblk output: "SIZE=\"1048576\" ROTA=\"1\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/sdd\" KNAME=\"/dev/sdd\" MOUNTPOINT=\"\" FSTYPE=\"ceph_bluestore\""
2023-06-08 12:17:47.598832 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
2023-06-08 12:17:48.007590 D | cephosd: {}
2023-06-08 12:17:48.007647 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2023-06-08 12:17:48.007658 D | exec: Running command: cryptsetup luksDump /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c
2023-06-08 12:17:48.015059 E | cephosd: failed to determine if the encrypted block "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c" is from our cluster. failed to dump LUKS header for disk "/mnt/ocs-deviceset-thin-csi-0-data-0jcr8c". Device /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c is not a valid LUKS device.: exit status 1
2023-06-08 12:17:48.015111 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-thin-csi-0-data-0jcr8c --format json
2023-06-08 12:17:48.382898 D | cephosd: {}
2023-06-08 12:17:48.382933 I | cephosd: 0 ceph-volume raw osd devices configured on this node
2023-06-08 12:17:48.382945 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-thin-csi-0-data-0jcr8c"

Comment 31 Travis Nielsen 2023-06-08 18:35:26 UTC

Joy, thanks for confirming that this issue also happens on non-stretch clusters. As mentioned in Comment 29, I'll close this Rook issue, as separate issues have been opened for the UI to consider this error case and whether to disallow changing the storage class type.

Note You need to log in before you can comment on or make changes to this bug.