1977121 – [IBM] Could not expand cluster.

Bug 1977121 - [IBM] Could not expand cluster.

Summary: [IBM] Could not expand cluster.

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.8
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Nobody
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-29 01:07 UTC by Sridhar Venkat (IBM)
Modified:	2023-08-03 08:28 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.8.0-452.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1265	0	None	open	ceph: add retry to acquire flock	2021-07-12 09:52:48 UTC
Github	openshift ocs-operator pull 1266	0	None	open	Bug 1977121: [release-4.8] ceph: add retry to acquire flock	2021-07-12 16:02:06 UTC

Description Sridhar Venkat (IBM) 2021-06-29 01:07:48 UTC

Description of problem (please be detailed as possible and provide log
snippests):
Performed the following steps :
1. Deployed OCP /OCS with 4 nodes.
2. Verified OCS deploy. Found 3 osd pods as the storage cluster created by ocs-ci is with 3 storage devices. (the deployment was done via ocs-ci, which creates a 3 osd ocs setup).
3. Labelled fourth node. Other three nodes were already labeled by ocs-ci code.
4. Checked pv - found three bound PVs but pvs existed for all four nodes.
5. Checked pvc - found three.
6. Checked storage cluster, cephcluser, ceph health using toolbox etc.
7. Performed oc edit on storage cluster and increased the storage device count to 4 from 3.
8. Noticed in a minute, fourth PVC got created. A PV moved to bound state.
9. A fourth osd pod did NOT got created.
10. checked ceph health using toolbox pod, it was HEALTH_OK.

Version of all relevant components (if applicable):
OCS 4.8.

While running must-gather,got this output:
ClusterID: 7fbfbff8-6f2d-46f7-b1e9-26b8a641f1e5
ClusterVersion: Stable at "4.8.0-0.nightly-ppc64le-2021-06-13-040854"
ClusterOperators:
All healthy and stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No.

Is there any workaround available to the best of your knowledge?
No.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1.

Can this issue reproducible?
Yes.

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
This procedure worked three days ago in an earlier nightly build of OCS 4.8.

Steps to Reproduce:
1. See above.
2.
3.

Actual results:
Additional osd pod did not get created.

Expected results:
Additional osd pod should get created.

Additional info:

Collected must-gather and output was posted online. I will add a comment with the url.

Comment 2 Sridhar Venkat (IBM) 2021-06-29 01:14:54 UTC

Uploaded must gather file to https://drive.google.com/file/d/1hbA6zUBiVR4OnFK86z4WUieQe13QCLx4/view?usp=sharing

Comment 3 Sridhar Venkat (IBM) 2021-06-29 01:16:53 UTC

When storage cluster was updated, a new pod
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3t8bnhhh5   0/1     Completed   0          21m
got created in openshift-storage namespace successfully. But this did not create a new osd pod.

Looked at the logs of this pod, and found this:

[cpratt@nx122-ahv ~]$ oc logs rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3t8bnhhh5 -n openshift-storage
2021-06-29 00:52:36.451100 I | cephcmd: desired devices to configure osds: [{Name:/mnt/ocs-deviceset-localblock-0-data-3t8bfc OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2021-06-29 00:52:36.454219 I | rookcmd: starting Rook 4.8-154.a39fd5c.release_4.8 with arguments '/rook/rook ceph osd provision'
2021-06-29 00:52:36.454232 I | rookcmd: flag values: --cluster-id=4732e6d0-6358-4709-b09d-51454d5202b1, --cluster-name=ocs-storagecluster-cephcluster, --data-device-filter=, --data-device-path-filter=, --data-devices=[{"id":"/mnt/ocs-deviceset-localblock-0-data-3t8bfc","storeConfig":{"osdsPerDevice":1}}], --encrypted-device=false, --force-format=false, --help=false, --location=, --log-flush-frequency=5s, --log-level=DEBUG, --metadata-device=, --node-name=ocs-deviceset-localblock-0-data-3t8bfc, --operator-image=, --osd-crush-device-class=, --osd-crush-initial-weight=, --osd-database-size=0, --osd-wal-size=576, --osds-per-device=1, --pvc-backed-osd=true, --service-account=
2021-06-29 00:52:36.454238 I | op-mon: parsing mon endpoints: c=172.30.81.35:6789,a=172.30.87.53:6789,b=172.30.222.40:6789
2021-06-29 00:52:37.494710 I | op-osd: CRUSH location=root=default host=worker-3
2021-06-29 00:52:37.494728 I | cephcmd: crush location of osd: root=default host=worker-3
2021-06-29 00:52:37.494779 D | exec: Running command: dmsetup version
2021-06-29 00:52:37.498192 I | cephosd: Library version:   1.02.175-RHEL8 (2021-01-28)
Driver version:    4.43.0
2021-06-29 00:52:37.513753 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2021-06-29 00:52:37.513885 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2021-06-29 00:52:37.514016 D | cephclient: config file @ /etc/ceph/ceph.conf: [global]
fsid                       = df03b6fd-8adb-4d45-8fee-e70c5205af54
mon initial members        = c a b
mon host                   = [v2:172.30.81.35:3300,v1:172.30.81.35:6789],[v2:172.30.87.53:3300,v1:172.30.87.53:6789],[v2:172.30.222.40:3300,v1:172.30.222.40:6789]
public addr                = 10.128.2.37
cluster addr               = 10.128.2.37
mon_osd_full_ratio         = .85
mon_osd_backfillfull_ratio = .8
mon_osd_nearfull_ratio     = .75
mon_max_pg_per_osd         = 600

[osd]
osd_memory_target_cgroup_limit_ratio = 0.5

[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

2021-06-29 00:52:37.514024 I | cephosd: discovering hardware
2021-06-29 00:52:37.514034 D | exec: Running command: lsblk /mnt/ocs-deviceset-localblock-0-data-3t8bfc --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME
2021-06-29 00:52:37.517238 D | exec: Running command: sgdisk --print /mnt/ocs-deviceset-localblock-0-data-3t8bfc
2021-06-29 00:52:37.519746 D | exec: Running command: udevadm info --query=property /dev/sda
2021-06-29 00:52:37.526529 I | cephosd: creating and starting the osds
2021-06-29 00:52:37.526567 D | cephosd: desiredDevices are [{Name:/mnt/ocs-deviceset-localblock-0-data-3t8bfc OSDsPerDevice:1 MetadataDevice: DatabaseSizeMB:0 DeviceClass: InitialWeight: IsFilter:false IsDevicePathFilter:false}]
2021-06-29 00:52:37.526571 D | cephosd: context.Devices are:
2021-06-29 00:52:37.526607 D | cephosd: &{Name:/mnt/ocs-deviceset-localblock-0-data-3t8bfc Parent: HasChildren:false DevLinks:/dev/disk/by-id/wwn-0x6005076810810261f800000000000ebc /dev/disk/by-id/scsi-36005076810810261f800000000000ebc /dev/disk/by-id/scsi-SIBM_2145_02042040987eXX00 /dev/disk/by-path/fc-0xc050760b9483588a-0x5005076810214c9f-lun-0 /dev/disk/by-path/fc-0x5005076810214c9f-lun-0 Size:536870912000 UUID:1ddb41a5-0157-43f4-b530-db1e97e75482 Serial:36005076810810261f800000000000ebc Type:data Rotational:true Readonly:false Partitions:[] Filesystem: Vendor:IBM Model:2145 WWN:0x6005076810810261 WWNVendorExtension:0x6005076810810261f800000000000ebc Empty:false CephVolumeData: RealPath:/dev/sda KernelName:sda Encrypted:false}
2021-06-29 00:52:37.526784 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-3t8bfc --format json
2021-06-29 00:52:37.784218 E | cephosd: . Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-06-29 00:52:37.784255 E | cephosd: [2021-06-29 00:52:37,744][ceph_volume.main][INFO  ] Running command: ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-3t8bfc --format json
[2021-06-29 00:52:37,745][ceph_volume.devices.raw.list][DEBUG ] Examining /mnt/ocs-deviceset-localblock-0-data-3t8bfc
[2021-06-29 00:52:37,745][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-localblock-0-data-3t8bfc
[2021-06-29 00:52:37,770][ceph_volume.process][INFO  ] stdout {
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "/mnt/ocs-deviceset-localblock-0-data-3t8bfc": {
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "osd_uuid": "714fafed-ba18-4bc9-a402-8631d670d8ab",
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "size": 536870912000,
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "btime": "2021-06-29 00:52:10.546006",
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "description": "main"
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout }
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout }
[2021-06-29 00:52:37,771][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-06-29 00:52:37.784333 I | cephosd: skipping device "/mnt/ocs-deviceset-localblock-0-data-3t8bfc": failed to detect if there is already an osd. failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1.
2021-06-29 00:52:37.799484 I | cephosd: configuring osd devices: {"Entries":{}}
2021-06-29 00:52:37.799503 I | cephosd: no new devices to configure. returning devices already configured with ceph-volume.
2021-06-29 00:52:37.799519 D | exec: Running command: pvdisplay -C -o lvpath --noheadings /mnt/ocs-deviceset-localblock-0-data-3t8bfc
2021-06-29 00:52:38.127250 W | cephosd: failed to retrieve logical volume path for "/mnt/ocs-deviceset-localblock-0-data-3t8bfc". exit status 5
2021-06-29 00:52:38.127285 D | exec: Running command: lsblk /mnt/ocs-deviceset-localblock-0-data-3t8bfc --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME
2021-06-29 00:52:38.129940 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
2021-06-29 00:52:38.719646 D | cephosd: {}
2021-06-29 00:52:38.719694 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2021-06-29 00:52:38.719735 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-3t8bfc --format json
2021-06-29 00:52:38.973421 E | cephosd: . Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 11, in <module>
    load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 39, in __init__
    self.main(self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-06-29 00:52:38.973447 E | cephosd: [2021-06-29 00:52:37,744][ceph_volume.main][INFO  ] Running command: ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-3t8bfc --format json
[2021-06-29 00:52:37,745][ceph_volume.devices.raw.list][DEBUG ] Examining /mnt/ocs-deviceset-localblock-0-data-3t8bfc
[2021-06-29 00:52:37,745][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-localblock-0-data-3t8bfc
[2021-06-29 00:52:37,770][ceph_volume.process][INFO  ] stdout {
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "/mnt/ocs-deviceset-localblock-0-data-3t8bfc": {
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "osd_uuid": "714fafed-ba18-4bc9-a402-8631d670d8ab",
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "size": 536870912000,
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "btime": "2021-06-29 00:52:10.546006",
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout "description": "main"
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout }
[2021-06-29 00:52:37,771][ceph_volume.process][INFO  ] stdout }
[2021-06-29 00:52:37,771][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
[2021-06-29 00:52:38,345][ceph_volume.main][INFO  ] Running command: ceph-volume --log-path /tmp/ceph-log lvm list  --format json
[2021-06-29 00:52:38,346][ceph_volume.process][INFO  ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a -S  -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2021-06-29 00:52:38,707][ceph_volume.process][INFO  ] stderr Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will manage logical volume symlinks in device directory.
[2021-06-29 00:52:38,707][ceph_volume.process][INFO  ] stderr Udev is running and DM_DISABLE_UDEV environment variable is set. Bypassing udev, LVM will obtain device list by scanning device directory.
[2021-06-29 00:52:38,934][ceph_volume.main][INFO  ] Running command: ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-3t8bfc --format json
[2021-06-29 00:52:38,935][ceph_volume.devices.raw.list][DEBUG ] Examining /mnt/ocs-deviceset-localblock-0-data-3t8bfc
[2021-06-29 00:52:38,935][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-localblock-0-data-3t8bfc
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout {
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout "/mnt/ocs-deviceset-localblock-0-data-3t8bfc": {
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout "osd_uuid": "714fafed-ba18-4bc9-a402-8631d670d8ab",
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout "size": 536870912000,
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout "btime": "2021-06-29 00:52:10.546006",
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout "description": "main"
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout }
[2021-06-29 00:52:38,960][ceph_volume.process][INFO  ] stdout }
[2021-06-29 00:52:38,960][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-06-29 00:52:38.973505 I | cephosd: failed to get device already provisioned by ceph-volume raw. failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1
2021-06-29 00:52:38.973515 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-localblock-0-data-3t8bfc"
[cpratt@nx122-ahv ~]$

Comment 4 Jose A. Rivera 2021-06-29 16:28:12 UTC

Being that this is related to Ceph command, at least on the surface, moving to Rook so they can take a closer look. However, we still need to determine if this is a block for IBM on OCS 4.8.0. If yes, we should see if they can take a 4.8.z release or if this is actually a blocker.

Could you at least verify what platform this was tested on? IBM Cloud, Z, etc. Depending on our official support statement for that platform we may be able to push it out. Also, since you only provided the OCS must-gather we don't have information on the broader cluster state. Could you take an OCP must-gather as well and provide information on the nodes in this BZ?

Comment 5 Sridhar Venkat (IBM) 2021-06-29 17:44:10 UTC

This was tested on IBM Power (ppc64le). I will get ocp must gather using (oc adm must-gather) and I will upload it and add its URL here.

Comment 6 Sridhar Venkat (IBM) 2021-06-29 17:58:18 UTC

ClusterID: 7fbfbff8-6f2d-46f7-b1e9-26b8a641f1e5
ClusterVersion: Stable at "4.8.0-0.nightly-ppc64le-2021-06-13-040854"
ClusterOperators:
        All healthy and stable

OCP must-gather logs in : https://drive.google.com/file/d/1eQr27QRceeS_f_N7OtSHLIh3xOcTkEmk/view?usp=sharing

Comment 7 Sridhar Venkat (IBM) 2021-06-29 17:59:12 UTC

The environment used for this test is PowerVS based OCP on IBM Cloud.

Comment 8 Travis Nielsen 2021-06-29 21:41:20 UTC

Seb Ideas on this one? ceph-volume doesn't seem to like running on this platform, but it's interesting that the first 3 OSDs were created succesfully and this fourth one failed.

Comment 9 Sébastien Han 2021-06-30 09:40:34 UTC

This typically happens when the block is half prepared, meaning the initial ceph-volume call failed (the exclusive lock cannot be acquired on the disk) and never recovered. The next instance of the prepared job sees this half-prepared OSD and cannot list since some metadata are missing from the block. I believe the env might be underpowered or running under high load.
Any way we can fix/mitigate this if ocs operator adds "bdev_flock_retry = 20" to the rook-config-override config map.


Moving back to ocs-operator.

Comment 10 Mudit Agarwal 2021-07-01 16:58:39 UTC

Sridhar, nothing has changed in OCS between 4.8 builds which could have caused it.
As mentioned by Sebastien this should be an environment issue and doesn't look like a blocker for 4.8 to me.
Let me know WDYT?

Sebastien,
Are there any side effects of enabling this option?

Comment 11 Sébastien Han 2021-07-05 09:44:44 UTC

(In reply to Mudit Agarwal from comment #10)
> Sridhar, nothing has changed in OCS between 4.8 builds which could have
> caused it.
> As mentioned by Sebastien this should be an environment issue and doesn't
> look like a blocker for 4.8 to me.
> Let me know WDYT?

This does not look like a blocker either.


> 
> Sebastien,
> Are there any side effects of enabling this option?

Adds 20milli sec potentially to the bootstrap time of each OSD, which is negligeable.

Comment 12 Sridhar Venkat (IBM) 2021-07-05 19:50:05 UTC

Mudit, Due to this problem, we are not able to add osd to the storage - expand cluster storage. We used to do this manually so far and we are automating this in ocs-ci code. Before committing the ocs-ci code changes, we wanted to verify this manually and it does not work. I am not sure what could be the issue in the environment.

Comment 13 Michael Adam 2021-07-08 08:02:24 UTC

(In reply to Mudit Agarwal from comment #10)
> Sridhar, nothing has changed in OCS between 4.8 builds which could have
> caused it.

@sridhar, can you confirm with which version this worked for you? I.e. did it work with some earlier 4.8 builds?

Even if nothing changed in OCS code. Something must have changed if it is a regression:

- Something in the environment
- Something in the kernel
- Something in OCP
- Something in Ceph
- Something in OCS (We ruled that out if it worked in earlier 4.8)
- Something in the deployment / setup
  Note that the description says it was deployed using "ocs-ci":
  ==> Maybe there was a change in ocs-ci that led to this regression?

Comment 14 Michael Adam 2021-07-08 08:03:39 UTC

And I am not a big fan of just adding a random timeout/retry mech at some point unless we think it's conceptually correct and we want to do it anyway.

Comment 15 Sridhar Venkat (IBM) 2021-07-08 18:21:12 UTC

@madam No, there were no changes in the environment, The method and specs used were the same. We did not update the kernel of RHCOS on worker node. (we are planning to update to 4.8 rc3 and i can try to reproduce it on it). OCP 4.8 of course keeps getting modified as it has not GAed yet. And same for ceph as we are using OCS 4.8 under development/test now. And same for ocs-ci. I will try to reproduce this problem again.

Comment 16 Sridhar Venkat (IBM) 2021-07-11 00:07:20 UTC

I am able to reproduce this problem with latest OCS and RHCOS 4.8 rc3 image.

Here is how I am able to reproduce this problem:
1. Deploy OCP with 4 worker nodes.
2. Use ocs-ci to deploy OCS which limits OCS deploy to 3 nodes. 
3. Verify pods in openshift-storage (find three osd pods) and ceph health to be HEALTH_OK.
4. Ensure pv and pvcs are correctly setup. See 3 PVs in bound state. See 4 PVCs in bound state.
5. Label the fourth node cluster.ocs.openshift.io/openshift-storage="". 
6. Additional pv get created.
7. using oc edit, patch storagecluster object by incrementing the count under storageDeviceSets. 
8. Notice fourth osd-prepare pod created under openshift-storage namespace. but no new osd pod get created:

[cpratt@nx122-ahv logs-cron]$ oc get pods -n openshift-storage | grep osd
rook-ceph-osd-0-76df7866d-6cv7p                                   2/2     Running     0          4h58m
rook-ceph-osd-1-c79449694-rdrvh                                   2/2     Running     0          4h58m
rook-ceph-osd-2-5dd5f65fb-cbgxq                                   2/2     Running     0          4h58m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-04hxz6n6f   0/1     Completed   0          4h58m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-1j9z2rwx4   0/1     Completed   0          4h58m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-2l255q4g6   0/1     Completed   0          4h58m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3f2lq74kx   0/1     Completed   0          9m48s
[cpratt@nx122-ahv logs-cron]$

Looking at logs of oc logs rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3f2lq74kx -n openshift-storage pod (newly created prepare osd pod), see this error message as reported earlier:

2021-07-10 23:55:30,164][ceph_volume.main][INFO  ] Running command: ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-localblock-0-data-3f2ln5 --format json
[2021-07-10 23:55:30,165][ceph_volume.devices.raw.list][DEBUG ] Examining /mnt/ocs-deviceset-localblock-0-data-3f2ln5
[2021-07-10 23:55:30,165][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-localblock-0-data-3f2ln5
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout {
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout "/mnt/ocs-deviceset-localblock-0-data-3f2ln5": {
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout "osd_uuid": "95a095d6-03b7-464b-933b-530320b2ecfe",
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout "size": 536870912000,
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout "btime": "2021-07-10 23:55:00.859653",
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout "description": "main"
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout }
[2021-07-10 23:55:30,194][ceph_volume.process][INFO  ] stdout }
[2021-07-10 23:55:30,195][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-07-10 23:55:30.209375 I | cephosd: failed to get device already provisioned by ceph-volume raw. failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1

Comment 17 Michael Adam 2021-07-12 06:35:12 UTC

(In reply to Sridhar Venkat (IBM) from comment #15)
> @madam No, there were no changes in the environment, The method
> and specs used were the same. We did not update the kernel of RHCOS on
> worker node. (we are planning to update to 4.8 rc3 and i can try to
> reproduce it on it). OCP 4.8 of course keeps getting modified as it has not
> GAed yet. And same for ceph as we are using OCS 4.8 under development/test
> now. And same for ocs-ci. I will try to reproduce this problem again.


Thanks Sridhar,

Can you confirm whether this has worked at all with a 4.8 build at some point?

Thanks - Michael

Comment 18 Mudit Agarwal 2021-07-12 06:42:32 UTC

(In reply to Michael Adam from comment #17)

> 
> Thanks Sridhar,
> 
> Can you confirm whether this has worked at all with a 4.8 build at some
> point?
> 
> Thanks - Michael

From the Bug description (https://bugzilla.redhat.com/show_bug.cgi?id=1977121#c0):

>> If this is a regression, please provide more details to justify this:
This procedure worked three days ago in an earlier nightly build of OCS 4.8.

Comment 19 Michael Adam 2021-07-12 06:46:48 UTC

(In reply to Mudit Agarwal from comment #18)
> (In reply to Michael Adam from comment #17)
> 
> > 
> > Thanks Sridhar,
> > 
> > Can you confirm whether this has worked at all with a 4.8 build at some
> > point?
> > 
> > Thanks - Michael
> 
> From the Bug description
> (https://bugzilla.redhat.com/show_bug.cgi?id=1977121#c0):
> 
> >> If this is a regression, please provide more details to justify this:
> This procedure worked three days ago in an earlier nightly build of OCS 4.8.

Thanks Mudit! :-) 

clearing needinfo...

Comment 20 Sébastien Han 2021-07-12 08:21:12 UTC

I'd like to clarify that this is not "just a random timeout/retry", this is effectively the only workaround (despite disabling/removing systemd-udevd which is not possible since we have no control over the host) various teams (Ceph/Rook) have agreed on. Also, looking on the Internet we can see we are not the only projects "suffering" from this. It helps us workaround a well-known race between Ceph trying to acquire an exclusive lock on the block device and systemd-udevd had opened the block already. During the OSD preparation sequence, some operations are performed which are effectively triggered systemd-udevd. Those operations are critical for the OSD preparation. More reading https://systemd.io/BLOCK_DEVICE_LOCKING/.

To conclude:

* it's hard to repro and wasn't introduced as a regression
* it adds up to 20 milli-seconds maximum so it's negligible.
* easy workaround that can provide more stability

Thanks!

Comment 21 Michael Adam 2021-07-12 09:33:23 UTC

Thanks for clarifying, Séb!

So if this is a change that's generally desirable, and non-risky, we can go ahead and merge it for the next RC.

Comment 22 Sébastien Han 2021-07-12 09:45:00 UTC

(In reply to Michael Adam from comment #21)
> Thanks for clarifying, Séb!
> 
> So if this is a change that's generally desirable, and non-risky, we can go
> ahead and merge it for the next RC.

Sounds good to me, thanks, Michael!

Comment 26 Sridhar Venkat (IBM) 2021-07-12 19:15:30 UTC

I see PRs merged, I assume changes will make it to tonight's OCS build. I will attempt to reproduce this problem tomorrow.

Comment 27 Sridhar Venkat (IBM) 2021-07-13 16:31:00 UTC

I tried with latest ocs build 4.8.0-452.ci. And see the same problem.

1. I checked oc get pv, shows new pvs for added node. I see four pvs in Bound state.
2. I checked oc get pvc, see four pvcs in bound state.
3. Looked at localvolumediscovery - auto-discover-devices. In its describe output
Events:
  Type    Reason                        Age   From                     Message
  ----    ------                        ----  ----                     -------
  Normal  CreatedDiscoveryResultObject  34m   local-storage-diskmaker  worker-0 - successfully created LocalVolumeDiscoveryResult resource
  Normal  UpdatedDiscoveredDeviceList   34m   local-storage-diskmaker  worker-0 - successfully updated discovered device details in the LocalVolumeDiscoveryResult resource
  Normal  CreatedDiscoveryResultObject  34m   local-storage-diskmaker  worker-1 - successfully created LocalVolumeDiscoveryResult resource
  Normal  UpdatedDiscoveredDeviceList   34m   local-storage-diskmaker  worker-1 - successfully updated discovered device details in the LocalVolumeDiscoveryResult resource
  Normal  CreatedDiscoveryResultObject  34m   local-storage-diskmaker  worker-2 - successfully created LocalVolumeDiscoveryResult resource
  Normal  UpdatedDiscoveredDeviceList   34m   local-storage-diskmaker  worker-2 - successfully updated discovered device details in the LocalVolumeDiscoveryResult resource
  Normal  CreatedDiscoveryResultObject  34m   local-storage-diskmaker  worker-3 - successfully created LocalVolumeDiscoveryResult resource
  Normal  UpdatedDiscoveredDeviceList   34m   local-storage-diskmaker  worker-3 - successfully updated discovered device details in the LocalVolumeDiscoveryResult resource

I see worker-3 entries (new node).

$ oc get LocalVolumeDiscoveryResult -A -o wide
NAMESPACE                 NAME                        AGE
openshift-local-storage   discovery-result-worker-0   35m
openshift-local-storage   discovery-result-worker-1   35m
openshift-local-storage   discovery-result-worker-2   35m
openshift-local-storage   discovery-result-worker-3   34m

4. Looked for osd prepare pods in openshift-storage namespace:

rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0kbpzz7mp   0/1     Completed   0          31m   10.131.0.51      worker-1   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-1dgc7j9dr   0/1     Completed   0          31m   10.128.2.17      worker-2   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-255949hqj   0/1     Completed   0          31m   10.129.2.19      worker-0   <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-36x7hf9z9   0/1     Completed   1          10m   10.130.2.25      worker-3   <none>           <none>

I see one for new node.

but see only three osd pods

rook-ceph-osd-0-775f4cff5c-gcprz                                  2/2     Running     0          30m   10.131.0.52      worker-1   <none>           <none>
rook-ceph-osd-1-cfc878b8b-lqdlh                                   2/2     Running     0          30m   10.129.2.20      worker-0   <none>           <none>
rook-ceph-osd-2-6d99cfc6c8-zmdsc                                  2/2     Running     0          30m   10.128.2.18      worker-2   <none>           <none>

In the log of rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-36x7hf9z9 I see the same error message: 

[2021-07-13 16:16:55,919][ceph_volume.devices.raw.list][DEBUG ] Examining /mnt/ocs-deviceset-localblock-0-data-36x7nt
[2021-07-13 16:16:55,919][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-localblock-0-data-36x7nt
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout {
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout "/mnt/ocs-deviceset-localblock-0-data-36x7nt": {
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout "osd_uuid": "dd7a4d02-1246-4d1e-927a-c42a229bba95",
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout "size": 536870912000,
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout "btime": "2021-07-13 16:16:41.917372",
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout "description": "main"
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout }
[2021-07-13 16:16:55,946][ceph_volume.process][INFO  ] stdout }
[2021-07-13 16:16:55,947][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
    whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-07-13 16:16:55.965316 I | cephosd: failed to get device already provisioned by ceph-volume raw. failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1
2021-07-13 16:16:55.965325 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-localblock-0-data-36x7nt"

I see matching pv in bound state.

$ oc get pv | grep "36x7nt" 
local-pv-28b9d3a6                          500Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-localblock-0-data-36x7nt   localblock                             37m

Comment 28 Mudit Agarwal 2021-07-13 17:22:23 UTC

Sridhar, can you attach the must-gather please.

Comment 29 Sridhar Venkat (IBM) 2021-07-13 18:50:02 UTC

Mudit, must gather is here : https://drive.google.com/file/d/1mN6bYkVrqCAzOWGefOyn6v0qfcSdwzZ1/view?usp=sharing

[cpratt@nx122-ahv ~]$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8 --dest-dir=/home/cpratt/mg
[must-gather      ] OUT Using must-gather plug-in image: quay.io/rhceph-dev/ocs-must-gather:latest-4.8
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 8086c1b2-5afd-4497-896e-59cc55c1d3ca
ClusterVersion: Stable at "4.8.0-0.nightly-ppc64le-2021-06-13-040854"
ClusterOperators:
        All healthy and stable

We are going to try few other options - using UI, and using localvolume we will deploy OCS cluster and try to expand.  This is not using ocs-ci. I will update this BZ with details.

Comment 30 Sridhar Venkat (IBM) 2021-07-13 18:50:02 UTC

Mudit, must gather is here : https://drive.google.com/file/d/1mN6bYkVrqCAzOWGefOyn6v0qfcSdwzZ1/view?usp=sharing

[cpratt@nx122-ahv ~]$ oc adm must-gather --image=quay.io/rhceph-dev/ocs-must-gather:latest-4.8 --dest-dir=/home/cpratt/mg
[must-gather      ] OUT Using must-gather plug-in image: quay.io/rhceph-dev/ocs-must-gather:latest-4.8
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 8086c1b2-5afd-4497-896e-59cc55c1d3ca
ClusterVersion: Stable at "4.8.0-0.nightly-ppc64le-2021-06-13-040854"
ClusterOperators:
        All healthy and stable

We are going to try few other options - using UI, and using localvolume we will deploy OCS cluster and try to expand.  This is not using ocs-ci. I will update this BZ with details.

Comment 31 Sridhar Venkat (IBM) 2021-07-13 18:57:17 UTC

This is ocp oc adm must-gather : https://drive.google.com/file/d/1qCkE8djVQN-oWWI88QSNDPY4qMxDxoao/view?usp=sharing in case if you need to look at ocp level must-gather logs.

Comment 32 Aaruni Aggarwal 2021-07-14 07:26:35 UTC

Cluster Expansion using UI went through successfully. I deployed OCS 4.8 on OCP 4.8 on IBM Power Systems(ppc64le) using localVolume in OperatorHub. Then I added a new node, labelled it with cluster.ocs.openshift.io/openshift-storage='' and edited my localvolume yaml file in order to add a newly added node. It created a new PV and corresponding diskmaker-manager pod in openshift-local-storage namespace. 
After that I clicked on Add Capacity option in StorageCluster tab . It added new osd (ie. 4th one) in OCS Storage Cluster. 

[root@rdr-aarui-sao01-bastion-0 ~]# oc get csv -A
NAMESPACE                              NAME                                        DISPLAY                       VERSION              REPLACES   PHASE
openshift-local-storage                local-storage-operator.4.8.0-202106291913   Local Storage                 4.8.0-202106291913              Succeeded
openshift-operator-lifecycle-manager   packageserver                               Package Server                0.17.0                          Succeeded
openshift-storage                      ocs-operator.v4.8.0-450.ci                  OpenShift Container Storage   4.8.0-450.ci                    Succeeded

[root@rdr-aarui-sao01-bastion-0 ~]# oc get nodes
NAME                       STATUS   ROLES    AGE     VERSION
rdr-aarui-sao01-master-0   Ready    master   105m    v1.21.0-rc.0+9bec1e1
rdr-aarui-sao01-master-1   Ready    master   105m    v1.21.0-rc.0+9bec1e1
rdr-aarui-sao01-master-2   Ready    master   105m    v1.21.0-rc.0+9bec1e1
rdr-aarui-sao01-worker-0   Ready    worker   93m     v1.21.0-rc.0+9bec1e1
rdr-aarui-sao01-worker-1   Ready    worker   93m     v1.21.0-rc.0+9bec1e1
rdr-aarui-sao01-worker-2   Ready    worker   89m     v1.21.0-rc.0+9bec1e1
rdr-aarui-sao01-worker-3   Ready    worker   3m18s   v1.21.0-rc.0+9bec1e1

[root@rdr-aarui-sao01-bastion-0 ~]# oc get pods -n openshift-local-storage
NAME                                      READY   STATUS    RESTARTS   AGE
diskmaker-discovery-q4mf8                 1/1     Running   0          52m
diskmaker-discovery-qmhzt                 1/1     Running   0          52m
diskmaker-discovery-tkb9h                 1/1     Running   0          52m
diskmaker-manager-ghnhb                   1/1     Running   0          61s
diskmaker-manager-p286p                   1/1     Running   0          22s
diskmaker-manager-qdgcn                   1/1     Running   0          48s
diskmaker-manager-tz7g2                   1/1     Running   0          35s
local-storage-operator-75cc958dcb-ggqrb   1/1     Running   0          55m

[root@rdr-aarui-sao01-bastion-0 ~]# oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-75ggd                                            3/3     Running     0          46m
csi-cephfsplugin-ld59m                                            3/3     Running     0          46m
csi-cephfsplugin-provisioner-78d7667cb8-57ztt                     6/6     Running     0          46m
csi-cephfsplugin-provisioner-78d7667cb8-5shgn                     6/6     Running     0          46m
csi-cephfsplugin-ptd5l                                            3/3     Running     0          46m
csi-cephfsplugin-zfnm8                                            3/3     Running     0          10m
csi-rbdplugin-5kx4f                                               3/3     Running     0          46m
csi-rbdplugin-5vb7p                                               3/3     Running     0          46m
csi-rbdplugin-l6hvv                                               3/3     Running     0          46m
csi-rbdplugin-provisioner-778cccf7c7-n8q9k                        6/6     Running     0          46m
csi-rbdplugin-provisioner-778cccf7c7-qc7fp                        6/6     Running     0          46m
csi-rbdplugin-rf4xj                                               3/3     Running     0          10m
noobaa-core-0                                                     1/1     Running     0          44m
noobaa-db-pg-0                                                    1/1     Running     0          44m
noobaa-endpoint-95f6ff898-2zm2s                                   1/1     Running     0          42m
noobaa-operator-7c7b57b94d-74495                                  1/1     Running     0          61m
ocs-metrics-exporter-75dc5ccf5f-tgjrf                             1/1     Running     0          61m
ocs-operator-79f86d998-gzl87                                      1/1     Running     0          61m
rook-ceph-crashcollector-rdr-aarui-sao01-worker-0-68b547d629qd4   1/1     Running     0          44m
rook-ceph-crashcollector-rdr-aarui-sao01-worker-1-78dd67d4mgggd   1/1     Running     0          44m
rook-ceph-crashcollector-rdr-aarui-sao01-worker-2-6b7b7df8tchpl   1/1     Running     0          44m
rook-ceph-crashcollector-rdr-aarui-sao01-worker-3-575bddbfcrk6f   1/1     Running     0          43s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-649b4fdfg6d52   2/2     Running     0          44m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5499ff7c2cjp7   2/2     Running     0          44m
rook-ceph-mgr-a-6857cc86dd-m6tpl                                  2/2     Running     0          44m
rook-ceph-mon-a-55cfc4f874-vrlzz                                  2/2     Running     0          45m
rook-ceph-mon-b-79db868b56-tcpsg                                  2/2     Running     0          45m
rook-ceph-mon-c-5f5df546f4-t7vxv                                  2/2     Running     0          45m
rook-ceph-operator-7747b9d995-vrl68                               1/1     Running     0          61m
rook-ceph-osd-0-775bd5f75b-rsvj8                                  2/2     Running     0          44m
rook-ceph-osd-1-6f8d8d6976-f8lgb                                  2/2     Running     0          44m
rook-ceph-osd-2-b7f98cb86-h82hq                                   2/2     Running     0          44m
rook-ceph-osd-3-5dfd86c75f-lt827                                  2/2     Running     0          43s
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0rqhdx5zj   0/1     Completed   0          44m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-12pwdbcms   0/1     Completed   0          44m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-2hbr88vzd   0/1     Completed   0          44m
rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-3g7v2z9pk   0/1     Completed   0          58s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-84bc5fbb92bz   2/2     Running     0          43m

[root@rdr-aarui-sao01-bastion-0 ~]# oc get pvc -n openshift-storage
NAME                                     STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
db-noobaa-db-pg-0                        Bound    pvc-3dd2104f-378a-438f-95a4-ea05ff451752   50Gi       RWO            ocs-storagecluster-ceph-rbd   45m
ocs-deviceset-localblock-0-data-0rqhgs   Bound    local-pv-571cfc66                          500Gi      RWO            localblock                    45m
ocs-deviceset-localblock-0-data-12pwhk   Bound    local-pv-1f98cfa3                          500Gi      RWO            localblock                    45m
ocs-deviceset-localblock-0-data-2hbrx7   Bound    local-pv-6cbcf5e1                          500Gi      RWO            localblock                    45m
ocs-deviceset-localblock-0-data-3g7vqt   Bound    local-pv-21b94170                          500Gi      RWO            localblock                    83s


But we are still unable to expand the storageCluster from ocs-ci which uses LocalVolumeset.

Comment 33 Sridhar Venkat (IBM) 2021-07-14 11:48:38 UTC

Mudit, see above. LocalVolume based OCS deploy is going through. This BZ is no longer a blocking defect. But we still are having an issue with LocalVolumeSet based deployment (done by ocs-ci code). I think it could be due to multi-path setup in our environment. LSO does not support multipath yet, it could be the reason for this problem.

Comment 34 Mudit Agarwal 2021-07-14 12:03:01 UTC

Thanks Sridhar.

The fix we did earlier is a generic one which will prevent these kind of issues unless the environment is really blown up which looks like to happen in this case.
Only thing I wanted to check in must-gather is whether the newly added option (bdev_flock_retry = 20) is present or not and which I can see is present.
So, I am afraid that we can't do much because it doesn't look like a product issue.

One question:
Can't we also do LocalVolumeSet based deployment without ocs-ci, I don't know whether multipath can cause this issue.

Comment 35 Sridhar Venkat (IBM) 2021-07-15 11:53:57 UTC

Mudit, we have verified cluster expansion using LocalVolume based OCS deployments. So this BZ can be closed. We will not be able to automate cluster expansion using ocs-ci test cases. We can revisit these test cases after LSO supports multipath.

Note You need to log in before you can comment on or make changes to this bug.