Description of problem (please be detailed as possible and provide log snippests): 1. Deploy OCP and ODF with 3 storage nodes. Use LocalVolumeSet to provision storage. Enable flexible scaling. In the storage cluster, set count to 3 and replica to 1. Notice three pv and pvcs created and bound. 2. Add a new storage node with identical storage. Edit storage cluster and increase the count to 4. 3. Notice additional 'osd-prepare-pod' created but it fails to create additional osd pod. Cluster expansion fails. I see the following in the log of newly added osd-prepare-pod: [2021-11-02 15:07:45,222][ceph_volume][ERROR ] exception caught by decorator Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main self.list(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list report = self.generate(args.device) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate whoami = oj[dev]['whoami'] KeyError: 'whoami' I believe this problem does not happen when I use LocalVolume instead of LocalVolumeSet. When I deploy OCP and ODF with four storage nodes to begin with and set the count in the storage cluster to 4, then the ODF deploy produces 4 prepare-osd pods and 4 osd pods, and cluster gets created successfully. Version of all relevant components (if applicable): 4.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No. But it impacts the expansion of storage cluster. Is there any workaround available to the best of your knowledge? No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Sridhar Please provide more information on the repro: - Do you have a must-gather? It would likely help answer the remaining questions. - Full log for the OSD prepare pod, it may show more details about the failure - The Rook operator log is needed - CephCluster CR details to show the storageClassDeviceSets - StorageCluster CR details - ... I suspect this is an environment issue with the device not being properly cleaned from a previous install, rather than a code bug. Moving out of 4.9 since it is expected to be environmental.
Created attachment 1846103 [details] must-gather from cluster2
I have been experiencing this issue with new installations and adding storage after the system is already running. This was a brand new installation with dynamic storage provisioning being provided by the IBM Block CSI Driver attached to an IBM FlashSystem externally via iSCSI. This particular install had issue with the prepare on deviceset ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9 and also crashLoopBackoff/error with osd0 and i had to run the oc process remove OSD process in order to remove osd0 and recreate the device. I have included a must-gather from this cluster. ClusterID: 6913171e-04c7-4892-a906-9f73fe056813 ClusterVersion: Stable at "4.9.9" ClusterOperators: All healthy and stable ---- [2021-12-13 14:52:32,307][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9 [2021-12-13 14:52:32,337][ceph_volume.process][INFO ] stdout { [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "/mnt/ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9": { [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "osd_uuid": "8595f681-df8c-41cf-b1ff-b5c16a64cc85", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "size": 549755813888, [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "btime": "2021-12-13T14:44:14.463837+0000", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "description": "main", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_blocks": "134217728", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_blocks_per_key": "128", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_bytes_per_block": "4096", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_size": "549755813888", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bluefs": "1", [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "kv_backend": "rocksdb" [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout } [2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout } [2021-12-13 14:52:32,339][ceph_volume][ERROR ] exception caught by decorator Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main self.list(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list report = self.generate(args.device) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate whoami = oj[dev]['whoami'] KeyError: 'whoami' 2021-12-13 14:52:32.371842 I | cephosd: failed to get device already provisioned by ceph-volume raw. failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1 2021-12-13 14:52:32.371850 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9" 2021-12-13 14:52:32.371875 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9 will be a5028116587694d818081045740cb30a
Guillaume What would cause ceph-volume to fail with the whoami error?
I think the real error isn't the one you are reporting but the following: 2021-12-13T14:44:16.007615871Z Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid 8595f681-df8c-41cf-b1ff-b5c16a64cc85 --setuser ceph --setgroup ceph 2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:14.461+0000 7f38bc8b0080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid 2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:14.981+0000 7f38bc8b0080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _open_db erroring opening db: 2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:15.494+0000 7f38bc8b0080 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error 2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:15.494+0000 7f38bc8b0080 -1 ESC[0;31m ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (5) Input/output errorESC[0m 2021-12-13T14:44:16.007615871Z --> Was unable to complete a new OSD, will rollback changes 2021-12-13T14:44:16.007615871Z Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it 2021-12-13T14:44:16.007615871Z stderr: purged osd.0 2021-12-13T14:44:16.007615871Z Traceback (most recent call last): 2021-12-13T14:44:16.007615871Z File "/usr/sbin/ceph-volume", line 11, in <module> 2021-12-13T14:44:16.007615871Z load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in __init__ 2021-12-13T14:44:16.007615871Z self.main(self.argv) 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc 2021-12-13T14:44:16.007615871Z return f(*a, **kw) 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main 2021-12-13T14:44:16.007615871Z terminal.dispatch(self.mapper, subcommand_args) 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch 2021-12-13T14:44:16.007615871Z instance.main() 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main 2021-12-13T14:44:16.007615871Z terminal.dispatch(self.mapper, self.argv) 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch 2021-12-13T14:44:16.007615871Z instance.main() 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main 2021-12-13T14:44:16.007615871Z self.safe_prepare(self.args) 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare 2021-12-13T14:44:16.007615871Z self.prepare() 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root 2021-12-13T14:44:16.007615871Z return func(*a, **kw) 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare 2021-12-13T14:44:16.007615871Z tmpfs, 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 68, in prepare_bluestore 2021-12-13T14:44:16.007615871Z db=db 2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 480, in osd_mkfs_bluestore 2021-12-13T14:44:16.007615871Z raise RuntimeError('Command failed with exit code %s: %s' % (returncode, ' '.join(command))) 2021-12-13T14:44:16.007615871Z RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid 8595f681-df8c-41cf-b1ff-b5c16a64cc85 --setuser ceph --setgroup ceph. debug logs below: "(5) Input/output error" : the device seems to be unhealthy, I think you have an hardware issue. The first osd prepare failed in the middle of the workflow, given that the operator is retrying to prepare the osd in a loop, next attempts produce another issue (the one you mentioned).