Bug 2019606
| Summary: | Could not expand the cluster with flexible scaling enabled | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sridhar Venkat (IBM) <svenkat> | ||||
| Component: | ceph | Assignee: | Guillaume Abrioux <gabrioux> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Elad <ebenahar> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.9 | CC: | assingh, bniver, ebenahar, gabrioux, jrivera, madam, matt.levan, mhackett, mmuench, muagarwa, ocs-bugs, odf-bz-bot | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | ppc64le | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-02-14 15:46:13 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Sridhar Please provide more information on the repro: - Do you have a must-gather? It would likely help answer the remaining questions. - Full log for the OSD prepare pod, it may show more details about the failure - The Rook operator log is needed - CephCluster CR details to show the storageClassDeviceSets - StorageCluster CR details - ... I suspect this is an environment issue with the device not being properly cleaned from a previous install, rather than a code bug. Moving out of 4.9 since it is expected to be environmental. Created attachment 1846103 [details]
must-gather from cluster2
I have been experiencing this issue with new installations and adding storage after the system is already running. This was a brand new installation with dynamic storage provisioning being provided by the IBM Block CSI Driver attached to an IBM FlashSystem externally via iSCSI. This particular install had issue with the prepare on deviceset ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9 and also crashLoopBackoff/error with osd0 and i had to run the oc process remove OSD process in order to remove osd0 and recreate the device. I have included a must-gather from this cluster.
ClusterID: 6913171e-04c7-4892-a906-9f73fe056813
ClusterVersion: Stable at "4.9.9"
ClusterOperators:
All healthy and stable
----
[2021-12-13 14:52:32,307][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-bluestore-tool show-label --dev /mnt/ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9
[2021-12-13 14:52:32,337][ceph_volume.process][INFO ] stdout {
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "/mnt/ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9": {
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "osd_uuid": "8595f681-df8c-41cf-b1ff-b5c16a64cc85",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "size": 549755813888,
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "btime": "2021-12-13T14:44:14.463837+0000",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "description": "main",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_blocks": "134217728",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_blocks_per_key": "128",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_bytes_per_block": "4096",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bfm_size": "549755813888",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "bluefs": "1",
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout "kv_backend": "rocksdb"
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout }
[2021-12-13 14:52:32,338][ceph_volume.process][INFO ] stdout }
[2021-12-13 14:52:32,339][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main
self.list(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list
report = self.generate(args.device)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate
whoami = oj[dev]['whoami']
KeyError: 'whoami'
2021-12-13 14:52:32.371842 I | cephosd: failed to get device already provisioned by ceph-volume raw. failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1
2021-12-13 14:52:32.371850 W | cephosd: skipping OSD configuration as no devices matched the storage settings for this node "ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9"
2021-12-13 14:52:32.371875 I | op-k8sutil: format and nodeName longer than 63 chars, nodeName ocs-deviceset-ibm-block-odf-test-2-data-0c2fb9 will be a5028116587694d818081045740cb30a
Guillaume What would cause ceph-volume to fail with the whoami error? I think the real error isn't the one you are reporting but the following:
2021-12-13T14:44:16.007615871Z Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid 8595f681-df8c-41cf-b1ff-b5c16a64cc85 --setuser ceph --setgroup ceph
2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:14.461+0000 7f38bc8b0080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:14.981+0000 7f38bc8b0080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _open_db erroring opening db:
2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:15.494+0000 7f38bc8b0080 -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error
2021-12-13T14:44:16.007615871Z stderr: 2021-12-13T14:44:15.494+0000 7f38bc8b0080 -1 ESC[0;31m ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (5) Input/output errorESC[0m
2021-12-13T14:44:16.007615871Z --> Was unable to complete a new OSD, will rollback changes
2021-12-13T14:44:16.007615871Z Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --yes-i-really-mean-it
2021-12-13T14:44:16.007615871Z stderr: purged osd.0
2021-12-13T14:44:16.007615871Z Traceback (most recent call last):
2021-12-13T14:44:16.007615871Z File "/usr/sbin/ceph-volume", line 11, in <module>
2021-12-13T14:44:16.007615871Z load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in __init__
2021-12-13T14:44:16.007615871Z self.main(self.argv)
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
2021-12-13T14:44:16.007615871Z return f(*a, **kw)
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
2021-12-13T14:44:16.007615871Z terminal.dispatch(self.mapper, subcommand_args)
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2021-12-13T14:44:16.007615871Z instance.main()
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
2021-12-13T14:44:16.007615871Z terminal.dispatch(self.mapper, self.argv)
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
2021-12-13T14:44:16.007615871Z instance.main()
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main
2021-12-13T14:44:16.007615871Z self.safe_prepare(self.args)
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare
2021-12-13T14:44:16.007615871Z self.prepare()
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
2021-12-13T14:44:16.007615871Z return func(*a, **kw)
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare
2021-12-13T14:44:16.007615871Z tmpfs,
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 68, in prepare_bluestore
2021-12-13T14:44:16.007615871Z db=db
2021-12-13T14:44:16.007615871Z File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 480, in osd_mkfs_bluestore
2021-12-13T14:44:16.007615871Z raise RuntimeError('Command failed with exit code %s: %s' % (returncode, ' '.join(command)))
2021-12-13T14:44:16.007615871Z RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid 8595f681-df8c-41cf-b1ff-b5c16a64cc85 --setuser ceph --setgroup ceph. debug logs below:
"(5) Input/output error" : the device seems to be unhealthy, I think you have an hardware issue.
The first osd prepare failed in the middle of the workflow, given that the operator is retrying to prepare the osd in a loop, next attempts produce another issue (the one you mentioned).
|
Description of problem (please be detailed as possible and provide log snippests): 1. Deploy OCP and ODF with 3 storage nodes. Use LocalVolumeSet to provision storage. Enable flexible scaling. In the storage cluster, set count to 3 and replica to 1. Notice three pv and pvcs created and bound. 2. Add a new storage node with identical storage. Edit storage cluster and increase the count to 4. 3. Notice additional 'osd-prepare-pod' created but it fails to create additional osd pod. Cluster expansion fails. I see the following in the log of newly added osd-prepare-pod: [2021-11-02 15:07:45,222][ceph_volume][ERROR ] exception caught by decorator Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 136, in main self.list(args) File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 92, in list report = self.generate(args.device) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 80, in generate whoami = oj[dev]['whoami'] KeyError: 'whoami' I believe this problem does not happen when I use LocalVolume instead of LocalVolumeSet. When I deploy OCP and ODF with four storage nodes to begin with and set the count in the storage cluster to 4, then the ODF deploy produces 4 prepare-osd pods and 4 osd pods, and cluster gets created successfully. Version of all relevant components (if applicable): 4.9 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No. But it impacts the expansion of storage cluster. Is there any workaround available to the best of your knowledge? No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: