Created attachment 1752774 [details] Describe osd-prepare-deviceset Description of problem (please be detailed as possible and provide log snippests): I am using 3 master and 3 worker nodes in my cluster. Not all required pods of the namespace openshift-storage are deployed on multiple clusters with this ocs version and after a retry of installation. In my case, rook-ceph-osd-prepare-ocs-deviceset-0-data-0-nlw7m has been restarting continuously without any success. But it can be another one on another cluster. Noobaa pods seem to have no problem Tier 4a tests are failing because of ceph identifies the osd count for too low. oc -n openshift-storage exec rook-ceph-tools-6fdd868f75-plkhq -- ceph health HEALTH_WARN Degraded data redundancy: 435/1305 objects degraded (33.333%), 85 pgs degraded, 176 pgs undersized; OSD count 2 < osd_pool_default_size 3 oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-provisioner-d8ccd695d-59hsc 6/6 Running 0 5d21h csi-cephfsplugin-provisioner-d8ccd695d-kgt6f 6/6 Running 0 5d21h csi-cephfsplugin-s7fpl 3/3 Running 0 5d21h csi-cephfsplugin-smhf8 3/3 Running 0 5d21h csi-cephfsplugin-srll7 3/3 Running 0 5d21h csi-rbdplugin-cjjtl 3/3 Running 0 5d21h csi-rbdplugin-j8ftf 3/3 Running 0 5d21h csi-rbdplugin-provisioner-76988fbc89-clszq 6/6 Running 0 5d21h csi-rbdplugin-provisioner-76988fbc89-qhjgd 6/6 Running 0 5d21h csi-rbdplugin-pvff5 3/3 Running 0 5d21h noobaa-core-0 1/1 Running 0 5d17h noobaa-db-0 1/1 Running 0 5d17h noobaa-endpoint-5d4995f4fc-djtgb 1/1 Running 3 5d21h noobaa-operator-5cbc75645c-rjr4c 1/1 Running 0 5d21h ocs-metrics-exporter-cfbdf59f7-hmtsh 1/1 Running 0 5d21h ocs-operator-7699785c58-h5hlk 1/1 Running 0 5d21h rook-ceph-crashcollector-worker-001.m1307001ocs.lnxne.boe-l7n8s 1/1 Running 0 5d21h rook-ceph-crashcollector-worker-002.m1307001ocs.lnxne.boe-4mdq2 1/1 Running 0 5d21h rook-ceph-crashcollector-worker-003.m1307001ocs.lnxne.boe-szgf2 1/1 Running 0 5d21h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-6bd76df4bvt6g 1/1 Running 0 5d21h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5856648cbtqbv 1/1 Running 0 5d21h rook-ceph-mgr-a-7988f8c79b-tvhhz 1/1 Running 0 5d21h rook-ceph-mon-a-9b94c9d67-29hvt 1/1 Running 0 5d21h rook-ceph-mon-b-b44b8db4b-nhp2f 1/1 Running 0 5d21h rook-ceph-mon-c-9c59cf784-rwtst 1/1 Running 0 5d21h rook-ceph-operator-767dd7c6b5-fs84m 1/1 Running 0 5d21h rook-ceph-osd-1-d474bd9-lqftc 1/1 Running 0 5d21h rook-ceph-osd-2-69d778c845-sj8tt 1/1 Running 0 5d21h rook-ceph-osd-prepare-ocs-deviceset-1-data-0-fd72v-fgljg 0/1 Completed 0 5d21h rook-ceph-osd-prepare-ocs-deviceset-2-data-0-gql7d-b6phd 0/1 Completed 0 5d21h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-78c558999wtf 1/1 Running 0 5d21h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-75c4b47mwwqr 1/1 Running 0 5d21h rook-ceph-tools-6fdd868f75-plkhq 1/1 Running 0 5d15h oc get pods -n local-storage NAME READY STATUS RESTARTS AGE local-disks-local-diskmaker-djfw2 1/1 Running 0 5d21h local-disks-local-diskmaker-rpdvt 1/1 Running 0 5d21h local-disks-local-diskmaker-sjzxq 1/1 Running 0 5d21h local-disks-local-provisioner-8n9bc 1/1 Running 0 5d21h local-disks-local-provisioner-dcfs2 1/1 Running 0 5d21h local-disks-local-provisioner-wtdhk 1/1 Running 0 5d21h local-storage-operator-6f5c9f9587-fp8mn 1/1 Running 0 5d21h oc get pvc -n openshift-storage NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-0 Bound pvc-5f7dc7fc-fb38-485c-be8f-549fb29a77fe 50Gi RWO ocs-storagecluster-ceph-rbd 5d21h ocs-deviceset-0-data-0-nlw7m Bound local-pv-9f1d5b9c 8Ti RWO localblock-sc 5d21h ocs-deviceset-1-data-0-fd72v Bound local-pv-e46e822f 8Ti RWO localblock-sc 5d21h ocs-deviceset-2-data-0-gql7d Bound local-pv-9b8807c6 8Ti RWO localblock-sc 5d21h oc get volumeattachments NAME ATTACHER PV NODE ATTACHED AGE csi-9ad4c800b9dd49bf3b16cc4d4a1d05c6174e70e2e64ed5b5edbed7c86b66aa23 openshift-storage.rbd.csi.ceph.com pvc-5f7dc7fc-fb38-485c-be8f-549fb29a77fe worker-003.m1307001ocs.lnxne.boe true 5d17h Error message in oc logs on this device: [2021-01-26 14:34:13,188][ceph_volume.process][INFO ] stderr 2021-01-26 14:34:13.170 3ffb26e8d40 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0/: (2) No such file or directory [2021-01-26 14:34:13,188][ceph_volume.devices.raw.prepare][ERROR ] raw prepare was unable to complete Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare self.prepare() File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare tmpfs, File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 68, in prepare_bluestore db=db File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 456, in osd_mkfs_bluestore raise RuntimeError('Command failed with exit code %s: %s' % (returncode, ' '.join(command))) RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid d5d2e27e-f4f9-4f50-b53e-34514e17a158 --setuser ceph --setgroup ceph [2021-01-26 14:34:13,189][ceph_volume.devices.raw.prepare][INFO ] will rollback OSD ID creation [2021-01-26 14:34:13,189][ceph_volume.process][INFO ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.0 --ye s-i-really-mean-it [2021-01-26 14:34:13,769][ceph_volume.process][INFO ] stderr purged osd.0 [2021-01-26 14:34:13,784][ceph_volume][ERROR ] exception caught by decorator Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc return f(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 150, in main terminal.dispatch(self.mapper, subcommand_args) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main terminal.dispatch(self.mapper, self.argv) File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch instance.main() File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main self.safe_prepare(self.args) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare self.prepare() File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root return func(*a, **kw) File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare tmpfs, File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 68, in prepare_bluestore db=db File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 456, in osd_mkfs_bluestore raise RuntimeError('Command failed with exit code %s: %s' % (returncode, ' '.join(command))) RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid d5d2e27e-f4f9-4f50-b53e-34514e17a158 --setuser ceph --setgroup ceph Version of all relevant components (if applicable): Client Version: 4.6.13 Server Version: 4.6.13 ocs-operator.v4.6.2-698.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. We can not execute different ocs-ci tests for s390x because of 3 ceph health warnings. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes. That is happening on different clusters on s390x. Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCP 2. Install OCS 3. Not all pods are available Actual results: Not all required pods for osd are available after a fresh installation. We have got multiple clusters where 1 osd-prepare-deviceset is restarting continuously without success. That is the reason for 3 ceph health warnings, which cause failed tests. Expected results: All osd pods should run without any problems. Additional info:
Created attachment 1752776 [details] Log osd-prepare-device-log
Created attachment 1754987 [details] Capacity Resource log by test 4a Tier4a test suite contains an error in a failed test with a timeout as a result because of this behavior. The content is attached.
At the moment this looks like a Rook issue, so reassigning this BZ accordingly.
It looks like the prepare pod ran on an existing OSD: 2021-01-26 14:34:09.868118 I | cephosd: skipping osd.0: 56861d4b-8a66-492a-acc8-deb11b9303e5 running on a different ceph cluster "af92dd7e-2c15-4b8b-bf7e-88fb8b4e122a" So I think there is an environmental issue, did you cleanup the drive before deploying? However, Rook should bail out and not proceed with the device which seems the case, so there is a bug too.
Thank you! I tried that with a OCS cleanup and a manual disk cleanup of zFCP disks on all worker nodes with: sgdisk -Z /dev/sdb That is a cleanup for reformatting and is not really healthy for daily resetups via CI/CD. Our customers would not be happy about such requied steps, too. The OCS setup is more successfully after this complete new OCP/OCS setup. What I can identify as an issue is, that pods can not be removed in tests. That can impact this bug that existing ceph clusters are identified on the disks. I had problems to remove these pods manually, too. The question for IBM and Red Hat is: Do we really need a reformatting of disks for reusing clusters? That would damage the environment for future usage more and more.
If you cleanly uninstall OCS following the official procedure (even through YAML), it will also sanitize the disk and then you can do a clean re-install.
We have been following the official procedure. It seems there is happening a timeout and then the cleanup will be skipped. If I understand that correctly, with the oc delete pv..., the lso pvs should be deleted and the Delete reclaim policy should be enforced. It seems that is not happening on s390x. I have restarted a tier 4a test with a (manual) disk cleanup before the test run. All pods were available before the test. Many tests failed again because of Pending pods and a timeout. Here is the pod result of the openshift-storage namespace after the tier4a test: # oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-gcrds 3/3 Running 0 3d4h csi-cephfsplugin-hpbrx 3/3 Running 0 3d4h csi-cephfsplugin-pqqz6 3/3 Running 0 3d4h csi-cephfsplugin-provisioner-d8ccd695d-6jjhb 6/6 Running 0 3d4h csi-cephfsplugin-provisioner-d8ccd695d-jbf2g 6/6 Running 0 3d4h csi-rbdplugin-8sl6t 3/3 Running 0 3d4h csi-rbdplugin-provisioner-76988fbc89-55vsk 6/6 Running 0 3d4h csi-rbdplugin-provisioner-76988fbc89-qvwtc 6/6 Running 0 3d4h csi-rbdplugin-tx56x 3/3 Running 0 3d4h csi-rbdplugin-zmt8v 3/3 Running 0 3d4h noobaa-core-0 1/1 Running 0 3d4h noobaa-db-0 1/1 Running 0 3d4h noobaa-endpoint-fbf5f484d-67zj5 1/1 Running 0 3d4h noobaa-endpoint-fbf5f484d-pnzxw 1/1 Running 0 3d4h noobaa-operator-55fc95dc4c-7s4rm 1/1 Running 0 3d4h ocs-metrics-exporter-c5655b599-qhz6x 1/1 Running 0 3d4h ocs-operator-c946699b4-rwqqq 1/1 Running 0 3d4h rook-ceph-crashcollector-worker-001.m1307001ocs.lnxne.boe-8wmvg 1/1 Running 0 3d4h rook-ceph-crashcollector-worker-002.m1307001ocs.lnxne.boe-fhmtk 1/1 Running 0 3d4h rook-ceph-crashcollector-worker-003.m1307001ocs.lnxne.boe-p9gp4 1/1 Running 0 3d4h rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-cc49b489mc8g6 1/1 Running 0 3d4h rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-769b79566r72f 1/1 Running 0 3d4h rook-ceph-mgr-a-6775c7cbdb-fxs9j 1/1 Running 0 3d4h rook-ceph-mon-a-5674bf7f7b-m58pm 1/1 Running 0 3d4h rook-ceph-mon-b-6d56f47956-5t2cg 1/1 Running 0 3d4h rook-ceph-mon-d-64d76f7b65-dfshr 1/1 Running 0 3d4h rook-ceph-operator-6c97bf77-t5gvs 1/1 Running 0 3d4h rook-ceph-osd-0-59546f75d5-999zh 1/1 Running 0 3d4h rook-ceph-osd-1-7fb98bfd85-62dpc 1/1 Running 0 3d4h rook-ceph-osd-2-799fc5d489-qnxz5 1/1 Running 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-0-data-1-dntps-7sz49 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-0-data-2-ktg27-fv5bq 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-0-data-3-w22gx-vn8n2 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-1-data-1-vlhxr-xkfkv 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-1-data-2-v8ktm-qjtfn 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-1-data-3-qhjts-pr859 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-2-data-1-wwww6-znjmd 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-2-data-2-n2sjh-l8qzf 0/1 Pending 0 3d4h rook-ceph-osd-prepare-ocs-deviceset-2-data-3-6rnqj-98c8h 0/1 Pending 0 3d4h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-68c46c9s57c6 1/1 Running 1 3d4h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-6b84f84mpqgj 1/1 Running 0 3d4h rook-ceph-tools-6fdd868f75-cw85l 1/1 Running 0 3d4h
Created attachment 1755866 [details] More test4a test results
Sarah, this is a different issue now. I fixed the reported issue in the PR attached. To debug your new issue, please open a new BZ and attach the rook-ceph-osd-prepare-ocs-deviceset-* logs. Thanks!
Ok. Thanks!
This bug is pending verification because of cluster unavailability. Taking help from IBM-z folks for this BZ verification to meet a cluster on s390x. Still waiting for the cluster update from Tuan Hoang.
New proposed version: Previously, if a disk belonged to an existing Ceph cluster, Rook would fail abruptly. With this update, Rook can detect that the disk belongs to a different cluster and refuse to deploy the OSD in that disk with a message.
verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041