Description of problem (please be detailed as possible and provide log snippests): The disk replacement procedure fails when using cli tool and succeeds when working with the regular procedure [“oc commands”] $ ./odf-cli purge-osd 2 Error: failed to run ceph command with args [osd safe-to-destroy 2]. Error EAGAIN: OSD(s) 2 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. . failed to run command. command terminated with exit code 11 Version of all relevant components (if applicable): OCP Version: 4.16.0-0.nightly-2024-04-14-063437 Plattform: Vsphere ODF Version: odf-operator.v4.16.0-76.stable Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? yes. working with regular procedure https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.15/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: Test Procedure 1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-04-14-063437 on Vsphere platform 2. Install ODF4.16 odf-operator.v4.16.0-76.stable 3. Detached a disk 4. Check OSD status $ oc get pods | grep osd rook-ceph-osd-0-6c4578bd8b-cvwfr 2/2 Running 0 2d22h rook-ceph-osd-1-6bdfd6b476-drbdc 2/2 Running 0 2d22h rook-ceph-osd-2-6bc66dc745-r899j 1/2 CrashLoopBackOff 758 (4m22s ago) 2d22h rook-ceph-osd-prepare-8ac2fb68e792d1a8fe8062047269f3d9-zqnpg 0/1 Completed 0 2d22h rook-ceph-osd-prepare-b14d60c32479945780398acdb0c5b215-tdv7m 0/1 Completed 0 2d22h rook-ceph-osd-prepare-de256a27ffccbb18b8cbc124a137681d-8kwfr 0/1 Completed 0 2d22h 5.Try to replace disk via cli tool: a.Download bin file: oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-76 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/OCS-AP/ocs-ci/bin b.Run purge-osd via cli tool: [got error] $ ./odf-cli purge-osd 2 Error: failed to run ceph command with args [osd safe-to-destroy 2]. Error EAGAIN: OSD(s) 2 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. . failed to run command. command terminated with exit code 11 5.Try to replace disk via “oc commands”: https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.15/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf $ osd_id_to_remove=2 $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0 $ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove} $ oc delete -n openshift-storage job ocs-osd-removal-job $ oc project openshift-storage $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f - $ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage $ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 $ oc get -n openshift-storage pods -l app=rook-ceph-osd NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-6c4578bd8b-cvwfr 2/2 Running 0 2d22h rook-ceph-osd-1-6bdfd6b476-drbdc 2/2 Running 0 2d22h rook-ceph-osd-2-68b7bf8967-wxs4w 1/2 Running 0 11s Actual results: Expected results: Additional info: https://docs.google.com/document/d/1WSyuhbqohTOit7qGasO6dsSYuR8wFIsJQXhu0PyeCTc/edit
I debugged the code locally on my machine /.vscode/launch.json { "version": "0.2.0", "configurations": [ { "name": "Launch", "type": "go", "request": "launch", "mode": "debug", "program": "${file}", "env": {}, "args": ["purge-osd", "0"] } ] } $ oc get pods | grep osd rook-ceph-osd-0-6c4578bd8b-cvwfr 1/2 CrashLoopBackOff 11 (73s ago) 2d23h $ oc rsh rook-ceph-operator-5b4ff776bd-284xf sh-5.1$ ceph osd safe-to-destroy 0 --connect-timeout=10 --conf=/var/lib/rook/openshift-storage/openshift-storage.config Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. On SafeToDestroy function https://github.com/red-hat-storage/odf-cli/blob/main/pkg/rook/osd/osd.go#L55-L56, the command "ceph osd safe-to-destroy 0" return error. I tried to run it on osd-55 [not exist] and it " safe to destroy": sh-5.1$ ceph osd safe-to-destroy 55 --connect-timeout=10 --conf=/var/lib/rook/openshift-storage/openshift-storage.config OSD(s) 55 are safe to destroy without reducing data durability. maybe ceph issue?
I opened a PR to fix the bz https://github.com/red-hat-storage/odf-cli/pull/38/files
The PR https://github.com/red-hat-storage/odf-cli/pull/26/files#diff-2ea9ab46da76bae62e31a10c9a0a21554c377272f9c6774e0086adf5f00378d3R27 cause the issue. oviner~/DEV_REPOS/odf-cli(bz-2275935)$ oc rsh rook-ceph-tools-799db6fc84-29f2l sh-5.1$ ceph osd safe-to-destroy osd.0 Error EBUSY: OSD(s) 0 have 169 pgs currently mapped to them. sh-5.1$ ceph osd safe-to-destroy osd.11 OSD(s) 11 are safe to destroy without reducing data durability. the "ceph osd safe-to-destroy osd.0" command returns error and Fatal function raise exception [exit code 1] https://github.com/rook/kubectl-rook-ceph/blob/master/pkg/logging/log.go#L50
Hi Oded, Assigning this BZ to you since you are working on the fix.
The bug reproduced oviner~/cli-tool$ ./odf-amd64 purge-osd 0 Error: failed to run ceph command with args [osd safe-to-destroy 0]. Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. . failed to run command. command terminated with exit code 11 Test procedure: 1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform 2. Install ODF4.16 odf-operator.v4.16.0-92.stable 3. Detached a disk via vcenter 4. Check OSD status $ oc get pods rook-ceph-osd-0-6d9b68df9-jtc4l NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-6d9b68df9-jtc4l 1/2 CrashLoopBackOff 2 (5s ago) 16h 5.Download bin file: oviner~/cli-tool$ oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-92 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/cli-tool oviner~/cli-tool$ ls odf-amd64 6.Run purge-osd via cli tool: oviner~/cli-tool$ ./odf-amd64 purge-osd 0 Error: failed to run ceph command with args [osd safe-to-destroy 0]. Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. . failed to run command. command terminated with exit code 11 Boris, this PR https://github.com/red-hat-storage/odf-cli/pull/38 included in the quay.io/rhceph-dev/mcg-cli:4.16.0-92 image ?
I opened a PR [cherry-pick] https://github.com/red-hat-storage/odf-cli/pull/45 to release-4.16 branch.
Bug fixed. Test procedure: 1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform 2. Install ODF4.16 odf-operator.v4.16.0-92.stable 3. Detached a disk via vcenter $ oc get pods rook-ceph-osd-0-9cc796565-9rkml NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-9cc796565-9rkml 1/2 CrashLoopBackOff 2 (17s ago) 26h 5.Download bin file: oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-94 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/cli-tool oviner~/cli-tool$ chmod -R 777 odf-amd64 6.Run cli tool oviner~/cli-tool$ ./odf-amd64 purge-osd 0 Warning: Are you sure you want to purge osd.0? The OSD is *not* safe to destroy. This may lead to data loss. If you are sure the OSD should be purged, enter 'yes-force-destroy-osd' yes-force-destroy-osd Info: Running purge osd command 2024/05/07 11:27:51 maxprocs: Leaving GOMAXPROCS=12: CPU quota undefined 2024-05-07 11:27:51.032067 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file 2024-05-07 11:27:51.032144 I | rookcmd: starting Rook v4.16.0-0.4f297e0b42d1af7a7b3198f9ed979a8526062c2f with arguments 'rook ceph osd remove --osd-ids=0 --force-osd-removal=true' 2024-05-07 11:27:51.032152 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0, --preserve-pvc=false 2024-05-07 11:27:51.032155 I | ceph-spec: parsing mon endpoints: c=172.30.210.194:3300,a=172.30.40.143:3300,b=172.30.4.250:3300 2024-05-07 11:27:51.048938 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2024-05-07 11:27:51.049180 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2024-05-07 11:27:51.431482 I | cephosd: validating status of osd.0 2024-05-07 11:27:51.431506 I | cephosd: osd.0 is marked 'DOWN' 2024-05-07 11:27:51.811853 I | cephosd: marking osd.0 out 2024-05-07 11:27:53.664526 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal 2024-05-07 11:27:53.672715 I | cephosd: removing the OSD deployment "rook-ceph-osd-0" 2024-05-07 11:27:53.672747 I | op-k8sutil: removing deployment rook-ceph-osd-0 if it exists 2024-05-07 11:27:53.685053 I | op-k8sutil: Removed deployment rook-ceph-osd-0 2024-05-07 11:27:53.693935 I | op-k8sutil: "rook-ceph-osd-0" still found. waiting... 2024-05-07 11:27:55.701264 I | op-k8sutil: confirmed rook-ceph-osd-0 does not exist 2024-05-07 11:27:55.712079 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-0-data-0fjsm8" 2024-05-07 11:27:55.735815 I | cephosd: removing the OSD PVC "ocs-deviceset-0-data-0fjsm8" 2024-05-07 11:27:55.745540 I | cephosd: purging osd.0 2024-05-07 11:27:56.121591 I | cephosd: attempting to remove host "ocs-deviceset-0-data-0fjsm8" from crush map if not in use 2024-05-07 11:27:57.143758 I | cephosd: removed CRUSH host "ocs-deviceset-0-data-0fjsm8" 2024-05-07 11:27:57.494974 I | cephosd: no ceph crash to silence 2024-05-07 11:27:57.495006 I | cephosd: completed removal of OSD 0 7.Check OSD-0 status: $ oc get pods rook-ceph-osd-0-795ccb5669-wxgm5 NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-795ccb5669-wxgm5 1/2 Running 0 16s
Please update the RDT flag/text appropriately.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days