Bug 2275935

Summary:	Disk replacement procedure failed with cli tool
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Oded <oviner>
Component:	odf-cli	Assignee:	Oded <oviner>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.16	CC:	odf-bz-bot, pbalogh, sapillai, sheggodu, srai, tnielsen
Target Milestone:	---	Keywords:	Regression
Target Release:	ODF 4.16.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.16.0-94	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-07-17 13:19:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oded 2024-04-18 14:31:17 UTC

Description of problem (please be detailed as possible and provide log
snippests):
The disk replacement procedure fails when using cli tool and succeeds when working with the regular procedure [“oc commands”]

$  ./odf-cli purge-osd 2
Error: failed to run ceph command with args [osd safe-to-destroy 2]. Error EAGAIN: OSD(s) 2 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. 
. failed to run command. command terminated with exit code 11


Version of all relevant components (if applicable):
OCP Version: 4.16.0-0.nightly-2024-04-14-063437 
Plattform: Vsphere
ODF Version: odf-operator.v4.16.0-76.stable


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
yes. working with regular procedure 
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.15/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Test Procedure
1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-04-14-063437 on Vsphere platform
2. Install ODF4.16 odf-operator.v4.16.0-76.stable
3. Detached a disk 
4. Check OSD status
 $ oc get pods | grep osd
rook-ceph-osd-0-6c4578bd8b-cvwfr                                  2/2     Running            0                 2d22h
rook-ceph-osd-1-6bdfd6b476-drbdc                                  2/2     Running            0                 2d22h
rook-ceph-osd-2-6bc66dc745-r899j                                  1/2     CrashLoopBackOff   758 (4m22s ago)   2d22h
rook-ceph-osd-prepare-8ac2fb68e792d1a8fe8062047269f3d9-zqnpg      0/1     Completed          0                 2d22h
rook-ceph-osd-prepare-b14d60c32479945780398acdb0c5b215-tdv7m      0/1     Completed          0                 2d22h
rook-ceph-osd-prepare-de256a27ffccbb18b8cbc124a137681d-8kwfr      0/1     Completed          0                 2d22h

5.Try to replace disk via cli tool:
a.Download bin file:
oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-76 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/OCS-AP/ocs-ci/bin

b.Run purge-osd via cli tool: [got error]
$  ./odf-cli purge-osd 2
Error: failed to run ceph command with args [osd safe-to-destroy 2]. Error EAGAIN: OSD(s) 2 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. 
. failed to run command. command terminated with exit code 11


5.Try to replace disk via “oc commands”:
https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.15/html-single/replacing_devices/index#replacing-operational-or-failed-storage-devices-on-vmware-infrastructure_rhodf
$ osd_id_to_remove=2
$  oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
$ oc delete -n openshift-storage job ocs-osd-removal-job
$ oc project openshift-storage
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
$ oc get -n openshift-storage pods -l app=rook-ceph-osd 
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-6c4578bd8b-cvwfr   2/2     Running   0          2d22h
rook-ceph-osd-1-6bdfd6b476-drbdc   2/2     Running   0          2d22h
rook-ceph-osd-2-68b7bf8967-wxs4w   1/2     Running   0          11s


Actual results:


Expected results:


Additional info:
https://docs.google.com/document/d/1WSyuhbqohTOit7qGasO6dsSYuR8wFIsJQXhu0PyeCTc/edit

Comment 3 Oded 2024-04-18 15:16:27 UTC

I debugged the code locally on my machine

/.vscode/launch.json
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch",
            "type": "go",
            "request": "launch",
            "mode": "debug",
            "program": "${file}",
            "env": {},
            "args": ["purge-osd", "0"]
        }
    ]
}


$ oc get pods | grep osd
rook-ceph-osd-0-6c4578bd8b-cvwfr                                  1/2     CrashLoopBackOff   11 (73s ago)    2d23h

$ oc rsh rook-ceph-operator-5b4ff776bd-284xf
sh-5.1$ ceph osd safe-to-destroy 0 --connect-timeout=10 --conf=/var/lib/rook/openshift-storage/openshift-storage.config
Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. 

On SafeToDestroy function https://github.com/red-hat-storage/odf-cli/blob/main/pkg/rook/osd/osd.go#L55-L56, the command "ceph osd safe-to-destroy 0" return error.
I tried to run it on osd-55 [not exist]  and it " safe to destroy":

sh-5.1$ ceph osd safe-to-destroy 55 --connect-timeout=10 --conf=/var/lib/rook/openshift-storage/openshift-storage.config
OSD(s) 55 are safe to destroy without reducing data durability.

maybe ceph issue?

Comment 4 Oded 2024-04-18 17:37:58 UTC

I opened a PR to fix the bz https://github.com/red-hat-storage/odf-cli/pull/38/files

Comment 6 Oded 2024-04-21 09:20:05 UTC

The PR https://github.com/red-hat-storage/odf-cli/pull/26/files#diff-2ea9ab46da76bae62e31a10c9a0a21554c377272f9c6774e0086adf5f00378d3R27 cause the issue.

oviner~/DEV_REPOS/odf-cli(bz-2275935)$ oc rsh rook-ceph-tools-799db6fc84-29f2l
sh-5.1$ ceph osd safe-to-destroy osd.0       
Error EBUSY: OSD(s) 0 have 169 pgs currently mapped to them. 
sh-5.1$ ceph osd safe-to-destroy osd.11
OSD(s) 11 are safe to destroy without reducing data durability.


the "ceph osd safe-to-destroy osd.0" command returns error and Fatal function raise exception [exit code 1] https://github.com/rook/kubectl-rook-ceph/blob/master/pkg/logging/log.go#L50

Comment 7 Santosh Pillai 2024-04-24 01:49:12 UTC

Hi Oded, 
Assigning this BZ to you since you are working on the fix.

Comment 11 Oded 2024-05-06 08:56:46 UTC

The bug reproduced

oviner~/cli-tool$ ./odf-amd64 purge-osd 0
Error: failed to run ceph command with args [osd safe-to-destroy 0]. Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. 
. failed to run command. command terminated with exit code 11



Test procedure: 
1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform
2. Install ODF4.16 odf-operator.v4.16.0-92.stable
3. Detached a disk via vcenter
4. Check OSD status
 $ oc get pods rook-ceph-osd-0-6d9b68df9-jtc4l
NAME                              READY   STATUS             RESTARTS     AGE
rook-ceph-osd-0-6d9b68df9-jtc4l   1/2     CrashLoopBackOff   2 (5s ago)   16h
5.Download bin file:
oviner~/cli-tool$ oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-92 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/cli-tool
oviner~/cli-tool$ ls
odf-amd64
6.Run purge-osd via cli tool: 
oviner~/cli-tool$ ./odf-amd64 purge-osd 0
Error: failed to run ceph command with args [osd safe-to-destroy 0]. Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. 
. failed to run command. command terminated with exit code 11


Boris, this PR https://github.com/red-hat-storage/odf-cli/pull/38 included in the  quay.io/rhceph-dev/mcg-cli:4.16.0-92 image ?

Comment 12 Oded 2024-05-06 09:09:39 UTC

I opened a PR [cherry-pick] https://github.com/red-hat-storage/odf-cli/pull/45 to release-4.16 branch.

Comment 13 Oded 2024-05-07 11:30:10 UTC

Bug fixed.

Test procedure: 
1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform
2. Install ODF4.16 odf-operator.v4.16.0-92.stable
3. Detached a disk via vcenter
$ oc get pods rook-ceph-osd-0-9cc796565-9rkml
NAME                              READY   STATUS             RESTARTS      AGE
rook-ceph-osd-0-9cc796565-9rkml   1/2     CrashLoopBackOff   2 (17s ago)   26h

5.Download bin file:
 oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-94 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/cli-tool
oviner~/cli-tool$ chmod -R 777 odf-amd64 

6.Run cli tool
oviner~/cli-tool$ ./odf-amd64 purge-osd 0
Warning: Are you sure you want to purge osd.0? The OSD is *not* safe to destroy. This may lead to data loss. If you are sure the OSD should be purged, enter 'yes-force-destroy-osd'
yes-force-destroy-osd
Info: Running purge osd command
2024/05/07 11:27:51 maxprocs: Leaving GOMAXPROCS=12: CPU quota undefined
2024-05-07 11:27:51.032067 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2024-05-07 11:27:51.032144 I | rookcmd: starting Rook v4.16.0-0.4f297e0b42d1af7a7b3198f9ed979a8526062c2f with arguments 'rook ceph osd remove --osd-ids=0 --force-osd-removal=true'
2024-05-07 11:27:51.032152 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0, --preserve-pvc=false
2024-05-07 11:27:51.032155 I | ceph-spec: parsing mon endpoints: c=172.30.210.194:3300,a=172.30.40.143:3300,b=172.30.4.250:3300
2024-05-07 11:27:51.048938 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config
2024-05-07 11:27:51.049180 I | cephclient: generated admin config in /var/lib/rook/openshift-storage
2024-05-07 11:27:51.431482 I | cephosd: validating status of osd.0
2024-05-07 11:27:51.431506 I | cephosd: osd.0 is marked 'DOWN'
2024-05-07 11:27:51.811853 I | cephosd: marking osd.0 out
2024-05-07 11:27:53.664526 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal
2024-05-07 11:27:53.672715 I | cephosd: removing the OSD deployment "rook-ceph-osd-0"
2024-05-07 11:27:53.672747 I | op-k8sutil: removing deployment rook-ceph-osd-0 if it exists
2024-05-07 11:27:53.685053 I | op-k8sutil: Removed deployment rook-ceph-osd-0
2024-05-07 11:27:53.693935 I | op-k8sutil: "rook-ceph-osd-0" still found. waiting...
2024-05-07 11:27:55.701264 I | op-k8sutil: confirmed rook-ceph-osd-0 does not exist
2024-05-07 11:27:55.712079 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-0-data-0fjsm8"
2024-05-07 11:27:55.735815 I | cephosd: removing the OSD PVC "ocs-deviceset-0-data-0fjsm8"
2024-05-07 11:27:55.745540 I | cephosd: purging osd.0
2024-05-07 11:27:56.121591 I | cephosd: attempting to remove host "ocs-deviceset-0-data-0fjsm8" from crush map if not in use
2024-05-07 11:27:57.143758 I | cephosd: removed CRUSH host "ocs-deviceset-0-data-0fjsm8"
2024-05-07 11:27:57.494974 I | cephosd: no ceph crash to silence
2024-05-07 11:27:57.495006 I | cephosd: completed removal of OSD 0

7.Check OSD-0 status:
$ oc get pods rook-ceph-osd-0-795ccb5669-wxgm5
NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-795ccb5669-wxgm5   1/2     Running   0          16s

Comment 15 Sunil Kumar Acharya 2024-06-18 06:45:26 UTC

Please update the RDT flag/text appropriately.

Comment 17 errata-xmlrpc 2024-07-17 13:19:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591

Comment 18 Red Hat Bugzilla 2024-11-15 04:25:32 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days