Bug 2275935
| Summary: | Disk replacement procedure failed with cli tool | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | odf-cli | Assignee: | Oded <oviner> |
| Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.16 | CC: | odf-bz-bot, pbalogh, sapillai, sheggodu, srai, tnielsen |
| Target Milestone: | --- | Keywords: | Regression |
| Target Release: | ODF 4.16.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.16.0-94 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-07-17 13:19:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Oded
2024-04-18 14:31:17 UTC
I debugged the code locally on my machine
/.vscode/launch.json
{
"version": "0.2.0",
"configurations": [
{
"name": "Launch",
"type": "go",
"request": "launch",
"mode": "debug",
"program": "${file}",
"env": {},
"args": ["purge-osd", "0"]
}
]
}
$ oc get pods | grep osd
rook-ceph-osd-0-6c4578bd8b-cvwfr 1/2 CrashLoopBackOff 11 (73s ago) 2d23h
$ oc rsh rook-ceph-operator-5b4ff776bd-284xf
sh-5.1$ ceph osd safe-to-destroy 0 --connect-timeout=10 --conf=/var/lib/rook/openshift-storage/openshift-storage.config
Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
On SafeToDestroy function https://github.com/red-hat-storage/odf-cli/blob/main/pkg/rook/osd/osd.go#L55-L56, the command "ceph osd safe-to-destroy 0" return error.
I tried to run it on osd-55 [not exist] and it " safe to destroy":
sh-5.1$ ceph osd safe-to-destroy 55 --connect-timeout=10 --conf=/var/lib/rook/openshift-storage/openshift-storage.config
OSD(s) 55 are safe to destroy without reducing data durability.
maybe ceph issue?
I opened a PR to fix the bz https://github.com/red-hat-storage/odf-cli/pull/38/files The PR https://github.com/red-hat-storage/odf-cli/pull/26/files#diff-2ea9ab46da76bae62e31a10c9a0a21554c377272f9c6774e0086adf5f00378d3R27 cause the issue. oviner~/DEV_REPOS/odf-cli(bz-2275935)$ oc rsh rook-ceph-tools-799db6fc84-29f2l sh-5.1$ ceph osd safe-to-destroy osd.0 Error EBUSY: OSD(s) 0 have 169 pgs currently mapped to them. sh-5.1$ ceph osd safe-to-destroy osd.11 OSD(s) 11 are safe to destroy without reducing data durability. the "ceph osd safe-to-destroy osd.0" command returns error and Fatal function raise exception [exit code 1] https://github.com/rook/kubectl-rook-ceph/blob/master/pkg/logging/log.go#L50 Hi Oded, Assigning this BZ to you since you are working on the fix. The bug reproduced oviner~/cli-tool$ ./odf-amd64 purge-osd 0 Error: failed to run ceph command with args [osd safe-to-destroy 0]. Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. . failed to run command. command terminated with exit code 11 Test procedure: 1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform 2. Install ODF4.16 odf-operator.v4.16.0-92.stable 3. Detached a disk via vcenter 4. Check OSD status $ oc get pods rook-ceph-osd-0-6d9b68df9-jtc4l NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-6d9b68df9-jtc4l 1/2 CrashLoopBackOff 2 (5s ago) 16h 5.Download bin file: oviner~/cli-tool$ oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-92 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/cli-tool oviner~/cli-tool$ ls odf-amd64 6.Run purge-osd via cli tool: oviner~/cli-tool$ ./odf-amd64 purge-osd 0 Error: failed to run ceph command with args [osd safe-to-destroy 0]. Error EAGAIN: OSD(s) 0 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions. . failed to run command. command terminated with exit code 11 Boris, this PR https://github.com/red-hat-storage/odf-cli/pull/38 included in the quay.io/rhceph-dev/mcg-cli:4.16.0-92 image ? I opened a PR [cherry-pick] https://github.com/red-hat-storage/odf-cli/pull/45 to release-4.16 branch. Bug fixed. Test procedure: 1. Deploy OCP4.16 cluster 4.16.0-0.nightly-2024-05-04-214435 on Vsphere platform 2. Install ODF4.16 odf-operator.v4.16.0-92.stable 3. Detached a disk via vcenter $ oc get pods rook-ceph-osd-0-9cc796565-9rkml NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-9cc796565-9rkml 1/2 CrashLoopBackOff 2 (17s ago) 26h 5.Download bin file: oc image extract --registry-config /home/oviner/OCS-AP/ocs-ci/data/pull-secret quay.io/rhceph-dev/mcg-cli:4.16.0-94 --confirm --path /usr/share/odf/linux/odf-amd64:/home/oviner/cli-tool oviner~/cli-tool$ chmod -R 777 odf-amd64 6.Run cli tool oviner~/cli-tool$ ./odf-amd64 purge-osd 0 Warning: Are you sure you want to purge osd.0? The OSD is *not* safe to destroy. This may lead to data loss. If you are sure the OSD should be purged, enter 'yes-force-destroy-osd' yes-force-destroy-osd Info: Running purge osd command 2024/05/07 11:27:51 maxprocs: Leaving GOMAXPROCS=12: CPU quota undefined 2024-05-07 11:27:51.032067 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file 2024-05-07 11:27:51.032144 I | rookcmd: starting Rook v4.16.0-0.4f297e0b42d1af7a7b3198f9ed979a8526062c2f with arguments 'rook ceph osd remove --osd-ids=0 --force-osd-removal=true' 2024-05-07 11:27:51.032152 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0, --preserve-pvc=false 2024-05-07 11:27:51.032155 I | ceph-spec: parsing mon endpoints: c=172.30.210.194:3300,a=172.30.40.143:3300,b=172.30.4.250:3300 2024-05-07 11:27:51.048938 I | cephclient: writing config file /var/lib/rook/openshift-storage/openshift-storage.config 2024-05-07 11:27:51.049180 I | cephclient: generated admin config in /var/lib/rook/openshift-storage 2024-05-07 11:27:51.431482 I | cephosd: validating status of osd.0 2024-05-07 11:27:51.431506 I | cephosd: osd.0 is marked 'DOWN' 2024-05-07 11:27:51.811853 I | cephosd: marking osd.0 out 2024-05-07 11:27:53.664526 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal 2024-05-07 11:27:53.672715 I | cephosd: removing the OSD deployment "rook-ceph-osd-0" 2024-05-07 11:27:53.672747 I | op-k8sutil: removing deployment rook-ceph-osd-0 if it exists 2024-05-07 11:27:53.685053 I | op-k8sutil: Removed deployment rook-ceph-osd-0 2024-05-07 11:27:53.693935 I | op-k8sutil: "rook-ceph-osd-0" still found. waiting... 2024-05-07 11:27:55.701264 I | op-k8sutil: confirmed rook-ceph-osd-0 does not exist 2024-05-07 11:27:55.712079 I | cephosd: removing the osd prepare job "rook-ceph-osd-prepare-ocs-deviceset-0-data-0fjsm8" 2024-05-07 11:27:55.735815 I | cephosd: removing the OSD PVC "ocs-deviceset-0-data-0fjsm8" 2024-05-07 11:27:55.745540 I | cephosd: purging osd.0 2024-05-07 11:27:56.121591 I | cephosd: attempting to remove host "ocs-deviceset-0-data-0fjsm8" from crush map if not in use 2024-05-07 11:27:57.143758 I | cephosd: removed CRUSH host "ocs-deviceset-0-data-0fjsm8" 2024-05-07 11:27:57.494974 I | cephosd: no ceph crash to silence 2024-05-07 11:27:57.495006 I | cephosd: completed removal of OSD 0 7.Check OSD-0 status: $ oc get pods rook-ceph-osd-0-795ccb5669-wxgm5 NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-795ccb5669-wxgm5 1/2 Running 0 16s Please update the RDT flag/text appropriately. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4591 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |