Bug 2248832
| Summary: | Revert The ocs-osd-removal-job should have unique names for each Job | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Malay Kumar parida <mparida> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | unspecified | CC: | kramdoss, nberry, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.15.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.15.0-75 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-03-19 15:28:37 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Malay Kumar parida
2023-11-09 08:31:40 UTC
I tested it with a GCP 4.15 cluster.
I did the following steps:
1. I execute the ocs-osd-removal job:
$ osd_id_to_remove=2
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
2. Switch to openshift-storage project:
$ oc project openshift-storage
Now using project "openshift-storage" on server "https://api.ikave-aws414.qe.rh-ocs.com:6443".
3. Check the jobs and see the ocs-osd-removal job completed:
$ oc get jobs
NAME COMPLETIONS DURATION AGE
ocs-osd-removal-job 1/1 5s 24s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz 1/1 23s 128m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9 1/1 23s 128m
rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g 1/1 22s 128m
4. Check the pods, and see the new ocs-osd-removal pod created:
$ oc get pods | grep osd
ocs-osd-removal-job-8lsc5 0/1 Completed 0 39s
rook-ceph-osd-0-fb8558fdb-4htxm 2/2 Running 0 128m
rook-ceph-osd-1-567f8567c7-86s7t 2/2 Running 0 128m
rook-ceph-osd-2-54f5974b4c-ms29x 2/2 Running 0 128m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz-tk652 0/1 Completed 0 128m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9-nkhx2 0/1 Completed 0 128m
rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g-8kww6 0/1 Completed 0 128m
5. Trying to create another ocs-osd-removal job, which failed as expected:
$ osd_id_to_remove=1
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
Error from server (AlreadyExists): error when creating "STDIN": jobs.batch "ocs-osd-removal-job" already exists
6. Delete the old ocs-osd-removal job:
$ oc delete jobs.batch ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted
7. Check the pods again:
oc get pods | grep osd
rook-ceph-osd-0-fb8558fdb-4htxm 2/2 Running 0 152m
rook-ceph-osd-1-567f8567c7-86s7t 2/2 Running 0 152m
rook-ceph-osd-2-54f5974b4c-ms29x 2/2 Running 0 152m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz-tk652 0/1 Completed 0 152m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9-nkhx2 0/1 Completed 0 152m
rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g-8kww6 0/1 Completed 0 152m
8. Create a new ocs-osd-removal job, check the jobs and pods and see it created successfully:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
$ oc get jobs
NAME COMPLETIONS DURATION AGE
ocs-osd-removal-job 1/1 5s 9s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz 1/1 23s 153m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9 1/1 23s 153m
rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g 1/1 22s 153m
ikave:ocs-ci$ oc get pods | grep osd
ocs-osd-removal-job-glzts 0/1 Completed 0 22s
rook-ceph-osd-0-fb8558fdb-4htxm 2/2 Running 0 153m
rook-ceph-osd-1-567f8567c7-86s7t 2/2 Running 0 153m
rook-ceph-osd-2-54f5974b4c-ms29x 2/2 Running 0 153m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0t75xz-tk652 0/1 Completed 0 153m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0k7zm9-nkhx2 0/1 Completed 0 153m
rook-ceph-osd-prepare-ocs-deviceset-2-data-095m2g-8kww6 0/1 Completed 0 153m
Additional info:
Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/32167/
Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.15.0-0.nightly-2023-12-18-220750
Kubernetes Version: v1.28.4+7aa0a74
OCS version:
ocs-operator.v4.15.0-89.stable OpenShift Container Storage 4.15.0-89.stable Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.15.0-0.nightly-2023-12-18-220750 True False 168m Cluster version is 4.15.0-0.nightly-2023-12-18-220750
Rook version:
rook: v4.15.0-0.fcbc808fe7930fc7d937d06c983d5f0b96d952d3
go: go1.20.10
Ceph version:
ceph version 17.2.6-167.el9cp (5ef1496ea3e9daaa9788809a172bd5a1c3192cf7) quincy (stable)
One more thing that I checked is the test recovery from volume deletion with vSphere. This test executes the ocs-osd-removal job after a volume deletion from the vSphere side and checks that it was completed successfully. You can see here that the test finished successfully: https://reportportal-ocs4.apps.ocp-c1.prod.psi.redhat.com/ui/#ocs/launches/632/17496/851123/851140/851143/log. According to the two comments above, I am moving the bug to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383 |