Bug 1970939
| Summary: | The ocs-osd-removal-job should have unique names for each Job | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Servesha <sdudhgao> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | ebenahar, mparida, muagarwa, nberry, odf-bz-bot, rcyriac, sostapov |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.14.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.14.0-111 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-11-08 18:49:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2244409 | ||
|
Description
Servesha
2021-06-11 14:10:29 UTC
No RFE in 4.9 at this point of time. Eran, should this be added to our backlog? Not a blocker & we have proper documentation for the customers to remove the existing job before creating a new one. Moving it out to 4.14. Hi Servesha, The BZ is now addressed. Instead of generating unique names every time, we went with deleting the completed jobs so they can be rerun again. Can you please look at the doc changes that might be required for this? As Neha is not available now, Itzhak from QE can collaborate with you. Reference- https://bugzilla.redhat.com/show_bug.cgi?id=1970939#c3 https://chat.google.com/room/AAAAK3AjBdo/Jl1UYkoqLFE I tested it with a vSphere UPI 4.14 cluster.
Unfortunately, The ocs-osd-removal job failed to complete successfully.
I did the following steps:
1. Delete the disk from vSphere side.
2. I checked the OSDs:
$ oc get pods -n openshift-storage | grep osd
ocs-osd-removal-job-mwtl8 0/1 Completed 0 2m3s
rook-ceph-osd-0-544bcf44c-5t48x 2/2 Running 0 17m
rook-ceph-osd-1-594976b949-r7ndl 2/2 Running 0 17m
rook-ceph-osd-2-7844589467-6v2j4 1/2 CrashLoopBackOff 1 (20s ago) 17m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0mcv42-688bf 0/1 Completed 0 18m
rook-ceph-osd-prepare-ocs-deviceset-1-data-09c7f6-bgwl2 0/1 Completed 0 18m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0vjk7r-czl82 0/1 Completed 0 18m
3. Scale down the OSD deployment for the OSD to be replaced
$ osd_id_to_remove=2
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-2 scaled
4. Delete the old ocs-osd-removal job:
$ oc -n openshift-storage delete pod ocs-osd-removal-job-mwtl8
pod "ocs-osd-removal-job-mwtl8" deleted
5. Execute the ocs-osd-removal job:
$ oc project openshift-storage
Now using project "openshift-storage" on server "https://api.ikave-vm414.qe.rh-ocs.com:6443".
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
6. Checking the ocs-osd-removal job status, and see that it didn't complete successfully:
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME READY STATUS RESTARTS AGE
ocs-osd-removal-job-z68ng 1/1 Running 0 26m
Additional info:
Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31095/
Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-05-194730
Kubernetes Version: v1.27.6+f67aeb3
OCS version:
ocs-operator.v4.14.0-rhodf OpenShift Container Storage 4.14.0-rhodf Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.nightly-2023-11-05-194730 True False 85m Cluster version is 4.14.0-0.nightly-2023-11-05-194730
Rook version:
rook: v4.14.0-0.103536c37b9f9063f4abe9db7c59150125b75908
go: go1.20.10
Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Not a 4.14 blocker I made a mistake when executing the osd removal job - I should pass the param "FORCE_OSD_REMOVAL=true".
Sorry for the confusion.
I tested it again with the same steps above.
I did the following steps:
1. Delete the disk from the vSphere side.
2. I checked the OSDs:
$ oc get pods -n openshift-storage | grep osd
ocs-osd-removal-job-4stqs 0/1 Completed 0 2m8s
rook-ceph-osd-0-86b7d46f4f-4kbq5 1/2 CrashLoopBackOff 2 (21s ago) 21m
rook-ceph-osd-1-6c449d8498-tndsp 2/2 Running 0 21m
rook-ceph-osd-2-78cf9f4c6c-mqw54 2/2 Running 0 21m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0lskk5-4248w 0/1 Completed 0 21m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0gbg9w-b7l78 0/1 Completed 0 21m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0kznvm-d4rvb 0/1 Completed 0 21m
3. Scale down the OSD deployment for the OSD to be replaced
$ osd_id_to_remove=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled
4. Delete the old ocs-osd-removal pod:
$ oc -n openshift-storage delete pod ocs-osd-removal-job-4stqs
pod "ocs-osd-removal-job-4stqs" deleted
5. Execute the ocs-osd-removal job with "FORCE_OSD_REMOVAL=true":
$ oc project openshift-storage
Now using project "openshift-storage" on server "https://api.ikave-vm414.qe.rh-ocs.com:6443".
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
6. Checking the ocs-osd-removal job status, and verify it was completed successfully.
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME READY STATUS RESTARTS AGE
ocs-osd-removal-job-z9kgv 0/1 Completed 0 13s
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
2023-11-07 10:47:48.390469 I | cephosd: completed removal of OSD 0
7. Also, I checked that a new OSD came up successfully:
$ oc get pods -n openshift-storage | grep osd
ocs-osd-removal-job-z9kgv 0/1 Completed 0 3m13s
rook-ceph-osd-0-f4448d86f-kbsbd 2/2 Running 0 2m24s
rook-ceph-osd-1-6c449d8498-tndsp 2/2 Running 0 32m
rook-ceph-osd-2-78cf9f4c6c-mqw54 2/2 Running 0 32m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0lskk5-4248w 0/1 Completed 0 32m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fbb56-b8njz 0/1 Completed 0 2m52s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0kznvm-d4rvb 0/1 Completed 0 32m
Additional info:
Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31120/.
Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-06-203803
Kubernetes Version: v1.27.6+f67aeb3
OCS version:
ocs-operator.v4.14.0-rhodf OpenShift Container Storage 4.14.0-rhodf Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.nightly-2023-11-06-203803 True False 56m Cluster version is 4.14.0-0.nightly-2023-11-06-203803
Rook version:
rook: v4.14.0-0.103536c37b9f9063f4abe9db7c59150125b75908
go: go1.20.10
Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6832 One thing I noticed. When I run the ocs-osd-removal job, I don't see the job created: $ oc get jobs NAME COMPLETIONS DURATION AGE rook-ceph-osd-prepare-ocs-deviceset-0-data-0xxp9b 1/1 36s 5h20m rook-ceph-osd-prepare-ocs-deviceset-1-data-0rmwdb 1/1 30s 112m rook-ceph-osd-prepare-ocs-deviceset-2-data-0nxxhh 1/1 34s 5h20m Even though the process was completed successfully, as I mentioned in my previous comment https://bugzilla.redhat.com/show_bug.cgi?id=1970939#c25. Let me know if it's a bug or not. That was actually expected as by the time you are trying to see the job, it has been deleted as it was successful. But FYI we are reverting this BZ. This requires doc changes at too many places & doesn't really add any value to the customer's exp. more details here https://bugzilla.redhat.com/show_bug.cgi?id=2248833 Okay, thanks for the clarification. |