Bug 2248833
| Summary: | [ODF 4.14] Revert The ocs-osd-removal-job should have unique names for each Job | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Malay Kumar parida <mparida> |
| Component: | ocs-operator | Assignee: | Malay Kumar parida <mparida> |
| Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.14 | CC: | branto, ikave, kramdoss, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.14.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.14.1-8 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-12-07 13:21:26 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Malay Kumar parida
2023-11-09 08:34:56 UTC
What is the expectation in the new OCS version? Currently, I have tried to create a new ocs-osd-removal job, and the jobs are automatically deleted, and the ocs-osd-removal pods have unique names: $ oc get jobs NAME COMPLETIONS DURATION AGE rook-ceph-osd-prepare-ocs-deviceset-0-data-0xxn4j 1/1 41s 20m rook-ceph-osd-prepare-ocs-deviceset-1-data-0ddpt8 1/1 41s 14h rook-ceph-osd-prepare-ocs-deviceset-2-data-09blqr 1/1 44s 14h ikave:ocs-ci$ oc get -n openshift-storage pods | grep ocs-osd-removal-job ocs-osd-removal-job-qbjjf 0/1 Completed 0 20m ocs-osd-removal-job-vkqrf 0/1 Completed 0 2m9s Versions: OC version: Client Version: 4.10.24 Server Version: 4.14.0-0.nightly-2023-11-17-202520 Kubernetes Version: v1.27.6+d548052 OCS version: ocs-operator.v4.14.1-rhodf OpenShift Container Storage 4.14.1-rhodf ocs-operator.v4.14.0-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-11-17-202520 True False 15h Cluster version is 4.14.0-0.nightly-2023-11-17-202520 Rook version: rook: v4.14.1-0.103536c37b9f9063f4abe9db7c59150125b75908 go: go1.20.10 Ceph version: ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) I think you are testing with ODF 4.14 build '4.14.1-4, But when I checked the build details with the upstream ref commit number I see that my changes are not part of this '4.14.1-4 build. The expectation is that after the job is done successfully the job stays & doesn't get deleted. Boris can you please check this and make another build for 4.14.1 with the latest changes in the ocs operator release-4.14 branch. Ack. Let me know when it's ready. I tested it with an AWS 4.14 cluster.
I did the following steps:
1. I execute the ocs-osd-removal job:
$ osd_id_to_remove=2
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
2. Switch to openshift-storage project:
$ oc project openshift-storage
Now using project "openshift-storage" on server "https://api.ikave-aws414.qe.rh-ocs.com:6443".
3. Check the jobs and see the ocs-osd-removal job completed:
$ oc get jobs
NAME COMPLETIONS DURATION AGE
ocs-osd-removal-job 1/1 5s 19s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5 1/1 30s 18m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf 1/1 17s 18m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4 1/1 25s 18m
4. Check the pods, and see the new ocs-osd-removal pod created:
$ oc get pods | grep osd
ocs-osd-removal-job-h28q2 0/1 Completed 0 30s
rook-ceph-osd-0-695f46f6ff-q5j96 2/2 Running 0 18m
rook-ceph-osd-1-768bfcff59-j5h9d 2/2 Running 0 18m
rook-ceph-osd-2-58fc99b489-z77r2 2/2 Running 0 17m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5-lktln 0/1 Completed 0 18m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf-vq5fh 0/1 Completed 0 18m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4-q8qtg 0/1 Completed 0 18m
5. Trying to create another ocs-osd-removal job, which failed as expected:
$ osd_id_to_remove=1
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
Error from server (AlreadyExists): error when creating "STDIN": jobs.batch "ocs-osd-removal-job" already exists
6. Delete the old ocs-osd-removal job:
$ oc delete jobs.batch ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted
7. Check the pods again:
$ oc get pods | grep osd
rook-ceph-osd-0-695f46f6ff-q5j96 2/2 Running 0 22m
rook-ceph-osd-1-768bfcff59-j5h9d 2/2 Running 0 22m
rook-ceph-osd-2-58fc99b489-z77r2 2/2 Running 0 21m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5-lktln 0/1 Completed 0 22m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf-vq5fh 0/1 Completed 0 22m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4-q8qtg 0/1 Completed 0 22m
8. Create a new ocs-osd-removal job, check the jobs and pods and see it created successfully:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
$ oc get jobs
NAME COMPLETIONS DURATION AGE
ocs-osd-removal-job 1/1 4s 6s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5 1/1 30s 22m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf 1/1 17s 22m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4 1/1 25s 22m
$ oc get pods | grep osd
ocs-osd-removal-job-8ng8w 0/1 Completed 0 12s
rook-ceph-osd-0-695f46f6ff-q5j96 2/2 Running 0 22m
rook-ceph-osd-1-768bfcff59-j5h9d 2/2 Running 0 22m
rook-ceph-osd-2-58fc99b489-z77r2 2/2 Running 0 22m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5-lktln 0/1 Completed 0 22m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf-vq5fh 0/1 Completed 0 22m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4-q8qtg 0/1 Completed 0 22m
Versions:
OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-23-033939
Kubernetes Version: v1.27.6+d548052
OCS version:
ocs-operator.v4.14.1-rhodf OpenShift Container Storage 4.14.1-rhodf Succeeded
Cluster version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.nightly-2023-11-23-033939 True False 46m Cluster version is 4.14.0-0.nightly-2023-11-23-033939
Rook version:
rook: v4.14.1-0.0154538b04cc11cd719d22885f23b4e7ce54a48c
go: go1.20.10
Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)
Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31525/
Hi, the results look correct to me. Yes. I also tested the device replacement with our automation test with vSphere cluster 4.14, and it passed successfully. Versions: OC version: Client Version: 4.10.24 Server Version: 4.14.0-0.nightly-2023-11-23-033939 Kubernetes Version: v1.27.6+d548052 OCS version: ocs-operator.v4.14.1-rhodf OpenShift Container Storage 4.14.1-rhodf Succeeded Cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-11-23-033939 True False 105m Cluster version is 4.14.0-0.nightly-2023-11-23-033939 Rook version: rook: v4.14.1-0.0154538b04cc11cd719d22885f23b4e7ce54a48c go: go1.20.10 Ceph version: ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable) According to the three comments above, I am moving the BZ to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7696 |