2248833 – [ODF 4.14] Revert The ocs-osd-removal-job should have unique names for each Job

Bug 2248833 - [ODF 4.14] Revert The ocs-osd-removal-job should have unique names for each Job

Summary: [ODF 4.14] Revert The ocs-osd-removal-job should have unique names for each Job

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.14
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.14.1
Assignee:	Malay Kumar parida
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-09 08:34 UTC by Malay Kumar parida
Modified:	2023-12-07 13:21 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.14.1-8
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-07 13:21:26 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2266	0	None	Merged	Bug 2248833: [release-4.14] Revert "Delete a template job after the job gets completed"	2023-11-16 06:58:20 UTC
Red Hat Product Errata	RHBA-2023:7696	0	None	None	None	2023-12-07 13:21:29 UTC

Description Malay Kumar parida 2023-11-09 08:34:56 UTC

This bug was initially created as a copy of Bug #2248832

This bug was initially created as a copy of Bug #1970939

We merged a fix for this BZ where we deleted any completed osd removal job after they were successful, So that while creating another osd removal job a check for the earlier completed job's deletion is not necessary. But we didn't know our docs had these steps where customer using encryption have a step where they depend on the logs of the successful job logs to get the PVC name for the replaced osd. They would use this PVC name to clean up the dmcrypt mapping on the node. 

Updating the docs for so many types of platforms at so many places feels risky & anyway, this fix is not adding much value instead just creating more issues. So we should revert the fix & push that with 4.14.1. Considering 4.14.1 is not that far away this is a very small risk. This only affects customer using encryption who are trying to replace their OSDs. Even if some customer faces any problems and any cases come to support we can easily find the orphan dmcrypt mapping by its pvc name manually & delete it. And to add these unused mappings also do not break anything so they are not at all concerning.

Comment 6 Itzhak 2023-11-20 08:46:46 UTC

What is the expectation in the new OCS version?
Currently, I have tried to create a new ocs-osd-removal job, and the jobs are automatically deleted, 
and the ocs-osd-removal pods have unique names: 

$ oc get jobs
NAME                                                COMPLETIONS   DURATION   AGE
rook-ceph-osd-prepare-ocs-deviceset-0-data-0xxn4j   1/1           41s        20m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0ddpt8   1/1           41s        14h
rook-ceph-osd-prepare-ocs-deviceset-2-data-09blqr   1/1           44s        14h
ikave:ocs-ci$ oc get -n openshift-storage pods | grep ocs-osd-removal-job
ocs-osd-removal-job-qbjjf                                         0/1     Completed   0          20m
ocs-osd-removal-job-vkqrf                                         0/1     Completed   0          2m9s


Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-17-202520
Kubernetes Version: v1.27.6+d548052

OCS version:
ocs-operator.v4.14.1-rhodf              OpenShift Container Storage   4.14.1-rhodf   ocs-operator.v4.14.0-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-11-17-202520   True        False         15h     Cluster version is 4.14.0-0.nightly-2023-11-17-202520

Rook version:
rook: v4.14.1-0.103536c37b9f9063f4abe9db7c59150125b75908
go: go1.20.10

Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

Comment 7 Malay Kumar parida 2023-11-20 09:54:06 UTC

I think you are testing with ODF 4.14 build '4.14.1-4,
But when I checked the build details with the upstream ref commit number I see that my changes are not part of this '4.14.1-4 build.
The expectation is that after the job is done successfully the job stays & doesn't get deleted.
Boris can you please check this and make another build for 4.14.1 with the latest changes in the ocs operator release-4.14 branch.

Comment 8 Itzhak 2023-11-20 10:28:08 UTC

Ack. Let me know when it's ready.

Comment 9 Itzhak 2023-11-23 13:19:53 UTC

I tested it with an AWS 4.14 cluster. 
I did the following steps:

1. I execute the ocs-osd-removal job:
$ osd_id_to_remove=2
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

2. Switch to openshift-storage project:
$ oc project openshift-storage 
Now using project "openshift-storage" on server "https://api.ikave-aws414.qe.rh-ocs.com:6443".

3. Check the jobs and see the ocs-osd-removal job completed:
$ oc get jobs
NAME                                                COMPLETIONS   DURATION   AGE
ocs-osd-removal-job                                 1/1           5s         19s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5   1/1           30s        18m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf   1/1           17s        18m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4   1/1           25s        18m

4. Check the pods, and see the new ocs-osd-removal pod created: 
$ oc get pods | grep osd
ocs-osd-removal-job-h28q2                                         0/1     Completed   0          30s
rook-ceph-osd-0-695f46f6ff-q5j96                                  2/2     Running     0          18m
rook-ceph-osd-1-768bfcff59-j5h9d                                  2/2     Running     0          18m
rook-ceph-osd-2-58fc99b489-z77r2                                  2/2     Running     0          17m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5-lktln           0/1     Completed   0          18m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf-vq5fh           0/1     Completed   0          18m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4-q8qtg           0/1     Completed   0          18m

5. Trying to create another ocs-osd-removal job, which failed as expected:
$ osd_id_to_remove=1
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
Error from server (AlreadyExists): error when creating "STDIN": jobs.batch "ocs-osd-removal-job" already exists

6. Delete the old ocs-osd-removal job: 
$ oc delete jobs.batch ocs-osd-removal-job 
job.batch "ocs-osd-removal-job" deleted

7. Check the pods again:
$ oc get pods | grep osd
rook-ceph-osd-0-695f46f6ff-q5j96                                  2/2     Running     0          22m
rook-ceph-osd-1-768bfcff59-j5h9d                                  2/2     Running     0          22m
rook-ceph-osd-2-58fc99b489-z77r2                                  2/2     Running     0          21m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5-lktln           0/1     Completed   0          22m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf-vq5fh           0/1     Completed   0          22m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4-q8qtg           0/1     Completed   0          22m

8. Create a new ocs-osd-removal job, check the jobs and pods and see it created successfully:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created
$ oc get jobs
NAME                                                COMPLETIONS   DURATION   AGE
ocs-osd-removal-job                                 1/1           4s         6s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5   1/1           30s        22m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf   1/1           17s        22m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4   1/1           25s        22m
$ oc get pods | grep osd
ocs-osd-removal-job-8ng8w                                         0/1     Completed   0          12s
rook-ceph-osd-0-695f46f6ff-q5j96                                  2/2     Running     0          22m
rook-ceph-osd-1-768bfcff59-j5h9d                                  2/2     Running     0          22m
rook-ceph-osd-2-58fc99b489-z77r2                                  2/2     Running     0          22m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0qz8l5-lktln           0/1     Completed   0          22m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0l2nrf-vq5fh           0/1     Completed   0          22m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0tpcf4-q8qtg           0/1     Completed   0          22m


Versions: 

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-23-033939
Kubernetes Version: v1.27.6+d548052

OCS version:
ocs-operator.v4.14.1-rhodf              OpenShift Container Storage   4.14.1-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-11-23-033939   True        False         46m     Cluster version is 4.14.0-0.nightly-2023-11-23-033939

Rook version:
rook: v4.14.1-0.0154538b04cc11cd719d22885f23b4e7ce54a48c
go: go1.20.10

Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31525/

Comment 10 Malay Kumar parida 2023-11-23 14:51:13 UTC

Hi, the results look correct to me.

Comment 11 Itzhak 2023-11-23 15:36:01 UTC

Yes. 
I also tested the device replacement with our automation test with vSphere cluster 4.14, and it passed successfully.

Versions: 

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-23-033939
Kubernetes Version: v1.27.6+d548052

OCS version:
ocs-operator.v4.14.1-rhodf              OpenShift Container Storage   4.14.1-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-11-23-033939   True        False         105m    Cluster version is 4.14.0-0.nightly-2023-11-23-033939

Rook version:
rook: v4.14.1-0.0154538b04cc11cd719d22885f23b4e7ce54a48c
go: go1.20.10

Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

Comment 12 Itzhak 2023-11-23 15:36:45 UTC

According to the three comments above, I am moving the BZ to Verified.

Comment 16 errata-xmlrpc 2023-12-07 13:21:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.14.1 Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7696

Note You need to log in before you can comment on or make changes to this bug.