Bug 1970939

Summary:	The ocs-osd-removal-job should have unique names for each Job
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Servesha <sdudhgao>
Component:	ocs-operator	Assignee:	Malay Kumar parida <mparida>
Status:	CLOSED ERRATA	QA Contact:	Itzhak <ikave>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	ebenahar, mparida, muagarwa, nberry, odf-bz-bot, rcyriac, sostapov
Target Milestone:	---
Target Release:	ODF 4.14.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.14.0-111	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-08 18:49:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2244409

Description Servesha 2021-06-11 14:10:29 UTC

Description of problem (please be detailed as possible and provide log
snippests):

The `ocs-osd-removal-job` should generate unique names for each job. This feature would eliminate the need to delete the older/existing job when triggering the job for a second time. The user can directly proceed to the next disk replacement.



Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?

In case of UI, only one osd can be replaced at a time. With unique job names, one doesn't have to delete the older job after the first osd replacement.

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

1. Once this feature is added, the admin can run `oc process` command to trigger the job without deleting the older/existing jobs in the cluster.
2. One template (ocs-osd-removal) will create multiple jobs.
3. This feature will need documentation efforts for the relevant OCS versions (below OCS 4.9)

Comment 4 Mudit Agarwal 2021-09-24 16:45:20 UTC

No RFE in 4.9 at this point of time.

Eran, should this be added to our backlog?

Comment 14 Malay Kumar parida 2023-03-20 09:40:53 UTC

Not a blocker & we have proper documentation for the customers to remove the existing job before creating a new one. Moving it out to 4.14.

Comment 19 Malay Kumar parida 2023-08-21 04:03:25 UTC

Hi Servesha, The BZ is now addressed. Instead of generating unique names every time, we went with deleting the completed jobs so they can be rerun again. 
Can you please look at the doc changes that might be required for this? As Neha is not available now, Itzhak from QE can collaborate with you.

Reference- 
https://bugzilla.redhat.com/show_bug.cgi?id=1970939#c3
https://chat.google.com/room/AAAAK3AjBdo/Jl1UYkoqLFE

Comment 20 Itzhak 2023-11-06 16:13:04 UTC

I tested it with a vSphere UPI 4.14 cluster. 
Unfortunately, The ocs-osd-removal job failed to complete successfully.

I did the following steps:

1. Delete the disk from vSphere side.
2. I checked the OSDs:
$ oc get pods -n openshift-storage | grep osd
ocs-osd-removal-job-mwtl8                                         0/1     Completed          0             2m3s
rook-ceph-osd-0-544bcf44c-5t48x                                   2/2     Running            0             17m
rook-ceph-osd-1-594976b949-r7ndl                                  2/2     Running            0             17m
rook-ceph-osd-2-7844589467-6v2j4                                  1/2     CrashLoopBackOff   1 (20s ago)   17m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0mcv42-688bf           0/1     Completed          0             18m
rook-ceph-osd-prepare-ocs-deviceset-1-data-09c7f6-bgwl2           0/1     Completed          0             18m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0vjk7r-czl82           0/1     Completed          0             18m

3. Scale down the OSD deployment for the OSD to be replaced
$ osd_id_to_remove=2
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-2 scaled

4. Delete the old ocs-osd-removal job: 
$ oc -n openshift-storage delete pod ocs-osd-removal-job-mwtl8 
pod "ocs-osd-removal-job-mwtl8" deleted

5. Execute the ocs-osd-removal job:
$ oc project openshift-storage
Now using project "openshift-storage" on server "https://api.ikave-vm414.qe.rh-ocs.com:6443".
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

6. Checking the ocs-osd-removal job status, and see that it didn't complete successfully:
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS    RESTARTS   AGE
ocs-osd-removal-job-z68ng   1/1     Running   0          26m


Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31095/

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-05-194730
Kubernetes Version: v1.27.6+f67aeb3

OCS version:
ocs-operator.v4.14.0-rhodf              OpenShift Container Storage   4.14.0-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-11-05-194730   True        False         85m     Cluster version is 4.14.0-0.nightly-2023-11-05-194730

Rook version:
rook: v4.14.0-0.103536c37b9f9063f4abe9db7c59150125b75908
go: go1.20.10

Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

Comment 21 Mudit Agarwal 2023-11-06 16:42:27 UTC

Not a 4.14 blocker

Comment 25 Itzhak 2023-11-07 11:04:57 UTC

I made a mistake when executing the osd removal job - I should pass the param "FORCE_OSD_REMOVAL=true". 
Sorry for the confusion.

I tested it again with the same steps above.

I did the following steps:

1. Delete the disk from the vSphere side.
2. I checked the OSDs:
$ oc get pods -n openshift-storage | grep osd
ocs-osd-removal-job-4stqs                                         0/1     Completed          0             2m8s
rook-ceph-osd-0-86b7d46f4f-4kbq5                                  1/2     CrashLoopBackOff   2 (21s ago)   21m
rook-ceph-osd-1-6c449d8498-tndsp                                  2/2     Running            0             21m
rook-ceph-osd-2-78cf9f4c6c-mqw54                                  2/2     Running            0             21m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0lskk5-4248w           0/1     Completed          0             21m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0gbg9w-b7l78           0/1     Completed          0             21m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0kznvm-d4rvb           0/1     Completed          0             21m


3. Scale down the OSD deployment for the OSD to be replaced
$ osd_id_to_remove=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
deployment.apps/rook-ceph-osd-0 scaled

4. Delete the old ocs-osd-removal pod: 
$ oc -n openshift-storage delete pod ocs-osd-removal-job-4stqs 
pod "ocs-osd-removal-job-4stqs" deleted

5. Execute the ocs-osd-removal job with "FORCE_OSD_REMOVAL=true":
$ oc project openshift-storage
Now using project "openshift-storage" on server "https://api.ikave-vm414.qe.rh-ocs.com:6443".
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=true |oc create -n openshift-storage -f -
job.batch/ocs-osd-removal-job created

6. Checking the ocs-osd-removal job status, and verify it was completed successfully.
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-z9kgv   0/1     Completed   0          13s

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
2023-11-07 10:47:48.390469 I | cephosd: completed removal of OSD 0

7. Also, I checked that a new OSD came up successfully:
$ oc get pods -n openshift-storage | grep osd
ocs-osd-removal-job-z9kgv                                         0/1     Completed   0          3m13s
rook-ceph-osd-0-f4448d86f-kbsbd                                   2/2     Running     0          2m24s
rook-ceph-osd-1-6c449d8498-tndsp                                  2/2     Running     0          32m
rook-ceph-osd-2-78cf9f4c6c-mqw54                                  2/2     Running     0          32m
rook-ceph-osd-prepare-ocs-deviceset-0-data-0lskk5-4248w           0/1     Completed   0          32m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0fbb56-b8njz           0/1     Completed   0          2m52s
rook-ceph-osd-prepare-ocs-deviceset-2-data-0kznvm-d4rvb           0/1     Completed   0          32m


Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/31120/.

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.14.0-0.nightly-2023-11-06-203803
Kubernetes Version: v1.27.6+f67aeb3

OCS version:
ocs-operator.v4.14.0-rhodf              OpenShift Container Storage   4.14.0-rhodf              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-11-06-203803   True        False         56m     Cluster version is 4.14.0-0.nightly-2023-11-06-203803

Rook version:
rook: v4.14.0-0.103536c37b9f9063f4abe9db7c59150125b75908
go: go1.20.10

Ceph version:
ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable)

Comment 30 errata-xmlrpc 2023-11-08 18:49:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Comment 31 Itzhak 2023-11-16 15:49:02 UTC

One thing I noticed. 
When I run the ocs-osd-removal job, I don't see the job created:  
$ oc get jobs
NAME                                                COMPLETIONS   DURATION   AGE
rook-ceph-osd-prepare-ocs-deviceset-0-data-0xxp9b   1/1           36s        5h20m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0rmwdb   1/1           30s        112m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0nxxhh   1/1           34s        5h20m


Even though the process was completed successfully, as I mentioned in my previous comment https://bugzilla.redhat.com/show_bug.cgi?id=1970939#c25.
Let me know if it's a bug or not.

Comment 32 Malay Kumar parida 2023-11-16 15:59:28 UTC

That was actually expected as by the time you are trying to see the job, it has been deleted as it was successful. But FYI we are reverting this BZ. This requires doc changes at too many places & doesn't really add any value to the customer's exp.

more details here https://bugzilla.redhat.com/show_bug.cgi?id=2248833

Comment 33 Itzhak 2023-11-16 17:18:47 UTC

Okay, thanks for the clarification.