2135626 – Do not use rook master tag in job template [4.12]

Bug 2135626 - Do not use rook master tag in job template [4.12]

Summary: Do not use rook master tag in job template [4.12]

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	ODF 4.12.0
Assignee:	Subham Rai
QA Contact:	Itzhak
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2135736 (view as bug list)
Depends On:
Blocks:	2135631 2135632 2135636 2135736
TreeView+	depends on / blocked

Reported:	2022-10-18 06:14 UTC by Subham Rai
Modified:	2023-08-09 17:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:	4.12.0-113
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2135631 2135632 2135636 2135736 (view as bug list)
Environment:
Last Closed:	2023-02-08 14:06:28 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1854	0	None	open	Bug 2135736: don't use rook master tag in Job	2022-10-19 06:14:44 UTC
Github	red-hat-storage ocs-operator pull 1856	0	None	open	Bug 2135626: [release-4.12] Don't use rook master tag in Job	2022-11-14 13:09:14 UTC

Description Subham Rai 2022-10-18 06:14:25 UTC

Description of problem (please be detailed as possible and provide log
snippets):

In job template we are using the rook master tag in the init container in the product, we should read the rook version from the env and not use master or any specific version. 

https://github.com/red-hat-storage/ocs-operator/blob/release-4.9/controllers/storagecluster/job_templates.go#L165

We have this from the 4.9 when the templated was added.

Version of all relevant components (if applicable):
From 4.9 to the latest main 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Mudit Agarwal 2022-10-19 03:33:12 UTC

*** Bug 2135736 has been marked as a duplicate of this bug. ***

Comment 3 krishnaram Karthick 2022-10-28 07:22:47 UTC

@shubam - 
1) what would be the steps to verify this bug? 
2) Do we need to add any new tests due to this change?

Comment 8 Itzhak 2022-11-22 17:15:22 UTC

I tested it with vSphere OCP 4.12 and ODF 4.12 dynamic cluster. 

The steps I did to reproduce the bug:

1. I deleted a disk from vSphere.
2. Check the osd status and observe the osd that is down:
$ oc get pods -o wide | grep osd
rook-ceph-osd-0-76748c9b6-vpwz9                                   2/2     Running            0             69m   10.130.2.22    compute-1   <none>           <none>
rook-ceph-osd-1-54749698d7-2jp48                                  1/2     CrashLoopBackOff   3 (41s ago)   69m   10.129.2.20    compute-0   <none>           <none>
rook-ceph-osd-2-99f58954-k42nk                                    2/2     Running            0             68m   10.128.2.20    compute-2   <none>           <none>

3. I scaled the osd-1 deployment and deleted the osd-1 pod, as mentioned in the doc.
4. Run the "ocs-osd-removal" job, and see it completed successfully:
$ oc get jobs
NAME                                                COMPLETIONS   DURATION   AGE
ocs-osd-removal-job                                 1/1           11s        22s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0vbd6h   1/1           72s        76m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0269xm   1/1           40s        76m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0w6njt   0/1           1s         1s
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-8b449   0/1     Completed   0          59s 


5. Check the logs of the "ocs-osd-removal-job"
$ oc logs ocs-osd-removal-job-8b449 
2022-11-22 11:41:25.506021 I | rookcmd: starting Rook v4.12.0-0.e237b7ff0b9225db1a5f8a95dc50f9f8e2d55206 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1 --force-osd-removal true'
2022-11-22 11:41:25.506069 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=1, --preserve-pvc=false, --service-account=

We can see that in the first line, the rook version is v4.12.0-0 without the master tag.


additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/18259/

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.12.0-0.nightly-2022-11-22-012345
Kubernetes Version: v1.25.2+5533733

OCS verison:
ocs-operator.v4.12.0-114.stable              OpenShift Container Storage   4.12.0-114.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-11-22-012345   True        False         6h44m   Cluster version is 4.12.0-0.nightly-2022-11-22-012345

Rook version:
rook: v4.12.0-0.e237b7ff0b9225db1a5f8a95dc50f9f8e2d55206
go: go1.18.7

Ceph version:
ceph version 16.2.10-72.el8cp (3311949c2d1edf5cabcc20ba0f35b4bfccbf021e) pacific (stable)

Comment 10 Itzhak 2022-11-22 17:19:29 UTC

According to the two comments above, I am moving the bug to Verified.

Note You need to log in before you can comment on or make changes to this bug.