Bug 2135626

Summary:	Do not use rook master tag in job template [4.12]
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Subham Rai <srai>
Component:	ocs-operator	Assignee:	Subham Rai <srai>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Itzhak <ikave>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.9	CC:	ebenahar, kramdoss, muagarwa, ocs-bugs, odf-bz-bot, sostapov
Target Milestone:	---
Target Release:	ODF 4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.12.0-113	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2135631 2135632 2135636 2135736 (view as bug list)		Environment:
Last Closed:	2023-02-08 14:06:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2135631, 2135632, 2135636, 2135736

Description Subham Rai 2022-10-18 06:14:25 UTC

Description of problem (please be detailed as possible and provide log
snippets):

In job template we are using the rook master tag in the init container in the product, we should read the rook version from the env and not use master or any specific version. 

https://github.com/red-hat-storage/ocs-operator/blob/release-4.9/controllers/storagecluster/job_templates.go#L165

We have this from the 4.9 when the templated was added.

Version of all relevant components (if applicable):
From 4.9 to the latest main 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Mudit Agarwal 2022-10-19 03:33:12 UTC

*** Bug 2135736 has been marked as a duplicate of this bug. ***

Comment 3 krishnaram Karthick 2022-10-28 07:22:47 UTC

@shubam - 
1) what would be the steps to verify this bug? 
2) Do we need to add any new tests due to this change?

Comment 8 Itzhak 2022-11-22 17:15:22 UTC

I tested it with vSphere OCP 4.12 and ODF 4.12 dynamic cluster. 

The steps I did to reproduce the bug:

1. I deleted a disk from vSphere.
2. Check the osd status and observe the osd that is down:
$ oc get pods -o wide | grep osd
rook-ceph-osd-0-76748c9b6-vpwz9                                   2/2     Running            0             69m   10.130.2.22    compute-1   <none>           <none>
rook-ceph-osd-1-54749698d7-2jp48                                  1/2     CrashLoopBackOff   3 (41s ago)   69m   10.129.2.20    compute-0   <none>           <none>
rook-ceph-osd-2-99f58954-k42nk                                    2/2     Running            0             68m   10.128.2.20    compute-2   <none>           <none>

3. I scaled the osd-1 deployment and deleted the osd-1 pod, as mentioned in the doc.
4. Run the "ocs-osd-removal" job, and see it completed successfully:
$ oc get jobs
NAME                                                COMPLETIONS   DURATION   AGE
ocs-osd-removal-job                                 1/1           11s        22s
rook-ceph-osd-prepare-ocs-deviceset-0-data-0vbd6h   1/1           72s        76m
rook-ceph-osd-prepare-ocs-deviceset-1-data-0269xm   1/1           40s        76m
rook-ceph-osd-prepare-ocs-deviceset-2-data-0w6njt   0/1           1s         1s
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
NAME                        READY   STATUS      RESTARTS   AGE
ocs-osd-removal-job-8b449   0/1     Completed   0          59s 


5. Check the logs of the "ocs-osd-removal-job"
$ oc logs ocs-osd-removal-job-8b449 
2022-11-22 11:41:25.506021 I | rookcmd: starting Rook v4.12.0-0.e237b7ff0b9225db1a5f8a95dc50f9f8e2d55206 with arguments '/usr/local/bin/rook ceph osd remove --osd-ids=1 --force-osd-removal true'
2022-11-22 11:41:25.506069 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=DEBUG, --operator-image=, --osd-ids=1, --preserve-pvc=false, --service-account=

We can see that in the first line, the rook version is v4.12.0-0 without the master tag.


additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/18259/

Versions:

OC version:
Client Version: 4.10.24
Server Version: 4.12.0-0.nightly-2022-11-22-012345
Kubernetes Version: v1.25.2+5533733

OCS verison:
ocs-operator.v4.12.0-114.stable              OpenShift Container Storage   4.12.0-114.stable              Succeeded

Cluster version
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-0.nightly-2022-11-22-012345   True        False         6h44m   Cluster version is 4.12.0-0.nightly-2022-11-22-012345

Rook version:
rook: v4.12.0-0.e237b7ff0b9225db1a5f8a95dc50f9f8e2d55206
go: go1.18.7

Ceph version:
ceph version 16.2.10-72.el8cp (3311949c2d1edf5cabcc20ba0f35b4bfccbf021e) pacific (stable)

Comment 10 Itzhak 2022-11-22 17:19:29 UTC

According to the two comments above, I am moving the bug to Verified.