Bug 2002852

Summary:	ocs-operator update from v.4.7.2 to v4.7.3 is in Installing state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Alvaro Soto <asoto>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Raz Tamir <ratamir>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	blaine, ccharron, dansmall, hchiramm, hnallurv, jpinto, kelwhite, madam, mrajanna, ocs-bugs, odf-bz-bot, rar, shan, sostapov, tdesala, tnielsen
Target Milestone:	---	Flags:	asoto: needinfo- asoto: needinfo- jarrpa: needinfo?
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-15 06:36:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alvaro Soto 2021-09-09 20:31:13 UTC

Description of problem (please be detailed as possible and provide log
snippests):

$ omg get csv
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.7.3               OpenShift Container Storage        4.7.3      ocs-operator.v4.7.2       Installing

~~~~~

  lastTransitionTime: '2021-09-03T17:10:07Z'
  lastUpdateTime: '2021-09-03T17:10:07Z'
  message: 'installing: waiting for deployment ocs-operator to become ready: Waiting
    for rollout to finish: 0 of 1 updated replicas are available...

    '
  phase: Installing
  reason: InstallWaiting


Version of all relevant components (if applicable):
ocp 4.7.5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
ceph storage unusable


Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 2 Alvaro Soto 2021-09-09 20:34:37 UTC

Seems to be like this other bz #1867024 (different version) but I'll like confirmation

Comment 3 Alvaro Soto 2021-09-10 20:31:47 UTC

Hello there!
any update about this issue?

Comment 5 Travis Nielsen 2021-09-10 21:09:14 UTC

Please attach a must-gather from the cluster. This issue does not have enough details to troubleshoot.

Comment 8 Travis Nielsen 2021-09-13 16:06:09 UTC

The issue looks related to the csi driver, moving to csi to take a look

Comment 20 Sébastien Han 2021-09-30 12:30:13 UTC

None of the must-gather links work, how can we access the logs?

Comment 21 Sébastien Han 2021-09-30 13:02:37 UTC

Got the logs from Gabriel. Thanks

Comment 22 Sébastien Han 2021-09-30 13:20:25 UTC

Ok after looking, I couldn't find anything wrong with our configuration:

* the Service Accounts are declared in the CSV, however, I couldn't verify they were actually created (would be good to validate that), but since the cluster was working in the previous version, we could assume they are still present
* the SCC for ceph-csi points to the correct service accounts
* the ceph-csi resources are configured with the correct service account

What's really strange is that both noobaa and rook-ceph have the right annotation for their respective SCC.
The only thing I can think of is if the SA does not exist, then the admission controller would default the SCC to the default "restricted".

Honestly, at this point, it would be good to get the input from the OCP team to see if anything changed.
Also, do you known why this was not caught by QE, and do we have other customers' cases? Or is it only this customer?

Comment 23 Sébastien Han 2021-09-30 15:02:00 UTC

Can you someone from support try the following:

* edit the rook-ceph-csi CSI and remove the rook-csi-rbd-attacher-sa service account from the users list
* remove all the ceph-csi resources (rbd/ceph plugin and provisioned)
* restart the rook-ceph operator

At this point, observe the newly created ceph-csi resources, look up the SCC they use in the annotation.

Thanks.

Comment 24 Alvaro Soto 2021-09-30 21:59:41 UTC

Hey Sébastien,
we followed the instructions, but pods continue using privileged SCC.
Talking to the customer we found this cluster has auto update flag enabled so, at the moment ocs-operator is on version 4.7.4.

New must-gathers in the case.

Cheers!

Comment 26 Sébastien Han 2021-10-07 15:14:23 UTC

Hi Alvaro,

Waiting for the OCS must gather now.
Thanks

Comment 27 Alvaro Soto 2021-10-07 17:09:06 UTC

Hello Sébastien,
customer decided to start all over again by deleting the cluster, what will be the next steps here?

Cheers!

Comment 28 Travis Nielsen 2021-10-11 15:27:17 UTC

Removing needsinfo since waiting to see if more details/repro

Comment 29 Sébastien Han 2021-10-15 06:36:15 UTC

The customer agreed to close the BZ. A new installation was done, no more logs or env are available to troubleshoot further.