1962278 – Include events in Rook for cephcluster

Bug 1962278 - Include events in Rook for cephcluster

Summary: Include events in Rook for cephcluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.8.0
Assignee:	Nitin Goyal
QA Contact:	Mugdha Soni
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-19 16:32 UTC by Neha Berry
Modified:	2021-08-03 18:16 UTC (History)
CC List:	5 users (show)
Fixed In Version:	4.8.0-402.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-03 18:16:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 241	0	None	closed	Sync from release-1.6 to downstream release-4.8	2021-05-21 00:40:20 UTC
Red Hat Product Errata	RHBA-2021:3003	0	None	None	None	2021-08-03 18:16:26 UTC

Description Neha Berry 2021-05-19 16:32:15 UTC

Description of problem (please be detailed as possible and provide log
snippests):
=====================================================================
With Bug 1927338, events were added to OCS operator, especially for the cases when uninstall was stuck and there were issues.

But they still had to check the rook logs to understand the cause of issues if there were problems with cephcluster deletion(deletion stuck).

Nitin has already added the Events in rook in the Upstream and this bug is to track the backport of the code to OCS 4.8 downstream branch


For more details, see Bug 1927338#c12 and Bug 1927338#c6

Version of all relevant components (if applicable):
==================================================
OCS 4.8

Not sure if we need to backport the fix to 4.7


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
============================================================
No but one needs to check the logs to get information on failures

Is there any workaround available to the best of your knowledge?
===============================================================
Check logs

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
================================================================
3

Can this issue reproducible?
===============================
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
=================================================
No


Steps to Reproduce:
======================
1. Create PVCs and OBCs
2. With default modes for storagecluster ( uninstall.ocs.openshift.io/mode: graceful), initiate storagecluster deletion
3. In ocs-operator logs, we only get the indication that deletion is waiting for cephcluster to be removed
4. oc describe of cephcluster also do not have the details

Actual results:
===================
Need to go through multiple logs to understand what is causing cephcluster  deletion to get stuck


Expected results:
=====================
We already have events in IMP CRs managed by storagecluster(see Bug 1927338#c11), but we also need events in the cephcluster so that we know what is affecting uninstall via oc describe <CR> itself

Comment 4 Travis Nielsen 2021-05-19 16:47:41 UTC

This will be picked up in the next resync with downstream to release-4.8

Comment 5 Travis Nielsen 2021-05-21 00:40:24 UTC

Included in the latest resync to release-4.8

Comment 12 Mugdha Soni 2021-06-25 07:56:12 UTC

Hi 

As mentioned in comment#11 , i powered off two storage nodes and observed that there were no events seen through CLI .

Events:
  Type    Reason              Age                From               Message
  ----    ------              ----               ----               -------
  Normal  ReconcileSucceeded  48m (x2 over 22h)  ClusterController  cluster has been configured successfully

The second scenario was suggested by @nigoyal which included to scale down the ocs-operator and then edit the cephcluster and change the mon count to 10 . Kept the cluster in same state for 5 hours and observed that events were generated and the count of events was correct . 

Steps performed to validate the fix  are mentioned below :-

1. Deployed OCS 4.8 cluster . The pods, nodes and ceph health was fine .

2. Scaled down the ocs -operator .

[root@localhost ocs4_8_aws]# oc scale deployment ocs-operator --replicas=0 -n openshift-storage
deployment.apps/ocs-operator scaled

3. Edited the cephcluster to change the mon count to 10 .

[root@localhost ocs4_8_aws]# oc edit -n openshift-storage cephcluster ocs-storagecluster-cephcluster
cephcluster.ceph.rook.io/ocs-storagecluster-cephcluster edited

4. Observed the events for next 5 hour using command "oc describe cephcluster -n openshift-storage".

Events:
  Type     Reason           Age    From               Message
  ----     ------           ----   ----               -------
  Warning  ReconcileFailed  3m32s  ClusterController  failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum

====================================================================================================================================================================================

Events:
  Type     Reason           Age                From               Message
  ----     ------           ----               ----               -------
  Warning  ReconcileFailed  12m (x2 over 84m)  ClusterController  failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum

====================================================================================================================================================================================

Events:
  Type     Reason           Age                 From               Message
  ----     ------           ----                ----               -------
  Warning  ReconcileFailed  29m (x3 over 168m)  ClusterController  failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum

=====================================================================================================================================================================================

Events:
  Type     Reason           Age                  From               Message
  ----     ------           ----                 ----               -------
  Warning  ReconcileFailed  12m (x6 over 5h51m)  ClusterController  failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum

=====================================================================================================================================================================================

5.Changed the mon count to 3 and then again to 10 and observed that events were generated and event count was increased.

Events:
  Type     Reason           Age                 From               Message
  ----     ------           ----                ----               -------
  Warning  ReconcileFailed  9m25s               ClusterController  failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to assign pods to mons: CANCELLING CURRENT ORCHESTRATION
  Warning  ReconcileFailed  45s (x7 over 6h7m)  ClusterController  failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: mon count 10 cannot be even, must be odd to support a healthy quorum
=====================================================================================================================================================================================


The events were generated and the event count did not increase before hour .

Hence moving the bug to verified state .

Thanks

Comment 14 errata-xmlrpc 2021-08-03 18:16:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3003

Note You need to log in before you can comment on or make changes to this bug.