Bug 1974344 - critical ClusterObjectStoreState alert firing after installation of arbiter storage cluster, likely because ceph object user for cephobjectstore fails to be created, when storagecluster is reinstalled
Summary: critical ClusterObjectStoreState alert firing after installation of arbiter s...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.10.0
Assignee: Blaine Gardner
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On: 2005040 2021068
Blocks: 1966894 1981258 2011326 2029744 2056571
TreeView+ depends on / blocked
 
Reported: 2021-06-21 12:25 UTC by Martin Bukatovic
Modified: 2023-08-09 17:03 UTC (History)
15 users (show)

Fixed In Version: 4.10.0-113
Doc Type: Bug Fix
Doc Text:
.CephObjectStore is reaching to Ready state Previously, there was an error with CephObjectStore where it would not attain the Ready state if it was deleted and then a new CephObjectStore was created with the same name. With this update, any new CephObjectStore can be able to reach the Ready state even if a previously deleted CephObjectStore has the same name.
Clone Of:
: 1981258 2029744 (view as bug list)
Environment:
Last Closed: 2022-04-13 18:49:40 UTC
Embargoed:


Attachments (Terms of Use)
screenshot #2: prometheus query ocs_rgw_health_status{job="ocs-metrics-exporter"} (183.10 KB, image/png)
2021-06-21 13:01 UTC, Martin Bukatovic
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 318 0 None Merged Sync from upstream release-1.8 to downstream release-4.10 2021-12-17 18:27:11 UTC
Github rook rook pull 8208 0 None closed ceph: enable debug for adminops client if the rook loglevel <= debug 2021-07-06 12:39:25 UTC
Github rook rook pull 9417 0 None open pool: file: object: clean up health checkers for force deletion 2021-12-13 23:11:50 UTC
Red Hat Product Errata RHSA-2022:1372 0 None None None 2022-04-13 18:50:11 UTC

Description Martin Bukatovic 2021-06-21 12:25:10 UTC
Description of problem
======================

Shortly after installation of StorageCluster in stretch arbiter mode using LSO
on vSphere platform, I see critical alert ClusterObjectStoreState firing.

Version-Release number of selected component
============================================

OCP 4.8.0-0.nightly-2021-06-19-005119
LSO 4.8.0-202106152230
OCS 4.8.0-422.ci

How reproducible
================

100%

Steps to Reproduce
==================

1. Install OCP on vSphere, with 3 master and 6 worker nodes, with 2 local
   storage device per worker node (for LSO).

2. Install LSO and OCS operators.

3. Label nodes via `oc label node $NODE topology.kubernetes.io/zone=$ZONE`
   so that you have 3 worker nodes and 1 master in each data zone, and just
   single master in arbiter zone:

   ```
   $ oc get nodes -L topology.kubernetes.io/zone
   NAME              STATUS   ROLES    AGE    VERSION                ZONE
   compute-0         Ready    worker   115m   v1.21.0-rc.0+120883f   data-1
   compute-1         Ready    worker   115m   v1.21.0-rc.0+120883f   data-1
   compute-2         Ready    worker   115m   v1.21.0-rc.0+120883f   data-1
   compute-3         Ready    worker   115m   v1.21.0-rc.0+120883f   data-2
   compute-4         Ready    worker   115m   v1.21.0-rc.0+120883f   data-2
   compute-5         Ready    worker   115m   v1.21.0-rc.0+120883f   data-2
   control-plane-0   Ready    master   121m   v1.21.0-rc.0+120883f   data-1
   control-plane-1   Ready    master   121m   v1.21.0-rc.0+120883f   data-2
   control-plane-2   Ready    master   122m   v1.21.0-rc.0+120883f   arbiter-0
   ```

4. Use "Create Storage Cluster" wizard in OCP Console to start setup of Storage
   Cluster in "Internal - Attached devices" mode.

5. In OCP Console, navigate to Monitoring -> Alerting, and then also to OCS
   Dashboard.

Actual results
==============

I see critical alert ClusterObjectStoreState:
Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health.

Expected results
================

There is no problem with object storage, ClusterObjectStoreState is not firing.

Additional info
===============

When I reinstalled StorageCluster on the same openshift cluster so that arbiter
mode is not used, the problem is not happening.

I don't recall this happening with arbiter clusters on OCS 4.7 => flagging this
as a regression.

Rados gateway pods seems to be running:

```
$ oc get pods -n openshift-storage | grep rgw
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7d797d4d8vpn   2/2     Running     0          64m
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-57b78b899hj7   2/2     Running     0          64m
```

Ceph doesn't report any error:

```
$ oc rsh -n openshift-storage rook-ceph-tools-69867749f-xflj9 bash
[root@compute-5 /]# ceph status
  cluster:
    id:     0598f7ad-7271-49dd-8a10-b4d7cb73194e
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum a,b,c,d,e (age 67m)
    mgr: b(active, since 67m), standbys: a
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 12 osds: 12 up (since 67m), 12 in (since 67m)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
 
  data:
    pools:   10 pools, 464 pgs
    objects: 316 objects, 136 MiB
    usage:   12 GiB used, 180 GiB / 192 GiB avail
    pgs:     464 active+clean
 
  io:
    client:   852 B/s rd, 2.3 KiB/s wr, 1 op/s rd, 0 op/s wr
```

See also must gather tarball referenced in a comment below.

Comment 6 Martin Bukatovic 2021-06-21 13:00:17 UTC
The definition of ClusterObjectStoreState alert is: ocs_rgw_health_status{job="ocs-metrics-exporter"} > 1

See screenshot #2 with a result of ocs_rgw_health_status{job="ocs-metrics-exporter"} prometheus query.

Moving to alerting component.

Comment 7 Martin Bukatovic 2021-06-21 13:01:42 UTC
Created attachment 1792649 [details]
screenshot #2: prometheus query ocs_rgw_health_status{job="ocs-metrics-exporter"}

Comment 8 Martin Bukatovic 2021-06-22 10:00:07 UTC
Anmol noticed that there is actual problem with ceph object store:

```
$ oc describe cephobjectstore/ocs-storagecluster-cephobjectstore -n openshift-storage | tail 
Status:
  Bucket Status:
    Details:       failed to create ceph object user "rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db": InvalidAccessKeyId tx000000000000000002886-0060d1b2c9-b017-ocs-storagecluster-cephobjectstore b017-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore
    Health:        Failure
    Last Changed:  2021-06-22T09:52:09Z
    Last Checked:  2021-06-22T09:52:09Z
  Info:
    Endpoint:  http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80
  Phase:       Failure
Events:        <none>
```

Based on discussion with Anmol, moving to rook.

Comment 9 Travis Nielsen 2021-06-22 17:46:25 UTC
@Martin Could I get access to the cluster where you are seeing this issue with the object store user. I don't see in the must gather the root cause for what is happening, and may need to enable debug logging in the rook operator.

Comment 10 Martin Bukatovic 2021-06-22 19:39:46 UTC
(In reply to Travis Nielsen from comment #9)
> @Martin Could I get access to the cluster where you are seeing this issue
> with the object store user. I don't see in the must gather the root cause
> for what is happening, and may need to enable debug logging in the rook
> operator.

Cluster access provided via direct message.

Comment 11 Travis Nielsen 2021-06-22 20:52:54 UTC
The object store user for the bucket health checker is failing to be created. Here is the debug logging around the health check:

2021-06-22 19:46:19.167874 D | ceph-object-controller: creating s3 user object "rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db" for object store "ocs-storagecluster-cephobjectstore"
2021-06-22 19:46:19.186375 D | ceph-object-controller: object store "openshift-storage/ocs-storagecluster-cephobjectstore" status updated to Failure
2021-06-22 19:46:19.186396 D | ceph-object-controller: failed to check rgw health for object store "ocs-storagecluster-cephobjectstore". failed to create ceph object user "rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db": InvalidAccessKeyId tx000000000000000003a2c-0060d23e0b-b017-ocs-storagecluster-cephobjectstore b017-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore
2021-06-22 19:46:19.186821 D | exec: Running command: ceph status --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/678101795

@Seb can you take a look? This seems related to the recent rgw api changes. I could not repro locally in minikube, but I don't see why it would be related to stretch clusters where this is being hit.

Comment 12 Sébastien Han 2021-06-23 08:06:43 UTC
Martin, can I access a live cluster for further debugging? Thanks

Comment 13 Martin Bukatovic 2021-06-23 11:09:20 UTC
(In reply to Sébastien Han from comment #12)
> Martin, can I access a live cluster for further debugging? Thanks

Cluster access provided via direct message.

Comment 14 Sébastien Han 2021-06-23 15:24:01 UTC
Some thoughts after debugging:

On the rgw log:

2021-06-21 11:13:49.389 7fafa2639700  1 ====== starting new request req=0x7fafebee9660 =====
2021-06-21 11:13:49.390 7fafa2639700  1 ====== req done req=0x7fafebee9660 op status=0 http_status=403 latency=0.00100009s ======
2021-06-21 11:13:49.390 7fafa2639700  1 beast: 0x7fafebee9660: 10.129.2.15 - - [21/Jun/2021:11:13:49.389 +0000] "PUT /admin/user?display-name=rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db&format=json&uid=rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db HTTP/1.1" 403 204 - "Go-http-client/1.1" - latency=0.00100009s

Which does not tell much and on the API front we have a response:

InvalidAccessKeyId tx000000000000000002886-0060d1b2c9-b017-ocs-storagecluster-cephobjectstore b017-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore

Which could indicate an invalid key.

I tried to inject a different image with some debugging code and now I'm not able to repro anymore.

Comment 15 Martin Bukatovic 2021-06-23 21:32:14 UTC
When I tried to reproduce this problem on a fresh openshift cluster, I noticed that I haven't hit the bug.

Retrying this by reinstalling storagecluster on the same openshift cluster revealed that I'm able to reproduce this problem with both arbiter and non arbiter setup, as long as it's not a 1st storagecluster created on given openshift cluster.

My notes confirms this conclusion. I reported this problem on Monday this week, after seeing this multiple times last week. I used a fresh cluster where:

- I installed non arbiter cluster, not seeing the problem
- reinstalled storage cluster in arbiter mode, seeing the issue

Comment 16 Travis Nielsen 2021-07-02 19:43:58 UTC
@Seb Where does the access key come from? From Ceph, right? Since the object store pools would all be wiped out between installs, I don't see how it's possible to retain the same key from a previous install. The access key is not persisted in a secret, right? If that was the case, maybe an ownerreference is missing on the secret and it's not being cleaned up between reinstall, but I don't see that it is the case.

Comment 17 Sébastien Han 2021-07-06 08:49:49 UTC
The access key comes from the Ceph cluster directly, it's only in a Secret for external mode. Right, this is not the case here.
Martin, have you been able to repro recently with the image I provided?

Thanks

Comment 20 Sébastien Han 2021-07-06 15:44:56 UTC
As discussed, Blaine PTAL.

Comment 21 Blaine Gardner 2021-07-06 18:07:55 UTC
@mbukatov I am trying to narrow down an exact definition of the case where this appears and want to ask some follow-up questions. 

From the info in this BZ, it seems that it only occurs when an OCS cluster is re-installed and never on the first install. From Seb I understand that restarting the operator resolves the issue. (This suggests to me that the operator may be caching some old info in memory.)

First question: Is my understanding above correct?

Second question: When uninstalling the OCS cluster (before reinstalling it) are the CephCluster and CephObjectStore resources both deleted, or is only the CephCluster deleted, leaving the CephObjectStore in place?

Third question: Related to the above, what are the specific steps/commands used for uninstall and reinstall?

-----

As a note, we are in the process of backporting this to 4.8 to add extra debug info for this and similar cases. This will help us debug this issue more deeply via logs but likely does not fix the issue. https://github.com/openshift/rook/pull/273

Request one: This will likely not be in an OCS build for another day or two, but once it is available could you provide logs from a cluster showing a repro with the newly enhanced logging?

Request two: From the second question above, if the CephObjectStore isn't deleted between uninstalling and reinstalling, does the issue reproduce if the CephObjectStore *is* deleted? Would you be able to investigate that?

-----

Thanks! I'm going to see if I can reproduce this on a local test cluster in the meantime.

Comment 22 Blaine Gardner 2021-07-06 20:56:30 UTC
@mbukatov could you also verify whether the CephObjectStore resource gets removed by Kubernetes after uninstalling and before reinstalling?

Comment 23 Blaine Gardner 2021-07-07 21:30:14 UTC
I've been able to get a consistent reproduce of this issue using the Rook and Ceph images from the 422 release noted in the original bug.

quay.io/rhceph-dev/rook-ceph:4.8-149.db8366d.release_4.8 and quay.io/rhceph-dev/rhceph:4-55 for reference.

The steps to reproduce are:
1. Install a CephCluster and a CephObjectStore
2. Wait for both to come up
3. Delete the CephCluster, and wait for it to be fully removed
4. Delete the CephObjectStore and wait for it to be fully removed
5. Install the same CephCluster and CephObjectStore again
6. The CephObjectStore will never reach connected state

This does NOT reproduce if the CephObjectStore is deleted and the user waits for it to be removed before removing the CephCluster. This could be a workaround, as is restarting the operator.

I believe this issue is a result of the health checker goroutine caching access to the radosgw admin ops API.

I believe this issue may be fixed by two PRs that I implemented for 4.9 for this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1927782
One PR enforces that the CephCluster cannot be deleted until the CephObjectStore is deleted. The second (in addition to fixing the BZ linked) adjusts how some of the health checking code for the CephObjectStore is implemented by making it get new access credentials each time it checks for health.

I will discuss with Seb tomorrow whether we should push to have the two PRs backported to 4.8 (may be considered a breaking change given how the CephCluster deletion is blocked if a CephObjectStore or other Rook-Ceph custom resources exist) or whether we should do new work to fix the issue which may take longer.

Comment 24 Mudit Agarwal 2021-07-08 04:12:13 UTC
Blaine, 
Thanks for root causing this.
Have a couple of questions:
1. Is this really a regression from 4.7?
2. Do we have a workaround which can help us in deferring this to 4.8.z

Moving back to ASSIGNED.

Comment 25 Sébastien Han 2021-07-08 08:24:54 UTC
Answering for Blaine

(In reply to Mudit Agarwal from comment #24)
> Blaine, 
> Thanks for root causing this.
> Have a couple of questions:
> 1. Is this really a regression from 4.7?

It was introduced somewhere in 4.8. So yes.

> 2. Do we have a workaround that can help us in deferring this to 4.8.z

Restarting the operator is the only workaround we have.

> 
> Moving back to ASSIGNED.

Comment 26 Mudit Agarwal 2021-07-08 12:11:25 UTC
In an offline discussion we agree to move this out of 4.8 as we have a workaround.
Adding it to the known issues.

Blaine, please fill the doc text.

Also, as you mentioned that this does NOT reproduce if the CephObjectStore is deleted and the user waits for it to be removed before removing the CephCluster.
Is it worth to add this in our uninstall documentation so that we can avoid the user from ending up in this situation?
Let me know if that makes sense and I will open a doc BZ for the same.

Comment 29 Blaine Gardner 2021-07-08 16:40:14 UTC
(In reply to Mudit Agarwal from comment #26)
> In an offline discussion we agree to move this out of 4.8 as we have a
> workaround.
> Adding it to the known issues.
> 
> Blaine, please fill the doc text.
> 
> Also, as you mentioned that this does NOT reproduce if the CephObjectStore
> is deleted and the user waits for it to be removed before removing the
> CephCluster.
> Is it worth to add this in our uninstall documentation so that we can avoid
> the user from ending up in this situation?
> Let me know if that makes sense and I will open a doc BZ for the same.

Thanks, Mudit! 

I updated the doc text as thoroughly as I could.

I believe it is worth it to add this to the uninstall documentation for 4.8, yes. In general, I would recommend that users always delete the CephCluster resource last, after any other Rook-related resources, but I think the CephObjectStore is the only one that could have this bug.

In 4.9, Rook will enforce this recommendation, so whatever documentation we add for 4.8 could be removed in lieu of explaining that in graceful mode, the CephCluster will not be deleted until all other Rook-related resources are deleted as well (if this isn't already documented).

Comment 30 Sébastien Han 2021-07-12 08:08:38 UTC
Branch https://github.com/red-hat-data-services/rook/tree/release-4.9 has the fix from a resync.
Moving to MODIFIED.

Comment 31 Mudit Agarwal 2021-07-12 08:23:55 UTC
Thanks Blaine, I have created a doc bug #1981258 for updating the uninstall procedure.

Comment 40 Mudit Agarwal 2021-11-03 04:24:50 UTC
Doc text needs to be changed, earlier it was a known issue and now a Bug Fix.

Comment 41 Blaine Gardner 2021-11-03 17:39:40 UTC
Added a short doc text about the bug. I could include information about the root cause but opted not to since that could be extraneous information for end users. Let me know if you'd like me to add that. It would be something like:

Uninstalling a CephCluster and CephObjectStore could result in "old" connection details stored in the Rook operator not being cleaned up. When the CephCluster and CephObjectStore were re-installed, the "old" connection details would prevent the CephObjectStore health checker routine from validating the store and setting its "Ready" state.

Comment 45 Blaine Gardner 2021-12-01 16:35:10 UTC
@akrai , deleting the cluster forcefully by removing finalizers may actually be a better way to attempt to reproduce the issue than normal "graceful" uninstallation. I think a forced uninstall would be more likely to repro this.

I would suggest to remove all resource finalizers *before* deleting the Rook resources. Do not remove the operator or restart the operator pod before reinstalling CephCluster and other resources (this is a workaround and would invalidate the test).

Comment 46 Martin Bukatovic 2021-12-02 13:39:45 UTC
Verifying on vSphere LSO cluster with:

OCP 4.9.0-0.nightly-2021-12-01-080120
LSO 4.9.0-202111151318
ODF 4.9.0-249.ci

Performing workarounds for:

- BZ 2005040: (ass suggested by Blaine) `oc patch -n openshift-storage CephObjectStore/ocs-storagecluster-cephobjectstore --type=merge -p '{"metadata": {"finalizers":null}}'`
- BZ 2028449: using 4.8 procedure instead of the 4.9 one

So that I can reinstall StorageSystem as required for the reproducer.

And when I perform the reproducer, I still see the ClusterObjectStoreState alert: Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health.

If removing the finalizer prevents the fix of this bug to be applied, we need to wait for BZ 2005040 to be actually fixed, instead of using one of suggested workarounds.

>>> ASSIGNED

Comment 48 Blaine Gardner 2021-12-02 18:19:55 UTC
Even when deleting the finalizer, I don't think we want this issue to happen. I will look into the issue again. I may have missed something for the removed-finalizer case.

In the meantime, there is an easy workaround to restart the `rook-ceph-operator` pod. I think this should be sufficient for 4.9.0.

I would suggest we target this fix for the next 4.9.z

Comment 49 Michael Adam 2021-12-02 18:33:26 UTC
This is not a regression in 4.9.

It was a regression in 4.8.0 from 4.7, but was moved out because we have a workaround (restart rook-ceph operator).

==> Removing regression keyword.

We can not and should not block the 4.9.0 release for it, so I suggest to moving it to 4.9.z.

Removing the acks, making it a proposed bug. (Not removing the blocker? flag just yet.)

==> Elad?

Comment 50 Elad 2021-12-02 20:13:50 UTC
Hi Michael,

> This is not a regression in 4.9.
> 
> It was a regression in 4.8.0 from 4.7, but was moved out because we have a
> workaround (restart rook-ceph operator).
> 
> ==> Removing regression keyword.

Although this is not a regression between 4.8 and 4.9, it was a regression between 4.7 and 4.8, as confirmed by both QE and dev (comment #0 and comment #25). Ideally, the bug was not supposed to be taken out of 4.8 to 4.9.
(IIRC, the bug was taken out eventually as streched cluster with arbiter solution was demoted to tech preview prior to 4.8 GA)

On top of that, we were blocked from testing the fix in 4.9 due to test blockers of both arbiter deployment (BZ 2021068) and uninstall (BZ 2005040).

 
> We can not and should not block the 4.9.0 release for it, so I suggest to
> moving it to 4.9.z.

I tend to agree that this should not block 4.9.0 release, given the tech preview supportability of the stretch cluster in 4.9 and the complexity of the scenario. 
But we need to ensure that there is a way out of the situation before we move the bug out to 4.9.z.

Comment 51 Elad 2021-12-02 20:16:26 UTC
Blaine, could you please provide evidence that the workaround of restarting the rook-ceph operator solves the situation?

Comment 52 Blaine Gardner 2021-12-02 21:51:55 UTC
Elad,

The underlying bug is caused by the rook operator keeping connection details for an uninstalled CephObjectStore in-memory, which conflicts with a newly-created object store created with the same name. If the operator is restarted, the old details are no longer in-memory, and the issue won't occur.

For a little bit more info about this bug...

Currently in 4.9, the only way of getting the bug to reproduce is to remove the finalizer from the CephObjectStore before deleting it, which is the workaround for BZ 2005040. This also suggests to me that the bug will be reproduced by the "optimized" deletion path (see https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c65) we implemented in Rook, which uses the same deletion code path. It's possible that this "optmized" path will become the solution for overcoming BZ 2005040, so I will want to re-look at this PR.

I believe that revisiting this abandoned backport in Rook upstream (https://github.com/rook/rook/pull/9213) might solve this issue; however, it will take some time to backport upstream and down (2-3 days maybe), and I don't want to assume to delay the 4.9 release when there is such an easy workaround available.

Please advise if this should still be considered a blocker in need of immediate fix.

Comment 53 Blaine Gardner 2021-12-02 21:54:21 UTC
Small clarification to the restart workaround above:

Restarting the rook operator will prevent the issue from occurring if it is restarted after a storage cluster is uninstalled and before the new storage cluster is re-installed. 

Restarting will also fix the issue if it is actively occurring due to a storage cluster being uninstalled and then reinstalled without restarting the operator in between.

Comment 54 Mudit Agarwal 2021-12-06 12:31:28 UTC
In an offline discussion with Elad, we agreed to fix this issue in 4.9.1 and re-target the doc bz #1981258 to 4.9.0

Blaine, please keep working on the backport PR and merge it in 4.9.1 once the branches are open for check-in (will let you know)
Also, lets add it again as a known issue so please fill the doc text accordingly.

Comment 58 Blaine Gardner 2021-12-13 17:37:40 UTC
@

Comment 59 Blaine Gardner 2021-12-13 17:40:35 UTC
@muagarwa the doc text is filled in thanks to Kusuma. I made a very small adjustment to remove a leading dot in the text. Is there anything more I can help with for this BZ other than continuing work to ensure a fix for 4.10 and 4.9.1?

Comment 61 Blaine Gardner 2021-12-13 23:11:50 UTC
Fix that will soon be backported to 4.10 is now present in upstream Rook https://github.com/rook/rook/pull/9417

Comment 62 Mudit Agarwal 2021-12-14 12:55:58 UTC
Thanks Blaine.

You can merge the fix in 4.10 but we have to wait for it to get into 4.9 as there are some other priority items lined up for 4.9.1
I am hopeful to take it in 4.9.2, will let you know when to backport to 4.9

Comment 63 Blaine Gardner 2021-12-14 19:02:05 UTC
Backported to Rook v1.8 https://github.com/rook/rook/pull/9420

Comment 64 Blaine Gardner 2021-12-17 18:27:12 UTC
This is backported to 4.10 with https://github.com/red-hat-storage/rook/pull/318

For QE:

With recent changes to 4.10, reproducing this bug is slightly different. It was only appearing when the CephObjectStore was removed **forcefully** by removing the finalizer. When I was reproducing this bug to work on this fix, this is the workflow I used:

1. Install a storage cluster that contains a CephObjectStore
2. Wait for the CephObjectStore to be healthy
3. Remove the finalizer from the CephObjectStore
4. Delete the storage cluster, but do not delete the rook-ceph-operator
5. Re-install the storage cluster
6. The CephCluster should start up fine
7. Under failure, CephObjectStore will report `status.bucketStatus.health: Failure` with `status.bucketStatus.details` containing the string "InvalidAccessKeyId" (this is the indicator of the bug)

To verify the fix, follow steps 1-6 above. In step 7, the CephObjectStore **should** report `status.bucketStatus.health: Connected` and `status.Phase: Connected`.

I hope this is helpful!
-Blaine

Comment 66 Martin Bukatovic 2022-01-26 14:56:46 UTC
While the BZ 2005040 was expected to be fixed, it's still unresolved, which is blocking verification of this bug.

Comment 67 Blaine Gardner 2022-01-27 16:48:16 UTC
@mbukatov, from a technical standpoint, I don't believe the BZ you linked must be resolved before testing this bug. My opinion is that it may be best to test this bug as independently from ocs-operator as possible. However, it is also okay to wait for https://bugzilla.redhat.com/show_bug.cgi?id=2005040 if that is better for the QA process.

I outlined a testing procedure to verify this bug does not persist in this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1974344#c64 
In order to test this bug while avoiding https://bugzilla.redhat.com/show_bug.cgi?id=2005040, I would suggest a new step between 3 and 4 from before. New steps would be as follows:

1. Install a storage cluster that contains a CephObjectStore
2. Wait for the CephObjectStore to be healthy
3. Remove the finalizer from the CephObjectStore
3.a. Delete the CephCluster but do not wait for it to be deleted (this should allow the CephObjectStore to become deleted)
4. Delete the storage cluster, but **do not delete the rook-ceph-operator** (this should remove all other resources that may be created)
5. Re-install the storage cluster
6. The CephCluster should start up fine
7. Under failure, CephObjectStore will report `status.bucketStatus.health: Failure` with `status.bucketStatus.details` containing the string "InvalidAccessKeyId" (this is the indicator of the bug)

To verify the fix, follow steps 1-6 above. In step 7, the CephObjectStore **should** report `status.bucketStatus.health: Connected` and `status.Phase: Connected`.

Comment 72 Martin Bukatovic 2022-02-17 19:34:30 UTC
Based on suggestion from Blaine, retesting on vSphere with:

OCP 4.10.0-0.nightly-2022-02-15-041303
LSO 4.10.0-202202160023
ODF 4.10.0-158

And I see no problem with the cephobjectstore:

```
$ oc describe cephobjectstore/ocs-storagecluster-cephobjectstore -n openshift-storage | tail
    Last Changed:  2022-02-17T19:23:11Z
    Last Checked:  2022-02-17T19:30:12Z
  Info:
    Endpoint:         http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80
    Secure Endpoint:  https://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:443
  Phase:              Connected
Events:
  Type    Reason              Age                   From                         Message
  ----    ------              ----                  ----                         -------
  Normal  ReconcileSucceeded  8m12s (x16 over 11m)  rook-ceph-object-controller  successfully configured CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore"
```

Neither I see the alert.

>>> VERIFIED

Comment 73 Mudit Agarwal 2022-03-04 12:15:58 UTC
Please add doc text.

Comment 79 errata-xmlrpc 2022-04-13 18:49:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372


Note You need to log in before you can comment on or make changes to this bug.