Description of problem ====================== Shortly after installation of StorageCluster in stretch arbiter mode using LSO on vSphere platform, I see critical alert ClusterObjectStoreState firing. Version-Release number of selected component ============================================ OCP 4.8.0-0.nightly-2021-06-19-005119 LSO 4.8.0-202106152230 OCS 4.8.0-422.ci How reproducible ================ 100% Steps to Reproduce ================== 1. Install OCP on vSphere, with 3 master and 6 worker nodes, with 2 local storage device per worker node (for LSO). 2. Install LSO and OCS operators. 3. Label nodes via `oc label node $NODE topology.kubernetes.io/zone=$ZONE` so that you have 3 worker nodes and 1 master in each data zone, and just single master in arbiter zone: ``` $ oc get nodes -L topology.kubernetes.io/zone NAME STATUS ROLES AGE VERSION ZONE compute-0 Ready worker 115m v1.21.0-rc.0+120883f data-1 compute-1 Ready worker 115m v1.21.0-rc.0+120883f data-1 compute-2 Ready worker 115m v1.21.0-rc.0+120883f data-1 compute-3 Ready worker 115m v1.21.0-rc.0+120883f data-2 compute-4 Ready worker 115m v1.21.0-rc.0+120883f data-2 compute-5 Ready worker 115m v1.21.0-rc.0+120883f data-2 control-plane-0 Ready master 121m v1.21.0-rc.0+120883f data-1 control-plane-1 Ready master 121m v1.21.0-rc.0+120883f data-2 control-plane-2 Ready master 122m v1.21.0-rc.0+120883f arbiter-0 ``` 4. Use "Create Storage Cluster" wizard in OCP Console to start setup of Storage Cluster in "Internal - Attached devices" mode. 5. In OCP Console, navigate to Monitoring -> Alerting, and then also to OCS Dashboard. Actual results ============== I see critical alert ClusterObjectStoreState: Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health. Expected results ================ There is no problem with object storage, ClusterObjectStoreState is not firing. Additional info =============== When I reinstalled StorageCluster on the same openshift cluster so that arbiter mode is not used, the problem is not happening. I don't recall this happening with arbiter clusters on OCS 4.7 => flagging this as a regression. Rados gateway pods seems to be running: ``` $ oc get pods -n openshift-storage | grep rgw rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7d797d4d8vpn 2/2 Running 0 64m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-b-57b78b899hj7 2/2 Running 0 64m ``` Ceph doesn't report any error: ``` $ oc rsh -n openshift-storage rook-ceph-tools-69867749f-xflj9 bash [root@compute-5 /]# ceph status cluster: id: 0598f7ad-7271-49dd-8a10-b4d7cb73194e health: HEALTH_OK services: mon: 5 daemons, quorum a,b,c,d,e (age 67m) mgr: b(active, since 67m), standbys: a mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 12 osds: 12 up (since 67m), 12 in (since 67m) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) data: pools: 10 pools, 464 pgs objects: 316 objects, 136 MiB usage: 12 GiB used, 180 GiB / 192 GiB avail pgs: 464 active+clean io: client: 852 B/s rd, 2.3 KiB/s wr, 1 op/s rd, 0 op/s wr ``` See also must gather tarball referenced in a comment below.
The definition of ClusterObjectStoreState alert is: ocs_rgw_health_status{job="ocs-metrics-exporter"} > 1 See screenshot #2 with a result of ocs_rgw_health_status{job="ocs-metrics-exporter"} prometheus query. Moving to alerting component.
Created attachment 1792649 [details] screenshot #2: prometheus query ocs_rgw_health_status{job="ocs-metrics-exporter"}
Anmol noticed that there is actual problem with ceph object store: ``` $ oc describe cephobjectstore/ocs-storagecluster-cephobjectstore -n openshift-storage | tail Status: Bucket Status: Details: failed to create ceph object user "rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db": InvalidAccessKeyId tx000000000000000002886-0060d1b2c9-b017-ocs-storagecluster-cephobjectstore b017-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore Health: Failure Last Changed: 2021-06-22T09:52:09Z Last Checked: 2021-06-22T09:52:09Z Info: Endpoint: http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80 Phase: Failure Events: <none> ``` Based on discussion with Anmol, moving to rook.
@Martin Could I get access to the cluster where you are seeing this issue with the object store user. I don't see in the must gather the root cause for what is happening, and may need to enable debug logging in the rook operator.
(In reply to Travis Nielsen from comment #9) > @Martin Could I get access to the cluster where you are seeing this issue > with the object store user. I don't see in the must gather the root cause > for what is happening, and may need to enable debug logging in the rook > operator. Cluster access provided via direct message.
The object store user for the bucket health checker is failing to be created. Here is the debug logging around the health check: 2021-06-22 19:46:19.167874 D | ceph-object-controller: creating s3 user object "rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db" for object store "ocs-storagecluster-cephobjectstore" 2021-06-22 19:46:19.186375 D | ceph-object-controller: object store "openshift-storage/ocs-storagecluster-cephobjectstore" status updated to Failure 2021-06-22 19:46:19.186396 D | ceph-object-controller: failed to check rgw health for object store "ocs-storagecluster-cephobjectstore". failed to create ceph object user "rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db": InvalidAccessKeyId tx000000000000000003a2c-0060d23e0b-b017-ocs-storagecluster-cephobjectstore b017-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore 2021-06-22 19:46:19.186821 D | exec: Running command: ceph status --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json --out-file /tmp/678101795 @Seb can you take a look? This seems related to the recent rgw api changes. I could not repro locally in minikube, but I don't see why it would be related to stretch clusters where this is being hit.
Martin, can I access a live cluster for further debugging? Thanks
(In reply to Sébastien Han from comment #12) > Martin, can I access a live cluster for further debugging? Thanks Cluster access provided via direct message.
Some thoughts after debugging: On the rgw log: 2021-06-21 11:13:49.389 7fafa2639700 1 ====== starting new request req=0x7fafebee9660 ===== 2021-06-21 11:13:49.390 7fafa2639700 1 ====== req done req=0x7fafebee9660 op status=0 http_status=403 latency=0.00100009s ====== 2021-06-21 11:13:49.390 7fafa2639700 1 beast: 0x7fafebee9660: 10.129.2.15 - - [21/Jun/2021:11:13:49.389 +0000] "PUT /admin/user?display-name=rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db&format=json&uid=rook-ceph-internal-s3-user-checker-f6587ff0-762e-4158-bf52-8e5a355923db HTTP/1.1" 403 204 - "Go-http-client/1.1" - latency=0.00100009s Which does not tell much and on the API front we have a response: InvalidAccessKeyId tx000000000000000002886-0060d1b2c9-b017-ocs-storagecluster-cephobjectstore b017-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore Which could indicate an invalid key. I tried to inject a different image with some debugging code and now I'm not able to repro anymore.
When I tried to reproduce this problem on a fresh openshift cluster, I noticed that I haven't hit the bug. Retrying this by reinstalling storagecluster on the same openshift cluster revealed that I'm able to reproduce this problem with both arbiter and non arbiter setup, as long as it's not a 1st storagecluster created on given openshift cluster. My notes confirms this conclusion. I reported this problem on Monday this week, after seeing this multiple times last week. I used a fresh cluster where: - I installed non arbiter cluster, not seeing the problem - reinstalled storage cluster in arbiter mode, seeing the issue
@Seb Where does the access key come from? From Ceph, right? Since the object store pools would all be wiped out between installs, I don't see how it's possible to retain the same key from a previous install. The access key is not persisted in a secret, right? If that was the case, maybe an ownerreference is missing on the secret and it's not being cleaned up between reinstall, but I don't see that it is the case.
The access key comes from the Ceph cluster directly, it's only in a Secret for external mode. Right, this is not the case here. Martin, have you been able to repro recently with the image I provided? Thanks
As discussed, Blaine PTAL.
@mbukatov I am trying to narrow down an exact definition of the case where this appears and want to ask some follow-up questions. From the info in this BZ, it seems that it only occurs when an OCS cluster is re-installed and never on the first install. From Seb I understand that restarting the operator resolves the issue. (This suggests to me that the operator may be caching some old info in memory.) First question: Is my understanding above correct? Second question: When uninstalling the OCS cluster (before reinstalling it) are the CephCluster and CephObjectStore resources both deleted, or is only the CephCluster deleted, leaving the CephObjectStore in place? Third question: Related to the above, what are the specific steps/commands used for uninstall and reinstall? ----- As a note, we are in the process of backporting this to 4.8 to add extra debug info for this and similar cases. This will help us debug this issue more deeply via logs but likely does not fix the issue. https://github.com/openshift/rook/pull/273 Request one: This will likely not be in an OCS build for another day or two, but once it is available could you provide logs from a cluster showing a repro with the newly enhanced logging? Request two: From the second question above, if the CephObjectStore isn't deleted between uninstalling and reinstalling, does the issue reproduce if the CephObjectStore *is* deleted? Would you be able to investigate that? ----- Thanks! I'm going to see if I can reproduce this on a local test cluster in the meantime.
@mbukatov could you also verify whether the CephObjectStore resource gets removed by Kubernetes after uninstalling and before reinstalling?
I've been able to get a consistent reproduce of this issue using the Rook and Ceph images from the 422 release noted in the original bug. quay.io/rhceph-dev/rook-ceph:4.8-149.db8366d.release_4.8 and quay.io/rhceph-dev/rhceph:4-55 for reference. The steps to reproduce are: 1. Install a CephCluster and a CephObjectStore 2. Wait for both to come up 3. Delete the CephCluster, and wait for it to be fully removed 4. Delete the CephObjectStore and wait for it to be fully removed 5. Install the same CephCluster and CephObjectStore again 6. The CephObjectStore will never reach connected state This does NOT reproduce if the CephObjectStore is deleted and the user waits for it to be removed before removing the CephCluster. This could be a workaround, as is restarting the operator. I believe this issue is a result of the health checker goroutine caching access to the radosgw admin ops API. I believe this issue may be fixed by two PRs that I implemented for 4.9 for this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1927782 One PR enforces that the CephCluster cannot be deleted until the CephObjectStore is deleted. The second (in addition to fixing the BZ linked) adjusts how some of the health checking code for the CephObjectStore is implemented by making it get new access credentials each time it checks for health. I will discuss with Seb tomorrow whether we should push to have the two PRs backported to 4.8 (may be considered a breaking change given how the CephCluster deletion is blocked if a CephObjectStore or other Rook-Ceph custom resources exist) or whether we should do new work to fix the issue which may take longer.
Blaine, Thanks for root causing this. Have a couple of questions: 1. Is this really a regression from 4.7? 2. Do we have a workaround which can help us in deferring this to 4.8.z Moving back to ASSIGNED.
Answering for Blaine (In reply to Mudit Agarwal from comment #24) > Blaine, > Thanks for root causing this. > Have a couple of questions: > 1. Is this really a regression from 4.7? It was introduced somewhere in 4.8. So yes. > 2. Do we have a workaround that can help us in deferring this to 4.8.z Restarting the operator is the only workaround we have. > > Moving back to ASSIGNED.
In an offline discussion we agree to move this out of 4.8 as we have a workaround. Adding it to the known issues. Blaine, please fill the doc text. Also, as you mentioned that this does NOT reproduce if the CephObjectStore is deleted and the user waits for it to be removed before removing the CephCluster. Is it worth to add this in our uninstall documentation so that we can avoid the user from ending up in this situation? Let me know if that makes sense and I will open a doc BZ for the same.
(In reply to Mudit Agarwal from comment #26) > In an offline discussion we agree to move this out of 4.8 as we have a > workaround. > Adding it to the known issues. > > Blaine, please fill the doc text. > > Also, as you mentioned that this does NOT reproduce if the CephObjectStore > is deleted and the user waits for it to be removed before removing the > CephCluster. > Is it worth to add this in our uninstall documentation so that we can avoid > the user from ending up in this situation? > Let me know if that makes sense and I will open a doc BZ for the same. Thanks, Mudit! I updated the doc text as thoroughly as I could. I believe it is worth it to add this to the uninstall documentation for 4.8, yes. In general, I would recommend that users always delete the CephCluster resource last, after any other Rook-related resources, but I think the CephObjectStore is the only one that could have this bug. In 4.9, Rook will enforce this recommendation, so whatever documentation we add for 4.8 could be removed in lieu of explaining that in graceful mode, the CephCluster will not be deleted until all other Rook-related resources are deleted as well (if this isn't already documented).
Branch https://github.com/red-hat-data-services/rook/tree/release-4.9 has the fix from a resync. Moving to MODIFIED.
Thanks Blaine, I have created a doc bug #1981258 for updating the uninstall procedure.
Doc text needs to be changed, earlier it was a known issue and now a Bug Fix.
Added a short doc text about the bug. I could include information about the root cause but opted not to since that could be extraneous information for end users. Let me know if you'd like me to add that. It would be something like: Uninstalling a CephCluster and CephObjectStore could result in "old" connection details stored in the Rook operator not being cleaned up. When the CephCluster and CephObjectStore were re-installed, the "old" connection details would prevent the CephObjectStore health checker routine from validating the store and setting its "Ready" state.
@akrai , deleting the cluster forcefully by removing finalizers may actually be a better way to attempt to reproduce the issue than normal "graceful" uninstallation. I think a forced uninstall would be more likely to repro this. I would suggest to remove all resource finalizers *before* deleting the Rook resources. Do not remove the operator or restart the operator pod before reinstalling CephCluster and other resources (this is a workaround and would invalidate the test).
Verifying on vSphere LSO cluster with: OCP 4.9.0-0.nightly-2021-12-01-080120 LSO 4.9.0-202111151318 ODF 4.9.0-249.ci Performing workarounds for: - BZ 2005040: (ass suggested by Blaine) `oc patch -n openshift-storage CephObjectStore/ocs-storagecluster-cephobjectstore --type=merge -p '{"metadata": {"finalizers":null}}'` - BZ 2028449: using 4.8 procedure instead of the 4.9 one So that I can reinstall StorageSystem as required for the reproducer. And when I perform the reproducer, I still see the ClusterObjectStoreState alert: Cluster Object Store is in unhealthy state for more than 15s. Please check Ceph cluster health. If removing the finalizer prevents the fix of this bug to be applied, we need to wait for BZ 2005040 to be actually fixed, instead of using one of suggested workarounds. >>> ASSIGNED
Even when deleting the finalizer, I don't think we want this issue to happen. I will look into the issue again. I may have missed something for the removed-finalizer case. In the meantime, there is an easy workaround to restart the `rook-ceph-operator` pod. I think this should be sufficient for 4.9.0. I would suggest we target this fix for the next 4.9.z
This is not a regression in 4.9. It was a regression in 4.8.0 from 4.7, but was moved out because we have a workaround (restart rook-ceph operator). ==> Removing regression keyword. We can not and should not block the 4.9.0 release for it, so I suggest to moving it to 4.9.z. Removing the acks, making it a proposed bug. (Not removing the blocker? flag just yet.) ==> Elad?
Hi Michael, > This is not a regression in 4.9. > > It was a regression in 4.8.0 from 4.7, but was moved out because we have a > workaround (restart rook-ceph operator). > > ==> Removing regression keyword. Although this is not a regression between 4.8 and 4.9, it was a regression between 4.7 and 4.8, as confirmed by both QE and dev (comment #0 and comment #25). Ideally, the bug was not supposed to be taken out of 4.8 to 4.9. (IIRC, the bug was taken out eventually as streched cluster with arbiter solution was demoted to tech preview prior to 4.8 GA) On top of that, we were blocked from testing the fix in 4.9 due to test blockers of both arbiter deployment (BZ 2021068) and uninstall (BZ 2005040). > We can not and should not block the 4.9.0 release for it, so I suggest to > moving it to 4.9.z. I tend to agree that this should not block 4.9.0 release, given the tech preview supportability of the stretch cluster in 4.9 and the complexity of the scenario. But we need to ensure that there is a way out of the situation before we move the bug out to 4.9.z.
Blaine, could you please provide evidence that the workaround of restarting the rook-ceph operator solves the situation?
Elad, The underlying bug is caused by the rook operator keeping connection details for an uninstalled CephObjectStore in-memory, which conflicts with a newly-created object store created with the same name. If the operator is restarted, the old details are no longer in-memory, and the issue won't occur. For a little bit more info about this bug... Currently in 4.9, the only way of getting the bug to reproduce is to remove the finalizer from the CephObjectStore before deleting it, which is the workaround for BZ 2005040. This also suggests to me that the bug will be reproduced by the "optimized" deletion path (see https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c65) we implemented in Rook, which uses the same deletion code path. It's possible that this "optmized" path will become the solution for overcoming BZ 2005040, so I will want to re-look at this PR. I believe that revisiting this abandoned backport in Rook upstream (https://github.com/rook/rook/pull/9213) might solve this issue; however, it will take some time to backport upstream and down (2-3 days maybe), and I don't want to assume to delay the 4.9 release when there is such an easy workaround available. Please advise if this should still be considered a blocker in need of immediate fix.
Small clarification to the restart workaround above: Restarting the rook operator will prevent the issue from occurring if it is restarted after a storage cluster is uninstalled and before the new storage cluster is re-installed. Restarting will also fix the issue if it is actively occurring due to a storage cluster being uninstalled and then reinstalled without restarting the operator in between.
In an offline discussion with Elad, we agreed to fix this issue in 4.9.1 and re-target the doc bz #1981258 to 4.9.0 Blaine, please keep working on the backport PR and merge it in 4.9.1 once the branches are open for check-in (will let you know) Also, lets add it again as a known issue so please fill the doc text accordingly.
@
@muagarwa the doc text is filled in thanks to Kusuma. I made a very small adjustment to remove a leading dot in the text. Is there anything more I can help with for this BZ other than continuing work to ensure a fix for 4.10 and 4.9.1?
Fix that will soon be backported to 4.10 is now present in upstream Rook https://github.com/rook/rook/pull/9417
Thanks Blaine. You can merge the fix in 4.10 but we have to wait for it to get into 4.9 as there are some other priority items lined up for 4.9.1 I am hopeful to take it in 4.9.2, will let you know when to backport to 4.9
Backported to Rook v1.8 https://github.com/rook/rook/pull/9420
This is backported to 4.10 with https://github.com/red-hat-storage/rook/pull/318 For QE: With recent changes to 4.10, reproducing this bug is slightly different. It was only appearing when the CephObjectStore was removed **forcefully** by removing the finalizer. When I was reproducing this bug to work on this fix, this is the workflow I used: 1. Install a storage cluster that contains a CephObjectStore 2. Wait for the CephObjectStore to be healthy 3. Remove the finalizer from the CephObjectStore 4. Delete the storage cluster, but do not delete the rook-ceph-operator 5. Re-install the storage cluster 6. The CephCluster should start up fine 7. Under failure, CephObjectStore will report `status.bucketStatus.health: Failure` with `status.bucketStatus.details` containing the string "InvalidAccessKeyId" (this is the indicator of the bug) To verify the fix, follow steps 1-6 above. In step 7, the CephObjectStore **should** report `status.bucketStatus.health: Connected` and `status.Phase: Connected`. I hope this is helpful! -Blaine
While the BZ 2005040 was expected to be fixed, it's still unresolved, which is blocking verification of this bug.
@mbukatov, from a technical standpoint, I don't believe the BZ you linked must be resolved before testing this bug. My opinion is that it may be best to test this bug as independently from ocs-operator as possible. However, it is also okay to wait for https://bugzilla.redhat.com/show_bug.cgi?id=2005040 if that is better for the QA process. I outlined a testing procedure to verify this bug does not persist in this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1974344#c64 In order to test this bug while avoiding https://bugzilla.redhat.com/show_bug.cgi?id=2005040, I would suggest a new step between 3 and 4 from before. New steps would be as follows: 1. Install a storage cluster that contains a CephObjectStore 2. Wait for the CephObjectStore to be healthy 3. Remove the finalizer from the CephObjectStore 3.a. Delete the CephCluster but do not wait for it to be deleted (this should allow the CephObjectStore to become deleted) 4. Delete the storage cluster, but **do not delete the rook-ceph-operator** (this should remove all other resources that may be created) 5. Re-install the storage cluster 6. The CephCluster should start up fine 7. Under failure, CephObjectStore will report `status.bucketStatus.health: Failure` with `status.bucketStatus.details` containing the string "InvalidAccessKeyId" (this is the indicator of the bug) To verify the fix, follow steps 1-6 above. In step 7, the CephObjectStore **should** report `status.bucketStatus.health: Connected` and `status.Phase: Connected`.
Based on suggestion from Blaine, retesting on vSphere with: OCP 4.10.0-0.nightly-2022-02-15-041303 LSO 4.10.0-202202160023 ODF 4.10.0-158 And I see no problem with the cephobjectstore: ``` $ oc describe cephobjectstore/ocs-storagecluster-cephobjectstore -n openshift-storage | tail Last Changed: 2022-02-17T19:23:11Z Last Checked: 2022-02-17T19:30:12Z Info: Endpoint: http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80 Secure Endpoint: https://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:443 Phase: Connected Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ReconcileSucceeded 8m12s (x16 over 11m) rook-ceph-object-controller successfully configured CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" ``` Neither I see the alert. >>> VERIFIED
Please add doc text.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372