Bug 2002220 - Ceph objectstore user "noobaa-ceph-objectstore-user" is not ready - StorageCluster stuck in progressing state
Summary: Ceph objectstore user "noobaa-ceph-objectstore-user" is not ready - StorageCl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 5.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 5.0z1
Assignee: Mark Kogan
QA Contact: Tejas
URL:
Whiteboard:
Depends On:
Blocks: 2013326
TreeView+ depends on / blocked
 
Reported: 2021-09-08 09:49 UTC by Petr Balogh
Modified: 2023-09-15 01:14 UTC (History)
17 users (show)

Fixed In Version: ceph-16.2.0-143.el8cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2007377 (view as bug list)
Environment:
Last Closed: 2021-11-02 16:39:21 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook issues 8778 0 None open CephObjectStore debugHTTPClient leaks admin credentials 2021-09-21 18:36:57 UTC
Github rook rook pull 8708 0 None open ceph: retry object health check if creation fails 2021-09-13 23:46:28 UTC
Github rook rook pull 8765 0 None open rgw: fix misleading log line in rgw health checker 2021-09-20 23:35:43 UTC
Red Hat Issue Tracker RHCEPH-1844 0 None None None 2021-09-21 20:37:50 UTC
Red Hat Product Errata RHBA-2021:4105 0 None None None 2021-11-02 16:39:47 UTC

Description Petr Balogh 2021-09-08 09:49:33 UTC
Description of problem:

Resource ocs-storagecluster is in phase: Progressing!
Found in Tier4 execution with 6 worker nodes here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1770//console

From here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-077vuf1cs36-t4a/j-077vuf1cs36-t4a_20210908T052821/logs/failed_testcase_ocs_logs_1631080049/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-a4606322e336d99d92edb5195f1852dde400ee02f738496a38bfb5f441c8d203/noobaa/namespaces/openshift-storage/noobaa.io/noobaas/noobaa.yaml

I see:
 conditions:
  - lastHeartbeatTime: "2021-09-08T06:31:17Z"
    lastTransitionTime: "2021-09-08T06:31:17Z"
    message: Ceph objectstore user "noobaa-ceph-objectstore-user" is not ready
    reason: TemporaryError
    status: "True"
    type: Progressing
 
  phase: Configuring
  readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version:     5.9.0-20210722\n\tNooBaa Operator Version: 5.9.0\n"
  services:



Version of all relevant components (if applicable):
OCP 4.9 nightly
OCS 4.9.0-125.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, the automation is failing to finish deployment validation


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Not sure yet

Can this issue reproduce from the UI?
Haven't tried

If this is a regression, please provide more details to justify this:
Yes, as it was working on the other deployments.

Steps to Reproduce:
1. Install ODF on top of cluster with 6 worker nodes on vSphere
2.
3.


Actual results:
StorageCluster stuck in progressing state

Expected results:
Have storage cluster in OK state

Additional info:

Full must gather logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-077vuf1cs36-t4a/j-077vuf1cs36-t4a_20210908T052821/logs/failed_testcase_ocs_logs_1631080049/test_deployment_ocs_logs/

Comment 5 Blaine Gardner 2021-09-13 21:01:16 UTC
It looks like Rook is able to create the CephObjectStore successfully in part. It's able to create RGWs and configure the store, but it fails to get a connection to the admin ops api with the relevant log dump below. Rook times out the command after 15 seconds of inactivity.


> 2021-09-08T10:15:14.055276348Z 2021-09-08 10:15:14.055187 E | ceph-object-controller: failed to create bucket checker for CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore": failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. 2021-09-08T10:15:09.161+0000 7ff30bc90380  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):394 Notify failed on object ocs-storagecluster-cephobjectstore.rgw.meta:users.uid:rgw-admin-ops-user: (110) Connection timed out
> 2021-09-08T10:15:14.055276348Z 2021-09-08T10:15:09.161+0000 7ff30bc90380  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):397 Backtrace: :  ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)
> 2021-09-08T10:15:14.055276348Z  1: (RGWSI_Notify::distribute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWCacheNotifyInfo const&, optional_yield)+0x3ae) [0x563500c77ebe]
> 2021-09-08T10:15:14.055276348Z  2: (RGWSI_SysObj_Cache::distribute_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw_raw_obj const&, ObjectCacheInfo&, int, optional_yield)+0x31f) [0x563500c7f05f]
> 2021-09-08T10:15:14.055276348Z  3: (RGWSI_SysObj_Cache::write(rgw_raw_obj const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, bool, ceph::buffer::v15_2_0::list const&, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, optional_yield)+0x399) [0x563500c80a09]
> 2021-09-08T10:15:14.055276348Z  4: (RGWSI_SysObj::Obj::WOp::write(ceph::buffer::v15_2_0::list&, optional_yield)+0x37) [0x56350078c4e7]
> 2021-09-08T10:15:14.055276348Z  5: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&, bool, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, optional_yield, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x124) [0x563500b502c4]
> 2021-09-08T10:15:14.055276348Z  6: (RGWSI_MetaBackend_SObj::put_entry(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWSI_MetaBackend::PutParams&, RGWObjVersionTracker*, optional_yield)+0xc0) [0x563500784240]
> 2021-09-08T10:15:14.055276348Z  7: radosgw-admin(+0x8fadc4) [0x563500c72dc4]
> 2021-09-08T10:15:14.055276348Z  8: (RGWSI_MetaBackend::do_mutate(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&, RGWObjVersionTracker*, RGWMDLogStatus, optional_yield, std::function<int ()>, bool)+0xc9) [0x563500c73849]
> 2021-09-08T10:15:14.055276348Z  9: (RGWSI_MetaBackend::put(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWSI_MetaBackend::PutParams&, RGWObjVersionTracker*, optional_yield)+0xe1) [0x563500c73361]
> 2021-09-08T10:15:14.055276348Z  10: (RGWSI_User_RADOS::store_user_info(RGWSI_MetaBackend::Context*, RGWUserInfo const&, RGWUserInfo*, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*, optional_yield)+0x396) [0x563500c87066]
> 2021-09-08T10:15:14.055276348Z  11: radosgw-admin(+0x7e1114) [0x563500b59114]
> 2021-09-08T10:15:14.055276348Z  12: radosgw-admin(+0x8faece) [0x563500c72ece]
> 2021-09-08T10:15:14.055276348Z  13: (RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>)+0x9e) [0x5635007855fe]
> 2021-09-08T10:15:14.055276348Z  14: (RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x5f) [0x563500c72cff]
> 2021-09-08T10:15:14.055276348Z  15: (RGWSI_MetaBackend_Handler::call(std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x78) [0x563500b68498]
> 2021-09-08T10:15:14.055276348Z  16: (RGWUserCtl::store_info(RGWUserInfo const&, optional_yield, RGWUserCtl::PutParams const&)+0xb0) [0x563500b5bad0]
> 2021-09-08T10:15:14.055276348Z  17: (rgw_store_user_info(RGWUserCtl*, RGWUserInfo&, RGWUserInfo*, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, bool, optional_yield, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x47) [0x563500b5bb97]
> 2021-09-08T10:15:14.055276348Z  18: (RGWUser::update(RGWUserAdminOpState&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, optional_yield)+0xb9) [0x563500b5fcd9]
> 2021-09-08T10:15:14.055276348Z  19: (RGWUser::execute_add(RGWUserAdminOpState&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, optional_yield)+0x733) [0x563500b64993]
> 2021-09-08T10:15:14.055276348Z  20: (RGWUser::add(RGWUserAdminOpState&, optional_yield, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x73) [0x563500b64e43]
> 2021-09-08T10:15:14.055276348Z  21: main()
> 2021-09-08T10:15:14.055276348Z  22: __libc_start_main()
> 2021-09-08T10:15:14.055276348Z  23: _start()
> 2021-09-08T10:15:14.055276348Z 2021-09-08T10:15:09.161+0000 7ff30bc90380  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):413 Invalidating obj=ocs-storagecluster-cephobjectstore.rgw.meta:users.uid:rgw-admin-ops-user tries=0. : signal: interrupt


I don't see any RGW configs that are substantially different from a working install of Rook upstream.


It's possible that there is a networking issue. Petr, do the clusters you are seeing that fail have Multus enabled?

-------

One tangential problem I notice is that Rook is not retrying the health checker when creating it fails. This is something we will have to fix in Rook. It's possible (but unlikely) that the object store just needs a little more time to get up and running before the health checker is created. Because I'm pretty sure this is not the root cause, I'll create a new BZ for this issue.

Comment 6 Blaine Gardner 2021-09-13 23:46:28 UTC
After speaking with Travis, I opted not to create a new BZ to fix the tangential issue. I attached a link to the GH issue: https://github.com/rook/rook/pull/8708

Comment 7 Jiffin 2021-09-14 06:07:22 UTC
From the logs, there are invalid access key errors. So either adminops user creation failed(but it should have logged) or adminsop user is pointing to wrong info.

rook-operator logs
 I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage"
2021-09-08T10:15:32.118051093Z 2021-09-08 10:15:32.118001 E | ceph-object-store-user-controller: failed to reconcile failed to create/update object store user "noobaa-ceph-objectstore-user": failed to get details from ceph object user "noobaa-ceph-objectstore-user": InvalidAccessKeyId tx000000000000000000014-0061388d44-619a-ocs-storagecluster-cephobjectstore 619a-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore
2021-09-08T10:15:32.203845375Z 2021-09-08 10:15:32.203803 I | op-mon: parsing mon endpoints: a=172.30.71.182:6789,b=172.30.32.126:6789,c=172.30.250.57:6789


rgw-logs
2021-09-08T10:15:30.670396214Z debug 2021-09-08T10:15:30.670448899Z 2021-09-08T10:15:30.668+0000 7ff79ab36700  1 ====== starting new request req=0x7ff69cc30460 =====2021-09-08T10:15:30.670476531Z 
2021-09-08T10:15:30.670501039Z debug 2021-09-08T10:15:30.670510651Z 2021-09-08T10:15:30.668+0000 7ff79ab36700  1 op->ERRORHANDLER: err_no=-2028 new_err_no=-20282021-09-08T10:15:30.670519116Z 
2021-09-08T10:15:30.670540944Z debug 2021-09-08T10:15:30.670552697Z 2021-09-08T10:15:30.668+0000 7ff78bb18700  1 ====== req done req=0x7ff69cc30460 op status=0 http_status=403 latency=0.000000000s ======2021-09-08T10:15:30.670561509Z 
2021-09-08T10:15:30.670574088Z debug 2021-09-08T10:15:30.670583034Z 2021-09-08T10:15:30.668+0000 7ff78bb18700  1 beast: 0x7ff69cc30460: 10.129.2.14 - - [08/Sep/2021:10:15:30.668 +0000] "GET /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 403 204 - "Go-http-client/1.1" - latency=0.000000000s2021-09-08T10:15:30.670591209Z

Comment 8 Blaine Gardner 2021-09-14 19:58:03 UTC
After looking a bit more this morning I don't see multus enabled in the cluster based on the CephCluster config, so I'll have to throw that theory out.

@pbalogh as the next debugging measure, could you try restarting the Rook operator in an affected cluster to see if the "ceph-object-controller: failed to create bucket checker" log message persists the second time? You can restart the operator simply by deleting the rook-ceph-operator-* pod. It'll get restarted automatically.

Also, if you have a cluster that is affected, could you give me ssh access to it so I can poke around interactively? We need to figure out why the RGW is hanging when we try to create the admin ops user. 

I see the RGW logs for `noobaa-ceph-objectstore-user` (and `ocs-storagecluster-cephobjectstoreuser`) Jiffin mentions, but I don't see any logs from when Rook trying to create the `rgw-admin-ops-user`.

Comment 9 Petr Balogh 2021-09-20 07:48:01 UTC
Hey Blaine,

Sorry for late reply.

I am trying to deploy two new clusters here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1850/

and here

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1853/

I set PAUSE_BEFORE_TEARDOWN in both of the executions so I will update if I will reproduce in one of the clusters above.

Comment 12 Petr Balogh 2021-09-20 10:00:29 UTC
As the issue is 100% reproducible on fips vsphere env, I am going to destroy the second cluster where I've tried the pod. So keeping only one cluster provided in Comment 10 . That one mentioned in Comment 11 I will destroy now to not block resources as I guess one cluster is enough for you.

Comment 13 Petr Balogh 2021-09-20 13:48:31 UTC
Looks like we hit it also on external cluster deployment here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1859/

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-026vu1ce33-t4an/j-026vu1ce33-t4an_20210920T121102/logs/failed_testcase_ocs_logs_1632140195/test_deployment_ocs_logs/ocs_must_gather/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-026vu1ce33-t4an/j-026vu1ce33-t4an_20210920T121102/logs/failed_testcase_ocs_logs_1632140195/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-def36bdd32393f876c47b4e418b7f33fd2fc647a51ae30162e355bc272fd1bb1/namespaces/openshift-storage/pods/noobaa-operator-78f4d46f9c-72567/noobaa-operator/noobaa-operator/logs/current.log

Here I see the issue with noobaa-ceph-objectstore-user as well:
2021-09-20T12:57:32.226446307Z time="2021-09-20T12:57:32Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
2021-09-20T12:57:32.226501210Z time="2021-09-20T12:57:32Z" level=warning msg="â³ Temporary Error: Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready" sys=openshift-storage/noobaa
2021-09-20T12:57:32.234035296Z time="2021-09-20T12:57:32Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa
2021-09-20T12:57:33.327446183Z time="2021-09-20T12:57:33Z" level=info msg="RPC Handle: {Op: req, API: server_inter_process_api, Method: load_syst

This is not FIPS deployment.

Comment 14 Blaine Gardner 2021-09-20 23:35:04 UTC
I have verified that restarting the operator does not seem to fix the issue 

I'm investigating whether this upstream issue could be the source of this bug.

https://github.com/rook/rook/pull/8712#issuecomment-919462273

I also created Rook PR https://github.com/rook/rook/pull/8765 to fix a second tangential issue I found while debugging this.

Comment 15 Jiffin 2021-09-21 05:19:40 UTC
I cannot find the "rgw-adminops-user" from the toolbox pod, so IMO its creation got failed and I can see a repeat attempt of creating admin-ops-user in rook-operator logs without any failure messages. It is worth try the below command from rook-operator pod and see why the adminsops user creation is failing. I didn't try out because I don't want to disturb the set up

radosgw-admin user create --uid rgw-admin-ops-user --display-name RGW Admin Ops User --caps buckets=*;users=*;usage=read;metadata=read;zone=read --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --debug-rgw 20 --debug-rgw_sync 20--debug-ms 20

Comment 16 Jiffin 2021-09-21 06:16:44 UTC
(In reply to Jiffin from comment #15)
> I cannot find the "rgw-adminops-user" from the toolbox pod, so IMO its
> creation got failed and I can see a repeat attempt of creating
> admin-ops-user in rook-operator logs without any failure messages. It is
> worth try the below command from rook-operator pod and see why the adminsops
> user creation is failing. I didn't try out because I don't want to disturb
> the set up
> 
> radosgw-admin user create --uid rgw-admin-ops-user --display-name RGW Admin
> Ops User --caps buckets=*;users=*;usage=read;metadata=read;zone=read
> --rgw-realm=ocs-storagecluster-cephobjectstore
> --rgw-zonegroup=ocs-storagecluster-cephobjectstore
> --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage
> --conf=/var/lib/rook/openshift-storage/openshift-storage.config
> --name=client.admin
> --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --debug-rgw
> 20 --debug-rgw_sync 20--debug-ms 20

Sorry I want to rectify above comment. Apparently, the rgw-admin-ops user got created 

bash-4.4$ radosgw-admin user list --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage  --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
[
    "rgw-admin-ops-user"
]

But if you don't specify zone/zonegroup details user won't be listed
 radosgw-admin user list --rgw-realm=ocs-storagecluster-cephobjectstore --cluster=openshift-storage  --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
[]


And you can also see there are two zone/zonegroup exists on the setup. I am not sure how it endup like this
radosgw-admin zonegroup list --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
{
    "default_info": "",
    "zonegroups": [
        "ocs-storagecluster-cephobjectstore",
        "default"
    ]
}
radosgw-admin zone list --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
{
    "default_info": "832a8df3-9249-4253-a025-637a73ad87c9",
    "zones": [
        "ocs-storagecluster-cephobjectstore",
        "default"
    ]
}

"ocs-storagecluster-cephobjectstore" is created by the Rook while setting the cephobjectstore/RGW. Even though rgw daemon is configured with proper zone/zonegroup. I can see following message 
debug 2021-09-20T07:52:42.846+0000 7f1510545480  1 mgrc service_daemon_register rgw.15334 metadata {arch=x86_64,ceph_release=pacific,ceph_version=ceph version 16.2.0-117.el8cp (0e34bb7470006
0ebfaa22d99b7d2cdc037b28a57) pacific (stable),ceph_version_short=16.2.0-117.el8cp,container_hostname=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,container_image=quay.io/r
hceph-dev/rhceph@sha256:efe1391ab28c3363308093a55f088a62eda06eab49bed707164c8db0d7dcb7dd,cpu=Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz,distro=rhel,distro_description=Red Hat Enterprise Linux 
8.4 (Ootpa),distro_version=8.4,frontend_config#0=beast port=8080 ssl_port=443 ssl_certificate=/etc/ceph/private/rgw-cert.pem ssl_private_key=/etc/ceph/private/rgw-key.pem,frontend_type#0=bea
st,hostname=compute-2,id=ocs.storagecluster.cephobjectstore.a,kernel_description=#1 SMP Tue Sep 7 07:07:31 EDT 2021,kernel_version=4.18.0-305.19.1.el8_4.x86_64,mem_cgroup_limit=4294967296,me
m_swap_kb=0,mem_total_kb=65951952,num_handles=1,os=Linux,pid=13,pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,pod_namespace=openshift-storage,zone_id=832a8df3-9249
-4253-a025-637a73ad87c9,zone_name=ocs-storagecluster-cephobjectstore,zonegroup_id=da6170c8-6eb2-4286-824a-ad411932e6c4,zonegroup_name=ocs-storagecluster-cephobjectstore}
debug 2021-09-20T07:52:42.847+0000 7f1510545480 -1 rgw period pusher: The new period does not contain my zonegroup!


I am guessing due to above message all the operations adminsops is failing since the default zonegroup(not the name default) is not set properly, so the "rgw-admin-ops-user" cannot be validated properly.

Comment 17 Petr Balogh 2021-09-21 08:15:01 UTC
Just got confirmed with Vijay, that external cluster deployment mentioned in Comment 13 is most likely related to other TLS bug: https://bugzilla.redhat.com/show_bug.cgi?id=2004003

Comment 18 Blaine Gardner 2021-09-21 14:45:36 UTC
(In reply to Jiffin from comment #16)
> 
> Sorry I want to rectify above comment. Apparently, the rgw-admin-ops user
> got created 
> 
> bash-4.4$ radosgw-admin user list
> --rgw-realm=ocs-storagecluster-cephobjectstore
> --rgw-zonegroup=ocs-storagecluster-cephobjectstore
> --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage 
> --conf=/var/lib/rook/openshift-storage/openshift-storage.config
> --name=client.admin
> --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
> [
>     "rgw-admin-ops-user"
> ]
> 
> But if you don't specify zone/zonegroup details user won't be listed
>  radosgw-admin user list --rgw-realm=ocs-storagecluster-cephobjectstore
> --cluster=openshift-storage 
> --conf=/var/lib/rook/openshift-storage/openshift-storage.config
> --name=client.admin
> --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
> []
> 
> 
> And you can also see there are two zone/zonegroup exists on the setup. I am
> not sure how it endup like this
> radosgw-admin zonegroup list --cluster=openshift-storage
> --conf=/var/lib/rook/openshift-storage/openshift-storage.config
> --name=client.admin
> --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
> {
>     "default_info": "",
>     "zonegroups": [
>         "ocs-storagecluster-cephobjectstore",
>         "default"
>     ]
> }
> radosgw-admin zone list --cluster=openshift-storage
> --conf=/var/lib/rook/openshift-storage/openshift-storage.config
> --name=client.admin
> --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
> {
>     "default_info": "832a8df3-9249-4253-a025-637a73ad87c9",
>     "zones": [
>         "ocs-storagecluster-cephobjectstore",
>         "default"
>     ]
> }

I will investigate the zones/zonegroups to see if there is somewhere where Rook is failing to set the zone/zonegroup.

> 
> "ocs-storagecluster-cephobjectstore" is created by the Rook while setting
> the cephobjectstore/RGW. Even though rgw daemon is configured with proper
> zone/zonegroup. I can see following message 
> debug 2021-09-20T07:52:42.846+0000 7f1510545480  1 mgrc
> service_daemon_register rgw.15334 metadata
> {arch=x86_64,ceph_release=pacific,ceph_version=ceph version 16.2.0-117.el8cp
> (0e34bb7470006
> 0ebfaa22d99b7d2cdc037b28a57) pacific
> (stable),ceph_version_short=16.2.0-117.el8cp,container_hostname=rook-ceph-
> rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,container_image=quay.
> io/r
> hceph-dev/rhceph@sha256:
> efe1391ab28c3363308093a55f088a62eda06eab49bed707164c8db0d7dcb7dd,
> cpu=Intel(R) Xeon(R) Gold 6242 CPU @
> 2.80GHz,distro=rhel,distro_description=Red Hat Enterprise Linux 
> 8.4 (Ootpa),distro_version=8.4,frontend_config#0=beast port=8080
> ssl_port=443 ssl_certificate=/etc/ceph/private/rgw-cert.pem
> ssl_private_key=/etc/ceph/private/rgw-key.pem,frontend_type#0=bea
> st,hostname=compute-2,id=ocs.storagecluster.cephobjectstore.a,
> kernel_description=#1 SMP Tue Sep 7 07:07:31 EDT
> 2021,kernel_version=4.18.0-305.19.1.el8_4.x86_64,mem_cgroup_limit=4294967296,
> me
> m_swap_kb=0,mem_total_kb=65951952,num_handles=1,os=Linux,pid=13,
> pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,
> pod_namespace=openshift-storage,zone_id=832a8df3-9249
> -4253-a025-637a73ad87c9,zone_name=ocs-storagecluster-cephobjectstore,
> zonegroup_id=da6170c8-6eb2-4286-824a-ad411932e6c4,zonegroup_name=ocs-
> storagecluster-cephobjectstore}
> debug 2021-09-20T07:52:42.847+0000 7f1510545480 -1 rgw period pusher: The
> new period does not contain my zonegroup!
> 
> 
> I am guessing due to above message all the operations adminsops is failing
> since the default zonegroup(not the name default) is not set properly, so
> the "rgw-admin-ops-user" cannot be validated properly.

I already investigated the "new period does not contain my zonegroup" message, and it only appears once and corresponds to a one-time operator failure to set up multisite for the object store. The following reconcile succeeds. This may be a symptom of the underlying issue, but it is only a temporary one.

Comment 19 Blaine Gardner 2021-09-21 18:36:57 UTC
Today, after playing with the cluster I have been seeing an issue where the RGW segfaults. This was after deleting the admin ops user and restarting the operator to re-create the user. This is a different issue than first described, but I believe it may be related. I am collecting more debug info for the RGW to have someone from the RGW team take a look.

While debugging this, I also noticed a credential leak in Rook with an upstream issue here which I will work on fixing: https://github.com/rook/rook/issues/8778

Comment 25 Petr Balogh 2021-09-22 20:44:31 UTC
On AWS it was not reproduced.
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-003aif3c33-d/j-003aif3c33-d_20210922T163146/logs/failed_testcase_ocs_logs_1632331868/test_deployment_ocs_logs/

RGW is only on vSphere so that means that it will be somehow related to RGW and FIPS cluster.

Comment 27 Jiffin 2021-09-23 07:01:04 UTC
Hey @

Comment 28 Jiffin 2021-09-23 07:02:17 UTC
Hey @pbalogh@redhat.com

Comment 29 Jiffin 2021-09-23 07:08:12 UTC
Hey Petr,
can you please tell me any tests ran post setting ODF cluster?
I can also see the following pools created : 
default.rgw.log
default.rgw.control
default.rgw.meta
similar to zone and zone group, I don't have any clue why/how it can happen. There are no traces of this rook logs as well.

Comment 32 Jiffin 2021-09-23 11:30:11 UTC
Thanks Petr. In the  new cluster, I guess I was able to RCA issue 

"radosgw-admin period update --commit --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring" command result in segfault

"2021-09-23 09:59:54.129858 D | ceph-object-controller: created zone group ocs-storagecluster-cephobjectstore
2021-09-23 09:59:54.129938 D | exec: Running command: radosgw-admin zone get --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
2021-09-23 09:59:54.186729 D | exec: Running command: radosgw-admin zone create --master --endpoints=https://172.30.10.69:443 --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
2021-09-23 09:59:54.253637 D | ceph-object-controller: created zone ocs-storagecluster-cephobjectstore
2021-09-23 09:59:54.253715 D | exec: Running command: radosgw-admin period update --commit --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring
2021-09-23 09:59:54.560945 I | ceph-file-controller: start running mdses for filesystem "ocs-storagecluster-cephfilesystem"
2021-09-23 09:59:54.560982 D | exec: Running command: ceph versions --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-09-23 09:59:54.904652 D | cephclient: {"mon":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mgr":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":1},"osd":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mds":{},"overall":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":7}}
2021-09-23 09:59:54.904675 D | cephclient: {"mon":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mgr":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":1},"osd":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mds":{},"overall":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":7}}
2021-09-23 09:59:54.904730 I | cephclient: getting or creating ceph auth key "mds.ocs-storagecluster-cephfilesystem-a"
2021-09-23 09:59:54.904741 D | exec: Running command: ceph auth get-or-create-key mds.ocs-storagecluster-cephfilesystem-a osd allow * mds allow mon allow profile mds --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-09-23 09:59:55.284545 D | ceph-object-controller: object store "openshift-storage/ocs-storagecluster-cephobjectstore" status updated to "Failure"
2021-09-23 09:59:55.284729 D | ceph-spec: update event from a CR
2021-09-23 09:59:55.284775 D | ceph-spec: update event on CephObjectStore CR
2021-09-23 09:59:55.284888 D | clusterdisruption-controller: reconciling "openshift-storage/"
2021-09-23 09:59:55.284998 D | clusterdisruption-controller: Using default maintenance timeout: 30m0s
2021-09-23 09:59:55.286188 D | op-k8sutil: returning version v1.22.0-rc.0 instead of v1.22.0-rc.0+af080cb
2021-09-23 09:59:55.286219 D | op-k8sutil: kubernetes version fetched 1.22.0-rc.0
2021-09-23 09:59:55.286744 D | op-mds: legacy mds key rook-ceph-mds-ocs-storagecluster-cephfilesystem-a is already removed
2021-09-23 09:59:55.289490 D | op-k8sutil: returning version v1.22.0-rc.0 instead of v1.22.0-rc.0+af080cb
2021-09-23 09:59:55.289510 D | op-k8sutil: kubernetes version fetched 1.22.0-rc.0
2021-09-23 09:59:55.289701 D | exec: Running command: ceph status --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json
2021-09-23 09:59:55.300136 D | op-cfg-keyring: creating secret for rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-keyring
2021-09-23 09:59:55.325810 D | ceph-object-controller: object store "openshift-storage/ocs-storagecluster-cephobjectstore" status updated to "Failure"
2021-09-23 09:59:55.325902 E | ceph-object-controller: failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["ocs-storagecluster-cephobjectstore"]: failed to update period%!(EXTRA []string=[]): signal: segmentation fault (core dumped)
2021-09-23 09:59:55.325921 I | op-k8sutil: Reporting Event openshift-storage:ocs-storagecluster-cephobjectstore Warning:ReconcileFailed:failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["ocs-storagecluster-cephobjectstore"]: failed to update period%!(EXTRA []string=[]): signal: segmentation fault (core dumped)"

from the operator pod logs, which resulted in the inconsistency. And IMO default zone/zonegroups/pools are created while debugging the issue may some radosgw-admin command triggered(I am not quite sure).
I am not sure whether this crash similar to what Blaine found. How can I fetch the crash details? In the operator pod it says  " log_file /var/lib/ceph/crash/2021-09-23T11:10:52.882440Z_f77efce7-3237-4e9e-8a54-c4f987e8e35b/log
--- end dump of recent events ---
Segmentation fault (core dumped) " but I was not able to access that file.

I don't have full output of the crash but partial o/p is copied from terminal.

Comment 58 Scott Ostapovicz 2021-10-14 12:54:17 UTC
I find it highly unlikely that someone changed the encryption from sha256 to md5, so it is equally unlikely this is a regression.  If that is so, then this should be moved to 5.0 z2.  However since you ACKED this Matt I will defer to you

Comment 68 errata-xmlrpc 2021-11-02 16:39:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105

Comment 69 Red Hat Bugzilla 2023-09-15 01:14:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.