Description of problem: Resource ocs-storagecluster is in phase: Progressing! Found in Tier4 execution with 6 worker nodes here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1770//console From here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-077vuf1cs36-t4a/j-077vuf1cs36-t4a_20210908T052821/logs/failed_testcase_ocs_logs_1631080049/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-a4606322e336d99d92edb5195f1852dde400ee02f738496a38bfb5f441c8d203/noobaa/namespaces/openshift-storage/noobaa.io/noobaas/noobaa.yaml I see: conditions: - lastHeartbeatTime: "2021-09-08T06:31:17Z" lastTransitionTime: "2021-09-08T06:31:17Z" message: Ceph objectstore user "noobaa-ceph-objectstore-user" is not ready reason: TemporaryError status: "True" type: Progressing phase: Configuring readme: "\n\n\tNooBaa operator is still working to reconcile this system.\n\tCheck out the system status.phase, status.conditions, and events with:\n\n\t\tkubectl -n openshift-storage describe noobaa\n\t\tkubectl -n openshift-storage get noobaa -o yaml\n\t\tkubectl -n openshift-storage get events --sort-by=metadata.creationTimestamp\n\n\tYou can wait for a specific condition with:\n\n\t\tkubectl -n openshift-storage wait noobaa/noobaa --for condition=available --timeout -1s\n\n\tNooBaa Core Version: 5.9.0-20210722\n\tNooBaa Operator Version: 5.9.0\n" services: Version of all relevant components (if applicable): OCP 4.9 nightly OCS 4.9.0-125.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, the automation is failing to finish deployment validation Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Not sure yet Can this issue reproduce from the UI? Haven't tried If this is a regression, please provide more details to justify this: Yes, as it was working on the other deployments. Steps to Reproduce: 1. Install ODF on top of cluster with 6 worker nodes on vSphere 2. 3. Actual results: StorageCluster stuck in progressing state Expected results: Have storage cluster in OK state Additional info: Full must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-077vuf1cs36-t4a/j-077vuf1cs36-t4a_20210908T052821/logs/failed_testcase_ocs_logs_1631080049/test_deployment_ocs_logs/
I have noticed that previous run is FIPS enabled and I have another occurrence here with another FIPS run, so maybe it's FIPS noobaa related: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1771//console Logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-077vuf1cs33-t1/j-077vuf1cs33-t1_20210908T090046/logs/failed_testcase_ocs_logs_1631093582/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-a4606322e336d99d92edb5195f1852dde400ee02f738496a38bfb5f441c8d203/noobaa/namespaces/openshift-storage/noobaa.io/noobaas/noobaa.yaml Shows the same: Ceph objectstore user "noobaa-ceph-objectstore-user" is not ready Full must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-077vuf1cs33-t1/j-077vuf1cs33-t1_20210908T090046/logs/failed_testcase_ocs_logs_1631093582/test_deployment_ocs_logs/
It looks like Rook is able to create the CephObjectStore successfully in part. It's able to create RGWs and configure the store, but it fails to get a connection to the admin ops api with the relevant log dump below. Rook times out the command after 15 seconds of inactivity. > 2021-09-08T10:15:14.055276348Z 2021-09-08 10:15:14.055187 E | ceph-object-controller: failed to create bucket checker for CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore": failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. 2021-09-08T10:15:09.161+0000 7ff30bc90380 1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):394 Notify failed on object ocs-storagecluster-cephobjectstore.rgw.meta:users.uid:rgw-admin-ops-user: (110) Connection timed out > 2021-09-08T10:15:14.055276348Z 2021-09-08T10:15:09.161+0000 7ff30bc90380 1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):397 Backtrace: : ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable) > 2021-09-08T10:15:14.055276348Z 1: (RGWSI_Notify::distribute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWCacheNotifyInfo const&, optional_yield)+0x3ae) [0x563500c77ebe] > 2021-09-08T10:15:14.055276348Z 2: (RGWSI_SysObj_Cache::distribute_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw_raw_obj const&, ObjectCacheInfo&, int, optional_yield)+0x31f) [0x563500c7f05f] > 2021-09-08T10:15:14.055276348Z 3: (RGWSI_SysObj_Cache::write(rgw_raw_obj const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, bool, ceph::buffer::v15_2_0::list const&, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, optional_yield)+0x399) [0x563500c80a09] > 2021-09-08T10:15:14.055276348Z 4: (RGWSI_SysObj::Obj::WOp::write(ceph::buffer::v15_2_0::list&, optional_yield)+0x37) [0x56350078c4e7] > 2021-09-08T10:15:14.055276348Z 5: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&, bool, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, optional_yield, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x124) [0x563500b502c4] > 2021-09-08T10:15:14.055276348Z 6: (RGWSI_MetaBackend_SObj::put_entry(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWSI_MetaBackend::PutParams&, RGWObjVersionTracker*, optional_yield)+0xc0) [0x563500784240] > 2021-09-08T10:15:14.055276348Z 7: radosgw-admin(+0x8fadc4) [0x563500c72dc4] > 2021-09-08T10:15:14.055276348Z 8: (RGWSI_MetaBackend::do_mutate(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&, RGWObjVersionTracker*, RGWMDLogStatus, optional_yield, std::function<int ()>, bool)+0xc9) [0x563500c73849] > 2021-09-08T10:15:14.055276348Z 9: (RGWSI_MetaBackend::put(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWSI_MetaBackend::PutParams&, RGWObjVersionTracker*, optional_yield)+0xe1) [0x563500c73361] > 2021-09-08T10:15:14.055276348Z 10: (RGWSI_User_RADOS::store_user_info(RGWSI_MetaBackend::Context*, RGWUserInfo const&, RGWUserInfo*, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > > const&, bool, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*, optional_yield)+0x396) [0x563500c87066] > 2021-09-08T10:15:14.055276348Z 11: radosgw-admin(+0x7e1114) [0x563500b59114] > 2021-09-08T10:15:14.055276348Z 12: radosgw-admin(+0x8faece) [0x563500c72ece] > 2021-09-08T10:15:14.055276348Z 13: (RGWSI_MetaBackend_SObj::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend::Context*)>)+0x9e) [0x5635007855fe] > 2021-09-08T10:15:14.055276348Z 14: (RGWSI_MetaBackend_Handler::call(std::optional<std::variant<RGWSI_MetaBackend_CtxParams_SObj> >, std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x5f) [0x563500c72cff] > 2021-09-08T10:15:14.055276348Z 15: (RGWSI_MetaBackend_Handler::call(std::function<int (RGWSI_MetaBackend_Handler::Op*)>)+0x78) [0x563500b68498] > 2021-09-08T10:15:14.055276348Z 16: (RGWUserCtl::store_info(RGWUserInfo const&, optional_yield, RGWUserCtl::PutParams const&)+0xb0) [0x563500b5bad0] > 2021-09-08T10:15:14.055276348Z 17: (rgw_store_user_info(RGWUserCtl*, RGWUserInfo&, RGWUserInfo*, RGWObjVersionTracker*, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, bool, optional_yield, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x47) [0x563500b5bb97] > 2021-09-08T10:15:14.055276348Z 18: (RGWUser::update(RGWUserAdminOpState&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, optional_yield)+0xb9) [0x563500b5fcd9] > 2021-09-08T10:15:14.055276348Z 19: (RGWUser::execute_add(RGWUserAdminOpState&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, optional_yield)+0x733) [0x563500b64993] > 2021-09-08T10:15:14.055276348Z 20: (RGWUser::add(RGWUserAdminOpState&, optional_yield, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x73) [0x563500b64e43] > 2021-09-08T10:15:14.055276348Z 21: main() > 2021-09-08T10:15:14.055276348Z 22: __libc_start_main() > 2021-09-08T10:15:14.055276348Z 23: _start() > 2021-09-08T10:15:14.055276348Z 2021-09-08T10:15:09.161+0000 7ff30bc90380 1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):413 Invalidating obj=ocs-storagecluster-cephobjectstore.rgw.meta:users.uid:rgw-admin-ops-user tries=0. : signal: interrupt I don't see any RGW configs that are substantially different from a working install of Rook upstream. It's possible that there is a networking issue. Petr, do the clusters you are seeing that fail have Multus enabled? ------- One tangential problem I notice is that Rook is not retrying the health checker when creating it fails. This is something we will have to fix in Rook. It's possible (but unlikely) that the object store just needs a little more time to get up and running before the health checker is created. Because I'm pretty sure this is not the root cause, I'll create a new BZ for this issue.
After speaking with Travis, I opted not to create a new BZ to fix the tangential issue. I attached a link to the GH issue: https://github.com/rook/rook/pull/8708
From the logs, there are invalid access key errors. So either adminops user creation failed(but it should have logged) or adminsop user is pointing to wrong info. rook-operator logs I | ceph-object-store-user-controller: creating ceph object user "noobaa-ceph-objectstore-user" in namespace "openshift-storage" 2021-09-08T10:15:32.118051093Z 2021-09-08 10:15:32.118001 E | ceph-object-store-user-controller: failed to reconcile failed to create/update object store user "noobaa-ceph-objectstore-user": failed to get details from ceph object user "noobaa-ceph-objectstore-user": InvalidAccessKeyId tx000000000000000000014-0061388d44-619a-ocs-storagecluster-cephobjectstore 619a-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore 2021-09-08T10:15:32.203845375Z 2021-09-08 10:15:32.203803 I | op-mon: parsing mon endpoints: a=172.30.71.182:6789,b=172.30.32.126:6789,c=172.30.250.57:6789 rgw-logs 2021-09-08T10:15:30.670396214Z debug 2021-09-08T10:15:30.670448899Z 2021-09-08T10:15:30.668+0000 7ff79ab36700 1 ====== starting new request req=0x7ff69cc30460 =====2021-09-08T10:15:30.670476531Z 2021-09-08T10:15:30.670501039Z debug 2021-09-08T10:15:30.670510651Z 2021-09-08T10:15:30.668+0000 7ff79ab36700 1 op->ERRORHANDLER: err_no=-2028 new_err_no=-20282021-09-08T10:15:30.670519116Z 2021-09-08T10:15:30.670540944Z debug 2021-09-08T10:15:30.670552697Z 2021-09-08T10:15:30.668+0000 7ff78bb18700 1 ====== req done req=0x7ff69cc30460 op status=0 http_status=403 latency=0.000000000s ======2021-09-08T10:15:30.670561509Z 2021-09-08T10:15:30.670574088Z debug 2021-09-08T10:15:30.670583034Z 2021-09-08T10:15:30.668+0000 7ff78bb18700 1 beast: 0x7ff69cc30460: 10.129.2.14 - - [08/Sep/2021:10:15:30.668 +0000] "GET /admin/user?display-name=my%20display%20name&format=json&uid=noobaa-ceph-objectstore-user HTTP/1.1" 403 204 - "Go-http-client/1.1" - latency=0.000000000s2021-09-08T10:15:30.670591209Z
After looking a bit more this morning I don't see multus enabled in the cluster based on the CephCluster config, so I'll have to throw that theory out. @pbalogh as the next debugging measure, could you try restarting the Rook operator in an affected cluster to see if the "ceph-object-controller: failed to create bucket checker" log message persists the second time? You can restart the operator simply by deleting the rook-ceph-operator-* pod. It'll get restarted automatically. Also, if you have a cluster that is affected, could you give me ssh access to it so I can poke around interactively? We need to figure out why the RGW is hanging when we try to create the admin ops user. I see the RGW logs for `noobaa-ceph-objectstore-user` (and `ocs-storagecluster-cephobjectstoreuser`) Jiffin mentions, but I don't see any logs from when Rook trying to create the `rgw-admin-ops-user`.
Hey Blaine, Sorry for late reply. I am trying to deploy two new clusters here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1850/ and here https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1853/ I set PAUSE_BEFORE_TEARDOWN in both of the executions so I will update if I will reproduce in one of the clusters above.
As the issue is 100% reproducible on fips vsphere env, I am going to destroy the second cluster where I've tried the pod. So keeping only one cluster provided in Comment 10 . That one mentioned in Comment 11 I will destroy now to not block resources as I guess one cluster is enough for you.
Looks like we hit it also on external cluster deployment here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1859/ Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-026vu1ce33-t4an/j-026vu1ce33-t4an_20210920T121102/logs/failed_testcase_ocs_logs_1632140195/test_deployment_ocs_logs/ocs_must_gather/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-026vu1ce33-t4an/j-026vu1ce33-t4an_20210920T121102/logs/failed_testcase_ocs_logs_1632140195/test_deployment_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-def36bdd32393f876c47b4e418b7f33fd2fc647a51ae30162e355bc272fd1bb1/namespaces/openshift-storage/pods/noobaa-operator-78f4d46f9c-72567/noobaa-operator/noobaa-operator/logs/current.log Here I see the issue with noobaa-ceph-objectstore-user as well: 2021-09-20T12:57:32.226446307Z time="2021-09-20T12:57:32Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa 2021-09-20T12:57:32.226501210Z time="2021-09-20T12:57:32Z" level=warning msg="â³ Temporary Error: Ceph objectstore user \"noobaa-ceph-objectstore-user\" is not ready" sys=openshift-storage/noobaa 2021-09-20T12:57:32.234035296Z time="2021-09-20T12:57:32Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa 2021-09-20T12:57:33.327446183Z time="2021-09-20T12:57:33Z" level=info msg="RPC Handle: {Op: req, API: server_inter_process_api, Method: load_syst This is not FIPS deployment.
I have verified that restarting the operator does not seem to fix the issue I'm investigating whether this upstream issue could be the source of this bug. https://github.com/rook/rook/pull/8712#issuecomment-919462273 I also created Rook PR https://github.com/rook/rook/pull/8765 to fix a second tangential issue I found while debugging this.
I cannot find the "rgw-adminops-user" from the toolbox pod, so IMO its creation got failed and I can see a repeat attempt of creating admin-ops-user in rook-operator logs without any failure messages. It is worth try the below command from rook-operator pod and see why the adminsops user creation is failing. I didn't try out because I don't want to disturb the set up radosgw-admin user create --uid rgw-admin-ops-user --display-name RGW Admin Ops User --caps buckets=*;users=*;usage=read;metadata=read;zone=read --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --debug-rgw 20 --debug-rgw_sync 20--debug-ms 20
(In reply to Jiffin from comment #15) > I cannot find the "rgw-adminops-user" from the toolbox pod, so IMO its > creation got failed and I can see a repeat attempt of creating > admin-ops-user in rook-operator logs without any failure messages. It is > worth try the below command from rook-operator pod and see why the adminsops > user creation is failing. I didn't try out because I don't want to disturb > the set up > > radosgw-admin user create --uid rgw-admin-ops-user --display-name RGW Admin > Ops User --caps buckets=*;users=*;usage=read;metadata=read;zone=read > --rgw-realm=ocs-storagecluster-cephobjectstore > --rgw-zonegroup=ocs-storagecluster-cephobjectstore > --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage > --conf=/var/lib/rook/openshift-storage/openshift-storage.config > --name=client.admin > --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --debug-rgw > 20 --debug-rgw_sync 20--debug-ms 20 Sorry I want to rectify above comment. Apparently, the rgw-admin-ops user got created bash-4.4$ radosgw-admin user list --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring [ "rgw-admin-ops-user" ] But if you don't specify zone/zonegroup details user won't be listed radosgw-admin user list --rgw-realm=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring [] And you can also see there are two zone/zonegroup exists on the setup. I am not sure how it endup like this radosgw-admin zonegroup list --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring { "default_info": "", "zonegroups": [ "ocs-storagecluster-cephobjectstore", "default" ] } radosgw-admin zone list --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring { "default_info": "832a8df3-9249-4253-a025-637a73ad87c9", "zones": [ "ocs-storagecluster-cephobjectstore", "default" ] } "ocs-storagecluster-cephobjectstore" is created by the Rook while setting the cephobjectstore/RGW. Even though rgw daemon is configured with proper zone/zonegroup. I can see following message debug 2021-09-20T07:52:42.846+0000 7f1510545480 1 mgrc service_daemon_register rgw.15334 metadata {arch=x86_64,ceph_release=pacific,ceph_version=ceph version 16.2.0-117.el8cp (0e34bb7470006 0ebfaa22d99b7d2cdc037b28a57) pacific (stable),ceph_version_short=16.2.0-117.el8cp,container_hostname=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,container_image=quay.io/r hceph-dev/rhceph@sha256:efe1391ab28c3363308093a55f088a62eda06eab49bed707164c8db0d7dcb7dd,cpu=Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz,distro=rhel,distro_description=Red Hat Enterprise Linux 8.4 (Ootpa),distro_version=8.4,frontend_config#0=beast port=8080 ssl_port=443 ssl_certificate=/etc/ceph/private/rgw-cert.pem ssl_private_key=/etc/ceph/private/rgw-key.pem,frontend_type#0=bea st,hostname=compute-2,id=ocs.storagecluster.cephobjectstore.a,kernel_description=#1 SMP Tue Sep 7 07:07:31 EDT 2021,kernel_version=4.18.0-305.19.1.el8_4.x86_64,mem_cgroup_limit=4294967296,me m_swap_kb=0,mem_total_kb=65951952,num_handles=1,os=Linux,pid=13,pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,pod_namespace=openshift-storage,zone_id=832a8df3-9249 -4253-a025-637a73ad87c9,zone_name=ocs-storagecluster-cephobjectstore,zonegroup_id=da6170c8-6eb2-4286-824a-ad411932e6c4,zonegroup_name=ocs-storagecluster-cephobjectstore} debug 2021-09-20T07:52:42.847+0000 7f1510545480 -1 rgw period pusher: The new period does not contain my zonegroup! I am guessing due to above message all the operations adminsops is failing since the default zonegroup(not the name default) is not set properly, so the "rgw-admin-ops-user" cannot be validated properly.
Just got confirmed with Vijay, that external cluster deployment mentioned in Comment 13 is most likely related to other TLS bug: https://bugzilla.redhat.com/show_bug.cgi?id=2004003
(In reply to Jiffin from comment #16) > > Sorry I want to rectify above comment. Apparently, the rgw-admin-ops user > got created > > bash-4.4$ radosgw-admin user list > --rgw-realm=ocs-storagecluster-cephobjectstore > --rgw-zonegroup=ocs-storagecluster-cephobjectstore > --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage > --conf=/var/lib/rook/openshift-storage/openshift-storage.config > --name=client.admin > --keyring=/var/lib/rook/openshift-storage/client.admin.keyring > [ > "rgw-admin-ops-user" > ] > > But if you don't specify zone/zonegroup details user won't be listed > radosgw-admin user list --rgw-realm=ocs-storagecluster-cephobjectstore > --cluster=openshift-storage > --conf=/var/lib/rook/openshift-storage/openshift-storage.config > --name=client.admin > --keyring=/var/lib/rook/openshift-storage/client.admin.keyring > [] > > > And you can also see there are two zone/zonegroup exists on the setup. I am > not sure how it endup like this > radosgw-admin zonegroup list --cluster=openshift-storage > --conf=/var/lib/rook/openshift-storage/openshift-storage.config > --name=client.admin > --keyring=/var/lib/rook/openshift-storage/client.admin.keyring > { > "default_info": "", > "zonegroups": [ > "ocs-storagecluster-cephobjectstore", > "default" > ] > } > radosgw-admin zone list --cluster=openshift-storage > --conf=/var/lib/rook/openshift-storage/openshift-storage.config > --name=client.admin > --keyring=/var/lib/rook/openshift-storage/client.admin.keyring > { > "default_info": "832a8df3-9249-4253-a025-637a73ad87c9", > "zones": [ > "ocs-storagecluster-cephobjectstore", > "default" > ] > } I will investigate the zones/zonegroups to see if there is somewhere where Rook is failing to set the zone/zonegroup. > > "ocs-storagecluster-cephobjectstore" is created by the Rook while setting > the cephobjectstore/RGW. Even though rgw daemon is configured with proper > zone/zonegroup. I can see following message > debug 2021-09-20T07:52:42.846+0000 7f1510545480 1 mgrc > service_daemon_register rgw.15334 metadata > {arch=x86_64,ceph_release=pacific,ceph_version=ceph version 16.2.0-117.el8cp > (0e34bb7470006 > 0ebfaa22d99b7d2cdc037b28a57) pacific > (stable),ceph_version_short=16.2.0-117.el8cp,container_hostname=rook-ceph- > rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr,container_image=quay. > io/r > hceph-dev/rhceph@sha256: > efe1391ab28c3363308093a55f088a62eda06eab49bed707164c8db0d7dcb7dd, > cpu=Intel(R) Xeon(R) Gold 6242 CPU @ > 2.80GHz,distro=rhel,distro_description=Red Hat Enterprise Linux > 8.4 (Ootpa),distro_version=8.4,frontend_config#0=beast port=8080 > ssl_port=443 ssl_certificate=/etc/ceph/private/rgw-cert.pem > ssl_private_key=/etc/ceph/private/rgw-key.pem,frontend_type#0=bea > st,hostname=compute-2,id=ocs.storagecluster.cephobjectstore.a, > kernel_description=#1 SMP Tue Sep 7 07:07:31 EDT > 2021,kernel_version=4.18.0-305.19.1.el8_4.x86_64,mem_cgroup_limit=4294967296, > me > m_swap_kb=0,mem_total_kb=65951952,num_handles=1,os=Linux,pid=13, > pod_name=rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-6555bd957jsr, > pod_namespace=openshift-storage,zone_id=832a8df3-9249 > -4253-a025-637a73ad87c9,zone_name=ocs-storagecluster-cephobjectstore, > zonegroup_id=da6170c8-6eb2-4286-824a-ad411932e6c4,zonegroup_name=ocs- > storagecluster-cephobjectstore} > debug 2021-09-20T07:52:42.847+0000 7f1510545480 -1 rgw period pusher: The > new period does not contain my zonegroup! > > > I am guessing due to above message all the operations adminsops is failing > since the default zonegroup(not the name default) is not set properly, so > the "rgw-admin-ops-user" cannot be validated properly. I already investigated the "new period does not contain my zonegroup" message, and it only appears once and corresponds to a one-time operator failure to set up multisite for the object store. The following reconcile succeeds. This may be a symptom of the underlying issue, but it is only a temporary one.
Today, after playing with the cluster I have been seeing an issue where the RGW segfaults. This was after deleting the admin ops user and restarting the operator to re-create the user. This is a different issue than first described, but I believe it may be related. I am collecting more debug info for the RGW to have someone from the RGW team take a look. While debugging this, I also noticed a credential leak in Rook with an upstream issue here which I will work on fixing: https://github.com/rook/rook/issues/8778
On AWS it was not reproduced. http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-003aif3c33-d/j-003aif3c33-d_20210922T163146/logs/failed_testcase_ocs_logs_1632331868/test_deployment_ocs_logs/ RGW is only on vSphere so that means that it will be somehow related to RGW and FIPS cluster.
Hey @
Hey @pbalogh@redhat.com
Hey Petr, can you please tell me any tests ran post setting ODF cluster? I can also see the following pools created : default.rgw.log default.rgw.control default.rgw.meta similar to zone and zone group, I don't have any clue why/how it can happen. There are no traces of this rook logs as well.
Thanks Petr. In the new cluster, I guess I was able to RCA issue "radosgw-admin period update --commit --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring" command result in segfault "2021-09-23 09:59:54.129858 D | ceph-object-controller: created zone group ocs-storagecluster-cephobjectstore 2021-09-23 09:59:54.129938 D | exec: Running command: radosgw-admin zone get --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring 2021-09-23 09:59:54.186729 D | exec: Running command: radosgw-admin zone create --master --endpoints=https://172.30.10.69:443 --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring 2021-09-23 09:59:54.253637 D | ceph-object-controller: created zone ocs-storagecluster-cephobjectstore 2021-09-23 09:59:54.253715 D | exec: Running command: radosgw-admin period update --commit --rgw-realm=ocs-storagecluster-cephobjectstore --rgw-zonegroup=ocs-storagecluster-cephobjectstore --rgw-zone=ocs-storagecluster-cephobjectstore --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring 2021-09-23 09:59:54.560945 I | ceph-file-controller: start running mdses for filesystem "ocs-storagecluster-cephfilesystem" 2021-09-23 09:59:54.560982 D | exec: Running command: ceph versions --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-09-23 09:59:54.904652 D | cephclient: {"mon":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mgr":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":1},"osd":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mds":{},"overall":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":7}} 2021-09-23 09:59:54.904675 D | cephclient: {"mon":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mgr":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":1},"osd":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":3},"mds":{},"overall":{"ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)":7}} 2021-09-23 09:59:54.904730 I | cephclient: getting or creating ceph auth key "mds.ocs-storagecluster-cephfilesystem-a" 2021-09-23 09:59:54.904741 D | exec: Running command: ceph auth get-or-create-key mds.ocs-storagecluster-cephfilesystem-a osd allow * mds allow mon allow profile mds --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-09-23 09:59:55.284545 D | ceph-object-controller: object store "openshift-storage/ocs-storagecluster-cephobjectstore" status updated to "Failure" 2021-09-23 09:59:55.284729 D | ceph-spec: update event from a CR 2021-09-23 09:59:55.284775 D | ceph-spec: update event on CephObjectStore CR 2021-09-23 09:59:55.284888 D | clusterdisruption-controller: reconciling "openshift-storage/" 2021-09-23 09:59:55.284998 D | clusterdisruption-controller: Using default maintenance timeout: 30m0s 2021-09-23 09:59:55.286188 D | op-k8sutil: returning version v1.22.0-rc.0 instead of v1.22.0-rc.0+af080cb 2021-09-23 09:59:55.286219 D | op-k8sutil: kubernetes version fetched 1.22.0-rc.0 2021-09-23 09:59:55.286744 D | op-mds: legacy mds key rook-ceph-mds-ocs-storagecluster-cephfilesystem-a is already removed 2021-09-23 09:59:55.289490 D | op-k8sutil: returning version v1.22.0-rc.0 instead of v1.22.0-rc.0+af080cb 2021-09-23 09:59:55.289510 D | op-k8sutil: kubernetes version fetched 1.22.0-rc.0 2021-09-23 09:59:55.289701 D | exec: Running command: ceph status --connect-timeout=15 --cluster=openshift-storage --conf=/var/lib/rook/openshift-storage/openshift-storage.config --name=client.admin --keyring=/var/lib/rook/openshift-storage/client.admin.keyring --format json 2021-09-23 09:59:55.300136 D | op-cfg-keyring: creating secret for rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-keyring 2021-09-23 09:59:55.325810 D | ceph-object-controller: object store "openshift-storage/ocs-storagecluster-cephobjectstore" status updated to "Failure" 2021-09-23 09:59:55.325902 E | ceph-object-controller: failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["ocs-storagecluster-cephobjectstore"]: failed to update period%!(EXTRA []string=[]): signal: segmentation fault (core dumped) 2021-09-23 09:59:55.325921 I | op-k8sutil: Reporting Event openshift-storage:ocs-storagecluster-cephobjectstore Warning:ReconcileFailed:failed to create object store deployments: failed to configure multisite for object store: failed create ceph multisite for object-store ["ocs-storagecluster-cephobjectstore"]: failed to update period%!(EXTRA []string=[]): signal: segmentation fault (core dumped)" from the operator pod logs, which resulted in the inconsistency. And IMO default zone/zonegroups/pools are created while debugging the issue may some radosgw-admin command triggered(I am not quite sure). I am not sure whether this crash similar to what Blaine found. How can I fetch the crash details? In the operator pod it says " log_file /var/lib/ceph/crash/2021-09-23T11:10:52.882440Z_f77efce7-3237-4e9e-8a54-c4f987e8e35b/log --- end dump of recent events --- Segmentation fault (core dumped) " but I was not able to access that file. I don't have full output of the crash but partial o/p is copied from terminal.
I find it highly unlikely that someone changed the encryption from sha256 to md5, so it is equally unlikely this is a regression. If that is so, then this should be moved to 5.0 z2. However since you ACKED this Matt I will defer to you
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4105
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days