Description of problem: Observed the following crash during upgrade from 4.2 ga with ssl configured to latest 5.0 and the rgw daemon failed with ERROR: failed initializing frontend'. Details at http://magna002.ceph.redhat.com/ceph-qe-logs/madhavi/bz1981682/upgrade_logs/crash_failure Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 1: (RGWSI_Notify::distribute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWCacheNotifyInfo const&> Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 2: (RGWSI_SysObj_Cache::distribute_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw_raw_obj c> Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 3: (RGWSI_SysObj_Cache::write(rgw_raw_obj const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l> Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 4: (RGWSI_SysObj::Obj::WOp::write(ceph::buffer::v15_2_0::list&, optional_yield)+0x37) [0x7f39e65df837] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 5: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > cons> Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 6: (RGWSI_MDLog::write_history(RGWMetadataLogHistory const&, RGWObjVersionTracker*, optional_yield, bool)+0x16c) [0x7f39e65daa5c] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 7: (RGWSI_MDLog::init_oldest_log_period(optional_yield)+0x5ff) [0x7f39e65dc62f] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 8: (RGWSI_MDLog::do_start(optional_yield)+0x10a) [0x7f39e65dc93a] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 9: (RGWServiceInstance::start(optional_yield)+0x1e) [0x7f39e6601bfe] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 10: (RGWServices_Def::init(ceph::common::CephContext*, bool, bool, bool, optional_yield)+0xaef) [0x7f39e660349f] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 11: (RGWServices::do_init(ceph::common::CephContext*, bool, bool, bool, optional_yield)+0x26) [0x7f39e6605576] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 12: (RGWRados::init_svc(bool)+0x53) [0x7f39e68a7b73] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 13: (RGWRados::initialize()+0x15c) [0x7f39e68e762c] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 14: (RGWStoreManager::init_storage_provider(ceph::common::CephContext*, bool, bool, bool, bool, bool, bool, bool)+0xd1) [0x7f39e698f771] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 15: (radosgw_Main(int, char const**)+0x1528) [0x7f39e65a4828] Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 16: __libc_start_main() Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: 17: _start() Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: steps: 1. ceph cluster deployed with rgw ssl in 4.2 ga 2. created a bucket and few objects in it. 3. upgraded to latest 5.0 [ upgrade includes: switch_to_container->rolling upgrade -> cephadm adopt] 4. after cephadm adopt the rgw service is in failed state Actual results: Expected results: Additional info: (env) [root@ceph-4-2-ssl-upgrade-28bq6h-node8 s3_swift]# ceph health detail HEALTH_WARN mons are allowing insecure global_id reclaim; 1 failed cephadm daemon(s); insufficient standby MDS daemons available; 1 pools have too many placement groups [WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim mon.ceph-4-2-ssl-upgrade-28bq6h-node1 has auth_allow_insecure_global_id_reclaim set to true mon.ceph-4-2-ssl-upgrade-28bq6h-node9 has auth_allow_insecure_global_id_reclaim set to true mon.ceph-4-2-ssl-upgrade-28bq6h-node10 has auth_allow_insecure_global_id_reclaim set to true [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) daemon rgw.ceph.ceph-4-2-ssl-upgrade-28bq6h-node8.tssnwo on ceph-4-2-ssl-upgrade-28bq6h-node8 is in error state [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more [WRN] POOL_TOO_MANY_PGS: 1 pools have too many placement groups Pool cephfs_metadata has 64 placement groups, should have 16 setup details: rgw node - 10.0.209.106 ; root/passwd ansible/installer node: 10.0.209.106; root/passwd PFA, the rgw logs , ceph-ansible upgrade logs and all.yamls at http://magna002.ceph.redhat.com/ceph-qe-logs/madhavi/bz1981682/upgrade_logs/
From https://bugzilla.redhat.com/show_bug.cgi?id=1981682#c29 Crash part I am not sure why it happened(may be it happened during shutdown of RGW than during the start of RGW), it will be helpful if we can collect the logs with debug level 20. Even after crash RGW tries to come again Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480 0 framework: beast Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480 0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480 0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480 0 starting handler: beast Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.462+0000 7f39e72b7480 -1 ssl_private_key was not found: rgw/cert/default/default.key Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39a6e95700 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000000 ret=-16 Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 ssl_private_key was not found: rgw/cert/rgw.ceph Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 no ssl_certificate configured for ssl_port Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 ERROR: failed initializing frontend Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 systemd[1]: libpod-22719cf987ae1f4fd2ff2994174441cecf0caf2d4f64f15b731bdc8dfbc4b69b.scope: Succeeded. But apparently there is no tls certs listed in "ceph config-key ls" option and when I tried to get spec details, tls cert is missing there as well ceph config-key get mgr/cephadm/spec.rgw.ceph {"created": "2021-07-27T15:02:58.077394Z", "spec": {"placement": {"count_per_host": 1, "label": "rgws"}, "service_id": "ceph", "service_name": "rgw.ceph", "service_type": "rgw", "spec": {"rgw_frontend_port": 443, "rgw_realm": "default", "rgw_zone": "default", "ssl": true}}} In the post-upgrade logs the cert file is pointing to "/etc/ssl/certs/server.pem". I am not sure how it can be pointed for post-upgrade for cephadm It looks we are talking about a different bug here than the original one. @sewagner any idea how it can be done from cephadm for existing cluster?
Can you look into https://bugzilla.redhat.com/show_bug.cgi?id=1987010#c5 ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3294