Bug 1987010

Summary: [upgrade][rgw][ssl]: During upgrade from 4.2 with ssl configured to 5.0 rgw daemon failed with ERROR: failed initializing frontend'
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Veera Raghava Reddy <vereddy>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Madhavi Kasturi <mkasturi>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.0CC: aoconnor, aschoen, ceph-eng-bugs, dsavinea, gabrioux, gmeno, gsitlani, jthottan, nthomas, sewagner, tserlin, vimishra, ykaul
Target Milestone: ---   
Target Release: 5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-6.0.11.1-1.el8cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-30 08:31:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Veera Raghava Reddy 2021-07-28 17:39:31 UTC
Description of problem:
Observed the following crash during upgrade from 4.2 ga with ssl configured to latest 5.0 and the rgw daemon failed with  ERROR: failed initializing frontend'. Details at  http://magna002.ceph.redhat.com/ceph-qe-logs/madhavi/bz1981682/upgrade_logs/crash_failure

Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  1: (RGWSI_Notify::distribute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWCacheNotifyInfo const&>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  2: (RGWSI_SysObj_Cache::distribute_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw_raw_obj c>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  3: (RGWSI_SysObj_Cache::write(rgw_raw_obj const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  4: (RGWSI_SysObj::Obj::WOp::write(ceph::buffer::v15_2_0::list&, optional_yield)+0x37) [0x7f39e65df837]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  5: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > cons>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  6: (RGWSI_MDLog::write_history(RGWMetadataLogHistory const&, RGWObjVersionTracker*, optional_yield, bool)+0x16c) [0x7f39e65daa5c]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  7: (RGWSI_MDLog::init_oldest_log_period(optional_yield)+0x5ff) [0x7f39e65dc62f]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  8: (RGWSI_MDLog::do_start(optional_yield)+0x10a) [0x7f39e65dc93a]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  9: (RGWServiceInstance::start(optional_yield)+0x1e) [0x7f39e6601bfe]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  10: (RGWServices_Def::init(ceph::common::CephContext*, bool, bool, bool, optional_yield)+0xaef) [0x7f39e660349f]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  11: (RGWServices::do_init(ceph::common::CephContext*, bool, bool, bool, optional_yield)+0x26) [0x7f39e6605576]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  12: (RGWRados::init_svc(bool)+0x53) [0x7f39e68a7b73]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  13: (RGWRados::initialize()+0x15c) [0x7f39e68e762c]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  14: (RGWStoreManager::init_storage_provider(ceph::common::CephContext*, bool, bool, bool, bool, bool, bool, bool)+0xd1) [0x7f39e698f771]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  15: (radosgw_Main(int, char const**)+0x1528) [0x7f39e65a4828]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  16: __libc_start_main()
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  17: _start()


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
steps:
1. ceph cluster deployed with rgw ssl in 4.2 ga
2. created a bucket and few objects in it.
3. upgraded to latest 5.0 [ upgrade includes: switch_to_container->rolling upgrade -> cephadm adopt]
4. after cephadm adopt the rgw service is in failed state


Actual results:


Expected results:


Additional info:

(env) [root@ceph-4-2-ssl-upgrade-28bq6h-node8 s3_swift]# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim; 1 failed cephadm daemon(s); insufficient standby MDS daemons available; 1 pools have too many placement groups
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
    mon.ceph-4-2-ssl-upgrade-28bq6h-node1 has auth_allow_insecure_global_id_reclaim set to true
    mon.ceph-4-2-ssl-upgrade-28bq6h-node9 has auth_allow_insecure_global_id_reclaim set to true
    mon.ceph-4-2-ssl-upgrade-28bq6h-node10 has auth_allow_insecure_global_id_reclaim set to true
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon rgw.ceph.ceph-4-2-ssl-upgrade-28bq6h-node8.tssnwo on ceph-4-2-ssl-upgrade-28bq6h-node8 is in error state
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[WRN] POOL_TOO_MANY_PGS: 1 pools have too many placement groups
    Pool cephfs_metadata has 64 placement groups, should have 16

setup details:
rgw node - 10.0.209.106 ; root/passwd
ansible/installer node: 10.0.209.106; root/passwd

PFA, the rgw logs , ceph-ansible upgrade logs and all.yamls at http://magna002.ceph.redhat.com/ceph-qe-logs/madhavi/bz1981682/upgrade_logs/

Comment 1 Veera Raghava Reddy 2021-07-28 17:42:00 UTC
From https://bugzilla.redhat.com/show_bug.cgi?id=1981682#c29

Crash part I am not sure why it happened(may be it happened during shutdown of RGW than during the start of RGW), it will be helpful if we can collect the logs with debug level 20.

Even after crash RGW tries to come again 
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 framework: beast
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 starting handler: beast
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.462+0000 7f39e72b7480 -1 ssl_private_key was not found: rgw/cert/default/default.key
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39a6e95700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000000 ret=-16
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 ssl_private_key was not found: rgw/cert/rgw.ceph
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 no ssl_certificate configured for ssl_port
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 ERROR: failed initializing frontend
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 systemd[1]: libpod-22719cf987ae1f4fd2ff2994174441cecf0caf2d4f64f15b731bdc8dfbc4b69b.scope: Succeeded.

But apparently there is no tls certs listed in "ceph config-key ls" option and when I tried to get spec details, tls cert is missing there as well
ceph config-key get mgr/cephadm/spec.rgw.ceph
{"created": "2021-07-27T15:02:58.077394Z", "spec": {"placement": {"count_per_host": 1, "label": "rgws"}, "service_id": "ceph", "service_name": "rgw.ceph", "service_type": "rgw", "spec": {"rgw_frontend_port": 443, "rgw_realm": "default", "rgw_zone": "default", "ssl": true}}}

In the post-upgrade logs the cert file is pointing to "/etc/ssl/certs/server.pem". I am not sure how it can be pointed for post-upgrade for cephadm 
It looks we are talking about a different bug here than the original one. @sewagner any idea how it can be done from cephadm for existing cluster?

Comment 6 Veera Raghava Reddy 2021-08-04 11:52:35 UTC
Can you look into https://bugzilla.redhat.com/show_bug.cgi?id=1987010#c5 ?

Comment 14 errata-xmlrpc 2021-08-30 08:31:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294