Bug 1987010 - [upgrade][rgw][ssl]: During upgrade from 4.2 with ssl configured to 5.0 rgw daemon failed with ERROR: failed initializing frontend'
Summary: [upgrade][rgw][ssl]: During upgrade from 4.2 with ssl configured to 5.0 rgw d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.0
Assignee: Guillaume Abrioux
QA Contact: Madhavi Kasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-28 17:39 UTC by Veera Raghava Reddy
Modified: 2021-08-30 08:31 UTC (History)
13 users (show)

Fixed In Version: ceph-ansible-6.0.11.1-1.el8cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-30 08:31:46 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 6775 0 None None None 2021-08-04 17:07:41 UTC
Red Hat Bugzilla 1981682 1 unspecified CLOSED [cephadm][rgw][ssl]: error 'failed initializing frontend' seen on configuring beast frontend with ssl 2021-08-30 08:36:56 UTC
Red Hat Issue Tracker RHCEPH-637 0 None None None 2021-08-30 00:25:23 UTC
Red Hat Product Errata RHBA-2021:3294 0 None None None 2021-08-30 08:31:58 UTC

Description Veera Raghava Reddy 2021-07-28 17:39:31 UTC
Description of problem:
Observed the following crash during upgrade from 4.2 ga with ssl configured to latest 5.0 and the rgw daemon failed with  ERROR: failed initializing frontend'. Details at  http://magna002.ceph.redhat.com/ceph-qe-logs/madhavi/bz1981682/upgrade_logs/crash_failure

Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  1: (RGWSI_Notify::distribute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWCacheNotifyInfo const&>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  2: (RGWSI_SysObj_Cache::distribute_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw_raw_obj c>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  3: (RGWSI_SysObj_Cache::write(rgw_raw_obj const&, std::chrono::time_point<ceph::real_clock, std::chrono::duration<unsigned long, std::ratio<1l>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  4: (RGWSI_SysObj::Obj::WOp::write(ceph::buffer::v15_2_0::list&, optional_yield)+0x37) [0x7f39e65df837]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  5: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > cons>
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  6: (RGWSI_MDLog::write_history(RGWMetadataLogHistory const&, RGWObjVersionTracker*, optional_yield, bool)+0x16c) [0x7f39e65daa5c]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  7: (RGWSI_MDLog::init_oldest_log_period(optional_yield)+0x5ff) [0x7f39e65dc62f]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  8: (RGWSI_MDLog::do_start(optional_yield)+0x10a) [0x7f39e65dc93a]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  9: (RGWServiceInstance::start(optional_yield)+0x1e) [0x7f39e6601bfe]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  10: (RGWServices_Def::init(ceph::common::CephContext*, bool, bool, bool, optional_yield)+0xaef) [0x7f39e660349f]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  11: (RGWServices::do_init(ceph::common::CephContext*, bool, bool, bool, optional_yield)+0x26) [0x7f39e6605576]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  12: (RGWRados::init_svc(bool)+0x53) [0x7f39e68a7b73]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  13: (RGWRados::initialize()+0x15c) [0x7f39e68e762c]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  14: (RGWStoreManager::init_storage_provider(ceph::common::CephContext*, bool, bool, bool, bool, bool, bool, bool)+0xd1) [0x7f39e698f771]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  15: (radosgw_Main(int, char const**)+0x1528) [0x7f39e65a4828]
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  16: __libc_start_main()
Jul 27 11:06:41 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]:  17: _start()


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
steps:
1. ceph cluster deployed with rgw ssl in 4.2 ga
2. created a bucket and few objects in it.
3. upgraded to latest 5.0 [ upgrade includes: switch_to_container->rolling upgrade -> cephadm adopt]
4. after cephadm adopt the rgw service is in failed state


Actual results:


Expected results:


Additional info:

(env) [root@ceph-4-2-ssl-upgrade-28bq6h-node8 s3_swift]# ceph health detail
HEALTH_WARN mons are allowing insecure global_id reclaim; 1 failed cephadm daemon(s); insufficient standby MDS daemons available; 1 pools have too many placement groups
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim
    mon.ceph-4-2-ssl-upgrade-28bq6h-node1 has auth_allow_insecure_global_id_reclaim set to true
    mon.ceph-4-2-ssl-upgrade-28bq6h-node9 has auth_allow_insecure_global_id_reclaim set to true
    mon.ceph-4-2-ssl-upgrade-28bq6h-node10 has auth_allow_insecure_global_id_reclaim set to true
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon rgw.ceph.ceph-4-2-ssl-upgrade-28bq6h-node8.tssnwo on ceph-4-2-ssl-upgrade-28bq6h-node8 is in error state
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[WRN] POOL_TOO_MANY_PGS: 1 pools have too many placement groups
    Pool cephfs_metadata has 64 placement groups, should have 16

setup details:
rgw node - 10.0.209.106 ; root/passwd
ansible/installer node: 10.0.209.106; root/passwd

PFA, the rgw logs , ceph-ansible upgrade logs and all.yamls at http://magna002.ceph.redhat.com/ceph-qe-logs/madhavi/bz1981682/upgrade_logs/

Comment 1 Veera Raghava Reddy 2021-07-28 17:42:00 UTC
From https://bugzilla.redhat.com/show_bug.cgi?id=1981682#c29

Crash part I am not sure why it happened(may be it happened during shutdown of RGW than during the start of RGW), it will be helpful if we can collect the logs with debug level 20.

Even after crash RGW tries to come again 
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 framework: beast
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.460+0000 7f39e72b7480  0 starting handler: beast
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.462+0000 7f39e72b7480 -1 ssl_private_key was not found: rgw/cert/default/default.key
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39a6e95700  0 RGWReshardLock::lock failed to acquire lock on reshard.0000000000 ret=-16
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 ssl_private_key was not found: rgw/cert/rgw.ceph
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 no ssl_certificate configured for ssl_port
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 conmon[1749418]: debug 2021-07-27T15:06:47.463+0000 7f39e72b7480 -1 ERROR: failed initializing frontend
Jul 27 11:06:47 ceph-4-2-ssl-upgrade-28bq6h-node8 systemd[1]: libpod-22719cf987ae1f4fd2ff2994174441cecf0caf2d4f64f15b731bdc8dfbc4b69b.scope: Succeeded.

But apparently there is no tls certs listed in "ceph config-key ls" option and when I tried to get spec details, tls cert is missing there as well
ceph config-key get mgr/cephadm/spec.rgw.ceph
{"created": "2021-07-27T15:02:58.077394Z", "spec": {"placement": {"count_per_host": 1, "label": "rgws"}, "service_id": "ceph", "service_name": "rgw.ceph", "service_type": "rgw", "spec": {"rgw_frontend_port": 443, "rgw_realm": "default", "rgw_zone": "default", "ssl": true}}}

In the post-upgrade logs the cert file is pointing to "/etc/ssl/certs/server.pem". I am not sure how it can be pointed for post-upgrade for cephadm 
It looks we are talking about a different bug here than the original one. @sewagner any idea how it can be done from cephadm for existing cluster?

Comment 6 Veera Raghava Reddy 2021-08-04 11:52:35 UTC
Can you look into https://bugzilla.redhat.com/show_bug.cgi?id=1987010#c5 ?

Comment 14 errata-xmlrpc 2021-08-30 08:31:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 bug fix and enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3294


Note You need to log in before you can comment on or make changes to this bug.