Description of problem: Currently Ceph only warns of clock SKU between monitor nodes for PAXOS on the monitors to run. But other nodes such as OSD's can be impacted by clock sku's as well and can lead to customer impacting situations such as large amounts of flapping OSD's. Clock sync is required for when cephx rotates keys and must we within the hour of rotation. A customer replaced an OSD node and the node had the incorrect time set. After about an hour the customer noticed a severe performance degradation and OSD's flapping throughout the cluster. The OSD logs began showing this after an hour: 2022-03-25 19:53:40.299 7f576e5bd700 0 auth: could not find secret_id=3868 2022-03-25 19:53:40.299 7f576e5bd700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868 2022-03-25 19:53:40.500 7f576e5bd700 0 auth: could not find secret_id=3868 2022-03-25 19:53:40.500 7f576e5bd700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868 2022-03-25 19:53:40.902 7f576e5bd700 0 auth: could not find secret_id=3868 2022-03-25 19:53:40.902 7f576e5bd700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868 The clock issue on the OSD node had to be fixed before the cluster settled down. Version-Release number of selected component (if applicable): 4.2 How reproducible: Steps to Reproduce: Lab testing results of the below steps: ~~~ 1. Create 4 node osd cluster 2. Stop OSDs in one node 3. Change the date/time of the node by 4-5 hours. 4. Start the OSD back 5. wait for 2-3 hours ~~~ ## Env Details; - Both OSD and Mon are non-collocated - Each OSD node is having 1 OSD running - Before stoping the OSD, set the noout flag to avoid unnecessary marking out of the OSD. - Version: RHCS 4.2z4 (14.2.11-208.el7cp) ## Results: - After changing the time on one of the OSD node by 4 hours, and starting the OSD back. The down OSD does not comes up online. - And in the ceph-osd logs, below messages are reported: ~~~ Apr 06 04:49:26 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:26.507 7fa360aa3a80 -1 osd.2 36 unable to obtain rotating service keys; retrying Apr 06 04:49:36 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:36.496 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:36.497170) Apr 06 04:49:36 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:36.499 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:36.500030) Apr 06 04:49:46 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:46.497 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:46.497428) Apr 06 04:49:46 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:46.499 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:46.500824) Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.496 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:56.497628) Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.498 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:56.499895) Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.508 7fa360aa3a80 -1 osd.2 36 unable to obtain rotating service keys; retrying ~~~ Actual results: OSD flap Expected results: We report the clock SKU for the OPD node Additional info:
Curious, why was this closed as won't fix?
see c#22 :)