Bug 2072667

Summary: [RFE] Implement code for Ceph to warn of clock SKU on other daemons
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Mike Hackett <mhackett>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED WONTFIX QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2CC: akupczyk, amathuri, bhubbard, ceph-eng-bugs, choffman, gjose, kelwhite, ksirivad, lflores, lithomas, milang, mmuench, nojha, pdhange, rfriedma, rzarzyns, sseshasa, vumrao
Target Milestone: ---Keywords: FutureFeature
Target Release: 7.0Flags: kelwhite: needinfo? (nojha)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-06 17:56:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Hackett 2022-04-06 18:02:23 UTC
Description of problem:
Currently Ceph only warns of clock SKU between monitor nodes for PAXOS on the monitors to run. But other nodes such as OSD's can be impacted by clock sku's as well and can lead to customer impacting situations such as large amounts of flapping OSD's. 
Clock sync is required for when cephx rotates keys and must we within the hour of rotation. 
A customer replaced an OSD node and the node had the incorrect time set. After about an hour the customer noticed a severe performance degradation and OSD's flapping throughout the cluster. 

The OSD logs began showing this after an hour:
2022-03-25 19:53:40.299 7f576e5bd700  0 auth: could not find secret_id=3868
2022-03-25 19:53:40.299 7f576e5bd700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868
2022-03-25 19:53:40.500 7f576e5bd700  0 auth: could not find secret_id=3868
2022-03-25 19:53:40.500 7f576e5bd700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868
2022-03-25 19:53:40.902 7f576e5bd700  0 auth: could not find secret_id=3868
2022-03-25 19:53:40.902 7f576e5bd700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868

The clock issue on the OSD node had to be fixed before the cluster settled down. 

Version-Release number of selected component (if applicable):
4.2

How reproducible:


Steps to Reproduce:
Lab testing results of the below steps:

~~~
1. Create 4 node osd cluster
2. Stop OSDs in one node
3. Change the date/time of the node by 4-5 hours. 
4. Start the OSD back
5. wait for 2-3 hours
~~~

## Env Details;
- Both OSD and Mon are non-collocated
- Each OSD node is having 1 OSD running
- Before stoping the OSD, set the noout flag to avoid unnecessary marking out of the OSD.
- Version: RHCS 4.2z4 (14.2.11-208.el7cp)

## Results:
- After changing the time on one of the OSD node by 4 hours, and starting the OSD back. The down OSD does not comes up online.
- And in the ceph-osd logs, below messages are reported:
~~~
Apr 06 04:49:26 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:26.507 7fa360aa3a80 -1 osd.2 36 unable to obtain rotating service keys; retrying
Apr 06 04:49:36 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:36.496 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:36.497170)
Apr 06 04:49:36 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:36.499 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:36.500030)
Apr 06 04:49:46 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:46.497 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:46.497428)
Apr 06 04:49:46 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:46.499 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:46.500824)
Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.496 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:56.497628)
Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.498 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:56.499895)
Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.508 7fa360aa3a80 -1 osd.2 36 unable to obtain rotating service keys; retrying
~~~


Actual results:
OSD flap

Expected results:
We report the clock SKU for the OPD node

Additional info:

Comment 22 kelwhite 2024-03-12 23:54:09 UTC
Curious, why was this closed as won't fix?

Comment 23 kelwhite 2024-03-12 23:54:36 UTC
see c#22 :)