2072667 – [RFE] Implement code for Ceph to warn of clock SKU on other daemons

Bug 2072667 - [RFE] Implement code for Ceph to warn of clock SKU on other daemons [NEEDINFO]

Summary: [RFE] Implement code for Ceph to warn of clock SKU on other daemons

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	4.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.0
Assignee:	Neha Ojha
QA Contact:	Pawan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-06 18:02 UTC by Mike Hackett
Modified:	2024-03-12 23:54 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-06 17:56:16 UTC
Embargoed:
Dependent Products:
Flags:	kelwhite: needinfo? (nojha)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	55213	0	None	None	None	2022-04-06 18:33:01 UTC
Red Hat Issue Tracker	RHCEPH-3934	0	None	None	None	2022-04-06 19:15:57 UTC

Description Mike Hackett 2022-04-06 18:02:23 UTC

Description of problem:
Currently Ceph only warns of clock SKU between monitor nodes for PAXOS on the monitors to run. But other nodes such as OSD's can be impacted by clock sku's as well and can lead to customer impacting situations such as large amounts of flapping OSD's. 
Clock sync is required for when cephx rotates keys and must we within the hour of rotation. 
A customer replaced an OSD node and the node had the incorrect time set. After about an hour the customer noticed a severe performance degradation and OSD's flapping throughout the cluster. 

The OSD logs began showing this after an hour:
2022-03-25 19:53:40.299 7f576e5bd700  0 auth: could not find secret_id=3868
2022-03-25 19:53:40.299 7f576e5bd700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868
2022-03-25 19:53:40.500 7f576e5bd700  0 auth: could not find secret_id=3868
2022-03-25 19:53:40.500 7f576e5bd700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868
2022-03-25 19:53:40.902 7f576e5bd700  0 auth: could not find secret_id=3868
2022-03-25 19:53:40.902 7f576e5bd700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=3868

The clock issue on the OSD node had to be fixed before the cluster settled down. 

Version-Release number of selected component (if applicable):
4.2

How reproducible:


Steps to Reproduce:
Lab testing results of the below steps:

~~~
1. Create 4 node osd cluster
2. Stop OSDs in one node
3. Change the date/time of the node by 4-5 hours. 
4. Start the OSD back
5. wait for 2-3 hours
~~~

## Env Details;
- Both OSD and Mon are non-collocated
- Each OSD node is having 1 OSD running
- Before stoping the OSD, set the noout flag to avoid unnecessary marking out of the OSD.
- Version: RHCS 4.2z4 (14.2.11-208.el7cp)

## Results:
- After changing the time on one of the OSD node by 4 hours, and starting the OSD back. The down OSD does not comes up online.
- And in the ceph-osd logs, below messages are reported:
~~~
Apr 06 04:49:26 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:26.507 7fa360aa3a80 -1 osd.2 36 unable to obtain rotating service keys; retrying
Apr 06 04:49:36 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:36.496 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:36.497170)
Apr 06 04:49:36 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:36.499 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:36.500030)
Apr 06 04:49:46 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:46.497 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:46.497428)
Apr 06 04:49:46 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:46.499 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:46.500824)
Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.496 7fa346908700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:56.497628)
Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.498 7fa34a3ab700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2022-04-06 03:49:56.499895)
Apr 06 04:49:56 osds-1.ceph4k.lab.rdu2.cee.redhat.com ceph-osd[20141]: 2022-04-06 04:49:56.508 7fa360aa3a80 -1 osd.2 36 unable to obtain rotating service keys; retrying
~~~


Actual results:
OSD flap

Expected results:
We report the clock SKU for the OPD node

Additional info:

Comment 22 kelwhite 2024-03-12 23:54:09 UTC

Curious, why was this closed as won't fix?

Comment 23 kelwhite 2024-03-12 23:54:36 UTC

see c#22 :)

Note You need to log in before you can comment on or make changes to this bug.