Back to bug 2142983

Who When What Removed Added
Red Hat One Jira (issues.redhat.com) 2022-11-15 17:40:57 UTC Link ID Red Hat Issue Tracker RHCEPH-5615
Kamoltat (Junior) Sirivadhna 2022-11-16 20:19:34 UTC Link ID Github ceph/ceph/pull/48698
Neha Ojha 2022-12-14 21:53:21 UTC Link ID Github ceph/ceph/pull/49311
Kamoltat (Junior) Sirivadhna 2022-12-15 22:44:19 UTC Status ASSIGNED POST
Ken Dreyer (Red Hat) 2022-12-16 01:59:22 UTC CC kdreyer
Fixed In Version ceph-17.2.5-31.el9cp
Status POST MODIFIED
Ken Dreyer (Red Hat) 2022-12-16 02:35:33 UTC Flags needinfo?(pdhiran)
Veera Raghava Reddy 2022-12-16 04:58:20 UTC CC vereddy
Flags needinfo?(pdhiran)
errata-xmlrpc 2022-12-16 05:37:49 UTC Status MODIFIED ON_QA
Eliska 2022-12-19 12:50:47 UTC CC ekristov
Flags needinfo?(ksirivad)
Kamoltat (Junior) Sirivadhna 2022-12-19 21:36:14 UTC Flags needinfo?(ksirivad)
Doc Text Cause:

The variable `removed_ranks` does not discard its content for every update of Monmap, therefore, replacing MONs in a 2-site stretch-cluster and failing over one of the sites cause connection scores (including ranks associated with the scores) to be inconsistent.

Consequence:

Inconsistent connection scores cause deadlock during the MON election period, causing Ceph to become unresponsive. Moreover, once this happens, there is no way for the MON rank associated with the connection score to correct itself.

Fix:

The variable `removed_ranks` gets cleared every update of the Monmap. Moreover, we added a way for the connection score to correct itself when executing the command `ceph daemon mon.{name} connection scores reset`.

Result:

MONs are no longer stuck in the election period and Ceph no longer becomes responsive when replacing monitors and failing over a site. Furthermore, we also have a way to manually force the connection scores to correct themselves.
Doc Type If docs needed, set a value Bug Fix
Eliska 2022-12-23 10:34:59 UTC Flags needinfo?(ksirivad)
Doc Text Cause:

The variable `removed_ranks` does not discard its content for every update of Monmap, therefore, replacing MONs in a 2-site stretch-cluster and failing over one of the sites cause connection scores (including ranks associated with the scores) to be inconsistent.

Consequence:

Inconsistent connection scores cause deadlock during the MON election period, causing Ceph to become unresponsive. Moreover, once this happens, there is no way for the MON rank associated with the connection score to correct itself.

Fix:

The variable `removed_ranks` gets cleared every update of the Monmap. Moreover, we added a way for the connection score to correct itself when executing the command `ceph daemon mon.{name} connection scores reset`.

Result:

MONs are no longer stuck in the election period and Ceph no longer becomes responsive when replacing monitors and failing over a site. Furthermore, we also have a way to manually force the connection scores to correct themselves.
.Ceph Monitors are not stuck during failover of a site

Previously, the `removed_ranks` variable would not discard its content for every update of the Monitor map.
Thus it would replace monitors in a 2-site stretch cluster and fail over of one of the site would cause connection scores, including ranks associated with the scores, to be inconsistent.

Inconsistent connection scores would cause deadlock during the monitor election period, which would result in Ceph to become unresponsive.
Once this happened, there was no way for the monitor rank associated with the connection score to correct itself.

With this fix, the `removed_ranks` variable gets cleared with every update of the monitor map.
Monitors are no longer stuck in the election period and Ceph no longer becomes unresponsive when replacing monitors and failing over a site.
Moreover, there is a way to manually force the connection scores to correct themselves with the `ceph daemon mon._NAME_ connection scores reset` command.
Docs Contact ekristov
Eliska 2022-12-23 10:44:21 UTC Blocks 2126050
Kamoltat (Junior) Sirivadhna 2022-12-27 10:17:59 UTC Flags needinfo?(ksirivad)
Red Hat Bugzilla 2022-12-31 19:04:25 UTC CC mashetty
Red Hat Bugzilla 2022-12-31 19:13:39 UTC CC amathuri
Red Hat Bugzilla 2022-12-31 19:32:48 UTC CC pdhiran
QA Contact pdhiran
Red Hat Bugzilla 2022-12-31 20:00:13 UTC CC sseshasa
Red Hat Bugzilla 2022-12-31 22:37:04 UTC CC ebenahar
Red Hat Bugzilla 2022-12-31 22:43:41 UTC CC rfriedma
Red Hat Bugzilla 2022-12-31 23:43:49 UTC CC rzarzyns
Red Hat Bugzilla 2022-12-31 23:46:05 UTC CC akupczyk
Red Hat Bugzilla 2023-01-01 05:35:34 UTC Assignee ksirivad nojha
CC ksirivad
Red Hat Bugzilla 2023-01-01 05:40:01 UTC CC tserlin
Red Hat Bugzilla 2023-01-01 05:47:22 UTC CC flucifre
Red Hat Bugzilla 2023-01-01 06:02:15 UTC CC bniver
Red Hat Bugzilla 2023-01-01 06:03:42 UTC CC kdreyer
Red Hat Bugzilla 2023-01-01 06:27:22 UTC CC lflores
Red Hat Bugzilla 2023-01-01 06:29:13 UTC CC choffman
Red Hat Bugzilla 2023-01-01 07:23:10 UTC CC tnielsen
Red Hat Bugzilla 2023-01-01 08:22:23 UTC CC vkolli
Red Hat Bugzilla 2023-01-01 08:30:02 UTC CC bkunal
Red Hat Bugzilla 2023-01-01 08:39:06 UTC CC nojha
Assignee nojha nobody
Red Hat Bugzilla 2023-01-01 08:40:01 UTC CC pdhange
Red Hat Bugzilla 2023-01-01 08:47:56 UTC CC vereddy
Red Hat Bugzilla 2023-01-01 08:50:24 UTC CC vumrao
Pawan 2023-01-02 16:30:07 UTC QA Contact pdhiran
CC pdhiran
Alasdair Kergon 2023-01-04 04:40:45 UTC CC akupczyk
Alasdair Kergon 2023-01-04 04:43:11 UTC Assignee nobody ksirivad
Alasdair Kergon 2023-01-04 04:43:34 UTC CC amathuri
Alasdair Kergon 2023-01-04 05:03:42 UTC CC kdreyer
Alasdair Kergon 2023-01-04 05:08:58 UTC CC ksirivad
Alasdair Kergon 2023-01-04 05:10:58 UTC CC lflores
Alasdair Kergon 2023-01-04 05:21:38 UTC CC nojha
Alasdair Kergon 2023-01-04 05:28:18 UTC CC pdhange
Alasdair Kergon 2023-01-04 05:34:52 UTC CC rfriedma
Alasdair Kergon 2023-01-04 05:37:37 UTC CC rzarzyns
Alasdair Kergon 2023-01-04 05:49:38 UTC CC tnielsen
Alasdair Kergon 2023-01-04 05:57:35 UTC CC vkolli
Alasdair Kergon 2023-01-04 05:59:30 UTC CC vumrao
Alasdair Kergon 2023-01-04 06:09:44 UTC CC bkunal
Alasdair Kergon 2023-01-04 06:11:25 UTC CC bniver
Alasdair Kergon 2023-01-04 06:13:47 UTC CC choffman
Alasdair Kergon 2023-01-04 06:29:04 UTC CC vereddy
Alasdair Kergon 2023-01-04 06:41:59 UTC CC ebenahar
Alasdair Kergon 2023-01-04 06:43:51 UTC CC flucifre
Alasdair Kergon 2023-01-04 06:50:47 UTC CC mashetty
Alasdair Kergon 2023-01-04 06:56:31 UTC CC sseshasa
Sunil Kumar Nagaraju 2023-01-06 11:49:42 UTC CC sunnagar
Pawan 2023-01-09 08:00:03 UTC Status ON_QA VERIFIED
Red Hat Bugzilla 2023-01-09 08:30:35 UTC CC ceph-eng-bugs
Alasdair Kergon 2023-01-09 19:43:36 UTC CC ceph-eng-bugs
Red Hat Bugzilla 2023-01-31 23:38:09 UTC CC madam
errata-xmlrpc 2023-03-20 18:59:13 UTC CC tserlin
Group private
Resolution --- ERRATA
Status VERIFIED CLOSED
Last Closed 2023-03-20 18:59:13 UTC
errata-xmlrpc 2023-03-20 19:00:16 UTC Link ID Red Hat Product Errata RHBA-2023:1360

Back to bug 2142983