Bug 1791792

Summary:

[RFE] please add ability to gather scheduling pause history via cmap

Product:

Red Hat Enterprise Linux 8

Reporter:

Fabio Massimo Di Nitto <fdinitto>

Component:

corosync

Assignee:

Christine Caulfield <ccaulfie>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.4

CC:

ccaulfie, cluster-maint, huali, jfriesse, mnovacek, phagara, sbradley

Target Milestone:

Keywords:

FutureFeature

Target Release:

8.0

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

corosync-3.0.3-4.el8

Doc Type:

Enhancement

Doc Text:

Feature: Add ability to gather scheduling pause history via cmap. Reason: Better support for insights. Result: Corosync stores timestamp and delay of scheduling pause to stats part of cmap - keys stats.schedmiss.X.delay and stats.schedmiss.X.timestamp. corosync-cmapctl was enhanced to allow deleting of these keys (-C schedmiss). These keys are intended for insights and shouldn't be used by users. Also message in /var/log/messages (and journal) is changed to contain timestamp (may cause possible incompatibility with automated tools) - New format is "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase.".

Story Points:

---

Clone Of:

Clones:

1835885 (view as bug list)

Environment:

Last Closed:

2020-11-04 03:25:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1835885

Attachments:

Description	Flags
stats: Add stats for scheduler misses	none
stats: Use nanoseconds from epoch for schedmiss	none
main: Add schedmiss timestamp into message	none
man: move cmap_keys man page from section 8 to 7	none

Description Fabio Massimo Di Nitto 2020-01-16 13:38:51 UTC

Description of problem:

Insights needs the ability to detect the scheduling issue without parsing logs. Having those scheduling pause events recorded in cmap would allow better report to customers of the issue.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Jan Friesse 2020-01-22 16:41:55 UTC

Created attachment 1654646 [details]
stats: Add stats for scheduler misses

stats: Add stats for scheduler misses

This patch add a stats.schedmiss.* set of entries that
are a record of the last 10 times corosync was not scheduled
in time.

These entries are keypt in reverse order (so stats.schedmiss.0.* is
always the latest one kept) and the values, including the timestamp,
are in milliseconds.

It's also possible to use a cmap tracker to follow these events, which
might be useful.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 4 Jan Friesse 2020-01-23 15:08:44 UTC

For QE (and Shane):
I've tested following scenarios:

# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2432.800781
stats.schedmiss.0.timestamp (u64) = 5106463408980
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2076.766113
stats.schedmiss.0.timestamp (u64) = 5106516822000
stats.schedmiss.1.delay (flt) = 2432.800781
stats.schedmiss.1.timestamp (u64) = 5106463408980
# for i in `seq 1 10`;do pkill -STOP corosync && sleep 2 && pkill -CONT corosync;done
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2023.957031
stats.schedmiss.0.timestamp (u64) = 5106558848098
stats.schedmiss.1.delay (flt) = 2023.436279
stats.schedmiss.1.timestamp (u64) = 5106556824141
stats.schedmiss.2.delay (flt) = 2030.076904
stats.schedmiss.2.timestamp (u64) = 5106554800704
stats.schedmiss.3.delay (flt) = 2022.936890
stats.schedmiss.3.timestamp (u64) = 5106552770627
stats.schedmiss.4.delay (flt) = 2024.749023
stats.schedmiss.4.timestamp (u64) = 5106550747691
stats.schedmiss.5.delay (flt) = 2021.841553
stats.schedmiss.5.timestamp (u64) = 5106548722942
stats.schedmiss.6.delay (flt) = 2024.137207
stats.schedmiss.6.timestamp (u64) = 5106546701100
stats.schedmiss.7.delay (flt) = 2021.287476
stats.schedmiss.7.timestamp (u64) = 5106544676963
stats.schedmiss.8.delay (flt) = 2025.073853
stats.schedmiss.8.timestamp (u64) = 5106542655675
stats.schedmiss.9.delay (flt) = 2473.122070
stats.schedmiss.9.timestamp (u64) = 5106540630601
# corosync-cmapctl -C schedmiss
# corosync-cmapctl -m stats | grep schedmiss
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2316.379639
stats.schedmiss.0.timestamp (u64) = 5106611834662
# grep /var/log/cluster/corosync.log | grep scheduled
...
Jan 23 16:00:11.920 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2030.0769 ms (threshold is 1600.0000 ms). Consider token timeout increase.
Jan 23 16:00:13.944 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2023.4363 ms (threshold is 1600.0000 ms). Consider token timeout increase.
Jan 23 16:00:15.968 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2023.9570 ms (threshold is 1600.0000 ms). Consider token timeout increase.
Jan 23 16:01:08.953 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2316.3796 ms (threshold is 1600.0000 ms). Consider token timeout increase.

Different terminal: # corosync-cmapctl -m stats -t "stats.schedmiss.0.timestamp"

# pkill -STOP corosync && sleep 2 && pkill -CONT corosync

corosync-cmapctl should output
modify> stats.schedmiss.0.timestamp (u64) = 5106939132219

Different terminal: # corosync-cmapctl -m stats -T "stats.schedmiss"
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync

-> corosync-cmapctl outputs nothing, because prefix tracking of schedmiss events is deliberately ignored (doesn't make sense for ring buffer).

Comment 5 Jan Friesse 2020-01-23 17:01:11 UTC

Created attachment 1654883 [details]
stats: Use nanoseconds from epoch for schedmiss

stats: Use nanoseconds from epoch for schedmiss

Using monotonic time is not working because it doesn't have to match
time from epoch.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 6 Jan Friesse 2020-01-23 17:02:23 UTC

For QE:
Please ignore the values of the stats.schedmiss.X.timestamp in previous example. They were wrong, next patch (https://bugzilla.redhat.com/attachment.cgi?id=1654883&action=diff) makes them correct.

Comment 16 Jan Friesse 2020-03-09 08:28:43 UTC

Created attachment 1668607 [details]
main: Add schedmiss timestamp into message

main: Add schedmiss timestamp into message

This is useful for matching schedmiss event in stats map with logged
event.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 21 Jan Friesse 2020-04-21 06:49:12 UTC

@huanhuan: Not sure I understand question so I would try to answer, but please don't hesitate to ask more questions if my answer doesn't answer your question.

Idea is like comment #7, one small exception is using `corosync-cmapctl -m stats "stats.schedmiss."` instead of grep, because it is safer (filters really only keys beginning with "stats.schedmiss." prefix) and doesn't need grep. These items correlate with messages logged by corosync log mechanism - by default it is:
- syslog - so these messages can be found in journal and when syslog daemon is configured also in /var/log/messages - /var/log/cluster/corosync.log

Which file we decide to parse is really up-to Shane decision.

Comment 24 Patrik Hagara 2020-05-14 14:41:50 UTC

qa_ack+, repro in comment#4 and comment#6

Comment 25 Jan Friesse 2020-05-19 13:57:52 UTC

@QE:
Also please note that log message changed a bit to contain timestamp, so new format is:
"May 19 15:49:03 XXXX corosync[XXXX]:  [MAIN  ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase."

Comment 26 Jan Friesse 2020-05-26 13:05:18 UTC

Created attachment 1692262 [details]
man: move cmap_keys man page from section 8 to 7

man: move cmap_keys man page from section 8 to 7
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Section 8 is for "System administration commands", 7 is "Miscellaneous".

Signed-off-by: Ferenc Wágner <wferi>
Reviewed-by: Jan Friesse <jfriesse>

Comment 29 michal novacek 2020-09-17 12:17:33 UTC

I have verified that there is new functionality logging scheduler misses to /var/log/cluster/corosync.log corosync-3.0.3-4.el8.x86_64.

----                                           

> [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2213.491455
stats.schedmiss.0.timestamp (u64) = 1600343948932
stats.schedmiss.1.delay (flt) = 2252.461426
stats.schedmiss.1.timestamp (u64) = 1600343939385
stats.schedmiss.2.delay (flt) = 2230.939209
stats.schedmiss.2.timestamp (u64) = 1600343935799

# Delete currently logged events.
> [root@virt-245 ~]# corosync-cmapctl -C schedmiss

> [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss

> [root@virt-245 ~]# pkill -STOP corosync && sleep 2 && pkill -CONT corosync

> [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2243.752197
stats.schedmiss.0.timestamp (u64) = 1600344039511

> [root@virt-245 ~]# grep scheduled /var/log/cluster/corosync.log 
Sep 17 12:10:58 [149363] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled for 2032.3646 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 12:12:09 [149363] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled for 2084.4497 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 12:13:42 [152383] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600337622220) for 2113.2803 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 13:58:55 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600343935799) for 2230.9392 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 13:58:59 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600343939385) for 2252.4614 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 13:59:08 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600343948932) for 2213.4915 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 14:00:39 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600344039511) for 2243.7522 ms (threshold is 800.0000 ms). Consider token timeout increase.

Comment 31 Jan Friesse 2020-09-21 06:49:07 UTC

@Shane: It was probably you who requested insights rule so can you please update BZ (if I understand process correctly the Links section)?

Comment 36 errata-xmlrpc 2020-11-04 03:25:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4736