RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1791792 - [RFE] please add ability to gather scheduling pause history via cmap
Summary: [RFE] please add ability to gather scheduling pause history via cmap
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: corosync
Version: 8.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Christine Caulfield
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1835885
TreeView+ depends on / blocked
 
Reported: 2020-01-16 13:38 UTC by Fabio Massimo Di Nitto
Modified: 2020-11-04 03:26 UTC (History)
7 users (show)

Fixed In Version: corosync-3.0.3-4.el8
Doc Type: Enhancement
Doc Text:
Feature: Add ability to gather scheduling pause history via cmap. Reason: Better support for insights. Result: Corosync stores timestamp and delay of scheduling pause to stats part of cmap - keys stats.schedmiss.X.delay and stats.schedmiss.X.timestamp. corosync-cmapctl was enhanced to allow deleting of these keys (-C schedmiss). These keys are intended for insights and shouldn't be used by users. Also message in /var/log/messages (and journal) is changed to contain timestamp (may cause possible incompatibility with automated tools) - New format is "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase.".
Clone Of:
: 1835885 (view as bug list)
Environment:
Last Closed: 2020-11-04 03:25:51 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
stats: Add stats for scheduler misses (10.73 KB, patch)
2020-01-22 16:41 UTC, Jan Friesse
no flags Details | Diff
stats: Use nanoseconds from epoch for schedmiss (1.04 KB, patch)
2020-01-23 17:01 UTC, Jan Friesse
no flags Details | Diff
main: Add schedmiss timestamp into message (1.75 KB, patch)
2020-03-09 08:28 UTC, Jan Friesse
no flags Details | Diff
man: move cmap_keys man page from section 8 to 7 (27.78 KB, patch)
2020-05-26 13:05 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1217663 0 None None None 2020-01-16 15:17:57 UTC
Red Hat Knowledge Base (Solution) 4340901 0 None None None 2020-01-16 15:16:43 UTC
Red Hat Product Errata RHBA-2020:4736 0 None None None 2020-11-04 03:26:02 UTC

Description Fabio Massimo Di Nitto 2020-01-16 13:38:51 UTC
Description of problem:

Insights needs the ability to detect the scheduling issue without parsing logs. Having those scheduling pause events recorded in cmap would allow better report to customers of the issue.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Jan Friesse 2020-01-22 16:41:55 UTC
Created attachment 1654646 [details]
stats: Add stats for scheduler misses

stats: Add stats for scheduler misses

This patch add a stats.schedmiss.* set of entries that
are a record of the last 10 times corosync was not scheduled
in time.

These entries are keypt in reverse order (so stats.schedmiss.0.* is
always the latest one kept) and the values, including the timestamp,
are in milliseconds.

It's also possible to use a cmap tracker to follow these events, which
might be useful.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 4 Jan Friesse 2020-01-23 15:08:44 UTC
For QE (and Shane):
I've tested following scenarios:

# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2432.800781
stats.schedmiss.0.timestamp (u64) = 5106463408980
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2076.766113
stats.schedmiss.0.timestamp (u64) = 5106516822000
stats.schedmiss.1.delay (flt) = 2432.800781
stats.schedmiss.1.timestamp (u64) = 5106463408980
# for i in `seq 1 10`;do pkill -STOP corosync && sleep 2 && pkill -CONT corosync;done
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2023.957031
stats.schedmiss.0.timestamp (u64) = 5106558848098
stats.schedmiss.1.delay (flt) = 2023.436279
stats.schedmiss.1.timestamp (u64) = 5106556824141
stats.schedmiss.2.delay (flt) = 2030.076904
stats.schedmiss.2.timestamp (u64) = 5106554800704
stats.schedmiss.3.delay (flt) = 2022.936890
stats.schedmiss.3.timestamp (u64) = 5106552770627
stats.schedmiss.4.delay (flt) = 2024.749023
stats.schedmiss.4.timestamp (u64) = 5106550747691
stats.schedmiss.5.delay (flt) = 2021.841553
stats.schedmiss.5.timestamp (u64) = 5106548722942
stats.schedmiss.6.delay (flt) = 2024.137207
stats.schedmiss.6.timestamp (u64) = 5106546701100
stats.schedmiss.7.delay (flt) = 2021.287476
stats.schedmiss.7.timestamp (u64) = 5106544676963
stats.schedmiss.8.delay (flt) = 2025.073853
stats.schedmiss.8.timestamp (u64) = 5106542655675
stats.schedmiss.9.delay (flt) = 2473.122070
stats.schedmiss.9.timestamp (u64) = 5106540630601
# corosync-cmapctl -C schedmiss
# corosync-cmapctl -m stats | grep schedmiss
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2316.379639
stats.schedmiss.0.timestamp (u64) = 5106611834662
# grep /var/log/cluster/corosync.log | grep scheduled
...
Jan 23 16:00:11.920 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2030.0769 ms (threshold is 1600.0000 ms). Consider token timeout increase.
Jan 23 16:00:13.944 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2023.4363 ms (threshold is 1600.0000 ms). Consider token timeout increase.
Jan 23 16:00:15.968 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2023.9570 ms (threshold is 1600.0000 ms). Consider token timeout increase.
Jan 23 16:01:08.953 [6349] XXX corosync warning [MAIN  ] Corosync main process was not scheduled for 2316.3796 ms (threshold is 1600.0000 ms). Consider token timeout increase.

Different terminal: # corosync-cmapctl -m stats -t "stats.schedmiss.0.timestamp"

# pkill -STOP corosync && sleep 2 && pkill -CONT corosync

corosync-cmapctl should output
modify> stats.schedmiss.0.timestamp (u64) = 5106939132219

Different terminal: # corosync-cmapctl -m stats -T "stats.schedmiss"
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync

-> corosync-cmapctl outputs nothing, because prefix tracking of schedmiss events is deliberately ignored (doesn't make sense for ring buffer).

Comment 5 Jan Friesse 2020-01-23 17:01:11 UTC
Created attachment 1654883 [details]
stats: Use nanoseconds from epoch for schedmiss

stats: Use nanoseconds from epoch for schedmiss

Using monotonic time is not working because it doesn't have to match
time from epoch.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 6 Jan Friesse 2020-01-23 17:02:23 UTC
For QE:
Please ignore the values of the stats.schedmiss.X.timestamp in previous example. They were wrong, next patch (https://bugzilla.redhat.com/attachment.cgi?id=1654883&action=diff) makes them correct.

Comment 16 Jan Friesse 2020-03-09 08:28:43 UTC
Created attachment 1668607 [details]
main: Add schedmiss timestamp into message

main: Add schedmiss timestamp into message

This is useful for matching schedmiss event in stats map with logged
event.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 21 Jan Friesse 2020-04-21 06:49:12 UTC
@huanhuan: Not sure I understand question so I would try to answer, but please don't hesitate to ask more questions if my answer doesn't answer your question.

Idea is like comment #7, one small exception is using `corosync-cmapctl -m stats "stats.schedmiss."` instead of grep, because it is safer (filters really only keys beginning with "stats.schedmiss." prefix) and doesn't need grep. These items correlate with messages logged by corosync log mechanism - by default it is:
- syslog - so these messages can be found in journal and when syslog daemon is configured also in /var/log/messages - /var/log/cluster/corosync.log

Which file we decide to parse is really up-to Shane decision.

Comment 24 Patrik Hagara 2020-05-14 14:41:50 UTC
qa_ack+, repro in comment#4 and comment#6

Comment 25 Jan Friesse 2020-05-19 13:57:52 UTC
@QE:
Also please note that log message changed a bit to contain timestamp, so new format is:
"May 19 15:49:03 XXXX corosync[XXXX]:  [MAIN  ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase."

Comment 26 Jan Friesse 2020-05-26 13:05:18 UTC
Created attachment 1692262 [details]
man: move cmap_keys man page from section 8 to 7

man: move cmap_keys man page from section 8 to 7
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Section 8 is for "System administration commands", 7 is "Miscellaneous".

Signed-off-by: Ferenc Wágner <wferi>
Reviewed-by: Jan Friesse <jfriesse>

Comment 29 michal novacek 2020-09-17 12:17:33 UTC
I have verified that there is new functionality logging scheduler misses to /var/log/cluster/corosync.log corosync-3.0.3-4.el8.x86_64.

----                                           

> [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2213.491455
stats.schedmiss.0.timestamp (u64) = 1600343948932
stats.schedmiss.1.delay (flt) = 2252.461426
stats.schedmiss.1.timestamp (u64) = 1600343939385
stats.schedmiss.2.delay (flt) = 2230.939209
stats.schedmiss.2.timestamp (u64) = 1600343935799

# Delete currently logged events.
> [root@virt-245 ~]# corosync-cmapctl -C schedmiss

> [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss

> [root@virt-245 ~]# pkill -STOP corosync && sleep 2 && pkill -CONT corosync

> [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss
stats.schedmiss.0.delay (flt) = 2243.752197
stats.schedmiss.0.timestamp (u64) = 1600344039511

> [root@virt-245 ~]# grep scheduled /var/log/cluster/corosync.log 
Sep 17 12:10:58 [149363] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled for 2032.3646 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 12:12:09 [149363] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled for 2084.4497 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 12:13:42 [152383] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600337622220) for 2113.2803 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 13:58:55 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600343935799) for 2230.9392 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 13:58:59 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600343939385) for 2252.4614 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 13:59:08 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600343948932) for 2213.4915 ms (threshold is 800.0000 ms). Consider token timeout increase.
Sep 17 14:00:39 [7995] virt-245 corosync warning [MAIN  ] Corosync main process was not scheduled (@1600344039511) for 2243.7522 ms (threshold is 800.0000 ms). Consider token timeout increase.

Comment 31 Jan Friesse 2020-09-21 06:49:07 UTC
@Shane: It was probably you who requested insights rule so can you please update BZ (if I understand process correctly the Links section)?

Comment 36 errata-xmlrpc 2020-11-04 03:25:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4736


Note You need to log in before you can comment on or make changes to this bug.