RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1835885 - [RFE] please add ability to gather scheduling pause history via cmap
Summary: [RFE] please add ability to gather scheduling pause history via cmap
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync
Version: 7.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On: 1791792
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 16:52 UTC by Shane Bradley
Modified: 2020-09-29 19:55 UTC (History)
11 users (show)

Fixed In Version: corosync-2.4.5-7.el7
Doc Type: Enhancement
Doc Text:
Feature: Add ability to gather scheduling pause history via cmap. Reason: Better support for insights. Result: Corosync stores timestamp and delay of scheduling pause to cmap keys runtime.schedmiss.delay and runtime.schedmiss.timestamp. These keys are intended for insights and shouldn't be used by users. Also message in /var/log/messages is changed to contain timestamp (may cause possible incompatibility with automated tools) - New format is "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase."
Clone Of: 1791792
Environment:
Last Closed: 2020-09-29 19:55:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stats: Add basic schedule-miss stats to needle (2.01 KB, patch)
2020-05-19 13:31 UTC, Jan Friesse
no flags Details | Diff
main: Add schedmiss timestamp into message (1.67 KB, patch)
2020-05-19 13:31 UTC, Jan Friesse
no flags Details | Diff
main: Make schedmiss in cmap and log equal (2.01 KB, patch)
2020-06-05 13:23 UTC, Jan Friesse
no flags Details | Diff
main: Make schedmiss in cmap and log equal (2.08 KB, patch)
2020-06-05 13:43 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:3924 0 None None None 2020-09-29 19:55:20 UTC

Comment 7 Jan Friesse 2020-05-19 13:31:21 UTC
Created attachment 1689890 [details]
stats: Add basic schedule-miss stats to needle

stats: Add basic schedule-miss stats to needle

In camelback (48b6894ef41e9a06ccbb696d062d86ef60dc2c4b) we have
a much more comprehensive system for recording
schedule misses because it has a 'stats' map. This is much more basic
and just writes the last event into cmap. You can still query and track
the value though.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 8 Jan Friesse 2020-05-19 13:31:29 UTC
Created attachment 1689891 [details]
main: Add schedmiss timestamp into message

main: Add schedmiss timestamp into message

This is useful for matching schedmiss event in stats map with logged
event.

(backported from master 35662dd0ec53f456445c30c0ef92892f47b25aa2)

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 9 Jan Friesse 2020-05-19 13:53:20 UTC
@QE (and Shane):
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl | grep schedmiss
runtime.schedmiss.delay (flt) = 2282.403320
runtime.schedmiss.timestamp (u64) = 1589895874915
# date --date="@$((1589895874915/1000))"
Tue May 19 15:44:34 CEST 2020
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl | grep schedmiss
runtime.schedmiss.delay (flt) = 2196.203857
runtime.schedmiss.timestamp (u64) = 1589896143548
# corosync-cmapctl  -d runtime.schedmiss.timestamp && corosync-cmapctl  -d runtime.schedmiss.delay
# corosync-cmapctl | grep schedmiss # Empty output
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl | grep schedmiss
runtime.schedmiss.delay (flt) = 2283.874512
runtime.schedmiss.timestamp (u64) = 1589896274200

Also please note log message was slightly changed to contain timestamp:
# grep 'not schedul' /var/log/messages
May 19 15:49:03 XXXX corosync[XXXX]:  [MAIN  ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase.
May 19 15:51:14 XXXX corosync[XXXX]:  [MAIN  ] Corosync main process was not scheduled (@1589896274200) for 2283.8745 ms (threshold is 800.0000 ms). Consider token timeout increase.

Comment 11 Patrik Hagara 2020-05-19 14:35:33 UTC
qa_ack+, reproducer somewhere in description (or in the cloned rhel8 bz comments https://bugzilla.redhat.com/show_bug.cgi?id=1791792#c4, https://bugzilla.redhat.com/show_bug.cgi?id=1791792#c6) and comment#9

Comment 31 Jan Friesse 2020-06-05 13:23:25 UTC
Created attachment 1695427 [details]
main: Make schedmiss in cmap and log equal

main: Make schedmiss in cmap and log equal

Second call of qb_util_nano_from_epoch_get may differ a bit. Solution is
to use previously stored timestamp (similarly as in master branch).

Also fix man page to follow similar style as other keys.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 32 Jan Friesse 2020-06-05 13:43:59 UTC
Created attachment 1695430 [details]
main: Make schedmiss in cmap and log equal

main: Make schedmiss in cmap and log equal

Second call of qb_util_nano_from_epoch_get may differ a bit. Solution is
to use previously stored timestamp (similarly as in master branch).

Also fix man page to follow similar style as other keys.

Thanks Patrik Hagara <phagara> for reporting the problem.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 34 Patrik Hagara 2020-06-11 16:06:44 UTC
> [root@virt-126 ~]# rpm -q corosync
> corosync-2.4.5-7.el7.x86_64
> [root@virt-126 ~]# man cmap_keys
<snip>
>        runtime.schedmiss.*
>               If corosync is not scheduled after the required period of time it will log this event and also write an entry to cmap under following keys:
> 
>               timestamp The timestamp of the last time when corosync failed to be scheduled for the required amount of time. The time is milli-seconds since the epoch.
> 
>               delay The amount of time (milliseconds as a float) that corosync was delayed.
<snip>

manual page style fixed

> [root@virt-126 ~]# pcs status
> Cluster name: STSRHTS23630
> Stack: corosync
> Current DC: virt-126 (version 1.1.23-1.el7-9acf116022) - partition with quorum
> Last updated: Thu Jun 11 16:18:13 2020
> Last change: Thu Jun 11 15:52:49 2020 by root via cibadmin on virt-126
> 
> 2 nodes configured
> 2 resource instances configured
> 
> Online: [ virt-126 virt-127 ]
> 
> Full list of resources:
> 
>  fence-virt-126	(stonith:fence_xvm):	Started virt-126
>  fence-virt-127	(stonith:fence_xvm):	Started virt-127
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync)
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.delay (flt) = 2025.971436
> runtime.schedmiss.timestamp (u64) = 1591885690834
> [root@virt-126 ~]# date --date @1591885690
> Thu Jun 11 16:28:10 CEST 2020
> [root@virt-126 ~]# date
> Thu Jun 11 16:28:41 CEST 2020
> [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync)
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.delay (flt) = 2165.545898
> runtime.schedmiss.timestamp (u64) = 1591885731683
> [root@virt-126 ~]# corosync-cmapctl -d runtime.schedmiss.delay
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.timestamp (u64) = 1591885731683
> [root@virt-126 ~]# corosync-cmapctl -d runtime.schedmiss.timestamp
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync)
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.delay (flt) = 2237.884277
> runtime.schedmiss.timestamp (u64) = 1591885769290
> [root@virt-126 ~]# journalctl -u corosync -b | grep 'not scheduled'
> Jun 11 16:28:10 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]:  [MAIN  ] Corosync main process was not scheduled (@1591885690834) for 2025.9714 ms (threshold is 800.0000 ms). Consider token timeout increase.
> Jun 11 16:28:51 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]:  [MAIN  ] Corosync main process was not scheduled (@1591885731683) for 2165.5459 ms (threshold is 800.0000 ms). Consider token timeout increase.
> Jun 11 16:29:29 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]:  [MAIN  ] Corosync main process was not scheduled (@1591885769290) for 2237.8843 ms (threshold is 800.0000 ms). Consider token timeout increase.
> [root@virt-126 ~]# grep 'not scheduled' /var/log/cluster/corosync.log
> [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN  ] Corosync main process was not scheduled (@1591885690834) for 2025.9714 ms (threshold is 800.0000 ms). Consider token timeout increase.
> [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN  ] Corosync main process was not scheduled (@1591885731683) for 2165.5459 ms (threshold is 800.0000 ms). Consider token timeout increase.
> [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN  ] Corosync main process was not scheduled (@1591885769290) for 2237.8843 ms (threshold is 800.0000 ms). Consider token timeout increase.

timestamps match, trying a few more times just to be sure:

> [root@virt-126 ~]# while true; do kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync); STAMP_CMAP=$(corosync-cmapctl -g runtime.schedmiss.timestamp | cut -d ' ' -f 4); STAMP_LOG=$(journalctl -u corosync -b | grep 'not scheduled' | tail -n 1 | grep -oP '(?<=\(@)\d+(?=\))'); if [ ${STAMP_CMAP} -ne ${STAMP_LOG} ]; then echo "timestamp mismatch: ${STAMP_CMAP} != ${STAMP_LOG}" && break; else echo "ok: ${STAMP_CMAP} == ${STAMP_LOG}"; fi; sleep 30; done
> ok: 1591888846135 == 1591888846135
> ok: 1591888878295 == 1591888878295
> ok: 1591888910418 == 1591888910418
> ok: 1591888942513 == 1591888942513
> ok: 1591888974608 == 1591888974608
> ok: 1591889006709 == 1591889006709
> ok: 1591889038809 == 1591889038809
> ok: 1591889070919 == 1591889070919
> ok: 1591889103015 == 1591889103015
> ok: 1591889135196 == 1591889135196
> ok: 1591889167354 == 1591889167354
> ok: 1591889199465 == 1591889199465
> ok: 1591889231589 == 1591889231589
> ok: 1591889263673 == 1591889263673
> ok: 1591889295762 == 1591889295762
> ok: 1591889327944 == 1591889327944
> ok: 1591889360103 == 1591889360103
> ok: 1591889392214 == 1591889392214
> ok: 1591889424399 == 1591889424399
> ok: 1591889456548 == 1591889456548
> ok: 1591889488652 == 1591889488652
> ok: 1591889520746 == 1591889520746
> ok: 1591889552838 == 1591889552838
> ok: 1591889584925 == 1591889584925
> ok: 1591889617019 == 1591889617019
> ok: 1591889649101 == 1591889649101
> ok: 1591889681192 == 1591889681192
> ok: 1591889713281 == 1591889713281
> ok: 1591889745377 == 1591889745377
> ok: 1591889777464 == 1591889777464
> ok: 1591889809582 == 1591889809582
> ok: 1591889841668 == 1591889841668
> ok: 1591889873735 == 1591889873735
> ok: 1591889905819 == 1591889905819
> ok: 1591889937904 == 1591889937904
> ok: 1591889969971 == 1591889969971
> ok: 1591890002086 == 1591890002086
> ok: 1591890034201 == 1591890034201
> ok: 1591890066308 == 1591890066308
> ok: 1591890098419 == 1591890098419
> ok: 1591890130518 == 1591890130518
> ok: 1591890162624 == 1591890162624
> ok: 1591890194708 == 1591890194708
> ok: 1591890226796 == 1591890226796
> ok: 1591890258880 == 1591890258880
> ok: 1591890290970 == 1591890290970
> ok: 1591890323047 == 1591890323047
> ok: 1591890355125 == 1591890355125
> ok: 1591890387214 == 1591890387214
> ok: 1591890419300 == 1591890419300
> ok: 1591890451390 == 1591890451390
> ok: 1591890483463 == 1591890483463
> ok: 1591890515561 == 1591890515561
> ok: 1591890547653 == 1591890547653
> ok: 1591890579730 == 1591890579730
> ok: 1591890611859 == 1591890611859
> ok: 1591890643978 == 1591890643978
> ok: 1591890676124 == 1591890676124
> ok: 1591890708226 == 1591890708226
> ok: 1591890740303 == 1591890740303
> ok: 1591890772395 == 1591890772395
> ok: 1591890804487 == 1591890804487
> ok: 1591890836575 == 1591890836575
> ok: 1591890868666 == 1591890868666
> ok: 1591890900756 == 1591890900756
> ok: 1591890932883 == 1591890932883
> ok: 1591890965003 == 1591890965003
> ok: 1591890997111 == 1591890997111
> ok: 1591891029212 == 1591891029212
> ok: 1591891061309 == 1591891061309
> ok: 1591891093394 == 1591891093394
> ok: 1591891125516 == 1591891125516
> ok: 1591891157621 == 1591891157621
> ok: 1591891189723 == 1591891189723
> ok: 1591891221845 == 1591891221845
> ok: 1591891253941 == 1591891253941
> ok: 1591891286090 == 1591891286090
> ok: 1591891318212 == 1591891318212
> ok: 1591891350314 == 1591891350314
> ok: 1591891382406 == 1591891382406
> ok: 1591891414515 == 1591891414515
> ok: 1591891446606 == 1591891446606
> ok: 1591891478700 == 1591891478700
> ^C

marking verified in corosync-2.4.5-7.el7

Comment 36 Jan Friesse 2020-09-21 06:48:35 UTC
@Shane: It was probably you who requested insights rule so can you please update BZ (if I understand process correctly the Links section)?

Comment 40 errata-xmlrpc 2020-09-29 19:55:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3924


Note You need to log in before you can comment on or make changes to this bug.