Bug 1835885

Summary: [RFE] please add ability to gather scheduling pause history via cmap
Product: Red Hat Enterprise Linux 7 Reporter: Shane Bradley <sbradley>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.8CC: ccaulfie, cfeist, cluster-maint, cluster-qe, coughlan, fdinitto, huali, jfriesse, phagara, pvlasin, sbradley
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-2.4.5-7.el7 Doc Type: Enhancement
Doc Text:
Feature: Add ability to gather scheduling pause history via cmap. Reason: Better support for insights. Result: Corosync stores timestamp and delay of scheduling pause to cmap keys runtime.schedmiss.delay and runtime.schedmiss.timestamp. These keys are intended for insights and shouldn't be used by users. Also message in /var/log/messages is changed to contain timestamp (may cause possible incompatibility with automated tools) - New format is "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase."
Story Points: ---
Clone Of: 1791792 Environment:
Last Closed: 2020-09-29 19:55:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1791792    
Bug Blocks:    
Attachments:
Description Flags
stats: Add basic schedule-miss stats to needle
none
main: Add schedmiss timestamp into message
none
main: Make schedmiss in cmap and log equal
none
main: Make schedmiss in cmap and log equal none

Comment 7 Jan Friesse 2020-05-19 13:31:21 UTC
Created attachment 1689890 [details]
stats: Add basic schedule-miss stats to needle

stats: Add basic schedule-miss stats to needle

In camelback (48b6894ef41e9a06ccbb696d062d86ef60dc2c4b) we have
a much more comprehensive system for recording
schedule misses because it has a 'stats' map. This is much more basic
and just writes the last event into cmap. You can still query and track
the value though.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 8 Jan Friesse 2020-05-19 13:31:29 UTC
Created attachment 1689891 [details]
main: Add schedmiss timestamp into message

main: Add schedmiss timestamp into message

This is useful for matching schedmiss event in stats map with logged
event.

(backported from master 35662dd0ec53f456445c30c0ef92892f47b25aa2)

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 9 Jan Friesse 2020-05-19 13:53:20 UTC
@QE (and Shane):
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl | grep schedmiss
runtime.schedmiss.delay (flt) = 2282.403320
runtime.schedmiss.timestamp (u64) = 1589895874915
# date --date="@$((1589895874915/1000))"
Tue May 19 15:44:34 CEST 2020
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl | grep schedmiss
runtime.schedmiss.delay (flt) = 2196.203857
runtime.schedmiss.timestamp (u64) = 1589896143548
# corosync-cmapctl  -d runtime.schedmiss.timestamp && corosync-cmapctl  -d runtime.schedmiss.delay
# corosync-cmapctl | grep schedmiss # Empty output
# pkill -STOP corosync && sleep 2 && pkill -CONT corosync
# corosync-cmapctl | grep schedmiss
runtime.schedmiss.delay (flt) = 2283.874512
runtime.schedmiss.timestamp (u64) = 1589896274200

Also please note log message was slightly changed to contain timestamp:
# grep 'not schedul' /var/log/messages
May 19 15:49:03 XXXX corosync[XXXX]:  [MAIN  ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase.
May 19 15:51:14 XXXX corosync[XXXX]:  [MAIN  ] Corosync main process was not scheduled (@1589896274200) for 2283.8745 ms (threshold is 800.0000 ms). Consider token timeout increase.

Comment 11 Patrik Hagara 2020-05-19 14:35:33 UTC
qa_ack+, reproducer somewhere in description (or in the cloned rhel8 bz comments https://bugzilla.redhat.com/show_bug.cgi?id=1791792#c4, https://bugzilla.redhat.com/show_bug.cgi?id=1791792#c6) and comment#9

Comment 31 Jan Friesse 2020-06-05 13:23:25 UTC
Created attachment 1695427 [details]
main: Make schedmiss in cmap and log equal

main: Make schedmiss in cmap and log equal

Second call of qb_util_nano_from_epoch_get may differ a bit. Solution is
to use previously stored timestamp (similarly as in master branch).

Also fix man page to follow similar style as other keys.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 32 Jan Friesse 2020-06-05 13:43:59 UTC
Created attachment 1695430 [details]
main: Make schedmiss in cmap and log equal

main: Make schedmiss in cmap and log equal

Second call of qb_util_nano_from_epoch_get may differ a bit. Solution is
to use previously stored timestamp (similarly as in master branch).

Also fix man page to follow similar style as other keys.

Thanks Patrik Hagara <phagara> for reporting the problem.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 34 Patrik Hagara 2020-06-11 16:06:44 UTC
> [root@virt-126 ~]# rpm -q corosync
> corosync-2.4.5-7.el7.x86_64
> [root@virt-126 ~]# man cmap_keys
<snip>
>        runtime.schedmiss.*
>               If corosync is not scheduled after the required period of time it will log this event and also write an entry to cmap under following keys:
> 
>               timestamp The timestamp of the last time when corosync failed to be scheduled for the required amount of time. The time is milli-seconds since the epoch.
> 
>               delay The amount of time (milliseconds as a float) that corosync was delayed.
<snip>

manual page style fixed

> [root@virt-126 ~]# pcs status
> Cluster name: STSRHTS23630
> Stack: corosync
> Current DC: virt-126 (version 1.1.23-1.el7-9acf116022) - partition with quorum
> Last updated: Thu Jun 11 16:18:13 2020
> Last change: Thu Jun 11 15:52:49 2020 by root via cibadmin on virt-126
> 
> 2 nodes configured
> 2 resource instances configured
> 
> Online: [ virt-126 virt-127 ]
> 
> Full list of resources:
> 
>  fence-virt-126	(stonith:fence_xvm):	Started virt-126
>  fence-virt-127	(stonith:fence_xvm):	Started virt-127
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync)
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.delay (flt) = 2025.971436
> runtime.schedmiss.timestamp (u64) = 1591885690834
> [root@virt-126 ~]# date --date @1591885690
> Thu Jun 11 16:28:10 CEST 2020
> [root@virt-126 ~]# date
> Thu Jun 11 16:28:41 CEST 2020
> [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync)
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.delay (flt) = 2165.545898
> runtime.schedmiss.timestamp (u64) = 1591885731683
> [root@virt-126 ~]# corosync-cmapctl -d runtime.schedmiss.delay
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.timestamp (u64) = 1591885731683
> [root@virt-126 ~]# corosync-cmapctl -d runtime.schedmiss.timestamp
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync)
> [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss'
> runtime.schedmiss.delay (flt) = 2237.884277
> runtime.schedmiss.timestamp (u64) = 1591885769290
> [root@virt-126 ~]# journalctl -u corosync -b | grep 'not scheduled'
> Jun 11 16:28:10 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]:  [MAIN  ] Corosync main process was not scheduled (@1591885690834) for 2025.9714 ms (threshold is 800.0000 ms). Consider token timeout increase.
> Jun 11 16:28:51 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]:  [MAIN  ] Corosync main process was not scheduled (@1591885731683) for 2165.5459 ms (threshold is 800.0000 ms). Consider token timeout increase.
> Jun 11 16:29:29 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]:  [MAIN  ] Corosync main process was not scheduled (@1591885769290) for 2237.8843 ms (threshold is 800.0000 ms). Consider token timeout increase.
> [root@virt-126 ~]# grep 'not scheduled' /var/log/cluster/corosync.log
> [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN  ] Corosync main process was not scheduled (@1591885690834) for 2025.9714 ms (threshold is 800.0000 ms). Consider token timeout increase.
> [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN  ] Corosync main process was not scheduled (@1591885731683) for 2165.5459 ms (threshold is 800.0000 ms). Consider token timeout increase.
> [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN  ] Corosync main process was not scheduled (@1591885769290) for 2237.8843 ms (threshold is 800.0000 ms). Consider token timeout increase.

timestamps match, trying a few more times just to be sure:

> [root@virt-126 ~]# while true; do kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync); STAMP_CMAP=$(corosync-cmapctl -g runtime.schedmiss.timestamp | cut -d ' ' -f 4); STAMP_LOG=$(journalctl -u corosync -b | grep 'not scheduled' | tail -n 1 | grep -oP '(?<=\(@)\d+(?=\))'); if [ ${STAMP_CMAP} -ne ${STAMP_LOG} ]; then echo "timestamp mismatch: ${STAMP_CMAP} != ${STAMP_LOG}" && break; else echo "ok: ${STAMP_CMAP} == ${STAMP_LOG}"; fi; sleep 30; done
> ok: 1591888846135 == 1591888846135
> ok: 1591888878295 == 1591888878295
> ok: 1591888910418 == 1591888910418
> ok: 1591888942513 == 1591888942513
> ok: 1591888974608 == 1591888974608
> ok: 1591889006709 == 1591889006709
> ok: 1591889038809 == 1591889038809
> ok: 1591889070919 == 1591889070919
> ok: 1591889103015 == 1591889103015
> ok: 1591889135196 == 1591889135196
> ok: 1591889167354 == 1591889167354
> ok: 1591889199465 == 1591889199465
> ok: 1591889231589 == 1591889231589
> ok: 1591889263673 == 1591889263673
> ok: 1591889295762 == 1591889295762
> ok: 1591889327944 == 1591889327944
> ok: 1591889360103 == 1591889360103
> ok: 1591889392214 == 1591889392214
> ok: 1591889424399 == 1591889424399
> ok: 1591889456548 == 1591889456548
> ok: 1591889488652 == 1591889488652
> ok: 1591889520746 == 1591889520746
> ok: 1591889552838 == 1591889552838
> ok: 1591889584925 == 1591889584925
> ok: 1591889617019 == 1591889617019
> ok: 1591889649101 == 1591889649101
> ok: 1591889681192 == 1591889681192
> ok: 1591889713281 == 1591889713281
> ok: 1591889745377 == 1591889745377
> ok: 1591889777464 == 1591889777464
> ok: 1591889809582 == 1591889809582
> ok: 1591889841668 == 1591889841668
> ok: 1591889873735 == 1591889873735
> ok: 1591889905819 == 1591889905819
> ok: 1591889937904 == 1591889937904
> ok: 1591889969971 == 1591889969971
> ok: 1591890002086 == 1591890002086
> ok: 1591890034201 == 1591890034201
> ok: 1591890066308 == 1591890066308
> ok: 1591890098419 == 1591890098419
> ok: 1591890130518 == 1591890130518
> ok: 1591890162624 == 1591890162624
> ok: 1591890194708 == 1591890194708
> ok: 1591890226796 == 1591890226796
> ok: 1591890258880 == 1591890258880
> ok: 1591890290970 == 1591890290970
> ok: 1591890323047 == 1591890323047
> ok: 1591890355125 == 1591890355125
> ok: 1591890387214 == 1591890387214
> ok: 1591890419300 == 1591890419300
> ok: 1591890451390 == 1591890451390
> ok: 1591890483463 == 1591890483463
> ok: 1591890515561 == 1591890515561
> ok: 1591890547653 == 1591890547653
> ok: 1591890579730 == 1591890579730
> ok: 1591890611859 == 1591890611859
> ok: 1591890643978 == 1591890643978
> ok: 1591890676124 == 1591890676124
> ok: 1591890708226 == 1591890708226
> ok: 1591890740303 == 1591890740303
> ok: 1591890772395 == 1591890772395
> ok: 1591890804487 == 1591890804487
> ok: 1591890836575 == 1591890836575
> ok: 1591890868666 == 1591890868666
> ok: 1591890900756 == 1591890900756
> ok: 1591890932883 == 1591890932883
> ok: 1591890965003 == 1591890965003
> ok: 1591890997111 == 1591890997111
> ok: 1591891029212 == 1591891029212
> ok: 1591891061309 == 1591891061309
> ok: 1591891093394 == 1591891093394
> ok: 1591891125516 == 1591891125516
> ok: 1591891157621 == 1591891157621
> ok: 1591891189723 == 1591891189723
> ok: 1591891221845 == 1591891221845
> ok: 1591891253941 == 1591891253941
> ok: 1591891286090 == 1591891286090
> ok: 1591891318212 == 1591891318212
> ok: 1591891350314 == 1591891350314
> ok: 1591891382406 == 1591891382406
> ok: 1591891414515 == 1591891414515
> ok: 1591891446606 == 1591891446606
> ok: 1591891478700 == 1591891478700
> ^C

marking verified in corosync-2.4.5-7.el7

Comment 36 Jan Friesse 2020-09-21 06:48:35 UTC
@Shane: It was probably you who requested insights rule so can you please update BZ (if I understand process correctly the Links section)?

Comment 40 errata-xmlrpc 2020-09-29 19:55:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3924