Bug 1835885
Summary: | [RFE] please add ability to gather scheduling pause history via cmap | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Shane Bradley <sbradley> |
Component: | corosync | Assignee: | Jan Friesse <jfriesse> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.8 | CC: | ccaulfie, cfeist, cluster-maint, cluster-qe, coughlan, fdinitto, huali, jfriesse, phagara, pvlasin, sbradley |
Target Milestone: | rc | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | corosync-2.4.5-7.el7 | Doc Type: | Enhancement |
Doc Text: |
Feature:
Add ability to gather scheduling pause history via cmap.
Reason:
Better support for insights.
Result:
Corosync stores timestamp and delay of scheduling pause to cmap keys runtime.schedmiss.delay and runtime.schedmiss.timestamp. These keys are intended for insights and shouldn't be used by users.
Also message in /var/log/messages is changed to contain timestamp (may cause possible incompatibility with automated tools) - New format is "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase."
|
Story Points: | --- |
Clone Of: | 1791792 | Environment: | |
Last Closed: | 2020-09-29 19:55:11 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1791792 | ||
Bug Blocks: | |||
Attachments: |
Created attachment 1689891 [details]
main: Add schedmiss timestamp into message
main: Add schedmiss timestamp into message
This is useful for matching schedmiss event in stats map with logged
event.
(backported from master 35662dd0ec53f456445c30c0ef92892f47b25aa2)
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
@QE (and Shane): # pkill -STOP corosync && sleep 2 && pkill -CONT corosync # corosync-cmapctl | grep schedmiss runtime.schedmiss.delay (flt) = 2282.403320 runtime.schedmiss.timestamp (u64) = 1589895874915 # date --date="@$((1589895874915/1000))" Tue May 19 15:44:34 CEST 2020 # pkill -STOP corosync && sleep 2 && pkill -CONT corosync # corosync-cmapctl | grep schedmiss runtime.schedmiss.delay (flt) = 2196.203857 runtime.schedmiss.timestamp (u64) = 1589896143548 # corosync-cmapctl -d runtime.schedmiss.timestamp && corosync-cmapctl -d runtime.schedmiss.delay # corosync-cmapctl | grep schedmiss # Empty output # pkill -STOP corosync && sleep 2 && pkill -CONT corosync # corosync-cmapctl | grep schedmiss runtime.schedmiss.delay (flt) = 2283.874512 runtime.schedmiss.timestamp (u64) = 1589896274200 Also please note log message was slightly changed to contain timestamp: # grep 'not schedul' /var/log/messages May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase. May 19 15:51:14 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896274200) for 2283.8745 ms (threshold is 800.0000 ms). Consider token timeout increase. qa_ack+, reproducer somewhere in description (or in the cloned rhel8 bz comments https://bugzilla.redhat.com/show_bug.cgi?id=1791792#c4, https://bugzilla.redhat.com/show_bug.cgi?id=1791792#c6) and comment#9 Created attachment 1695427 [details]
main: Make schedmiss in cmap and log equal
main: Make schedmiss in cmap and log equal
Second call of qb_util_nano_from_epoch_get may differ a bit. Solution is
to use previously stored timestamp (similarly as in master branch).
Also fix man page to follow similar style as other keys.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
Created attachment 1695430 [details]
main: Make schedmiss in cmap and log equal
main: Make schedmiss in cmap and log equal
Second call of qb_util_nano_from_epoch_get may differ a bit. Solution is
to use previously stored timestamp (similarly as in master branch).
Also fix man page to follow similar style as other keys.
Thanks Patrik Hagara <phagara> for reporting the problem.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
> [root@virt-126 ~]# rpm -q corosync > corosync-2.4.5-7.el7.x86_64 > [root@virt-126 ~]# man cmap_keys <snip> > runtime.schedmiss.* > If corosync is not scheduled after the required period of time it will log this event and also write an entry to cmap under following keys: > > timestamp The timestamp of the last time when corosync failed to be scheduled for the required amount of time. The time is milli-seconds since the epoch. > > delay The amount of time (milliseconds as a float) that corosync was delayed. <snip> manual page style fixed > [root@virt-126 ~]# pcs status > Cluster name: STSRHTS23630 > Stack: corosync > Current DC: virt-126 (version 1.1.23-1.el7-9acf116022) - partition with quorum > Last updated: Thu Jun 11 16:18:13 2020 > Last change: Thu Jun 11 15:52:49 2020 by root via cibadmin on virt-126 > > 2 nodes configured > 2 resource instances configured > > Online: [ virt-126 virt-127 ] > > Full list of resources: > > fence-virt-126 (stonith:fence_xvm): Started virt-126 > fence-virt-127 (stonith:fence_xvm): Started virt-127 > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss' > [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync) > [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss' > runtime.schedmiss.delay (flt) = 2025.971436 > runtime.schedmiss.timestamp (u64) = 1591885690834 > [root@virt-126 ~]# date --date @1591885690 > Thu Jun 11 16:28:10 CEST 2020 > [root@virt-126 ~]# date > Thu Jun 11 16:28:41 CEST 2020 > [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync) > [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss' > runtime.schedmiss.delay (flt) = 2165.545898 > runtime.schedmiss.timestamp (u64) = 1591885731683 > [root@virt-126 ~]# corosync-cmapctl -d runtime.schedmiss.delay > [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss' > runtime.schedmiss.timestamp (u64) = 1591885731683 > [root@virt-126 ~]# corosync-cmapctl -d runtime.schedmiss.timestamp > [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss' > [root@virt-126 ~]# kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync) > [root@virt-126 ~]# corosync-cmapctl | grep '^runtime\.schedmiss' > runtime.schedmiss.delay (flt) = 2237.884277 > runtime.schedmiss.timestamp (u64) = 1591885769290 > [root@virt-126 ~]# journalctl -u corosync -b | grep 'not scheduled' > Jun 11 16:28:10 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]: [MAIN ] Corosync main process was not scheduled (@1591885690834) for 2025.9714 ms (threshold is 800.0000 ms). Consider token timeout increase. > Jun 11 16:28:51 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]: [MAIN ] Corosync main process was not scheduled (@1591885731683) for 2165.5459 ms (threshold is 800.0000 ms). Consider token timeout increase. > Jun 11 16:29:29 virt-126.cluster-qe.lab.eng.brq.redhat.com corosync[22575]: [MAIN ] Corosync main process was not scheduled (@1591885769290) for 2237.8843 ms (threshold is 800.0000 ms). Consider token timeout increase. > [root@virt-126 ~]# grep 'not scheduled' /var/log/cluster/corosync.log > [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN ] Corosync main process was not scheduled (@1591885690834) for 2025.9714 ms (threshold is 800.0000 ms). Consider token timeout increase. > [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN ] Corosync main process was not scheduled (@1591885731683) for 2165.5459 ms (threshold is 800.0000 ms). Consider token timeout increase. > [22574] virt-126.cluster-qe.lab.eng.brq.redhat.com corosyncwarning [MAIN ] Corosync main process was not scheduled (@1591885769290) for 2237.8843 ms (threshold is 800.0000 ms). Consider token timeout increase. timestamps match, trying a few more times just to be sure: > [root@virt-126 ~]# while true; do kill -STOP $(pidof corosync) && sleep 2 && kill -CONT $(pidof corosync); STAMP_CMAP=$(corosync-cmapctl -g runtime.schedmiss.timestamp | cut -d ' ' -f 4); STAMP_LOG=$(journalctl -u corosync -b | grep 'not scheduled' | tail -n 1 | grep -oP '(?<=\(@)\d+(?=\))'); if [ ${STAMP_CMAP} -ne ${STAMP_LOG} ]; then echo "timestamp mismatch: ${STAMP_CMAP} != ${STAMP_LOG}" && break; else echo "ok: ${STAMP_CMAP} == ${STAMP_LOG}"; fi; sleep 30; done > ok: 1591888846135 == 1591888846135 > ok: 1591888878295 == 1591888878295 > ok: 1591888910418 == 1591888910418 > ok: 1591888942513 == 1591888942513 > ok: 1591888974608 == 1591888974608 > ok: 1591889006709 == 1591889006709 > ok: 1591889038809 == 1591889038809 > ok: 1591889070919 == 1591889070919 > ok: 1591889103015 == 1591889103015 > ok: 1591889135196 == 1591889135196 > ok: 1591889167354 == 1591889167354 > ok: 1591889199465 == 1591889199465 > ok: 1591889231589 == 1591889231589 > ok: 1591889263673 == 1591889263673 > ok: 1591889295762 == 1591889295762 > ok: 1591889327944 == 1591889327944 > ok: 1591889360103 == 1591889360103 > ok: 1591889392214 == 1591889392214 > ok: 1591889424399 == 1591889424399 > ok: 1591889456548 == 1591889456548 > ok: 1591889488652 == 1591889488652 > ok: 1591889520746 == 1591889520746 > ok: 1591889552838 == 1591889552838 > ok: 1591889584925 == 1591889584925 > ok: 1591889617019 == 1591889617019 > ok: 1591889649101 == 1591889649101 > ok: 1591889681192 == 1591889681192 > ok: 1591889713281 == 1591889713281 > ok: 1591889745377 == 1591889745377 > ok: 1591889777464 == 1591889777464 > ok: 1591889809582 == 1591889809582 > ok: 1591889841668 == 1591889841668 > ok: 1591889873735 == 1591889873735 > ok: 1591889905819 == 1591889905819 > ok: 1591889937904 == 1591889937904 > ok: 1591889969971 == 1591889969971 > ok: 1591890002086 == 1591890002086 > ok: 1591890034201 == 1591890034201 > ok: 1591890066308 == 1591890066308 > ok: 1591890098419 == 1591890098419 > ok: 1591890130518 == 1591890130518 > ok: 1591890162624 == 1591890162624 > ok: 1591890194708 == 1591890194708 > ok: 1591890226796 == 1591890226796 > ok: 1591890258880 == 1591890258880 > ok: 1591890290970 == 1591890290970 > ok: 1591890323047 == 1591890323047 > ok: 1591890355125 == 1591890355125 > ok: 1591890387214 == 1591890387214 > ok: 1591890419300 == 1591890419300 > ok: 1591890451390 == 1591890451390 > ok: 1591890483463 == 1591890483463 > ok: 1591890515561 == 1591890515561 > ok: 1591890547653 == 1591890547653 > ok: 1591890579730 == 1591890579730 > ok: 1591890611859 == 1591890611859 > ok: 1591890643978 == 1591890643978 > ok: 1591890676124 == 1591890676124 > ok: 1591890708226 == 1591890708226 > ok: 1591890740303 == 1591890740303 > ok: 1591890772395 == 1591890772395 > ok: 1591890804487 == 1591890804487 > ok: 1591890836575 == 1591890836575 > ok: 1591890868666 == 1591890868666 > ok: 1591890900756 == 1591890900756 > ok: 1591890932883 == 1591890932883 > ok: 1591890965003 == 1591890965003 > ok: 1591890997111 == 1591890997111 > ok: 1591891029212 == 1591891029212 > ok: 1591891061309 == 1591891061309 > ok: 1591891093394 == 1591891093394 > ok: 1591891125516 == 1591891125516 > ok: 1591891157621 == 1591891157621 > ok: 1591891189723 == 1591891189723 > ok: 1591891221845 == 1591891221845 > ok: 1591891253941 == 1591891253941 > ok: 1591891286090 == 1591891286090 > ok: 1591891318212 == 1591891318212 > ok: 1591891350314 == 1591891350314 > ok: 1591891382406 == 1591891382406 > ok: 1591891414515 == 1591891414515 > ok: 1591891446606 == 1591891446606 > ok: 1591891478700 == 1591891478700 > ^C marking verified in corosync-2.4.5-7.el7 @Shane: It was probably you who requested insights rule so can you please update BZ (if I understand process correctly the Links section)? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (corosync bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3924 |
Created attachment 1689890 [details] stats: Add basic schedule-miss stats to needle stats: Add basic schedule-miss stats to needle In camelback (48b6894ef41e9a06ccbb696d062d86ef60dc2c4b) we have a much more comprehensive system for recording schedule misses because it has a 'stats' map. This is much more basic and just writes the last event into cmap. You can still query and track the value though. Signed-off-by: Christine Caulfield <ccaulfie> Reviewed-by: Jan Friesse <jfriesse>