Bug 1791792
Summary: | [RFE] please add ability to gather scheduling pause history via cmap | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Fabio Massimo Di Nitto <fdinitto> | |
Component: | corosync | Assignee: | Christine Caulfield <ccaulfie> | |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 8.4 | CC: | ccaulfie, cluster-maint, huali, jfriesse, mnovacek, phagara, sbradley | |
Target Milestone: | rc | Keywords: | FutureFeature | |
Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | corosync-3.0.3-4.el8 | Doc Type: | Enhancement | |
Doc Text: |
Feature:
Add ability to gather scheduling pause history via cmap.
Reason:
Better support for insights.
Result:
Corosync stores timestamp and delay of scheduling pause to stats part of cmap - keys stats.schedmiss.X.delay and stats.schedmiss.X.timestamp. corosync-cmapctl was enhanced to allow deleting of these keys (-C schedmiss). These keys are intended for insights and shouldn't be used by users.
Also message in /var/log/messages (and journal) is changed to contain timestamp (may cause possible incompatibility with automated tools) - New format is "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase.".
|
Story Points: | --- | |
Clone Of: | ||||
: | 1835885 (view as bug list) | Environment: | ||
Last Closed: | 2020-11-04 03:25:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1835885 | |||
Attachments: |
Description
Fabio Massimo Di Nitto
2020-01-16 13:38:51 UTC
Created attachment 1654646 [details]
stats: Add stats for scheduler misses
stats: Add stats for scheduler misses
This patch add a stats.schedmiss.* set of entries that
are a record of the last 10 times corosync was not scheduled
in time.
These entries are keypt in reverse order (so stats.schedmiss.0.* is
always the latest one kept) and the values, including the timestamp,
are in milliseconds.
It's also possible to use a cmap tracker to follow these events, which
might be useful.
Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>
For QE (and Shane): I've tested following scenarios: # pkill -STOP corosync && sleep 2 && pkill -CONT corosync # corosync-cmapctl -m stats | grep schedmiss stats.schedmiss.0.delay (flt) = 2432.800781 stats.schedmiss.0.timestamp (u64) = 5106463408980 # pkill -STOP corosync && sleep 2 && pkill -CONT corosync # corosync-cmapctl -m stats | grep schedmiss stats.schedmiss.0.delay (flt) = 2076.766113 stats.schedmiss.0.timestamp (u64) = 5106516822000 stats.schedmiss.1.delay (flt) = 2432.800781 stats.schedmiss.1.timestamp (u64) = 5106463408980 # for i in `seq 1 10`;do pkill -STOP corosync && sleep 2 && pkill -CONT corosync;done # corosync-cmapctl -m stats | grep schedmiss stats.schedmiss.0.delay (flt) = 2023.957031 stats.schedmiss.0.timestamp (u64) = 5106558848098 stats.schedmiss.1.delay (flt) = 2023.436279 stats.schedmiss.1.timestamp (u64) = 5106556824141 stats.schedmiss.2.delay (flt) = 2030.076904 stats.schedmiss.2.timestamp (u64) = 5106554800704 stats.schedmiss.3.delay (flt) = 2022.936890 stats.schedmiss.3.timestamp (u64) = 5106552770627 stats.schedmiss.4.delay (flt) = 2024.749023 stats.schedmiss.4.timestamp (u64) = 5106550747691 stats.schedmiss.5.delay (flt) = 2021.841553 stats.schedmiss.5.timestamp (u64) = 5106548722942 stats.schedmiss.6.delay (flt) = 2024.137207 stats.schedmiss.6.timestamp (u64) = 5106546701100 stats.schedmiss.7.delay (flt) = 2021.287476 stats.schedmiss.7.timestamp (u64) = 5106544676963 stats.schedmiss.8.delay (flt) = 2025.073853 stats.schedmiss.8.timestamp (u64) = 5106542655675 stats.schedmiss.9.delay (flt) = 2473.122070 stats.schedmiss.9.timestamp (u64) = 5106540630601 # corosync-cmapctl -C schedmiss # corosync-cmapctl -m stats | grep schedmiss # pkill -STOP corosync && sleep 2 && pkill -CONT corosync # corosync-cmapctl -m stats | grep schedmiss stats.schedmiss.0.delay (flt) = 2316.379639 stats.schedmiss.0.timestamp (u64) = 5106611834662 # grep /var/log/cluster/corosync.log | grep scheduled ... Jan 23 16:00:11.920 [6349] XXX corosync warning [MAIN ] Corosync main process was not scheduled for 2030.0769 ms (threshold is 1600.0000 ms). Consider token timeout increase. Jan 23 16:00:13.944 [6349] XXX corosync warning [MAIN ] Corosync main process was not scheduled for 2023.4363 ms (threshold is 1600.0000 ms). Consider token timeout increase. Jan 23 16:00:15.968 [6349] XXX corosync warning [MAIN ] Corosync main process was not scheduled for 2023.9570 ms (threshold is 1600.0000 ms). Consider token timeout increase. Jan 23 16:01:08.953 [6349] XXX corosync warning [MAIN ] Corosync main process was not scheduled for 2316.3796 ms (threshold is 1600.0000 ms). Consider token timeout increase. Different terminal: # corosync-cmapctl -m stats -t "stats.schedmiss.0.timestamp" # pkill -STOP corosync && sleep 2 && pkill -CONT corosync corosync-cmapctl should output modify> stats.schedmiss.0.timestamp (u64) = 5106939132219 Different terminal: # corosync-cmapctl -m stats -T "stats.schedmiss" # pkill -STOP corosync && sleep 2 && pkill -CONT corosync -> corosync-cmapctl outputs nothing, because prefix tracking of schedmiss events is deliberately ignored (doesn't make sense for ring buffer). Created attachment 1654883 [details]
stats: Use nanoseconds from epoch for schedmiss
stats: Use nanoseconds from epoch for schedmiss
Using monotonic time is not working because it doesn't have to match
time from epoch.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
For QE: Please ignore the values of the stats.schedmiss.X.timestamp in previous example. They were wrong, next patch (https://bugzilla.redhat.com/attachment.cgi?id=1654883&action=diff) makes them correct. Created attachment 1668607 [details]
main: Add schedmiss timestamp into message
main: Add schedmiss timestamp into message
This is useful for matching schedmiss event in stats map with logged
event.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
@huanhuan: Not sure I understand question so I would try to answer, but please don't hesitate to ask more questions if my answer doesn't answer your question. Idea is like comment #7, one small exception is using `corosync-cmapctl -m stats "stats.schedmiss."` instead of grep, because it is safer (filters really only keys beginning with "stats.schedmiss." prefix) and doesn't need grep. These items correlate with messages logged by corosync log mechanism - by default it is: - syslog - so these messages can be found in journal and when syslog daemon is configured also in /var/log/messages - /var/log/cluster/corosync.log Which file we decide to parse is really up-to Shane decision. @QE: Also please note that log message changed a bit to contain timestamp, so new format is: "May 19 15:49:03 XXXX corosync[XXXX]: [MAIN ] Corosync main process was not scheduled (@1589896143548) for 2196.2039 ms (threshold is 800.0000 ms). Consider token timeout increase." Created attachment 1692262 [details]
man: move cmap_keys man page from section 8 to 7
man: move cmap_keys man page from section 8 to 7
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Section 8 is for "System administration commands", 7 is "Miscellaneous".
Signed-off-by: Ferenc Wágner <wferi>
Reviewed-by: Jan Friesse <jfriesse>
I have verified that there is new functionality logging scheduler misses to /var/log/cluster/corosync.log corosync-3.0.3-4.el8.x86_64. ---- > [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss stats.schedmiss.0.delay (flt) = 2213.491455 stats.schedmiss.0.timestamp (u64) = 1600343948932 stats.schedmiss.1.delay (flt) = 2252.461426 stats.schedmiss.1.timestamp (u64) = 1600343939385 stats.schedmiss.2.delay (flt) = 2230.939209 stats.schedmiss.2.timestamp (u64) = 1600343935799 # Delete currently logged events. > [root@virt-245 ~]# corosync-cmapctl -C schedmiss > [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss > [root@virt-245 ~]# pkill -STOP corosync && sleep 2 && pkill -CONT corosync > [root@virt-245 ~]# corosync-cmapctl -m stats | grep schedmiss stats.schedmiss.0.delay (flt) = 2243.752197 stats.schedmiss.0.timestamp (u64) = 1600344039511 > [root@virt-245 ~]# grep scheduled /var/log/cluster/corosync.log Sep 17 12:10:58 [149363] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled for 2032.3646 ms (threshold is 800.0000 ms). Consider token timeout increase. Sep 17 12:12:09 [149363] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled for 2084.4497 ms (threshold is 800.0000 ms). Consider token timeout increase. Sep 17 12:13:42 [152383] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled (@1600337622220) for 2113.2803 ms (threshold is 800.0000 ms). Consider token timeout increase. Sep 17 13:58:55 [7995] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled (@1600343935799) for 2230.9392 ms (threshold is 800.0000 ms). Consider token timeout increase. Sep 17 13:58:59 [7995] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled (@1600343939385) for 2252.4614 ms (threshold is 800.0000 ms). Consider token timeout increase. Sep 17 13:59:08 [7995] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled (@1600343948932) for 2213.4915 ms (threshold is 800.0000 ms). Consider token timeout increase. Sep 17 14:00:39 [7995] virt-245 corosync warning [MAIN ] Corosync main process was not scheduled (@1600344039511) for 2243.7522 ms (threshold is 800.0000 ms). Consider token timeout increase. @Shane: It was probably you who requested insights rule so can you please update BZ (if I understand process correctly the Links section)? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (corosync bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4736 |