Bug 1591045
| Summary: | crm_failcount treats INFINITY as zero | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Reid Wahl <nwahl> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | Patrik Hagara <phagara> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.5 | CC: | abeekhof, cluster-maint, mnovacek, phagara |
| Target Milestone: | rc | Keywords: | Regression |
| Target Release: | 7.6 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-1.1.19-1.el7 | Doc Type: | No Doc Update |
| Doc Text: |
Most users will use pcs resource failcount rather than crm_failcount directly, and this issue does not affect them.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-10-30 07:58:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The issue begins earlier than the addition loop. The sed expression in line 114 only parses integer values ([0-9][0-9]*) and ignores strings like "INFINITY". Then the addition loop is not equipped to handle "INFINITY" either.
112 # Extract the attribute values (one per line) from the output
113 QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
114 's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
115
116 # Sum the values
117 QAS_SUM=0
118 for i in 0 $QAS_VALUE; do
119 QAS_SUM=$(($QAS_SUM + $i))
120 done
121 echo $QAS_SUM
We could either add "INFINITY" to the match group in the sed expression, or simply stop restricting it to digits. Is there any reason we should restrict it to digits?
--------------------
Here is a patch that adds "INFINITY", if we take that approach.
# diff -u /usr/sbin/crm_failcount{,.bak}
--- /usr/sbin/crm_failcount 2018-06-13 19:55:09.549196483 -0400
+++ /usr/sbin/crm_failcount.bak 2018-04-23 10:04:27.000000000 -0400
@@ -110,22 +110,13 @@
#fi
# Extract the attribute values (one per line) from the output
- #QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
- #'s/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
- 's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p')
-
- # echo "$QAS_VALUE"
+ 's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
# Sum the values
QAS_SUM=0
for i in 0 $QAS_VALUE; do
- if [ "$i" = "INFINITY" ]; then
- QAS_SUM="INFINITY"
- break
- else
- QAS_SUM=$(($QAS_SUM + $i))
- fi
+ QAS_SUM=$(($QAS_SUM + $i))
done
echo $QAS_SUM
}
We could even regex-search for "INFINITY" in "$QAS_VALUE" instead of iterating through the values until we hit "INFINITY", but the above approach seems more straightforward.
--------------------
Using this new version, we get the following results:
# pcs cluster cib | grep fail-count-dummy
<nvpair id="status-1-fail-count-dummy.monitor_10000" name="fail-count-dummy#monitor_10000" value="8"/>
<nvpair id="status-1-fail-count-dummy.start_0" name="fail-count-dummy#start_0" value="INFINITY"/>
# crm_failcount -G -r dummy -n monitor -I 10000ms
scope=status name=fail-count-dummy#monitor_10000 value=8
# crm_failcount -G -r dummy -n start
scope=status name=fail-count-dummy#start_0 value=INFINITY
# crm_failcount -G -r dummy
scope=status name=fail-count-dummy value=INFINITY
Thanks for the discovery, and the patch!
This is a regression in RHEL 7.5 (via upstream release 1.1.17 commit 83236161).
QA: Reproducer is in Description. Instead of making the operations fail in actuality, for these purposes it would be sufficient to simulate failures by adding fail-count attributes directly, e.g.
attrd_updater -U 8 -N ${NODE_NAME} -n fail-count-${RSC_NAME}#monitor_${INTERVAL_MS}
attrd_updater -U INFINITY -N ${NODE_NAME} -n fail-count-${RSC_NAME}#start_0
(Such simulated failures won't show up in status but that is not important for this bz.)
(In reply to Reid Wahl from comment #2) > Here is a patch that adds "INFINITY", if we take that approach. > > # diff -u /usr/sbin/crm_failcount{,.bak} > --- /usr/sbin/crm_failcount 2018-06-13 19:55:09.549196483 -0400 > +++ /usr/sbin/crm_failcount.bak 2018-04-23 10:04:27.000000000 -0400 > @@ -110,22 +110,13 @@ > #fi > > # Extract the attribute values (one per line) from the output > - #QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \ > - #'s/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p') > QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \ > - 's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p') > - > - # echo "$QAS_VALUE" > + 's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p') > > # Sum the values > QAS_SUM=0 > for i in 0 $QAS_VALUE; do > - if [ "$i" = "INFINITY" ]; then > - QAS_SUM="INFINITY" > - break > - else > - QAS_SUM=$(($QAS_SUM + $i)) > - fi > + QAS_SUM=$(($QAS_SUM + $i)) > done > echo $QAS_SUM > } > Forgot to remove commented lines before running diff ;) # diff -u /usr/sbin/crm_failcount{,.bak} --- /usr/sbin/crm_failcount 2018-06-14 12:34:06.189587246 -0400 +++ /usr/sbin/crm_failcount.bak 2018-04-23 10:04:27.000000000 -0400 @@ -111,17 +111,12 @@ # Extract the attribute values (one per line) from the output QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \ - 's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p') + 's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p') # Sum the values QAS_SUM=0 for i in 0 $QAS_VALUE; do - if [ "$i" = "INFINITY" ]; then - QAS_SUM="INFINITY" - break - else - QAS_SUM=$(($QAS_SUM + $i)) - fi + QAS_SUM=$(($QAS_SUM + $i)) done echo $QAS_SUM } Third time's the charm. Time for coffee. --- /usr/sbin/crm_failcount.bak 2018-04-23 10:04:27.000000000 -0400 +++ /usr/sbin/crm_failcount 2018-06-14 12:34:06.189587246 -0400 @@ -111,12 +111,17 @@ # Extract the attribute values (one per line) from the output QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \ - 's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p') + 's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p') # Sum the values QAS_SUM=0 for i in 0 $QAS_VALUE; do - QAS_SUM=$(($QAS_SUM + $i)) + if [ "$i" = "INFINITY" ]; then + QAS_SUM="INFINITY" + break + else + QAS_SUM=$(($QAS_SUM + $i)) + fi done echo $QAS_SUM } Fixed upstream as of commit 8567726d before: ======= > [root@virt-122 ~]# rpm -q pacemaker > pacemaker-1.1.18-11.el7.x86_64 > [root@virt-122 ~]# pcs resource create dummy ocf:pacemaker:Dummy op monitor interval=30s > [root@virt-122 ~]# pcs resource disable dummy --wait > resource 'dummy' is not running on any node > [root@virt-122 ~]# pcs status > Cluster name: bzzt > Stack: corosync > Current DC: virt-122.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum > Last updated: Mon Jul 30 14:57:27 2018 > Last change: Mon Jul 30 14:57:23 2018 by root via cibadmin on virt-122.cluster-qe.lab.eng.brq.redhat.com > > 3 nodes configured > 1 resource configured (1 DISABLED) > > Online: [ virt-122.cluster-qe.lab.eng.brq.redhat.com virt-123.cluster-qe.lab.eng.brq.redhat.com virt-127.cluster-qe.lab.eng.brq.redhat.com ] > > Full list of resources: > > dummy (ocf::pacemaker:Dummy): Stopped (disabled) > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-122 ~]# attrd_updater -U 8 -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#monitor_30000' > [root@virt-122 ~]# attrd_updater -U INFINITY -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#start_0' > [root@virt-122 ~]# crm_failcount -G -r dummy -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n monitor -I 30000ms > scope=status name=fail-count-dummy#monitor_30000 value=8 > [root@virt-122 ~]# crm_failcount -G -r dummy -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n start > scope=status name=fail-count-dummy#start_0 value=INFINITY > [root@virt-122 ~]# crm_failcount -G -r dummy -N virt-127.cluster-qe.lab.eng.brq.redhat.com > scope=status name=fail-count-dummy value=8 Total failcount (last command) is equal to 8 instead of INFINITY, which is incorrect. after: ====== > [root@virt-148 ~]# rpm -q pacemaker > pacemaker-1.1.19-4.el7.x86_64 > [root@virt-148 ~]# pcs resource create dummy ocf:pacemaker:Dummy op monitor interval=30s > [root@virt-148 ~]# pcs resource disable dummy --wait > resource 'dummy' is not running on any node > [root@virt-148 ~]# pcs status > Cluster name: bzzt > Stack: corosync > Current DC: virt-150.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-4.el7-c3c624ea3d) - partition with quorum > Last updated: Mon Jul 30 15:56:58 2018 > Last change: Mon Jul 30 15:56:50 2018 by root via cibadmin on virt-148.cluster-qe.lab.eng.brq.redhat.com > > 3 nodes configured > 1 resource configured (1 DISABLED) > > Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com virt-149.cluster-qe.lab.eng.brq.redhat.com virt-150.cluster-qe.lab.eng.brq.redhat.com ] > > Full list of resources: > > dummy (ocf::pacemaker:Dummy): Stopped (disabled) > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled > [root@virt-148 ~]# attrd_updater -U 8 -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#monitor_30000' > [root@virt-148 ~]# attrd_updater -U INFINITY -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#start_0' > [root@virt-148 ~]# crm_failcount -G -r dummy -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n monitor -I 30000ms > scope=status name=fail-count-dummy#monitor_30000 value=8 > [root@virt-148 ~]# crm_failcount -G -r dummy -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n start > scope=status name=fail-count-dummy#start_0 value=INFINITY > [root@virt-148 ~]# crm_failcount -G -r dummy -N virt-149.cluster-qe.lab.eng.brq.redhat.com > scope=status name=fail-count-dummy value=INFINITY Total failcount is now correctly evaluated as 8 + INFINITY = INFINITY. Marking verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3055 |
Description of problem: crm_failcount treats INFINITY as zero. The query_failcount function calls the query_attr_sum function to compute the sum of a resource's failcounts from <transient_attributes> in the CIB. When query_attr_sum encounters the string value "INFINITY", it typecasts it to the integer 0 by placing it inside double parentheses. So effectively, INFINITY failcounts are ignored. 89 query_attr_sum() { ... 116 # Sum the values 117 QAS_SUM=0 118 for i in 0 $QAS_VALUE; do 119 QAS_SUM=$(($QAS_SUM + $i)) 120 done 121 echo $QAS_SUM 122 } The ClusterLabs documentation tells us that "Any value + INFINITY = INFINITY". With this in mind, upon encountering an INFINITY value, we should set QAS_SUM to "INFINITY" and exit the loop. -------------------- Version-Release number of selected component (if applicable): pacemaker-1.1.18-11.el7_5.2 -------------------- How reproducible: Always -------------------- Steps to Reproduce: 1. Force a resource to fail its monitor operation several times. 2. With start-failure-is-fatal=true, force the resource to fail its start operation. 3. View the resource's failcount in the CIB, and find that it has a finite value for its monitor failcount and an "INFINITY" value for its start failcount. # pcs cluster cib | grep fail-count-dummy <nvpair id="status-1-fail-count-dummy.monitor_10000" name="fail-count-dummy#monitor_10000" value="8"/> <nvpair id="status-1-fail-count-dummy.start_0" name="fail-count-dummy#start_0" value="INFINITY"/> 4. Use crm_failcount to query the failcounts for the individual operations. # crm_failcount -G -r dummy -n monitor -I 10000ms scope=status name=fail-count-dummy#monitor_10000 value=8 # crm_failcount -G -r dummy -n start scope=status name=fail-count-dummy#start_0 value=INFINITY 5. Use crm_failcount to query the total failcount, and find that it is not INFINITY. # crm_failcount -G -r dummy scope=status name=fail-count-dummy value=8 -------------------- Actual results: # crm_failcount -G -r dummy scope=status name=fail-count-dummy value=8 -------------------- Expected results: # crm_failcount -G -r dummy scope=status name=fail-count-dummy value=INFINITY -------------------- Additional info: See also: Bug 1588667 - The command `pcs resource failcount show <resource id>` always shows a failcount of zero (even when it should not) Bug 1427273 - Support planned changes in pacemaker failure handling