Bug 1591045

Summary: crm_failcount treats INFINITY as zero
Product: Red Hat Enterprise Linux 7 Reporter: Reid Wahl <nwahl>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: Patrik Hagara <phagara>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.5CC: abeekhof, cluster-maint, mnovacek, phagara
Target Milestone: rcKeywords: Regression
Target Release: 7.6   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: pacemaker-1.1.19-1.el7 Doc Type: No Doc Update
Doc Text:
Most users will use pcs resource failcount rather than crm_failcount directly, and this issue does not affect them.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 07:58:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Reid Wahl 2018-06-13 23:45:35 UTC
Description of problem:

crm_failcount treats INFINITY as zero.

The query_failcount function calls the query_attr_sum function to compute the sum of a resource's failcounts from <transient_attributes> in the CIB. When query_attr_sum encounters the string value "INFINITY", it typecasts it to the integer 0 by placing it inside double parentheses. So effectively, INFINITY failcounts are ignored.

 89 query_attr_sum() {
...
116         # Sum the values
117         QAS_SUM=0
118         for i in 0 $QAS_VALUE; do
119                 QAS_SUM=$(($QAS_SUM + $i))
120         done
121         echo $QAS_SUM
122 }


The ClusterLabs documentation tells us that "Any value + INFINITY = INFINITY". With this in mind, upon encountering an INFINITY value, we should set QAS_SUM to "INFINITY" and exit the loop.


--------------------

Version-Release number of selected component (if applicable):

pacemaker-1.1.18-11.el7_5.2


--------------------

How reproducible:

Always

--------------------

Steps to Reproduce:

1. Force a resource to fail its monitor operation several times.
2. With start-failure-is-fatal=true, force the resource to fail its start operation.
3. View the resource's failcount in the CIB, and find that it has a finite value for its monitor failcount and an "INFINITY" value for its start failcount.

    # pcs cluster cib | grep fail-count-dummy
              <nvpair id="status-1-fail-count-dummy.monitor_10000" name="fail-count-dummy#monitor_10000" value="8"/>
              <nvpair id="status-1-fail-count-dummy.start_0" name="fail-count-dummy#start_0" value="INFINITY"/>

4. Use crm_failcount to query the failcounts for the individual operations.

    # crm_failcount -G -r dummy -n monitor -I 10000ms
    scope=status  name=fail-count-dummy#monitor_10000 value=8

    # crm_failcount -G -r dummy -n start 
    scope=status  name=fail-count-dummy#start_0 value=INFINITY

5. Use crm_failcount to query the total failcount, and find that it is not INFINITY.

    # crm_failcount -G -r dummy
    scope=status  name=fail-count-dummy value=8


--------------------

Actual results:

# crm_failcount -G -r dummy
scope=status  name=fail-count-dummy value=8


--------------------

Expected results:

# crm_failcount -G -r dummy
scope=status  name=fail-count-dummy value=INFINITY


--------------------

Additional info:

See also:
Bug 1588667 - The command `pcs resource failcount show <resource id>` always shows a failcount of zero (even when it should not)
Bug 1427273 - Support planned changes in pacemaker failure handling

Comment 2 Reid Wahl 2018-06-14 00:09:04 UTC
The issue begins earlier than the addition loop. The sed expression in line 114 only parses integer values ([0-9][0-9]*) and ignores strings like "INFINITY". Then the addition loop is not equipped to handle "INFINITY" either.

112         # Extract the attribute values (one per line) from the output
113         QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
114                 's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
115 
116         # Sum the values
117         QAS_SUM=0
118         for i in 0 $QAS_VALUE; do
119                 QAS_SUM=$(($QAS_SUM + $i))
120         done
121         echo $QAS_SUM


We could either add "INFINITY" to the match group in the sed expression, or simply stop restricting it to digits. Is there any reason we should restrict it to digits?


--------------------

Here is a patch that adds "INFINITY", if we take that approach.

# diff -u /usr/sbin/crm_failcount{,.bak}
--- /usr/sbin/crm_failcount	2018-06-13 19:55:09.549196483 -0400
+++ /usr/sbin/crm_failcount.bak	2018-04-23 10:04:27.000000000 -0400
@@ -110,22 +110,13 @@
 	#fi
 
 	# Extract the attribute values (one per line) from the output
-	#QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
-		#'s/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
 	QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
-		's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p')
-
-	# echo "$QAS_VALUE"
+		's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
 
 	# Sum the values
 	QAS_SUM=0
 	for i in 0 $QAS_VALUE; do
-		if [ "$i" = "INFINITY" ]; then
-			QAS_SUM="INFINITY"
-			break
-		else
-			QAS_SUM=$(($QAS_SUM + $i))
-		fi
+		QAS_SUM=$(($QAS_SUM + $i))
 	done
 	echo $QAS_SUM
 }


We could even regex-search for "INFINITY" in "$QAS_VALUE" instead of iterating through the values until we hit "INFINITY", but the above approach seems more straightforward.


--------------------

Using this new version, we get the following results:

# pcs cluster cib | grep fail-count-dummy
          <nvpair id="status-1-fail-count-dummy.monitor_10000" name="fail-count-dummy#monitor_10000" value="8"/>
          <nvpair id="status-1-fail-count-dummy.start_0" name="fail-count-dummy#start_0" value="INFINITY"/>

# crm_failcount -G -r dummy -n monitor -I 10000ms
scope=status  name=fail-count-dummy#monitor_10000 value=8

# crm_failcount -G -r dummy -n start
scope=status  name=fail-count-dummy#start_0 value=INFINITY

# crm_failcount -G -r dummy
scope=status  name=fail-count-dummy value=INFINITY

Comment 3 Ken Gaillot 2018-06-14 16:07:07 UTC
Thanks for the discovery, and the patch!

This is a regression in RHEL 7.5 (via upstream release 1.1.17 commit 83236161). 

QA: Reproducer is in Description. Instead of making the operations fail in actuality, for these purposes it would be sufficient to simulate failures by adding fail-count attributes directly, e.g.

attrd_updater -U 8 -N ${NODE_NAME} -n fail-count-${RSC_NAME}#monitor_${INTERVAL_MS}

attrd_updater -U INFINITY -N ${NODE_NAME} -n fail-count-${RSC_NAME}#start_0

(Such simulated failures won't show up in status but that is not important for this bz.)

Comment 5 Reid Wahl 2018-06-14 16:35:22 UTC
(In reply to Reid Wahl from comment #2)
> Here is a patch that adds "INFINITY", if we take that approach.
> 
> # diff -u /usr/sbin/crm_failcount{,.bak}
> --- /usr/sbin/crm_failcount	2018-06-13 19:55:09.549196483 -0400
> +++ /usr/sbin/crm_failcount.bak	2018-04-23 10:04:27.000000000 -0400
> @@ -110,22 +110,13 @@
>  	#fi
>  
>  	# Extract the attribute values (one per line) from the output
> -	#QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
> -		#'s/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
>  	QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
> -		's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p')
> -
> -	# echo "$QAS_VALUE"
> +		's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
>  
>  	# Sum the values
>  	QAS_SUM=0
>  	for i in 0 $QAS_VALUE; do
> -		if [ "$i" = "INFINITY" ]; then
> -			QAS_SUM="INFINITY"
> -			break
> -		else
> -			QAS_SUM=$(($QAS_SUM + $i))
> -		fi
> +		QAS_SUM=$(($QAS_SUM + $i))
>  	done
>  	echo $QAS_SUM
>  }
> 


Forgot to remove commented lines before running diff ;)

# diff -u /usr/sbin/crm_failcount{,.bak}
--- /usr/sbin/crm_failcount	2018-06-14 12:34:06.189587246 -0400
+++ /usr/sbin/crm_failcount.bak	2018-04-23 10:04:27.000000000 -0400
@@ -111,17 +111,12 @@
 
 	# Extract the attribute values (one per line) from the output
 	QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
-		's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p')
+		's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
 
 	# Sum the values
 	QAS_SUM=0
 	for i in 0 $QAS_VALUE; do
-		if [ "$i" = "INFINITY" ]; then
-			QAS_SUM="INFINITY"
-			break
-		else
-			QAS_SUM=$(($QAS_SUM + $i))
-		fi
+		QAS_SUM=$(($QAS_SUM + $i))
 	done
 	echo $QAS_SUM
 }

Comment 6 Reid Wahl 2018-06-14 16:36:11 UTC
Third time's the charm. Time for coffee.

--- /usr/sbin/crm_failcount.bak	2018-04-23 10:04:27.000000000 -0400
+++ /usr/sbin/crm_failcount	2018-06-14 12:34:06.189587246 -0400
@@ -111,12 +111,17 @@
 
 	# Extract the attribute values (one per line) from the output
 	QAS_VALUE=$(echo "$QAS_ALL" | sed -n -e \
-		's/.*<nvpair.*value="\([0-9][0-9]*\)".*>.*/\1/p')
+		's/.*<nvpair.*value="\([0-9][0-9]*\|INFINITY\)".*>.*/\1/p')
 
 	# Sum the values
 	QAS_SUM=0
 	for i in 0 $QAS_VALUE; do
-		QAS_SUM=$(($QAS_SUM + $i))
+		if [ "$i" = "INFINITY" ]; then
+			QAS_SUM="INFINITY"
+			break
+		else
+			QAS_SUM=$(($QAS_SUM + $i))
+		fi
 	done
 	echo $QAS_SUM
 }

Comment 9 Ken Gaillot 2018-06-14 21:57:33 UTC
Fixed upstream as of commit 8567726d

Comment 12 Patrik Hagara 2018-07-30 14:08:26 UTC
before:
=======

> [root@virt-122 ~]# rpm -q pacemaker
> pacemaker-1.1.18-11.el7.x86_64
> [root@virt-122 ~]# pcs resource create dummy ocf:pacemaker:Dummy op monitor interval=30s
> [root@virt-122 ~]# pcs resource disable dummy --wait
> resource 'dummy' is not running on any node
> [root@virt-122 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-122.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum
> Last updated: Mon Jul 30 14:57:27 2018
> Last change: Mon Jul 30 14:57:23 2018 by root via cibadmin on virt-122.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 1 resource configured (1 DISABLED)
> 
> Online: [ virt-122.cluster-qe.lab.eng.brq.redhat.com virt-123.cluster-qe.lab.eng.brq.redhat.com virt-127.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  dummy	(ocf::pacemaker:Dummy):	Stopped (disabled)
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-122 ~]# attrd_updater -U 8 -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#monitor_30000'
> [root@virt-122 ~]# attrd_updater -U INFINITY -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#start_0'
> [root@virt-122 ~]# crm_failcount -G -r dummy -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n monitor -I 30000ms
> scope=status  name=fail-count-dummy#monitor_30000 value=8
> [root@virt-122 ~]# crm_failcount -G -r dummy -N virt-127.cluster-qe.lab.eng.brq.redhat.com -n start
> scope=status  name=fail-count-dummy#start_0 value=INFINITY
> [root@virt-122 ~]# crm_failcount -G -r dummy -N virt-127.cluster-qe.lab.eng.brq.redhat.com
> scope=status  name=fail-count-dummy value=8

Total failcount (last command) is equal to 8 instead of INFINITY, which is incorrect.


after:
======

> [root@virt-148 ~]# rpm -q pacemaker
> pacemaker-1.1.19-4.el7.x86_64
> [root@virt-148 ~]# pcs resource create dummy ocf:pacemaker:Dummy op monitor interval=30s
> [root@virt-148 ~]# pcs resource disable dummy --wait
> resource 'dummy' is not running on any node
> [root@virt-148 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-150.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-4.el7-c3c624ea3d) - partition with quorum
> Last updated: Mon Jul 30 15:56:58 2018
> Last change: Mon Jul 30 15:56:50 2018 by root via cibadmin on virt-148.cluster-qe.lab.eng.brq.redhat.com
> 
> 3 nodes configured
> 1 resource configured (1 DISABLED)
> 
> Online: [ virt-148.cluster-qe.lab.eng.brq.redhat.com virt-149.cluster-qe.lab.eng.brq.redhat.com virt-150.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  dummy	(ocf::pacemaker:Dummy):	Stopped (disabled)
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> [root@virt-148 ~]# attrd_updater -U 8 -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#monitor_30000'
> [root@virt-148 ~]# attrd_updater -U INFINITY -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n 'fail-count-dummy#start_0'
> [root@virt-148 ~]# crm_failcount -G -r dummy -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n monitor -I 30000ms
> scope=status  name=fail-count-dummy#monitor_30000 value=8
> [root@virt-148 ~]# crm_failcount -G -r dummy -N virt-149.cluster-qe.lab.eng.brq.redhat.com -n start
> scope=status  name=fail-count-dummy#start_0 value=INFINITY
> [root@virt-148 ~]# crm_failcount -G -r dummy -N virt-149.cluster-qe.lab.eng.brq.redhat.com
> scope=status  name=fail-count-dummy value=INFINITY

Total failcount is now correctly evaluated as 8 + INFINITY = INFINITY. Marking verified.

Comment 14 errata-xmlrpc 2018-10-30 07:58:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055