Bug 1414139

Summary: Potentially incorrect/undefined parsing results of time-related configuration parameters such as action intervals/timeouts
Product: Red Hat Enterprise Linux 6 Reporter: Jan Pokorný [poki] <jpokorny>
Component: rgmanagerAssignee: Jan Pokorný [poki] <jpokorny>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact: Steven J. Levine <slevine>
Priority: urgent    
Version: 6.9CC: cluster-maint, mjuricek, slevine, tlavigne
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rgmanager-3.0.12.1-33.el6 Doc Type: Release Note
Doc Text:
Time-related values of cluster resource configuration are now evaluated properly Previously, time-related resource values in actual use could differ from the values configured in the `cluster.conf` file, especially at the initial configuration load. This could cause the `rgmanager` daemon to behave unpredictably. With this fix, `rgmanager` behaves exactly as configured with regards to resources and respective time-related values.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 10:41:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1310529    
Attachments:
Description Flags
Proposed fix none

Description Jan Pokorný [poki] 2017-01-17 21:04:55 UTC
Symptoms
========

Using following cluster.conf (original indented with tabs):

<?xml version="1.0"?>
<cluster config_version="5" name="test">
 <clusternodes>
  <clusternode name="virt-122.cluster-qe.lab.eng.brq.redhat.com" nodeid="1"/>
 </clusternodes>
 <totem consensus="200" join="100" token="5000" token_retransmits_before_loss_const="4"/>
 <logging>
  <logging_daemon debug="on" name="corosync" subsys="CONFDB"/>
 </logging>
 <rm>
  <resources>
   <fs device="/dev/vda1" mountpoint="/boot" name="Reference" options="ro,remount">
    <action depth="20" interval="0" name="status"/>
   </fs>
  </resources>
  <service autostart="1" max_restarts="321" name="DummyRef" recovery="relocate" restart_expire_time="654">
   <fs ref="Reference"/>
  </service>
 </rm>
</cluster>

I can reliably, i.e.
- behavior surviving reboot of a given VM
- behavior very likely surviving a bit different deployment of VM
  from scratch on the same archicture architecture (see below)
(- with glibc-2.12-1.208.el6.x86_64, libxml2-2.7.6-21.el6_8.1.x86_64),
reproduce following bad behavior of rgmanager (that I came across in
relation to [bug 1310529 comment 27]):

# service cman start
# rgmanager -f
> [...]
> Building Resource Trees                            
> Replacing action 'status' depth 20: interval: 60->2
> [...]

This does not make any sense as we are overriding status action of
Reference resource at depth 20 because we want to disable (interval=0)
the checks at that level because otherwise they are implicitly (through
metadata of fs.sh) scheduled to happen every 1 minute (60 seconds)
-- this is a use case of referred [bug 1310529].
And apparently, the status action at depth 20 did not get disabled,
but rather happens 30x faster!

The expected message is rather:

> Replacing action 'status' depth 20: interval: 60->0

and subsequent behavior should be really no such depth for a status
based monitoring is ever used.

Note that we are now discussing the startup behavior.  A config-reload
(e.g. ccs --incversion --sync --activate) action does not (again,
reliably) this issue for me, I can observe the expected message above.


Root cause
==========

It was found out that the issue was introduced with upstream commit:
https://git.fedorahosted.org/cgit/cluster.git/commit/?id=9d0ecbac1ff8ac1f045bd19bb73ca063a5351014
and the issue can only happen with a time specification ending with
a digit (as opposed to, e.g., ending the implicit entry in seconds
with explicit 's').  In that case, the logic of expand_time function
skips the nul-terminator after this last digit and continues with
a subsequent byte, which in general has unspecified value, just as any
other in row (this may also serve potential attackers, but because
cluster.conf is supposed to come from trusted sources, this line is
disregarded).

Note that first affected RHEL release is 6.6: [bug 1036652].


Workaround
==========

Always suffix the entry in seconds with explicit 's'.

Comment 1 Jan Pokorný [poki] 2017-01-17 21:58:29 UTC
Created attachment 1241957 [details]
Proposed fix

Affected attributes
===================

Following is hopefully an exhustive list of affected values
as expressed with fitting XPath expressions:

/cluster/rm/*//action/@timeout
/cluster/rm/*//action/@interval
/cluster/rm/*[name() != 'resources']/*//@__restart_expire_time
/cluster/rm/*[name() != 'resources']/@restart_expire_time
/cluster/rm/*[name() != 'resources']//@__failure_expire_time

In my observation, only explicit values in cluster.conf were affected
but let's suppose that even the implicit ones (e.g., encoded in
metadata of particular resource agents) are endangered.


Fix
===

See the attached patch.  Simplest solution is simply not to allow
nul-terminator to slip through the checks as detailed in Root cause
[comment 0].

Comment 9 errata-xmlrpc 2017-03-21 10:41:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0684.html