Time-related values of cluster resource configuration are now evaluated properly
Previously, time-related resource values in actual use could differ from the values configured in the `cluster.conf` file, especially at the initial configuration load. This could cause the `rgmanager` daemon to behave unpredictably. With this fix, `rgmanager` behaves exactly as configured with regards to resources and respective time-related values.
DescriptionJan Pokorný [poki]
2017-01-17 21:04:55 UTC
Symptoms
========
Using following cluster.conf (original indented with tabs):
<?xml version="1.0"?>
<cluster config_version="5" name="test">
<clusternodes>
<clusternode name="virt-122.cluster-qe.lab.eng.brq.redhat.com" nodeid="1"/>
</clusternodes>
<totem consensus="200" join="100" token="5000" token_retransmits_before_loss_const="4"/>
<logging>
<logging_daemon debug="on" name="corosync" subsys="CONFDB"/>
</logging>
<rm>
<resources>
<fs device="/dev/vda1" mountpoint="/boot" name="Reference" options="ro,remount">
<action depth="20" interval="0" name="status"/>
</fs>
</resources>
<service autostart="1" max_restarts="321" name="DummyRef" recovery="relocate" restart_expire_time="654">
<fs ref="Reference"/>
</service>
</rm>
</cluster>
I can reliably, i.e.
- behavior surviving reboot of a given VM
- behavior very likely surviving a bit different deployment of VM
from scratch on the same archicture architecture (see below)
(- with glibc-2.12-1.208.el6.x86_64, libxml2-2.7.6-21.el6_8.1.x86_64),
reproduce following bad behavior of rgmanager (that I came across in
relation to [bug 1310529 comment 27]):
# service cman start
# rgmanager -f
> [...]
> Building Resource Trees
> Replacing action 'status' depth 20: interval: 60->2
> [...]
This does not make any sense as we are overriding status action of
Reference resource at depth 20 because we want to disable (interval=0)
the checks at that level because otherwise they are implicitly (through
metadata of fs.sh) scheduled to happen every 1 minute (60 seconds)
-- this is a use case of referred [bug 1310529].
And apparently, the status action at depth 20 did not get disabled,
but rather happens 30x faster!
The expected message is rather:
> Replacing action 'status' depth 20: interval: 60->0
and subsequent behavior should be really no such depth for a status
based monitoring is ever used.
Note that we are now discussing the startup behavior. A config-reload
(e.g. ccs --incversion --sync --activate) action does not (again,
reliably) this issue for me, I can observe the expected message above.
Root cause
==========
It was found out that the issue was introduced with upstream commit:
https://git.fedorahosted.org/cgit/cluster.git/commit/?id=9d0ecbac1ff8ac1f045bd19bb73ca063a5351014
and the issue can only happen with a time specification ending with
a digit (as opposed to, e.g., ending the implicit entry in seconds
with explicit 's'). In that case, the logic of expand_time function
skips the nul-terminator after this last digit and continues with
a subsequent byte, which in general has unspecified value, just as any
other in row (this may also serve potential attackers, but because
cluster.conf is supposed to come from trusted sources, this line is
disregarded).
Note that first affected RHEL release is 6.6: [bug 1036652].
Workaround
==========
Always suffix the entry in seconds with explicit 's'.
Comment 1Jan Pokorný [poki]
2017-01-17 21:58:29 UTC
Created attachment 1241957[details]
Proposed fix
Affected attributes
===================
Following is hopefully an exhustive list of affected values
as expressed with fitting XPath expressions:
/cluster/rm/*//action/@timeout
/cluster/rm/*//action/@interval
/cluster/rm/*[name() != 'resources']/*//@__restart_expire_time
/cluster/rm/*[name() != 'resources']/@restart_expire_time
/cluster/rm/*[name() != 'resources']//@__failure_expire_time
In my observation, only explicit values in cluster.conf were affected
but let's suppose that even the implicit ones (e.g., encoded in
metadata of particular resource agents) are endangered.
Fix
===
See the attached patch. Simplest solution is simply not to allow
nul-terminator to slip through the checks as detailed in Root cause
[comment 0].
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2017-0684.html