Bug 1414139 - Potentially incorrect/undefined parsing results of time-related configuration parameters such as action intervals/timeouts
Summary: Potentially incorrect/undefined parsing results of time-related configuration...
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rgmanager
Version: 6.9
Hardware: Unspecified
OS: Unspecified
Target Milestone: rc
: ---
Assignee: Jan Pokorný [poki]
QA Contact: cluster-qe@redhat.com
Steven J. Levine
Depends On:
Blocks: 1310529
TreeView+ depends on / blocked
Reported: 2017-01-17 21:04 UTC by Jan Pokorný [poki]
Modified: 2017-03-21 10:41 UTC (History)
4 users (show)

Fixed In Version: rgmanager-
Doc Type: Release Note
Doc Text:
Time-related values of cluster resource configuration are now evaluated properly Previously, time-related resource values in actual use could differ from the values configured in the `cluster.conf` file, especially at the initial configuration load. This could cause the `rgmanager` daemon to behave unpredictably. With this fix, `rgmanager` behaves exactly as configured with regards to resources and respective time-related values.
Clone Of:
Last Closed: 2017-03-21 10:41:24 UTC
Target Upstream Version:

Attachments (Terms of Use)
Proposed fix (2.07 KB, patch)
2017-01-17 21:58 UTC, Jan Pokorný [poki]
no flags Details | Diff

System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1036652 0 high CLOSED Apply upstream patch to prevent rgmanager crash on misconfigured "time_t" values (e.g., __restart_expire_time) in cluste... 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2017:0684 0 normal SHIPPED_LIVE rgmanager bug fix update 2017-03-21 12:37:12 UTC

Internal Links: 1036652

Description Jan Pokorný [poki] 2017-01-17 21:04:55 UTC

Using following cluster.conf (original indented with tabs):

<?xml version="1.0"?>
<cluster config_version="5" name="test">
  <clusternode name="virt-122.cluster-qe.lab.eng.brq.redhat.com" nodeid="1"/>
 <totem consensus="200" join="100" token="5000" token_retransmits_before_loss_const="4"/>
  <logging_daemon debug="on" name="corosync" subsys="CONFDB"/>
   <fs device="/dev/vda1" mountpoint="/boot" name="Reference" options="ro,remount">
    <action depth="20" interval="0" name="status"/>
  <service autostart="1" max_restarts="321" name="DummyRef" recovery="relocate" restart_expire_time="654">
   <fs ref="Reference"/>

I can reliably, i.e.
- behavior surviving reboot of a given VM
- behavior very likely surviving a bit different deployment of VM
  from scratch on the same archicture architecture (see below)
(- with glibc-2.12-1.208.el6.x86_64, libxml2-2.7.6-21.el6_8.1.x86_64),
reproduce following bad behavior of rgmanager (that I came across in
relation to [bug 1310529 comment 27]):

# service cman start
# rgmanager -f
> [...]
> Building Resource Trees                            
> Replacing action 'status' depth 20: interval: 60->2
> [...]

This does not make any sense as we are overriding status action of
Reference resource at depth 20 because we want to disable (interval=0)
the checks at that level because otherwise they are implicitly (through
metadata of fs.sh) scheduled to happen every 1 minute (60 seconds)
-- this is a use case of referred [bug 1310529].
And apparently, the status action at depth 20 did not get disabled,
but rather happens 30x faster!

The expected message is rather:

> Replacing action 'status' depth 20: interval: 60->0

and subsequent behavior should be really no such depth for a status
based monitoring is ever used.

Note that we are now discussing the startup behavior.  A config-reload
(e.g. ccs --incversion --sync --activate) action does not (again,
reliably) this issue for me, I can observe the expected message above.

Root cause

It was found out that the issue was introduced with upstream commit:
and the issue can only happen with a time specification ending with
a digit (as opposed to, e.g., ending the implicit entry in seconds
with explicit 's').  In that case, the logic of expand_time function
skips the nul-terminator after this last digit and continues with
a subsequent byte, which in general has unspecified value, just as any
other in row (this may also serve potential attackers, but because
cluster.conf is supposed to come from trusted sources, this line is

Note that first affected RHEL release is 6.6: [bug 1036652].


Always suffix the entry in seconds with explicit 's'.

Comment 1 Jan Pokorný [poki] 2017-01-17 21:58:29 UTC
Created attachment 1241957 [details]
Proposed fix

Affected attributes

Following is hopefully an exhustive list of affected values
as expressed with fitting XPath expressions:

/cluster/rm/*[name() != 'resources']/*//@__restart_expire_time
/cluster/rm/*[name() != 'resources']/@restart_expire_time
/cluster/rm/*[name() != 'resources']//@__failure_expire_time

In my observation, only explicit values in cluster.conf were affected
but let's suppose that even the implicit ones (e.g., encoded in
metadata of particular resource agents) are endangered.


See the attached patch.  Simplest solution is simply not to allow
nul-terminator to slip through the checks as detailed in Root cause
[comment 0].

Comment 9 errata-xmlrpc 2017-03-21 10:41:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.