1414139 – Potentially incorrect/undefined parsing results of time-related configuration parameters such as action intervals/timeouts

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1414139 - Potentially incorrect/undefined parsing results of time-related configuration parameters such as action intervals/timeouts

Summary: Potentially incorrect/undefined parsing results of time-related configuration...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	6.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Pokorný [poki]
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	1310529
TreeView+	depends on / blocked

Reported:	2017-01-17 21:04 UTC by Jan Pokorný [poki]
Modified:	2017-03-21 10:41 UTC (History)
CC List:	4 users (show)
Fixed In Version:	rgmanager-3.0.12.1-33.el6
Doc Type:	Release Note
Doc Text:	Time-related values of cluster resource configuration are now evaluated properly Previously, time-related resource values in actual use could differ from the values configured in the `cluster.conf` file, especially at the initial configuration load. This could cause the `rgmanager` daemon to behave unpredictably. With this fix, `rgmanager` behaves exactly as configured with regards to resources and respective time-related values.
Clone Of:
Environment:
Last Closed:	2017-03-21 10:41:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed fix (2.07 KB, patch) 2017-01-17 21:58 UTC, Jan Pokorný [poki]	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1036652	0	high	CLOSED	Apply upstream patch to prevent rgmanager crash on misconfigured "time_t" values (e.g., __restart_expire_time) in cluste...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:0684	0	normal	SHIPPED_LIVE	rgmanager bug fix update	2017-03-21 12:37:12 UTC

Internal Links: 1036652

Description Jan Pokorný [poki] 2017-01-17 21:04:55 UTC

Symptoms
========

Using following cluster.conf (original indented with tabs):

<?xml version="1.0"?>
<cluster config_version="5" name="test">
 <clusternodes>
  <clusternode name="virt-122.cluster-qe.lab.eng.brq.redhat.com" nodeid="1"/>
 </clusternodes>
 <totem consensus="200" join="100" token="5000" token_retransmits_before_loss_const="4"/>
 <logging>
  <logging_daemon debug="on" name="corosync" subsys="CONFDB"/>
 </logging>
 <rm>
  <resources>
   <fs device="/dev/vda1" mountpoint="/boot" name="Reference" options="ro,remount">
    <action depth="20" interval="0" name="status"/>
   </fs>
  </resources>
  <service autostart="1" max_restarts="321" name="DummyRef" recovery="relocate" restart_expire_time="654">
   <fs ref="Reference"/>
  </service>
 </rm>
</cluster>

I can reliably, i.e.
- behavior surviving reboot of a given VM
- behavior very likely surviving a bit different deployment of VM
  from scratch on the same archicture architecture (see below)
(- with glibc-2.12-1.208.el6.x86_64, libxml2-2.7.6-21.el6_8.1.x86_64),
reproduce following bad behavior of rgmanager (that I came across in
relation to [bug 1310529 comment 27]):

# service cman start
# rgmanager -f
> [...]
> Building Resource Trees                            
> Replacing action 'status' depth 20: interval: 60->2
> [...]

This does not make any sense as we are overriding status action of
Reference resource at depth 20 because we want to disable (interval=0)
the checks at that level because otherwise they are implicitly (through
metadata of fs.sh) scheduled to happen every 1 minute (60 seconds)
-- this is a use case of referred [bug 1310529].
And apparently, the status action at depth 20 did not get disabled,
but rather happens 30x faster!

The expected message is rather:

> Replacing action 'status' depth 20: interval: 60->0

and subsequent behavior should be really no such depth for a status
based monitoring is ever used.

Note that we are now discussing the startup behavior.  A config-reload
(e.g. ccs --incversion --sync --activate) action does not (again,
reliably) this issue for me, I can observe the expected message above.


Root cause
==========

It was found out that the issue was introduced with upstream commit:
https://git.fedorahosted.org/cgit/cluster.git/commit/?id=9d0ecbac1ff8ac1f045bd19bb73ca063a5351014
and the issue can only happen with a time specification ending with
a digit (as opposed to, e.g., ending the implicit entry in seconds
with explicit 's').  In that case, the logic of expand_time function
skips the nul-terminator after this last digit and continues with
a subsequent byte, which in general has unspecified value, just as any
other in row (this may also serve potential attackers, but because
cluster.conf is supposed to come from trusted sources, this line is
disregarded).

Note that first affected RHEL release is 6.6: [bug 1036652].


Workaround
==========

Always suffix the entry in seconds with explicit 's'.

Comment 1 Jan Pokorný [poki] 2017-01-17 21:58:29 UTC

Created attachment 1241957 [details]
Proposed fix

Affected attributes
===================

Following is hopefully an exhustive list of affected values
as expressed with fitting XPath expressions:

/cluster/rm/*//action/@timeout
/cluster/rm/*//action/@interval
/cluster/rm/*[name() != 'resources']/*//@__restart_expire_time
/cluster/rm/*[name() != 'resources']/@restart_expire_time
/cluster/rm/*[name() != 'resources']//@__failure_expire_time

In my observation, only explicit values in cluster.conf were affected
but let's suppose that even the implicit ones (e.g., encoded in
metadata of particular resource agents) are endangered.


Fix
===

See the attached patch.  Simplest solution is simply not to allow
nul-terminator to slip through the checks as detailed in Root cause
[comment 0].

Comment 9 errata-xmlrpc 2017-03-21 10:41:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0684.html

Note You need to log in before you can comment on or make changes to this bug.