This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .
Bug 2215190 - Race issue can happen between action.resumeInterval and action.resumeIntervalMax
Summary: Race issue can happen between action.resumeInterval and action.resumeIntervalMax
Keywords:
Status: CLOSED MIGRATED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: rsyslog
Version: 8.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Attila Lakatos
QA Contact: BaseOS QE Security Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-15 03:24 UTC by Flos Qi Guo
Modified: 2023-08-03 12:51 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-03 12:50:55 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
strace log for rsyslog after having the upstream's PR5171 and debugging PR5120 applied. (2.09 MB, application/gzip)
2023-07-13 05:02 UTC, Flos Qi Guo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github rsyslog rsyslog pull 5171 0 None open action.resumeintervalmax: the parameter was not respected 2023-06-29 10:11:57 UTC
Red Hat Issue Tracker   RHEL-952 0 None Migrated None 2023-08-03 12:50:54 UTC
Red Hat Issue Tracker RHELPLAN-159929 0 None None None 2023-06-15 03:24:55 UTC
Red Hat Issue Tracker SECENGSP-5245 0 None None None 2023-06-15 03:25:02 UTC

Description Flos Qi Guo 2023-06-15 03:24:33 UTC
> Description of problem:
The action.resumeInterval can make the retry scheduled by the action.resumeIntervalMax unable to be executed under some circumstance.

> Version-Release number of selected component (if applicable):
rsyslog-8.2102.0-5.el8.x86_64 and even UPSTREAM.

> How reproducible:
Always.

> Steps to Reproduce:
1. Install latest rsyslog on RHEL8 or RHEL9.

2. Configure rsyslog remote logging with the following settings:
---------------8< ---------------8< ---------------8< ---------------8< ---------------
# cat /etc/rsyslog.conf|grep -v ^# 
module(load="imuxsock" 	  # provides support for local system logging (e.g. via logger command)
       SysSock.Use="off") # Turn off message reception via local log socket; 
			  # local messages are retrieved through imjournal now.
module(load="imjournal" 	    # provides access to the systemd journal
       StateFile="imjournal.state") # File to store the position in the journal
global(workDirectory="/var/lib/rsyslog")
module(load="builtin:omfile" Template="RSYSLOG_TraditionalFileFormat")
include(file="/etc/rsyslog.d/*.conf" mode="optional")
*.info;mail.none;authpriv.none;cron.none                /var/log/messages
authpriv.*                                              /var/log/secure
mail.*                                                  -/var/log/maillog
cron.*                                                  /var/log/cron
*.emerg                                                 :omusrmsg:*
uucp,news.crit                                          /var/log/spooler
local7.*                                                /var/log/boot.log

# cat /etc/rsyslog.d/client.conf  | grep -v ^#
*.* action( type="omfwd"
        target  = "192.168.18.158"
        # Do not use 514 as that is rsh-protocol
        port    = "10514"
        protocol= "tcp"
        # XXX
        RebindInterval="100"
        action.resumeInterval="10"
        action.resumeIntervalMax="11"
        action.resumeRetryCount="1"

        # and queue to disk if needs be
        queue.filename="tcp_fwd"
        queue.type="LinkedList"
        queue.saveonshutdown="on"
        queue.maxdiskspace="1g" )
---------------8< ---------------8< ---------------8< ---------------8< ---------------

3. Start the rsyslog in debug mode manually.
---------------8< ---------------8< ---------------8< ---------------8< ---------------
# export RSYSLOG_DEBUG="Debug NoStdOut"
# export RSYSLOG_DEBUGLOG="/var/log/rsyslog-debug-$(date +%s).log"
# strace -ttTvfyy -s 1024 -o /tmp/rsyslog_$(date +%s).strace rsyslogd -n
---------------8< ---------------8< ---------------8< ---------------8< ---------------

4. Disconnect the network of remote server 192.168.18.158.

5. Generate logs with a simple script on local rsyslog client.
---------------8< ---------------8< ---------------8< ---------------8< ---------------
# for i in {1..10000}; do logger `date`; sleep 1; done`
---------------8< ---------------8< ---------------8< ---------------8< ---------------

6. Monitor the strace output and the debug log.

> Actual results:
The action.resumeIntervalMax may not work correctly because the sleep time specified in action.resumeInterval can time out so that the scheduled retry time is not complied.

> Expected results:
The action.resumeIntervalMax should work correctly.

> Additional info:
Upstream issue:
 - https://github.com/rsyslog/rsyslog/issues/5132

--FIXED TYPO--

Comment 5 Attila Lakatos 2023-06-29 10:11:58 UTC
After hours of debugging I've probably found the root cause of the issue. Here is a summary of my findings.
### Reproduction
I create two rsyslog servers, serverA and serverB. I tried to simulate an environment where the first one is the primary server and second one is the backup, in case the first one is down. Both of them were listening on port 514, protocol=tcp. The rsyslog client had the following conf:
...
$DebugFile /var/log/rsyslogdebug
$DebugLevel 2

  action(type="omfwd" target="serverA" port="514" protocol="tcp" RebindInterval="100" ResendLastMSGOnReconnect="on" action.resumeInterval="10" action.resumeIntervalMax="10" action.resumeRetryCount="1")
    action(type="omfwd" target="serverB" port="514" protocol="tcp" RebindInterval="100" ResendLastMSGOnReconnect="on" action.resumeInterval="10" action.resumeIntervalMax="10" action.resumeRetryCount="1" action.execOnlyWhenPreviousIsSuspended="on")
...

Firstly, I shut down serverA and observed what happens next. It took a couple of minutes to create enough logs to reproduce the issue. In the meanwhile I was sending messages via the logger command to the rsyslog client, which then forwards them to the server side.

### Analysis
In the meanwhile, I reviewed the debug log located under /var/log/rsyslogdebug.
The following and many similar lines were suspicious:
../action.c: action 'action-0-builtin:omfwd' suspended, earliest retry=1688029585 (now 1688029555), iNbrResRtry 20, duration 30
By default, the duration should not exceed the 10seconds interval, because we defined it in the configuration. So I thought that some race condition was causing this. We have a check in the code that compares the current waiting time with the maximum value, so that can not be the case. I scrolled through the debug file(few thousand lines) associated with each line in the action.c source code but it did not give any hint.

### Root cause
I thought that the resumeInterVal parameter must have been parsed correctly, otherwise an error message would be displayed, BUT!! When rsyslog is started, it scans the main configuration file and checks what parameters are defined, it was able to detect the action.resumeintervalmax parameter properly.
---
6038.748955263:main thread    : rainerscript.c: action.resumeintervalmax: 10
6038.748962756:main thread    : rainerscript.c: action.resumeinterval: 10
---
Because that was defined as "action.resumeintervalmax". However, while assigning individual parameters to individual variables there was a comparision bug in the code.
---
} else if(!strcmp(pblk.descr[i].name, "action.resumeintervalMax")) {
    pAction->iResumeIntervalMax = pvals[i].val.d.n;
---
Due to that the iResumeIntervalMax parameter was not respected.

Comment 9 Flos Qi Guo 2023-07-13 05:02:23 UTC
Created attachment 1975536 [details]
strace log for rsyslog after having the upstream's PR5171 and debugging PR5120 applied.

Comment 10 Attila Lakatos 2023-07-14 13:09:28 UTC
*.* action( type="omfwd" target="192.168.18.158" port="10514" protocol="tcp" RebindInterval="100"
        action.resumeInterval="10"
        action.resumeIntervalMax="11"
        action.resumeRetryCount="1"

        queue.filename="tcp_fwd"
        queue.type="LinkedList"
        queue.saveonshutdown="on"
        queue.maxdiskspace="1g" )

Adding a note on the config above, with attention to action.resumeInterval, action.resumeIntervalMax and action.resumeRetryCount parameters.
In case rsyslog is not able to deliver a message to a remote destination it will do the following steps: First, the action will be suspended for a configured interval, which is 10 seconds. After 10 seconds, the action will be retried but the target is still down, so it will increment the counter of the number of retries. As a result, the number of retries will be set to 1, which equals to action.resumeRetryCount. This parameter specifies how often an action is retried before it is considered to have failed. Failed actions discard messages.
In general, after every 10th failed retry, the waiting time is prolonged with the value of action.resumeInterval. In case the waiting time is bigger than the value of action.resumeIntervalMax, then action.resumeIntervalMax is used instead. However, in this configuration we will not reach 10 retries, because the action.resumeRetryCount is set to 1, thus after the 1st retry rsyslog will consider the message as not delivered.


Note You need to log in before you can comment on or make changes to this bug.