Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2215190

Summary: Race issue can happen between action.resumeInterval and action.resumeIntervalMax
Product: Red Hat Enterprise Linux 8 Reporter: Flos Qi Guo <qguo>
Component: rsyslogAssignee: Attila Lakatos <alakatos>
Status: CLOSED MIGRATED QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: high Docs Contact:
Priority: high    
Version: 8.5CC: rsroka
Target Milestone: rcKeywords: MigratedToJIRA, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-03 12:50:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace log for rsyslog after having the upstream's PR5171 and debugging PR5120 applied. none

Description Flos Qi Guo 2023-06-15 03:24:33 UTC
> Description of problem:
The action.resumeInterval can make the retry scheduled by the action.resumeIntervalMax unable to be executed under some circumstance.

> Version-Release number of selected component (if applicable):
rsyslog-8.2102.0-5.el8.x86_64 and even UPSTREAM.

> How reproducible:
Always.

> Steps to Reproduce:
1. Install latest rsyslog on RHEL8 or RHEL9.

2. Configure rsyslog remote logging with the following settings:
---------------8< ---------------8< ---------------8< ---------------8< ---------------
# cat /etc/rsyslog.conf|grep -v ^# 
module(load="imuxsock" 	  # provides support for local system logging (e.g. via logger command)
       SysSock.Use="off") # Turn off message reception via local log socket; 
			  # local messages are retrieved through imjournal now.
module(load="imjournal" 	    # provides access to the systemd journal
       StateFile="imjournal.state") # File to store the position in the journal
global(workDirectory="/var/lib/rsyslog")
module(load="builtin:omfile" Template="RSYSLOG_TraditionalFileFormat")
include(file="/etc/rsyslog.d/*.conf" mode="optional")
*.info;mail.none;authpriv.none;cron.none                /var/log/messages
authpriv.*                                              /var/log/secure
mail.*                                                  -/var/log/maillog
cron.*                                                  /var/log/cron
*.emerg                                                 :omusrmsg:*
uucp,news.crit                                          /var/log/spooler
local7.*                                                /var/log/boot.log

# cat /etc/rsyslog.d/client.conf  | grep -v ^#
*.* action( type="omfwd"
        target  = "192.168.18.158"
        # Do not use 514 as that is rsh-protocol
        port    = "10514"
        protocol= "tcp"
        # XXX
        RebindInterval="100"
        action.resumeInterval="10"
        action.resumeIntervalMax="11"
        action.resumeRetryCount="1"

        # and queue to disk if needs be
        queue.filename="tcp_fwd"
        queue.type="LinkedList"
        queue.saveonshutdown="on"
        queue.maxdiskspace="1g" )
---------------8< ---------------8< ---------------8< ---------------8< ---------------

3. Start the rsyslog in debug mode manually.
---------------8< ---------------8< ---------------8< ---------------8< ---------------
# export RSYSLOG_DEBUG="Debug NoStdOut"
# export RSYSLOG_DEBUGLOG="/var/log/rsyslog-debug-$(date +%s).log"
# strace -ttTvfyy -s 1024 -o /tmp/rsyslog_$(date +%s).strace rsyslogd -n
---------------8< ---------------8< ---------------8< ---------------8< ---------------

4. Disconnect the network of remote server 192.168.18.158.

5. Generate logs with a simple script on local rsyslog client.
---------------8< ---------------8< ---------------8< ---------------8< ---------------
# for i in {1..10000}; do logger `date`; sleep 1; done`
---------------8< ---------------8< ---------------8< ---------------8< ---------------

6. Monitor the strace output and the debug log.

> Actual results:
The action.resumeIntervalMax may not work correctly because the sleep time specified in action.resumeInterval can time out so that the scheduled retry time is not complied.

> Expected results:
The action.resumeIntervalMax should work correctly.

> Additional info:
Upstream issue:
 - https://github.com/rsyslog/rsyslog/issues/5132

--FIXED TYPO--

Comment 5 Attila Lakatos 2023-06-29 10:11:58 UTC
After hours of debugging I've probably found the root cause of the issue. Here is a summary of my findings.
### Reproduction
I create two rsyslog servers, serverA and serverB. I tried to simulate an environment where the first one is the primary server and second one is the backup, in case the first one is down. Both of them were listening on port 514, protocol=tcp. The rsyslog client had the following conf:
...
$DebugFile /var/log/rsyslogdebug
$DebugLevel 2

  action(type="omfwd" target="serverA" port="514" protocol="tcp" RebindInterval="100" ResendLastMSGOnReconnect="on" action.resumeInterval="10" action.resumeIntervalMax="10" action.resumeRetryCount="1")
    action(type="omfwd" target="serverB" port="514" protocol="tcp" RebindInterval="100" ResendLastMSGOnReconnect="on" action.resumeInterval="10" action.resumeIntervalMax="10" action.resumeRetryCount="1" action.execOnlyWhenPreviousIsSuspended="on")
...

Firstly, I shut down serverA and observed what happens next. It took a couple of minutes to create enough logs to reproduce the issue. In the meanwhile I was sending messages via the logger command to the rsyslog client, which then forwards them to the server side.

### Analysis
In the meanwhile, I reviewed the debug log located under /var/log/rsyslogdebug.
The following and many similar lines were suspicious:
../action.c: action 'action-0-builtin:omfwd' suspended, earliest retry=1688029585 (now 1688029555), iNbrResRtry 20, duration 30
By default, the duration should not exceed the 10seconds interval, because we defined it in the configuration. So I thought that some race condition was causing this. We have a check in the code that compares the current waiting time with the maximum value, so that can not be the case. I scrolled through the debug file(few thousand lines) associated with each line in the action.c source code but it did not give any hint.

### Root cause
I thought that the resumeInterVal parameter must have been parsed correctly, otherwise an error message would be displayed, BUT!! When rsyslog is started, it scans the main configuration file and checks what parameters are defined, it was able to detect the action.resumeintervalmax parameter properly.
---
6038.748955263:main thread    : rainerscript.c: action.resumeintervalmax: 10
6038.748962756:main thread    : rainerscript.c: action.resumeinterval: 10
---
Because that was defined as "action.resumeintervalmax". However, while assigning individual parameters to individual variables there was a comparision bug in the code.
---
} else if(!strcmp(pblk.descr[i].name, "action.resumeintervalMax")) {
    pAction->iResumeIntervalMax = pvals[i].val.d.n;
---
Due to that the iResumeIntervalMax parameter was not respected.

Comment 9 Flos Qi Guo 2023-07-13 05:02:23 UTC
Created attachment 1975536 [details]
strace log for rsyslog after having the upstream's PR5171 and debugging PR5120 applied.

Comment 10 Attila Lakatos 2023-07-14 13:09:28 UTC
*.* action( type="omfwd" target="192.168.18.158" port="10514" protocol="tcp" RebindInterval="100"
        action.resumeInterval="10"
        action.resumeIntervalMax="11"
        action.resumeRetryCount="1"

        queue.filename="tcp_fwd"
        queue.type="LinkedList"
        queue.saveonshutdown="on"
        queue.maxdiskspace="1g" )

Adding a note on the config above, with attention to action.resumeInterval, action.resumeIntervalMax and action.resumeRetryCount parameters.
In case rsyslog is not able to deliver a message to a remote destination it will do the following steps: First, the action will be suspended for a configured interval, which is 10 seconds. After 10 seconds, the action will be retried but the target is still down, so it will increment the counter of the number of retries. As a result, the number of retries will be set to 1, which equals to action.resumeRetryCount. This parameter specifies how often an action is retried before it is considered to have failed. Failed actions discard messages.
In general, after every 10th failed retry, the waiting time is prolonged with the value of action.resumeInterval. In case the waiting time is bigger than the value of action.resumeIntervalMax, then action.resumeIntervalMax is used instead. However, in this configuration we will not reach 10 retries, because the action.resumeRetryCount is set to 1, thus after the 1st retry rsyslog will consider the message as not delivered.