Bug 2215190
| Summary: | Race issue can happen between action.resumeInterval and action.resumeIntervalMax | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Flos Qi Guo <qguo> | ||||
| Component: | rsyslog | Assignee: | Attila Lakatos <alakatos> | ||||
| Status: | CLOSED MIGRATED | QA Contact: | BaseOS QE Security Team <qe-baseos-security> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 8.5 | CC: | rsroka | ||||
| Target Milestone: | rc | Keywords: | MigratedToJIRA, Triaged | ||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-08-03 12:50:55 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Flos Qi Guo
2023-06-15 03:24:33 UTC
After hours of debugging I've probably found the root cause of the issue. Here is a summary of my findings.
### Reproduction
I create two rsyslog servers, serverA and serverB. I tried to simulate an environment where the first one is the primary server and second one is the backup, in case the first one is down. Both of them were listening on port 514, protocol=tcp. The rsyslog client had the following conf:
...
$DebugFile /var/log/rsyslogdebug
$DebugLevel 2
action(type="omfwd" target="serverA" port="514" protocol="tcp" RebindInterval="100" ResendLastMSGOnReconnect="on" action.resumeInterval="10" action.resumeIntervalMax="10" action.resumeRetryCount="1")
action(type="omfwd" target="serverB" port="514" protocol="tcp" RebindInterval="100" ResendLastMSGOnReconnect="on" action.resumeInterval="10" action.resumeIntervalMax="10" action.resumeRetryCount="1" action.execOnlyWhenPreviousIsSuspended="on")
...
Firstly, I shut down serverA and observed what happens next. It took a couple of minutes to create enough logs to reproduce the issue. In the meanwhile I was sending messages via the logger command to the rsyslog client, which then forwards them to the server side.
### Analysis
In the meanwhile, I reviewed the debug log located under /var/log/rsyslogdebug.
The following and many similar lines were suspicious:
../action.c: action 'action-0-builtin:omfwd' suspended, earliest retry=1688029585 (now 1688029555), iNbrResRtry 20, duration 30
By default, the duration should not exceed the 10seconds interval, because we defined it in the configuration. So I thought that some race condition was causing this. We have a check in the code that compares the current waiting time with the maximum value, so that can not be the case. I scrolled through the debug file(few thousand lines) associated with each line in the action.c source code but it did not give any hint.
### Root cause
I thought that the resumeInterVal parameter must have been parsed correctly, otherwise an error message would be displayed, BUT!! When rsyslog is started, it scans the main configuration file and checks what parameters are defined, it was able to detect the action.resumeintervalmax parameter properly.
---
6038.748955263:main thread : rainerscript.c: action.resumeintervalmax: 10
6038.748962756:main thread : rainerscript.c: action.resumeinterval: 10
---
Because that was defined as "action.resumeintervalmax". However, while assigning individual parameters to individual variables there was a comparision bug in the code.
---
} else if(!strcmp(pblk.descr[i].name, "action.resumeintervalMax")) {
pAction->iResumeIntervalMax = pvals[i].val.d.n;
---
Due to that the iResumeIntervalMax parameter was not respected.
Created attachment 1975536 [details]
strace log for rsyslog after having the upstream's PR5171 and debugging PR5120 applied.
*.* action( type="omfwd" target="192.168.18.158" port="10514" protocol="tcp" RebindInterval="100"
action.resumeInterval="10"
action.resumeIntervalMax="11"
action.resumeRetryCount="1"
queue.filename="tcp_fwd"
queue.type="LinkedList"
queue.saveonshutdown="on"
queue.maxdiskspace="1g" )
Adding a note on the config above, with attention to action.resumeInterval, action.resumeIntervalMax and action.resumeRetryCount parameters.
In case rsyslog is not able to deliver a message to a remote destination it will do the following steps: First, the action will be suspended for a configured interval, which is 10 seconds. After 10 seconds, the action will be retried but the target is still down, so it will increment the counter of the number of retries. As a result, the number of retries will be set to 1, which equals to action.resumeRetryCount. This parameter specifies how often an action is retried before it is considered to have failed. Failed actions discard messages.
In general, after every 10th failed retry, the waiting time is prolonged with the value of action.resumeInterval. In case the waiting time is bigger than the value of action.resumeIntervalMax, then action.resumeIntervalMax is used instead. However, in this configuration we will not reach 10 retries, because the action.resumeRetryCount is set to 1, thus after the 1st retry rsyslog will consider the message as not delivered.
|