Bug 1549706

Summary: rsyslog SEGV with disk-assisted queues, possible corruption
Product: Red Hat Enterprise Linux 7 Reporter: Peter Portante <pportant>
Component: rsyslogAssignee: Jiří Vymazal <jvymazal>
Status: CLOSED ERRATA QA Contact: Dalibor Pospíšil <dapospis>
Severity: urgent Docs Contact:
Priority: high    
Version: 7.4CC: amitkuma, cbesson, cww, dapospis, dkopecek, jvymazal, mgrepl, mmatsuya, nhosoi, pportant, pvrabec, rmetrich, rsroka, sbroz
Target Milestone: rcKeywords: Reopened, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rsyslog-8.24.0-48.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1738213 (view as bug list) Environment:
Last Closed: 2020-03-31 19:10:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1477664, 1738213    
Attachments:
Description Flags
rsyslogd killed by SIGSEGV none

Description Peter Portante 2018-02-27 16:59:56 UTC
Created attachment 1401462 [details]
rsyslogd killed by SIGSEGV

With rsyslog-8.24.0-12.el7_4.1 (the customer support exception image), we need repeatedly encountered SEGV related to the on-disk queues configured.

See the attached SEGV abrt report.

We put this version of rsyslog onto our 12 production machines and we have encountered this crash on 2 of the nodes.  We don't think it is related to the patch this image was generated for (duplicate logs), as we encounter the same stack track when we use the regular version of rsyslog on RHEL 7.4.

[root@gprfc053 ~]# gdb /usr/sbin/rsyslogd
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7_4.1
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/rsyslogd...Reading symbols from /usr/lib/debug/usr/sbin/rsyslogd.debug...done.
done.
(gdb) run -n
Starting program: /usr/sbin/rsyslogd -n
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffed86d2700 (LWP 45053)]
[New Thread 0x7ffed7ed1700 (LWP 45054)]
[New Thread 0x7ffed60ca700 (LWP 45055)]
[New Thread 0x7ffed58c9700 (LWP 45056)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffed86d2700 (LWP 45053)]
rsCStrDestruct (ppThis=ppThis@entry=0x7ffed00168f0) at stringbuf.c:216
216		free(pThis->pBuf);
(gdb) list
211
212	void rsCStrDestruct(cstr_t **ppThis)
213	{
214		cstr_t *pThis = *ppThis;
215
216		free(pThis->pBuf);
217		RSFREEOBJ(pThis);
218		*ppThis = NULL;
219	}
220
(gdb) up
#1  0x000055555558a468 in varDestruct (ppThis=0x7ffed86d1bc0) at var.c:72
72				rsCStrDestruct(&pThis->val.pStr);
(gdb) list
67	CODESTARTobjDestruct(var)
68		if(pThis->pcsName != NULL)
69			rsCStrDestruct(&pThis->pcsName);
70		if(pThis->varType == VARTYPE_STR) {
71			if(pThis->val.pStr != NULL)
72				rsCStrDestruct(&pThis->val.pStr);
73		}
74	ENDobjDestruct(var)
75
76
(gdb) up
#2  0x0000555555576314 in MsgDeserialize (pMsg=0x7ffed0019680, pStrm=pStrm@entry=0x555555826ed0) at msg.c:1405
1405			var.Destruct(&pVar);
(gdb) list
1400		if(!isProp("offMSG"))
1401			ABORT_FINALIZE(RS_RET_DS_PROP_SEQ_ERR);
1402		MsgSetMSGoffs(pMsg, pVar->val.num);
1403	finalize_it:
1404		if(pVar != NULL)
1405			var.Destruct(&pVar);
1406		RETiRet;
1407	}
1408	#undef isProp
1409
(gdb) where
#0  rsCStrDestruct (ppThis=ppThis@entry=0x7ffed00168f0) at stringbuf.c:216
#1  0x000055555558a468 in varDestruct (ppThis=0x7ffed86d1bc0) at var.c:72
#2  0x0000555555576314 in MsgDeserialize (pMsg=0x7ffed0019680, pStrm=pStrm@entry=0x555555826ed0) at msg.c:1405
#3  0x00005555555822a9 in objDeserializeWithMethods (ppObj=0x7ffed86d1cf0, pszTypeExpected=pszTypeExpected@entry=0x5555555bbfe6 "msg", lenTypeExpected=lenTypeExpected@entry=3, pStrm=0x555555826ed0, fFixup=fFixup@entry=0x0, pUsr=pUsr@entry=0x0, objConstruct=0x555555574100 <msgConstructForDeserializer>, objConstructFinalize=objConstructFinalize@entry=0x0, objDeserialize=0x5555555762b0 <MsgDeserialize>) at obj.c:915
#4  0x000055555558c8b8 in qDeqDisk (pThis=<optimized out>, ppMsg=<optimized out>) at queue.c:988
#5  0x000055555558e741 in qqueueDeq (ppMsg=0x7ffed86d1cf0, pThis=0x555555823080) at queue.c:1115
#6  DequeueConsumableElements (pSkippedMsgs=<optimized out>, piRemainingQueueSize=<synthetic pointer>, pWti=<optimized out>, pThis=0x555555823080) at queue.c:1713
#7  DequeueConsumable (pThis=pThis@entry=0x555555823080, pWti=pWti@entry=0x5555558235d0, pSkippedMsgs=pSkippedMsgs@entry=0x7ffed86d1d44) at queue.c:1767
#8  0x000055555558ea92 in DequeueForConsumer (pSkippedMsgs=0x7ffed86d1d44, pWti=0x5555558235d0, pThis=0x555555823080) at queue.c:1919
#9  ConsumerReg (pThis=0x555555823080, pWti=0x5555558235d0) at queue.c:1975
#10 0x000055555558b66e in wtiWorker (pThis=pThis@entry=0x5555558235d0) at wti.c:365
#11 0x000055555558a7dd in wtpWorker (arg=0x5555558235d0) at wtp.c:410
#12 0x00007ffff79b0e25 in start_thread (arg=0x7ffed86d2700) at pthread_create.c:308
#13 0x00007ffff6aa934d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 3 Peter Portante 2018-02-27 19:38:28 UTC
Could this be the same as https://access.redhat.com/solutions/882893?  I would think not, but it does match the stack trace.

Comment 4 Peter Portante 2018-02-28 00:00:59 UTC
Perhaps another clue: when I deleted the first queue file, and edited the .qi file to reference the next queue file, it processed many more queue files until is SEGV'd again.  When I then deleted that queue file, it continued until a 3rd one.  Finally is processed the remaining few hundred of them (1 GB max on disk, with 1 MB files) until it caught up.

I saved the state of the /var/lib/rsyslog directory before making these changes.  There are 1,026 files in /var/lib/rsyslog:

[/var/lib/rsyslog.bz1549706]# ls -1 | wc -l
1026
[/var/lib/rsyslog.bz1549706]# ls -l | head -n 3
total 1052060
-rw-------. 1 root root     126 Feb 23 01:27 imjournal.state
-rw-------. 1 root root 1051359 Jan 30 13:10 ViaQ_Collector_01.00055319
[/var/lib/rsyslog.bz1549706]# ls -l | tail -n 3
-rw-------. 1 root root 1051129 Feb 22 19:53 ViaQ_Collector_01.00056341
-rw-------. 1 root root  354308 Feb 23 01:25 ViaQ_Collector_01.00056342
-rw-------. 1 root root     572 Feb 22 19:53 ViaQ_Collector_01.qi
[/var/lib/rsyslog.bz1549706]# cat ViaQ_Collector_01.qi
<OPB:1:qqueue:1:
+iQueueSize:2:6:323723:
+tVars.disk.sizeOnDisk:2:10:1073394353:
>End
.
<Obj:1:strm:1:
+iCurrFNum:2:5:56342:
+pszFName:1:17:ViaQ_Collector_01:
+iMaxFiles:2:8:10000000:
+bDeleteOnClose:2:1:0:
+sType:2:1:1:
+tOperationsMode:2:1:2:
+tOpenMode:2:3:384:
+iCurrOffs:2:4:2447:
+inode:2:1:0:
+bPrevWasNL:2:1:0:
>End
.
<Obj:1:strm:1:
+iCurrFNum:2:5:55319:
+pszFName:1:17:ViaQ_Collector_01:
+iMaxFiles:2:8:10000000:
+bDeleteOnClose:2:1:1:
+sType:2:1:1:
+tOperationsMode:2:1:1:
+tOpenMode:2:3:384:
+iCurrOffs:2:6:432905:
+inode:2:7:1060046:
+bPrevWasNL:2:1:0:
>End
.

Below is our rsyslog configuration:

[/var/lib/rsyslog.bz1549706]# cat /etc/rsyslog.conf
# rsyslog v7 for RHEL 7 configuration file

# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

#### MODULES ####

# The imjournal module below is now used as a message source instead of imuxsock,
# providing access to the systemd journal and /dev/log messages.
$ModLoad imjournal


#### GLOBAL DIRECTIVES ####

# Where to place auxiliary files
$WorkDirectory /var/lib/rsyslog

# Use updated timestamp format for local files
$ActionFileDefaultTemplate RSYSLOG_FileFormat

# We want fully qualified domain names for common logging
$PreserveFQDN on

# File to store the position in the journal
$IMJournalStateFile imjournal.state
$IMJournalPersistStateInterval 10000
$IMJournalRatelimitInterval 1
$IMJournalRatelimitBurst 100000000

# Include all config files in /etc/rsyslog.d/
$IncludeConfig /etc/rsyslog.d/*.conf


#### RULES ####

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog

# Log cron stuff
cron.*                                                  /var/log/cron

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log
[/var/lib/rsyslog.bz1549706]# ls /etc/rsyslog.d/
00-viaq-tags.conf           01-viaq-forwarding.conf     99-example-forwarding.conf
[/var/lib/rsyslog.bz1549706]# cat /etc/rsyslog.d/00-viaq-tags.conf
# A simple string of "tags" that system administrators can use to identify the
# various roles a system can take on, or categories to consider it under.
#
# E.g.
# set $!tags = "EXAMPLETAG EXAMPLETAGTWO EXAMPLETAGTHREE";
set $!tags = "BAGL perf-dept OSE";
[/var/lib/rsyslog.bz1549706]# cat /etc/rsyslog.d/01-viaq-forwarding.conf
# This is basically the RSYSLOG_SyslogProtocol23Format, which is RFC 5424 on
# the wire, but with the message payload a CEE/Lumberjack JSON document.
template(name="ViaQ_SyslogProtocol23Format" type="string"
    string="<%PRI%>1 %TIMESTAMP:::date-rfc3339% %HOSTNAME% %APP-NAME% %PROCID% %MSGID% %STRUCTURED-DATA% @cee:%$!%\n")

# Some log messages have not been stored in the JSON variables, so check and
# add them here.
if strlen($!msg) < 1 then {
    set $!msg = $msg;
}

# rsyslog instance in the Integration Lab over the 1Gb public lab switch
$ActionQueueType LinkedList             # run asynchronously
$ActionQueueFileName ViaQ_Collector_01  # unique name prefix for spool files
$ActionQueueMaxDiskSpace 1g             # 1gb space limit (use as much as possible)
$ActionQueueSaveOnShutdown on           # save messages to disk on shutdown
$ActionResumeRetryCount -1              # infinite retries if host is down
*.*  @@172.17.40.134:514;ViaQ_SyslogProtocol23Format
[/var/lib/rsyslog.bz1549706]# cat /etc/rsyslog.d/99-example-forwarding.conf
# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
#$ActionQueueFileName fwdRule1 # unique name prefix for spool files
#$ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
#$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
#$ActionQueueType LinkedList   # run asynchronously
#$ActionResumeRetryCount -1    # infinite retries if host is down
# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @@remote-host:514
# ### end of the forwarding rule ###

Comment 9 Peter Portante 2018-03-01 20:05:31 UTC
Well, the bug is with the queue files, and it seems to be because they are corrupted.  The previous bug, 1511485, does not seem to be the problem.

Here is an example queue file line:

/var/lib/rsyslog/ViaQ_Collector_01.00055319 (line 5137)
+<Obj:1:msg:1:

Full record:

.
<Obj:1:msg:1:
+iProtocolVersion:2:1:0:
+iSeverity:2:1:6:
+iFacility:2:1:1:
+msgFlags:2:1:0:
+ttGenTime:2:10:1517316316:
+<Obj:1:msg:1:
+iProtocolVersion:2:1:0:
+iSeverity:2:1:6:
+iFacility:2:1:1:
+msgFlags:2:1:0:
+ttGenTime:2:10:1517316316:
+tRcvdAt:3:35:2:2018:1:30:12:45:16:668346:6:+:0:0:
+tTIMESTAMP:3:35:2:2018:1:30:12:45:16:668346:6:+:0:0:
+pszTAG:1:8:journal::
+pszRawMsg:1:1423:... removed for anonymity ...:
+pszHOSTNAME:1:15:foo.example.com:
+pszInputName:1:9:imjournal:
+pszRcvFrom:1:15:foo.example.com:
+pszRcvFromIP:1:9:127.0.0.1:
+json:1:4587:... removed for anonymity ...:
+offMSG:2:1:0:
>End
.

Comment 20 Jiří Vymazal 2019-07-31 13:27:00 UTC
This issue was not selected to be included in Red Hat Enterprise Linux 7.7 because it is seen either as low or moderate impact to a small number of use-cases. This release is in Maintenance Support 1 Phase, which means that qualified Critical and Important Security errata advisories (RHSAs) and Urgent Priority Bug Fix errata advisories (RHBAs) may be released as they become available. We will now close this issue, but if you believe that it qualifies for the Maintenance Support 1 Phase, please re-open; otherwise, we recommend moving the request to Red Hat Enterprise Linux 8 if applicable.

Comment 21 Christophe Besson 2019-09-04 07:43:44 UTC
Finally succeeded to reproduce the issue in my environment with the latest rsyslog package:
rsyslog-8.24.0-41.el7_7

It happens on RHEL 7.6 as well as RHEL 7.7. The backtrace shows exactly the same:

#0  rsCStrDestruct (ppThis=ppThis@entry=0x7fd874004250) at stringbuf.c:216
#1  0x00005579594a7a08 in varDestruct (ppThis=0x7fd87bbf6bc0) at var.c:72
#2  0x00005579594933c4 in MsgDeserialize (pMsg=0x7fd874004590, pStrm=pStrm@entry=0x55795b4e1710) at msg.c:1407
#3  0x000055795949f439 in objDeserializeWithMethods (ppObj=0x7fd87bbf6cf0, pszTypeExpected=pszTypeExpected@entry=0x5579594d962f "msg", lenTypeExpected=lenTypeExpected@entry=3, 
    pStrm=0x55795b4e1710, fFixup=fFixup@entry=0x0, pUsr=pUsr@entry=0x0, objConstruct=0x5579594911b0 <msgConstructForDeserializer>, objConstructFinalize=objConstructFinalize@entry=0x0, 
    objDeserialize=0x557959493360 <MsgDeserialize>) at obj.c:915
#4  0x00005579594a9e58 in qDeqDisk (pThis=<optimized out>, ppMsg=<optimized out>) at queue.c:988
#5  0x00005579594abd21 in qqueueDeq (ppMsg=0x7fd87bbf6cf0, pThis=0x55795b4da900) at queue.c:1115
#6  DequeueConsumableElements (pSkippedMsgs=<optimized out>, piRemainingQueueSize=<synthetic pointer>, pWti=<optimized out>, pThis=0x55795b4da900) at queue.c:1714
#7  DequeueConsumable (pThis=pThis@entry=0x55795b4da900, pWti=pWti@entry=0x55795b4dada0, pSkippedMsgs=pSkippedMsgs@entry=0x7fd87bbf6d44) at queue.c:1770
#8  0x00005579594ac092 in DequeueForConsumer (pSkippedMsgs=0x7fd87bbf6d44, pWti=0x55795b4dada0, pThis=0x55795b4da900) at queue.c:1922
#9  ConsumerReg (pThis=0x55795b4da900, pWti=0x55795b4dada0) at queue.c:1978
#10 0x00005579594a8c0e in wtiWorker (pThis=pThis@entry=0x55795b4dada0) at wti.c:365
#11 0x00005579594a7d7d in wtpWorker (arg=0x55795b4dada0) at wtp.c:410
#12 0x00007fd87de44ea5 in start_thread (arg=0x7fd87bbf7700) at pthread_create.c:307
#13 0x00007fd87cf388cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I provide the needed data to reproduce this issue in a private attachment (rsyslog.conf and a set of queue files).

Comment 27 Jiří Vymazal 2019-11-04 14:53:23 UTC
Found the fix (see linked upstream PR)

Comment 32 errata-xmlrpc 2020-03-31 19:10:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1000