Bug 963942
| Summary: | rsyslog doesn't clear $WorkDirectory for disk-assisted queue when remote rsyslog server comes back online. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | sean.edge | ||||||
| Component: | rsyslog | Assignee: | Tomas Heinrich <theinric> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Karel Srot <ksrot> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 6.4 | CC: | akarlsso, b.j.smith, ikryten, ksrot, printul77700, pvrabec, rdutta, rpiddapa, rrajaram, theinric | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | rsyslog-5.8.10-7.el6 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause:
Wrong data type of a variable holding the spool file size limit.
Consequence:
This could have resulted in message loss and the intended size limit not being respected.
Fix:
The package was patched and the data type corrected.
Result:
Correct spool files should be created and messages resent as expected.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2013-11-21 23:40:41 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
QueueIndex file contents: [root@n5 ~]# cat /var/spool/rsyslog_spool/aqfwd.qi <OPB:1:qqueue:1: +iQueueSize:2:3:621: +tVars.disk.sizeOnDisk:2:6:348756: +tVars.disk.bytesRead:2:1:0: >End . <Obj:1:strm:1: +iCurrFNum:2:1:1: +pszFName:1:5:aqfwd: +iMaxFiles:2:8:10000000: +bDeleteOnClose:2:1:0: +sType:2:1:1: +tOperationsMode:2:1:2: +tOpenMode:2:3:384: +iCurrOffs:2:1:0: >End . <Obj:1:strm:1: +iCurrFNum:2:1:1: +pszFName:1:5:aqfwd: +iMaxFiles:2:8:10000000: +bDeleteOnClose:2:1:1: +sType:2:1:1: +tOperationsMode:2:1:1: +tOpenMode:2:3:384: +iCurrOffs:2:1:0: >End . Syslog message the QI file points to: [root@n5 ~]# cat /var/spool/rsyslog_spool/aqfwd.00000001 <Obj:1:msg:1: +iProtocolVersion:2:1:0: +iSeverity:2:1:5: +iFacility:2:2:16: +msgFlags:2:1:0: +ttGenTime:2:10:1368676097: +tRcvdAt:3:34:2:2013:5:16:3:48:17:881464:6:+:0:0: +tTIMESTAMP:3:34:2:2013:5:16:3:48:17:881464:6:+:0:0: +pszTAG:1:5:audit: +pszRawMsg:1:115:94): item=0 name="/tmp/" inode=652801 dev=fd:00 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tmp_t:s0: +pszHOSTNAME:1:2:n5: +pszInputName:1:6:imfile: +pszRcvFrom:1:0:: +pszRcvFromIP:1:0:: +offMSG:2:1:0: >End . s0 tclass=tcp_socket: +pszHOSTNAME:1:2:n5: +pszInputName:1:6:imfile: +pszRcvFrom:1:0:: +pszRcvFromIP:1:0:: +offMSG:2:1:0: >End . Few things which will be needed here. In the order of requirement and data collection.
1) Start a loop of logger command on the client. Something as simple as this
for i in {1..100}; do logger $i "Test"; done
Increase value of i if needed and insert sleep
2) Ideally $i Test will appear on remote log server.
3) Take down the remote log server. Stop rsyslog.
4) Now, the logger loop is still running and the on disk queue in disk is being created and growing.
5) Start a tcpdump at both client and server.
tcpdump -s0 -i ethX -w /tmp/from_client.pcap host <server-ip>
tcpdump -s0 -i ethX -w /tmp/from_server.pcap host <client-ip>
6) Start rsyslog in debug mode on server.
pkill rsyslog
/sbin/rsyslogd -c5 -dn > logfile
Let it run for around 5 minutes. In problematic circumstance, $i Test will not come in remote log server.
7) Do a manual logger "This is manual test" from client. Now, see whether this logger message reaches server or not.
8) Ctrl-c the rsyslog in debug mode in server.
9) Restart rsyslog normally on server.
We will need to check the tcpdump captures and the logfile and the /var/log/messages from server.
I am not able to reproduce the bug. In my case messages are properly delivered and the queue files are deleted, although I didn't let the collector down for such a long time. According to the rsyslog documentation (http://www.rsyslog.com/doc/rsconf1_actionresumeinterval.html) the resume interval is (numRetries / 10 + 1) * $ActionResumeInterval. So after the 10th try, it by default is 60 and after the 100th try it is 330. Is it possible that the reporter actually didn't give the client enough time to contact the server? If the server was down for several days it is possible that the resume interval was even more than ten minutes. Soham, do you observe a message loss? What rsyslog.conf do you use? Here is the relevant section of my config again, though I believe I attached the whole thing to this bug: # Forward messages to central log server. $ActionQueueType LinkedList $ActionQueueFileName aqfwd $ActionQueueSize 1000 $ActionQueueLowWaterMark 200 $ActionQueueHighWaterMark 800 $ActionQueueSaveOnShutdown on #$ActionQueueDiscardSeverity 7 # Dont discard messages. $ActionQueueDiscardMark 975 $ActionQueueMaxFileSize 1m $ActionResumeRetryCount -1 $ActionQueueMaxDiskSpace 1g *.* @@logserver:10514;RSYSLOG_ForwardFormat Karel, If you look at the support case you can see that it's been about a week since I did a similar test to what Soham is asking for. The spool files are still sitting there. Soham, I'll try to run your test some time today.
>
> 5) Start a tcpdump at both client and server.
>
> tcpdump -s0 -i ethX -w /tmp/from_client.pcap host <server-ip>
> tcpdump -s0 -i ethX -w /tmp/from_server.pcap host <client-ip>
>
I'm pretty sure from_client should have the client ip in this tcpdump line, right?
Created attachment 760707 [details]
from_server.pcap
Created attachment 760708 [details]
from_client.pcap
Soham, I did what you asked. For #7, yes, the manual message reaches the logserver. My spool directory continues to grow: -rw-------. 1 root root 392 Jun 7 14:09 aqfwd.00015436 -rw-------. 1 root root 1048690 Jun 7 14:12 aqfwd.00015442 -rw-------. 1 root root 1048856 Jun 7 14:13 aqfwd.00015443 -rw-------. 1 root root 496 Jun 11 14:18 aqfwd.qi -rw-------. 1 root root 1048807 Jun 13 13:52 aqfwd.00015444 -rw-------. 1 root root 1048887 Jun 13 13:52 aqfwd.00015445 -rw-------. 1 root root 1048942 Jun 13 13:54 aqfwd.00015446 -rw-------. 1 root root 200056 Jun 13 13:54 aqfwd.00015447 This large gap in test messages is where I had rsyslog down on the server side: 2013-06-13T13:49:33.345911+00:00 n4 root: 1 Test 2013-06-13T13:49:38.352978+00:00 n4 root: 2 Test 2013-06-13T13:49:43.360908+00:00 n4 root: 3 Test 2013-06-13T13:49:48.368667+00:00 n4 root: 4 Test 2013-06-13T13:49:53.376028+00:00 n4 root: 5 Test 2013-06-13T13:49:58.383314+00:00 n4 root: 6 Test 2013-06-13T13:50:03.391075+00:00 n4 root: 7 Test 2013-06-13T13:50:08.398741+00:00 n4 root: 8 Test 2013-06-13T13:50:13.406110+00:00 n4 root: 9 Test 2013-06-13T13:53:38.317058+00:00 n4 root: 6786 Test 2013-06-13T13:53:38.337640+00:00 n4 root: 6787 Test 2013-06-13T13:53:38.357610+00:00 n4 root: 6788 Test 2013-06-13T13:53:38.377921+00:00 n4 root: 6789 Test 2013-06-13T13:53:38.398555+00:00 n4 root: 6790 Test 2013-06-13T13:53:38.420186+00:00 n4 root: 6791 Test 2013-06-13T13:53:38.440441+00:00 n4 root: 6792 Test 2013-06-13T13:53:38.460366+00:00 n4 root: 6793 Test 2013-06-13T13:53:38.479349+00:00 n4 root: 6794 Test 2013-06-13T13:53:38.495890+00:00 n4 root: 6795 Test 2013-06-13T13:53:38.518186+00:00 n4 root: 6796 Test 2013-06-13T13:53:38.540254+00:00 n4 root: 6797 Test 2013-06-13T13:53:38.562547+00:00 n4 root: 6798 Test 2013-06-13T13:53:38.584377+00:00 n4 root: 6799 Test 2013-06-13T13:53:38.605304+00:00 n4 root: 6800 Test 2013-06-13T13:53:38.626186+00:00 n4 root: 6801 Test 2013-06-13T13:53:38.647346+00:00 n4 root: 6802 Test 2013-06-13T13:53:38.668363+00:00 n4 root: 6803 Test 2013-06-13T13:53:38.689164+00:00 n4 root: 6804 Test 2013-06-13T13:53:38.709830+00:00 n4 root: 6805 Test 2013-06-13T13:53:38.730565+00:00 n4 root: 6806 Test I never get "10 Test" if I look at the logserver's side. [root@mpoam5-logserver 13]# egrep '\b9 Test' messages 2013-06-13T13:50:13.406110+00:00 n4 root: 9 Test [root@mpoam5-logserver 13]# egrep '\b10 Test' messages [root@mpoam5-logserver 13]# (In reply to sean.edge from comment #12) > > > > 5) Start a tcpdump at both client and server. > > > > tcpdump -s0 -i ethX -w /tmp/from_client.pcap host <server-ip> > > tcpdump -s0 -i ethX -w /tmp/from_server.pcap host <client-ip> > > > > I'm pretty sure from_client should have the client ip in this tcpdump line, > right? Sean, let's take the data to issue tracker. This is a public bug and we don't want tcpdump and debug log to be world visible. Please remove the files here and attach in the support case. I will see there. If anything on the bz is needed, I will ask accordingly. I don't see any way to delete them. The only action I have available on them is to view them. Can you remove them? Sean, Karel has made them private. So, only we will be able to view them. Don't worry. Perfect, thanks! I'll need to use the dropbox.redhat.com to upload the rsyslog debug output because it's too large. Keep an eye out for that in the next few minutes. Nevermind, I compressed it and was able to upload it. It's attached to the support case now. Tomas, do you think that rsyslog is able to forward messages store in disk queues even if it used all inodes on the disk? In the description #c0 I do not see whether the logs kept coming even when all inodes were used. I am curious how much important is this "use all inodes" part and whether the scenario works OK for the reporter if there are free inodes left. Karel, I tried freeing up some inodes and restarting rsyslog, wondering if perhaps rsyslog needed to write some temporary/work files in order to process the spooled messages. Unfortunately that didn't help. (In reply to sean.edge from comment #25) > Karel, > I tried freeing up some inodes and restarting rsyslog, wondering if perhaps > rsyslog needed to write some temporary/work files in order to process the > spooled messages. Unfortunately that didn't help. And is the inodes consumption crucial for the reproducer? Is the scenario (queued log delivery when the server comes back online) working otherwise? Thank you. Karel, No, it's not crucial for the reproducer. Was just trying to list that as a side affect. The real issue is that log messages in the spool directory are NEVER forwarded to the remote logging server, even after it comes on line and various service rsyslog restarts, nor is the spool directory cleared. Thus creating a condition where we are running out of inodes in our dev environment. What concerns me most is the messages never being reprocessed. Hopefully that helps. Neither I can reproduce the problem. I've tried taking down the receiver, sending ~500K messages and bringing it up again. After a while, all the messages are sent to the other side. I've also tried taking down the receiver, sending a couple thousand msgs, stopping rsyslog, starting the receiver and starting rsyslog. Success again. I've used a slightly altered version of the configuration you'd provided. Tomas, I was playing around with this a bit yesterday and here's what I saw. 1) 10g as an ActionQueueMaxFileSize does not work as expected. This should create up to an approximately 10GB file for queued syslog messages (when queue is full) but instead it creates one file per message. Changing this value to 1g,50m,10m,1m all do what you would expect. This was causing me to run out of inodes in my dev environment (VMs with not much disk space). 2) I was monitoring the work directory with 'watch -n2 ls -ltr' while doing these tests. Once the receiver came back a large number of spool files were cleared (~4K) but the rest were not. I double checked and those messages were not resent to the receiver. I'm not sure if I wasn't patient enough or what, but it seemed to stop processing the spooled syslog messages. I've happily changed my ActionQueueMaxFileSize to 50m and am content with that as a resolution for the time being. Thanks. Sean, Let me try to formalize your last comment. You mean, $ActionQueueMaxFileSize, whatever value is set, doesn't work as the name suggests. That is, when the server is down and client is creating spool disk files (supposed to be sent later), it doesn't allow the spool file(s) to grow to a single 10 GiB. But, creates smaller files per message (like - logger "A" and then logger "B"). This will create two log files of the size of "A" and "B". Is this correct? And when it happens, the filesystem on the client disk, i.e. /var/spool/rsyslog (in /var to be specific) goes short on inode. And we eventually see loss of spooled messages. Am I correct in postulating the first answer in the correct terms. Also, about the second answer, I also don't know. Probably your testing is what we can rely. Also, how ActionQueueMaxFileSize to 50 MiB is helping. I mean, if you generate 50 files of 1 MiB each, then we cull 50 inodes. Is this correct. Let me know please. Soham (In reply to Soham Chakraborty from comment #31) > You mean, $ActionQueueMaxFileSize, whatever value is set, doesn't work as > the name suggests. That is, when the server is down and client is creating > spool disk files (supposed to be sent later), it doesn't allow the spool > file(s) to grow to a single 10 GiB. But, creates smaller files per message > (like - logger "A" and then logger "B"). This will create two log files of > the size of "A" and "B". Is this correct? Yes but only when it's set to 10g. Other values, that I have previously listed, work as expected and fill a spool file until it reaches the defined max, then rsyslog creates a new one and continues on. > And when it happens, the filesystem on the client disk, i.e. > /var/spool/rsyslog (in /var to be specific) goes short on inode. And we > eventually see loss of spooled messages. Presumably, yes. I haven't verified that myself but it's a reasonable assumption. The messages don't have any place to go. > Am I correct in postulating the first answer in the correct terms. I think so. With $ActionQueueMaxFileSize=10g rsyslog creates one spool file per syslog message. Using the other values I mentioned, rsyslog creates a spool file, fills it until maxfilesize is met or exceeded, and then it creates a new spool file. > Also, about the second answer, I also don't know. Probably your testing is > what we can rely. I don't know what you mean here. > Also, how ActionQueueMaxFileSize to 50 MiB is helping. I mean, if you > generate 50 files of 1 MiB each, then we cull 50 inodes. Is this correct. I don't completely understand what you wrote here. Since rsyslog can now put many messages in a single file we are less likely to run out of inodes. Previously, it was creating one spool file per queued message. This means I had hundreds of thousands of spool files in my work directory and rsyslog was never clearing them out. (In reply to sean.edge from comment #32) > > Also, how ActionQueueMaxFileSize to 50 MiB is helping. I mean, if you > > generate 50 files of 1 MiB each, then we cull 50 inodes. Is this correct. > > I don't completely understand what you wrote here. <...> I think there's a misunderstanding on Soham's part that... (In reply to Soham Chakraborty from comment #31) > You mean, $ActionQueueMaxFileSize, whatever value is set, doesn't work as > the name suggests. The wrong behavior should manifest only with the 10GB argument, as Sean stated, therefore... (In reply to sean.edge from comment #30) > I've happily changed my ActionQueueMaxFileSize to 50m and am content with > that as a resolution for the time being. (In reply to sean.edge from comment #30) > Tomas, > I was playing around with this a bit yesterday and here's what I saw. > 1) 10g as an ActionQueueMaxFileSize does not work as expected. This should > create up to an approximately 10GB file for queued syslog messages (when > queue is full) but instead it creates one file per message. Changing this > value to 1g,50m,10m,1m all do what you would expect. This was causing me to > run out of inodes in my dev environment (VMs with not much disk space). Thanks for letting us know, this is a weird misbehavior. > 2) I was monitoring the work directory with 'watch -n2 ls -ltr' while doing > these tests. Once the receiver came back a large number of spool files were > cleared (~4K) but the rest were not. I double checked and those messages > were not resent to the receiver. I'm not sure if I wasn't patient enough or > what, but it seemed to stop processing the spooled syslog messages. The correct behavior should be to resend all the spooled messages when possible and remove all but one spool file. > I've happily changed my ActionQueueMaxFileSize to 50m and am content with > that as a resolution for the time being. Good that you have a workaround, but thanks to your information, I'm now able to reproduce the symptoms. These may even two separate issues, I'm not sure yet. First is the ActionQueueMaxFileSize bug and the second is not attempting to resend the spooled messages. I came to these rough steps to reproduce the symptoms and weren't able to reproduce it without using the 10g value. I use netcat as a receiver as it's just a simple tcp data dump. - have a working setup to forward the msgs to netcat - bring down nc - send ~5k msgs - stop rsyslog - start rsyslog - bring nc up - wait for rsyslog to sync the msgs to nc at this point, all the msgs should be forwarded - bring nc down - send ~5k msgs - stop rsyslog - start rsyslog - bring nc up at this point, some msgs are sent but majority is not and the spool files are not erased I've reproduced this even with 10t, so this might be some value overflow somewhere. Will keep you posted. (In reply to Tomas Heinrich from comment #34) > Thanks for letting us know, this is a weird misbehavior. No problem. > The correct behavior should be to resend all the spooled messages when > possible and remove all but one spool file. Agreed and I see the correct behavior when I used the lower maxfilesize values I listed. > Good that you have a workaround, but thanks to your information, I'm now > able to reproduce the symptoms. These may even two separate issues, I'm not > sure yet. First is the ActionQueueMaxFileSize bug and the second is not > attempting to resend the spooled messages. I agree about the possibility of two separate issues. My manager and I had similar suspicion when we talked about it the other day. > I came to these rough steps to reproduce the symptoms and weren't able to > reproduce it without using the 10g value. I use netcat as a receiver as it's > just a simple tcp data dump. > > - have a working setup to forward the msgs to netcat > - bring down nc > - send ~5k msgs > - stop rsyslog > - start rsyslog > - bring nc up > - wait for rsyslog to sync the msgs to nc > at this point, all the msgs should be forwarded > - bring nc down > - send ~5k msgs > - stop rsyslog > - start rsyslog > - bring nc up > at this point, some msgs are sent but majority is not and the spool files > are not erased > > I've reproduced this even with 10t, so this might be some value overflow > somewhere. Will keep you posted. Excellent! Looking forward to hearing back about this! Thanks, Tomas! Tomas, Sean, Awesome. It is clear now. Yeah, I understand the inode problem more clearly now. I will see if I can help Tomas in this. Soham Created attachment 775306 [details]
correction to the stream obj's interface
Looks like I've found the root cause; bad type in one of the interfaces. Here's a simple patch.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1716.html Hello, I know this is an old discussion,but I just encountered same problem. Could you tell me please if there is any way of sending those messages/files from spool to processing so they can get to receiver/consumer? I expect to be an workaround that can make rsyslog read the spool, or I hope it exists. The second question is do I neet to have the .qi file?Because in some of my problematic server there is the file but in some of them there is not Thank you (In reply to printul77700 from comment #42) > Hello, > I know this is an old discussion,but I just encountered same problem. > Could you tell me please if there is any way of sending those messages/files > from spool to processing so they can get to receiver/consumer? > I expect to be an workaround that can make rsyslog read the spool, or I hope > it exists. > The second question is do I neet to have the .qi file?Because in some of my > problematic server there is the file but in some of them there is not > Thank you Bugzilla is not a support channel, so if you are still experiencing problems after applying the errata (http://rhn.redhat.com/errata/RHBA-2013-1716.html) for the problem described in this Bugzilla - please raise a case through the official support channels at https://access.redhat.com/. Kind regards, /Anders |
Created attachment 749044 [details] Full /etc/rsyslog.conf of client Description of problem: Using disk-assisted $ActionQueue (type=LinkedList) for forwarding messages to remote rsyslog server, via TCP. If the logserver has been down for a period of time, messages that were backed to disk because of various factors (rsyslog restart, in memory queue is full, etc) never get reprocessed and sent to the remote rsyslog server. Version-Release number of selected component (if applicable): [root@n5 ~]# rpm -q rsyslog rsyslog-5.8.10-6.el6.x86_64 How reproducible: Easily Steps to Reproduce: 1. Configure rsyslog using in-memory, disk-assisted forwarding queue 2. Take down remote log server for a period of time, I waited a couple days for it to use all the inodes on a 15GB system. 3. Start remote logging server 4. Monitor $WorkDirectory on original server. Files never get cleaned up. Actual results: Configured $WorkDirectory never gets cleaned of old log messages even if there's a .qi (QueueIndex) file that points to legitimate syslog message files. This can be problematic because if a lot of messages get backed up it isn't very difficult to exhaust all the available inodes on the filesystem. Expected results: When the remote log server comes back online, I would expect rsyslog to replay the disk backed messages to the remote log server and clean the configured $WorkDirectory. Additional info: Relevent bits of /etc/rsyslog.conf on client servers: $WorkDirectory /var/spool/rsyslog_spool $ActionQueueType LinkedList $ActionQueueFileName aqfwd $ActionQueueSize 150000 $ActionQueueLowWaterMark 30000 $ActionQueueHighWaterMark 120000 $ActionQueueSaveOnShutdown on $ActionQueueDiscardSeverity 7 $ActionQueueMaxFileSize 10g $ActionResumeRetryCount -1 *.* @@logserver:10514;RSYSLOG_ForwardFormat All inodes exhaused with awfwd msgs in /var/spool/rsyslog_spool [root@n5 ~]# df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/mapper/myvg-rootvol 971040 971040 0 100% / tmpfs 1007547 1 1007546 1% /dev/shm /dev/sda1 51200 51 51149 1% /boot [root@n5 ~]# find /var/spool/rsyslog_spool -name 'aqfwd.*' | wc -l 858883