Bug 971471
Summary: | rsyslog crashes on start | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Fenzi <kevin> |
Component: | rsyslog | Assignee: | Tomas Heinrich <theinric> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | loganjerry, mah.darade, nobody+PNT0469646, pvrabec, theinric |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-06-13 18:02:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Kevin Fenzi
2013-06-06 15:27:43 UTC
Are there any selinux AVCs? Can you try running: $ systemct stop rsyslog $ export RSYSLOG_DEBUG="debug nostdout" $ export RSYSLOG_DEBUGLOG=/tmp/rs.dbg $ rsyslogd -dn 2> /dev/null and putting the /tmp/rs.dbg file somewhre? No avcs /tmp/rs.dbg -> http://www.scrye.com/~kevin/fedora/rs.dbg No luck so far; I'm not able to reproduce it. Can you try if it fails when you either: comment out "$ModLoad imuxsock" in /etc/rsyslog.conf or comment out "$ModLoad imjournal" in /etc/rsyslog.conf There will be some warning messages that can be ignored for now. Can you get a coredump or a backtrace? I am seeing this, too, in Rawhide VMs, both i386 and x86_64. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fb935030700 (LWP 3519)] strnlen () at ../sysdeps/x86_64/strlen.S:103 103 FIND_ZERO (gdb) bt #0 strnlen () at ../sysdeps/x86_64/strlen.S:103 #1 0x00007fb93a9b5ace in __GI___strndup (s=0x1 <Address 0x1 out of bounds>, n=<optimized out>) at strndup.c:44 #2 0x00007fb93a2be32c in readjournal () at imjournal.c:317 #3 runInput (pThrd=<optimized out>) at imjournal.c:494 #4 0x00007fb93bfd24f0 in thrdStarter (arg=0x7fb93cfc9c60) at ../threads.c:212 #5 0x00007fb93b932f03 in start_thread (arg=0x7fb935030700) at pthread_create.c:309 #6 0x00007fb93aa25a9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) frame 2 #2 0x00007fb93a2be32c in readjournal () at imjournal.c:317 317 data = strndup(get + prefixlen, l - prefixlen); (gdb) info locals iRet = RS_RET_OK json = 0x7fb9285d6a20 message = 0x7fb9285d55e0 "systemd 201 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)" sys_pid = <optimized out> sys_iden = <optimized out> get2 = <optimized out> r = <optimized out> tv = {tv_sec = 1365627365, tv_usec = 858148} parse = <optimized out> length = 25 pidlength = 12 prefixlen = -140433449759255 timestamp = 1365627365858148 get = 0x7fb935e33218 facility = 3 equal_sign = <optimized out> data = <optimized out> priority = 6 sys_iden_help = 0x7fb9285d6ae0 "systemd[1]:" pidget = 0x7fb935bfd9b8 jval = <optimized out> name = 0x7fb9285d58f0 "_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a" l = 43 That value of prefixlen is clearly bogus. Ah, looks like equal_sign, which is "always present", isn't, so equal_sign == NULL. Since this is a non-debug build, this doesn't fire: assert (equal_sign != NULL); I should have mentioned the string with no equal sign: (gdb) print (char *)get $1 = 0x7fa15aea5218 "_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a" Thanks for the analysis and backtrace, Jerry. I think the problem you're seeing is caused by a bug in systemd that was fixed here: http://cgit.freedesktop.org/systemd/systemd/commit/?id=d378991747d67fff1d4dc39e7fb2bc8f49f1b561 I've done a scratch build but it seems there's some other problem as someone else reported it still fails for them. Here's the build: http://koji.fedoraproject.org/koji/taskinfo?taskID=5477721 Yes, the scratch build also crashes, apparently due to stack overflow caused by infinite recursion in the rate limiter. Here's a backtrace: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc5d0d37700 (LWP 3054)] 0x00007fc5d6682307 in _IO_vfprintf_internal (s=s@entry=0x7fc5d05384e0, format=format@entry=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", ap=ap@entry=0x7fc5d0538658) at vfprintf.c:257 257 int save_errno = errno; (gdb) bt #0 0x00007fc5d6682307 in _IO_vfprintf_internal (s=s@entry=0x7fc5d05384e0, format=format@entry=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", ap=ap@entry=0x7fc5d0538658) at vfprintf.c:257 #1 0x00007fc5d6741245 in ___vsnprintf_chk (s=0x7fc5d0538740 "", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", args=args@entry=0x7fc5d0538658) at vsnprintf_chk.c:63 #2 0x00007fc5d67411a8 in ___snprintf_chk (s=s@entry=0x7fc5d0538740 "", maxlen=maxlen@entry=1024, flags=flags@entry=1, slen=slen@entry=1024, format=format@entry=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting") at snprintf_chk.c:34 #3 0x00007fc5d7cd8246 in snprintf ( __fmt=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", __n=1024, __s=0x7fc5d0538740 "") at /usr/include/bits/stdio2.h:64 #4 withinRatelimit (ratelimit=0x7fc5d871e340, tt=<optimized out>) at ratelimit.c:171 #5 0x00007fc5d7cd835c in ratelimitMsg (ratelimit=0x7fc5d871e340, pMsg=0x7fc5ccc523b0, ppRepMsg=0x7fc5d0538bb8) at ratelimit.c:208 #6 0x00007fc5d7cd8595 in ratelimitAddMsg (ratelimit=<optimized out>, pMultiSub=pMultiSub@entry=0x0, pMsg=0x7fc5ccc523b0) at ratelimit.c:241 #7 0x00007fc5d7c9fe8d in logmsgInternal (iErr=iErr@entry=-2177, pri=pri@entry=46, msg=msg@entry=0x7fc5d0538c70 "rsyslogd[internal_messages]: begin to drop messages due to rate-limiting", flags=<optimized out>, flags@entry=0) at syslogd.c:486 #8 0x00007fc5d7cd825a in withinRatelimit (ratelimit=0x7fc5d871e340, tt=<optimized out>) at ratelimit.c:174 ... (stack frames 5-8 repeat over and over and over) ... #25239 0x00007fc5d7c9fe8d in logmsgInternal (iErr=-3000, pri=<optimized out>, msg=0x7fc5d0d36350 "SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'", flags=<optimized out>) at syslogd.c:486 #25240 0x00007fc5d7ce6141 in LogError (iErrno=<optimized out>, iErrCode=-3000, fmt=<optimized out>) at errmsg.c:101 #25241 0x00007fc5d5fca71f in readjournal () at imjournal.c:250 #25242 runInput (pThrd=<optimized out>) at imjournal.c:501 #25243 0x00007fc5d7cde570 in thrdStarter (arg=0x7fc5d872cc60) at ../threads.c:212 #25244 0x00007fc5d763ef03 in start_thread (arg=0x7fc5d0d37700) at pthread_create.c:309 #25245 0x00007fc5d6731a9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 My crash on start here has: #0 strnlen () at ../sysdeps/x86_64/strlen.S:103 #1 0x00007f9c0248dace in __GI___strndup (s=0x1 <Address 0x1 out of bounds>, n=<optimized out>) at strndup.c:44 #2 0x00007f9c01d9632c in readjournal () at imjournal.c:317 #3 runInput (pThrd=<optimized out>) at imjournal.c:494 #4 0x00007f9c03aaa4f0 in thrdStarter (arg=0x7f9c0421dc70) at ../threads.c:212 #5 0x00007f9c0340af03 in start_thread (arg=0x7f9bf1f9e700) at pthread_create.c:309 #6 0x00007f9c024fda9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Commenting out: $ModLoad imjournal in /etc/rsyslog.conf and it starts up fine. Thanks for the backtrace Kevin. I think this is the same as the first one Jerry has reported. If you just disable the imjournal plugin, you won't get any system messages. The module that was formerly used to read them, imuxsock, is currently turned off but can be enabled by commenting out this line in /etc/rsyslog.conf: $OmitLocalLogging on and tweaking the service file to use socket activation: $ sed -i '/^;Sockets/s/^;//;/^Alias;/s/^;//' /usr/lib/systemd/system/rsyslog.service After restarting syslog.socket and rsyslog.service, the messages should start flowing in. I have another scratch build which should correct the last issue Jerry has reported and would love to hear your thoughts: http://koji.fedoraproject.org/koji/taskinfo?taskID=5484881 There's just one more outstanding issue (that I know of) reported by yaneti via irc and it seems to be a bug in journal: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011261.html @Jerry In the trace you've provided, there's something which looks like a malformed journal field name. Specifically this message: "SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'" I guess the fragment should stand for "_SOURCE_REALTIME_TIMESTAMP". Since it shouldn't be affected by the corrupted stack, I'm curious about how that got there. Would it be possible to get the whole message to which this fragment belongs to? > I have another scratch build which should correct the last issue Jerry has reported and would love to hear your thoughts: http://koji.fedoraproject.org/koji/taskinfo?taskID=5484881
This scratch build works here. Strangely, once it started up it dumped all the logs for about the last month or so into /var/log/messages and /var/log/maillog... but it's running and appears to be ok after that.
I'm happy to hear it doesn't crash, thanks. The behavior you're seeing is expected (unfortunately). The imjournal module stores its position in the journal in a state file. Since there's no file yet, all the backlog got read in and logged. From now on, only newer messages will be logged. If you erase the state file, everything will be read in again. There's going to be a configuration directive to suppress this behavior and only read in messages created after rsyslog has started, but this can cause message loss. makes sense. No problem. The new scratch build is working for me, too. (In reply to Tomas Heinrich from comment #9) > @Jerry > In the trace you've provided, there's something which looks like a malformed > journal field name. Specifically this message: > > "SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'" > > I guess the fragment should stand for "_SOURCE_REALTIME_TIMESTAMP". Since it > shouldn't be affected by the corrupted stack, I'm curious about how that got > there. Would it be possible to get the whole message to which this fragment > belongs to? Oops, I did things in the wrong order. I ran your new scratch build first, which processed all of the journal entries. How do I replicate this now? It did log the bad entries successfully: Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a' Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC' Those are the only 2. Actually, when the daily cron job ran, rsyslog crashed again: Core was generated by `/sbin/rsyslogd -n'. Program terminated with signal 11, Segmentation fault. #0 __find_specmb ( format=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting") at printf-parse.h:108 108 return (const unsigned char *) __strchrnul ((const char *) format, '%'); (gdb) bt #0 __find_specmb ( format=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting") at printf-parse.h:108 #1 _IO_vfprintf_internal (s=s@entry=0x7fff2e9cc500, format=format@entry=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting", ap=ap@entry=0x7fff2e9cc678) at vfprintf.c:1308 #2 0x00007f73c1c8d245 in ___vsnprintf_chk (s=0x7fff2e9cc760 "", maxlen=<optimized out>, flags=1, slen=<optimized out>, format=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting", args=args@entry=0x7fff2e9cc678) at vsnprintf_chk.c:63 #3 0x00007f73c1c8d1a8 in ___snprintf_chk (s=s@entry=0x7fff2e9cc760 "", maxlen=maxlen@entry=1024, flags=flags@entry=1, slen=slen@entry=1024, format=format@entry=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting") at snprintf_chk.c:34 #4 0x00007f73c32241f2 in snprintf ( __fmt=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting", __n=1024, __s=0x7fff2e9cc760 "") at /usr/include/bits/stdio2.h:64 #5 tellLostCnt (ratelimit=0x7f73c5353340, ratelimit=0x7f73c5353340) at ratelimit.c:128 #6 withinRatelimit (ratelimit=0x7f73c5353340, tt=<optimized out>) at ratelimit.c:160 #7 0x00007f73c322435c in ratelimitMsg (ratelimit=0x7f73c5353340, pMsg=0x7f73c570e100, ppRepMsg=0x7fff2e9ccbd8) at ratelimit.c:208 #8 0x00007f73c3224595 in ratelimitAddMsg (ratelimit=<optimized out>, pMultiSub=pMultiSub@entry=0x0, pMsg=0x7f73c570e100) at ratelimit.c:241 #9 0x00007f73c31ebe8d in logmsgInternal (iErr=iErr@entry=-2177, pri=pri@entry=46, msg=msg@entry=0x7fff2e9ccc90 "rsyslogd[internal_messages]: 20774 messages lost due to rate-limiting", flags=<optimized out>, flags@entry=0) at syslogd.c:486 #10 0x00007f73c3224206 in tellLostCnt (ratelimit=0x7f73c5353340, ratelimit=0x7f73c5353340) at ratelimit.c:131 ... (frames 6-10 over and over and over) ... #31560 0x00007f73c3224206 in tellLostCnt (ratelimit=0x7f73c5353340, ratelimit=0x7f73c5353340) at ratelimit.c:131 #31561 withinRatelimit (ratelimit=0x7f73c5353340, tt=<optimized out>) at ratelimit.c:160 #31562 0x00007f73c322435c in ratelimitMsg (ratelimit=0x7f73c5353340, pMsg=0x7f73c53620c0, ppRepMsg=0x7fff2f1cae28) at ratelimit.c:208 #31563 0x00007f73c3224595 in ratelimitAddMsg (ratelimit=<optimized out>, pMultiSub=pMultiSub@entry=0x0, pMsg=0x7f73c53620c0) at ratelimit.c:241 #31564 0x00007f73c31ebe8d in logmsgInternal (iErr=iErr@entry=-1, pri=pri@entry=46, msg=msg@entry=0x7fff2f1cb010 " [origin software=\"rsyslogd\" swVersion=\"7.3.15\" x-pid=\"4018\" x-info=\"http://www.rsyslog.com\"] rsyslogd was HUPed", flags=<optimized out>, flags@entry=0) at syslogd.c:486 #31565 0x00007f73c31ed8f1 in doHUP () at syslogd.c:1251 #31566 mainloop () at syslogd.c:1291 #31567 realMain (argc=<optimized out>, argv=<optimized out>) at syslogd.c:2036 #31568 0x00007f73c1ba5ff5 in __libc_start_main (main=0x7f73c31eb550 <main>, argc=2, argv=0x7fff2f1cb338, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff2f1cb328) at libc-start.c:269 #31569 0x00007f73c31eb599 in _start () (In reply to Jerry James from comment #14) > Actually, when the daily cron job ran, rsyslog crashed again: Should have seen that one coming. That's a twin problem to the one in the ratelimiter. Should be equally easy to fix. I'll do another scratch build and rebase it to the new v7.4.0 while I'm at it. (In reply to Jerry James from comment #13) > Oops, I did things in the wrong order. I ran your new scratch build first, > which processed all of the journal entries. How do I replicate this now? > It did log the bad entries successfully: > > Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned > a malformed field (has no '='): '_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a' > Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned > a malformed field (has no '='): '_SOURC' > > Those are the only 2. Never mind, I guess it's some quirk in the journal. I've skimmed through its sources but couldn't find the cause. Since we can handle these now, it shouldn't be a big deal. That works great on my x86_64 Rawhide machine. The i686 Rawhide machine still seems weird. I looked in /var/log/messages, and it is still showing messages from months ago, and /var/log/cron is completely empty. Systemctl claims the service is running, and shows a running time that is consistent with the time that I updated to the build in comment 17. Any ideas on that one? Ah, and /var/log/messages is growing by the minute, but the tail of the log always shows messages from Mar 25. Does that imply that it is stuck in some kind of loop trying to process the journal? Yes, it is repeatedly writing messages from 09:50:03 to 09:50:40. It gets up to this point: ... Mar 25 09:50:40 rawhide32 systemd: Starting Locale Service... Mar 25 09:50:40 rawhide32 colord: Device added: xrandr-QXL 1-50332672 and then resets back to 09:50:03 to write the same messages over again: Mar 25 09:50:03 rawhide32 systemd[1]: Smack is not enabled in the kernel, not loading access rules. Mar 25 09:50:03 rawhide32 systemd[1]: systemd 198 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ) ... I attached gdb to rsyslogd and got a random snapshot of the stack in the thread reading the journal: #0 0xb737c6b0 in __x86.get_pc_thunk.bx () from /lib/libsystemd-journal.so.0 #1 0xb7382999 in test_object_offset (f=0xb81edf40, p=4894096, needle=4682728) at src/journal/journal-file.c:1715 #2 0xb73844d5 in generic_array_bisect (f=f@entry=0xb81edf40, first=3736368, n=<optimized out>, needle=4682728, test_object=test_object@entry=0xb7382990 <test_object_offset>, direction=direction@entry=DIRECTION_DOWN, ret=ret@entry=0x0, offset=offset@entry=0x0, idx=idx@entry=0xb33e9ef8) at src/journal/journal-file.c:1580 #3 0xb7385291 in journal_file_next_entry (f=f@entry=0xb81edf40, o=o@entry=0xb4bf53e8, p=4682728, direction=direction@entry=DIRECTION_DOWN, ret=ret@entry=0xb33e9fdc, offset=offset@entry=0xb33e9fe0) at src/journal/journal-file.c:1895 #4 0xb737d4ac in next_with_matches (j=j@entry=0xb81ed5b8, f=f@entry=0xb81edf40, direction=direction@entry=DIRECTION_DOWN, ret=ret@entry=0xb33e9fdc, offset=offset@entry=0xb33e9fe0) at src/journal/sd-journal.c:814 #5 0xb737eefe in next_beyond_location (offset=<synthetic pointer>, ret=0xb33e9fd4, direction=DIRECTION_DOWN, f=0xb81edf40, j=0xb81ed5b8) at src/journal/sd-journal.c:836 #6 real_journal_next (j=j@entry=0xb81ed5b8, direction=direction@entry=DIRECTION_DOWN) at src/journal/sd-journal.c:893 #7 0xb737fdfe in sd_journal_next (j=0xb81ed5b8) at src/journal/sd-journal.c:932 #8 0xb767ab91 in runInput (pThrd=0xb81f68a8) at imjournal.c:485 #9 0xb76ff39a in thrdStarter (arg=0xb81f68a8) at ../threads.c:212 #10 0xb7635e5a in start_thread (arg=0xb33eab40) at pthread_create.c:309 #11 0xb752098e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:129 If you can think of anything I can check to see why it is looping, I'm happy to do so. Jerry, is it possible that you've hit this bug? http://lists.freedesktop.org/archives/systemd-devel/2013-June/011261.html tl;dr: duplicit ids in the journal cause it to loop. I haven't encountered this myself but one other user reported being hit by this. I don't know. How would I tell? I know very little about journals. After playing with journalctl a little, I can tell you that the first message in the loop is the first one for Mar 25 marked "systemd[1]". All previous messages for that day are marked "kernel". Hmm, this is interesting. Here is the journalctl output for the messages immediately following the last one in the loop: Mar 25 09:50:40 rawhide32.betterlinux.com colord[962]: Device added: xrandr-QXL 1-50332672 -- Reboot -- Mar 21 14:53:11 rawhide32.betterlinux.com gdm[475]: GdmSession: Handling new connection from outside -- Reboot -- Mar 25 09:50:02 rawhide32.betterlinux.com systemd-journal[53]: Allowing runtime journal files to grow to 49.6M. And now journalctl is showing me the Mar 25 09:50 messages all over again. So, yes, it seems likely that I have hit that bug. How do I recover from it? Using a newer systemd should help (as noted here: https://bugs.freedesktop.org/show_bug.cgi?id=64566). Perhaps deleting the journal logs could help but I'm just guessing. Yes, I deleted the journal logs from March, and now the system appears to have recovered to normalcy. So the build from comment 17 looks good to me. OK, I'll close this now, since nothing new came in. If anything new springs up, feel free to reopen or open a new bz. Thanks for your patience and all the help. |