Bug 971471

Summary:	rsyslog crashes on start
Product:	[Fedora] Fedora	Reporter:	Kevin Fenzi <kevin>
Component:	rsyslog	Assignee:	Tomas Heinrich <theinric>
Status:	CLOSED ERRATA	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	loganjerry, mah.darade, nobody+PNT0469646, pvrabec, theinric
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-06-13 18:02:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kevin Fenzi 2013-06-06 15:27:43 UTC

After an upgrade yesterday, rsyslog failed to restart: 

# systemctl status rsyslog
rsyslog.service - System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled)
   Active: failed (Result: start-limit) since Wed 2013-06-05 09:20:39 MDT; 24h ago
 Main PID: 6159 (code=dumped, signal=SEGV)
   CGroup: name=systemd:/system/rsyslog.service

Jun 05 09:20:39 jelerak.scrye.com systemd[1]: rsyslog.service: main process exited, code=dumpe...EGV
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: Unit rsyslog.service entered failed state.
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: Starting System Logging Service...
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: Started System Logging Service.
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: rsyslog.service: main process exited, code=dumpe...EGV
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: Unit rsyslog.service entered failed state.
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: Starting System Logging Service...
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: rsyslog.service start request repeated too quick...rt.
Jun 05 09:20:39 jelerak.scrye.com systemd[1]: Failed to start System Logging Service.

Trying to manually restart: 

[root@jelerak ~]# systemctl restart rsyslog
[root@jelerak ~]# systemctl status rsyslog
rsyslog.service - System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled)
   Active: failed (Result: core-dump) since Thu 2013-06-06 09:23:26 MDT; 3s ago
  Process: 5479 ExecStart=/sbin/rsyslogd -n $SYSLOGD_OPTIONS (code=dumped, signal=SEGV)

Jun 06 09:23:25 jelerak.scrye.com systemd[1]: Started System Logging Service.
Jun 06 09:23:26 jelerak.scrye.com systemd[1]: rsyslog.service: main process exited, code=dumpe...EGV
Jun 06 09:23:26 jelerak.scrye.com systemd[1]: Unit rsyslog.service entered failed state.

The update was from 7.3.10-1.fc20: 

    Updated rsyslog-7.3.10-1.fc20.x86_64                       @rawhide
    Update          7.3.15-1.20130604git6e72fa6.fc20.x86_64    @rawhide

Downgrading to 7.3.10-1.fc20 and everything works fine again. 

Happy to gather more info...

Comment 1 Tomas Heinrich 2013-06-06 15:52:26 UTC

Are there any selinux AVCs?

Can you try running:

$ systemct stop rsyslog
$ export RSYSLOG_DEBUG="debug nostdout"
$ export RSYSLOG_DEBUGLOG=/tmp/rs.dbg
$ rsyslogd -dn 2> /dev/null

and putting the /tmp/rs.dbg file somewhre?

Comment 2 Kevin Fenzi 2013-06-06 16:07:02 UTC

No avcs

/tmp/rs.dbg -> http://www.scrye.com/~kevin/fedora/rs.dbg

Comment 3 Tomas Heinrich 2013-06-06 16:59:15 UTC

No luck so far; I'm not able to reproduce it.

Can you try if it fails when you either:
comment out "$ModLoad imuxsock" in /etc/rsyslog.conf
or
comment out "$ModLoad imjournal" in /etc/rsyslog.conf
There will be some warning messages that can be ignored for now.

Can you get a coredump or a backtrace?

Comment 4 Jerry James 2013-06-06 20:18:28 UTC

I am seeing this, too, in Rawhide VMs, both i386 and x86_64.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb935030700 (LWP 3519)]
strnlen () at ../sysdeps/x86_64/strlen.S:103
103		FIND_ZERO
(gdb) bt
#0  strnlen () at ../sysdeps/x86_64/strlen.S:103
#1  0x00007fb93a9b5ace in __GI___strndup (s=0x1 <Address 0x1 out of bounds>, 
    n=<optimized out>) at strndup.c:44
#2  0x00007fb93a2be32c in readjournal () at imjournal.c:317
#3  runInput (pThrd=<optimized out>) at imjournal.c:494
#4  0x00007fb93bfd24f0 in thrdStarter (arg=0x7fb93cfc9c60) at ../threads.c:212
#5  0x00007fb93b932f03 in start_thread (arg=0x7fb935030700)
    at pthread_create.c:309
#6  0x00007fb93aa25a9d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 2
#2  0x00007fb93a2be32c in readjournal () at imjournal.c:317
317			data = strndup(get + prefixlen, l - prefixlen);
(gdb) info locals
iRet = RS_RET_OK
json = 0x7fb9285d6a20
message = 0x7fb9285d55e0 "systemd 201 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)"
sys_pid = <optimized out>
sys_iden = <optimized out>
get2 = <optimized out>
r = <optimized out>
tv = {tv_sec = 1365627365, tv_usec = 858148}
parse = <optimized out>
length = 25
pidlength = 12
prefixlen = -140433449759255
timestamp = 1365627365858148
get = 0x7fb935e33218
facility = 3
equal_sign = <optimized out>
data = <optimized out>
priority = 6
sys_iden_help = 0x7fb9285d6ae0 "systemd[1]:"
pidget = 0x7fb935bfd9b8
jval = <optimized out>
name = 0x7fb9285d58f0 "_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a"
l = 43

That value of prefixlen is clearly bogus.  Ah, looks like equal_sign, which is "always present", isn't, so equal_sign == NULL.  Since this is a non-debug build, this doesn't fire:

assert (equal_sign != NULL);

Comment 5 Jerry James 2013-06-06 20:29:31 UTC

I should have mentioned the string with no equal sign:

(gdb) print (char *)get
$1 = 0x7fa15aea5218 "_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a"

Comment 6 Tomas Heinrich 2013-06-06 23:54:11 UTC

Thanks for the analysis and backtrace, Jerry.

I think the problem you're seeing is caused by a bug in systemd that was fixed here:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=d378991747d67fff1d4dc39e7fb2bc8f49f1b561

I've done a scratch build but it seems there's some other problem as someone else reported it still fails for them.

Here's the build: http://koji.fedoraproject.org/koji/taskinfo?taskID=5477721

Comment 7 Jerry James 2013-06-07 19:33:39 UTC

Yes, the scratch build also crashes, apparently due to stack overflow caused by infinite recursion in the rate limiter.  Here's a backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc5d0d37700 (LWP 3054)]
0x00007fc5d6682307 in _IO_vfprintf_internal (s=s@entry=0x7fc5d05384e0, 
    format=format@entry=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", ap=ap@entry=0x7fc5d0538658) at vfprintf.c:257
257	  int save_errno = errno;
(gdb) bt
#0  0x00007fc5d6682307 in _IO_vfprintf_internal (s=s@entry=0x7fc5d05384e0, 
    format=format@entry=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", ap=ap@entry=0x7fc5d0538658) at vfprintf.c:257
#1  0x00007fc5d6741245 in ___vsnprintf_chk (s=0x7fc5d0538740 "", 
    maxlen=<optimized out>, flags=1, slen=<optimized out>, 
    format=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", 
    args=args@entry=0x7fc5d0538658) at vsnprintf_chk.c:63
#2  0x00007fc5d67411a8 in ___snprintf_chk (s=s@entry=0x7fc5d0538740 "", 
    maxlen=maxlen@entry=1024, flags=flags@entry=1, slen=slen@entry=1024, 
    format=format@entry=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting") at snprintf_chk.c:34
#3  0x00007fc5d7cd8246 in snprintf (
    __fmt=0x7fc5d7cf8558 "%s: begin to drop messages due to rate-limiting", 
    __n=1024, __s=0x7fc5d0538740 "") at /usr/include/bits/stdio2.h:64
#4  withinRatelimit (ratelimit=0x7fc5d871e340, tt=<optimized out>)
    at ratelimit.c:171
#5  0x00007fc5d7cd835c in ratelimitMsg (ratelimit=0x7fc5d871e340, 
    pMsg=0x7fc5ccc523b0, ppRepMsg=0x7fc5d0538bb8) at ratelimit.c:208
#6  0x00007fc5d7cd8595 in ratelimitAddMsg (ratelimit=<optimized out>, 
    pMultiSub=pMultiSub@entry=0x0, pMsg=0x7fc5ccc523b0) at ratelimit.c:241
#7  0x00007fc5d7c9fe8d in logmsgInternal (iErr=iErr@entry=-2177, 
    pri=pri@entry=46, 
    msg=msg@entry=0x7fc5d0538c70 "rsyslogd[internal_messages]: begin to drop messages due to rate-limiting", flags=<optimized out>, flags@entry=0)
    at syslogd.c:486
#8  0x00007fc5d7cd825a in withinRatelimit (ratelimit=0x7fc5d871e340, 
    tt=<optimized out>) at ratelimit.c:174
...
(stack frames 5-8 repeat over and over and over)
...
#25239 0x00007fc5d7c9fe8d in logmsgInternal (iErr=-3000, pri=<optimized out>, msg=0x7fc5d0d36350 "SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'", flags=<optimized out>) at syslogd.c:486
#25240 0x00007fc5d7ce6141 in LogError (iErrno=<optimized out>, iErrCode=-3000, fmt=<optimized out>) at errmsg.c:101
#25241 0x00007fc5d5fca71f in readjournal () at imjournal.c:250
#25242 runInput (pThrd=<optimized out>) at imjournal.c:501
#25243 0x00007fc5d7cde570 in thrdStarter (arg=0x7fc5d872cc60) at ../threads.c:212
#25244 0x00007fc5d763ef03 in start_thread (arg=0x7fc5d0d37700) at pthread_create.c:309
#25245 0x00007fc5d6731a9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Comment 8 Kevin Fenzi 2013-06-08 18:14:43 UTC

My crash on start here has: 

#0  strnlen () at ../sysdeps/x86_64/strlen.S:103
#1  0x00007f9c0248dace in __GI___strndup (s=0x1 <Address 0x1 out of bounds>, n=<optimized out>)
    at strndup.c:44
#2  0x00007f9c01d9632c in readjournal () at imjournal.c:317
#3  runInput (pThrd=<optimized out>) at imjournal.c:494
#4  0x00007f9c03aaa4f0 in thrdStarter (arg=0x7f9c0421dc70) at ../threads.c:212
#5  0x00007f9c0340af03 in start_thread (arg=0x7f9bf1f9e700) at pthread_create.c:309
#6  0x00007f9c024fda9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Commenting out:

$ModLoad imjournal

in /etc/rsyslog.conf and it starts up fine.

Comment 9 Tomas Heinrich 2013-06-09 12:23:37 UTC

Thanks for the backtrace Kevin. I think this is the same as the first one Jerry has reported.

If you just disable the imjournal plugin, you won't get any system messages. The module that was formerly used to read them, imuxsock, is currently turned off but can be enabled by commenting out this line in /etc/rsyslog.conf:

$OmitLocalLogging on

and tweaking the service file to use socket activation:

$ sed -i '/^;Sockets/s/^;//;/^Alias;/s/^;//' /usr/lib/systemd/system/rsyslog.service

After restarting syslog.socket and rsyslog.service, the messages should start flowing in.


I have another scratch build which should correct the last issue Jerry has reported and would love to hear your thoughts: http://koji.fedoraproject.org/koji/taskinfo?taskID=5484881

There's just one more outstanding issue (that I know of) reported by yaneti via irc and it seems to be a bug in journal: http://lists.freedesktop.org/archives/systemd-devel/2013-June/011261.html


@Jerry
In the trace you've provided, there's something which looks like a malformed journal field name. Specifically this message:

"SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'"

I guess the fragment should stand for "_SOURCE_REALTIME_TIMESTAMP". Since it shouldn't be affected by the corrupted stack, I'm curious about how that got there. Would it be possible to get the whole message to which this fragment belongs to?

Comment 10 Kevin Fenzi 2013-06-09 16:53:14 UTC

> I have another scratch build which should correct the last issue Jerry has reported and would love to hear your thoughts: http://koji.fedoraproject.org/koji/taskinfo?taskID=5484881

This scratch build works here. Strangely, once it started up it dumped all the logs for about the last month or so into /var/log/messages and /var/log/maillog... but it's running and appears to be ok after that.

Comment 11 Tomas Heinrich 2013-06-09 19:01:58 UTC

I'm happy to hear it doesn't crash, thanks.

The behavior you're seeing is expected (unfortunately). The imjournal module stores its position in the journal in a state file. Since there's no file yet, all the backlog got read in and logged. From now on, only newer messages will be logged. If you erase the state file, everything will be read in again.

There's going to be a configuration directive to suppress this behavior and only read in messages created after rsyslog has started, but this can cause message loss.

Comment 12 Kevin Fenzi 2013-06-09 19:11:09 UTC

makes sense. No problem.

Comment 13 Jerry James 2013-06-10 17:55:26 UTC

The new scratch build is working for me, too.

(In reply to Tomas Heinrich from comment #9)
> @Jerry
> In the trace you've provided, there's something which looks like a malformed
> journal field name. Specifically this message:
> 
> "SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'"
> 
> I guess the fragment should stand for "_SOURCE_REALTIME_TIMESTAMP". Since it
> shouldn't be affected by the corrupted stack, I'm curious about how that got
> there. Would it be possible to get the whole message to which this fragment
> belongs to?

Oops, I did things in the wrong order.  I ran your new scratch build first, which processed all of the journal entries.  How do I replicate this now?  It did log the bad entries successfully:

Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a'
Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned a malformed field (has no '='): '_SOURC'

Those are the only 2.

Comment 14 Jerry James 2013-06-10 19:00:20 UTC

Actually, when the daily cron job ran, rsyslog crashed again:

Core was generated by `/sbin/rsyslogd -n'.
Program terminated with signal 11, Segmentation fault.
#0  __find_specmb (
    format=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting")
    at printf-parse.h:108
108	  return (const unsigned char *) __strchrnul ((const char *) format, '%');
(gdb) bt
#0  __find_specmb (
    format=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting")
    at printf-parse.h:108
#1  _IO_vfprintf_internal (s=s@entry=0x7fff2e9cc500, 
    format=format@entry=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting", ap=ap@entry=0x7fff2e9cc678) at vfprintf.c:1308
#2  0x00007f73c1c8d245 in ___vsnprintf_chk (s=0x7fff2e9cc760 "", 
    maxlen=<optimized out>, flags=1, slen=<optimized out>, 
    format=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting", 
    args=args@entry=0x7fff2e9cc678) at vsnprintf_chk.c:63
#3  0x00007f73c1c8d1a8 in ___snprintf_chk (s=s@entry=0x7fff2e9cc760 "", 
    maxlen=maxlen@entry=1024, flags=flags@entry=1, slen=slen@entry=1024, 
    format=format@entry=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting") at snprintf_chk.c:34
#4  0x00007f73c32241f2 in snprintf (
    __fmt=0x7f73c32444c8 "%s: %u messages lost due to rate-limiting", 
    __n=1024, __s=0x7fff2e9cc760 "") at /usr/include/bits/stdio2.h:64
#5  tellLostCnt (ratelimit=0x7f73c5353340, ratelimit=0x7f73c5353340)
    at ratelimit.c:128
#6  withinRatelimit (ratelimit=0x7f73c5353340, tt=<optimized out>)
    at ratelimit.c:160
#7  0x00007f73c322435c in ratelimitMsg (ratelimit=0x7f73c5353340, 
    pMsg=0x7f73c570e100, ppRepMsg=0x7fff2e9ccbd8) at ratelimit.c:208
#8  0x00007f73c3224595 in ratelimitAddMsg (ratelimit=<optimized out>, 
    pMultiSub=pMultiSub@entry=0x0, pMsg=0x7f73c570e100) at ratelimit.c:241
#9  0x00007f73c31ebe8d in logmsgInternal (iErr=iErr@entry=-2177, 
    pri=pri@entry=46, 
    msg=msg@entry=0x7fff2e9ccc90 "rsyslogd[internal_messages]: 20774 messages lost due to rate-limiting", flags=<optimized out>, flags@entry=0)
    at syslogd.c:486
#10 0x00007f73c3224206 in tellLostCnt (ratelimit=0x7f73c5353340, 
    ratelimit=0x7f73c5353340) at ratelimit.c:131
...
(frames 6-10 over and over and over)
...
#31560 0x00007f73c3224206 in tellLostCnt (ratelimit=0x7f73c5353340, ratelimit=0x7f73c5353340) at ratelimit.c:131
#31561 withinRatelimit (ratelimit=0x7f73c5353340, tt=<optimized out>) at ratelimit.c:160
#31562 0x00007f73c322435c in ratelimitMsg (ratelimit=0x7f73c5353340, pMsg=0x7f73c53620c0, ppRepMsg=0x7fff2f1cae28) at ratelimit.c:208
#31563 0x00007f73c3224595 in ratelimitAddMsg (ratelimit=<optimized out>, pMultiSub=pMultiSub@entry=0x0, pMsg=0x7f73c53620c0) at ratelimit.c:241
#31564 0x00007f73c31ebe8d in logmsgInternal (iErr=iErr@entry=-1, pri=pri@entry=46, msg=msg@entry=0x7fff2f1cb010 " [origin software=\"rsyslogd\" swVersion=\"7.3.15\" x-pid=\"4018\" x-info=\"http://www.rsyslog.com\"] rsyslogd was HUPed", flags=<optimized out>, flags@entry=0) at syslogd.c:486
#31565 0x00007f73c31ed8f1 in doHUP () at syslogd.c:1251
#31566 mainloop () at syslogd.c:1291
#31567 realMain (argc=<optimized out>, argv=<optimized out>) at syslogd.c:2036
#31568 0x00007f73c1ba5ff5 in __libc_start_main (main=0x7f73c31eb550 <main>, argc=2, argv=0x7fff2f1cb338, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff2f1cb328) at libc-start.c:269
#31569 0x00007f73c31eb599 in _start ()

Comment 15 Tomas Heinrich 2013-06-10 19:20:18 UTC

(In reply to Jerry James from comment #14)
> Actually, when the daily cron job ran, rsyslog crashed again:

Should have seen that one coming. That's a twin problem to the one in the ratelimiter. Should be equally easy to fix. I'll do another scratch build and rebase it to the new v7.4.0 while I'm at it.

Comment 16 Tomas Heinrich 2013-06-10 19:22:16 UTC

(In reply to Jerry James from comment #13)

> Oops, I did things in the wrong order.  I ran your new scratch build first,
> which processed all of the journal entries.  How do I replicate this now? 
> It did log the bad entries successfully:
> 
> Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned
> a malformed field (has no '='): '_MACHINE_IDd1bf95a45a60f1e7e8badbed071e291a'
> Jun 10 11:47:57 rawhide64 rsyslogd-3000: SD_JOURNAL_FOREACH_DATA() returned
> a malformed field (has no '='): '_SOURC'
> 
> Those are the only 2.

Never mind, I guess it's some quirk in the journal. I've skimmed through its sources but couldn't find the cause. Since we can handle these now, it shouldn't be a big deal.

Comment 17 Tomas Heinrich 2013-06-11 10:03:59 UTC

Here it is:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5490201

Comment 18 Jerry James 2013-06-11 20:11:26 UTC

That works great on my x86_64 Rawhide machine.  The i686 Rawhide machine still seems weird.  I looked in /var/log/messages, and it is still showing messages from months ago, and /var/log/cron is completely empty.  Systemctl claims the service is running, and shows a running time that is consistent with the time that I updated to the build in comment 17.  Any ideas on that one?

Ah, and /var/log/messages is growing by the minute, but the tail of the log always shows messages from Mar 25.  Does that imply that it is stuck in some kind of loop trying to process the journal?

Comment 19 Jerry James 2013-06-11 20:14:09 UTC

Yes, it is repeatedly writing messages from 09:50:03 to 09:50:40.  It gets up to this point:

...
Mar 25 09:50:40 rawhide32 systemd: Starting Locale Service...
Mar 25 09:50:40 rawhide32 colord: Device added: xrandr-QXL 1-50332672

and then resets back to 09:50:03 to write the same messages over again:

Mar 25 09:50:03 rawhide32 systemd[1]: Smack is not enabled in the kernel, not loading access rules.
Mar 25 09:50:03 rawhide32 systemd[1]: systemd 198 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
...

Comment 20 Jerry James 2013-06-11 20:23:36 UTC

I attached gdb to rsyslogd and got a random snapshot of the stack in the thread reading the journal:

#0  0xb737c6b0 in __x86.get_pc_thunk.bx () from /lib/libsystemd-journal.so.0
#1  0xb7382999 in test_object_offset (f=0xb81edf40, p=4894096, needle=4682728)
    at src/journal/journal-file.c:1715
#2  0xb73844d5 in generic_array_bisect (f=f@entry=0xb81edf40, first=3736368, 
    n=<optimized out>, needle=4682728, 
    test_object=test_object@entry=0xb7382990 <test_object_offset>, 
    direction=direction@entry=DIRECTION_DOWN, ret=ret@entry=0x0, 
    offset=offset@entry=0x0, idx=idx@entry=0xb33e9ef8)
    at src/journal/journal-file.c:1580
#3  0xb7385291 in journal_file_next_entry (f=f@entry=0xb81edf40, 
    o=o@entry=0xb4bf53e8, p=4682728, direction=direction@entry=DIRECTION_DOWN, 
    ret=ret@entry=0xb33e9fdc, offset=offset@entry=0xb33e9fe0)
    at src/journal/journal-file.c:1895
#4  0xb737d4ac in next_with_matches (j=j@entry=0xb81ed5b8, 
    f=f@entry=0xb81edf40, direction=direction@entry=DIRECTION_DOWN, 
    ret=ret@entry=0xb33e9fdc, offset=offset@entry=0xb33e9fe0)
    at src/journal/sd-journal.c:814
#5  0xb737eefe in next_beyond_location (offset=<synthetic pointer>, 
    ret=0xb33e9fd4, direction=DIRECTION_DOWN, f=0xb81edf40, j=0xb81ed5b8)
    at src/journal/sd-journal.c:836
#6  real_journal_next (j=j@entry=0xb81ed5b8, 
    direction=direction@entry=DIRECTION_DOWN) at src/journal/sd-journal.c:893
#7  0xb737fdfe in sd_journal_next (j=0xb81ed5b8)
    at src/journal/sd-journal.c:932
#8  0xb767ab91 in runInput (pThrd=0xb81f68a8) at imjournal.c:485
#9  0xb76ff39a in thrdStarter (arg=0xb81f68a8) at ../threads.c:212
#10 0xb7635e5a in start_thread (arg=0xb33eab40) at pthread_create.c:309
#11 0xb752098e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:129

If you can think of anything I can check to see why it is looping, I'm happy to do so.

Comment 21 Tomas Heinrich 2013-06-11 20:41:14 UTC

Jerry, is it possible that you've hit this bug?

http://lists.freedesktop.org/archives/systemd-devel/2013-June/011261.html

tl;dr: duplicit ids in the journal cause it to loop.
I haven't encountered this myself but one other user reported being hit by this.

Comment 22 Jerry James 2013-06-11 20:52:52 UTC

I don't know.  How would I tell?  I know very little about journals.

After playing with journalctl a little, I can tell you that the first message in the loop is the first one for Mar 25 marked "systemd[1]".  All previous messages for that day are marked "kernel".  Hmm, this is interesting.  Here is the journalctl output for the messages immediately following the last one in the loop:

Mar 25 09:50:40 rawhide32.betterlinux.com colord[962]: Device added: xrandr-QXL 1-50332672
-- Reboot --
Mar 21 14:53:11 rawhide32.betterlinux.com gdm[475]: GdmSession: Handling new connection from outside
-- Reboot --
Mar 25 09:50:02 rawhide32.betterlinux.com systemd-journal[53]: Allowing runtime journal files to grow to 49.6M.

And now journalctl is showing me the Mar 25 09:50 messages all over again.  So, yes, it seems likely that I have hit that bug.  How do I recover from it?

Comment 23 Tomas Heinrich 2013-06-11 21:32:04 UTC

Using a newer systemd should help (as noted here: https://bugs.freedesktop.org/show_bug.cgi?id=64566). Perhaps deleting the journal logs could help but I'm just guessing.

Comment 24 Jerry James 2013-06-11 22:55:11 UTC

Yes, I deleted the journal logs from March, and now the system appears to have recovered to normalcy.  So the build from comment 17 looks good to me.

Comment 25 Tomas Heinrich 2013-06-13 17:59:59 UTC

OK, I'll close this now, since nothing new came in. If anything new springs up, feel free to reopen or open a new bz.

Thanks for your patience and all the help.