Bug 1359939

Summary:	Restarting systemd-journald to load new configurations causes other daemons stop working
Product:	Red Hat Enterprise Linux 7	Reporter:	Bryan Totty <btotty>
Component:	systemd	Assignee:	Michal Sekletar <msekleta>
Status:	CLOSED ERRATA	QA Contact:	Frantisek Sumsal <fsumsal>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	7.4	CC:	brubisch, btotty, chorn, dmoessne, dornelas, fcami, fsumsal, jhunt, jolee, jsynacek, kwalker, mdshaikh, mmatsuya, mosvald, msekleta, pdwyer, salmy, systemd-maint-list, systemd-maint
Target Milestone:	rc	Keywords:	PrioBumpGSS, ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	systemd-219-45.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1525993 (view as bug list)		Environment:
Last Closed:	2018-04-10 11:16:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1186913, 1298243, 1420851, 1465901, 1466365, 1494728, 1525993

Description Bryan Totty 2016-07-25 21:08:50 UTC

Description of problem:

On several occasions, restarting the journal daemon (systemd-journald) will cause some daemon such as mcollective to start failing where it can connect with other processes such as puppet any more. We've also seen that the rsyslog daemon will stop logging certain messages after restarting the the journal.

We believe this is related to an issue with journald and how it closes pipes when it restarts. This has been reported in several places on the Internet and according to Lennart Pottering, this is fixed up stream. From the following link it appears to be fixed in version v228. 

https://github.com/systemd/systemd/issues/2236


Version-Release number of selected component (if applicable):

systemd-219-19.el7_2.11.x86_64


How reproducible:

Frequent in some environments, but without a clear reproducer.


Steps to Reproduce:
1. ?


Actual results:
Restarting the journald service appears to break logging for all other running services until those services are also restarted. 

(journalctl -flu <service_name> shows no new log lines post journald restart).

Expected results:
Support for reloading journald configuration without restarting the process and breaking logging for other systemd services.


Strace Output:

1. For reference, may not be related-

21669 15:40:02.378230 read(8, "# This might be removed if http://bugs.ruby-lang.org/issues/6286 gets accepted.\n\nmodule ABRT\n  module Exception\n    # Provides the exception formated in the same way as Ruby does for standard\n    # error output.\n    def format\n      backtrace = self.backtrace.collect { |line| \"\\tfrom #{line}\" }\n      backtrace[0] = \"#{self.backtrace.first}: #{self.message} (#{self.class.name})\"\n      backtrace\n    end\n\n    # Obtains executable name from backtrace. This should be more reliable then\n    # use of $0 aka $PROGRAM_NAME.\n    def executable\n      backtrace.last[/(.*?):/, 1]\n    end\n  end\nend\n", 8192) = 592 <0.000007>
21669 15:40:02.378322 read(8, "", 8192) = 0 <0.000007>
21669 15:40:02.378356 close(8)          = 0 <0.000008>

2. Broken Pipe-

...
21669 15:40:02.380090 write(2, "\tfrom /usr/share/ruby/vendor_ruby/puppet/application.rb:381:in `run'\n", 69) = -1 EPIPE (Broken pipe) <0.000007>
21669 15:40:02.380114 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380129 write(2, "\tfrom /usr/share/ruby/vendor_ruby/puppet/util/command_line.rb:146:in `run'\n", 75) = -1 EPIPE (Broken pipe) <0.000007>
21669 15:40:02.380153 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380168 write(2, "\tfrom /usr/share/ruby/vendor_ruby/puppet/util/command_line.rb:92:in `execute'\n", 78) = -1 EPIPE (Broken pipe) <0.000007>
21669 15:40:02.380192 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380207 write(2, "\tfrom /usr/bin/puppet:8:in `<main>'\n", 36) = -1 EPIPE (Broken pipe) <0.000006>
21669 15:40:02.380231 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380246 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, {0x7f124c2f7510, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, 8) = 0 <0.000006>
21669 15:40:02.380273 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, 8) = 0 <0.000006>
10449 15:40:02.380358 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100096>
10449 15:40:02.380390 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000049>
10449 15:40:02.380477 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870901, {1304282, 846059213}, ffffffff <unfinished ...>
21669 15:40:02.400268 write(4, "!", 1 <unfinished ...>
21832 15:40:02.400324 <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}]) <0.050503>
21832 15:40:02.400346 read(3, "!", 1024) = 1 <0.000007>

3. Resource Temporarily Unavailable-

21832 15:40:02.400369 read(3, 0x7f124c640580, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000006>
21832 15:40:02.400392 read(5, 0x7f124c640580, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000007>

21832 15:40:02.400429 _exit(0)          = ?
21832 15:40:02.400464 +++ exited with 0 +++
21669 15:40:02.400489 <... write resumed> ) = 1 <0.000180>
21669 15:40:02.400513 munmap(0x7f124c722000, 1052672) = 0 <0.000018>
21669 15:40:02.403969 exit_group(1)     = ?
21669 15:40:02.408829 +++ exited with 1 +++

4. After the exit, we see repeated FUTEX_WAKE/WAIT logs-

10449 15:40:02.480542 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100051>
10449 15:40:02.480582 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:02.480616 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870903, {1304282, 946203464}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100054>
10449 15:40:02.580708 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:02.580739 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870905, {1304283, 46326228}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100055>
10449 15:40:02.680831 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:02.680862 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870907, {1304283, 146449402}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100055>
...
10449 15:40:04.483912 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000008>
10449 15:40:04.483981 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870943, {1304284, 949568209}, ffffffff <unfinished ...>
10460 15:40:04.488944 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <29.500058>
10460 15:40:04.488994 futex(0x7f43fc016d00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000008>
10460 15:40:04.489046 ioctl(8, FIONREAD, [0]) = 0 <0.000010>
10460 15:40:04.489374 sched_yield()     = 0 <0.000008>
10460 15:40:04.489420 futex(0x7f43fc016c84, FUTEX_WAIT_BITSET_PRIVATE, 30105, {1304314, 354997052}, ffffffff <unfinished ...>
10449 15:40:04.584073 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100070>
10449 15:40:04.584147 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:04.584195 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870945, {1304285, 49782156}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100069>
10449 15:40:04.684352 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:04.684435 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870947, {1304285, 150020786}, ffffffff
.) = -1 ETIMEDOUT (Connection timed out) <0.100057>
...

5. The ETIMEOUTs repeat until the strace is manually killed-

...
10449 15:40:05.987435 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000008>
10449 15:40:05.987511 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870973, {1304286, 453097522}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100071>
10449 15:40:06.087666 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:06.087736 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870975, {1304286, 553323086}, ffffffff <detached ...>

<END OF FILE>


Additional Notes:

I can't find any matches or documentation towards the patching of:

https://github.com/systemd/systemd/issues/2236

Possibly patch-

https://github.com/systemd/systemd/commit/e22aa3d3#diff-8eeb33e68ba5aa7ad71124ed25315657R742


Configuration Reference:

# grep -v ^# /etc/systemd/journald.conf 

[Journal]
Storage=persistent
Compress=no
SplitMode=none
SyncIntervalSec=1m
RateLimitInterval=2s
RateLimitBurst=500
SystemMaxUse=8G
SystemMaxFileSize=1G
RuntimeMaxUse=8G
RuntimeMaxFileSize=1G
MaxRetentionSec=7day
ForwardToSyslog=no
ForwardToWall=no

Comment 3 Jan Synacek 2017-05-05 13:00:33 UTC

(In reply to Bryan Totty from comment #0)
> Additional Notes:
> 
> I can't find any matches or documentation towards the patching of:
> 
> https://github.com/systemd/systemd/issues/2236
> 
> Possibly patch-
> 
> https://github.com/systemd/systemd/commit/e22aa3d3#diff-
> 8eeb33e68ba5aa7ad71124ed25315657R742

Did you link this specific hunk for a reason, or was it a mistake and you want the whole patch to be backported? The entire patch looks kind of scary, but backporting only part of it would be quite dangerous in my opinion.

Comment 5 Michal Sekletar 2017-07-10 13:46:46 UTC

systemd-journald as released in RHEL-7 GA didn't support statefull restart wrt. stdout streams. stdout log aggregation was provided as a convenience mostly for short running one-shot scripts. AFAICT, our (as Red Hat) message never was, "just log to stdout and things will be fine". With RHEL-7 GA our guidelines about logging from daemons didn't change compared to RHEL-6 and traditional syslog API should be used for logging purposes in system services.

Having said that, I think that we should fix this even though we didn't initially provide this option with RHEL-7 GA and not even in RHEL-7.1. In RHEL-7.2 we were rebasing systemd and we reverted code that provided persistence for stdout streams. To get that functionality back we should "unrevert" following revert,

https://github.com/lnykryn/systemd-rhel/commit/91cb89c1b79ef3c475d91319edb0c052cb9f2724

Comment 14 Lukáš Nykrýn 2017-09-12 14:28:13 UTC

fix merged to upstream staging branch -> https://github.com/lnykryn/systemd-rhel/pull/135 -> post

Comment 36 errata-xmlrpc 2018-04-10 11:16:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0711