1359939 – Restarting systemd-journald to load new configurations causes other daemons stop working

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1359939 - Restarting systemd-journald to load new configurations causes other daemons stop working

Summary: Restarting systemd-journald to load new configurations causes other daemons s...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Michal Sekletar
QA Contact:	Frantisek Sumsal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1186913 1298243 1420851 1465901 1466365 1494728 1525993
TreeView+	depends on / blocked

Reported:	2016-07-25 21:08 UTC by Bryan Totty
Modified:	2021-09-09 11:53 UTC (History)
CC List:	19 users (show)
Fixed In Version:	systemd-219-45.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1525993 (view as bug list)
Environment:
Last Closed:	2018-04-10 11:16:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3717061	0	None	None	None	2018-11-26 20:36:24 UTC
Red Hat Product Errata	RHBA-2018:0711	0	None	None	None	2018-04-10 11:18:53 UTC

Description Bryan Totty 2016-07-25 21:08:50 UTC

Description of problem:

On several occasions, restarting the journal daemon (systemd-journald) will cause some daemon such as mcollective to start failing where it can connect with other processes such as puppet any more. We've also seen that the rsyslog daemon will stop logging certain messages after restarting the the journal.

We believe this is related to an issue with journald and how it closes pipes when it restarts. This has been reported in several places on the Internet and according to Lennart Pottering, this is fixed up stream. From the following link it appears to be fixed in version v228. 

https://github.com/systemd/systemd/issues/2236


Version-Release number of selected component (if applicable):

systemd-219-19.el7_2.11.x86_64


How reproducible:

Frequent in some environments, but without a clear reproducer.


Steps to Reproduce:
1. ?


Actual results:
Restarting the journald service appears to break logging for all other running services until those services are also restarted. 

(journalctl -flu <service_name> shows no new log lines post journald restart).

Expected results:
Support for reloading journald configuration without restarting the process and breaking logging for other systemd services.


Strace Output:

1. For reference, may not be related-

21669 15:40:02.378230 read(8, "# This might be removed if http://bugs.ruby-lang.org/issues/6286 gets accepted.\n\nmodule ABRT\n  module Exception\n    # Provides the exception formated in the same way as Ruby does for standard\n    # error output.\n    def format\n      backtrace = self.backtrace.collect { |line| \"\\tfrom #{line}\" }\n      backtrace[0] = \"#{self.backtrace.first}: #{self.message} (#{self.class.name})\"\n      backtrace\n    end\n\n    # Obtains executable name from backtrace. This should be more reliable then\n    # use of $0 aka $PROGRAM_NAME.\n    def executable\n      backtrace.last[/(.*?):/, 1]\n    end\n  end\nend\n", 8192) = 592 <0.000007>
21669 15:40:02.378322 read(8, "", 8192) = 0 <0.000007>
21669 15:40:02.378356 close(8)          = 0 <0.000008>

2. Broken Pipe-

...
21669 15:40:02.380090 write(2, "\tfrom /usr/share/ruby/vendor_ruby/puppet/application.rb:381:in `run'\n", 69) = -1 EPIPE (Broken pipe) <0.000007>
21669 15:40:02.380114 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380129 write(2, "\tfrom /usr/share/ruby/vendor_ruby/puppet/util/command_line.rb:146:in `run'\n", 75) = -1 EPIPE (Broken pipe) <0.000007>
21669 15:40:02.380153 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380168 write(2, "\tfrom /usr/share/ruby/vendor_ruby/puppet/util/command_line.rb:92:in `execute'\n", 78) = -1 EPIPE (Broken pipe) <0.000007>
21669 15:40:02.380192 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380207 write(2, "\tfrom /usr/bin/puppet:8:in `<main>'\n", 36) = -1 EPIPE (Broken pipe) <0.000006>
21669 15:40:02.380231 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=21669, si_uid=0} ---
21669 15:40:02.380246 rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, {0x7f124c2f7510, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, 8) = 0 <0.000006>
21669 15:40:02.380273 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, {SIG_IGN, [], SA_RESTORER|SA_SIGINFO, 0x7f124bfd8100}, 8) = 0 <0.000006>
10449 15:40:02.380358 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100096>
10449 15:40:02.380390 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000049>
10449 15:40:02.380477 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870901, {1304282, 846059213}, ffffffff <unfinished ...>
21669 15:40:02.400268 write(4, "!", 1 <unfinished ...>
21832 15:40:02.400324 <... poll resumed> ) = 1 ([{fd=3, revents=POLLIN}]) <0.050503>
21832 15:40:02.400346 read(3, "!", 1024) = 1 <0.000007>

3. Resource Temporarily Unavailable-

21832 15:40:02.400369 read(3, 0x7f124c640580, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000006>
21832 15:40:02.400392 read(5, 0x7f124c640580, 1024) = -1 EAGAIN (Resource temporarily unavailable) <0.000007>

21832 15:40:02.400429 _exit(0)          = ?
21832 15:40:02.400464 +++ exited with 0 +++
21669 15:40:02.400489 <... write resumed> ) = 1 <0.000180>
21669 15:40:02.400513 munmap(0x7f124c722000, 1052672) = 0 <0.000018>
21669 15:40:02.403969 exit_group(1)     = ?
21669 15:40:02.408829 +++ exited with 1 +++

4. After the exit, we see repeated FUTEX_WAKE/WAIT logs-

10449 15:40:02.480542 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100051>
10449 15:40:02.480582 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:02.480616 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870903, {1304282, 946203464}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100054>
10449 15:40:02.580708 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:02.580739 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870905, {1304283, 46326228}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100055>
10449 15:40:02.680831 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:02.680862 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870907, {1304283, 146449402}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100055>
...
10449 15:40:04.483912 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000008>
10449 15:40:04.483981 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870943, {1304284, 949568209}, ffffffff <unfinished ...>
10460 15:40:04.488944 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <29.500058>
10460 15:40:04.488994 futex(0x7f43fc016d00, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000008>
10460 15:40:04.489046 ioctl(8, FIONREAD, [0]) = 0 <0.000010>
10460 15:40:04.489374 sched_yield()     = 0 <0.000008>
10460 15:40:04.489420 futex(0x7f43fc016c84, FUTEX_WAIT_BITSET_PRIVATE, 30105, {1304314, 354997052}, ffffffff <unfinished ...>
10449 15:40:04.584073 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <0.100070>
10449 15:40:04.584147 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:04.584195 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870945, {1304285, 49782156}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100069>
10449 15:40:04.684352 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:04.684435 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870947, {1304285, 150020786}, ffffffff
.) = -1 ETIMEDOUT (Connection timed out) <0.100057>
...

5. The ETIMEOUTs repeat until the strace is manually killed-

...
10449 15:40:05.987435 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000008>
10449 15:40:05.987511 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870973, {1304286, 453097522}, ffffffff) = -1 ETIMEDOUT (Connection timed out) <0.100071>
10449 15:40:06.087666 futex(0x1f016b0, FUTEX_WAKE_PRIVATE, 1) = 0 <0.000007>
10449 15:40:06.087736 futex(0x1f01634, FUTEX_WAIT_BITSET_PRIVATE, 8870975, {1304286, 553323086}, ffffffff <detached ...>

<END OF FILE>


Additional Notes:

I can't find any matches or documentation towards the patching of:

https://github.com/systemd/systemd/issues/2236

Possibly patch-

https://github.com/systemd/systemd/commit/e22aa3d3#diff-8eeb33e68ba5aa7ad71124ed25315657R742


Configuration Reference:

# grep -v ^# /etc/systemd/journald.conf 

[Journal]
Storage=persistent
Compress=no
SplitMode=none
SyncIntervalSec=1m
RateLimitInterval=2s
RateLimitBurst=500
SystemMaxUse=8G
SystemMaxFileSize=1G
RuntimeMaxUse=8G
RuntimeMaxFileSize=1G
MaxRetentionSec=7day
ForwardToSyslog=no
ForwardToWall=no

Comment 3 Jan Synacek 2017-05-05 13:00:33 UTC

(In reply to Bryan Totty from comment #0)
> Additional Notes:
> 
> I can't find any matches or documentation towards the patching of:
> 
> https://github.com/systemd/systemd/issues/2236
> 
> Possibly patch-
> 
> https://github.com/systemd/systemd/commit/e22aa3d3#diff-
> 8eeb33e68ba5aa7ad71124ed25315657R742

Did you link this specific hunk for a reason, or was it a mistake and you want the whole patch to be backported? The entire patch looks kind of scary, but backporting only part of it would be quite dangerous in my opinion.

Comment 5 Michal Sekletar 2017-07-10 13:46:46 UTC

systemd-journald as released in RHEL-7 GA didn't support statefull restart wrt. stdout streams. stdout log aggregation was provided as a convenience mostly for short running one-shot scripts. AFAICT, our (as Red Hat) message never was, "just log to stdout and things will be fine". With RHEL-7 GA our guidelines about logging from daemons didn't change compared to RHEL-6 and traditional syslog API should be used for logging purposes in system services.

Having said that, I think that we should fix this even though we didn't initially provide this option with RHEL-7 GA and not even in RHEL-7.1. In RHEL-7.2 we were rebasing systemd and we reverted code that provided persistence for stdout streams. To get that functionality back we should "unrevert" following revert,

https://github.com/lnykryn/systemd-rhel/commit/91cb89c1b79ef3c475d91319edb0c052cb9f2724

Comment 14 Lukáš Nykrýn 2017-09-12 14:28:13 UTC

fix merged to upstream staging branch -> https://github.com/lnykryn/systemd-rhel/pull/135 -> post

Comment 36 errata-xmlrpc 2018-04-10 11:16:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0711

Note You need to log in before you can comment on or make changes to this bug.