Bug 1433687

Summary: some systemd services will fail when using stateless linux mode
Product: Red Hat Enterprise Linux 7 Reporter: Lev Veyde <lveyde>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.3CC: bblaskov, dfediuck, fsumsal, jsynacek, lveyde, sbonazzo, systemd-maint-list
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-33.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 09:14:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lev Veyde 2017-03-19 10:40:27 UTC
Description of problem:
Some systemd services will fail if stateless linux is enabled
(TEMPORARY_STATE=yes in /etc/sysconfig/readonly-root)

The services that are confirmed to fail are systemd-machined, systemd-hostnamed and systemd-localed.

Version-Release number of selected component (if applicable):
systemd-219-30

How reproducible:
100%

Steps to Reproduce:
1. run system with stateless mode enabled
2. check the status of the aforementioned services
3. note that they all fail with 226/NAMESPACE

Actual results:
systemd services fail in the stateless mode

Expected results:
systemd services should not fail in stateless mode

Additional info:
The issue was can be easily reproduced with oVirt-Live
http://jenkins.ovirt.org/job/ovirt-live_4.1-create-iso/130/artifact/output/ovirt-live-4.1.1_rc3.iso

Comment 2 Jan Synacek 2017-03-20 07:37:33 UTC
I cannot reproduce this with systemd-219-30.el7.x86_64 on a RHEL-7.3 machine. Nothing fails when TEMPORARY_STATE is set to 'yes'.

Comment 3 Jan Synacek 2017-03-20 08:02:51 UTC
The problem seems to be that all those services set PrivateDevices=yes. If you comment that line, they start.

Comment 4 Jan Synacek 2017-03-20 08:41:04 UTC
When stracing pid 1 and its forked processes (systemd-hostnamed in this particular case), this part reveals what failed:

[pid 12578] mount(NULL, "/dev/shm/lldpad.state", NULL, MS_REMOUNT|MS_BIND, NULL) = -1 EINVAL (Invalid argument)
[pid 12578] close(3)                    = 0
[pid 12578] munmap(0x7f03475be000, 4096) = 0
[pid 12578] socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
[pid 12578] getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
[pid 12578] setsockopt(3, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0
[pid 12578] setsockopt(3, SOL_SOCKET, SO_SNDTIMEO, "\n\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
[pid 12578] connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, 29) = 0
[pid 12578] sendmsg(3, {msg_name(0)=NULL, msg_iov(11)=[{"PRIORITY=3\nSYSLOG_FACILITY=3\nCOD"..., 124}, {"USER_UNIT=systemd-hostnamed.serv"..., 35}, {"\n", 1}, {"MESSAGE_ID=641257651c1b4ec9a8624"..., 43}, {"\n", 1}, {"EXECUTABLE=/usr/lib/systemd/syst"..., 45}, {"\n", 1}, {"MESSAGE=Failed at step NAMESPACE"..., 94}, {"\n", 1}, {"ERRNO=22", 8}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 354
[pid 12578] exit_group(226)             = ?

Comment 5 Jan Synacek 2017-03-20 08:50:22 UTC
$ mount | grep /dev/shm
tmpfs on /dev/shm type tmpfs (rw,relatime,seclabel)
none on /dev/shm/lldpad.state type tmpfs (rw,relatime,seclabel)

I think that this might be a problem. Why is /dev/shm/lldpad.state mounted like that?

Comment 6 Lev Veyde 2017-03-20 16:21:07 UTC
(In reply to Jan Synacek from comment #5)
> $ mount | grep /dev/shm
> tmpfs on /dev/shm type tmpfs (rw,relatime,seclabel)
> none on /dev/shm/lldpad.state type tmpfs (rw,relatime,seclabel)
> 
> I think that this might be a problem. Why is /dev/shm/lldpad.state mounted
> like that?

No idea, AFAIK we don't modify it in any way.

Comment 7 Jan Synacek 2017-03-21 09:46:41 UTC
(In reply to Lev Veyde from comment #6)
> No idea, AFAIK we don't modify it in any way.

Is it "we the image creators"? I'm asking because I would like to know if it has to be mounted (or what was the reasoning) like that and if the problem disappears when the mount is not there. Then the solution would be simple. If the mount line really has to be there, then we would have to come up with something else. It doesn't look (to me, at least) that there's a bug in systemd in this case.

Comment 8 Lev Veyde 2017-03-21 10:01:37 UTC
(In reply to Jan Synacek from comment #7)
> (In reply to Lev Veyde from comment #6)
> > No idea, AFAIK we don't modify it in any way.
> 
> Is it "we the image creators"? I'm asking because I would like to know if it
> has to be mounted (or what was the reasoning) like that and if the problem
> disappears when the mount is not there. Then the solution would be simple.
> If the mount line really has to be there, then we would have to come up with
> something else. It doesn't look (to me, at least) that there's a bug in
> systemd in this case.

Yes, that's what I meant - we don't change anything in that area as part of the oVirt-Live image creation.

From what I see it looks like that if that mount is removed then the services load normally (or rather can be restarted to work fine).

BTW, how you have concluded that the issue is specifically with that mount?
I couldn't find any useful information in the logs that could point to that direction.

Comment 9 Jan Synacek 2017-03-21 13:48:29 UTC
A potential fix: https://github.com/systemd/systemd/commit/98df8089bea1b2407c46495b6c2eb76dda46c658
I'm not sure if it will fix this particular case.

If I provide a test build, would it be possible to rebuild the image?

Comment 10 Lev Veyde 2017-03-30 14:49:00 UTC
(In reply to Jan Synacek from comment #9)
> A potential fix:
> https://github.com/systemd/systemd/commit/
> 98df8089bea1b2407c46495b6c2eb76dda46c658
> I'm not sure if it will fix this particular case.
> 
> If I provide a test build, would it be possible to rebuild the image?

Yes, we could try to do a test build using an updated systemd package.

Comment 14 Jan Synacek 2017-04-05 06:45:25 UTC
https://github.com/lnykryn/systemd-rhel/pull/103

Comment 16 Lukáš Nykrýn 2017-04-10 14:58:31 UTC
fix merged to upstream staging branch ->
https://github.com/lnykryn/systemd-rhel/commit/8d166597076d87aae9d5f98144103386c79d6446
-> post

Comment 19 errata-xmlrpc 2017-08-01 09:14:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2297