Bug 1838081

Summary: podman healthchecks fail on RHEL 8 when using rootless
Product: Red Hat Enterprise Linux 8 Reporter: Suhaas Bhat <subhat>
Component: systemdAssignee: David Tardon <dtardon>
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.1CC: bbaude, dtardon, dwalsh, gscrivan, jligon, jnovy, lsm5, mheon, pasik, rmanes, systemd-maint-list
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 02:02:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Suhaas Bhat 2020-05-20 12:49:23 UTC
Description of problem:
On RHEL 8 podman is using systemd for healthchecks and fails while using rootless.

Version-Release number of selected component (if applicable):
podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64

How reproducible:
Fully reproducible

Steps to Reproduce:

[joe@vm249-81 ~]$ podman run -d --name test1 --health-cmd="/bin/sh -c 'echo hello'" --healthcheck-interval=30s --health-timeout=10s ubi8
Trying to pull registry.access.redhat.com/ubi8...
Getting image source signatures
Copying blob 78afc5364ad2 done
Copying blob 58e1deb9693d done
Copying config 8e0b0194b7 done
Writing manifest to image destination
Storing signatures
ERRO[0021] exit status 1                                
ERRO[0021] Unit 3ea606270fe65b703d94f03b710bf2182471cfe8c1f24ebec3130f8e126f40c8.service not found. 
3ea606270fe65b703d94f03b710bf2182471cfe8c1f24ebec3130f8e126f40c8

[joe@vm249-81 ~]$ podman healthcheck run test1
healthy

We have the fix here : https://github.com/containers/libpod/pull/3529/commits/c6c637da0012d54ddce32e06706c8903564b41cc

We need to back port this.

Actual results:
Healthchecks fail for rootless podman on rhel 8

Expected results:
Healthchecks should run for rootless podman on rhel 8

Additional info:

Comment 1 Matthew Heon 2020-05-20 13:21:57 UTC
That commit is present in Podman 1.5.0, so it's definitely in the 1.6.x versions we build & ship for RHEL 8. How certain are you that it is the cause?

Comment 2 Robb Manes 2020-05-20 14:13:16 UTC
(In reply to Matthew Heon from comment #1)
> That commit is present in Podman 1.5.0, so it's definitely in the 1.6.x
> versions we build & ship for RHEL 8. How certain are you that it is the
> cause?

We're not certain.  I made an error in my checking and apologize.

The behavior we do see on podman 1.6.4 is the following, however:

$ podman -v
podman version 1.6.4

$ podman run -d --name healthyboi --health-cmd="/bin/sh -c 'echo hello'" --healthcheck-interval=30s --health-timeout=10s ubi8
ERRO[0000] exit status 1                                
ERRO[0000] Unit e8cfe5ce36e59858f103ca863956caa1a10ae54a36e9e79af72e9d7014c7abe5.service not found. 
e8cfe5ce36e59858f103ca863956caa1a10ae54a36e9e79af72e9d7014c7abe5

I did the same exact thing on 2.0.0-dev and it succeeds, rootless, so please ignore my mistake regarding the github issue I gave to Suhaas.

Comment 3 Matthew Heon 2020-05-20 14:24:41 UTC
Thanks for the quick response. The 2.0 thing is interesting. We're shipping 1.9.2 in RHEL 8.2.1 - any chance you can try with that and verify if it's fixed there as well?

Comment 4 Robb Manes 2020-05-20 15:02:48 UTC
Disclaimer: previously I tried 2.0 on my ubuntu20 dev box and 1.6.4 on a rhel8 box, just because it was handy.  This time, I tried 2.0, 1.9.2, and 1.6.4 on my ubuntu20 box, and they all worked, all versions, with no errors.

---

Using strictly the RHEL8 system that can reliably reproduce the error I posted in #2, it still fails on 2.0, so again I muddied the waters there by jumping the gun.

So _all_ versions of podman fail on brand new RHEL8 installation using rootless healthchecks.

$ ./bin/podman -v
podman version 2.0.0-dev

$ ./bin/podman run -d --name healthyboi --health-cmd="/bin/sh -c 'echo hello'" --healthcheck-interval=30s --health-timeout=10s ubi8
ERRO[0000] exit status 1                                
ERRO[0000] Unit cb300d9a0ecbcd72146fcd3b058fbf48f395f7d6bb64fb03061c2796659512cb.service not found. 
cb300d9a0ecbcd72146fcd3b058fbf48f395f7d6bb64fb03061c2796659512cb

Debugging further on my system...

Comment 5 Robb Manes 2020-05-20 15:26:50 UTC
Debug logging:

$ /usr/bin/podman --log-level=debug run -d --name healthyboi --health-cmd="/bin/sh -c 'echo hello'" --healthcheck-interval=30s --health-timeout=10s ubi8
- - - - 8< - - - -
DEBU[0000] /usr/bin/conmon messages will be logged to syslog 
DEBU[0000] running conmon: /usr/bin/conmon               args="[--api-version 1 -c 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c -u 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c -r /usr/bin/runc -b /home/rmanes/.local/share/containers/storage/overlay-containers/54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c/userdata -p /run/user/1000/overlay-containers/54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c/userdata/pidfile -l k8s-file:/home/rmanes/.local/share/containers/storage/overlay-containers/54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c/userdata/ctr.log --exit-dir /run/user/1000/libpod/tmp/exits --socket-dir-path /run/user/1000/libpod/tmp/socket --log-level debug --syslog --conmon-pidfile /run/user/1000/overlay-containers/54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /home/rmanes/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /run/user/1000 --exit-command-arg --log-level --exit-command-arg debug --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /run/user/1000/libpod/tmp --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mount_program=/usr/bin/fuse-overlayfs --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c]"
[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied

DEBU[0000] Received: 111750                             
INFO[0000] Got Conmon PID as 111739                     
DEBU[0000] Created container 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c in OCI runtime 
DEBU[0000] creating systemd-transient files: systemd-run [--user --unit 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c --on-unit-inactive=30s --timer-property=AccuracySec=1s /usr/bin/podman healthcheck run 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c] 
ERRO[0000] exit status 1                                
DEBU[0000] Starting container 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c with command [/bin/bash] 
DEBU[0000] Started container 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c 
ERRO[0000] Unit 54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c.service not found. 
54866c61dafaa28ce372465e03f6f9162940252adfa8132c6eed92866c44834c

Looks like systemd-run is the one having issues:

$ ~/go/bin/dlv exec /usr/bin/podman -- run -d --name healthyboi --health-cmd="/bin/sh -c 'echo hello'" --healthcheck-interval=30s --health-timeout=10s ubi8

(dlv) b libpod.Container.createTimer
Breakpoint 1 set at 0x557e6e6b2faa for github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:52(dlv) continue
> github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:52 (hits goroutine(1):1 total:1) (PC: 0x5555e21f8faa)
Warning: debugging optimized function
    47:		}
    48:		return dbus.NewSystemdConnection()
    49:	}
    50:	
    51:	// createTimer systemd timers for healthchecks of a container
=>  52:	func (c *Container) createTimer() error {
    53:		if c.disableHealthCheckSystemd() {
    54:			return nil
    55:		}
    56:		podman, err := os.Executable()
    57:		if err != nil {

Stepped through a bit...

(dlv) continue
> github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:67 (hits goroutine(1):1 total:1) (PC: 0x55725d35c2c3)
Warning: debugging optimized function
    62:		if rootless.IsRootless() {
    63:			cmd = append(cmd, "--user")
    64:		}
    65:		cmd = append(cmd, "--unit", c.ID(), fmt.Sprintf("--on-unit-inactive=%s", c.HealthCheckConfig().Interval.String()), "--timer-property=AccuracySec=1s", podman, "healthcheck", "run", c.ID())
    66:	
=>  67:		conn, err := getConnection()
    68:		if err != nil {
    69:			return errors.Wrapf(err, "unable to get systemd connection to add healthchecks")
    70:		}
    71:		conn.Close()
    72:		logrus.Debugf("creating systemd-transient files: %s %s", "systemd-run", cmd)

(dlv) p cmd
[]string len: 9, cap: 9, [
	"--user",
	"--unit",
	"2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5",
	"--on-unit-inactive=30s",
	"--timer-property=AccuracySec=1s",
	"/usr/bin/podman",
	"healthcheck",
	"run",
	"2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5",
]

(dlv) b 74
Breakpoint 6 set at 0x55725d35c3dd for github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:74

(dlv) p systemdRun
*os/exec.Cmd {
	Path: "/usr/bin/systemd-run",
	Args: []string len: 10, cap: 10, [
		"systemd-run",
		"--user",
		"--unit",
		"2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5",
		"--on-unit-inactive=30s",
		"--timer-property=AccuracySec=1s",
		"/usr/bin/podman",
		"healthcheck",
		"run",
		"2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5",
	],
	Env: []string len: 0, cap: 0, nil,
	Dir: "",
	Stdin: io.Reader nil,
	Stdout: io.Writer nil,
	Stderr: io.Writer nil,
	ExtraFiles: []*os.File len: 0, cap: 0, nil,
	SysProcAttr: *syscall.SysProcAttr nil,
	Process: *os.Process nil,
	ProcessState: *os.ProcessState nil,
	ctx: context.Context nil,
	lookPathErr: error nil,
	finished: false,
	childFiles: []*os.File len: 0, cap: 0, nil,
	closeAfterStart: []io.Closer len: 0, cap: 0, nil,
	closeAfterWait: []io.Closer len: 0, cap: 0, nil,
	goroutine: []func() error len: 0, cap: 0, nil,
	errch: chan error {},
	waitDone: chan struct {} {},}

At this moment, since this has already run in the debugger, I went to see if the unit `2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5` existed, and it did not:

$ systemctl status 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5.service
Unit 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5.service could not be found.

So perhaps the unit isn't being created?  Is it supposed to be created by systemd-run or earlier?  Still looking.

Comment 6 Matthew Heon 2020-05-20 15:32:39 UTC
Can you see if `systemctl --user` works? For rootless Podman, we interact with the user's systemd instance.

Comment 7 Robb Manes 2020-05-20 15:38:47 UTC
When I run a run command manually, it makes the unit on the RHEL8 system without issues:

$ systemd-run --user --unit myunit --on-unit-inactive=30s --timer-property=AccuracySec=1s echo 'hello'
Running timer as unit: myunit.timer
Will run service as unit: myunit.service

$ systemctl list-units --user | grep my
myunit.timer                         loaded active     elapsed   /usr/bin/echo hello

Even when I run an identical command it works:

$ systemd-run --user --unit 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5 --on-unit-inactive=30s --timer-property=AccuracySec=1s /usr/bin/podman healthcheck run 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5
Running timer as unit: 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5.timer
Will run service as unit: 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5.service

$ systemctl list-units --user | grep 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5.timer
2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5.timer loaded active     elapsed   /usr/bin/podman healthcheck run 2df5e01ded4a2063532c203d30fa1d1877f3bf7944de467a084ac23580197fe5

But when I run it in podman, it definitely isn't created, or the `podman healthcheck` is removing it far too fast for me to see when it fires once.

Comment 8 Robb Manes 2020-05-20 16:50:22 UTC
The healthcheck isn't removing the systemd unit, it doesn't appear to be created at all.  I've been trying to debug this a bit more with dlv and it seems to be failing at different times during os/exec.Cmd.Start when we try to launch systemd-run, but I can't seem to have it reliably fail in the same place every time.  I'm still not sure why our systemd-run would be failing.

Comment 9 Robb Manes 2020-05-20 16:59:29 UTC
(dlv) b libpod.Container.createTimer
Breakpoint 1 set at 0x560333295faa for github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:52

(dlv) c
> github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:52 (hits goroutine(1):1 total:1) (PC: 0x560333295faa)
Warning: debugging optimized function
    47:		}
    48:		return dbus.NewSystemdConnection()
    49:	}
    50:	
    51:	// createTimer systemd timers for healthchecks of a container
=>  52:	func (c *Container) createTimer() error {
    53:		if c.disableHealthCheckSystemd() {
    54:			return nil
    55:		}
    56:		podman, err := os.Executable()
    57:		if err != nil {

(dlv) b 75
Breakpoint 3 set at 0x5603332963f0 for github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:75

(dlv) c
> github.com/containers/libpod/libpod.(*Container).createTimer() /usr/src/debug/podman-1.6.4-11.module+el8.2.0+6368+cf16aa14.x86_64/_build/src/github.com/containers/libpod/libpod/healthcheck_linux.go:75 (hits goroutine(1):1 total:1) (PC: 0x5603332963f0)
Warning: debugging optimized function
    70:		}
    71:		conn.Close()
    72:		logrus.Debugf("creating systemd-transient files: %s %s", "systemd-run", cmd)
    73:		systemdRun := exec.Command("systemd-run", cmd...)
    74:		_, err = systemdRun.CombinedOutput()
=>  75:		if err != nil {
    76:			return err
    77:		}
    78:		return nil
    79:	}
    80:	

(dlv) p err.ProcessState.status
256

So I am not sure why we get this returned to us, if I print the systemdRun values they all look exactly what I ran at the command line, unless I'm missing something.

Comment 10 Giuseppe Scrivano 2020-05-20 19:36:04 UTC
a newer version of systemd than the one available on RHEL 8.1 is required.

The reproducer for the issue is "unshare -r systemd-run --user echo it works", can you verify if it works fine for you?

probably systemd is missing this patch: https://github.com/systemd/systemd/commit/1ed4723d38cd0d1423c8fe650f90fa86007ddf55

I am moving the issue to systemd for further triaging.

Comment 11 Robb Manes 2020-05-21 12:16:02 UTC
(In reply to Giuseppe Scrivano from comment #10)
> The reproducer for the issue is "unshare -r systemd-run --user echo it
> works", can you verify if it works fine for you?

It fails on RHEL 8.2:

$ rpm -q systemd
systemd-239-30.el8_2.x86_64

$ unshare -r systemd-run --user echo it works
Failed to start transient service unit: Access denied

And works upstream:

$ unshare -r systemd-run --user echo it works
Running as unit: run-rf9711da4a745461599342f6b1592f3aa.service

So you're probably right, thank you!

Comment 13 Lukáš Nykrýn 2020-06-08 14:01:37 UTC
fix merged to github master branch -> https://github.com/systemd-rhel/rhel-8/pull/102

Comment 20 errata-xmlrpc 2020-11-04 02:02:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: systemd security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4553