Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1878780

Summary:

on the nodes there are zombies caused by podman

Product:

OpenShift Container Platform

Reporter:

wvoesch

Component:

Containers

Assignee:

Valentin Rothberg <vrothber>

Status:

CLOSED CURRENTRELEASE

QA Contact:

pmali

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.6

CC:

alisha.prabhu, alklein, aos-bugs, brueckner, danijel.soldo, danili, dgilmore, dwalsh, Holger.Wolf, hwolf, jokerman, madeel, miabbott, nagrawal, nbziouec, pehunt, tsweeney, vrothber, wking

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Flags:

nagrawal: needinfo-
nagrawal: needinfo-

Hardware:

s390x

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-10-20 13:21:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1881153

Attachments:

Description	Flags
Output of the dbginfo.sh for the clusters s8343xxx (executed on bastion s8343015)	none
Output of the dbginfo.sh for the clusters m3559xxx (executed on bastion m3559001))	none
Output of the dbginfo.sh for the clusters m3558xxx (executed on bastion m3558001))	none
Output of the dbginfo.sh for the clusters t8359xxx (executed on bastion t8359029)	none

Description wvoesch 2020-09-14 13:52:06 UTC

On several nodes, both control and compute plane nodes, I have observed zombie processes caused by podman (one per node). Please see additional info.

I observed this on the following clusters with these versions:
Cluster a) Z13: Version: 4.6.0-0.nightly-s390x-2020-08-27-080214 RHCOS: 46.82.202008261939-0
Cluster b) Z13: Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0
Cluster c) Z14: Version: 4.6.0-0.nightly-s390x-2020-09-05-222506 RHCOS: 46.82.202009042339-0
(these are three different environments) 


Please let me know what information you need for further debugging and I shall provide it happily. 
Thank you. 


Additional info:
core     2564607       1  7 13:38 ?        00:00:00 /usr/lib/systemd/systemd --user
core     2564627 2564607  0 13:38 ?        00:00:00  \_ (sd-pam)
core     2564654 2564607  2 13:38 ?        00:00:00  \_ /usr/bin/podman varlink unix:/run/user/1000/podman/io.podman --timeout=60000
core     2564671 2564654  2 13:38 ?        00:00:00  |   \_ /usr/bin/podman varlink unix:/run/user/1000/podman/io.podman --timeout=60000
core     2564675 2564671  0 13:38 ?        00:00:00  |       \_ [podman] <defunct>
core     2564676 2564607  0 13:38 ?        00:00:00  \_ /usr/bin/podman

Comment 2 Tom Sweeney 2020-09-14 14:45:20 UTC

Wofgang thanks for the report.  I've assigned this to Valentin I'm sure he'll have more questions for you.  I've a couple to start though.

Can you do `podman --version` and let us know what's running on the machines?  

Do you know if someone installed Podman separately or did it come with OCP?

It seems that someone or something setup the remote api for Podman so that it could act as a server.  Are you aware of anyone or antyhing doing the setup for that?

Comment 3 wvoesch 2020-09-14 15:04:43 UTC

Hi Tom, 

thank you.
Here are the answers to your questions:

1. podman's version is: 1.9.3 for all three reported RHCOS versions. 

2. It came with OCP directly. 

3. I'm not aware of nobody setting this up on purpose. Is that maybe done by OCP automatically?

Comment 4 Tom Sweeney 2020-09-14 18:39:49 UTC

Hi Wolfgang,

Thanks for the info.  I'm not aware of OCP setting anything up automatically to run the Podman varlink server, but I don't have a deep understanding on how OCP is using Podman.  Hopefully Valentin will know when he checks in.

Comment 5 Dan Li 2020-09-16 20:24:51 UTC

Changing reported Version from 4.6.z to 4.6

Comment 6 Daniel Walsh 2020-09-17 09:36:49 UTC

This looks like someone/thing is launching a podman-remote client against a podman varlink service running in the homedir. We would need someone with deeper knowledge of the openshift setup to be able to figure out what is going on?

Could this be part of the installation?  

Is this the "core" user?

Trever do you have an idea what is going on?

Comment 7 W. Trevor King 2020-09-21 23:43:01 UTC

I dunno who would be launching podman-remote.  Installer does some stuff with 'podman' (no varlink) on the bootstrap machine during install, and the machine-config folks use 'podman' (I think) during RHCOS pivoting.  I'm not aware of any other podman users.  Are there steps on how to reproduce this from scratch?

Comment 9 Daniel Walsh 2020-09-23 13:20:02 UTC

We have no information on what is causing this.  Because of lack of information, this is not going to be fixes in 4.6.  These zombies are most likely caused via Varlink, which is going away.  Not sure if they would be fixed with the move to APIv2.

Comment 13 Dennis Gilmore 2020-09-23 23:32:22 UTC

We need to get information on how to reproduce, Wolfgang can you please provide information on how to reproduct this bug?

Comment 14 wvoesch 2020-09-24 09:19:09 UTC

Hi Daniel, hi Dennis, 

unfortunately, I don’t know what is causing this either. 

I found these zombie processes on all our clusters across various different 4.6 nightlies, 4.5 nightlies, and on 4.5.7. The clusters run on 3 different CECs (z13, z14, z15).  For more details on the clusters, please see the additional information below. 

For reproduction:
1.	Install a cluster
2.	ssh to the nodes (worker nodes seem to be more affected)
3.	check with “ps -ef –forest” or/and “ps -aux”. I found that sometimes the zombies can be seen with “ps -aux” but not with “ps -ef –forest”. 

I observed that on some of the nodes the zombie-processes disappear and reappear after some time. Please see the journalctl output. 

For additional information, please don’t hesitate to contact me. 

----------------------------------------------------------------------------------------------------
cluster: m3558001, z14, fcp, direct attached OSA, 3 masters, 3 workers, zVM 1, age: 9d
Version: 4.6.0-0.nightly-s390x-2020-09-10-082639
Kernel Version:                         4.18.0-193.19.1.el8_2.s390x
OS Image:                               Red Hat Enterprise Linux CoreOS 46.82.202009091339-0 (Ootpa)
Operating System:                       linux
Architecture:                           s390x
Container Runtime Version:              cri-o://1.19.0-11.rhaos4.6.gitf83564f.el8-rc.1
Kubelet Version:                        v1.19.0-rc.2+40d85fc
Kube-Proxy Version:                     v1.19.0-rc.2+40d85fc

+ zombies found on all 3 master
+ zombies found on all 3 workers

----------------------------------------------------------------------------------------------------
cluster m3558008, z14, fcp, direct attached OSA, 3 masters, 18 workers, zVM 1, age: 61d
Version: 4.5.0-0.nightly-s390x-2020-07-23-201250
Kernel Version:                         4.18.0-193.12.1.el8_2.s390x
OS Image:                               Red Hat Enterprise Linux CoreOS 45.82.202007101157-0 (Ootpa)
Operating System:                       linux
Architecture:                           s390x
Container Runtime Version:              cri-o://1.18.2-18.rhaos4.5.git754d46b.el8
Kubelet Version:                        v1.18.3+8b0a82f
Kube-Proxy Version:                     v1.18.3+8b0a82f

+ zombies found on all workers

----------------------------------------------------------------------------------------------------
cluster s8343015, z13, fcp & dasd, direct attached OSA, 3 masters, 2 workers, zVM 2, age: 19h
Version: 4.5.7
Kernel Version:                          4.18.0-193.14.3.el8_2.s390x
OS Image:                                Red Hat Enterprise Linux CoreOS 45.82.202008150257-0 (Ootpa)
Operating System:                        linux
Architecture:                            s390x
Container Runtime Version:               cri-o://1.18.3-10.rhaos4.5.gitd6f1f19.el8
Kubelet Version:                         v1.18.3+2cf11e2
Kube-Proxy Version:                      v1.18.3+2cf11e2

+ zombies found on all 2 worker

----------------------------------------------------------------------------------------------------
cluster t8359029, z15, minidisk, vswitch, 3 masters, 3 workers, zVM 3, age: 37h
Version: 4.6.0-0.nightly-s390x-2020-09-21-093300
Kernel Version:                         4.18.0-193.23.1.el8_2.s390x
OS Image:                               Red Hat Enterprise Linux CoreOS 46.82.202009182339-0 (Ootpa)
Operating System:                       linux
Architecture:                           s390x
Container Runtime Version:              cri-o://1.19.0-18.rhaos4.6.gitd802e19.el8
Kubelet Version:                        v1.19.0+7f9e863
Kube-Proxy Version:                     v1.19.0+7f9e863

+ zombies found on all 3 workers

----------------------------------------------------------------------------------------------------
journalctl output from a worker of the cluster s8343015, OCP version 4.5.7

journalctl |grep -E "podman|varlink"
Sep 23 13:14:23 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1692]: I0923 13:14:23.106855    1692 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13
Sep 23 13:14:23 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1662]: I0923 13:14:23.106855    1692 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13
Sep 23 13:14:26 worker-002.ocp-s8343015.lnxne.boe podman[1704]: 2020-09-23 13:14:26.367936725 +0000 UTC m=+2.323657775 system refresh
Sep 23 13:15:07 worker-002.ocp-s8343015.lnxne.boe podman[1704]: 2020-09-23 13:15:07.964738641 +0000 UTC m=+43.920459919 image pull
Sep 23 13:15:08 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1692]: I0923 13:15:08.104386    1692 rpm-ostree.go:368] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13
Sep 23 13:15:08 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1662]: I0923 13:15:08.104386    1692 rpm-ostree.go:368] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13
Sep 23 13:15:08 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1692]: I0923 13:15:08.653302    1692 rpm-ostree.go:368] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name ostree-container-pivot-2576ba4f-18b3-49d6-b450-5236fe9dfd86 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13
Sep 23 13:15:08 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1662]: I0923 13:15:08.653302    1692 rpm-ostree.go:368] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name ostree-container-pivot-2576ba4f-18b3-49d6-b450-5236fe9dfd86 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13
Sep 23 13:15:09 worker-002.ocp-s8343015.lnxne.boe podman[1759]: 2020-09-23 13:15:09.035394436 +0000 UTC m=+0.324306191 container create fbe69e6be6bb4883be9be094d8c7a344b998bfe66d76cd726bde1e3d31f7f68f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13, name=ostree-container-pivot-2576ba4f-18b3-49d6-b450-5236fe9dfd86)
Sep 23 13:15:09 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1692]: I0923 13:15:09.106078    1692 rpm-ostree.go:368] Running captured: podman mount fbe69e6be6bb4883be9be094d8c7a344b998bfe66d76cd726bde1e3d31f7f68f
Sep 23 13:15:09 worker-002.ocp-s8343015.lnxne.boe machine-config-daemon[1662]: I0923 13:15:09.106078    1692 rpm-ostree.go:368] Running captured: podman mount fbe69e6be6bb4883be9be094d8c7a344b998bfe66d76cd726bde1e3d31f7f68f
Sep 23 13:15:09 worker-002.ocp-s8343015.lnxne.boe podman[1772]: 2020-09-23 13:15:09.439799124 +0000 UTC m=+0.298370820 container mount fbe69e6be6bb4883be9be094d8c7a344b998bfe66d76cd726bde1e3d31f7f68f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13, name=ostree-container-pivot-2576ba4f-18b3-49d6-b450-5236fe9dfd86)
Sep 23 13:15:43 worker-002.ocp-s8343015.lnxne.boe podman[1876]: 2020-09-23 13:15:43.834689764 +0000 UTC m=+0.749070805 container remove fbe69e6be6bb4883be9be094d8c7a344b998bfe66d76cd726bde1e3d31f7f68f (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4508108ee1a99fa12989f05f31f96662f61d6eb8c3c69bd6f855e4c1d012ac13, name=ostree-container-pivot-2576ba4f-18b3-49d6-b450-5236fe9dfd86)
Sep 24 07:10:12 worker-002.ocp-s8343015.lnxne.boe systemd[2453105]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 07:10:12 worker-002.ocp-s8343015.lnxne.boe systemd[2453105]: io.podman.service: Failed with result 'exit-code'.
Sep 24 07:10:12 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2453173 (podman pause) with signal SIGKILL.
Sep 24 08:10:58 worker-002.ocp-s8343015.lnxne.boe systemd[2602537]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:10:58 worker-002.ocp-s8343015.lnxne.boe systemd[2602537]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:10:58 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2602568 (podman pause) with signal SIGKILL.
Sep 24 08:10:59 worker-002.ocp-s8343015.lnxne.boe systemd[2602605]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:10:59 worker-002.ocp-s8343015.lnxne.boe systemd[2602605]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:10:59 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2602670 (podman pause) with signal SIGKILL.
Sep 24 08:10:59 worker-002.ocp-s8343015.lnxne.boe podman[2602769]: Error: could not get runtime: cannot re-exec process
Sep 24 08:10:59 worker-002.ocp-s8343015.lnxne.boe systemd[2602731]: io.podman.service: Main process exited, code=exited, status=125/n/a
Sep 24 08:10:59 worker-002.ocp-s8343015.lnxne.boe systemd[2602731]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:11:08 worker-002.ocp-s8343015.lnxne.boe systemd[2603134]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:11:08 worker-002.ocp-s8343015.lnxne.boe systemd[2603134]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:11:08 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2603169 (podman pause) with signal SIGKILL.
Sep 24 08:11:09 worker-002.ocp-s8343015.lnxne.boe systemd[2603197]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:11:09 worker-002.ocp-s8343015.lnxne.boe systemd[2603197]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:11:09 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2603254 (podman pause) with signal SIGKILL.
Sep 24 08:11:10 worker-002.ocp-s8343015.lnxne.boe systemd[2603322]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:11:10 worker-002.ocp-s8343015.lnxne.boe systemd[2603322]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:11:10 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2603374 (podman pause) with signal SIGKILL.
Sep 24 08:25:23 worker-002.ocp-s8343015.lnxne.boe systemd[2637552]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:25:23 worker-002.ocp-s8343015.lnxne.boe systemd[2637552]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:25:23 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2637598 (podman pause) with signal SIGKILL.
Sep 24 08:25:23 worker-002.ocp-s8343015.lnxne.boe systemd[2637639]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:25:23 worker-002.ocp-s8343015.lnxne.boe systemd[2637639]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:25:23 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2637705 (podman pause) with signal SIGKILL.
Sep 24 08:25:29 worker-002.ocp-s8343015.lnxne.boe systemd[2637896]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:25:29 worker-002.ocp-s8343015.lnxne.boe systemd[2637896]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:25:29 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2637921 (podman pause) with signal SIGKILL.
Sep 24 08:25:29 worker-002.ocp-s8343015.lnxne.boe systemd[2638016]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:25:29 worker-002.ocp-s8343015.lnxne.boe systemd[2638016]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:25:29 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2638071 (podman pause) with signal SIGKILL.
Sep 24 08:25:35 worker-002.ocp-s8343015.lnxne.boe systemd[2638332]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:25:35 worker-002.ocp-s8343015.lnxne.boe systemd[2638332]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:25:35 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2638359 (podman pause) with signal SIGKILL.
Sep 24 08:25:35 worker-002.ocp-s8343015.lnxne.boe systemd[2638392]: io.podman.service: Main process exited, code=exited, status=143/n/a
Sep 24 08:25:35 worker-002.ocp-s8343015.lnxne.boe systemd[2638392]: io.podman.service: Failed with result 'exit-code'.
Sep 24 08:25:35 worker-002.ocp-s8343015.lnxne.boe systemd[1]: user: Killing process 2638418 (podman pause) with signal SIGKILL.

Comment 15 Hendrik Brueckner 2020-09-24 10:30:22 UTC

Wolfgang, for sake of completion, could you add a dbginfo. Thanks.

Comment 16 wvoesch 2020-09-24 11:30:34 UTC

Created attachment 1716312 [details]
Output of the dbginfo.sh for the clusters s8343xxx (executed on bastion s8343015)

Comment 17 wvoesch 2020-09-24 11:32:49 UTC

Created attachment 1716313 [details]
Output of the dbginfo.sh for the clusters m3559xxx (executed on bastion m3559001))

Comment 18 wvoesch 2020-09-24 11:35:20 UTC

Created attachment 1716314 [details]
Output of the dbginfo.sh for the clusters m3558xxx (executed on bastion m3558001))

Comment 19 wvoesch 2020-09-24 11:38:16 UTC

Created attachment 1716315 [details]
Output of the dbginfo.sh for the clusters t8359xxx (executed on bastion t8359029)

Comment 20 wvoesch 2020-09-24 11:42:47 UTC

Hi Hendrik,

please find the output of dbginfo.sh as attachment to the four previous comments. 

For referencing the dbginfo.sh: the actual names of the three clusters from the initial report: 
Cluster a) s8343022 
Cluster b) s8343008 
Cluster c) m3559001

Comment 21 Holger Wolf 2020-09-24 16:20:47 UTC

The key blocking release bug is the high CPU usage https://bugzilla.redhat.com/show_bug.cgi?id=1878770
There seems to be a connection between etcd and the crash api nodes which I suspect is related to the zombies in this bug here

Comment 22 Daniel Walsh 2020-09-25 18:55:39 UTC

We should disable the varlink socket to stop whoever is communicating with podman from doing it.  That would at least get us more information on what is going on.

Comment 26 alisha 2020-10-05 12:21:56 UTC

I installed OCP 4.6 on ppc64le (little endian).
Build used : 4.6.0-0.nightly-ppc64le-2020-10-02-231830

Output of "ps -ef --forest" on OCP cluster deployed on ppc64le did not show any zombie processes.

Comment 29 wvoesch 2020-11-19 13:46:26 UTC

Hi all, 

I found this issue again on s390x in two clusters with the versions 4.6.1 and 4.7.0-0.nightly-s390x-2020-11-05-004605.

Below is the journal output of the 4.7 nightly cluster. 

$ journalctl |grep -E "podman|varlink"
Nov 19 13:32:17 worker-02.ocp-m3558030.lnxne.boe systemd[2898561]: io.podman.service: Main process exited, code=exited, status=143/n/a
Nov 19 13:32:17 worker-02.ocp-m3558030.lnxne.boe systemd[2898561]: io.podman.service: Failed with result 'exit-code'.
Nov 19 13:32:17 worker-02.ocp-m3558030.lnxne.boe systemd[1]: user: Killing process 2898604 (podman pause) with signal SIGKILL.
Nov 19 13:32:18 worker-02.ocp-m3558030.lnxne.boe systemd[2898648]: io.podman.service: Main process exited, code=exited, status=143/n/a
Nov 19 13:32:18 worker-02.ocp-m3558030.lnxne.boe systemd[2898648]: io.podman.service: Failed with result 'exit-code'.
Nov 19 13:32:18 worker-02.ocp-m3558030.lnxne.boe systemd[1]: user: Killing process 2898689 (podman pause) with signal SIGKILL.
Nov 19 13:32:18 worker-02.ocp-m3558030.lnxne.boe systemd[2898724]: io.podman.service: Main process exited, code=exited, status=143/n/a
Nov 19 13:32:18 worker-02.ocp-m3558030.lnxne.boe systemd[2898724]: io.podman.service: Failed with result 'exit-code'.
Nov 19 13:32:18 worker-02.ocp-m3558030.lnxne.boe systemd[1]: user: Killing process 2898749 (podman pause) with signal SIGKILL.
Nov 19 13:32:23 worker-02.ocp-m3558030.lnxne.boe systemd[2898821]: io.podman.service: Main process exited, code=exited, status=143/n/a
Nov 19 13:32:23 worker-02.ocp-m3558030.lnxne.boe systemd[2898821]: io.podman.service: Failed with result 'exit-code'.
Nov 19 13:32:24 worker-02.ocp-m3558030.lnxne.boe systemd[1]: user: Killing process 2898848 (podman pause) with signal SIGKILL.

Comment 30 Dan Li 2020-11-30 17:50:21 UTC

Hi Containers team, I'm checking in here. Do you have the information needed for this bug? Or do you need an environment to reproduce this error?

Comment 31 Daniel Walsh 2020-11-30 19:47:26 UTC

No, this bug is caused by a container on the system launching podman containers via VARLink.  Remove the varlink service and the problem should probably go away.

Comment 35 wvoesch 2020-12-08 10:55:40 UTC

Hi all, 

the problem still occurs in ocp version 4.7.0-0.nightly-s390x-2020-12-03-121304.


$ ps -ef --forest
....
core     1460776       1  1 10:14 ?        00:00:29 /usr/lib/systemd/systemd --user
core     1460780 1460776  0 10:14 ?        00:00:00  \_ (sd-pam)
core     1460809 1460776  0 10:14 ?        00:00:00  \_ /usr/bin/podman varlink unix:/run/user/1000/podman/io.podman --timeout=60000
core     1460827 1460809  0 10:14 ?        00:00:00  |   \_ /usr/bin/podman varlink unix:/run/user/1000/podman/io.podman --timeout=60000
core     1460830 1460827  0 10:14 ?        00:00:00  |       \_ [podman] <defunct>
core     1460831 1460776  0 10:14 ?        00:00:00  \_ /usr/bin/podman


Please note that the output of the journal is different to the output before. 

$ journalctl |grep -E "podman|varlink"
Dec 08 08:20:20 worker-002.m3558ocp.lnxne.boe podman[1301810]: Command "varlink" is deprecated, Please see 'podman system service' for RESTful APIs
Dec 08 08:20:20 worker-002.m3558ocp.lnxne.boe podman[1301810]: Command "varlink" is deprecated, Please see 'podman system service' for RESTful APIs
Dec 08 08:25:48 worker-002.m3558ocp.lnxne.boe systemd[1301799]: io.podman.service: Main process exited, code=exited, status=143/n/a
Dec 08 08:25:48 worker-002.m3558ocp.lnxne.boe systemd[1301799]: io.podman.service: Failed with result 'exit-code'.
Dec 08 08:25:48 worker-002.m3558ocp.lnxne.boe systemd[1301799]: podman.socket: Succeeded.
Dec 08 08:25:48 worker-002.m3558ocp.lnxne.boe systemd[1301799]: io.podman.socket: Succeeded.
Dec 08 08:25:48 worker-002.m3558ocp.lnxne.boe systemd[1]: user: Killing process 1301837 (podman pause) with signal SIGKILL.
Dec 08 10:14:30 worker-002.m3558ocp.lnxne.boe podman[1460809]: Command "varlink" is deprecated, Please see 'podman system service' for RESTful APIs
Dec 08 10:14:30 worker-002.m3558ocp.lnxne.boe podman[1460809]: Command "varlink" is deprecated, Please see 'podman system service' for RESTful APIs


@Daniel : When should we expect, that the varlink service has been removed?

Comment 36 Daniel Walsh 2020-12-08 13:26:45 UTC

It will be removed in podman 3.0 which will arrive in RHEL8.4 release.  Not sure when this will arrive for OpenShift.

For now, why don't you just disable the varlink service and socket, and then the problem will go away.

Comment 47 Tom Sweeney 2021-10-20 13:21:32 UTC

As this is now fixed in OCP/RHCOS 4.7, 4.8, and 4.9, I'm closing this as fixed in current release.  Thanks all for the updates.