2119072 – podman gating test issues in RHEL8.7

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2119072 - podman gating test issues in RHEL8.7

Summary: podman gating test issues in RHEL8.7

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	conmon
Sub Component:
Version:	8.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jindrich Novy
QA Contact:	Joy Pu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2123152 (view as bug list)
Depends On:
Blocks:	2002451 2154403 2154417 2154418
TreeView+	depends on / blocked

Reported:	2022-08-17 12:54 UTC by Jindrich Novy
Modified:	2023-01-02 08:28 UTC (History)
CC List:	16 users (show)
Fixed In Version:	conmon-2.1.2-2.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2154403 2154417 2154418 (view as bug list)
Environment:
Last Closed:	2022-11-08 09:16:44 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	containers conmon pull 352	None	Merged	signal: handle SIGUSR1 with signalfd	2022-08-26 13:57:09 UTC
Red Hat Issue Tracker	RHELPLAN-131329	None	None	None	2022-08-17 13:01:13 UTC
Red Hat Product Errata	RHSA-2022:7457	None	None	None	2022-11-08 09:17:17 UTC

Description Jindrich Novy 2022-08-17 12:54:06 UTC

Moving email conversation to a bug - failing gating tests: RHEL8 gating finished but is failing: http://artifacts.osci.redhat.com/baseos-ci/redhat-module/16/32/16321/https___baseos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com-ci-artemis-mbs-sti/1206/work-tests.ymlW7GfnB/tests-va2QkJ/test.podman-nonroot.bats.log

Conversation:

For the failed test cases I also do some tests with the podman 4.2 rc3 build last week and podman 4.2 final build today. There are some informations about two failed test cases "podman system service containers survive service stop" and "podman generate - systemd - basic"

For the first test "podman system service containers survive service stop" it passed in my RHEL 8.7 environment with runc for both build(rc3 and final). I can not trigger the error in the gating system.
And for "podman generate - systemd - basic" It failed in RHEL 8.7(6/100) and RHEL 9.1(49/100) randomly. So I added some debug checks for journald output inside the scripts and ran it again. Here is the different output from the pass and failed ones. The beginning of the logs are the same and will only paste the logs after the stop signal is sent out. The failed one will include something starting with "container cleanup xxx" and the container process is killed and lead to a failed status.

Pass output:
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopping Podman container-97505420458f8025824694983e41bd4d4c64decae3de44946df95a5e3801448b.service...
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com 1Hpn5RJ1mW[126803]: Received SIGTERM, finishing
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[126878]: 2022-08-16 15:16:03.184407347 +0530 IST m=+0.033040602 container died b3cc94625d0c7f4853cb0b814c67c10dddc851678ee01f5fa42f4c93f81e4396 (image=quay.io/libpod/testimage:20220615, name=1Hpn5RJ1mW, health_status=)
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[126878]: 2022-08-16 15:16:03.551901433 +0530 IST m=+0.400534729 container remove b3cc94625d0c7f4853cb0b814c67c10dddc851678ee01f5fa42f4c93f81e4396 (image=quay.io/libpod/testimage:20220615, name=1Hpn5RJ1mW, health_status=, created_at=2022-06-15T11:44:04Z, created_by=test/system/build-testimage, io.buildah.version=1.27.0-dev, io.containers.autoupdate=registry, PODMAN_SYSTEMD_UNIT=podman_test_3TIXgrJE15.service)
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[126860]: b3cc94625d0c7f4853cb0b814c67c10dddc851678ee01f5fa42f4c93f81e4396
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: podman_test_3TIXgrJE15.service: Deactivated successfully.
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopped Podman container-97505420458f8025824694983e41bd4d4c64decae3de44946df95a


Failed output:
# Aug 16 15:09:04 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopping Podman container-ea7d5723ee43b194c07f8ee620eea3a4e224521be4400acbf1bc403461ae1186.service...
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com 5FYDlVSETX[124960]: Received SIGTERM, finishing
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125035]: 2022-08-16 15:09:05.066332309 +0530 IST m=+0.033484528 container died c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724 (image=quay.io/libpod/testimage:20220615, name=5FYDlVSETX, health_status=)
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125017]: 2022-08-16 15:09:05.389656486 +0530 IST m=+0.437777883 container cleanup c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724 (image=quay.io/libpod/testimage:20220615, name=5FYDlVSETX, health_status=, created_by=test/system/build-testimage, io.buildah.version=1.27.0-dev, io.containers.autoupdate=registry, PODMAN_SYSTEMD_UNIT=podman_test_2xLuHNmJwO.service, created_at=2022-06-15T11:44:04Z)
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125017]: c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: podman_test_2xLuHNmJwO.service: Main process exited, code=killed, status=10/USR1
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125113]: 2022-08-16 15:09:05.586198141 +0530 IST m=+0.093228241 container remove c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724 (image=quay.io/libpod/testimage:20220615, name=5FYDlVSETX, health_status=, io.buildah.version=1.27.0-dev, io.containers.autoupdate=registry, PODMAN_SYSTEMD_UNIT=podman_test_2xLuHNmJwO.service, created_at=2022-06-15T11:44:04Z, created_by=test/system/build-testimage)
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125113]: c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: podman_test_2xLuHNmJwO.service: Failed with result 'signal'.
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopped Podman container-ea7d5723ee43b194c07f8ee620eea3a4e224521be4400acbf1bc403461ae1186.service.


Hope this will help to debug the problem.


Best regards
Yiqiao Pu

On Mon, Aug 15, 2022 at 9:58 PM Ed Santiago <santiago> wrote:

    On Mon, Aug 15, 2022 at 09:34:45AM -0400, Matt Heon wrote:
    > Failures [RHEL8]:
    >
    > 224 podman volume exec/noexec: Error message seems to have changed - maybe
    > a newer runc version. Need to patch the test, but nothing to worry about.

    Agree. This may be the intention behind #14793[1] but I never fully
    understood that PR, and it was ultra-badly broken, so it has stalled.
    (As background, we lost runc testing some months ago, and haven't
    fully regained it because Ubuntu).

     [1] runc changes, https://github.com/containers/podman/pull/14793


    > 270 podman generate - systemd - basic: Unclear what went wrong, I see no
    > relevant error messages. Requires further investigation.

    1minutetip seems down, so I can't investigate. Reading the code
    comments suggests https://github.com/containers/podman/issues/11304

    > 282 podman system service containers survive service stop: This is a
    > significant regression. If this reproduces, we should get Giuseppe to take
    > a look.

    This test should not be running on RHEL8 (runc) at all! This is
    not something implemented in runc. The "skip" was added in #14972
    which was never ported into v4.2. (Once again, see "we lost runc"
    comment above, and weep).

    This is OK to waive.

    > 332 podman with nonexistent labels: Another runc error message changing
    > slightly, not concerning.

    Concur. But this is a test that _does_ run in CI, on Ubuntu, so I'm
    wondering if RHEL-runc is diverging significantly from Ubuntu-runc.
    That would be Very Bad because it means we have no way of testing
    runc until RHEL gating tests.

    > The fact that the two significant failures are both in systemd-related
    > operations strikes me as somewhat suspicious, but we test on newer systemd
    > versions on other distros, so I somehow doubt that a systemd update broke
    > us.

    One significant failure, not two.

    Recommendation: anyone with the ability to get a RHEL 8.7 VM somehow,
    drop everything and run the "podman generate - systemd - basic" test,
    try to reproduce then understand that failure. All else is ignorable.

Comment 1 Ed Santiago 2022-08-17 14:51:55 UTC

Trivial reproducer (as root on RHEL8.7 with MB 16321 installed):

   # podman create --pull=always --name=foo quay.io/libpod/testimage:20220615 top
   # podman generate systemd --new foo >/run/systemd/system/foo.service
   # podman rm foo
   # systemctl daemon-reload
   # systemctl start foo

So far, so good. Now:

   # systemctl stop foo
   # systemctl status foo
      ? foo.service - Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service
      Loaded: loaded (/run/systemd/system/foo.service; disabled; vendor preset: disabled)
      Active: failed (Result: signal) since Wed 2022-08-17 10:49:54 EDT; 32s ago
        Docs: man:podman-generate-systemd(1)
     Process: 13978 ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=/run/foo.service.ctr-id (code=exited, status=0/SUCCESS)
     Process: 13917 ExecStop=/usr/bin/podman stop --ignore --cidfile=/run/foo.service.ctr-id (code=exited, status=0/SUCCESS)
     Process: 13896 ExecStart=/usr/bin/podman run --cidfile=/run/foo.service.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon -d --replace --pull=always --name=foo quay.io/libpod/testimage:20220615 top (code=killed, signal=USR1)
     Process: 13841 ExecStartPre=/bin/rm -f /run/foo.service.ctr-id (code=exited, status=0/SUCCESS)
    Main PID: 13896 (code=killed, signal=USR1)
   
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: Copying config sha256:f26aa69bb3f3de470d358b6c4e7f9de8ca165ef7f1689af1cdd446902dc27928
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: Writing manifest to image destination
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: Storing signatures
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: Started Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service.
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: 0804ed1173c14036e9813829e2a2a58ea836f4d964925ab018888eb6de36bd00
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: Stopping Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service...
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13917]: 0804ed1173c14036e9813829e2a2a58ea836f4d964925ab018888eb6de36bd00
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: foo.service: Main process exited, code=killed, status=10/USR1
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: foo.service: Failed with result 'signal'.
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: Stopped Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service.


Reproduce at will via `podman start/stop foo`.

Comment 2 Tom Sweeney 2022-08-18 21:59:34 UTC

Valentin, please take a look when you return from PTO.

Comment 3 Valentin Rothberg 2022-08-22 09:38:24 UTC

I am unable to reproduce with build 2119676 [1] on RHEL-8.7.0-20220820.0.

What I get is the expected exit code 143.
> foo.service: Main process exited, code=exited, status=143/n/a


> foo.service: Main process exited, code=killed, status=10/USR1

I have never seen this issue so far and do not know who's sending the USR1 signal.

[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2119676

Comment 4 Valentin Rothberg 2022-08-22 14:55:18 UTC

After chatting with Ed, I noticed a difference between the non-working 1minute VMs and the one I got from beaker.

Working: 4.18.0-418.el8.x86_64 and systemd-239-65.el8
Broken: 4.18.0-416.el8.x86_64 and systemd-239-62.el8

I ran killsnoop from the `bcc-tools` package to figure out who's sending the USR1 signal on the broken machine.  What I observed is that systemd is immediately sending SIGTERM on `systemctl stop` but I couldn't spot USR1.

My conclusion is that something in stack (outside of Podman) had a temporary hiccup that has been fixed in the meantime.

Tom, I suggest to close.

Comment 5 Ed Santiago 2022-08-22 17:01:25 UTC

1minutetip has RHEL-8.7.0-20220816.0 available but not 0820 (the one Valentin is using). On 0816 I cannot dnf-upgrade to the systemd or kernel versions that Valentin sees. Until I get those versions and confirm the bug is fixed, I recommend leaving open.

Comment 6 Tom Sweeney 2022-08-22 22:20:04 UTC

Thanks for digging into this quickly folks.  Let's keep this open until Ed gets a chance to dive a bit deeper.

Comment 7 Ed Santiago 2022-08-23 19:40:03 UTC

Sorry Valentin, there's something different about your beaker setup. 1mt has just made 0823 available, with systemd-239-65.el8.x86_64 and kernel 4.18.0-419.el8.x86_64 (both greater than or equal to the ones you report). The bug is present in this setup.

Comment 8 Valentin Rothberg 2022-08-24 07:56:28 UTC

(In reply to Ed Santiago from comment #7)
> Sorry Valentin, there's something different about your beaker setup. 1mt has
> just made 0823 available, with systemd-239-65.el8.x86_64 and kernel
> 4.18.0-419.el8.x86_64 (both greater than or equal to the ones you report).
> The bug is present in this setup.

That is very curious.  Ed, could you spin up a beaker machine and test there?  I will give 1mt a shot later today.

Comment 9 Valentin Rothberg 2022-08-24 11:13:18 UTC

Pulling in Michal from the system team.

@Michal, Ed gave me access to a "broken" system.  I made the following observations that make be believe it's (temporary?) issue in systemd:

* `killsnoop` revealed that systemd sends TERM and KILL immediately on `systemctl stop` to the conmon process.
* Given `TimeoutStopSec=70`, systemd should not do that and does not on a "working" system.
* Setting `KillMode=none` resolves the issue but it should not be required.

Is there a temporary issue in systemd or am I looking in the wrong direction?

Comment 11 Valentin Rothberg 2022-08-25 12:29:00 UTC

Ed and I ran another debugging session.  I couldn't reproduce as I only installed the podman{-catonit} package.  Installing the entire module yielded the failure again.

It turns it to be `conmon`.  We bisected that `d91cc4321797eaa84dcfb7863e91632d2fe26861` [1] is the first commit introducing the issue.  Note that we ship conmon v2.1.0 in Fedora while we're running conmon v2.1.3 in RHEL.

Assigning the issue to Giuseepe.

[1] https://github.com/containers/conmon/commit/d91cc4321797eaa84dcfb7863e91632d2fe26861

Comment 12 Giuseppe Scrivano 2022-08-25 14:13:23 UTC

opened a PR:  https://bugzilla.redhat.com/show_bug.cgi?id=2119072

Comment 13 Ed Santiago 2022-08-25 14:16:53 UTC

Quick note that, for purposes of RHEL, it might be safest to revert conmon back to 2.1.2-2

Comment 14 Giuseppe Scrivano 2022-08-25 15:25:12 UTC

sorry, the PR is here: https://github.com/containers/conmon/pull/352

Comment 15 Valentin Rothberg 2022-08-26 07:41:41 UTC

(In reply to Ed Santiago from comment #13)
> Quick note that, for purposes of RHEL, it might be safest to revert conmon
> back to 2.1.2-2

Giuseppe, Tom, Jindrich, what do you think about Ed's suggestion?  I concur with Ed but if the conmon experts feel confident, I will as well.

Comment 16 Jindrich Novy 2022-08-26 14:04:34 UTC

Just reverted conmon to 2.1.2. Hope it's the last change for 8.7.

Comment 21 Valentin Rothberg 2022-09-01 07:08:07 UTC

*** Bug 2123152 has been marked as a duplicate of this bug. ***

Comment 22 Joy Pu 2022-09-05 10:08:08 UTC

Test with conmon-2.1.2-2.module+el8.7.0+16493+89f82ab8.x86_64.rpm and run the test case "podman generate - systemd - basic" 200 times. It all finished successfully as expected. So move this bug to verified.

Comment 24 errata-xmlrpc 2022-11-08 09:16:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7457

Note You need to log in before you can comment on or make changes to this bug.