Bug 2119072

Summary: podman gating test issues in RHEL8.7
Product: Red Hat Enterprise Linux 8 Reporter: Jindrich Novy <jnovy>
Component: conmonAssignee: Jindrich Novy <jnovy>
Status: CLOSED ERRATA QA Contact: Joy Pu <ypu>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.7CC: bbaude, dwalsh, gscrivan, hshiina, jligon, jnovy, lfriedma, lsm5, mheon, msekleta, pthomas, santiago, tsweeney, umohnani, vrothber, ypu
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: conmon-2.1.2-2.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2154403 2154417 2154418 (view as bug list) Environment:
Last Closed: 2022-11-08 09:16:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2002451, 2154403, 2154417, 2154418    

Description Jindrich Novy 2022-08-17 12:54:06 UTC
Moving email conversation to a bug - failing gating tests: RHEL8 gating finished but is failing: http://artifacts.osci.redhat.com/baseos-ci/redhat-module/16/32/16321/https___baseos-jenkins.rhev-ci-vms.eng.rdu2.redhat.com-ci-artemis-mbs-sti/1206/work-tests.ymlW7GfnB/tests-va2QkJ/test.podman-nonroot.bats.log

Conversation:

For the failed test cases I also do some tests with the podman 4.2 rc3 build last week and podman 4.2 final build today. There are some informations about two failed test cases "podman system service containers survive service stop" and "podman generate - systemd - basic"

For the first test "podman system service containers survive service stop" it passed in my RHEL 8.7 environment with runc for both build(rc3 and final). I can not trigger the error in the gating system.
And for "podman generate - systemd - basic" It failed in RHEL 8.7(6/100) and RHEL 9.1(49/100) randomly. So I added some debug checks for journald output inside the scripts and ran it again. Here is the different output from the pass and failed ones. The beginning of the logs are the same and will only paste the logs after the stop signal is sent out. The failed one will include something starting with "container cleanup xxx" and the container process is killed and lead to a failed status.

Pass output:
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopping Podman container-97505420458f8025824694983e41bd4d4c64decae3de44946df95a5e3801448b.service...
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com 1Hpn5RJ1mW[126803]: Received SIGTERM, finishing
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[126878]: 2022-08-16 15:16:03.184407347 +0530 IST m=+0.033040602 container died b3cc94625d0c7f4853cb0b814c67c10dddc851678ee01f5fa42f4c93f81e4396 (image=quay.io/libpod/testimage:20220615, name=1Hpn5RJ1mW, health_status=)
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[126878]: 2022-08-16 15:16:03.551901433 +0530 IST m=+0.400534729 container remove b3cc94625d0c7f4853cb0b814c67c10dddc851678ee01f5fa42f4c93f81e4396 (image=quay.io/libpod/testimage:20220615, name=1Hpn5RJ1mW, health_status=, created_at=2022-06-15T11:44:04Z, created_by=test/system/build-testimage, io.buildah.version=1.27.0-dev, io.containers.autoupdate=registry, PODMAN_SYSTEMD_UNIT=podman_test_3TIXgrJE15.service)
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[126860]: b3cc94625d0c7f4853cb0b814c67c10dddc851678ee01f5fa42f4c93f81e4396
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: podman_test_3TIXgrJE15.service: Deactivated successfully.
# Aug 16 15:16:03 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopped Podman container-97505420458f8025824694983e41bd4d4c64decae3de44946df95a


Failed output:
# Aug 16 15:09:04 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopping Podman container-ea7d5723ee43b194c07f8ee620eea3a4e224521be4400acbf1bc403461ae1186.service...
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com 5FYDlVSETX[124960]: Received SIGTERM, finishing
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125035]: 2022-08-16 15:09:05.066332309 +0530 IST m=+0.033484528 container died c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724 (image=quay.io/libpod/testimage:20220615, name=5FYDlVSETX, health_status=)
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125017]: 2022-08-16 15:09:05.389656486 +0530 IST m=+0.437777883 container cleanup c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724 (image=quay.io/libpod/testimage:20220615, name=5FYDlVSETX, health_status=, created_by=test/system/build-testimage, io.buildah.version=1.27.0-dev, io.containers.autoupdate=registry, PODMAN_SYSTEMD_UNIT=podman_test_2xLuHNmJwO.service, created_at=2022-06-15T11:44:04Z)
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125017]: c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: podman_test_2xLuHNmJwO.service: Main process exited, code=killed, status=10/USR1
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125113]: 2022-08-16 15:09:05.586198141 +0530 IST m=+0.093228241 container remove c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724 (image=quay.io/libpod/testimage:20220615, name=5FYDlVSETX, health_status=, io.buildah.version=1.27.0-dev, io.containers.autoupdate=registry, PODMAN_SYSTEMD_UNIT=podman_test_2xLuHNmJwO.service, created_at=2022-06-15T11:44:04Z, created_by=test/system/build-testimage)
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com podman[125113]: c697a69f511d1d6c5e1e70dcd9bcbe8164c14a59f7a7bf1029ae0b1c84053724
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: podman_test_2xLuHNmJwO.service: Failed with result 'signal'.
# Aug 16 15:09:05 hp-bl460cg9-6.gsslab.pnq2.redhat.com systemd[1]: Stopped Podman container-ea7d5723ee43b194c07f8ee620eea3a4e224521be4400acbf1bc403461ae1186.service.


Hope this will help to debug the problem.


Best regards
Yiqiao Pu

On Mon, Aug 15, 2022 at 9:58 PM Ed Santiago <santiago> wrote:

    On Mon, Aug 15, 2022 at 09:34:45AM -0400, Matt Heon wrote:
    > Failures [RHEL8]:
    >
    > 224 podman volume exec/noexec: Error message seems to have changed - maybe
    > a newer runc version. Need to patch the test, but nothing to worry about.

    Agree. This may be the intention behind #14793[1] but I never fully
    understood that PR, and it was ultra-badly broken, so it has stalled.
    (As background, we lost runc testing some months ago, and haven't
    fully regained it because Ubuntu).

     [1] runc changes, https://github.com/containers/podman/pull/14793


    > 270 podman generate - systemd - basic: Unclear what went wrong, I see no
    > relevant error messages. Requires further investigation.

    1minutetip seems down, so I can't investigate. Reading the code
    comments suggests https://github.com/containers/podman/issues/11304

    > 282 podman system service containers survive service stop: This is a
    > significant regression. If this reproduces, we should get Giuseppe to take
    > a look.

    This test should not be running on RHEL8 (runc) at all! This is
    not something implemented in runc. The "skip" was added in #14972
    which was never ported into v4.2. (Once again, see "we lost runc"
    comment above, and weep).

    This is OK to waive.

    > 332 podman with nonexistent labels: Another runc error message changing
    > slightly, not concerning.

    Concur. But this is a test that _does_ run in CI, on Ubuntu, so I'm
    wondering if RHEL-runc is diverging significantly from Ubuntu-runc.
    That would be Very Bad because it means we have no way of testing
    runc until RHEL gating tests.

    > The fact that the two significant failures are both in systemd-related
    > operations strikes me as somewhat suspicious, but we test on newer systemd
    > versions on other distros, so I somehow doubt that a systemd update broke
    > us.

    One significant failure, not two.

    Recommendation: anyone with the ability to get a RHEL 8.7 VM somehow,
    drop everything and run the "podman generate - systemd - basic" test,
    try to reproduce then understand that failure. All else is ignorable.

Comment 1 Ed Santiago 2022-08-17 14:51:55 UTC
Trivial reproducer (as root on RHEL8.7 with MB 16321 installed):

   # podman create --pull=always --name=foo quay.io/libpod/testimage:20220615 top
   # podman generate systemd --new foo >/run/systemd/system/foo.service
   # podman rm foo
   # systemctl daemon-reload
   # systemctl start foo

So far, so good. Now:

   # systemctl stop foo
   # systemctl status foo
      ? foo.service - Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service
      Loaded: loaded (/run/systemd/system/foo.service; disabled; vendor preset: disabled)
      Active: failed (Result: signal) since Wed 2022-08-17 10:49:54 EDT; 32s ago
        Docs: man:podman-generate-systemd(1)
     Process: 13978 ExecStopPost=/usr/bin/podman rm -f --ignore --cidfile=/run/foo.service.ctr-id (code=exited, status=0/SUCCESS)
     Process: 13917 ExecStop=/usr/bin/podman stop --ignore --cidfile=/run/foo.service.ctr-id (code=exited, status=0/SUCCESS)
     Process: 13896 ExecStart=/usr/bin/podman run --cidfile=/run/foo.service.ctr-id --cgroups=no-conmon --rm --sdnotify=conmon -d --replace --pull=always --name=foo quay.io/libpod/testimage:20220615 top (code=killed, signal=USR1)
     Process: 13841 ExecStartPre=/bin/rm -f /run/foo.service.ctr-id (code=exited, status=0/SUCCESS)
    Main PID: 13896 (code=killed, signal=USR1)
   
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: Copying config sha256:f26aa69bb3f3de470d358b6c4e7f9de8ca165ef7f1689af1cdd446902dc27928
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: Writing manifest to image destination
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: Storing signatures
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: Started Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service.
   Aug 17 10:49:49 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13843]: 0804ed1173c14036e9813829e2a2a58ea836f4d964925ab018888eb6de36bd00
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: Stopping Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service...
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com podman[13917]: 0804ed1173c14036e9813829e2a2a58ea836f4d964925ab018888eb6de36bd00
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: foo.service: Main process exited, code=killed, status=10/USR1
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: foo.service: Failed with result 'signal'.
   Aug 17 10:49:54 ci-vm-10-0-137-138.hosted.upshift.rdu2.redhat.com systemd[1]: Stopped Podman container-cd68b1f76c9880d29ca5a0377848f72308d4ea1ee2965cc64924f77b864746a6.service.


Reproduce at will via `podman start/stop foo`.

Comment 2 Tom Sweeney 2022-08-18 21:59:34 UTC
Valentin, please take a look when you return from PTO.

Comment 3 Valentin Rothberg 2022-08-22 09:38:24 UTC
I am unable to reproduce with build 2119676 [1] on RHEL-8.7.0-20220820.0.

What I get is the expected exit code 143.
> foo.service: Main process exited, code=exited, status=143/n/a


> foo.service: Main process exited, code=killed, status=10/USR1

I have never seen this issue so far and do not know who's sending the USR1 signal.

[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2119676

Comment 4 Valentin Rothberg 2022-08-22 14:55:18 UTC
After chatting with Ed, I noticed a difference between the non-working 1minute VMs and the one I got from beaker.

Working: 4.18.0-418.el8.x86_64 and systemd-239-65.el8
Broken: 4.18.0-416.el8.x86_64 and systemd-239-62.el8

I ran killsnoop from the `bcc-tools` package to figure out who's sending the USR1 signal on the broken machine.  What I observed is that systemd is immediately sending SIGTERM on `systemctl stop` but I couldn't spot USR1.

My conclusion is that something in stack (outside of Podman) had a temporary hiccup that has been fixed in the meantime.

Tom, I suggest to close.

Comment 5 Ed Santiago 2022-08-22 17:01:25 UTC
1minutetip has RHEL-8.7.0-20220816.0 available but not 0820 (the one Valentin is using). On 0816 I cannot dnf-upgrade to the systemd or kernel versions that Valentin sees. Until I get those versions and confirm the bug is fixed, I recommend leaving open.

Comment 6 Tom Sweeney 2022-08-22 22:20:04 UTC
Thanks for digging into this quickly folks.  Let's keep this open until Ed gets a chance to dive a bit deeper.

Comment 7 Ed Santiago 2022-08-23 19:40:03 UTC
Sorry Valentin, there's something different about your beaker setup. 1mt has just made 0823 available, with systemd-239-65.el8.x86_64 and kernel 4.18.0-419.el8.x86_64 (both greater than or equal to the ones you report). The bug is present in this setup.

Comment 8 Valentin Rothberg 2022-08-24 07:56:28 UTC
(In reply to Ed Santiago from comment #7)
> Sorry Valentin, there's something different about your beaker setup. 1mt has
> just made 0823 available, with systemd-239-65.el8.x86_64 and kernel
> 4.18.0-419.el8.x86_64 (both greater than or equal to the ones you report).
> The bug is present in this setup.

That is very curious.  Ed, could you spin up a beaker machine and test there?  I will give 1mt a shot later today.

Comment 9 Valentin Rothberg 2022-08-24 11:13:18 UTC
Pulling in Michal from the system team.

@Michal, Ed gave me access to a "broken" system.  I made the following observations that make be believe it's (temporary?) issue in systemd:

* `killsnoop` revealed that systemd sends TERM and KILL immediately on `systemctl stop` to the conmon process.
* Given `TimeoutStopSec=70`, systemd should not do that and does not on a "working" system.
* Setting `KillMode=none` resolves the issue but it should not be required.

Is there a temporary issue in systemd or am I looking in the wrong direction?

Comment 11 Valentin Rothberg 2022-08-25 12:29:00 UTC
Ed and I ran another debugging session.  I couldn't reproduce as I only installed the podman{-catonit} package.  Installing the entire module yielded the failure again.

It turns it to be `conmon`.  We bisected that `d91cc4321797eaa84dcfb7863e91632d2fe26861` [1] is the first commit introducing the issue.  Note that we ship conmon v2.1.0 in Fedora while we're running conmon v2.1.3 in RHEL.

Assigning the issue to Giuseepe.

[1] https://github.com/containers/conmon/commit/d91cc4321797eaa84dcfb7863e91632d2fe26861

Comment 12 Giuseppe Scrivano 2022-08-25 14:13:23 UTC
opened a PR:  https://bugzilla.redhat.com/show_bug.cgi?id=2119072

Comment 13 Ed Santiago 2022-08-25 14:16:53 UTC
Quick note that, for purposes of RHEL, it might be safest to revert conmon back to 2.1.2-2

Comment 14 Giuseppe Scrivano 2022-08-25 15:25:12 UTC
sorry, the PR is here: https://github.com/containers/conmon/pull/352

Comment 15 Valentin Rothberg 2022-08-26 07:41:41 UTC
(In reply to Ed Santiago from comment #13)
> Quick note that, for purposes of RHEL, it might be safest to revert conmon
> back to 2.1.2-2

Giuseppe, Tom, Jindrich, what do you think about Ed's suggestion?  I concur with Ed but if the conmon experts feel confident, I will as well.

Comment 16 Jindrich Novy 2022-08-26 14:04:34 UTC
Just reverted conmon to 2.1.2. Hope it's the last change for 8.7.

Comment 21 Valentin Rothberg 2022-09-01 07:08:07 UTC
*** Bug 2123152 has been marked as a duplicate of this bug. ***

Comment 22 Joy Pu 2022-09-05 10:08:08 UTC
Test with conmon-2.1.2-2.module+el8.7.0+16493+89f82ab8.x86_64.rpm and run the test case "podman generate - systemd - basic" 200 times. It all finished successfully as expected. So move this bug to verified.

Comment 24 errata-xmlrpc 2022-11-08 09:16:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7457