Bug 1967128 - OSP16.2 podman failed to launch haproxy-bundle-podman after reboot
Summary: OSP16.2 podman failed to launch haproxy-bundle-podman after reboot
Keywords:
Status: CLOSED DUPLICATE of bug 1973035
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 16.2 (Train)
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Damien Ciabrini
QA Contact: David Rosenfeld
URL:
Whiteboard:
Depends On: 1973035
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-02 13:14 UTC by David Rosenfeld
Modified: 2022-08-17 15:02 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-25 13:03:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-4348 0 None None None 2022-08-17 15:02:19 UTC

Description David Rosenfeld 2021-06-02 13:14:26 UTC
Description of problem: Overcloud reboot is performed. It fails because haproxy-bundle-podman did not start on one of the three controllers after the reboot. From logs:


2021-05-28 08:25:22.000 | Failed Resource Actions:
2021-05-28 08:25:22.002 |   * haproxy-bundle-podman-2_start_0 on controller-2 'error' (1): call=106, status='complete', exitreason='podman failed to launch container', last-rc-change='2021-05-28 08:19:02Z', queued=0ms, exec=6606ms
2021-05-28 08:25:22.203 | 
2021-05-28 08:25:22.206 | TASK [Stopped resources found] *************************************************
2021-05-28 08:25:22.208 | task path: /home/rhos-ci/jenkins/workspace/DFG-df-deployment-16.2-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-geneve-reboot-overcloud/infrared/plugins/tripleo-overcloud/overcloud_reboot.yml:224
2021-05-28 08:25:22.211 | Friday 28 May 2021  08:25:22 +0000 (0:00:00.709)       0:25:47.835 ************ 
2021-05-28 08:25:22.233 | fatal: [controller-2]: FAILED! => {
2021-05-28 08:25:22.236 |     "changed": false
2021-05-28 08:25:22.238 | }
2021-05-28 08:25:22.241 | 
2021-05-28 08:25:22.244 | MSG:
2021-05-28 08:25:22.246 | 
2021-05-28 08:25:22.248 | Some pacemeker resource failed to come back online after reboot

Version-Release number of selected component (if applicable): RHOS-16.2-RHEL-8-20210525.n.0


How reproducible: It happened in Phase 3 and then took three attempts to recreate on a private server.


Steps to Reproduce:
1. Found with Jenkins job: DFG-df-deployment-16.2-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-geneve-reboot-overcloud
2.
3.

Actual results: haproxy-bundle-podman-2 is not restarting after an overcloud reboot.


Expected results: haproxy-bundle-podman-2 does restart after an overcloud reboot.


Additional info:

Comment 1 David Rosenfeld 2021-06-02 13:18:31 UTC
This is log to a failed job. See ir-tripleo-overcloud-reboot.log:

https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-16.2-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-geneve-reboot-overcloud/30/

When recreated on a private server it failed on controller-0. Executed journalctl and saw this:

Jun 01 23:44:16 controller-0 podman(haproxy-bundle-podman-0)[8338]: INFO: running container haproxy-bundle-podman-0 for the first time
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8867]: ERROR: Error: OCI runtime error: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: process_linux.go:422: setting cgroup config for procHooks process caused: Unit libpod-35c4a915bdc87f8a0c9c33eb43cba5f4470b5683882ed34d7a02bad6beb2bb53.scope not found.
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8896]: ERROR: podman failed to launch container
Jun 01 23:44:19 controller-0 pacemaker-execd[3278]:  notice: haproxy-bundle-podman-0_start_0[7089] error output [ ocf-exit-reason:podman failed to launch container ]
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: Result of start operation for haproxy-bundle-podman-0 on controller-0: error
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: controller-0-haproxy-bundle-podman-0_start_0:103 [ ocf-exit-reason:podman failed to launch container\n ]
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting fail-count-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> INFINITY
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting last-failure-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> 1622591059
Jun 01 23:44:26 controller-0 pacemaker-controld[3284]:  notice: Requesting local execution of stop operation for haproxy-bundle-podman-0 on controller-0

Comment 2 Cédric Jeanneret 2021-06-02 14:47:51 UTC
Some env info:
podman version 3.0.2-dev

[root@controller-0 ~]# podman info
host:
  arch: amd64
  buildahVersion: 1.19.8
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.0.26-1.module+el8.4.0+10614+dd38312c.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.26, commit: 973e0cbd4075b559ce0400d805f7b13163ef6cc1'
  cpus: 8
  distribution:
    distribution: '"rhel"'
    version: "8.4"
  eventLogger: file
  hostname: controller-0
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 4.18.0-305.el8.x86_64
  linkmode: dynamic
  memFree: 19228971008
  memTotal: 33511833600
  ociRuntime:
    name: runc
    package: runc-1.0.0-70.rc92.module+el8.4.0+10614+dd38312c.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.2-dev'
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    selinuxEnabled: true
  slirp4netns:
    executable: ""
    package: ""
    version: ""
  swapFree: 0
  swapTotal: 0
  uptime: 15h 3m 34.01s (Approximately 0.62 days)
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
  undercloud-0.ctlplane.redhat.local:8787:
    Blocked: false
    Insecure: true
    Location: undercloud-0.ctlplane.redhat.local:8787
    MirrorByDigestOnly: false
    Mirrors: []
    Prefix: undercloud-0.ctlplane.redhat.local:8787
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 80
    paused: 0
    running: 46
    stopped: 34
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageStore:
    number: 29
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 3.0.0
  Built: 1617800874
  BuiltTime: Wed Apr  7 13:07:54 2021
  GitCommit: ""
  GoVersion: go1.15.7
  OsArch: linux/amd64
  Version: 3.0.2-dev

podman-3.0.1-6.module+el8.4.0+10614+dd38312c.x86_64
podman-catatonit-3.0.1-6.module+el8.4.0+10614+dd38312c.x86_64
conmon-2.0.26-1.module+el8.4.0+10614+dd38312c.x86_64

Comment 3 Damien Ciabrini 2021-06-02 15:59:50 UTC
So after a debugging session with Cedric on David's environment, we found that the reason haproxy did not start on controller-0 is because podman failed to create the container appropriately:

Jun 01 23:44:12 controller-0 pacemaker-controld[3284]:  notice: Requesting local execution of start operation for haproxy-bundle-podman-0 on controller-0                                                                             
Jun 01 23:44:16 controller-0 podman(haproxy-bundle-podman-0)[8338]: INFO: running container haproxy-bundle-podman-0 for the first time                                                                                                
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8867]: ERROR: Error: OCI runtime error: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: process_linux.go:422: setting cgroup config for procHooks process caused: Unit libpod-35c4a915bdc87f8a0c9c33eb43cba5f4470b5683882ed34d7a02bad6beb2bb53.scope not found.                                                                                         
Jun 01 23:44:19 controller-0 podman(haproxy-bundle-podman-0)[8896]: ERROR: podman failed to launch container
Jun 01 23:44:19 controller-0 pacemaker-execd[3278]:  notice: haproxy-bundle-podman-0_start_0[7089] error output [ ocf-exit-reason:podman failed to launch container ]                                                                 
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: Result of start operation for haproxy-bundle-podman-0 on controller-0: error                                                                                          
Jun 01 23:44:19 controller-0 pacemaker-controld[3284]:  notice: controller-0-haproxy-bundle-podman-0_start_0:103 [ ocf-exit-reason:podman failed to launch container\n ]                                                              
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting fail-count-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> INFINITY                                                                                    
Jun 01 23:44:19 controller-0 pacemaker-attrd[3279]:  notice: Setting last-failure-haproxy-bundle-podman-0#start_0[controller-0]: (unset) -> 1622591059                                                                                

Since a failure during a start operation is fatal for the cluster, the haproxy container never got restarted on controller-0.

The podman failure can be seen in the output of the podman resource agent [1]:

ERROR: Error: OCI runtime error:
  container_linux.go:370: starting container process caused:
    process_linux.go:459: container init caused: 
      process_linux.go:422: setting cgroup config for procHooks process caused:
        Unit libpod-35c4a915bdc87f8a0c9c33eb43cba5f4470b5683882ed34d7a02bad6beb2bb53.scope not found.

This happened while running the following shell command:

  podman run -d --name=haproxy-bundle-podman-0 --user=root --log-driver=k8s-file --log-opt path=/var/log/containers/stdouts/haproxy-bundle.log -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS cluster.common.tag/rhosp16-openstack-haproxy:pcmklatest /bin/bash /usr/local/bin/kolla_start

Unfortunately the resource agent didn't log the exit code from the podman command, so we don't know the state the container was in after the creation failure.
We do know however that the container got created in the podman db, because we could delete it right after, when pacemaker forcibly issued a stop command:

Jun 01 23:44:26 controller-0 pacemaker-controld[3284]:  notice: Requesting local execution of stop operation for haproxy-bundle-podman-0 on controller-0                                                                              
Jun 01 23:44:26 controller-0 podman(haproxy-bundle-podman-0)[10025]: ERROR: monitor cmd failed (rc=255), output: Error: can only create exec sessions on running containers: container state improper                                 
Jun 01 23:44:27 controller-0 podman(haproxy-bundle-podman-0)[10176]: NOTICE: Cleaning up inactive container, haproxy-bundle-podman-0.                                                                                                 
Jun 01 23:44:28 controller-0 podman(haproxy-bundle-podman-0)[10253]: INFO: 35c4a915bdc87f8a0c9c33eb43cba5f4470b5683882ed34d7a02bad6beb2bb53                                                                                           
Jun 01 23:44:28 controller-0 pacemaker-execd[3278]:  notice: haproxy-bundle-podman-0_stop_0[9787] error output [ ocf-exit-reason:monitor cmd failed (rc=255), output: Error: can only create exec sessions on running containers: container state improper ]
Jun 01 23:44:28 controller-0 pacemaker-controld[3284]:  notice: Result of stop operation for haproxy-bundle-podman-0 on controller-0: ok                                                                                              
 
And no container remained after that.

Note that we could recreate the container once we manually unblock the resource with "pcs resource cleanup haproxy-bundle", so maybe a way of working around the bug in podman would be to retry creating the container if the resource agent detects an incomplete creation.


[1] https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/podman#L361

Comment 4 Damien Ciabrini 2021-06-07 13:56:43 UTC
Since I suspect that this is a podman/runc/systemd race, I'm working on a fix in the podman resource-agent to retry creating the container when a OCI runtime error is raised back to the podman layer.
Once I have something I'll ask David if he can test the patched rpm locally on his env.

Comment 5 Damien Ciabrini 2021-06-11 08:55:25 UTC
FTR I could reproduce it locally on my own environment yesterday, this time with another podman container

Jun 10 15:50:53 controller-1 podman(rabbitmq-bundle-podman-1)[8826]: INFO: running container rabbitmq-bundle-podman-1 for the first time                                                                                               
Jun 10 15:50:55 controller-1 podman(rabbitmq-bundle-podman-1)[9285]: ERROR: Error: OCI runtime error: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: process_linux.go:422: sett
ing cgroup config for procHooks process caused: Unit libpod-f2289e0e4117ade0c64ad8c82c8261618bc2e6493f1d407559f530a6fdb65af6.scope not found.                                                                                          
Jun 10 15:50:55 controller-1 podman(rabbitmq-bundle-podman-1)[9302]: ERROR: podman failed to launch container      

I'm trying to understand if I can pinpoint the race as well as come up with a workaround in the podman resource agent.

Comment 6 Damien Ciabrini 2021-06-15 20:43:50 UTC
I filed https://bugzilla.redhat.com/show_bug.cgi?id=1972209 to track the podman race separately in RHEL.

We have a workaround for this podman bug in the podman resource-agent, that way any HA container will retry creating a container if it is hit by the race.
The PR upstream is https://github.com/ClusterLabs/resource-agents/pull/1659, it probably needs a bz in RHEL for backport to 8.4.z.

Comment 7 Damien Ciabrini 2021-06-16 14:39:07 UTC
I just opened the RHEL bz [1] that will be used to track the backport of the PR mentioned in comment #6.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1972743

Comment 8 Damien Ciabrini 2021-06-18 15:46:16 UTC
The correct dependent bz is the 8.4.z one: https://bugzilla.redhat.com/show_bug.cgi?id=1973035

Comment 9 Damien Ciabrini 2021-06-25 13:03:30 UTC
Alright I think we can close that one as a dup of rhbz#1973035 as the workaround for the podman race is in there, and we think this race is only impacting pacemaker-managed containers (as paunch-managed containers gets retried by systemd).

*** This bug has been marked as a duplicate of bug 1973035 ***


Note You need to log in before you can comment on or make changes to this bug.