Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Hi, sorry somehow the bug was comited before I could type in the description. Here it comes. During update of OSP16.1 from any starting version, to passed_phase2 (currently RHOS-16.1-RHEL-8-20210421.n.1) the update of the ceph osd fails: Main Update log: 2021-04-30 02:34:03 | "TASK [start the ceph-crash service] ********************************************", 2021-04-30 02:34:03 | "task path: /usr/share/ceph-ansible/roles/ceph-crash/tasks/main.yml:53", 2021-04-30 02:34:03 | "Friday 30 April 2021 02:33:40 +0000 (0:00:01.198) 0:15:13.454 ********** ", 2021-04-30 02:34:03 | "fatal: [ceph-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@ceph-0: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}", 2021-04-30 02:34:03 | "fatal: [ceph-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@ceph-1: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}", 2021-04-30 02:34:03 | "fatal: [ceph-2]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@ceph-2: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}", 2021-04-30 02:34:03 | "fatal: [controller-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@controller-0: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}", 2021-04-30 02:34:03 | "fatal: [controller-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@controller-1: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}", 2021-04-30 02:34:03 | "fatal: [controller-2]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@controller-2: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}", The log on ceph-0 are: - First error seen at 20:17:08, but it doesn't cause an error in the update process. Apr 29 20:17:08 ceph-0 systemd[1]: Created slice system-ceph\x2dcrash.slice. Apr 29 20:17:08 ceph-0 systemd[1]: Starting Ceph crash dump collector... Apr 29 20:17:08 ceph-0 podman[21931]: ceph-crash-ceph-0 Apr 29 20:17:08 ceph-0 podman[21931]: Error: no container with ID or name "ceph-crash-ceph-0" found: no such container Apr 29 20:17:08 ceph-0 podman[21942]: Error: Failed to evict container: "": Failed to find container "ceph-crash-ceph-0" in state: no container with name or ID ceph-crash-ceph-0 found: no such container Apr 29 20:17:08 ceph-0 podman[21951]: 2021-04-29 20:17:08.298546424 +0000 UTC m=+0.066894234 container create 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0) Apr 29 20:17:08 ceph-0 systemd[1]: Started libpod-conmon-6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575.scope. Apr 29 20:17:08 ceph-0 systemd[1]: Started libcontainer container 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575. Apr 29 20:17:08 ceph-0 kernel: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) Apr 29 20:17:08 ceph-0 podman[21951]: 2021-04-29 20:17:08.655801627 +0000 UTC m=+0.424149436 container init 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0) Apr 29 20:17:08 ceph-0 podman[21951]: 2021-04-29 20:17:08.67600919 +0000 UTC m=+0.444357031 container start 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0) Apr 29 20:17:08 ceph-0 podman[21951]: 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 Apr 29 20:17:08 ceph-0 systemd[1]: Started Ceph crash dump collector. Apr 29 20:17:08 ceph-0 conmon[21964]: INFO:__main__:monitoring path /var/lib/ceph/crash, delay 600s Apr 29 20:17:09 ceph-0 ansible-systemd[22006]: Invoked with name= state=restarted enabled=True masked=False daemon_reload=True daemon_reexec=False no_block=False force=None user=None scope=None Apr 29 20:17:10 ceph-0 systemd[1]: Reloading. Apr 29 20:17:10 ceph-0 systemd[1]: Stopping Ceph crash dump collector... Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Stopping timed out. Terminating. Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Failed with result 'timeout'. Apr 29 20:17:20 ceph-0 systemd[1]: Stopped Ceph crash dump collector. Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Found left-over process 22040 (podman) in control group while starting unit. Ignoring. Apr 29 20:17:20 ceph-0 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Apr 29 20:17:20 ceph-0 systemd[1]: Starting Ceph crash dump collector... Apr 29 20:17:20 ceph-0 systemd[1]: libpod-6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575.scope: Consumed 302ms CPU time Apr 29 20:17:20 ceph-0 podman[22060]: ceph-crash-ceph-0 Apr 29 20:17:20 ceph-0 podman[22060]: Error: refusing to remove "ceph-crash-ceph-0" as it exists in libpod as container 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575: container already exists Apr 29 20:17:20 ceph-0 podman[22040]: 2021-04-29 20:17:20.544936597 +0000 UTC m=+10.214715127 container died 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0) Apr 29 20:17:20 ceph-0 podman[22040]: 2021-04-29 20:17:20.546804312 +0000 UTC m=+10.216582774 container stop 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0) Apr 29 20:17:20 ceph-0 podman[22083]: Error: Failed to evict container: "": Failed to find container "ceph-crash-ceph-0" in state: no container with name or ID ceph-crash-ceph-0 found: no such container Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Found left-over process 22040 (podman) in control group while starting unit. Ignoring. Apr 29 20:17:20 ceph-0 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. - At 22:22:38 the server is rebooted Apr 29 22:22:38 ceph-0 kernel: Linux version 4.18.0-193.29.1.el8_2.x86_64 - Then later on we have the final error that cause the update to fail and match the timestamp of the Update log. Apr 30 02:33:52 ceph-0 systemd[1]: libpod-351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0.scope: Consumed 321ms CPU time Apr 30 02:33:57 ceph-0 podman[84668]: Error: cannot remove container 351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0 as it could not be stopped: timed out waiting for file /var/run/libpod/exits/351bad97194299c442fa3131251263e126a58d945c7df ef4fbdd9262e33996a0: internal libpod error Apr 30 02:33:57 ceph-0 podman[84710]: Error: error creating container storage: the container name "ceph-crash-ceph-0" is already in use by "351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0". You have to remove that container to be able to r euse that name.: that name is already in use Apr 30 02:33:57 ceph-0 systemd[1]: ceph-crash: Control process exited, code=exited status=125 Apr 30 02:33:57 ceph-0 systemd[1]: ceph-crash: Failed with result 'exit-code'. Apr 30 02:33:57 ceph-0 systemd[1]: Failed to start Ceph crash dump collector. Apr 30 02:34:07 ceph-0 systemd[1]: ceph-crash: Service RestartSec=10s expired, scheduling restart. Apr 30 02:34:07 ceph-0 systemd[1]: ceph-crash: Scheduled restart job, restart counter is at 1. Apr 30 02:34:07 ceph-0 systemd[1]: Stopped Ceph crash dump collector. Apr 30 02:34:07 ceph-0 systemd[1]: Starting Ceph crash dump collector... Apr 30 02:34:07 ceph-0 podman[84724]: ceph-crash-ceph-0 Apr 30 02:34:07 ceph-0 podman[84724]: Error: refusing to remove "ceph-crash-ceph-0" as it exists in libpod as container 351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0: container already exists Apr 30 02:34:07 ceph-0 podman[84732]: 2021-04-30 02:34:07.960757185 +0000 UTC m=+0.100301577 container remove 351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0) The first notable thing is that relative to latest working puddle ceph-ansible moved from ceph-ansible-4.0.41-1.el8cp.noarch to ceph-ansible-4.0.49.2-1.el8cp.noarch. It includes a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1943471. The fix make sure that ceph-crash get updated during update so that seems relevant. Second, there is that reboot. This happen during a new stage of the update process where we export the images created for re-use in next jobs. This happens after the deployment and before the update. This is purely a "CI" stage. It shouldn't cause any problem as it's a clean restart of the vms: TASK [Restart any VM's that were running] ************************************** task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16.1-from-z3-HA-ipv4/infrared/plugins/virsh/snapshot_export.yml:146 Finally there is the first error at 20:17. It happens during the *deployment* of the osd (ceph-ansible hasn't been updated yet at that point): TASK [tripleo-ceph-run-ansible : Notify user about upcoming ceph-ansible execution(s)] *** Thursday 29 April 2021 20:10:52 +0000 (0:00:00.090) 0:17:48.194 ******** ok: [undercloud] => { "msg": "Running 1 ceph-ansible playbook(s) (immediate log at /var/lib/mistral/qe-Cloud-0/ceph-ansible/ceph_ansible_command.log)" } TASK [tripleo-ceph-run-ansible : run ceph-ansible] ***************************** Thursday 29 April 2021 20:10:52 +0000 (0:00:00.091) 0:17:48.286 ******** changed: [undercloud] => (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true} changed: [undercloud] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true} TASK [tripleo-ceph-run-ansible : search output of ceph-ansible run(s) non-zero return codes] *** Thursday 29 April 2021 20:17:23 +0000 (0:06:31.172) 0:24:19.459 ******** skipping: [undercloud] => (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} skipping: [undercloud] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} TASK [tripleo-ceph-run-ansible : print ceph-ansible output in case of failure] *** Thursday 29 April 2021 20:17:24 +0000 (0:00:00.602) 0:24:20.062 ******** skipping: [undercloud] => {} no error reported by the deployment process, that may be a red herring. We need support from Ceph DFG to go further there. I'll try to get a live reproducer environment ready ASAP.
*** Bug 1956930 has been marked as a duplicate of this bug. ***
Created attachment 1781995 [details] ceph update
Created attachment 1781996 [details] ceph update log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 4.2 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1554