Bug 1955393 - [update] ceph-crash@ service is unable to start when ceph-osd are updated
Summary: [update] ceph-crash@ service is unable to start when ceph-osd are updated
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 4.2
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.2z1
Assignee: Guillaume Abrioux
QA Contact: Ameena Suhani S H
URL:
Whiteboard:
: 1956930 (view as bug list)
Depends On:
Blocks: 1760354 1956930
TreeView+ depends on / blocked
 
Reported: 2021-04-30 03:20 UTC by Sofer Athlan-Guyot
Modified: 2021-06-09 12:55 UTC (History)
17 users (show)

Fixed In Version: ceph-ansible-4.0.49.3-1.el8cp, ceph-ansible-4.0.49.3-1.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-12 12:57:26 UTC
Embargoed:


Attachments (Terms of Use)
ceph update (345.27 KB, application/gzip)
2021-05-11 11:44 UTC, Sofer Athlan-Guyot
no flags Details
ceph update log (345.27 KB, application/gzip)
2021-05-11 11:54 UTC, Sofer Athlan-Guyot
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 6507 0 None open update: fix ceph-crash stop task 2021-05-04 09:47:54 UTC
Red Hat Product Errata RHBA-2021:1554 0 None None None 2021-05-12 12:57:35 UTC

Internal Links: 1956930

Description Sofer Athlan-Guyot 2021-04-30 03:20:48 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sofer Athlan-Guyot 2021-04-30 04:08:11 UTC
Hi,

sorry somehow the bug was comited before I could type in the
description. Here it comes.

During update of OSP16.1 from any starting version, to passed_phase2
(currently RHOS-16.1-RHEL-8-20210421.n.1) the update of the ceph osd
fails:

Main Update log:

    2021-04-30 02:34:03 |         "TASK [start the ceph-crash service] ********************************************",
    2021-04-30 02:34:03 |         "task path: /usr/share/ceph-ansible/roles/ceph-crash/tasks/main.yml:53",
    2021-04-30 02:34:03 |         "Friday 30 April 2021  02:33:40 +0000 (0:00:01.198)       0:15:13.454 ********** ",
    2021-04-30 02:34:03 |         "fatal: [ceph-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@ceph-0: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
    2021-04-30 02:34:03 |         "fatal: [ceph-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@ceph-1: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
    2021-04-30 02:34:03 |         "fatal: [ceph-2]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@ceph-2: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
    2021-04-30 02:34:03 |         "fatal: [controller-0]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@controller-0: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
    2021-04-30 02:34:03 |         "fatal: [controller-1]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@controller-1: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",
    2021-04-30 02:34:03 |         "fatal: [controller-2]: FAILED! => {\"changed\": false, \"msg\": \"Unable to start service ceph-crash@controller-2: Job for ceph-crash failed because the control process exited with error code.\\nSee \\\"systemctl status ceph-crash\\\" and \\\"journalctl -xe\\\" for details.\\n\"}",


The log on ceph-0 are:

 - First error seen at 20:17:08, but it doesn't cause an error in the
   update process.

    Apr 29 20:17:08 ceph-0 systemd[1]: Created slice system-ceph\x2dcrash.slice.
    Apr 29 20:17:08 ceph-0 systemd[1]: Starting Ceph crash dump collector...
    Apr 29 20:17:08 ceph-0 podman[21931]: ceph-crash-ceph-0
    Apr 29 20:17:08 ceph-0 podman[21931]: Error: no container with ID or name "ceph-crash-ceph-0" found: no such container
    Apr 29 20:17:08 ceph-0 podman[21942]: Error: Failed to evict container: "": Failed to find container "ceph-crash-ceph-0" in state: no container with name or ID ceph-crash-ceph-0 found: no such container
    Apr 29 20:17:08 ceph-0 podman[21951]: 2021-04-29 20:17:08.298546424 +0000 UTC m=+0.066894234 container create 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0)
    Apr 29 20:17:08 ceph-0 systemd[1]: Started libpod-conmon-6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575.scope.
    Apr 29 20:17:08 ceph-0 systemd[1]: Started libcontainer container 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575.
    Apr 29 20:17:08 ceph-0 kernel: SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
    Apr 29 20:17:08 ceph-0 podman[21951]: 2021-04-29 20:17:08.655801627 +0000 UTC m=+0.424149436 container init 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0)
    Apr 29 20:17:08 ceph-0 podman[21951]: 2021-04-29 20:17:08.67600919 +0000 UTC m=+0.444357031 container start 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0)
    Apr 29 20:17:08 ceph-0 podman[21951]: 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575
    Apr 29 20:17:08 ceph-0 systemd[1]: Started Ceph crash dump collector.
    Apr 29 20:17:08 ceph-0 conmon[21964]: INFO:__main__:monitoring path /var/lib/ceph/crash, delay 600s
    Apr 29 20:17:09 ceph-0 ansible-systemd[22006]: Invoked with name= state=restarted enabled=True masked=False daemon_reload=True daemon_reexec=False no_block=False force=None user=None scope=None
    Apr 29 20:17:10 ceph-0 systemd[1]: Reloading.
    Apr 29 20:17:10 ceph-0 systemd[1]: Stopping Ceph crash dump collector...
    Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Stopping timed out. Terminating.
    Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Failed with result 'timeout'.
    Apr 29 20:17:20 ceph-0 systemd[1]: Stopped Ceph crash dump collector.
    Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Found left-over process 22040 (podman) in control group while starting unit. Ignoring.
    Apr 29 20:17:20 ceph-0 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
    Apr 29 20:17:20 ceph-0 systemd[1]: Starting Ceph crash dump collector...
    Apr 29 20:17:20 ceph-0 systemd[1]: libpod-6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575.scope: Consumed 302ms CPU time
    Apr 29 20:17:20 ceph-0 podman[22060]: ceph-crash-ceph-0
    Apr 29 20:17:20 ceph-0 podman[22060]: Error: refusing to remove "ceph-crash-ceph-0" as it exists in libpod as container 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575: container already exists
    Apr 29 20:17:20 ceph-0 podman[22040]: 2021-04-29 20:17:20.544936597 +0000 UTC m=+10.214715127 container died 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0)
    Apr 29 20:17:20 ceph-0 podman[22040]: 2021-04-29 20:17:20.546804312 +0000 UTC m=+10.216582774 container stop 6b30126b186ee6134f3e89c3f355b98dd18b05017727712e75fc00d903bf2575 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0)
    Apr 29 20:17:20 ceph-0 podman[22083]: Error: Failed to evict container: "": Failed to find container "ceph-crash-ceph-0" in state: no container with name or ID ceph-crash-ceph-0 found: no such container
    Apr 29 20:17:20 ceph-0 systemd[1]: ceph-crash: Found left-over process 22040 (podman) in control group while starting unit. Ignoring.
    Apr 29 20:17:20 ceph-0 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

 - At 22:22:38 the server is rebooted

    Apr 29 22:22:38 ceph-0 kernel: Linux version 4.18.0-193.29.1.el8_2.x86_64

 - Then later on we have the final error that cause the update to fail and match the timestamp of the Update log.

    Apr 30 02:33:52 ceph-0 systemd[1]: libpod-351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0.scope: Consumed 321ms CPU time
    Apr 30 02:33:57 ceph-0 podman[84668]: Error: cannot remove container 351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0 as it could not be stopped: timed out waiting for file /var/run/libpod/exits/351bad97194299c442fa3131251263e126a58d945c7df
    ef4fbdd9262e33996a0: internal libpod error
    Apr 30 02:33:57 ceph-0 podman[84710]: Error: error creating container storage: the container name "ceph-crash-ceph-0" is already in use by "351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0". You have to remove that container to be able to r
    euse that name.: that name is already in use
    Apr 30 02:33:57 ceph-0 systemd[1]: ceph-crash: Control process exited, code=exited status=125
    Apr 30 02:33:57 ceph-0 systemd[1]: ceph-crash: Failed with result 'exit-code'.
    Apr 30 02:33:57 ceph-0 systemd[1]: Failed to start Ceph crash dump collector.
    Apr 30 02:34:07 ceph-0 systemd[1]: ceph-crash: Service RestartSec=10s expired, scheduling restart.
    Apr 30 02:34:07 ceph-0 systemd[1]: ceph-crash: Scheduled restart job, restart counter is at 1.
    Apr 30 02:34:07 ceph-0 systemd[1]: Stopped Ceph crash dump collector.
    Apr 30 02:34:07 ceph-0 systemd[1]: Starting Ceph crash dump collector...
    Apr 30 02:34:07 ceph-0 podman[84724]: ceph-crash-ceph-0
    Apr 30 02:34:07 ceph-0 podman[84724]: Error: refusing to remove "ceph-crash-ceph-0" as it exists in libpod as container 351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0: container already exists
    Apr 30 02:34:07 ceph-0 podman[84732]: 2021-04-30 02:34:07.960757185 +0000 UTC m=+0.100301577 container remove 351bad97194299c442fa3131251263e126a58d945c7dfef4fbdd9262e33996a0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhceph:4-36, name=ceph-crash-ceph-0)

The first notable thing is that relative to latest working puddle
ceph-ansible moved from ceph-ansible-4.0.41-1.el8cp.noarch to
ceph-ansible-4.0.49.2-1.el8cp.noarch.

It includes a fix for
https://bugzilla.redhat.com/show_bug.cgi?id=1943471.

The fix make sure that ceph-crash get updated during update so that
seems relevant.


Second, there is that reboot. This happen during a new stage of the
update process where we export the images created for re-use in next
jobs. This happens after the deployment and before the update. This is
purely a "CI" stage. It shouldn't cause any problem as it's a clean
restart of the vms:

    TASK [Restart any VM's that were running] **************************************
    task path: /home/rhos-ci/jenkins/workspace/DFG-upgrades-updates-16.1-from-z3-HA-ipv4/infrared/plugins/virsh/snapshot_export.yml:146


Finally there is the first error at 20:17. It happens during the
*deployment* of the osd (ceph-ansible hasn't been updated yet at that
point):

    TASK [tripleo-ceph-run-ansible : Notify user about upcoming ceph-ansible execution(s)] ***
    Thursday 29 April 2021  20:10:52 +0000 (0:00:00.090)       0:17:48.194 ********
    ok: [undercloud] => {
        "msg": "Running 1 ceph-ansible playbook(s) (immediate log at /var/lib/mistral/qe-Cloud-0/ceph-ansible/ceph_ansible_command.log)"
    }
    
    TASK [tripleo-ceph-run-ansible : run ceph-ansible] *****************************
    Thursday 29 April 2021  20:10:52 +0000 (0:00:00.091)       0:17:48.286 ********
    changed: [undercloud] => (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
    
    changed: [undercloud] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": true}
    
    TASK [tripleo-ceph-run-ansible : search output of ceph-ansible run(s) non-zero return codes] ***
    Thursday 29 April 2021  20:17:23 +0000 (0:06:31.172)       0:24:19.459 ********
    skipping: [undercloud] => (item=None)  => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
    skipping: [undercloud] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
    
    TASK [tripleo-ceph-run-ansible : print ceph-ansible output in case of failure] ***
    Thursday 29 April 2021  20:17:24 +0000 (0:00:00.602)       0:24:20.062 ********
    skipping: [undercloud] => {}

no error reported by the deployment process, that may be a red herring.

We need support from Ceph DFG to go further there. I'll try to get a
live reproducer environment ready ASAP.

Comment 22 Francesco Pantano 2021-05-07 11:49:37 UTC
*** Bug 1956930 has been marked as a duplicate of this bug. ***

Comment 23 Sofer Athlan-Guyot 2021-05-11 11:44:49 UTC
Created attachment 1781995 [details]
ceph update

Comment 24 Sofer Athlan-Guyot 2021-05-11 11:54:09 UTC
Created attachment 1781996 [details]
ceph update log

Comment 27 errata-xmlrpc 2021-05-12 12:57:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.2 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1554

Comment 28 Francesco Pantano 2021-06-09 12:55:44 UTC
*** Bug 1956930 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.