Bug 1590560

Summary: ceph upgrade/deployment fails with "Error response from daemon: No such container: ceph-create-keys"
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: ceph-ansibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: ccamacho, dbecker, gabrioux, gfidente, johfulto, knylande, mburns, morazi, nmorell, sasha, sclewis, scohen, yprokule
Target Milestone: gaKeywords: Triaged
Target Release: 13.0 (Queens)Flags: scohen: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-3.1.0-0.1.rc9.el7cp Doc Type: Known Issue
Doc Text:
The ceph-ansible utility does not always remove the ceph-create-keys container from the same node where it was created. Because of this, the deployment may fail with the message "Error response from daemon: No such container: ceph-create-keys." This may affect any ceph-ansible run, including fresh deployments, that have: * multiple compute notes or * a custom role behaving as ceph client which is also hosting a service consuming ceph.
Story Points: ---
Clone Of:
: 1590746 (view as bug list) Environment:
Last Closed: 2018-06-27 13:58:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1590746    
Bug Blocks:    

Description Marius Cornea 2018-06-12 21:27:55 UTC
Description of problem:
FFU: ceph upgrade fails during the fast forward process with "Error response from daemon: No such container: ceph-create-keys"

Version-Release number of selected component (if applicable):
ceph-ansible-3.1.0-0.1.rc8.el7cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 compute + 3 ceph osd nodes
2. Run through the fast forward upgrade procedure
3. Run the ceph upgrade step:


openstack overcloud ceph-upgrade run \
    --templates /usr/share/openstack-tripleo-heat-templates \
    --stack qe-Cloud-0 \
            -e /usr/share/openstack-tripleo-heat-templates/environments/cinder-backup.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/services/sahara.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
        -e /home/stack/virt/internal.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
        -e /home/stack/virt/network/network-environment.yaml \
        -e /home/stack/virt/enable-tls.yaml \
        -e /home/stack/virt/inject-trust-anchor.yaml \
        -e /home/stack/virt/public_vip.yaml \
        -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml \
        -e /home/stack/virt/hostnames.yml \
        -e /home/stack/virt/debug.yaml \
                -e /home/stack/cli_opts_params.yaml \
            -e /home/stack/ceph-ansible-env.yaml \
    --ceph-ansible-playbook '/usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml,/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml' \
            --container-registry-file /home/stack/virt/docker-images.yaml \

Actual results:
Fails

Expected results:
Completes fine.

Additional info:
Attaching ceph-install-workflow.log.

Comment 3 Giulio Fidente 2018-06-13 15:33:40 UTC
We believe this can be hit for any ceph-ansible run (including fresh deployments) with >1 compute node (or custom role behaving as ceph client, hosting a service consuming ceph).

Comment 12 Yogev Rabl 2018-06-21 14:48:08 UTC
verified on ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch

Comment 14 errata-xmlrpc 2018-06-27 13:58:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Comment 15 Nathan Morell 2019-02-08 19:12:41 UTC
I've seen a very similar example of this problem when using Custom Roles (CephAll and ControllerNoCeph) where it doesn't find ceph-osd-1: Error response from daemon: No such container: ceph-osd-1

I see this error roughly 50% of the time when trying to do a deploy on an existing overcloud, ceph-ansible-3.2.0-1.el7cp.noarch

Another noteworthy part is it's trying to unmount /dev/sda2, which is where my host is installed device list is as follows:

parameter_defaults:
  CephAnsibleDisksConfig:
    osd_scenario: lvm
    osd_objectstore: bluestore
    devices:
      - /dev/sdb
      - /dev/sdc
      - /dev/sdd
      - /dev/sde
      - /dev/sdf
      - /dev/sdg
      - /dev/sdh
      - /dev/sdi
      - /dev/sdj
      - /dev/sdk
      - /dev/sdl
      - /dev/sdm
      - /dev/sdn
      - /dev/sdo
      - /dev/sdp
      - /dev/sdq
      - /dev/sdr
      - /dev/sds
      - /dev/sdt
      - /dev/sdu
      - /dev/sdv
      - /dev/sdw

Full Error Below:

"stdout_lines": [
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"exec: \\\"a95f57a637cc\\\": executable file not found in $PATH\"", 
        "", 
        "Socket file /var/run/ceph/ceph-osd.1.asok could not be found, which means the osd daemon is not running. Showing ceph-osd unit logs now:", 
        "-- Logs begin at Tue 2019-02-05 07:34:21 UTC, end at Thu 2019-02-07 23:02:09 UTC. --", 
        "Feb 05 07:59:53 overcloud-ceph-all-2 systemd[1]: Starting Ceph OSD...", 
        "Feb 05 07:59:54 overcloud-ceph-all-2 docker[51388]: Error response from daemon: No such container: ceph-osd-1", 
        "Feb 05 07:59:54 overcloud-ceph-all-2 docker[51402]: Error response from daemon: No such container: ceph-osd-1", 
        "Feb 05 07:59:54 overcloud-ceph-all-2 systemd[1]: Started Ceph OSD.", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: restorecon /var/lib/ceph/osd/ceph-1", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-72de8451-9a9c-4462-b5af-59e442f225fa/osd-data-fe82915e-4d28-4744-988a-bd4250caa54a --path /var/lib/ceph/osd/ceph-1", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: ln -snf /dev/ceph-72de8451-9a9c-4462-b5af-59e442f225fa/osd-data-fe82915e-4d28-4744-988a-bd4250caa54a /var/lib/ceph/osd/ceph-1/block", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: chown -R ceph:ceph /dev/mapper/ceph--72de8451--9a9c--4462--b5af--59e442f225fa-osd--data--fe82915e--4d28--4744--988a--bd4250caa54a", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: --> ceph-volume lvm activate successful for osd ID: 1", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-05 08:00:21  /entrypoint.sh: SUCCESS", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: exec: PID 57921: spawning /usr/bin/ceph-osd --cluster ceph -f -i 1", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: exec: Waiting 57921 to quit", 
        "Feb 05 08:00:21 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal", 
        "Feb 05 08:00:22 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-05 08:00:22.410304 7f9197bdfd80 -1 osd.1 0 log_to_monitors {default=true}", 
        "Feb 05 08:00:23 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-05 08:00:23.907197 7f917fbe6700 -1 osd.1 0 waiting for initial osdmap", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 systemd[1]: Stopping Ceph OSD...", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: teardown: managing teardown after SIGTERM", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: teardown: Sending SIGTERM to PID 57921", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: teardown: Waiting PID 57921 to terminate .2019-02-07 22:57:06.572019 7f9175bd2700 -1 Fail to read '/proc/309854/cmdline' error = (3) No such process", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-07 22:57:06.572054 7f9175bd2700 -1 received  signal: Terminated from  PID: 309854 task name: <unknown> UID: 0", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-07 22:57:06.572071 7f9175bd2700 -1 osd.1 100 *** Got signal Terminated ***", 
        "Feb 07 22:57:06 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: .2019-02-07 22:57:06.694787 7f9175bd2700 -1 osd.1 100 shutdown", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: ..........................", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: teardown: Process 57921 is terminated", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: sigterm_cleanup_post", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-07 22:57:09  /entrypoint.sh: osd_volume_activate: Unmounting /dev/sda2", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: umount: /var/lib/ceph: target is busy.", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: (In some cases useful info about processes that use", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: the device is found by lsof(8) or fuser(1))", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: 2019-02-07 22:57:09  /entrypoint.sh: osd_volume_activate: Failed to umount /dev/sda2", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 ceph-osd-run.sh[51416]: osd_volume_activate.sh: line 47: lsof: command not found", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 docker[309842]: ceph-osd-1", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 systemd[1]: Stopped Ceph OSD.", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 systemd[1]: Starting Ceph OSD...", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 docker[309924]: Error response from daemon: No such container: ceph-osd-1", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 docker[309936]: Error response from daemon: No such container: ceph-osd-1", 
        "Feb 07 22:57:09 overcloud-ceph-all-2 systemd[1]: Started Ceph OSD.", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: restorecon /var/lib/ceph/osd/ceph-1", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-72de8451-9a9c-4462-b5af-59e442f225fa/osd-data-fe82915e-4d28-4744-988a-bd4250caa54a --path /var/lib/ceph/osd/ceph-1", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: ln -snf /dev/ceph-72de8451-9a9c-4462-b5af-59e442f225fa/osd-data-fe82915e-4d28-4744-988a-bd4250caa54a /var/lib/ceph/osd/ceph-1/block", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: chown -h ceph:ceph /var/lib/ceph/osd/ceph-1/block", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: chown -R ceph:ceph /dev/mapper/ceph--72de8451--9a9c--4462--b5af--59e442f225fa-osd--data--fe82915e--4d28--4744--988a--bd4250caa54a", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: --> ceph-volume lvm activate successful for osd ID: 1", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: 2019-02-07 22:57:32  /entrypoint.sh: SUCCESS", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: exec: PID 310325: spawning /usr/bin/ceph-osd --cluster ceph -f -i 1", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: exec: Waiting 310325 to quit", 
        "Feb 07 22:57:32 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal", 
        "Feb 07 22:57:33 overcloud-ceph-all-2 ceph-osd-run.sh[309949]: 2019-02-07 22:57:33.370792 7fb551040d80 -1 osd.1 100 log_to_monitors {default=true}"
    ]
}