.Initiating the `ceph-ansible` playbook to expand the cluster sometimes fails on nodes with NVMe disks
When `osd_auto_discovery` is set to `true`, initiating the `ceph-ansible` playbook to expand the cluster causes the playbook to fail on nodes with NVMe disks because it is trying to reconfigure disks that are already being used by existing OSDs. This makes it impossible to add a new daemon collocating with an existing ODS that uses NVMe disks when `osd_auto_discovery` is set to `true`. To workaround this issue, configure a new daemon on a new node for which `osd_auto_discovery` is not set to `true`, and use the `--limit` parameter when initiating the playbook to expand the cluster.
Created attachment 1414135[details]
File contains contents of ansible-playbook log
Description of problem:
Even though OSD is configured on the disk, playbook is failing in task "automatic prepare ceph containerized osd disk collocated" saying "Error response from daemon: Conflict. The container name \"/ceph-osd-prepare-argo017-nvme0n1\" is already in use by container 22833f0e6fd4892a45c2867e59551c173b57a85decf94c781836e62cbb942967. You have to remove (or rename) that container to be able to reuse that name."
Version-Release number of selected component (if applicable):
ceph-ansible-3.0.28-1.el7cp.noarch
How reproducible:
Always (3/3)
Steps to Reproduce:
1. Configure containerized cluster with OSDs configured with osd_auto_discovery feature
2. Initiate ansible-playbook site-docker.yml once cluster is up
Actual results:
failed: [argo016] (item=/dev/nvme0n1) => {"changed": true, "cmd": "docker run --net=host --pid=host --privileged=true --name=ceph-osd-prepare-argo016-nvme0n1 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -e DEBUG=verbose -e CLUSTER=abc1 -e CEPH_DAEMON=OSD_CEPH_DISK_PREPARE -e OSD_DEVICE=/dev/nvme0n1 -e OSD_BLUESTORE=0 -e OSD_FILESTORE=1 -e OSD_DMCRYPT=1 -e OSD_JOURNAL_SIZE=10240 <image-name>", "delta": "0:00:00.030059", "end": "2018-03-28 09:25:51.061482", "item": "/dev/nvme0n1", "msg": "non-zero return code", "rc": 125, "start": "2018-03-28 09:25:51.031423", "stderr": "/usr/bin/docker-current: Error response from daemon: Conflict. The container name \"/ceph-osd-prepare-argo016-nvme0n1\" is already in use by container e1a23f4ca78b6f670121be11a29ef43c02db374e13f12115a9321b71c8c1c204. You have to remove (or rename) that container to be able to reuse that name..\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["/usr/bin/docker-current: Error response from daemon: Conflict. The container name \"/ceph-osd-prepare-argo016-nvme0n1\" is already in use by container e1a23f4ca78b6f670121be11a29ef43c02db374e13f12115a9321b71c8c1c204. You have to remove (or rename) that container to be able to reuse that name..", "See '/usr/bin/docker-current run --help'."], "stdout": "", "stdout_lines": []}
Expected results:
Playbook must not fail.
Additional info:
Inventory file snippet -
$ cat /etc/ansible/hosts |grep auto
argo016 osd_auto_discovery='true' dmcrypt="true" osd_scenario="collocated"
argo017 osd_auto_discovery='true' osd_scenario="collocated"
This issue was encountered when a new RGW node was tried to be added.
** The issue seemed particular to NVMe disks.**
Hi Sebastien,
As I observed, playbook should have skipped the task as an existing OSD was using the disk.
Playbook was initiated to expand the cluster, but it failed trying to configure disk which was being used by an OSD.
Regards,
Vasishta Shastry
AQE, Ceph
I tried with latest 3.3, still facing the issue, This affects the usability.
This affects scaling up of cluster, I request to kindly consider providing a fix for this issue BZ.
Created attachment 1414135 [details] File contains contents of ansible-playbook log Description of problem: Even though OSD is configured on the disk, playbook is failing in task "automatic prepare ceph containerized osd disk collocated" saying "Error response from daemon: Conflict. The container name \"/ceph-osd-prepare-argo017-nvme0n1\" is already in use by container 22833f0e6fd4892a45c2867e59551c173b57a85decf94c781836e62cbb942967. You have to remove (or rename) that container to be able to reuse that name." Version-Release number of selected component (if applicable): ceph-ansible-3.0.28-1.el7cp.noarch How reproducible: Always (3/3) Steps to Reproduce: 1. Configure containerized cluster with OSDs configured with osd_auto_discovery feature 2. Initiate ansible-playbook site-docker.yml once cluster is up Actual results: failed: [argo016] (item=/dev/nvme0n1) => {"changed": true, "cmd": "docker run --net=host --pid=host --privileged=true --name=ceph-osd-prepare-argo016-nvme0n1 -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev:/dev -v /etc/localtime:/etc/localtime:ro -e DEBUG=verbose -e CLUSTER=abc1 -e CEPH_DAEMON=OSD_CEPH_DISK_PREPARE -e OSD_DEVICE=/dev/nvme0n1 -e OSD_BLUESTORE=0 -e OSD_FILESTORE=1 -e OSD_DMCRYPT=1 -e OSD_JOURNAL_SIZE=10240 <image-name>", "delta": "0:00:00.030059", "end": "2018-03-28 09:25:51.061482", "item": "/dev/nvme0n1", "msg": "non-zero return code", "rc": 125, "start": "2018-03-28 09:25:51.031423", "stderr": "/usr/bin/docker-current: Error response from daemon: Conflict. The container name \"/ceph-osd-prepare-argo016-nvme0n1\" is already in use by container e1a23f4ca78b6f670121be11a29ef43c02db374e13f12115a9321b71c8c1c204. You have to remove (or rename) that container to be able to reuse that name..\nSee '/usr/bin/docker-current run --help'.", "stderr_lines": ["/usr/bin/docker-current: Error response from daemon: Conflict. The container name \"/ceph-osd-prepare-argo016-nvme0n1\" is already in use by container e1a23f4ca78b6f670121be11a29ef43c02db374e13f12115a9321b71c8c1c204. You have to remove (or rename) that container to be able to reuse that name..", "See '/usr/bin/docker-current run --help'."], "stdout": "", "stdout_lines": []} Expected results: Playbook must not fail. Additional info: Inventory file snippet - $ cat /etc/ansible/hosts |grep auto argo016 osd_auto_discovery='true' dmcrypt="true" osd_scenario="collocated" argo017 osd_auto_discovery='true' osd_scenario="collocated" This issue was encountered when a new RGW node was tried to be added. ** The issue seemed particular to NVMe disks.**