On Thu, Sep 16, 2021 at 11:13 AM Attila Fazekas <afazekas> wrote: > rdo ceph wallaby failed to install. > https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/RDO/job/rdo-pcci-wallaby-rhel-8.4-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph/5/ I see that ceph failed to bootstrap in this job: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/rdo-pcci-wallaby-rhel-8.4-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph/5/undercloud-0/home/stack/overcloud-deploy/overcloud/config-download/overcloud/cephadm/cephadm_command.log.gz It failed while trying to connect to controller-0 using the SSH account which should have been added: "Adding host controller-0...", "Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/16855422-5759-46d4-99bd-0e2da357801b:/var/log/ceph:z -v /tmp/ceph-tmpjhc6i6js:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp85cuj7vh:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0", "/usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0).", "/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key" The account was created: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/rdo-pcci-wallaby-rhel-8.4-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph/5/controller-0/etc/passwd.gz But it's SSH keys should have been added by this playbook though those tasks were skipped: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/rdo-pcci-wallaby-rhel-8.4-virthost-3cont_2comp_3ceph-ipv4-geneve-ceph/5/undercloud-0/home/stack/overcloud-deploy/overcloud/config-download/overcloud/cephadm/cephadm_enable_user_key.log.gz
The keys were correctly distributed [stack@undercloud-0 ~]$ cd ~/config-download/overcloud/cephadm [stack@undercloud-0 cephadm]$ ansible -i inventory.yml mons,osds -b -m shell -a "ls -l /home/ceph-admin/.ssh/" ceph-1 | CHANGED | rc=0 >> total 4 -rw-------. 1 ceph-admin ceph-admin 554 Sep 21 14:43 authorized_keys controller-0 | CHANGED | rc=0 >> total 16 -rw-------. 1 ceph-admin ceph-admin 1108 Sep 21 14:43 authorized_keys -rw-------. 1 ceph-admin ceph-admin 2622 Sep 21 14:43 id_rsa -rw-r--r--. 1 ceph-admin ceph-admin 553 Sep 21 14:43 id_rsa.pub -rw-r--r--. 1 ceph-admin ceph-admin 186 Sep 21 15:14 known_hosts controller-1 | CHANGED | rc=0 >> total 12 -rw-------. 1 ceph-admin ceph-admin 1108 Sep 21 14:43 authorized_keys -rw-------. 1 ceph-admin ceph-admin 2622 Sep 21 14:43 id_rsa -rw-r--r--. 1 ceph-admin ceph-admin 553 Sep 21 14:43 id_rsa.pub ceph-0 | CHANGED | rc=0 >> total 4 -rw-------. 1 ceph-admin ceph-admin 554 Sep 21 14:43 authorized_keys controller-2 | CHANGED | rc=0 >> total 12 -rw-------. 1 ceph-admin ceph-admin 1108 Sep 21 14:43 authorized_keys -rw-------. 1 ceph-admin ceph-admin 2622 Sep 21 14:43 id_rsa -rw-r--r--. 1 ceph-admin ceph-admin 553 Sep 21 14:43 id_rsa.pub ceph-2 | CHANGED | rc=0 >> total 4 -rw-------. 1 ceph-admin ceph-admin 554 Sep 21 14:43 authorized_keys [stack@undercloud-0 cephadm]$
The bootstrap was run on controller-0 and it failed to add itself as per the following from /var/log/ceph/cephadm.log 2021-09-21 14:45:37,059 INFO Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/c50a77c0-b27c-4a89-8042-782508a870dd:/var/log/ceph:z -v /tmp/ceph-tmpfqjce370:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp2qoitnyh:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0). 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr To add the cephadm SSH key to the host: 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr To check that the host is reachable open a new shell with the --no-hosts flag: 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > cephadm shell --no-hosts 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr Then run the following: 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0 2021-09-21 14:45:37,060 ERROR ERROR: Failed to add host <controller-0>: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/c50a77c0-b27c-4a89-8042-782508a870dd:/var/log/ceph:z -v /tmp/ceph-tmpfqjce370:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp2qoitnyh:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0 2021-09-21 14:45:37,061 DEBUG Releasing lock 140121488108064 on /run/cephadm/c50a77c0-b27c-4a89-8042-782508a870dd.lock 2021-09-21 14:45:37,061 DEBUG Lock 140121488108064 released on /run/cephadm/c50a77c0-b27c-4a89-8042-782508a870dd.lock
Re-provisioned virtual baremetal and redeployed overcloud with the tripleo-ansible ceph bootstrap task [1] modified to pass a) --skip-ssh and then b) fail to stop the tripleo deployment. Then, as ceph-admin user on controller-0, manually ran the CLI arguments which omitting --skip-ssh in cephadm would have run [2]. This did not reproduce the bug however [3]. [1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#L43 [2] https://github.com/ceph/ceph/blob/master/src/cephadm/cephadm#L3894 [3] https://paste.opendev.org/show/809539/
Re-provisioned virtual baremetal and redeployed overcloud with the tripleo-ansible ceph bootstrap task modified [1] to ensure an SSH config option was set. It made no difference and the deployment failed as usual. However, I was able to reproduce the bug simply by running "cephadm shell" and then trying to add the host the same way. It failed the first time but then re-running the same command the second time it didn't fail [2]. [1] https://paste.opendev.org/show/809544/ [2] [ceph-admin@controller-0 ~]$ sudo cephadm shell Inferring fsid ed21f87c-5d79-42d4-b49c-d37a2c228278 Inferring config /var/lib/ceph/ed21f87c-5d79-42d4-b49c-d37a2c228278/mon.controller-0/config Using recent ceph image quay.io/ceph/daemon@sha256:06c8f7d23a48820eb59f0d977bc03479043512370a18337f200e9c04d7974f5b [ceph: root@controller-0 /]# ceph orch host ls [ceph: root@controller-0 /]# ceph orch host add controller-0 Error EINVAL: Failed to connect to controller-0 (controller-0). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: > ceph cephadm get-pub-key > ~/ceph.pub > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0 To check that the host is reachable open a new shell with the --no-hosts flag: > cephadm shell --no-hosts Then run the following: > ceph cephadm get-ssh-config > ssh_config > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > chmod 0600 ~/cephadm_private_key > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0 [ceph: root@controller-0 /]# ceph orch host add controller-0 Added host 'controller-0' with addr '192.168.24.24' [ceph: root@controller-0 /]#
I modified the bootstrap task file [1] to add a task which verifies that the bootstrap node can SSH to its hostname as tripleo_cephadm_ssh_user [2]. I then saw this task succeed while cephadm's add host command fail [3]. [1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml [2] - name: Create SSH config file for tripleo_cephadm_ssh_user copy: dest: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" mode: '0644' owner: "{{ tripleo_cephadm_ssh_user }}" group: "{{ tripleo_cephadm_ssh_user }}" content: | Host * User {{ tripleo_cephadm_ssh_user }} StrictHostKeyChecking no UserKnownHostsFile /dev/null ConnectTimeout=30 - name: Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user shell: | ssh -F {{ conf }} -i {{ priv }} -l {{ user }} {{ host }} /bin/true vars: conf: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" priv: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa" user: "{{ tripleo_cephadm_ssh_user }}" host: "{{ hostvars[inventory_hostname]['ansible_facts']['hostname'] }}" delegate_to: "{{ inventory_hostname }}" register: tripleo_cephadm_ssh_test until: tripleo_cephadm_ssh_test.rc == 0 retries: 5 delay: 10 - name: Bootstrap Ceph if there are no running Ceph Daemons block: - name: Run cephadm bootstrap shell: | {{ tripleo_cephadm_bin }} \ --image {{ tripleo_cephadm_container_ns + '/' + tripleo_cephadm_container_image + ':' + tripleo_cephadm_container_tag }} \ bootstrap \ --skip-firewalld \ --ssh-private-key /home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa \ --ssh-public-key /home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa.pub \ --ssh-config /home/{{ tripleo_cephadm_ssh_user }}/.ssh/config \ ... [3] 2021-09-24 18:18:49,797 p=471659 u=stack n=ansible | 2021-09-24 18:18:49.797395 | 525400de-00e5-cdbc-0879-00000000004d | TASK | Create SSH config file for tripleo_cephadm_ssh_user 2021-09-24 18:18:50,679 p=471659 u=stack n=ansible | 2021-09-24 18:18:50.677724 | 525400de-00e5-cdbc-0879-00000000004d | CHANGED | Create SSH config file for tripleo_cephadm_ssh_user | controller-0 2021-09-24 18:18:50,773 p=471659 u=stack n=ansible | 2021-09-24 18:18:50.772593 | 525400de-00e5-cdbc-0879-00000000004e | TASK | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user 2021-09-24 18:18:51,298 p=471659 u=stack n=ansible | 2021-09-24 18:18:51.297690 | 525400de-00e5-cdbc-0879-00000000004e | CHANGED | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user | controller-0 -> 192.168.24.40 2021-09-24 18:18:51,299 p=471659 u=stack n=ansible | [WARNING]: ('controller-0 -> 192.168.24.40', '525400de-00e5-cdbc-0879-00000000004e') missing from stats 2021-09-24 18:18:51,310 p=471659 u=stack n=ansible | 2021-09-24 18:18:51.310435 | 525400de-00e5-cdbc-0879-000000000050 | TASK | Run cephadm bootstrap 2021-09-24 18:21:12,229 p=471659 u=stack n=ansible | 2021-09-24 18:21:12.227983 | 525400de-00e5-cdbc-0879-000000000050 | FATAL | Run cephadm bootstrap | controller-0 | error={"changed": true, "cmd": "/usr/sbin/cephadm --image quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-config /home/ceph-admin/.ssh/config --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid 2bfc75cd-8043-4b4d-9707-9223f8a11026 --config /home/ceph-admin/bootstrap_ceph.conf \\--skip-monitoring-stack --skip-dashboard --mon-ip 172.17.3.24\n", "delta": "0:02:20.638702", "end": "2021-09-24 18:21:12.177156", "msg": "non-zero return code", "rc": 1, "start": "2021-09-24 18:18:51.538454", "stderr": "Verifying podman|docker is present...\nVerifying lvm2 is present...\nVerifying time synchronization is in place...\nUnit chronyd.service is enabled and running\nRepeating the final host check...\npodman|docker (/bin/podman) is present\nsystemctl is present\nlvcreate is present\nUnit chronyd.service is enabled and running\nHost looks OK\nCluster fsid: 2bfc75cd-8043-4b4d-9707-9223f8a11026\nVerifying IP 172.17.3.24 port 3300 ...\nVerifying IP 172.17.3.24 port 6789 ...\nMon IP 172.17.3.24 is in CIDR network 172.17.3.0/24\n- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network\nPulling container image quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64...\nCeph version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)\nExtracting ceph user uid/gid from container image...\nCreating initial keys...\nCreating initial monmap...\nCreating mon...\nWaiting for mon to start...\nWaiting for mon...\nmon is available\nAssimilating anything we can from ceph.conf...\nGenerating new minimal ceph.conf...\nRestarting the monitor...\nSetting mon public_network to 172.17.3.0/24\nWrote config to /etc/ceph/ceph.conf\nWrote keyring to /etc/ceph/ceph.client.admin.keyring\nCreating mgr...\nVerifying port 9283 ...\nWaiting for mgr to start...\nWaiting for mgr...\nmgr not available, waiting (1/15)...\nmgr not available, waiting (2/15)...\nmgr not available, waiting (3/15)...\nmgr not available, waiting (4/15)...\nmgr is available\nEnabling cephadm module...\nWaiting for the mgr to restart...\nWaiting for mgr epoch 5...\nmgr epoch 5 is available\nSetting orchestrator backend to cephadm...\nUsing provided ssh config...\nUsing provided ssh keys...\nAdding host controller-0...\nNon-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0\n/usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0).\n/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key\n/usr/bin/ceph: stderr \n/usr/bin/ceph: stderr To add the cephadm SSH key to the host:\n/usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub\n/usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0\n/usr/bin/ceph: stderr \n/usr/bin/ceph: stderr To check that the host is reachable open a new shell with the --no-hosts flag:\n/usr/bin/ceph: stderr > cephadm shell --no-hosts\n/usr/bin/ceph: stderr \n/usr/bin/ceph: stderr Then run the following:\n/usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config\n/usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key\n/usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key\n/usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0\nERROR: Failed to add host <controller-0>: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0", "stderr_lines": ["Verifying podman|docker is present...", "Verifying lvm2 is present...", "Verifying time synchronization is in place...", "Unit chronyd.service is enabled and running", "Repeating the final host check...", "podman|docker (/bin/podman) is present", "systemctl is present", "lvcreate is present", "Unit chronyd.service is enabled and running", "Host looks OK", "Cluster fsid: 2bfc75cd-8043-4b4d-9707-9223f8a11026", "Verifying IP 172.17.3.24 port 3300 ...", "Verifying IP 172.17.3.24 port 6789 ...", "Mon IP 172.17.3.24 is in CIDR network 172.17.3.0/24", "- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network", "Pulling container image quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64...", "Ceph version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)", "Extracting ceph user uid/gid from container image...", "Creating initial keys...", "Creating initial monmap...", "Creating mon...", "Waiting for mon to start...", "Waiting for mon...", "mon is available", "Assimilating anything we can from ceph.conf...", "Generating new minimal ceph.conf...", "Restarting the monitor...", "Setting mon public_network to 172.17.3.0/24", "Wrote config to /etc/ceph/ceph.conf", "Wrote keyring to /etc/ceph/ceph.client.admin.keyring", "Creating mgr...", "Verifying port 9283 ...", "Waiting for mgr to start...", "Waiting for mgr...", "mgr not available, waiting (1/15)...", "mgr not available, waiting (2/15)...", "mgr not available, waiting (3/15)...", "mgr not available, waiting (4/15)...", "mgr is available", "Enabling cephadm module...", "Waiting for the mgr to restart...", "Waiting for mgr epoch 5...", "mgr epoch 5 is available", "Setting orchestrator backend to cephadm...", "Using provided ssh config...", "Using provided ssh keys...", "Adding host controller-0...", "Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0", "/usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0).", "/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key", "/usr/bin/ceph: stderr ", "/usr/bin/ceph: stderr To add the cephadm SSH key to the host:", "/usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub", "/usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0", "/usr/bin/ceph: stderr ", "/usr/bin/ceph: stderr To check that the host is reachable open a new shell with the --no-hosts flag:", "/usr/bin/ceph: stderr > cephadm shell --no-hosts", "/usr/bin/ceph: stderr ", "/usr/bin/ceph: stderr Then run the following:", "/usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config", "/usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key", "/usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key", "/usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0", "ERROR: Failed to add host <controller-0>: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0"], "stdout": "", "stdout_lines": []} 2021-09-24 18:21:12,235 p=471659 u=stack n=ansible | PLAY RECAP ********************************************************************* 2021-09-24 18:21:12,236 p=471659 u=stack n=ansible | controller-0 : ok=12 changed=7 unreachable=0 failed=1 skipped=8 rescued=0 ignored=0 2021-09-24 18:21:12,238 p=471659 u=stack n=ansible | 2021-09-24 18:21:12.237460 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Summary Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Re-testing the scenario described in comment #6 but with the SSH test inside the ceph container and both with and without --no-hosts as seen in the modified ssh test task below. - name: Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user shell: >- /bin/podman run --rm --ipc=host --net=host {{ no_hosts }} --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE={{ image }} -e NODE_NAME={{ host }} -e CEPH_USE_RANDOM_NONCE=1 -v {{ conf }}:{{ c_conf }} -v {{ priv }}:{{ c_priv }} {{ image }} -F {{ c_conf }} -i {{ c_priv }} -l {{ user }} {{ host }} /bin/true vars: no_hosts: "--no-hosts" #no_hosts: "" conf: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" c_conf: "/tmp/config" priv: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa" c_priv: "/tmp/id_rsa" image: "{{ tripleo_cephadm_container_ns + '/' + tripleo_cephadm_container_image + ':' + tripleo_cephadm_container_tag }}" user: "{{ tripleo_cephadm_ssh_user }}" host: "{{ hostvars[inventory_hostname]['ansible_facts']['hostname'] }}" delegate_to: "{{ inventory_hostname }}" become: true register: tripleo_cephadm_ssh_test until: tripleo_cephadm_ssh_test.rc == 0 retries: 10 delay: 5
The Ansible role was modified so it has the following two sequential tasks: A. SSH test (from comment #7) B. Bootstrap When A uses --no-hosts and B uses --no-hosts, then A fails [1] When A does not use --no-hosts and B uses --no-hosts, then A succeeds [2] and B fails as seen in comment #3 [1] 2021-09-25 14:08:26,460 p=436212 u=stack n=ansible | 2021-09-25 14:08:26.460565 | 525400de-00e5-0d5d-62e6-00000000004d | TASK | Create SSH config file for tripleo_cephadm_ssh_user 2021-09-25 14:08:27,192 p=436212 u=stack n=ansible | 2021-09-25 14:08:27.191605 | 525400de-00e5-0d5d-62e6-00000000004d | CHANGED | Create SSH config file for tripleo_cephadm_ssh_user | controller-0 2021-09-25 14:08:27,282 p=436212 u=stack n=ansible | 2021-09-25 14:08:27.281953 | 525400de-00e5-0d5d-62e6-00000000004e | TASK | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user 2021-09-25 14:12:02,494 p=436212 u=stack n=ansible | 2021-09-25 14:12:02.493201 | 525400de-00e5-0d5d-62e6-00000000004e | FATAL | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user | controller-0 -> 192.168.24.51 | error={"attempts": 10, "changed": true, "cmd": "/bin/podman run --rm --ipc=host --net=host --no-hosts --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -F /tmp/config -i /tmp/id_rsa -l ceph-admin controller-0 /bin/true", "delta": "0:00:08.557218", "end": "2021-09-25 14:12:02.447699", "msg": "non-zero return code", "rc": 255, "start": "2021-09-25 14:11:53.890481", "stderr": "ssh: connect to host controller-0 port 22: No route to host", "stderr_lines": ["ssh: connect to host controller-0 port 22: No route to host"], "stdout": "", "stdout_lines": []} 2021-09-25 14:12:02,496 p=436212 u=stack n=ansible | [WARNING]: ('controller-0 -> 192.168.24.51', '525400de-00e5-0d5d-62e6-00000000004e') missing from stats [2] 2021-09-27 01:16:52,698 p=696678 u=stack n=ansible | 2021-09-27 01:16:52.698384 | 525400de-00e5-28b1-713b-00000000004d | TASK | Create SSH config file for tripleo_cephadm_ssh_user 2021-09-27 01:16:53,432 p=696678 u=stack n=ansible | 2021-09-27 01:16:53.431110 | 525400de-00e5-28b1-713b-00000000004d | CHANGED | Create SSH config file for tripleo_cephadm_ssh_user | controller-0 2021-09-27 01:16:53,520 p=696678 u=stack n=ansible | 2021-09-27 01:16:53.520145 | 525400de-00e5-28b1-713b-00000000004e | TASK | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user 2021-09-27 01:18:02,028 p=696678 u=stack n=ansible | 2021-09-27 01:18:02.027077 | 525400de-00e5-28b1-713b-00000000004e | CHANGED | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user | controller-0 -> 192.168.24.34 2021-09-27 01:18:02,030 p=696678 u=stack n=ansible | [WARNING]: ('controller-0 -> 192.168.24.34', '525400de-00e5-28b1-713b-00000000004e') missing from stats 2021-09-27 01:18:02,044 p=696678 u=stack n=ansible | 2021-09-27 01:18:02.043477 | 525400de-00e5-28b1-713b-000000000050 | TASK | Run cephadm bootstrap 2021-09-27 01:19:29,315 p=696678 u=stack n=ansible | 2021-09-27 01:19:29.313644 | 525400de-00e5-28b1-713b-000000000050 | FATAL | Run cephadm bootstrap | controller-0 | error={"changed": <cut>
Modifying the SSH task to use "ssh -v" and logging the output [0] reveals what is happening within the ceph container when --no-hosts is passed. - Attila's system (which exhibits this bug) falls back to DNS to identify controller-0 [1]. The DNS server in /etc/resolv.conf returns an IP it cannot reach [2] and thus it fails to SSH. - My system uses the IPv6 autoconfig'd address which does work [3]. - After some amount of time Attila's IPv6 autoconfig'd IP address does work [4]. I don't know why his system's IPv6 autoconfig takes time but I don't think that's relevant. What's more interesting is that we have been relying on IPv6 autoconfig. TripleO manages the /etc/hosts file for all hosts that it deploys [5]. TripleO has the option to update a DNS service, but this isn't required. Always using --no-hosts blocks our access to the hosts file that tripleo configures for host resolution [6]. I can work around it by modifying the SSH config managed by tripleo's cephadm role so that it's effectively like a hosts file (since it can map the name to the IP) [7] but it might be nicer to have an option in cephadm so that a user can simply choose to rely on /etc/hosts in their ceph container at bootstrap if they wish. [0] - name: Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user shell: >- /bin/podman run --rm --ipc=host --net=host --no-hosts --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE={{ image }} -e NODE_NAME={{ host }} -e CEPH_USE_RANDOM_NONCE=1 -v {{ conf }}:{{ c_conf }} -v {{ priv }}:{{ c_priv }} {{ image }} -v -F {{ c_conf }} -i {{ c_priv }} -l {{ user }} {{ host }} /bin/true vars: conf: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" c_conf: "/tmp/config" priv: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa" c_priv: "/tmp/id_rsa" image: "{{ tripleo_cephadm_container_ns + '/' + tripleo_cephadm_container_image + ':' + tripleo_cephadm_container_tag }}" user: "{{ tripleo_cephadm_ssh_user }}" host: "{{ hostvars[inventory_hostname]['ansible_facts']['hostname'] }}" delegate_to: "{{ inventory_hostname }}" become: true register: tripleo_cephadm_ssh_test until: tripleo_cephadm_ssh_test.rc == 0 ignore_errors: true failed_when: tripleo_cephadm_ssh_test.rc > 0 retries: 10 delay: 5 - debug: msg: "{{ tripleo_cephadm_ssh_test }}" [1] 2021-09-27 18:00:51,886 p=215519 u=stack n=ansible | 2021-09-27 18:00:51.886143 | 525400de-00e5-afff-459e-00000000004f | OK | tripleo_cephadm : debug | controller-0 | result={ "changed": false, "msg": { "attempts": 10, "changed": true, "cmd": "/bin/podman run --rm --ipc=host --net=host --no-hosts --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -v -F /tmp/config -i /tmp/id_rsa -l ceph-admin controller-0 /bin/true", "delta": "0:00:08.520137", "end": "2021-09-27 18:00:51.793941", "failed": true, "failed_when_result": true, "msg": "non-zero return code", "rc": 255, "start": "2021-09-27 18:00:43.273804", "stderr": "OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020\r\ndebug1: Reading configuration data /tmp/config\r\ndebug1: /tmp/config line 1: Applying options for *\r\ndebug1: Connecting to controller-0 [10.0.0.3] port 22.\r\ndebug1: connect to address 10.0.0.3 port 22: No route to host\r\nssh: connect to host controller-0 port 22: No route to host", "stderr_lines": [ "OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020", "debug1: Reading configuration data /tmp/config", "debug1: /tmp/config line 1: Applying options for *", "debug1: Connecting to controller-0 [10.0.0.3] port 22.", "debug1: connect to address 10.0.0.3 port 22: No route to host", "ssh: connect to host controller-0 port 22: No route to host" ], "stdout": "", "stdout_lines": [] } } [2] [root@controller-1 ~]# cat /etc/resolv.conf # Generated by NetworkManager nameserver 172.16.0.1 nameserver 10.0.0.1 [root@controller-1 ~]# [root@controller-1 ~]# dig @10.0.0.1 controller-0 +short 10.0.0.3 [root@controller-1 ~]# [root@controller-1 ~]# ssh 10.0.0.3 ssh: connect to host 10.0.0.3 port 22: No route to host [root@controller-1 ~]# [3] TASK [Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user] ******************** changed: [oc0-controller-0 -> 192.168.24.8] TASK [tripleo_cephadm : debug] *************************************************************** ok: [oc0-controller-0] => { "msg": { ... "stderr_lines": [ "OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020", "debug1: Reading configuration data /tmp/config", "debug1: /tmp/config line 1: Applying options for *", "debug1: Connecting to oc0-controller-0 [fe80::2642:ff:fe79:d558%ens3] port 22.", "debug1: fd 3 clearing O_NONBLOCK", "debug1: Connection established.", [4] "stderr_lines": [ "OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020", "debug1: Reading configuration data /tmp/config", "debug1: /tmp/config line 1: Applying options for *", "debug1: Connecting to controller-0 [fe80::5054:ff:fe72:5432%ens3] port 22.", "debug1: fd 3 clearing O_NONBLOCK", [5] https://github.com/openstack/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_hosts_entries/tasks/main.yml [6] [root@controller-0 ~]# /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/cat -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 [root@controller-0 ~]# [root@controller-0 ~]# /bin/podman run --rm --ipc=host --net=host --entrypoint /usr/bin/cat -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 /etc/hosts # START_HOST_ENTRIES_FOR_STACK: overcloud 172.17.3.31 ceph-0.redhat.local ceph-0 172.17.3.31 ceph-0.storage.redhat.local ceph-0.storage 172.17.4.110 ceph-0.storagemgmt.redhat.local ceph-0.storagemgmt 192.168.24.13 ceph-0.ctlplane.redhat.local ceph-0.ctlplane 172.17.3.58 ceph-1.redhat.local ceph-1 172.17.3.58 ceph-1.storage.redhat.local ceph-1.storage 172.17.4.109 ceph-1.storagemgmt.redhat.local ceph-1.storagemgmt 192.168.24.24 ceph-1.ctlplane.redhat.local ceph-1.ctlplane 172.17.3.124 ceph-2.redhat.local ceph-2 172.17.3.124 ceph-2.storage.redhat.local ceph-2.storage 172.17.4.38 ceph-2.storagemgmt.redhat.local ceph-2.storagemgmt 192.168.24.12 ceph-2.ctlplane.redhat.local ceph-2.ctlplane 172.17.1.18 compute-0.redhat.local compute-0 172.17.3.100 compute-0.storage.redhat.local compute-0.storage 172.17.1.18 compute-0.internalapi.redhat.local compute-0.internalapi 172.17.2.133 compute-0.tenant.redhat.local compute-0.tenant 192.168.24.35 compute-0.ctlplane.redhat.local compute-0.ctlplane 172.17.1.120 compute-1.redhat.local compute-1 172.17.3.141 compute-1.storage.redhat.local compute-1.storage 172.17.1.120 compute-1.internalapi.redhat.local compute-1.internalapi 172.17.2.31 compute-1.tenant.redhat.local compute-1.tenant 192.168.24.50 compute-1.ctlplane.redhat.local compute-1.ctlplane 172.17.1.76 controller-0.redhat.local controller-0 172.17.3.24 controller-0.storage.redhat.local controller-0.storage 172.17.4.17 controller-0.storagemgmt.redhat.local controller-0.storagemgmt 172.17.1.76 controller-0.internalapi.redhat.local controller-0.internalapi 172.17.2.45 controller-0.tenant.redhat.local controller-0.tenant 10.0.0.120 controller-0.external.redhat.local controller-0.external 192.168.24.34 controller-0.ctlplane.redhat.local controller-0.ctlplane 172.17.1.16 controller-1.redhat.local controller-1 172.17.3.68 controller-1.storage.redhat.local controller-1.storage 172.17.4.72 controller-1.storagemgmt.redhat.local controller-1.storagemgmt 172.17.1.16 controller-1.internalapi.redhat.local controller-1.internalapi 172.17.2.37 controller-1.tenant.redhat.local controller-1.tenant 10.0.0.145 controller-1.external.redhat.local controller-1.external 192.168.24.10 controller-1.ctlplane.redhat.local controller-1.ctlplane 172.17.1.23 controller-2.redhat.local controller-2 172.17.3.134 controller-2.storage.redhat.local controller-2.storage 172.17.4.68 controller-2.storagemgmt.redhat.local controller-2.storagemgmt 172.17.1.23 controller-2.internalapi.redhat.local controller-2.internalapi 172.17.2.75 controller-2.tenant.redhat.local controller-2.tenant 10.0.0.131 controller-2.external.redhat.local controller-2.external 192.168.24.18 controller-2.ctlplane.redhat.local controller-2.ctlplane 192.168.24.1 undercloud-0.ctlplane.redhat.local undercloud-0.ctlplane 192.168.24.28 overcloud.ctlplane.localdomain 172.17.3.145 overcloud.storage.localdomain 172.17.4.95 overcloud.storagemgmt.localdomain 172.17.1.99 overcloud.internalapi.localdomain 10.0.0.102 overcloud.localdomain # END_HOST_ENTRIES_FOR_STACK: overcloud 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 127.0.1.1 controller-0 controller-0 elastic_moore [root@controller-0 ~]# [7] - name: Create SSH config file for tripleo_cephadm_ssh_user copy: dest: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" mode: '0644' owner: "{{ tripleo_cephadm_ssh_user }}" group: "{{ tripleo_cephadm_ssh_user }}" content: | Host {{ bootstrap_host_name }} Hostname {{ bootstrap_host_ip }} User {{ tripleo_cephadm_ssh_user }} StrictHostKeyChecking no UserKnownHostsFile /dev/null ConnectTimeout=30
cephadm already removed the --no-hosts option [1] so we don't need a new option to avoid it. Downstream 5.0 does not have --no-hosts Downstream 5.1 will not have --no-hosts OLD upstream versions of pacific have --no-hosts but it was removed because it caused bugs (like this one). If you're going to test upstream versions, then please use a modern version. E.g in TripleO we use that latest version of [2] which passed our CI. [1] https://github.com/ceph/ceph/commit/d1bb94ba4c4b8401bf7799b7da3a8f5c2fd228c6 [2] https://cbs.centos.org/koji/packageinfo?packageID=8439