Bug 2005333
| Summary: | cephadm bootstrap fails to add the host doing the bootstrap | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | John Fulton <johfulto> |
| Component: | tripleo-ansible | Assignee: | John Fulton <johfulto> |
| Status: | CLOSED UPSTREAM | QA Contact: | Yogev Rabl <yrabl> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 17.0 (Wallaby) | CC: | fpantano |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-28 12:42:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
John Fulton
2021-09-17 13:02:28 UTC
The keys were correctly distributed [stack@undercloud-0 ~]$ cd ~/config-download/overcloud/cephadm [stack@undercloud-0 cephadm]$ ansible -i inventory.yml mons,osds -b -m shell -a "ls -l /home/ceph-admin/.ssh/" ceph-1 | CHANGED | rc=0 >> total 4 -rw-------. 1 ceph-admin ceph-admin 554 Sep 21 14:43 authorized_keys controller-0 | CHANGED | rc=0 >> total 16 -rw-------. 1 ceph-admin ceph-admin 1108 Sep 21 14:43 authorized_keys -rw-------. 1 ceph-admin ceph-admin 2622 Sep 21 14:43 id_rsa -rw-r--r--. 1 ceph-admin ceph-admin 553 Sep 21 14:43 id_rsa.pub -rw-r--r--. 1 ceph-admin ceph-admin 186 Sep 21 15:14 known_hosts controller-1 | CHANGED | rc=0 >> total 12 -rw-------. 1 ceph-admin ceph-admin 1108 Sep 21 14:43 authorized_keys -rw-------. 1 ceph-admin ceph-admin 2622 Sep 21 14:43 id_rsa -rw-r--r--. 1 ceph-admin ceph-admin 553 Sep 21 14:43 id_rsa.pub ceph-0 | CHANGED | rc=0 >> total 4 -rw-------. 1 ceph-admin ceph-admin 554 Sep 21 14:43 authorized_keys controller-2 | CHANGED | rc=0 >> total 12 -rw-------. 1 ceph-admin ceph-admin 1108 Sep 21 14:43 authorized_keys -rw-------. 1 ceph-admin ceph-admin 2622 Sep 21 14:43 id_rsa -rw-r--r--. 1 ceph-admin ceph-admin 553 Sep 21 14:43 id_rsa.pub ceph-2 | CHANGED | rc=0 >> total 4 -rw-------. 1 ceph-admin ceph-admin 554 Sep 21 14:43 authorized_keys [stack@undercloud-0 cephadm]$ The bootstrap was run on controller-0 and it failed to add itself as per the following from /var/log/ceph/cephadm.log 2021-09-21 14:45:37,059 INFO Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/c50a77c0-b27c-4a89-8042-782508a870dd:/var/log/ceph:z -v /tmp/ceph-tmpfqjce370:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp2qoitnyh:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0). 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr To add the cephadm SSH key to the host: 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0 2021-09-21 14:45:37,059 INFO /usr/bin/ceph: stderr 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr To check that the host is reachable open a new shell with the --no-hosts flag: 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > cephadm shell --no-hosts 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr Then run the following: 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key 2021-09-21 14:45:37,060 INFO /usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0 2021-09-21 14:45:37,060 ERROR ERROR: Failed to add host <controller-0>: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/c50a77c0-b27c-4a89-8042-782508a870dd:/var/log/ceph:z -v /tmp/ceph-tmpfqjce370:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp2qoitnyh:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0 2021-09-21 14:45:37,061 DEBUG Releasing lock 140121488108064 on /run/cephadm/c50a77c0-b27c-4a89-8042-782508a870dd.lock 2021-09-21 14:45:37,061 DEBUG Lock 140121488108064 released on /run/cephadm/c50a77c0-b27c-4a89-8042-782508a870dd.lock Re-provisioned virtual baremetal and redeployed overcloud with the tripleo-ansible ceph bootstrap task [1] modified to pass a) --skip-ssh and then b) fail to stop the tripleo deployment. Then, as ceph-admin user on controller-0, manually ran the CLI arguments which omitting --skip-ssh in cephadm would have run [2]. This did not reproduce the bug however [3]. [1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#L43 [2] https://github.com/ceph/ceph/blob/master/src/cephadm/cephadm#L3894 [3] https://paste.opendev.org/show/809539/ Re-provisioned virtual baremetal and redeployed overcloud with the tripleo-ansible ceph bootstrap task modified [1] to ensure an SSH config option was set. It made no difference and the deployment failed as usual. However, I was able to reproduce the bug simply by running "cephadm shell" and then trying to add the host the same way. It failed the first time but then re-running the same command the second time it didn't fail [2]. [1] https://paste.opendev.org/show/809544/ [2] [ceph-admin@controller-0 ~]$ sudo cephadm shell Inferring fsid ed21f87c-5d79-42d4-b49c-d37a2c228278 Inferring config /var/lib/ceph/ed21f87c-5d79-42d4-b49c-d37a2c228278/mon.controller-0/config Using recent ceph image quay.io/ceph/daemon@sha256:06c8f7d23a48820eb59f0d977bc03479043512370a18337f200e9c04d7974f5b [ceph: root@controller-0 /]# ceph orch host ls [ceph: root@controller-0 /]# ceph orch host add controller-0 Error EINVAL: Failed to connect to controller-0 (controller-0). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: > ceph cephadm get-pub-key > ~/ceph.pub > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0 To check that the host is reachable open a new shell with the --no-hosts flag: > cephadm shell --no-hosts Then run the following: > ceph cephadm get-ssh-config > ssh_config > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > chmod 0600 ~/cephadm_private_key > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0 [ceph: root@controller-0 /]# ceph orch host add controller-0 Added host 'controller-0' with addr '192.168.24.24' [ceph: root@controller-0 /]# I modified the bootstrap task file [1] to add a task which verifies that the bootstrap node can SSH to its hostname as tripleo_cephadm_ssh_user [2]. I then saw this task succeed while cephadm's add host command fail [3]. [1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml [2] - name: Create SSH config file for tripleo_cephadm_ssh_user copy: dest: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" mode: '0644' owner: "{{ tripleo_cephadm_ssh_user }}" group: "{{ tripleo_cephadm_ssh_user }}" content: | Host * User {{ tripleo_cephadm_ssh_user }} StrictHostKeyChecking no UserKnownHostsFile /dev/null ConnectTimeout=30 - name: Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user shell: | ssh -F {{ conf }} -i {{ priv }} -l {{ user }} {{ host }} /bin/true vars: conf: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" priv: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa" user: "{{ tripleo_cephadm_ssh_user }}" host: "{{ hostvars[inventory_hostname]['ansible_facts']['hostname'] }}" delegate_to: "{{ inventory_hostname }}" register: tripleo_cephadm_ssh_test until: tripleo_cephadm_ssh_test.rc == 0 retries: 5 delay: 10 - name: Bootstrap Ceph if there are no running Ceph Daemons block: - name: Run cephadm bootstrap shell: | {{ tripleo_cephadm_bin }} \ --image {{ tripleo_cephadm_container_ns + '/' + tripleo_cephadm_container_image + ':' + tripleo_cephadm_container_tag }} \ bootstrap \ --skip-firewalld \ --ssh-private-key /home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa \ --ssh-public-key /home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa.pub \ --ssh-config /home/{{ tripleo_cephadm_ssh_user }}/.ssh/config \ ... [3] 2021-09-24 18:18:49,797 p=471659 u=stack n=ansible | 2021-09-24 18:18:49.797395 | 525400de-00e5-cdbc-0879-00000000004d | TASK | Create SSH config file for tripleo_cephadm_ssh_user 2021-09-24 18:18:50,679 p=471659 u=stack n=ansible | 2021-09-24 18:18:50.677724 | 525400de-00e5-cdbc-0879-00000000004d | CHANGED | Create SSH config file for tripleo_cephadm_ssh_user | controller-0 2021-09-24 18:18:50,773 p=471659 u=stack n=ansible | 2021-09-24 18:18:50.772593 | 525400de-00e5-cdbc-0879-00000000004e | TASK | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user 2021-09-24 18:18:51,298 p=471659 u=stack n=ansible | 2021-09-24 18:18:51.297690 | 525400de-00e5-cdbc-0879-00000000004e | CHANGED | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user | controller-0 -> 192.168.24.40 2021-09-24 18:18:51,299 p=471659 u=stack n=ansible | [WARNING]: ('controller-0 -> 192.168.24.40', '525400de-00e5-cdbc-0879-00000000004e') missing from stats 2021-09-24 18:18:51,310 p=471659 u=stack n=ansible | 2021-09-24 18:18:51.310435 | 525400de-00e5-cdbc-0879-000000000050 | TASK | Run cephadm bootstrap 2021-09-24 18:21:12,229 p=471659 u=stack n=ansible | 2021-09-24 18:21:12.227983 | 525400de-00e5-cdbc-0879-000000000050 | FATAL | Run cephadm bootstrap | controller-0 | error={"changed": true, "cmd": "/usr/sbin/cephadm --image quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-config /home/ceph-admin/.ssh/config --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid 2bfc75cd-8043-4b4d-9707-9223f8a11026 --config /home/ceph-admin/bootstrap_ceph.conf \\--skip-monitoring-stack --skip-dashboard --mon-ip 172.17.3.24\n", "delta": "0:02:20.638702", "end": "2021-09-24 18:21:12.177156", "msg": "non-zero return code", "rc": 1, "start": "2021-09-24 18:18:51.538454", "stderr": "Verifying podman|docker is present...\nVerifying lvm2 is present...\nVerifying time synchronization is in place...\nUnit chronyd.service is enabled and running\nRepeating the final host check...\npodman|docker (/bin/podman) is present\nsystemctl is present\nlvcreate is present\nUnit chronyd.service is enabled and running\nHost looks OK\nCluster fsid: 2bfc75cd-8043-4b4d-9707-9223f8a11026\nVerifying IP 172.17.3.24 port 3300 ...\nVerifying IP 172.17.3.24 port 6789 ...\nMon IP 172.17.3.24 is in CIDR network 172.17.3.0/24\n- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network\nPulling container image quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64...\nCeph version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)\nExtracting ceph user uid/gid from container image...\nCreating initial keys...\nCreating initial monmap...\nCreating mon...\nWaiting for mon to start...\nWaiting for mon...\nmon is available\nAssimilating anything we can from ceph.conf...\nGenerating new minimal ceph.conf...\nRestarting the monitor...\nSetting mon public_network to 172.17.3.0/24\nWrote config to /etc/ceph/ceph.conf\nWrote keyring to /etc/ceph/ceph.client.admin.keyring\nCreating mgr...\nVerifying port 9283 ...\nWaiting for mgr to start...\nWaiting for mgr...\nmgr not available, waiting (1/15)...\nmgr not available, waiting (2/15)...\nmgr not available, waiting (3/15)...\nmgr not available, waiting (4/15)...\nmgr is available\nEnabling cephadm module...\nWaiting for the mgr to restart...\nWaiting for mgr epoch 5...\nmgr epoch 5 is available\nSetting orchestrator backend to cephadm...\nUsing provided ssh config...\nUsing provided ssh keys...\nAdding host controller-0...\nNon-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0\n/usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0).\n/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key\n/usr/bin/ceph: stderr \n/usr/bin/ceph: stderr To add the cephadm SSH key to the host:\n/usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub\n/usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0\n/usr/bin/ceph: stderr \n/usr/bin/ceph: stderr To check that the host is reachable open a new shell with the --no-hosts flag:\n/usr/bin/ceph: stderr > cephadm shell --no-hosts\n/usr/bin/ceph: stderr \n/usr/bin/ceph: stderr Then run the following:\n/usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config\n/usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key\n/usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key\n/usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0\nERROR: Failed to add host <controller-0>: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0", "stderr_lines": ["Verifying podman|docker is present...", "Verifying lvm2 is present...", "Verifying time synchronization is in place...", "Unit chronyd.service is enabled and running", "Repeating the final host check...", "podman|docker (/bin/podman) is present", "systemctl is present", "lvcreate is present", "Unit chronyd.service is enabled and running", "Host looks OK", "Cluster fsid: 2bfc75cd-8043-4b4d-9707-9223f8a11026", "Verifying IP 172.17.3.24 port 3300 ...", "Verifying IP 172.17.3.24 port 6789 ...", "Mon IP 172.17.3.24 is in CIDR network 172.17.3.0/24", "- internal network (--cluster-network) has not been provided, OSD replication will default to the public_network", "Pulling container image quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64...", "Ceph version: ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)", "Extracting ceph user uid/gid from container image...", "Creating initial keys...", "Creating initial monmap...", "Creating mon...", "Waiting for mon to start...", "Waiting for mon...", "mon is available", "Assimilating anything we can from ceph.conf...", "Generating new minimal ceph.conf...", "Restarting the monitor...", "Setting mon public_network to 172.17.3.0/24", "Wrote config to /etc/ceph/ceph.conf", "Wrote keyring to /etc/ceph/ceph.client.admin.keyring", "Creating mgr...", "Verifying port 9283 ...", "Waiting for mgr to start...", "Waiting for mgr...", "mgr not available, waiting (1/15)...", "mgr not available, waiting (2/15)...", "mgr not available, waiting (3/15)...", "mgr not available, waiting (4/15)...", "mgr is available", "Enabling cephadm module...", "Waiting for the mgr to restart...", "Waiting for mgr epoch 5...", "mgr epoch 5 is available", "Setting orchestrator backend to cephadm...", "Using provided ssh config...", "Using provided ssh keys...", "Adding host controller-0...", "Non-zero exit code 22 from /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0", "/usr/bin/ceph: stderr Error EINVAL: Failed to connect to controller-0 (controller-0).", "/usr/bin/ceph: stderr Please make sure that the host is reachable and accepts connections using the cephadm SSH key", "/usr/bin/ceph: stderr ", "/usr/bin/ceph: stderr To add the cephadm SSH key to the host:", "/usr/bin/ceph: stderr > ceph cephadm get-pub-key > ~/ceph.pub", "/usr/bin/ceph: stderr > ssh-copy-id -f -i ~/ceph.pub ceph-admin@controller-0", "/usr/bin/ceph: stderr ", "/usr/bin/ceph: stderr To check that the host is reachable open a new shell with the --no-hosts flag:", "/usr/bin/ceph: stderr > cephadm shell --no-hosts", "/usr/bin/ceph: stderr ", "/usr/bin/ceph: stderr Then run the following:", "/usr/bin/ceph: stderr > ceph cephadm get-ssh-config > ssh_config", "/usr/bin/ceph: stderr > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key", "/usr/bin/ceph: stderr > chmod 0600 ~/cephadm_private_key", "/usr/bin/ceph: stderr > ssh -F ssh_config -i ~/cephadm_private_key ceph-admin@controller-0", "ERROR: Failed to add host <controller-0>: Failed command: /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/ceph --init -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /var/log/ceph/2bfc75cd-8043-4b4d-9707-9223f8a11026:/var/log/ceph:z -v /tmp/ceph-tmpeq3ld00s:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpxa9_18ih:/etc/ceph/ceph.conf:z quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 orch host add controller-0"], "stdout": "", "stdout_lines": []} 2021-09-24 18:21:12,235 p=471659 u=stack n=ansible | PLAY RECAP ********************************************************************* 2021-09-24 18:21:12,236 p=471659 u=stack n=ansible | controller-0 : ok=12 changed=7 unreachable=0 failed=1 skipped=8 rescued=0 ignored=0 2021-09-24 18:21:12,238 p=471659 u=stack n=ansible | 2021-09-24 18:21:12.237460 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Summary Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Re-testing the scenario described in comment #6 but with the SSH test inside the ceph container and both with and without --no-hosts as seen in the modified ssh test task below. - name: Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user shell: >- /bin/podman run --rm --ipc=host --net=host {{ no_hosts }} --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE={{ image }} -e NODE_NAME={{ host }} -e CEPH_USE_RANDOM_NONCE=1 -v {{ conf }}:{{ c_conf }} -v {{ priv }}:{{ c_priv }} {{ image }} -F {{ c_conf }} -i {{ c_priv }} -l {{ user }} {{ host }} /bin/true vars: no_hosts: "--no-hosts" #no_hosts: "" conf: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config" c_conf: "/tmp/config" priv: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa" c_priv: "/tmp/id_rsa" image: "{{ tripleo_cephadm_container_ns + '/' + tripleo_cephadm_container_image + ':' + tripleo_cephadm_container_tag }}" user: "{{ tripleo_cephadm_ssh_user }}" host: "{{ hostvars[inventory_hostname]['ansible_facts']['hostname'] }}" delegate_to: "{{ inventory_hostname }}" become: true register: tripleo_cephadm_ssh_test until: tripleo_cephadm_ssh_test.rc == 0 retries: 10 delay: 5 The Ansible role was modified so it has the following two sequential tasks: A. SSH test (from comment #7) B. Bootstrap When A uses --no-hosts and B uses --no-hosts, then A fails [1] When A does not use --no-hosts and B uses --no-hosts, then A succeeds [2] and B fails as seen in comment #3 [1] 2021-09-25 14:08:26,460 p=436212 u=stack n=ansible | 2021-09-25 14:08:26.460565 | 525400de-00e5-0d5d-62e6-00000000004d | TASK | Create SSH config file for tripleo_cephadm_ssh_user 2021-09-25 14:08:27,192 p=436212 u=stack n=ansible | 2021-09-25 14:08:27.191605 | 525400de-00e5-0d5d-62e6-00000000004d | CHANGED | Create SSH config file for tripleo_cephadm_ssh_user | controller-0 2021-09-25 14:08:27,282 p=436212 u=stack n=ansible | 2021-09-25 14:08:27.281953 | 525400de-00e5-0d5d-62e6-00000000004e | TASK | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user 2021-09-25 14:12:02,494 p=436212 u=stack n=ansible | 2021-09-25 14:12:02.493201 | 525400de-00e5-0d5d-62e6-00000000004e | FATAL | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user | controller-0 -> 192.168.24.51 | error={"attempts": 10, "changed": true, "cmd": "/bin/podman run --rm --ipc=host --net=host --no-hosts --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -F /tmp/config -i /tmp/id_rsa -l ceph-admin controller-0 /bin/true", "delta": "0:00:08.557218", "end": "2021-09-25 14:12:02.447699", "msg": "non-zero return code", "rc": 255, "start": "2021-09-25 14:11:53.890481", "stderr": "ssh: connect to host controller-0 port 22: No route to host", "stderr_lines": ["ssh: connect to host controller-0 port 22: No route to host"], "stdout": "", "stdout_lines": []} 2021-09-25 14:12:02,496 p=436212 u=stack n=ansible | [WARNING]: ('controller-0 -> 192.168.24.51', '525400de-00e5-0d5d-62e6-00000000004e') missing from stats [2] 2021-09-27 01:16:52,698 p=696678 u=stack n=ansible | 2021-09-27 01:16:52.698384 | 525400de-00e5-28b1-713b-00000000004d | TASK | Create SSH config file for tripleo_cephadm_ssh_user 2021-09-27 01:16:53,432 p=696678 u=stack n=ansible | 2021-09-27 01:16:53.431110 | 525400de-00e5-28b1-713b-00000000004d | CHANGED | Create SSH config file for tripleo_cephadm_ssh_user | controller-0 2021-09-27 01:16:53,520 p=696678 u=stack n=ansible | 2021-09-27 01:16:53.520145 | 525400de-00e5-28b1-713b-00000000004e | TASK | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user 2021-09-27 01:18:02,028 p=696678 u=stack n=ansible | 2021-09-27 01:18:02.027077 | 525400de-00e5-28b1-713b-00000000004e | CHANGED | Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user | controller-0 -> 192.168.24.34 2021-09-27 01:18:02,030 p=696678 u=stack n=ansible | [WARNING]: ('controller-0 -> 192.168.24.34', '525400de-00e5-28b1-713b-00000000004e') missing from stats 2021-09-27 01:18:02,044 p=696678 u=stack n=ansible | 2021-09-27 01:18:02.043477 | 525400de-00e5-28b1-713b-000000000050 | TASK | Run cephadm bootstrap 2021-09-27 01:19:29,315 p=696678 u=stack n=ansible | 2021-09-27 01:19:29.313644 | 525400de-00e5-28b1-713b-000000000050 | FATAL | Run cephadm bootstrap | controller-0 | error={"changed": <cut>
Modifying the SSH task to use "ssh -v" and logging the output [0] reveals what is happening within the ceph container when --no-hosts is passed.
- Attila's system (which exhibits this bug) falls back to DNS to identify controller-0 [1]. The DNS server in /etc/resolv.conf returns an IP it cannot reach [2] and thus it fails to SSH.
- My system uses the IPv6 autoconfig'd address which does work [3].
- After some amount of time Attila's IPv6 autoconfig'd IP address does work [4].
I don't know why his system's IPv6 autoconfig takes time but I don't think that's relevant. What's more interesting is that we have been relying on IPv6 autoconfig. TripleO manages the /etc/hosts file for all hosts that it deploys [5]. TripleO has the option to update a DNS service, but this isn't required. Always using --no-hosts blocks our access to the hosts file that tripleo configures for host resolution [6].
I can work around it by modifying the SSH config managed by tripleo's cephadm role so that it's effectively like a hosts file (since it can map the name to the IP) [7] but it might be nicer to have an option in cephadm so that a user can simply choose to rely on /etc/hosts in their ceph container at bootstrap if they wish.
[0]
- name: Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user
shell: >-
/bin/podman run --rm --ipc=host --net=host --no-hosts
--entrypoint /usr/bin/ssh
-e CONTAINER_IMAGE={{ image }}
-e NODE_NAME={{ host }}
-e CEPH_USE_RANDOM_NONCE=1
-v {{ conf }}:{{ c_conf }}
-v {{ priv }}:{{ c_priv }}
{{ image }}
-v -F {{ c_conf }} -i {{ c_priv }} -l {{ user }} {{ host }} /bin/true
vars:
conf: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config"
c_conf: "/tmp/config"
priv: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/id_rsa"
c_priv: "/tmp/id_rsa"
image: "{{ tripleo_cephadm_container_ns + '/' + tripleo_cephadm_container_image + ':' + tripleo_cephadm_container_tag }}"
user: "{{ tripleo_cephadm_ssh_user }}"
host: "{{ hostvars[inventory_hostname]['ansible_facts']['hostname'] }}"
delegate_to: "{{ inventory_hostname }}"
become: true
register: tripleo_cephadm_ssh_test
until: tripleo_cephadm_ssh_test.rc == 0
ignore_errors: true
failed_when: tripleo_cephadm_ssh_test.rc > 0
retries: 10
delay: 5
- debug:
msg: "{{ tripleo_cephadm_ssh_test }}"
[1]
2021-09-27 18:00:51,886 p=215519 u=stack n=ansible | 2021-09-27 18:00:51.886143 | 525400de-00e5-afff-459e-00000000004f | OK | tripleo_cephadm : debug | controller-0 | result={
"changed": false,
"msg": {
"attempts": 10,
"changed": true,
"cmd": "/bin/podman run --rm --ipc=host --net=host --no-hosts --entrypoint /usr/bin/ssh -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -v -F /tmp/config -i /tmp/id_rsa -l ceph-admin controller-0 /bin/true",
"delta": "0:00:08.520137",
"end": "2021-09-27 18:00:51.793941",
"failed": true,
"failed_when_result": true,
"msg": "non-zero return code",
"rc": 255,
"start": "2021-09-27 18:00:43.273804",
"stderr": "OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020\r\ndebug1: Reading configuration data /tmp/config\r\ndebug1: /tmp/config line 1: Applying options for *\r\ndebug1: Connecting to controller-0 [10.0.0.3] port 22.\r\ndebug1: connect to address 10.0.0.3 port 22: No route to host\r\nssh: connect to host controller-0 port 22: No route to host",
"stderr_lines": [
"OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020",
"debug1: Reading configuration data /tmp/config",
"debug1: /tmp/config line 1: Applying options for *",
"debug1: Connecting to controller-0 [10.0.0.3] port 22.",
"debug1: connect to address 10.0.0.3 port 22: No route to host",
"ssh: connect to host controller-0 port 22: No route to host"
],
"stdout": "",
"stdout_lines": []
}
}
[2]
[root@controller-1 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 172.16.0.1
nameserver 10.0.0.1
[root@controller-1 ~]#
[root@controller-1 ~]# dig @10.0.0.1 controller-0 +short
10.0.0.3
[root@controller-1 ~]#
[root@controller-1 ~]# ssh 10.0.0.3
ssh: connect to host 10.0.0.3 port 22: No route to host
[root@controller-1 ~]#
[3]
TASK [Can bootstrap node SSH to its hostname as tripleo_cephadm_ssh_user] ********************
changed: [oc0-controller-0 -> 192.168.24.8]
TASK [tripleo_cephadm : debug] ***************************************************************
ok: [oc0-controller-0] => {
"msg": {
...
"stderr_lines": [
"OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020",
"debug1: Reading configuration data /tmp/config",
"debug1: /tmp/config line 1: Applying options for *",
"debug1: Connecting to oc0-controller-0 [fe80::2642:ff:fe79:d558%ens3] port 22.",
"debug1: fd 3 clearing O_NONBLOCK",
"debug1: Connection established.",
[4]
"stderr_lines": [
"OpenSSH_8.0p1, OpenSSL 1.1.1g FIPS 21 Apr 2020",
"debug1: Reading configuration data /tmp/config",
"debug1: /tmp/config line 1: Applying options for *",
"debug1: Connecting to controller-0 [fe80::5054:ff:fe72:5432%ens3] port 22.",
"debug1: fd 3 clearing O_NONBLOCK",
[5] https://github.com/openstack/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_hosts_entries/tasks/main.yml
[6]
[root@controller-0 ~]# /bin/podman run --rm --ipc=host --no-hosts --net=host --entrypoint /usr/bin/cat -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
[root@controller-0 ~]#
[root@controller-0 ~]# /bin/podman run --rm --ipc=host --net=host --entrypoint /usr/bin/cat -e CONTAINER_IMAGE=quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 -e NODE_NAME=controller-0 -e CEPH_USE_RANDOM_NONCE=1 -v /home/ceph-admin/.ssh/config:/tmp/config -v /home/ceph-admin/.ssh/id_rsa:/tmp/id_rsa quay.io/ceph/daemon:v6.0.4-stable-6.0-pacific-centos-8-x86_64 /etc/hosts
# START_HOST_ENTRIES_FOR_STACK: overcloud
172.17.3.31 ceph-0.redhat.local ceph-0
172.17.3.31 ceph-0.storage.redhat.local ceph-0.storage
172.17.4.110 ceph-0.storagemgmt.redhat.local ceph-0.storagemgmt
192.168.24.13 ceph-0.ctlplane.redhat.local ceph-0.ctlplane
172.17.3.58 ceph-1.redhat.local ceph-1
172.17.3.58 ceph-1.storage.redhat.local ceph-1.storage
172.17.4.109 ceph-1.storagemgmt.redhat.local ceph-1.storagemgmt
192.168.24.24 ceph-1.ctlplane.redhat.local ceph-1.ctlplane
172.17.3.124 ceph-2.redhat.local ceph-2
172.17.3.124 ceph-2.storage.redhat.local ceph-2.storage
172.17.4.38 ceph-2.storagemgmt.redhat.local ceph-2.storagemgmt
192.168.24.12 ceph-2.ctlplane.redhat.local ceph-2.ctlplane
172.17.1.18 compute-0.redhat.local compute-0
172.17.3.100 compute-0.storage.redhat.local compute-0.storage
172.17.1.18 compute-0.internalapi.redhat.local compute-0.internalapi
172.17.2.133 compute-0.tenant.redhat.local compute-0.tenant
192.168.24.35 compute-0.ctlplane.redhat.local compute-0.ctlplane
172.17.1.120 compute-1.redhat.local compute-1
172.17.3.141 compute-1.storage.redhat.local compute-1.storage
172.17.1.120 compute-1.internalapi.redhat.local compute-1.internalapi
172.17.2.31 compute-1.tenant.redhat.local compute-1.tenant
192.168.24.50 compute-1.ctlplane.redhat.local compute-1.ctlplane
172.17.1.76 controller-0.redhat.local controller-0
172.17.3.24 controller-0.storage.redhat.local controller-0.storage
172.17.4.17 controller-0.storagemgmt.redhat.local controller-0.storagemgmt
172.17.1.76 controller-0.internalapi.redhat.local controller-0.internalapi
172.17.2.45 controller-0.tenant.redhat.local controller-0.tenant
10.0.0.120 controller-0.external.redhat.local controller-0.external
192.168.24.34 controller-0.ctlplane.redhat.local controller-0.ctlplane
172.17.1.16 controller-1.redhat.local controller-1
172.17.3.68 controller-1.storage.redhat.local controller-1.storage
172.17.4.72 controller-1.storagemgmt.redhat.local controller-1.storagemgmt
172.17.1.16 controller-1.internalapi.redhat.local controller-1.internalapi
172.17.2.37 controller-1.tenant.redhat.local controller-1.tenant
10.0.0.145 controller-1.external.redhat.local controller-1.external
192.168.24.10 controller-1.ctlplane.redhat.local controller-1.ctlplane
172.17.1.23 controller-2.redhat.local controller-2
172.17.3.134 controller-2.storage.redhat.local controller-2.storage
172.17.4.68 controller-2.storagemgmt.redhat.local controller-2.storagemgmt
172.17.1.23 controller-2.internalapi.redhat.local controller-2.internalapi
172.17.2.75 controller-2.tenant.redhat.local controller-2.tenant
10.0.0.131 controller-2.external.redhat.local controller-2.external
192.168.24.18 controller-2.ctlplane.redhat.local controller-2.ctlplane
192.168.24.1 undercloud-0.ctlplane.redhat.local undercloud-0.ctlplane
192.168.24.28 overcloud.ctlplane.localdomain
172.17.3.145 overcloud.storage.localdomain
172.17.4.95 overcloud.storagemgmt.localdomain
172.17.1.99 overcloud.internalapi.localdomain
10.0.0.102 overcloud.localdomain
# END_HOST_ENTRIES_FOR_STACK: overcloud
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.1.1 controller-0 controller-0 elastic_moore
[root@controller-0 ~]#
[7]
- name: Create SSH config file for tripleo_cephadm_ssh_user
copy:
dest: "/home/{{ tripleo_cephadm_ssh_user }}/.ssh/config"
mode: '0644'
owner: "{{ tripleo_cephadm_ssh_user }}"
group: "{{ tripleo_cephadm_ssh_user }}"
content: |
Host {{ bootstrap_host_name }}
Hostname {{ bootstrap_host_ip }}
User {{ tripleo_cephadm_ssh_user }}
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
ConnectTimeout=30
cephadm already removed the --no-hosts option [1] so we don't need a new option to avoid it. Downstream 5.0 does not have --no-hosts Downstream 5.1 will not have --no-hosts OLD upstream versions of pacific have --no-hosts but it was removed because it caused bugs (like this one). If you're going to test upstream versions, then please use a modern version. E.g in TripleO we use that latest version of [2] which passed our CI. [1] https://github.com/ceph/ceph/commit/d1bb94ba4c4b8401bf7799b7da3a8f5c2fd228c6 [2] https://cbs.centos.org/koji/packageinfo?packageID=8439 |