Description of problem: Update from 16.0 to 16.1 fails during: openstack overcloud external-update run \ --stack qe-Cloud-0 \ --tags ceph 2>&1 with: [create ceph_ansible_remote_tmp on all nodes with necessary ownership] ****\nSaturday 27 June 2020 17:13:55 +0000 (0:00:00.096) 0:00:00.096 ********* \nfatal: [controller-0]: UNREACHABLE! => changed=false \n msg: |-\n Data could not be sent to remote host \"192.168.24.10\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/3251f9f9-b644-4f85-a6be-587f084b7d8b/ceph-ansible/192.168.24.10-tripleo-admin-22.OLVVXeftlbIVXc0Z\" too long for Unix domain socket\n unreachable: true\nfatal: [controller-1]: UNREACHABLE! => changed=false \n msg: |-\n Data could not be sent to remote host \"192.168.24.54\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/3251f9f9-b644-4f85-a6be-587f084b7d8b/ceph-ansible/192.168.24.54-tripleo-admin-22.GL3KdARXOD8shdpO\" too long for Unix domain socket\n unreachable: true\nfatal: [messaging-1]: UNREACHABLE! => changed=false Version-Release number of selected component (if applicable):
Hi, so I've just tested with ansible 2.9 after manual update of ansible rpm -qa | grep '^ansible-2' ansible-2.9.10-1.el8ae.noarch followed up with: openstack overcloud external-update run \ --stack qe-Cloud-0 \ --tags ceph 2>&1 and end up with : TASK [create ceph_ansible_remote_tmp on all nodes with necessary ownership] ****", "Wednesday 01 July 2020 10:39:16 +0000 (0:00:00.033) 0:00:00.033 ******** ", "fatal: [controller-2]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.6\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.6-tripleo-admin-22.pA5DYd5I6HW6C4yh\" too long for Unix domain socket", " unreachable: true", "fatal: [ceph-0]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.54\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.54-tripleo-admin-22.F1juQ8Fwz6i9NuWk\" too long for Unix domain socket", " unreachable: true", "fatal: [ceph-2]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.24\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.24-tripleo-admin-22.I5oy1cYsMLLzOYaB\" too long for Unix domain socket", " unreachable: true", "fatal: [ceph-1]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.9\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.9-tripleo-admin-22.XuME121mWtW2Q2ZH\" too long for Unix domain socket", " unreachable: true", "fatal: [controller-0]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.48\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.48-tripleo-admin-22.jFTZ2QQcZ0xgImCr\" too long for Unix domain socket", " unreachable: true", "fatal: [controller-1]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.17\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.17-tripleo-admin-22.ZEV2kuSzWC2NNTAo\" too long for Unix domain socket", " unreachable: true", "fatal: [compute-1]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.29\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.29-tripleo-admin-22.oT6j6422O7uEwTGJ\" too long for Unix domain socket", " unreachable: true", "fatal: [compute-0]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"192.168.24.52\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.52-tripleo-admin-22.rwXD1cNSs3f6ZmXW\" too long for Unix domain socket", " unreachable: true", "fatal: [undercloud]: UNREACHABLE! => changed=false ", " msg: |-", " Data could not be sent to remote host \"localhost\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/localhost-tripleo-admin-22.EfMPy8s6M7L7RptW\" too long for Unix domain socket", " unreachable: true"
moved patch to https://bugzilla.redhat.com/show_bug.cgi?id=1852801 as this doesn't fix the issue here.
So there is no way around this, the unix socket path is too long and has to be shortened: [stack@undercloud-0 ~]$ cat /etc/redhat-release Red Hat Enterprise Linux release 8.2 (Ootpa) [stack@undercloud-0 ~]$ grep "define UNIX_PATH_MAX" /usr/include/linux/un.h #define UNIX_PATH_MAX 108 [stack@undercloud-0 ~]$ echo '/var/lib/mistral/f6075035-eb84-43ae-b517-8388b44ab148/ceph-ansible/192.168.24.52-tripleo-admin-22.ZDdc6A6Y2bLQQFO9' | wc -c 115 This is definitively a blocker. Why this pass on phase1 still eludes me though.
See my comment in there https://code.engineering.redhat.com/gerrit/#/c/204232/1/tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/create_ceph_ansible_remote_tmp.yml for what needs to be fixed.
1. This job uses the exact same code and it passes on deployment phase1-16.1_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph [1] 2. This job uses the exact same code and it fails on upgrade DFG-upgrades-updates-16-to-16.1-from-latest_cdn-composable-ipv6 [2] So deploy doesn't have the issue but upgrade does. Let's dig more into why. [1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase1-16.1_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/ [2] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-to-16.1-from-latest_cdn-composable-ipv6/
So I check the phase1 job and the main difference is that "tripleo-ceph-run-ansible : build create_ceph_ansible_remote_tmp command as list" use a differenc ANSIBLE_SSH_CONTROL_PATH_DIR: ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/overcloud/ceph-ansible while during update we have: ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/ Looking at https://code.engineering.redhat.com/gerrit/#/c/204232/1/tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/create_ceph_ansible_remote_tmp.yml that mean that the "playbook_dir" is different during deployment and update.
(In reply to Sofer Athlan-Guyot from comment #14) > So I check the phase1 job and the main difference is that > "tripleo-ceph-run-ansible : build create_ceph_ansible_remote_tmp command as > list" use a differenc ANSIBLE_SSH_CONTROL_PATH_DIR: > > > ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/overcloud/ceph-ansible > > while during update we have: > > ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/a6d9fd44-100f-4c30-bfce- > 12cea823fd0f/ceph-ansible/ > > Looking at > https://code.engineering.redhat.com/gerrit/#/c/204232/1/tripleo_ansible/ > roles/tripleo-ceph-run-ansible/tasks/create_ceph_ansible_remote_tmp.yml that > mean that the "playbook_dir" is different during deployment and update. We can probably workaround this at deployment time with an env file: parameter_defaults: CephAnsibleEnvironmentVariables: ANSIBLE_SSH_CONTROL_PATH_DIR: /tmp/ceph_ansible_control_path
This is an effective workaround: add this the heat parameters: parameter_defaults: CephAnsibleEnvironmentVariables: ANSIBLE_SSH_CONTROL_PATH_DIR: "/tmp/ceph_ansible_control_path" Then re-run: openstack overcloud prepare <extra args> with the above parameter passed to the cli in. Then re-run: openstack overcloud external-update run \ --stack qe-Cloud-0 \ --tags ceph 2>&1 Then: Wednesday 01 July 2020 14:18:30 +0000 (0:00:00.192) 0:18:48.582 ******** skipping: [undercloud] => {"changed": false, "skip_reason": "Conditional result was False"} TASK [generate ceph-ansible group vars osds] *********************************** Wednesday 01 July 2020 14:18:30 +0000 (0:00:00.191) 0:18:48.774 ******** skipping: [undercloud] => {"changed": false, "skip_reason": "Conditional result was False"} PLAY RECAP ********************************************************************* ceph-0 : ok=4 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 ceph-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 ceph-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 compute-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 compute-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 undercloud : ok=61 changed=17 unreachable=0 failed=0 skipped=163 rescued=0 ignored=0 Wednesday 01 July 2020 14:18:30 +0000 (0:00:00.059) 0:18:48.833 ******** =============================================================================== Updated nodes - None Success
Not only will this affect upGRADEs it will also affect upDATEs As per the "keeping openstack updated" doc for ceph [1] you run the same command $ openstack overcloud external-update run --tags ceph [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/keeping_red_hat_openstack_platform_updated/index#updating_all_ceph_storage_nodes
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148