Bug 1851927 - [osp16.1][update] ansible-ceph update run fail as unix path too long for Unix domain socket
Summary: [osp16.1][update] ansible-ceph update run fail as unix path too long for Unix...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ga
: 16.1 (Train on RHEL 8.2)
Assignee: John Fulton
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-29 13:14 UTC by Sofer Athlan-Guyot
Modified: 2020-07-29 07:54 UTC (History)
9 users (show)

Fixed In Version: tripleo-ansible-0.5.1-0.20200611113656.34b8fcc.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-29 07:53:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1885917 0 None None None 2020-07-01 12:53:17 UTC
OpenStack gerrit 738843 0 None MERGED Set short ANSIBLE_SSH_CONTROL_PATH_DIR for all ceph roles 2020-08-31 08:21:03 UTC
OpenStack gerrit 739068 0 None MERGED Set short ANSIBLE_SSH_CONTROL_PATH_DIR for all ceph roles 2020-08-31 08:21:03 UTC
OpenStack gerrit 739069 0 None MERGED Set short ANSIBLE_SSH_CONTROL_PATH_DIR for all ceph roles 2020-08-31 08:21:03 UTC
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:54:24 UTC

Description Sofer Athlan-Guyot 2020-06-29 13:14:18 UTC
Description of problem: Update from 16.0 to 16.1 fails during:

openstack overcloud external-update run \
    --stack qe-Cloud-0 \
    --tags ceph 2>&1


with:

[create ceph_ansible_remote_tmp on all nodes with necessary
ownership] ****\nSaturday 27 June 2020 17:13:55
+0000 (0:00:00.096) 0:00:00.096 ********* \nfatal:
[controller-0]: UNREACHABLE! => changed=false \n msg: |-\n Data
could not be sent to remote host \"192.168.24.10\". Make sure
this host can be reached over ssh: unix_listener: path
\"/var/lib/mistral/3251f9f9-b644-4f85-a6be-587f084b7d8b/ceph-ansible/192.168.24.10-tripleo-admin-22.OLVVXeftlbIVXc0Z\"
too long for Unix domain socket\n unreachable: true\nfatal:
[controller-1]: UNREACHABLE! => changed=false \n msg: |-\n Data
could not be sent to remote host \"192.168.24.54\". Make sure
this host can be reached over ssh: unix_listener: path
\"/var/lib/mistral/3251f9f9-b644-4f85-a6be-587f084b7d8b/ceph-ansible/192.168.24.54-tripleo-admin-22.GL3KdARXOD8shdpO\"
too long for Unix domain socket\n unreachable: true\nfatal:
[messaging-1]: UNREACHABLE! => changed=false

Version-Release number of selected component (if applicable):

Comment 8 Sofer Athlan-Guyot 2020-07-01 10:54:43 UTC
Hi,

so I've just tested with ansible 2.9 after manual update of ansible

rpm -qa | grep '^ansible-2'
ansible-2.9.10-1.el8ae.noarch

followed up with:

openstack overcloud external-update run \
    --stack qe-Cloud-0 \
    --tags ceph 2>&1


and end up with :


TASK [create ceph_ansible_remote_tmp on all nodes with necessary ownership] ****", "Wednesday 01 July 2020  10:39:16 +0000 (0:00:00.033)       0:00:00.033 ******** ", "fatal: [controller-2]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.6\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.6-tripleo-admin-22.pA5DYd5I6HW6C4yh\" too long for Unix domain socket", "  unreachable: true", "fatal: [ceph-0]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.54\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.54-tripleo-admin-22.F1juQ8Fwz6i9NuWk\" too long for Unix domain socket", "  unreachable: true", "fatal: [ceph-2]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.24\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.24-tripleo-admin-22.I5oy1cYsMLLzOYaB\" too long for Unix domain socket", "  unreachable: true", "fatal: [ceph-1]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.9\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.9-tripleo-admin-22.XuME121mWtW2Q2ZH\" too long for Unix domain socket", "  unreachable: true", "fatal: [controller-0]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.48\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.48-tripleo-admin-22.jFTZ2QQcZ0xgImCr\" too long for Unix domain socket", "  unreachable: true", "fatal: [controller-1]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.17\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.17-tripleo-admin-22.ZEV2kuSzWC2NNTAo\" too long for Unix domain socket", "  unreachable: true", "fatal: [compute-1]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.29\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.29-tripleo-admin-22.oT6j6422O7uEwTGJ\" too long for Unix domain socket", "  unreachable: true", "fatal: [compute-0]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"192.168.24.52\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/192.168.24.52-tripleo-admin-22.rwXD1cNSs3f6ZmXW\" too long for Unix domain socket", "  unreachable: true", "fatal: [undercloud]: UNREACHABLE! => changed=false ", "  msg: |-", "    Data could not be sent to remote host \"localhost\". Make sure this host can be reached over ssh: unix_listener: path \"/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/localhost-tripleo-admin-22.EfMPy8s6M7L7RptW\" too long for Unix domain socket", "  unreachable: true"

Comment 10 Sofer Athlan-Guyot 2020-07-01 10:59:06 UTC
moved patch to https://bugzilla.redhat.com/show_bug.cgi?id=1852801 as this doesn't fix the issue here.

Comment 11 Sofer Athlan-Guyot 2020-07-01 11:10:24 UTC
So there is no way around this, the unix socket path is too long and has to be shortened:


[stack@undercloud-0 ~]$ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.2 (Ootpa)
[stack@undercloud-0 ~]$ grep "define UNIX_PATH_MAX" /usr/include/linux/un.h
#define UNIX_PATH_MAX   108
[stack@undercloud-0 ~]$ echo '/var/lib/mistral/f6075035-eb84-43ae-b517-8388b44ab148/ceph-ansible/192.168.24.52-tripleo-admin-22.ZDdc6A6Y2bLQQFO9' | wc -c
115

This is definitively a blocker.

Why this pass on phase1 still eludes me though.

Comment 13 John Fulton 2020-07-01 12:10:06 UTC
1. This job uses the exact same code and it passes on deployment phase1-16.1_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph [1]
2. This job uses the exact same code and it fails on upgrade DFG-upgrades-updates-16-to-16.1-from-latest_cdn-composable-ipv6 [2]

So deploy doesn't have the issue but upgrade does. Let's dig more into why.

[1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/phase1-16.1_director-rhel-8.2-virthost-1cont_1comp_1ceph-ipv4-geneve-ceph/
[2] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-to-16.1-from-latest_cdn-composable-ipv6/

Comment 14 Sofer Athlan-Guyot 2020-07-01 12:19:39 UTC
So I check the phase1 job and the main difference is that "tripleo-ceph-run-ansible : build create_ceph_ansible_remote_tmp command as list" use a differenc ANSIBLE_SSH_CONTROL_PATH_DIR:


ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/overcloud/ceph-ansible

while during update we have:

ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/a6d9fd44-100f-4c30-bfce-12cea823fd0f/ceph-ansible/

Looking at https://code.engineering.redhat.com/gerrit/#/c/204232/1/tripleo_ansible/roles/tripleo-ceph-run-ansible/tasks/create_ceph_ansible_remote_tmp.yml that mean that the "playbook_dir" is different during deployment and update.

Comment 15 Giulio Fidente 2020-07-01 12:31:29 UTC
(In reply to Sofer Athlan-Guyot from comment #14)
> So I check the phase1 job and the main difference is that
> "tripleo-ceph-run-ansible : build create_ceph_ansible_remote_tmp command as
> list" use a differenc ANSIBLE_SSH_CONTROL_PATH_DIR:
> 
> 
> ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/overcloud/ceph-ansible
> 
> while during update we have:
> 
> ANSIBLE_SSH_CONTROL_PATH_DIR=/var/lib/mistral/a6d9fd44-100f-4c30-bfce-
> 12cea823fd0f/ceph-ansible/
> 
> Looking at
> https://code.engineering.redhat.com/gerrit/#/c/204232/1/tripleo_ansible/
> roles/tripleo-ceph-run-ansible/tasks/create_ceph_ansible_remote_tmp.yml that
> mean that the "playbook_dir" is different during deployment and update.

We can probably workaround this at deployment time with an env file:

parameter_defaults:
  CephAnsibleEnvironmentVariables:
    ANSIBLE_SSH_CONTROL_PATH_DIR: /tmp/ceph_ansible_control_path

Comment 19 Sofer Athlan-Guyot 2020-07-01 14:43:27 UTC
This is an effective workaround:

add this the heat parameters:


parameter_defaults:
    CephAnsibleEnvironmentVariables:
      ANSIBLE_SSH_CONTROL_PATH_DIR: "/tmp/ceph_ansible_control_path"

Then re-run:

openstack overcloud prepare <extra args>

with the above parameter passed to the cli in.

Then re-run:

openstack overcloud external-update run \
    --stack qe-Cloud-0 \
    --tags ceph 2>&1


Then:

Wednesday 01 July 2020  14:18:30 +0000 (0:00:00.192)       0:18:48.582 ******** 
skipping: [undercloud] => {"changed": false, "skip_reason": "Conditional result was False"}

TASK [generate ceph-ansible group vars osds] ***********************************
Wednesday 01 July 2020  14:18:30 +0000 (0:00:00.191)       0:18:48.774 ******** 
skipping: [undercloud] => {"changed": false, "skip_reason": "Conditional result was False"}

PLAY RECAP *********************************************************************
ceph-0                     : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
ceph-1                     : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
ceph-2                     : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
compute-0                  : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
compute-1                  : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-0               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-1               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-2               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
undercloud                 : ok=61   changed=17   unreachable=0    failed=0    skipped=163  rescued=0    ignored=0   

Wednesday 01 July 2020  14:18:30 +0000 (0:00:00.059)       0:18:48.833 ******** 
=============================================================================== 

Updated nodes - None
Success

Comment 22 John Fulton 2020-07-01 19:48:39 UTC
Not only will this affect upGRADEs it will also affect upDATEs

As per the "keeping openstack updated" doc for ceph [1] you run the same command

$ openstack overcloud external-update run --tags ceph

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html-single/keeping_red_hat_openstack_platform_updated/index#updating_all_ceph_storage_nodes

Comment 46 errata-xmlrpc 2020-07-29 07:53:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148


Note You need to log in before you can comment on or make changes to this bug.