Bug 1391805

Summary:

backup etcd failed when upgrade openshift 3.2

Product:

OpenShift Container Platform

Reporter:

Anping Li <anli>

Component:

Cluster Version Operator

Assignee:

Jason DeTiberus <jdetiber>

Status:

CLOSED DUPLICATE

QA Contact:

Anping Li <anli>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.2.1

CC:

anli, aos-bugs, dgoodwin, jokerman, mmccomas, tobias.genannt

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-08 12:33:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Upgrade logs	none

Description Anping Li 2016-11-04 05:26:00 UTC

Description of problem:
backup etcd failed when upgrade the embedded etcd Env

Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.2.37-1.git.0.8f013d0.el7.noarch
ansible-2.2.0.0-0.100.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. install OCP-3.2
2. ugprade to OCP-3.2
  ansible-playbook /root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_2/upgrade.yml

Actual results:
2. PLAY [Backup etcd] *************************************************************

TASK [setup] *******************************************************************
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/system/setup.py
<groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master> ESTABLISH SSH CONNECTION FOR USER: None
<groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r 'groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master' '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1478233944.01-82732812709270 `" && echo ansible-tmp-1478233944.01-82732812709270="` echo $HOME/.ansible/tmp/ansible-tmp-1478233944.01-82732812709270 `" ) && sleep 0'"'"''
fatal: [groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master]: UNREACHABLE! => {
    "changed": false, 
    "msg": "Failed to connect to the host via ssh: ControlPath too long\r\n", 
    "unreachable": true
}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_2/upgrade.retry

PLAY RECAP *********************************************************************
groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master : ok=0    changed=0    unreachable=1    failed=0   
localhost                  : ok=13   changed=8    unreachable=0    failed=0   
openshift-223.lab.eng.nay.redhat.com : ok=87   changed=1    unreachable=0    failed=0   
openshift-224.lab.eng.nay.redhat.com : ok=77   changed=1    unreachable=0    failed=0   


Expected results:


Additional info:

[OSEv3:children]
masters
nodes
nfs
[OSEv3:vars]

ansible_ssh_user=root
openshift_master_default_subdomain_enable=true
openshift_master_default_subdomain=1104-43x.qe.rhcloud.com
openshift_auth_type=htpasswd
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/htpasswd'}]
deployment_type=openshift-enterprise
oreg_url=openshift3/ose-${component}:${version}
osm_use_cockpit=false
osm_cockpit_plugins=['cockpit-kubernetes']
openshift_node_kubelet_args={"minimum-container-ttl-duration": ["10s"], "maximum-dead-containers-per-container": ["1"], "maximum-dead-containers": ["20"], "image-gc-high-threshold": ["80"], "image-gc-low-threshold": ["70"]}
openshift_hosted_registry_selector="role=node,registry=enabled"
openshift_hosted_router_selector="role=node,router=enabled"
debug_level=5
openshift_set_hostname=true
openshift_override_hostname_check=true
openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_nfs_options="*(rw,root_squash,sync,no_wdelay)"
openshift_hosted_registry_storage_nfs_directory=/var/lib/exports
openshift_hosted_registry_storage_volume_name=regpv
openshift_hosted_registry_storage_access_modes=["ReadWriteMany"]
openshift_hosted_registry_storage_volume_size=17G
openshift_docker_additional_registries=virt-openshift-05.lab.eng.nay.redhat.com:5000
openshift_docker_insecure_registries=virt-openshift-05.lab.eng.nay.redhat.com:5000

[masters]
openshift-223.lab.eng.nay.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=openshift-223.lab.eng.nay.redhat.com openshift_hostname=openshift-223.lab.eng.nay.redhat.com
[nodes]
openshift-223.lab.eng.nay.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=openshift-223.lab.eng.nay.redhat.com openshift_hostname=openshift-223.lab.eng.nay.redhat.com openshift_node_labels="{'role': 'node'}"
openshift-224.lab.eng.nay.redhat.com ansible_user=root ansible_ssh_user=root openshift_public_hostname=openshift-224.lab.eng.nay.redhat.com openshift_hostname=openshift-224.lab.eng.nay.redhat.com openshift_node_labels="{'role': 'node','registry': 'enabled','router': 'enabled'}"

[nfs]
openshift-223.lab.eng.nay.redhat.com ansible_user=root ansible_ssh_user=root

Comment 1 Anping Li 2016-11-04 06:14:53 UTC

hit same issue when upgrade openshift 3.2 with the external etcd

Comment 2 Anping Li 2016-11-04 06:20:04 UTC

Created attachment 1217294 [details]
Upgrade logs

Comment 3 Devan Goodwin 2016-11-04 11:52:06 UTC

I believe this is a known issue with ansible and hosts with long hostnames, for example we have to work around this when using AWS by editing by setting /etc/ansible/ansible.cfg param:

control_path = %(directory)s/ansible-ssh-%%C

More information available here: http://docs.ansible.com/ansible/intro_configuration.html#control-path

Comment 4 Anping Li 2016-11-07 08:03:18 UTC

It seems the control path doesn't work.  and I didn't use long hostname and home directory. the socket names seems less than 108 characters.

ansible-2.2.0.0-0.100.el7.noarch
openshift-ansible-3.2.37-1.git.0.8f013d0.el7.noarch


PLAY [Backup etcd] *************************************************************

TASK [setup] *******************************************************************
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/system/setup.py
<groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master> ESTABLISH SSH CONNECTION FOR USER: None
<groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r 'groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master' '/bin/sh -c '"'"'( umask 77 && mkdir -p "` echo $HOME/.ansible/tmp/ansible-tmp-1478505418.18-152149785284243 `" && echo ansible-tmp-1478505418.18-152149785284243="` echo $HOME/.ansible/tmp/ansible-tmp-1478505418.18-152149785284243 `" ) && sleep 0'"'"''
fatal: [groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master]: UNREACHABLE! => {
    "changed": false, 
    "msg": "Failed to connect to the host via ssh: OpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 57: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\nControlPath too long\r\n", 
    "unreachable": true
}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_2/upgrade.retry

PLAY RECAP *********************************************************************
groups.oo_etcd_to_config if groups.oo_etcd_to_config is defined and groups.oo_etcd_to_config | length > 0 else groups.oo_first_master : ok=0    changed=0    unreachable=1    failed=0   
localhost                  : ok=13   changed=8    unreachable=0    failed=0   
openshift-190.lab.eng.nay.redhat.com : ok=86   changed=1    unreachable=0    failed=0

Comment 5 Devan Goodwin 2016-11-07 15:46:01 UTC

In your previous comment we can see that the control path fix is not in effect: "ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r"

It should be using "control_path = %(directory)s/%%h-%%r" per the link above. Also note that it must be in the [ssh_connection] of ansible.cfg, and it may be ignored if you are using custom ssh_args.

Please attach /etc/ansible/ansible.cfg if the problem still persists.

Comment 6 Devan Goodwin 2016-11-07 15:51:04 UTC

May also be able to set it on CLI with the ANSIBLE_SSH_CONTROL_PATH environment variable.

Comment 7 Devan Goodwin 2016-11-07 16:05:21 UTC

ANSIBLE_SSH_CONTROL_PATH=/root/.ansible/cp/%%h-%%r example.

Comment 8 Devan Goodwin 2016-11-08 12:33:25 UTC

This looks to have surfaced with a customer and the other bugzilla has caught something we did not notice yet, closing this one as duplicate, lets resume in 1392169.

*** This bug has been marked as a duplicate of bug 1392169 ***

Comment 9 Jason DeTiberus 2016-11-16 19:33:23 UTC

So, depending on the generated hostname, /root/.ansible/cp/%%h-%%r could still be too long. Switching to someething like /tmp/cp/%%h-%%r could solve the problem, as could using shorter hostnames.