Bug 1578482

Summary: OCP 3.10: etcd scaleup on CRI-O HA cluster fails with dict object has no attribute etcd_ip error
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: InstallerAssignee: Russell Teague <rteague>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: high Docs Contact:
Priority: medium    
Version: 3.10.0CC: andreas.kunkel, aos-bugs, dmoessne, gpei, jkaur, jmalde, jokerman, mifiedle, mjahangi, mmccomas, rteague, sdodson, vlaad, wabouham, wmeng
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: x86_64   
OS: Linux   
Whiteboard: aos-scalability-310
Fixed In Version: Doc Type: Bug Fix
Doc Text:
During etcd scaleup, facts about the etcd cluster are required in order to add new hosts. The necessary tasks have been added to ensure those facts are set before configuring new hosts and therefore allow the scaleup to complete as expected.
Story Points: ---
Clone Of:
: 1628201 (view as bug list) Environment:
Last Closed: 2018-11-11 16:39:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1628201    

Description Walid A. 2018-05-15 16:55:09 UTC
Description of problem:

When running the openshift-ansible/playbooks/openshift-etcd/scaleup.yml playbook to add 2 etcds to an existing OCP 3.10.0-0.38.0 HA CRI-O cluster (rpm install), we get the following error:

TASK [etcd : Add new etcd members to cluster] ***************************************************************************************************
task path: /root/openshift-ansible/roles/etcd/tasks/add_new_member.yml:5
fatal: [<new_etcd_hostname_fqdn>]: FAILED! => {
    "msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'etcd_ip'\n\nThe error appears to have been in '/root/openshift-ansible/roles/etcd/tasks/add_new_member.yml': line 5, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Add new etcd members to cluster\n  ^ here\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: 'dict object' has no attribute 'etcd_ip'"

The scaleup.yaml playbook was executed from a cloned openshift-ansible repo, on one of the master/etcd hosts.  This is on AWS EC2, the HA cluster has 1 load-balancer, 3 master/etcds nodes, 2 infra nodes, 4 compute nodes, and 2 newly added master nodes which were added successfully with the openshift-master/scaleup.yaml playbook.  The 2 etcds I am trying to add are on the 2 newly added master node instances.
Version-Release number of selected component (if applicable):
 ~/openshift-ansible # git log --oneline -1
df9f5ed Merge pull request #8299 from mgleung/upgrade-calico-master

# openshift version
openshift v3.10.0-0.38.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

Container Runtime Version:          cri-o://1.10.1

# rpm -qva | grep cri-o

# rpm -q openshift-ansible

# rpm -q ansible

# ansible --version
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)]

How reproducible:

Steps to Reproduce:
1. Install an OCP 3.10.0-0.38.0 HA CRI-O (rpm install) cluster (1 load-balancer, 3 master/etcds nodes, 2 infra nodes, 4 compute nodes) in AWS EC2 with openshift-ansible/ with inventory attached:

    - ansible-playbook -i inv openshift-ansible/playbooks/prerequisites.yml
    - ansible-playbook -i inv run openshift-ansible/playbooks/deploy_cluster.yml

2.Create two new instances in AWS EC2

3. Update inventory to add two master nodes (attached "master_scaleup_inventory").

4. On one master/etcd host, run:
   ansible-playbook -i master_scaleup_inv openshift-ansible/playbooks/prerequisites.yml

5. ansible-playbook -i master_scaleup_inv openshift-ansible/playbooks/openshift-master/scaleup.yml

6. Verify cluster is up and 2 newly added masters are running as static pods:
   oc get pods -n kube-system

7. Create etcd_scaleup inventory with the 2 newly added master hostnames

8. Run etcd_scaleup inventory
   ansible-playbook -i etcd_scaleup_inv openshift-ansible/playbooks/openshift-etcd/scaleup.yml

9.  I have also tried to re-run the prerequistes.yaml playbook before the openshift-etcd/scaleup.yml playbook and I am getting the same error in the same task

Actual results:

Error in task:
TASK [etcd : Add new etcd members to cluster] 
(see "Description of problem" above)

Expected results:
Playbook successful completion and 2 added etcds running as static pods on the two newly added master nodes

Additional info:
Inventory files and logs from ansible-playbook with the -vvv flag are in next comment

Comment 2 Vikas Laad 2018-05-24 19:04:01 UTC
*** Bug 1582230 has been marked as a duplicate of this bug. ***

Comment 5 Vikas Laad 2018-06-06 14:55:26 UTC
*** Bug 1587882 has been marked as a duplicate of this bug. ***

Comment 7 Scott Dodson 2018-08-21 15:16:25 UTC
The attached case is not specific to CRI-O but I believe it's the same root cause.

The only customer case attached to this indicates that they have a workaround or at least it's not a blocker for them. I don't think Urgent is the appropriate severity for this BZ.

Comment 9 Russell Teague 2018-09-11 14:45:53 UTC
master: https://github.com/openshift/openshift-ansible/pull/10002

Comment 10 Russell Teague 2018-09-12 12:57:04 UTC
release-3.10: https://github.com/openshift/openshift-ansible/pull/10016

Comment 12 Russell Teague 2018-09-20 12:16:43 UTC
A fix is proposed for 1628201 which will be backported to 3.10 once it is verified.

Comment 13 Russell Teague 2018-09-20 12:38:17 UTC
Proposed: https://github.com/openshift/openshift-ansible/pull/10167 (release-3.10)

Comment 19 Gaoyun Pei 2018-09-26 07:48:17 UTC
Test with openshift-ansible-3.10.51-1.git.0.44a646c.el7.noarch.rpm.

1) New etcd collocated with master

When scaling-up etcd member on existing master hosts, just like the scenario mentioned in Description, it could work well. 

New etcd members were added, running as static pod, new etcd url added into etcdClientInfo.urls on all masters.

2) New etcd not collocated with master

When trying to scale-up etcd member on new hosts, the scale-up playbook on the 1st new etcd:

TASK [etcd : Verify cluster is healthy] ****************************************
FAILED - RETRYING: Verify cluster is healthy (1 retries left).
fatal: [ec2-54-211-178-50.compute-1.amazonaws.com]: FAILED! => {"attempts": 30, "changed": false, "cmd": "/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://ip-172-18-6-78.ec2.internal:2379 cluster-health", "msg": "[Errno 2] No such file or directory", "rc": 2}

The new etcd member was installed as rpm etcd, it doesn't have static master scripts. 

Attached the full ansible log below.

Comment 21 Russell Teague 2018-09-27 17:33:05 UTC
New 3.10 PR: https://github.com/openshift/openshift-ansible/pull/10255

Comment 22 Andreas Kunkel 2018-09-27 22:03:15 UTC
Running into the same issue as Gaoyun, with 3 standalone etcd's and 3 masters. Taking down one etcd and scaling up with a new etcd fails at the same step with the same error (No such file or directory). Was running into the etcd_ip error until I started specifying "openshift_version" and "openshift_image_tag" in addition to "openshift_release" in the vars section of my inventory.

Comment 23 Russell Teague 2018-10-08 17:23:38 UTC

Comment 24 Gaoyun Pei 2018-10-10 07:35:19 UTC
Verify this bug with openshift-ansible-3.10.53-1.git.0.ba2c2ec.el7.noarch.rpm

New etcd members could be added successfully for the following scenarios 
* New etcd collocated with masters
* New etcd not collocated with masters

Comment 26 errata-xmlrpc 2018-11-11 16:39:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.