Bug 1578482
| Summary: | OCP 3.10: etcd scaleup on CRI-O HA cluster fails with dict object has no attribute etcd_ip error | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> | |
| Component: | Installer | Assignee: | Russell Teague <rteague> | |
| Status: | CLOSED ERRATA | QA Contact: | Gaoyun Pei <gpei> | |
| Severity: | high | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 3.10.0 | CC: | andreas.kunkel, aos-bugs, dmoessne, gpei, jkaur, jmalde, jokerman, mifiedle, mjahangi, mmccomas, rteague, sdodson, vlaad, wabouham, wmeng | |
| Target Milestone: | --- | |||
| Target Release: | 3.10.z | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | aos-scalability-310 | |||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
During etcd scaleup, facts about the etcd cluster are required in order to add new hosts. The necessary tasks have been added to ensure those facts are set before configuring new hosts and therefore allow the scaleup to complete as expected.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1628201 (view as bug list) | Environment: | ||
| Last Closed: | 2018-11-11 16:39:10 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1628201 | |||
*** Bug 1582230 has been marked as a duplicate of this bug. *** *** Bug 1587882 has been marked as a duplicate of this bug. *** The attached case is not specific to CRI-O but I believe it's the same root cause. The only customer case attached to this indicates that they have a workaround or at least it's not a blocker for them. I don't think Urgent is the appropriate severity for this BZ. A fix is proposed for 1628201 which will be backported to 3.10 once it is verified. Proposed: https://github.com/openshift/openshift-ansible/pull/10167 (release-3.10) Test with openshift-ansible-3.10.51-1.git.0.44a646c.el7.noarch.rpm.
1) New etcd collocated with master
When scaling-up etcd member on existing master hosts, just like the scenario mentioned in Description, it could work well.
New etcd members were added, running as static pod, new etcd url added into etcdClientInfo.urls on all masters.
2) New etcd not collocated with master
When trying to scale-up etcd member on new hosts, the scale-up playbook on the 1st new etcd:
TASK [etcd : Verify cluster is healthy] ****************************************
...
FAILED - RETRYING: Verify cluster is healthy (1 retries left).
fatal: [ec2-54-211-178-50.compute-1.amazonaws.com]: FAILED! => {"attempts": 30, "changed": false, "cmd": "/usr/local/bin/master-exec etcd etcd etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt --endpoints https://ip-172-18-6-78.ec2.internal:2379 cluster-health", "msg": "[Errno 2] No such file or directory", "rc": 2}
The new etcd member was installed as rpm etcd, it doesn't have static master scripts.
Attached the full ansible log below.
Running into the same issue as Gaoyun, with 3 standalone etcd's and 3 masters. Taking down one etcd and scaling up with a new etcd fails at the same step with the same error (No such file or directory). Was running into the etcd_ip error until I started specifying "openshift_version" and "openshift_image_tag" in addition to "openshift_release" in the vars section of my inventory. openshift-ansible-3.10.53-1 Verify this bug with openshift-ansible-3.10.53-1.git.0.ba2c2ec.el7.noarch.rpm New etcd members could be added successfully for the following scenarios * New etcd collocated with masters * New etcd not collocated with masters Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2709 |
Description of problem: When running the openshift-ansible/playbooks/openshift-etcd/scaleup.yml playbook to add 2 etcds to an existing OCP 3.10.0-0.38.0 HA CRI-O cluster (rpm install), we get the following error: TASK [etcd : Add new etcd members to cluster] *************************************************************************************************** task path: /root/openshift-ansible/roles/etcd/tasks/add_new_member.yml:5 fatal: [<new_etcd_hostname_fqdn>]: FAILED! => { "msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'etcd_ip'\n\nThe error appears to have been in '/root/openshift-ansible/roles/etcd/tasks/add_new_member.yml': line 5, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Add new etcd members to cluster\n ^ here\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: 'dict object' has no attribute 'etcd_ip'" } The scaleup.yaml playbook was executed from a cloned openshift-ansible repo, on one of the master/etcd hosts. This is on AWS EC2, the HA cluster has 1 load-balancer, 3 master/etcds nodes, 2 infra nodes, 4 compute nodes, and 2 newly added master nodes which were added successfully with the openshift-master/scaleup.yaml playbook. The 2 etcds I am trying to add are on the 2 newly added master node instances. Version-Release number of selected component (if applicable): ~/openshift-ansible # git log --oneline -1 df9f5ed Merge pull request #8299 from mgleung/upgrade-calico-master # openshift version openshift v3.10.0-0.38.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 Container Runtime Version: cri-o://1.10.1 # rpm -qva | grep cri-o cri-o-1.10.1-2.git728df92.el7.x86_64 # rpm -q openshift-ansible openshift-ansible-3.10.0-0.38.0.git.7.848b045.el7.noarch # rpm -q ansible ansible-2.4.3.0-1.el7ae.noarch # ansible --version ansible 2.4.3.0 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, May 4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)] How reproducible: Always Steps to Reproduce: 1. Install an OCP 3.10.0-0.38.0 HA CRI-O (rpm install) cluster (1 load-balancer, 3 master/etcds nodes, 2 infra nodes, 4 compute nodes) in AWS EC2 with openshift-ansible/ with inventory attached: Run: - ansible-playbook -i inv openshift-ansible/playbooks/prerequisites.yml - ansible-playbook -i inv run openshift-ansible/playbooks/deploy_cluster.yml 2.Create two new instances in AWS EC2 3. Update inventory to add two master nodes (attached "master_scaleup_inventory"). 4. On one master/etcd host, run: ansible-playbook -i master_scaleup_inv openshift-ansible/playbooks/prerequisites.yml 5. ansible-playbook -i master_scaleup_inv openshift-ansible/playbooks/openshift-master/scaleup.yml 6. Verify cluster is up and 2 newly added masters are running as static pods: oc get pods -n kube-system 7. Create etcd_scaleup inventory with the 2 newly added master hostnames 8. Run etcd_scaleup inventory ansible-playbook -i etcd_scaleup_inv openshift-ansible/playbooks/openshift-etcd/scaleup.yml 9. I have also tried to re-run the prerequistes.yaml playbook before the openshift-etcd/scaleup.yml playbook and I am getting the same error in the same task Actual results: Error in task: TASK [etcd : Add new etcd members to cluster] (see "Description of problem" above) Expected results: Playbook successful completion and 2 added etcds running as static pods on the two newly added master nodes Additional info: Inventory files and logs from ansible-playbook with the -vvv flag are in next comment