Created attachment 1499043 [details] Customer Logs Description of problem: When trying to scale up etcd, if you are are using the calico networking plugin/playbooks, the etcd_cert_config_dir is set, thus it does not follow our defaults (or align with what our variables (https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/etcd/defaults/main.yaml#L27-L36) assume. This causes validation checks https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L182-L190 fail because the certificates are missing. TASK [etcd : Validate permissions on certificate files] **************************************************************************************************************** failed: [master1.HOSTNAME] (item=/etc/etcd/ca.crt) => {"changed": false, "item": "/etc/etcd/ca.crt", "msg": "file (/etc/etcd/ca.crt) is absent, cannot continue", "path": "/etc/etcd/ca.crt", "state": "absent"} failed: [master2.HOSTNAME] (item=/etc/etcd/ca.crt) => {"changed": false, "item": "/etc/etcd/ca.crt", "msg": "file (/etc/etcd/ca.crt) is absent, cannot continue", "path": "/etc/etcd/ca.crt", "state": "absent"} failed: [master1.HOSTNAME] (item=/etc/etcd/server.crt) => {"changed": false, "item": "/etc/etcd/server.crt", "msg": "file (/etc/etcd/server.crt) is absent, cannot continue", "path": "/etc/etcd/server.crt", "state": "absent"} failed: [master2.HOSTNAME] (item=/etc/etcd/server.crt) => {"changed": false, "item": "/etc/etcd/server.crt", "msg": "file (/etc/etcd/server.crt) is absent, cannot continue", "path": "/etc/etcd/server.crt", "state": "absent"} failed: [master1.HOSTNAME] (item=/etc/etcd/server.key) => {"changed": false, "item": "/etc/etcd/server.key", "msg": "file (/etc/etcd/server.key) is absent, cannot continue", "path": "/etc/etcd/server.key", "state": "absent"} failed: [master2.HOSTNAME] (item=/etc/etcd/server.key) => {"changed": false, "item": "/etc/etcd/server.key", "msg": "file (/etc/etcd/server.key) is absent, cannot continue", "path": "/etc/etcd/server.key", "state": "absent"} Version-Release number of the following components: > rpm -qa | grep ansible openshift-ansible-docs-3.11.16-1.git.0.4ac6f81.el7.noarch openshift-ansible-playbooks-3.11.16-1.git.0.4ac6f81.el7.noarch openshift-ansible-3.11.16-1.git.0.4ac6f81.el7.noarch openshift-ansible-roles-3.11.16-1.git.0.4ac6f81.el7.noarch ansible-2.6.5-1.el7ae.noarch How reproducible: Very Actual results: > see above Expected results: > the scaleup playbooks should succeed if the calico networking plugins are in place. Additional info: I believe the bug is caused by https://github.com/openshift/openshift-ansible/blob/f8a632b77d4ea5f76a6d568be4f7c23a56e9197c/roles/calico/tasks/certs.yml#L24-L34 Please attach logs from ansible-playbook with the -vvv flag
I think https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L11-L22 and https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L136-L148 should use etcd_ca_dir and not etcd_cert_config_dir
The version of playbooks in use pre-dates all of the Calico related fixes going into 3.11. At least these two pull requests are necessary https://github.com/openshift/openshift-ansible/pull/10233 https://github.com/openshift/openshift-ansible/pull/10515 Those are only available in openshift-ansible-3.11.32-1 and later builds. I'd like to better understand the history and goals here. Is this customer provisioning a new 3.11 cluster with just one master and etcd host then attempting to scale that up to a second and third master / etcd host? I think we need to go through the specific workflow they're going through internally before we request any further testing of the customer. I'm not confident that the fixes above would actually address the issue until I know what they're doing.
https://github.com/openshift/openshift-ansible/pull/10631 potential fix I need to work through another round of fresh testing in order to document the full process, but with that change I think the process will roughly be 1) On remaining etcd member, `echo ETCD_FORCE_NEW_CLUSTER=true >> /etc/etcd/etcd.conf && master-restart etcd` wait for etcd to come back online and api services to be restored. 2) Remove the appended line then `master-restart etcd` to return to normal etcd operations. 3) oc delete node master-2 master-3 4) Add master-4 and master-5 to [new_masters] and [new_nodes], run playbooks/openshift-master/scaleup.yml 5) Move master-4 and master-5 to [masters] and [nodes] group, add them to [new_etcd], run playbooks/openshift-etcd/scaleup.yml
This is what I've come up with, marking this comment private until I can go through it once again. 1) Provision 1LB, 3 M, 2N cluster. 2) Terminate master-2, master-3. 3) Restore etcd quorum by forcing a new cluster. # etcdctl2 cluster-health member 27eb9c8faee92f9a is unhealthy: got unhealthy result from https://172.18.10.172:2379 failed to check the health of member c469eee60064c582 on https://172.18.7.111:2379: Get https://172.18.7.111:2379/health: dial tcp 172.18.7.111:2379: i/o timeout member c469eee60064c582 is unreachable: [https://172.18.7.111:2379] are all unreachable failed to check the health of member c873828b1cbf2631 on https://172.18.8.183:2379: Get https://172.18.8.183:2379/health: dial tcp 172.18.8.183:2379: i/o timeout member c873828b1cbf2631 is unreachable: [https://172.18.8.183:2379] are all unreachable cluster is unhealthy # echo ETCD_FORCE_NEW_CLUSTER=true >> /etc/etcd/etcd.conf && master-restart etcd 2 # etcdctl2 cluster-health member 27eb9c8faee92f9a is healthy: got healthy result from https://172.18.10.172:2379 cluster is healthy 4) Delete node objects for previous masters, this is critical to prevent calico daemonsets from being mis-sized. Comment them out in inventory as well. 5) Add new masters to [new_masters] and [new_nodes] run playbooks/openshift-master/scaleup.yml 6) Move new masters to [masters] and [nodes], add to [new_etcd] run playbooks/openshift-etcd/scaleup.yml Calcio etcd will scale up as the new masters come online because they're tagged calico-etcd.
Customer has requested update on this issue. Are any updates available at this time?
Customer has requested another update this morning. I advised that we have reached out to RHOSE-PRIO and that we have received a response that the issue was being looked at. Any updates or information that may be shared with the customer is greatly appreciated at this time.
https://github.com/openshift/openshift-ansible/pull/10789 should address deploy_cluster.yml but I need to re-test the scaleup scenario.
https://github.com/openshift/openshift-ansible/pull/10789 merged addressing the issue during initial deployment time
Verified with: openshift v3.11.60 openshift-ansible-3.11.60-1.git.0.2fbdcdc.el7.noarch.rpm # oc get pods NAME READY STATUS RESTARTS AGE calico-kube-controllers-7ffbc994bf-plm9r 1/1 Running 0 4h calico-node-7445z 2/2 Running 0 4h calico-node-9rrx6 2/2 Running 0 4h calico-node-mcx5g 2/2 Running 0 4h complete-upgrade-xzl7g 1/1 Running 0 4h master-api-qe-zzhao311-master-etcd-nfs-1 1/1 Running 4 4h master-controllers-qe-zzhao311-master-etcd-nfs-1 1/1 Running 4 4h master-etcd-qe-zzhao311-master-etcd-nfs-1 1/1 Running 2 4h scale up successfully: # oc rsh master-etcd-qe-zzhao311-master-etcd-nfs-1 sh-4.2# etcdctl --cert-file=/etc/etcd/peer.crt --key-file=/etc/etcd/peer.key --ca-file=/etc/etcd/ca.crt --peers=https://qe-zzhao311-master-etcd-nfs-1.int.1224-8-s.qe.rhcloud.com:2379 member list b7cebc3c02a4c4a: name=qe-geliu-launchvmrhel-2 peerURLs=https://172.16.122.46:2380 clientURLs=https://172.16.122.46:2379 isLeader=false 200b1cbb34c0b2f6: name=qe-geliu-launchvmrhel-1 peerURLs=https://172.16.122.56:2380 clientURLs=https://172.16.122.56:2379 isLeader=true a32a5ed8a65b9417: name=qe-zzhao311-master-etcd-nfs-1.int.1224-8-s.qe.rhcloud.com peerURLs=https://172.16.122.55:2380 clientURLs=https://172.16.122.55:2379 isLeader=false
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024