Bug 1644416

Summary: calico roles overwrite etcd_cert_config_dir causing issues with OpenShift management playbooks
Product: OpenShift Container Platform Reporter: Eric Rich <erich>
Component: InstallerAssignee: Scott Dodson <sdodson>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: aos-bugs, gpei, grodrigu, jokerman, mmccomas, sdodson
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The scaleup playbooks when used in conjunction with Calico did not properly configure the Calico certificate paths causing them to fail. The playbooks have been updated to ensure that master scaleup with Calico works properly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-10 09:04:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eric Rich 2018-10-30 19:14:00 UTC
Created attachment 1499043 [details]
Customer Logs

Description of problem:

When trying to scale up etcd, if you are are using the calico networking plugin/playbooks, the etcd_cert_config_dir is set, thus it does not follow our defaults (or align with what our variables (https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/etcd/defaults/main.yaml#L27-L36) assume. This causes validation checks https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/etcd/tasks/certificates/fetch_server_certificates_from_ca.yml#L182-L190 fail because the certificates are missing.  

TASK [etcd : Validate permissions on certificate files] ****************************************************************************************************************
failed: [master1.HOSTNAME] (item=/etc/etcd/ca.crt) => {"changed": false, "item": "/etc/etcd/ca.crt", "msg": "file (/etc/etcd/ca.crt) is absent, cannot continue", "path": "/etc/etcd/ca.crt", "state": "absent"}
failed: [master2.HOSTNAME] (item=/etc/etcd/ca.crt) => {"changed": false, "item": "/etc/etcd/ca.crt", "msg": "file (/etc/etcd/ca.crt) is absent, cannot continue", "path": "/etc/etcd/ca.crt", "state": "absent"}
failed: [master1.HOSTNAME] (item=/etc/etcd/server.crt) => {"changed": false, "item": "/etc/etcd/server.crt", "msg": "file (/etc/etcd/server.crt) is absent, cannot continue", "path": "/etc/etcd/server.crt", "state": "absent"}
failed: [master2.HOSTNAME] (item=/etc/etcd/server.crt) => {"changed": false, "item": "/etc/etcd/server.crt", "msg": "file (/etc/etcd/server.crt) is absent, cannot continue", "path": "/etc/etcd/server.crt", "state": "absent"}
failed: [master1.HOSTNAME] (item=/etc/etcd/server.key) => {"changed": false, "item": "/etc/etcd/server.key", "msg": "file (/etc/etcd/server.key) is absent, cannot continue", "path": "/etc/etcd/server.key", "state": "absent"}
failed: [master2.HOSTNAME] (item=/etc/etcd/server.key) => {"changed": false, "item": "/etc/etcd/server.key", "msg": "file (/etc/etcd/server.key) is absent, cannot continue", "path": "/etc/etcd/server.key", "state": "absent"}

Version-Release number of the following components:
> rpm -qa | grep ansible 
openshift-ansible-docs-3.11.16-1.git.0.4ac6f81.el7.noarch
openshift-ansible-playbooks-3.11.16-1.git.0.4ac6f81.el7.noarch
openshift-ansible-3.11.16-1.git.0.4ac6f81.el7.noarch 
openshift-ansible-roles-3.11.16-1.git.0.4ac6f81.el7.noarch
ansible-2.6.5-1.el7ae.noarch 

How reproducible: Very
Actual results:

> see above

Expected results:

> the scaleup playbooks should succeed if the calico networking plugins are in place. 

Additional info:

I believe the bug is caused by https://github.com/openshift/openshift-ansible/blob/f8a632b77d4ea5f76a6d568be4f7c23a56e9197c/roles/calico/tasks/certs.yml#L24-L34

Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2018-10-30 20:02:36 UTC
The version of playbooks in use pre-dates all of the Calico related fixes going into 3.11.

At least these two pull requests are necessary
https://github.com/openshift/openshift-ansible/pull/10233
https://github.com/openshift/openshift-ansible/pull/10515

Those are only available in openshift-ansible-3.11.32-1 and later builds.

I'd like to better understand the history and goals here. Is this customer provisioning a new 3.11 cluster with just one master and etcd host then attempting to scale that up to a second and third master / etcd host?

I think we need to go through the specific workflow they're going through internally before we request any further testing of the customer. I'm not confident that the fixes above would actually address the issue until I know what they're doing.

Comment 8 Scott Dodson 2018-11-07 21:38:45 UTC
https://github.com/openshift/openshift-ansible/pull/10631 potential fix

I need to work through another round of fresh testing in order to document the full process, but with that change I think the process will roughly be

1) On remaining etcd member, `echo ETCD_FORCE_NEW_CLUSTER=true >> /etc/etcd/etcd.conf && master-restart etcd`  wait for etcd to come back online and api services to be restored.
2) Remove the appended line then `master-restart etcd` to return to normal etcd operations.
3) oc delete node master-2 master-3
4) Add master-4 and master-5 to [new_masters] and [new_nodes], run playbooks/openshift-master/scaleup.yml
5) Move master-4 and master-5 to [masters] and [nodes] group, add them to [new_etcd], run playbooks/openshift-etcd/scaleup.yml

Comment 9 Scott Dodson 2018-11-12 13:57:03 UTC
This is what I've come up with, marking this comment private until I can go through it once again.

1) Provision 1LB, 3 M, 2N cluster.
2) Terminate master-2, master-3.
3) Restore etcd quorum by forcing a new cluster.
# etcdctl2 cluster-health
member 27eb9c8faee92f9a is unhealthy: got unhealthy result from https://172.18.10.172:2379
failed to check the health of member c469eee60064c582 on https://172.18.7.111:2379: Get https://172.18.7.111:2379/health: dial tcp 172.18.7.111:2379: i/o timeout
member c469eee60064c582 is unreachable: [https://172.18.7.111:2379] are all unreachable
failed to check the health of member c873828b1cbf2631 on https://172.18.8.183:2379: Get https://172.18.8.183:2379/health: dial tcp 172.18.8.183:2379: i/o timeout
member c873828b1cbf2631 is unreachable: [https://172.18.8.183:2379] are all unreachable
cluster is unhealthy

# echo ETCD_FORCE_NEW_CLUSTER=true >> /etc/etcd/etcd.conf && master-restart etcd
2

# etcdctl2 cluster-health                                                                                                                                                                                            
member 27eb9c8faee92f9a is healthy: got healthy result from https://172.18.10.172:2379
cluster is healthy

4) Delete node objects for previous masters, this is critical to prevent calico daemonsets from being mis-sized. Comment them out in inventory as well.

5) Add new masters to [new_masters] and [new_nodes] run playbooks/openshift-master/scaleup.yml

6) Move new masters to [masters] and [nodes], add to [new_etcd] run playbooks/openshift-etcd/scaleup.yml

Calcio etcd will scale up as the new masters come online because they're tagged calico-etcd.

Comment 24 Greg Rodriguez II 2018-11-27 22:27:42 UTC
Customer has requested update on this issue.  Are any updates available at this time?

Comment 25 Greg Rodriguez II 2018-11-28 17:17:30 UTC
Customer has requested another update this morning.  I advised that we have reached out to RHOSE-PRIO and that we have received a response that the issue was being looked at.  Any updates or information that may be shared with the customer is greatly appreciated at this time.

Comment 27 Scott Dodson 2018-11-29 03:11:42 UTC
https://github.com/openshift/openshift-ansible/pull/10789 should address deploy_cluster.yml but I need to re-test the scaleup scenario.

Comment 34 Scott Dodson 2018-12-03 21:31:19 UTC
https://github.com/openshift/openshift-ansible/pull/10789 merged addressing the issue during initial deployment time

Comment 36 ge liu 2018-12-24 10:13:36 UTC
Verified with:
openshift v3.11.60
openshift-ansible-3.11.60-1.git.0.2fbdcdc.el7.noarch.rpm

# oc get pods
NAME                                               READY     STATUS    RESTARTS   AGE
calico-kube-controllers-7ffbc994bf-plm9r           1/1       Running   0          4h
calico-node-7445z                                  2/2       Running   0          4h
calico-node-9rrx6                                  2/2       Running   0          4h
calico-node-mcx5g                                  2/2       Running   0          4h
complete-upgrade-xzl7g                             1/1       Running   0          4h
master-api-qe-zzhao311-master-etcd-nfs-1           1/1       Running   4          4h
master-controllers-qe-zzhao311-master-etcd-nfs-1   1/1       Running   4          4h
master-etcd-qe-zzhao311-master-etcd-nfs-1          1/1       Running   2          4h


scale up successfully:
# oc rsh master-etcd-qe-zzhao311-master-etcd-nfs-1
sh-4.2# etcdctl --cert-file=/etc/etcd/peer.crt  --key-file=/etc/etcd/peer.key --ca-file=/etc/etcd/ca.crt --peers=https://qe-zzhao311-master-etcd-nfs-1.int.1224-8-s.qe.rhcloud.com:2379 member list
b7cebc3c02a4c4a: name=qe-geliu-launchvmrhel-2 peerURLs=https://172.16.122.46:2380 clientURLs=https://172.16.122.46:2379 isLeader=false
200b1cbb34c0b2f6: name=qe-geliu-launchvmrhel-1 peerURLs=https://172.16.122.56:2380 clientURLs=https://172.16.122.56:2379 isLeader=true
a32a5ed8a65b9417: name=qe-zzhao311-master-etcd-nfs-1.int.1224-8-s.qe.rhcloud.com peerURLs=https://172.16.122.55:2380 clientURLs=https://172.16.122.55:2379 isLeader=false

Comment 38 errata-xmlrpc 2019-01-10 09:04:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024