Bug 1463494

Summary: oadm migrate etcd-ttl failed when use the dedicated etcd clusters
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: Cluster Version OperatorAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, jchaloup, jokerman, mmccomas, smunilla, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:28:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Migrade logs none

Description Anping Li 2017-06-21 06:30:41 UTC
Description of problem:
The migrate failed when use the dedicated etcd clusters. For there isn't atomic-openshift packages on the dedicated etcd clusters. I guess we only need to run 'oadm migrate etcd-ttl' on the first master.

Version-Release number of selected component (if applicable):
openshift/openshift-ansible: Pull Request 4492. 

How reproducible:
always

Steps to Reproduce:
1. install OCP v3.5 with dedicated etcd clusters
2. upgrade to v3.6
3. migrate to etcd v3
   anible-playbook openshift-ansible/playbooks/byo/openshift-etcd/migrate.yml

Actual results:
TASK [etcd_migrate : Re-introduce leases (as a replacement for key TTLs)] ******
failed: [qe-auto-etcd-1.0621-ktl.qe.rhcloud.com] (item=/kubernetes.io/events) => {
    "cmd": "oadm migrate etcd-ttl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --etcd-address https://10.240.0.16:2379 --ttl-keys-prefix /kubernetes.io/events --lease-duration 1h", 
    "failed": true, 
    "item": "/kubernetes.io/events", 
    "rc": 2
}

MSG:

[Errno 2] No such file or directory

failed: [qe-auto-etcd-2.0621-ktl.qe.rhcloud.com] (item=/kubernetes.io/events) => {
    "cmd": "oadm migrate etcd-ttl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --etcd-address https://10.240.0.17:2379 --ttl-keys-prefix /kubernetes.io/events --lease-duration 1h", 
    "failed": true, 
    "item": "/kubernetes.io/events", 
    "rc": 2
}

MSG:

[Errno 2] No such file or directory

failed: [qe-auto-etcd-3.0621-ktl.qe.rhcloud.com] (item=/kubernetes.io/events) => {
    "cmd": "oadm migrate etcd-ttl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --etcd-address https://10.240.0.18:2379 --ttl-keys-prefix /kubernetes.io/events --lease-duration 1h", 
    "failed": true, 
    "item": "/kubernetes.io/events", 
    "rc": 2
}

MSG:

[Errno 2] No such file or directory

failed: [qe-auto-etcd-1.0621-ktl.qe.rhcloud.com] (item=/kubernetes.io/masterleases) => {
    "cmd": "oadm migrate etcd-ttl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --etcd-address https://10.240.0.16:2379 --ttl-keys-prefix /kubernetes.io/masterleases --lease-duration 1h", 
    "failed": true, 
    "item": "/kubernetes.io/masterleases", 
    "rc": 2
}

MSG:

[Errno 2] No such file or directory

failed: [qe-auto-etcd-2.0621-ktl.qe.rhcloud.com] (item=/kubernetes.io/masterleases) => {
    "cmd": "oadm migrate etcd-ttl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --etcd-address https://10.240.0.17:2379 --ttl-keys-prefix /kubernetes.io/masterleases --lease-duration 1h", 
    "failed": true, 
    "item": "/kubernetes.io/masterleases", 
    "rc": 2
}

MSG:

[Errno 2] No such file or directory

failed: [qe-auto-etcd-3.0621-ktl.qe.rhcloud.com] (item=/kubernetes.io/masterleases) => {
    "cmd": "oadm migrate etcd-ttl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt --etcd-address https://10.240.0.18:2379 --ttl-keys-prefix /kubernetes.io/masterleases --lease-duration 1h", 
    "failed": true, 
    "item": "/kubernetes.io/masterleases", 
    "rc": 2
}

MSG:

[Errno 2] No such file or directory

    to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry

PLAY RECAP *********************************************************************
localhost                  : ok=22   changed=0    unreachable=0    failed=0   
qe-auto-etcd-1.0621-ktl.qe.rhcloud.com : ok=96   changed=9    unreachable=0    failed=1   
qe-auto-etcd-2.0621-ktl.qe.rhcloud.com : ok=93   changed=9    unreachable=0    failed=1   
qe-auto-etcd-3.0621-ktl.qe.rhcloud.com : ok=93   changed=9    unreachable=0    failed=1   
qe-auto-master-1.0621-ktl.qe.rhcloud.com : ok=61   changed=3    unreachable=0    failed=0   
qe-auto-node-registry-router-1.0621-ktl.qe.rhcloud.com : ok=56   changed=2    unreachable=0    failed=0   

Expected results:


Additional info:

Comment 1 Anping Li 2017-06-21 06:34:53 UTC
The example inventory
[OSEv3:children]
masters
nodes
etcd
[OSEv3:vars]
ansible_ssh_user=root
xxxx
xxxx
[masters]
master.example.com
[nodes]
master.example.com
node.example.com
[etcd]
etcd1.example.com
etcd2.example.com
etcd3.example.com

Comment 2 Anping Li 2017-06-22 06:09:48 UTC
Created attachment 1290503 [details]
Migrade logs

hit 'oadm migrate etcd-ttl' error with custers etcd (installed on masters).  I think that is same issue. 

[masters]
host-8-174-222.host.centralci.eng.rdu2.redhat.com 
host-8-174-253.host.centralci.eng.rdu2.redhat.com 
host-8-175-112.host.centralci.eng.rdu2.redhat.com

[etcd]
host-8-174-222.host.centralci.eng.rdu2.redhat.com
host-8-174-253.host.centralci.eng.rdu2.redhat.com
host-8-175-112.host.centralci.eng.rdu2.redhat.com

[nodes]
host-8-174-222.host.centralci.eng.rdu2.redhat.com
host-8-174-253.host.centralci.eng.rdu2.redhat.com
host-8-175-112.host.centralci.eng.rdu2.redhat.com
host-8-175-68.host.centralci.eng.rdu2.redhat.com 
host-8-175-73.host.centralci.eng.rdu2.redhat.com
[lb]
host-8-175-186.host.centralci.eng.rdu2.redhat.com
[nfs]
host-8-175-186.host.centralci.eng.rdu2.redhat.com

Comment 3 Scott Dodson 2017-06-22 13:15:44 UTC
delegate_to: {{ oo_first_master }} so that it's run on the first master

Comment 4 Jan Chaloupka 2017-06-23 13:06:37 UTC
Fixed as part of https://github.com/openshift/openshift-ansible/pull/4558

Comment 5 Jan Chaloupka 2017-06-28 12:00:27 UTC
More specific upstream PR: https://github.com/openshift/openshift-ansible/pull/4623

The #4558 can be ignored for this issue.

Comment 6 Jan Chaloupka 2017-06-28 12:14:40 UTC
Merged upstream

Comment 8 Anping Li 2017-07-04 02:41:22 UTC
With master branch, I get the following error. Should I use the errata puddle?

[root@anli host2]# cat hosts 
[OSEv3:children]
masters
nodes
etcd
nfs

[OSEv3:vars]
deployment_type=openshift-enterprise
ansible_become=true
ansible_user=root
openshift_auth_type=allowall
openshift_master_identity_providers=[{'name': 'allow_all', 'login': 'true', 'challenge': 'true', 'kind': 'AllowAllPasswordIdentityProvider'}]
openshift_image_tag=v3.6.129
containerized=true
enable_excluders=false
openshift_master_cert_expire_days=365
openshift_disable_check=disk_availability,docker_storage,memory_availability

[masters]
openshift-225.lab.eng.nay.redhat.com openshift_public_hostname=openshift-225.lab.eng.nay.redhat.com openshift_hostname=openshift-225.lab.eng.nay.redhat.com
[nodes]
openshift-225.lab.eng.nay.redhat.com openshift_public_hostname=openshift-225.lab.eng.nay.redhat.com openshift_hostname=openshift-225.lab.eng.nay.redhat.com  openshift_node_labels="{'region': 'infra'}"
[etcd]
openshift-208.lab.eng.nay.redhat.com openshift_public_hostname=openshift-208.lab.eng.nay.redhat.com openshift_hostname=openshift-208.lab.eng.nay.redhat.com
[nfs]
openshift-208.lab.eng.nay.redhat.com



TASK [etcd_migrate : Re-introduce leases (as a replacement for key TTLs)] ******
failed: [openshift-208.lab.eng.nay.redhat.com -> openshift-225.lab.eng.nay.redhat.com] (item=/kubernetes.io/events) => {
    "changed": true, 
    "cmd": [
        "oadm", 
        "migrate", 
        "etcd-ttl", 
        "--cert", 
        "/etc/etcd/peer.crt", 
        "--key", 
        "/etc/etcd/peer.key", 
        "--cacert", 
        "/etc/etcd/ca.crt", 
        "--etcd-address", 
        "https://192.168.1.186:2379", 
        "--ttl-keys-prefix", 
        "/kubernetes.io/events", 
        "--lease-duration", 
        "1h"
    ], 
    "delta": "0:00:00.278998", 
    "end": "2017-07-03 22:30:37.631742", 
    "failed": true, 
    "item": "/kubernetes.io/events", 
    "rc": 1, 
    "start": "2017-07-03 22:30:37.352744", 
    "warnings": []
}

STDERR:

error: open /etc/etcd/peer.crt: no such file or directory

failed: [openshift-208.lab.eng.nay.redhat.com -> openshift-225.lab.eng.nay.redhat.com] (item=/kubernetes.io/masterleases) => {
    "changed": true, 
    "cmd": [
        "oadm", 
        "migrate", 
        "etcd-ttl", 
        "--cert", 
        "/etc/etcd/peer.crt", 
        "--key", 
        "/etc/etcd/peer.key", 
        "--cacert", 
        "/etc/etcd/ca.crt", 
        "--etcd-address", 
        "https://192.168.1.186:2379", 
        "--ttl-keys-prefix", 
        "/kubernetes.io/masterleases", 
        "--lease-duration", 
        "1h"
    ], 
    "delta": "0:00:00.271130", 
    "end": "2017-07-03 22:30:38.155144", 
    "failed": true, 
    "item": "/kubernetes.io/masterleases", 
    "rc": 1, 
    "start": "2017-07-03 22:30:37.884014", 
    "warnings": []
}

STDERR:

error: open /etc/etcd/peer.crt: no such file or directory


NO MORE HOSTS LEFT *************************************************************
	to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry

PLAY RECAP *********************************************************************
localhost                  : ok=15   changed=0    unreachable=0    failed=0   
openshift-208.lab.eng.nay.redhat.com : ok=74   changed=9    unreachable=0    failed=1   
openshift-225.lab.eng.nay.redhat.com : ok=19   changed=1    unreachable=0    failed=0

Comment 9 Anping Li 2017-07-04 03:09:05 UTC
Another question,  What shall we do to recover from failure when the playbook stopped at ' Re-introduce leases'?

Comment 10 Jan Chaloupka 2017-07-04 08:40:19 UTC
PR setting the proper certificates: https://github.com/openshift/openshift-ansible/pull/4671

Comment 12 Anping Li 2017-07-06 09:32:15 UTC
The external clustered can be migrated via openshift-ansible-3.6.136

Comment 14 errata-xmlrpc 2017-08-10 05:28:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716