Description of problem: Re-introduce leases failed for "2017-07-06 04:11:58.655766 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}". I think after etcd_container is started, the etcd member need some time to join cluster. it is better to wait for minute until the etcd cluster is health before Re-introduce leases Version-Release number of the following components: openshift-ansible-3.6.135 How reproducible: sometimes Steps to Reproduce: 1. install HA OCP v3.5 2. upgrade to OCP 3.6 anible-playbook usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml Actual results: changed: [openshift-221.lab.eng.nay.redhat.com] => { "changed": true, "name": "etcd_container", "state": "started", "status": { "ActiveEnterTimestamp": "Thu 2017-07-06 03:47:30 EDT", "ActiveEnterTimestampMonotonic": "3957118851", "ActiveExitTimestamp": "Thu 2017-07-06 04:11:46 EDT", "ActiveExitTimestampMonotonic": "5413035829", "ActiveState": "inactive", "After": "docker.service system.slice systemd-journald.socket basic.target", <--snip---> <--snip---> TASK [etcd_migrate : Re-introduce leases (as a replacement for key TTLs)] ****** failed: [openshift-181.lab.eng.nay.redhat.com -> openshift-181.lab.eng.nay.redhat.com] (item=/kubernetes.io/events) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/etcd/peer.crt", "--key", "/etc/etcd/peer.key", "--cacert", "/etc/etcd/ca.crt", "--etcd-address", "https://192.168.1.196:2379", "--ttl-keys-prefix", "/kubernetes.io/events", "--lease-duration", "1h" ], "delta": "0:00:05.262004", "end": "2017-07-06 04:12:03.659000", "failed": true, "item": "/kubernetes.io/events", "rc": 1, "start": "2017-07-06 04:11:58.396996", "warnings": [] } STDERR: 2017-07-06 04:11:58.655766 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>} 2017-07-06 04:11:59.655721 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>} 2017-07-06 04:12:01.462522 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>} 2017-07-06 04:12:03.655353 I | etcdserver/api/v3rpc: Failed to dial 192.168.1.196:2379: context canceled; please retry. error: unable to create etcd client: dial tcp 192.168.1.196:2379: getsockopt: connection refused changed: [openshift-181.lab.eng.nay.redhat.com -> openshift-181.lab.eng.nay.redhat.com] => (item=/kubernetes.io/masterleases) => { "changed": true, "cmd": [ "oadm", "migrate", "etcd-ttl", "--cert", "/etc/etcd/peer.crt", "--key", "/etc/etcd/peer.key", "--cacert", "/etc/etcd/ca.crt", "--etcd-address", "https://192.168.1.196:2379", "--ttl-keys-prefix", "/kubernetes.io/masterleases", "--lease-duration", "1h" ], "delta": "0:00:02.016886", "end": "2017-07-06 04:12:05.927187", "item": "/kubernetes.io/masterleases", "rc": 0, "start": "2017-07-06 04:12:03.910301", "warnings": [] } STDOUT: info: Lease #6295853157263723780 with TTL 3600 created info: Attaching lease to 2 entries NO MORE HOSTS LEFT ************************************************************* to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry Expected results: Additional info:
Created attachment 1294905 [details] Add logs
Anping, Can you gather the journal from all etcd services so we understand what went wrong with etcd after it was started successfully? journalctl --no-pager -u etcd
Jan, Lets add a health check after Enable etcd member? that retries for up to a minute waiting on the cluster to become healthy? I wonder if post v2 to v3 migration it takes longer than normal to become healthy again?
Branch in progress here, maybe it'll help once it's fleshed out. Can't test it yet because bringing up a cluster like in your inventory has taken 2 hours so far and I think it's just getting slower with every task.
Created attachment 1295141 [details] Migrate playbook output and etcd_container journal log That is a containerized etcd. It was 21:55:15.141364 to Re-introduce leases[1]. but the etcd_container was started after 21:55:18[2]. [1]. playbook output: 2017-07-06 21:55:15.141364 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; [2]. etcd_container journal logs Jul 06 21:55:02 host3-ha-1.novalocal systemd[1]: Stopped The Etcd Server container. Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Starting The Etcd Server container... Jul 06 21:55:12 host3-ha-1.novalocal etcd_container[19229]: Error response from daemon: No such container: etcd_container Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Started The Etcd Server container. Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.739935 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://192.168.1.196:2379 Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777574 I | pkg/flags: recognized and used environment variable ETCD_CA_FILE=/etc/etcd/ca.crt Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777583 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt
https://github.com/openshift/openshift-ansible/pull/4703 adds a cluster health check before we attempt to re-attach leases.
The fix works with master branch.
The fix work well on openshift-ansible-3.6.140
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716