Bug 1468187

Summary: Re-introduce leases failed for etcd is not available
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: Cluster Version OperatorAssignee: Tim Bielawa <tbielawa>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: anli, aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: etcd services were not always up before attempting to re-introduce leases Consequence: Installs/upgrades could fail Fix: Introduced a retrying health check of the etcd service Result: Re-introducing the leases will not happen until the service is detected as up and healthy on each specific node.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-10 05:29:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Add logs
none
Migrate playbook output and etcd_container journal log none

Description Anping Li 2017-07-06 08:59:30 UTC
Description of problem:
Re-introduce leases failed for "2017-07-06 04:11:58.655766 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}".

I think after etcd_container is started, the etcd member need some time to join cluster. it is better to wait for minute until the etcd cluster is health before Re-introduce leases 


Version-Release number of the following components:
openshift-ansible-3.6.135

How reproducible:
sometimes

Steps to Reproduce:
1. install HA OCP v3.5
2. upgrade to OCP 3.6
   anible-playbook usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml



Actual results:
changed: [openshift-221.lab.eng.nay.redhat.com] => {
    "changed": true,
    "name": "etcd_container",
    "state": "started",
    "status": {
        "ActiveEnterTimestamp": "Thu 2017-07-06 03:47:30 EDT",
        "ActiveEnterTimestampMonotonic": "3957118851",
        "ActiveExitTimestamp": "Thu 2017-07-06 04:11:46 EDT",
        "ActiveExitTimestampMonotonic": "5413035829",
        "ActiveState": "inactive",
        "After": "docker.service system.slice systemd-journald.socket basic.target",

<--snip--->
<--snip--->
TASK [etcd_migrate : Re-introduce leases (as a replacement for key TTLs)] ******
failed: [openshift-181.lab.eng.nay.redhat.com -> openshift-181.lab.eng.nay.redhat.com] (item=/kubernetes.io/events) => {
    "changed": true,
    "cmd": [
        "oadm",
        "migrate",
        "etcd-ttl",
        "--cert",
        "/etc/etcd/peer.crt",
        "--key",
        "/etc/etcd/peer.key",
        "--cacert",
        "/etc/etcd/ca.crt",
        "--etcd-address",
        "https://192.168.1.196:2379",
        "--ttl-keys-prefix",
        "/kubernetes.io/events",
        "--lease-duration",
        "1h"
    ],
    "delta": "0:00:05.262004",
    "end": "2017-07-06 04:12:03.659000",
    "failed": true,
    "item": "/kubernetes.io/events",
    "rc": 1,
    "start": "2017-07-06 04:11:58.396996",
    "warnings": []
}

STDERR:

2017-07-06 04:11:58.655766 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}
2017-07-06 04:11:59.655721 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}
2017-07-06 04:12:01.462522 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}
2017-07-06 04:12:03.655353 I | etcdserver/api/v3rpc: Failed to dial 192.168.1.196:2379: context canceled; please retry.
error: unable to create etcd client: dial tcp 192.168.1.196:2379: getsockopt: connection refused
changed: [openshift-181.lab.eng.nay.redhat.com -> openshift-181.lab.eng.nay.redhat.com] => (item=/kubernetes.io/masterleases) => {
    "changed": true,
    "cmd": [
        "oadm",
        "migrate",
        "etcd-ttl",
        "--cert",
        "/etc/etcd/peer.crt",
        "--key",
        "/etc/etcd/peer.key",
        "--cacert",
        "/etc/etcd/ca.crt",
        "--etcd-address",
        "https://192.168.1.196:2379",
        "--ttl-keys-prefix",
        "/kubernetes.io/masterleases",
        "--lease-duration",
        "1h"
    ],
    "delta": "0:00:02.016886",
    "end": "2017-07-06 04:12:05.927187",
    "item": "/kubernetes.io/masterleases",
    "rc": 0,
    "start": "2017-07-06 04:12:03.910301",
    "warnings": []
}

STDOUT:

info: Lease #6295853157263723780 with TTL 3600 created
info: Attaching lease to 2 entries

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry

Expected results:

Additional info:

Comment 1 Anping Li 2017-07-06 09:02:28 UTC
Created attachment 1294905 [details]
Add logs

Comment 2 Scott Dodson 2017-07-06 12:13:44 UTC
Anping,

Can you gather the journal from all etcd services so we understand what went wrong with etcd after it was started successfully?

journalctl --no-pager -u etcd

Comment 3 Scott Dodson 2017-07-06 12:19:19 UTC
Jan,

Lets add a health check after Enable etcd member? that retries for up to a minute waiting on the cluster to become healthy? I wonder if post v2 to v3 migration it takes longer than normal to become healthy again?

Comment 4 Tim Bielawa 2017-07-06 20:31:47 UTC
Branch in progress here, maybe it'll help once it's fleshed out. Can't test it yet because bringing up a cluster like in your inventory has taken 2 hours so far and I think it's just getting slower with every task.

Comment 5 Anping Li 2017-07-07 02:09:56 UTC
Created attachment 1295141 [details]
Migrate playbook output and etcd_container journal log

That is a containerized etcd. It was 21:55:15.141364 to Re-introduce leases[1].  but the etcd_container was started after 21:55:18[2].  

[1]. playbook output:
2017-07-06 21:55:15.141364 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; 


[2]. etcd_container journal logs

Jul 06 21:55:02 host3-ha-1.novalocal systemd[1]: Stopped The Etcd Server container.
Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Starting The Etcd Server container...
Jul 06 21:55:12 host3-ha-1.novalocal etcd_container[19229]: Error response from daemon: No such container: etcd_container
Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Started The Etcd Server container.
Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.739935 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://192.168.1.196:2379
Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777574 I | pkg/flags: recognized and used environment variable ETCD_CA_FILE=/etc/etcd/ca.crt
Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777583 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt

Comment 6 Scott Dodson 2017-07-08 00:48:38 UTC
https://github.com/openshift/openshift-ansible/pull/4703 adds a cluster health check before we attempt to re-attach leases.

Comment 7 Anping Li 2017-07-10 07:03:44 UTC
The fix works with master branch.

Comment 9 Anping Li 2017-07-11 01:17:43 UTC
The fix work well on openshift-ansible-3.6.140

Comment 11 errata-xmlrpc 2017-08-10 05:29:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716