This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1468187 - Re-introduce leases failed for etcd is not available
Re-introduce leases failed for etcd is not available
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.6.0
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Tim Bielawa
Anping Li
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-06 04:59 EDT by Anping Li
Modified: 2017-08-16 15 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: etcd services were not always up before attempting to re-introduce leases Consequence: Installs/upgrades could fail Fix: Introduced a retrying health check of the etcd service Result: Re-introducing the leases will not happen until the service is detected as up and healthy on each specific node.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-10 01:29:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Add logs (26.15 KB, application/x-gzip)
2017-07-06 05:02 EDT, Anping Li
no flags Details
Migrate playbook output and etcd_container journal log (31.35 KB, application/x-gzip)
2017-07-06 22:09 EDT, Anping Li
no flags Details

  None (edit)
Description Anping Li 2017-07-06 04:59:30 EDT
Description of problem:
Re-introduce leases failed for "2017-07-06 04:11:58.655766 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}".

I think after etcd_container is started, the etcd member need some time to join cluster. it is better to wait for minute until the etcd cluster is health before Re-introduce leases 


Version-Release number of the following components:
openshift-ansible-3.6.135

How reproducible:
sometimes

Steps to Reproduce:
1. install HA OCP v3.5
2. upgrade to OCP 3.6
   anible-playbook usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_6/upgrade.yml



Actual results:
changed: [openshift-221.lab.eng.nay.redhat.com] => {
    "changed": true,
    "name": "etcd_container",
    "state": "started",
    "status": {
        "ActiveEnterTimestamp": "Thu 2017-07-06 03:47:30 EDT",
        "ActiveEnterTimestampMonotonic": "3957118851",
        "ActiveExitTimestamp": "Thu 2017-07-06 04:11:46 EDT",
        "ActiveExitTimestampMonotonic": "5413035829",
        "ActiveState": "inactive",
        "After": "docker.service system.slice systemd-journald.socket basic.target",

<--snip--->
<--snip--->
TASK [etcd_migrate : Re-introduce leases (as a replacement for key TTLs)] ******
failed: [openshift-181.lab.eng.nay.redhat.com -> openshift-181.lab.eng.nay.redhat.com] (item=/kubernetes.io/events) => {
    "changed": true,
    "cmd": [
        "oadm",
        "migrate",
        "etcd-ttl",
        "--cert",
        "/etc/etcd/peer.crt",
        "--key",
        "/etc/etcd/peer.key",
        "--cacert",
        "/etc/etcd/ca.crt",
        "--etcd-address",
        "https://192.168.1.196:2379",
        "--ttl-keys-prefix",
        "/kubernetes.io/events",
        "--lease-duration",
        "1h"
    ],
    "delta": "0:00:05.262004",
    "end": "2017-07-06 04:12:03.659000",
    "failed": true,
    "item": "/kubernetes.io/events",
    "rc": 1,
    "start": "2017-07-06 04:11:58.396996",
    "warnings": []
}

STDERR:

2017-07-06 04:11:58.655766 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}
2017-07-06 04:11:59.655721 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}
2017-07-06 04:12:01.462522 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; Reconnecting to {192.168.1.196:2379 <nil>}
2017-07-06 04:12:03.655353 I | etcdserver/api/v3rpc: Failed to dial 192.168.1.196:2379: context canceled; please retry.
error: unable to create etcd client: dial tcp 192.168.1.196:2379: getsockopt: connection refused
changed: [openshift-181.lab.eng.nay.redhat.com -> openshift-181.lab.eng.nay.redhat.com] => (item=/kubernetes.io/masterleases) => {
    "changed": true,
    "cmd": [
        "oadm",
        "migrate",
        "etcd-ttl",
        "--cert",
        "/etc/etcd/peer.crt",
        "--key",
        "/etc/etcd/peer.key",
        "--cacert",
        "/etc/etcd/ca.crt",
        "--etcd-address",
        "https://192.168.1.196:2379",
        "--ttl-keys-prefix",
        "/kubernetes.io/masterleases",
        "--lease-duration",
        "1h"
    ],
    "delta": "0:00:02.016886",
    "end": "2017-07-06 04:12:05.927187",
    "item": "/kubernetes.io/masterleases",
    "rc": 0,
    "start": "2017-07-06 04:12:03.910301",
    "warnings": []
}

STDOUT:

info: Lease #6295853157263723780 with TTL 3600 created
info: Attaching lease to 2 entries

NO MORE HOSTS LEFT *************************************************************
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-etcd/migrate.retry

Expected results:

Additional info:
Comment 1 Anping Li 2017-07-06 05:02 EDT
Created attachment 1294905 [details]
Add logs
Comment 2 Scott Dodson 2017-07-06 08:13:44 EDT
Anping,

Can you gather the journal from all etcd services so we understand what went wrong with etcd after it was started successfully?

journalctl --no-pager -u etcd
Comment 3 Scott Dodson 2017-07-06 08:19:19 EDT
Jan,

Lets add a health check after Enable etcd member? that retries for up to a minute waiting on the cluster to become healthy? I wonder if post v2 to v3 migration it takes longer than normal to become healthy again?
Comment 4 Tim Bielawa 2017-07-06 16:31:47 EDT
Branch in progress here, maybe it'll help once it's fleshed out. Can't test it yet because bringing up a cluster like in your inventory has taken 2 hours so far and I think it's just getting slower with every task.
Comment 5 Anping Li 2017-07-06 22:09 EDT
Created attachment 1295141 [details]
Migrate playbook output and etcd_container journal log

That is a containerized etcd. It was 21:55:15.141364 to Re-introduce leases[1].  but the etcd_container was started after 21:55:18[2].  

[1]. playbook output:
2017-07-06 21:55:15.141364 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; 


[2]. etcd_container journal logs

Jul 06 21:55:02 host3-ha-1.novalocal systemd[1]: Stopped The Etcd Server container.
Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Starting The Etcd Server container...
Jul 06 21:55:12 host3-ha-1.novalocal etcd_container[19229]: Error response from daemon: No such container: etcd_container
Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Started The Etcd Server container.
Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.739935 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://192.168.1.196:2379
Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777574 I | pkg/flags: recognized and used environment variable ETCD_CA_FILE=/etc/etcd/ca.crt
Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777583 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt
Comment 6 Scott Dodson 2017-07-07 20:48:38 EDT
https://github.com/openshift/openshift-ansible/pull/4703 adds a cluster health check before we attempt to re-attach leases.
Comment 7 Anping Li 2017-07-10 03:03:44 EDT
The fix works with master branch.
Comment 9 Anping Li 2017-07-10 21:17:43 EDT
The fix work well on openshift-ansible-3.6.140
Comment 11 errata-xmlrpc 2017-08-10 01:29:50 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.