Bug 1468187
Summary: | Re-introduce leases failed for etcd is not available | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anping Li <anli> | ||||||
Component: | Cluster Version Operator | Assignee: | Tim Bielawa <tbielawa> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 3.6.0 | CC: | anli, aos-bugs, jokerman, mmccomas | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: etcd services were not always up before attempting to re-introduce leases
Consequence: Installs/upgrades could fail
Fix: Introduced a retrying health check of the etcd service
Result: Re-introducing the leases will not happen until the service is detected as up and healthy on each specific node.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-08-10 05:29:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Anping Li
2017-07-06 08:59:30 UTC
Created attachment 1294905 [details]
Add logs
Anping, Can you gather the journal from all etcd services so we understand what went wrong with etcd after it was started successfully? journalctl --no-pager -u etcd Jan, Lets add a health check after Enable etcd member? that retries for up to a minute waiting on the cluster to become healthy? I wonder if post v2 to v3 migration it takes longer than normal to become healthy again? Branch in progress here, maybe it'll help once it's fleshed out. Can't test it yet because bringing up a cluster like in your inventory has taken 2 hours so far and I think it's just getting slower with every task. Created attachment 1295141 [details] Migrate playbook output and etcd_container journal log That is a containerized etcd. It was 21:55:15.141364 to Re-introduce leases[1]. but the etcd_container was started after 21:55:18[2]. [1]. playbook output: 2017-07-06 21:55:15.141364 I | etcdserver/api/v3rpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 192.168.1.196:2379: getsockopt: connection refused"; [2]. etcd_container journal logs Jul 06 21:55:02 host3-ha-1.novalocal systemd[1]: Stopped The Etcd Server container. Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Starting The Etcd Server container... Jul 06 21:55:12 host3-ha-1.novalocal etcd_container[19229]: Error response from daemon: No such container: etcd_container Jul 06 21:55:12 host3-ha-1.novalocal systemd[1]: Started The Etcd Server container. Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.739935 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://192.168.1.196:2379 Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777574 I | pkg/flags: recognized and used environment variable ETCD_CA_FILE=/etc/etcd/ca.crt Jul 06 21:55:18 host3-ha-1.novalocal etcd_container[19235]: 2017-07-07 01:55:18.777583 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt https://github.com/openshift/openshift-ansible/pull/4703 adds a cluster health check before we attempt to re-attach leases. The fix works with master branch. The fix work well on openshift-ansible-3.6.140 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |