Bug 1550470
| Summary: | [3.6] Master api hang when 1 of master/etcd down | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Takayoshi Kimura <tkimura> | ||||
| Component: | Master | Assignee: | Jordan Liggitt <jliggitt> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Wang Haoran <haowang> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 3.6.0 | CC: | aos-bugs, bleanhar, dmoessne, fabian, geliu, haowang, jliggitt, jokerman, jorge_martinez, knakayam, maszulik, mfojtik, mifiedle, mmccomas, pdwyer, rbost, rkharwar, rkshirsa, sttts, tkimura, wmeng, yannick.kint | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.6.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1561748 1561749 1564978 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-04-12 06:03:40 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1552332, 1561748, 1561749, 1564978 | ||||||
| Attachments: |
|
||||||
|
Description
Takayoshi Kimura
2018-03-01 09:17:07 UTC
The issue no longer happen when restart atomic-openshift-master-api service on the 2 other master hosts.
For some reason the master-api is trying to communicate the dead host forever:
Mar 01 17:47:56 tkimura-shift-ha1.usersys.redhat.com openshift[5271]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.72.38.243:2379: i/o timeout"; Reconnecting to {tkimura-shift-ha3.usersys.redhat.com:2379 <nil>}
Mar 01 17:47:56 tkimura-shift-ha1.usersys.redhat.com openshift[5271]: Failed to dial tkimura-shift-ha3.usersys.redhat.com:2379: grpc: the connection is closing; please retry.
Can you verify if the reproducer you've described is reproducible on 3.7 as well? I don't have spare for that right now, can we assign QE for that? Is the API completely stuck? Can you try to get different resources, like "oc get pods" or "oc get secrets", etc... Will it timeout for all of them? It seems like that shooting the etcd member without it properly closing the connections with clients will cause the clients be stuck for ~15 minutes till the connection is dropped and then reopened with different member. This can impact the watches as well (the watch will be open for 15 minutes and then reconnect to a different member). There is related issue: https://github.com/coreos/etcd/issues/8980 (thanks Stefan!) starting server... [version: 3.2.15, cluster version: to_be_decided] oc v3.9.4 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO (I believe this has the 3.2.16 client already) What I did: 3 masters 1) I poweroff -f master #1 2) ssh into master #2, oc get all is stuck, oc get secrets works, oc get pods not 3) ssh into master #3, oc get all is stuck, oc get secrets works, oc get pods works as well (?) On #2 there are no gRPC messages recorded in the journal, just API timeouts (panic). All masters are running with loglevel=5. Created attachment 1414689 [details]
master api log
the API server is configured to use etcd v3:
kubernetesMasterConfig:
apiServerArguments:
storage-backend:
- etcd3
storage-media-type:
- application/vnd.kubernetes.protobuf
This fix only applies when running with etcd v2:
kubernetesMasterConfig:
apiServerArguments:
storage-backend:
- etcd2
storage-media-type:
- application/json
Regarding to comment 48, bug trace the new problem: https://bugzilla.redhat.com/show_bug.cgi?id=1562331 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1106 |