Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1550470

Summary: [3.6] Master api hang when 1 of master/etcd down
Product: OpenShift Container Platform Reporter: Takayoshi Kimura <tkimura>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.6.0CC: aos-bugs, bleanhar, dmoessne, fabian, geliu, haowang, jliggitt, jokerman, jorge_martinez, knakayam, maszulik, mfojtik, mifiedle, mmccomas, pdwyer, rbost, rkharwar, rkshirsa, sttts, tkimura, wmeng, yannick.kint
Target Milestone: ---   
Target Release: 3.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1561748 1561749 1564978 (view as bug list) Environment:
Last Closed: 2018-04-12 06:03:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1552332, 1561748, 1561749, 1564978    
Attachments:
Description Flags
master api log none

Description Takayoshi Kimura 2018-03-01 09:17:07 UTC
Description of problem:

Performed "echo c > /proc/sysrq-trigger" on one of the host when running "oc get all" in a loop. The "oc get all" command hangs because the master-api doesn't respond.

I tested this on my fresh 3.6 HA cluster with HAProxy on RHEV, 3 master/etcd collocated master hosts. After the "echo c > /proc/sysrq-trigger", perform poweroff operation on RHEV console as well.

Interestingly it didn't reproduce with 1st try but reproducible with 2nd try with different master host down. I did 3 set of testings and got same result. Not sure it's coincidence.

- Tested ha2 node down several times but "oc get all" recovers in 30 sec or so
- Reboot ha2 and will make ha3 down next round
- Before the test, etcd leader was ha1, so the down host is not etcd leader

# etcdctl2 cluster-health; etcdctl2 member list
member 1f5b3374465bbe21 is healthy: got healthy result from https://10.72.38.241:2379
member a9858ea062efda3b is healthy: got healthy result from https://10.72.38.243:2379
member da8b43be10574ba5 is healthy: got healthy result from https://10.72.38.242:2379
cluster is healthy
1f5b3374465bbe21: name=tkimura-shift-ha1.usersys.redhat.com peerURLs=https://10.72.38.241:2380 clientURLs=https://10.72.38.241:2379 isLeader=true
a9858ea062efda3b: name=tkimura-shift-ha3.usersys.redhat.com peerURLs=https://10.72.38.243:2380 clientURLs=https://10.72.38.243:2379 isLeader=false
da8b43be10574ba5: name=tkimura-shift-ha2.usersys.redhat.com peerURLs=https://10.72.38.242:2380 clientURLs=https://10.72.38.242:2379 isLeader=false

- ha3 down at Thu Mar  1 17:38:33 JST 2018
- "oc get all" hangs

Version-Release number of selected component (if applicable):

atomic-openshift-3.6.173.0.96-1.git.0.8f6ff22.el7.x86_64

How reproducible:

Always when testing rolling master host down scenario

Steps to Reproduce:

See description

Actual results:

master api hangs when 1 master is down

Expected results:

master api works

Additional info:

There are similar report in the other ticket: https://bugzilla.redhat.com/show_bug.cgi?id=1498456

And based on those reports, it seems this bug exists both 3.6 and 3.7 latest.

Comment 3 Takayoshi Kimura 2018-03-01 10:01:48 UTC
The issue no longer happen when restart atomic-openshift-master-api service on the 2 other master hosts.

For some reason the master-api is trying to communicate the dead host forever:

Mar 01 17:47:56 tkimura-shift-ha1.usersys.redhat.com openshift[5271]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 10.72.38.243:2379: i/o timeout"; Reconnecting to {tkimura-shift-ha3.usersys.redhat.com:2379 <nil>}
Mar 01 17:47:56 tkimura-shift-ha1.usersys.redhat.com openshift[5271]: Failed to dial tkimura-shift-ha3.usersys.redhat.com:2379: grpc: the connection is closing; please retry.

Comment 4 Maciej Szulik 2018-03-02 10:46:13 UTC
Can you verify if the reproducer you've described is reproducible on 3.7 as well?

Comment 5 Takayoshi Kimura 2018-03-05 00:37:54 UTC
I don't have spare for that right now, can we assign QE for that?

Comment 18 Michal Fojtik 2018-03-12 13:50:11 UTC
Is the API completely stuck? Can you try to get different resources, like "oc get pods" or "oc get secrets", etc... Will it timeout for all of them?

It seems like that shooting the etcd member without it properly closing the connections with clients will cause the clients be stuck for ~15 minutes till the connection is dropped and then reopened with different member.

This can impact the watches as well (the watch will be open for 15 minutes and then reconnect to a different member).

There is related issue: https://github.com/coreos/etcd/issues/8980 (thanks Stefan!)

Comment 19 Michal Fojtik 2018-03-12 14:10:39 UTC
starting server... [version: 3.2.15, cluster version: to_be_decided]

oc v3.9.4
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

(I believe this has the 3.2.16 client already)

What I did:

3 masters

1) I poweroff -f master #1
2) ssh into master #2, oc get all is stuck, oc get secrets works, oc get pods not
3) ssh into master #3, oc get all is stuck, oc get secrets works, oc get pods works as well (?)

On #2 there are no gRPC messages recorded in the journal, just API timeouts (panic).
All masters are running with loglevel=5.

Comment 45 ge liu 2018-03-29 09:44:38 UTC
Created attachment 1414689 [details]
master api log

Comment 47 Jordan Liggitt 2018-03-29 23:34:02 UTC
the API server is configured to use etcd v3:

kubernetesMasterConfig:
  apiServerArguments: 
    storage-backend:
    - etcd3
    storage-media-type:
    - application/vnd.kubernetes.protobuf


This fix only applies when running with etcd v2:

kubernetesMasterConfig:
  apiServerArguments: 
    storage-backend:
    - etcd2
    storage-media-type:
    - application/json

Comment 49 ge liu 2018-03-30 08:16:55 UTC
Regarding to comment 48, bug trace the new problem: https://bugzilla.redhat.com/show_bug.cgi?id=1562331

Comment 54 errata-xmlrpc 2018-04-12 06:03:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106