Bug 1720174

Summary: No pod failover when multiple nodes are NotReady
Product: OpenShift Container Platform Reporter: Sergio G. <sgarciam>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.11.0CC: acavalla, akaiser, aos-bugs, asolanas, bfurtado, clpereir, gblomqui, jokerman, mfojtik, mmccomas, mnunes, openshift-bugs-escalate, palonsor, pweil, rphillips, rpuccini, rsunog, schoudha, sjenning, skolicha, tnozicka, xtian
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1752894 1753995 (view as bug list) Environment:
Last Closed: 2019-10-18 01:34:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1752894, 1753995    

Description Sergio G. 2019-06-13 10:14:45 UTC
Description of problem:
No failover for pods if more than 2 nodes are failing at same time. As an example, stopping 1 infra node and 2 app nodes.


Version-Release number of selected component (if applicable):
3.11


How reproducible:
Not in my laboratory but it happens all the times in customer's facilities.


Steps to Reproduce:
1. Turn off more than 2 nodes (not a master involved)


Actual results:
2. Nodes are marked as NotReady
3. Wait for 5 minutes
4. Pods keep in Running state. No Unknown nor NodeLost states.
5. Wait for 5 minutes more
6. Turn on nodes
7. Pods are terminated and new ones are created


Expected results:
2. Nodes are marked as NotReady
3. Wait for 5 minutes
4. Pods change to Unknown or NodeLost depending if they are part of a deployment or daemonset. New pods are started up to meet required number of replicas whenver it applies
5. Wait for 5 minutes more
6. Turn on nodes
7. Unknown and NodeLost pods are terminated and new ones are created if apply.



Additional info:
- When turning off one single node the result is the expected. This only happens when more than 2 nodes are turned off.
- Note that no master has been turned off during the test.
- See attached file with master-api and master-controllers logs during the test.
- This is a stretched cluster between two datacenters:
DataCenter-1
  dfrsijaspcpm1.example.net  10.240.153.11  (master)
  dfrsijaspcpin1.example.net 10.240.153.13  (infra)
  dfrvijaspcplb1.example.net 10.241.232.92  (loadbalancer: haproxy)
  dfrsijaspcpcn1.example.net 10.240.153.17  (compute)
  dfrsijaspcpcn3.example.net 10.240.153.19  (compute)

DataCenter-2
  dfhsijaspcpm2.example.net  10.240.153.12  (master)
  dfhsijaspcpin2.example.net 10.240.153.14  (infra)
  dfhvijaspcplb2.example.net 10.241.232.93  (loadbalancer: haproxy)
  dfhsijaspcpcn2.example.net 10.240.153.18  (compute)
  dfhsijaspcpcn4.example.net 10.240.153.20  (compute)

DataCenter-3
  dfrsijaspcpm3.example.net  10.241.92.10  (master)
The underlying network is a VLAN with 2 masters, all infrastructure and all compute nodes together. The exception 
here is the third master and both Loadbalancers, they are located in other VLANs. 

The networks are low-latency (<1ms) 
10GBit network connections over multipathing, here the ping values between the masters:

root@dfrsijaspcpm1:~# ping -c 3 dfhsijaspcpm2
PING dfhsijaspcpm2.example.net (10.240.153.12) 56(84) bytes of data.
64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=1 ttl=64 time=0.442 ms
64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=2 ttl=64 time=0.429 ms
64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=3 ttl=64 time=0.462 ms

root@dfrsijaspcpm1:~# ping -c 3 dfrsijaspcpm3
PING dfrsijaspcpm3.examplenet (10.241.92.10) 56(84) bytes of data.
64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=1 ttl=60 time=0.588 ms
64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=2 ttl=60 time=0.598 ms
64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=3 ttl=60 time=0.614 ms

Comment 2 Sergio G. 2019-06-13 10:22:03 UTC
I tend to think that this is related with the fact that the cluster is spread but I can't find a reason why due to the very low latency and the fact that no masters have been turned off during the test so etcd and master-api is okay.

If you need anything else please let me know and I'll get it from customer.

Comment 15 Seth Jennings 2019-07-03 14:13:03 UTC
*** Bug 1722288 has been marked as a duplicate of this bug. ***

Comment 20 Ryan Phillips 2019-07-03 15:59:54 UTC
While going through the logs, I saw the new pods failed to be schedule. It's a slightly different issue, but if you could post all the events for the cluster (for all namespaces), that would help.

Comment 64 Sergio G. 2019-08-21 08:14:41 UTC
For whatever it's worth, the initial case which originated this bugzilla is no longer being affected. Customer replaced baremetal servers to host the master servers with virtual machines with the same hardware requirements, and the issue is gone.

It may still related to networking if the baremetal servers are differently connected than the virtual machines.

Comment 85 errata-xmlrpc 2019-10-18 01:34:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3139