Bug 1720174 - No pod failover when multiple nodes are NotReady
Summary: No pod failover when multiple nodes are NotReady
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Ryan Phillips
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1722288 (view as bug list)
Depends On:
Blocks: 1752894 1753995
TreeView+ depends on / blocked
 
Reported: 2019-06-13 10:14 UTC by Sergio G.
Modified: 2019-10-26 00:54 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1752894 1753995 (view as bug list)
Environment:
Last Closed: 2019-10-18 01:34:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 23779 None closed [release-3.11] Bug 1720174: upstream: Kubelet status manager sync the status of local pods 2020-07-01 07:42:15 UTC
Red Hat Product Errata RHBA-2019:3139 None None None 2019-10-18 01:34:58 UTC

Description Sergio G. 2019-06-13 10:14:45 UTC
Description of problem:
No failover for pods if more than 2 nodes are failing at same time. As an example, stopping 1 infra node and 2 app nodes.


Version-Release number of selected component (if applicable):
3.11


How reproducible:
Not in my laboratory but it happens all the times in customer's facilities.


Steps to Reproduce:
1. Turn off more than 2 nodes (not a master involved)


Actual results:
2. Nodes are marked as NotReady
3. Wait for 5 minutes
4. Pods keep in Running state. No Unknown nor NodeLost states.
5. Wait for 5 minutes more
6. Turn on nodes
7. Pods are terminated and new ones are created


Expected results:
2. Nodes are marked as NotReady
3. Wait for 5 minutes
4. Pods change to Unknown or NodeLost depending if they are part of a deployment or daemonset. New pods are started up to meet required number of replicas whenver it applies
5. Wait for 5 minutes more
6. Turn on nodes
7. Unknown and NodeLost pods are terminated and new ones are created if apply.



Additional info:
- When turning off one single node the result is the expected. This only happens when more than 2 nodes are turned off.
- Note that no master has been turned off during the test.
- See attached file with master-api and master-controllers logs during the test.
- This is a stretched cluster between two datacenters:
DataCenter-1
  dfrsijaspcpm1.example.net  10.240.153.11  (master)
  dfrsijaspcpin1.example.net 10.240.153.13  (infra)
  dfrvijaspcplb1.example.net 10.241.232.92  (loadbalancer: haproxy)
  dfrsijaspcpcn1.example.net 10.240.153.17  (compute)
  dfrsijaspcpcn3.example.net 10.240.153.19  (compute)

DataCenter-2
  dfhsijaspcpm2.example.net  10.240.153.12  (master)
  dfhsijaspcpin2.example.net 10.240.153.14  (infra)
  dfhvijaspcplb2.example.net 10.241.232.93  (loadbalancer: haproxy)
  dfhsijaspcpcn2.example.net 10.240.153.18  (compute)
  dfhsijaspcpcn4.example.net 10.240.153.20  (compute)

DataCenter-3
  dfrsijaspcpm3.example.net  10.241.92.10  (master)
The underlying network is a VLAN with 2 masters, all infrastructure and all compute nodes together. The exception 
here is the third master and both Loadbalancers, they are located in other VLANs. 

The networks are low-latency (<1ms) 
10GBit network connections over multipathing, here the ping values between the masters:

root@dfrsijaspcpm1:~# ping -c 3 dfhsijaspcpm2
PING dfhsijaspcpm2.example.net (10.240.153.12) 56(84) bytes of data.
64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=1 ttl=64 time=0.442 ms
64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=2 ttl=64 time=0.429 ms
64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=3 ttl=64 time=0.462 ms

root@dfrsijaspcpm1:~# ping -c 3 dfrsijaspcpm3
PING dfrsijaspcpm3.examplenet (10.241.92.10) 56(84) bytes of data.
64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=1 ttl=60 time=0.588 ms
64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=2 ttl=60 time=0.598 ms
64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=3 ttl=60 time=0.614 ms

Comment 2 Sergio G. 2019-06-13 10:22:03 UTC
I tend to think that this is related with the fact that the cluster is spread but I can't find a reason why due to the very low latency and the fact that no masters have been turned off during the test so etcd and master-api is okay.

If you need anything else please let me know and I'll get it from customer.

Comment 15 Seth Jennings 2019-07-03 14:13:03 UTC
*** Bug 1722288 has been marked as a duplicate of this bug. ***

Comment 20 Ryan Phillips 2019-07-03 15:59:54 UTC
While going through the logs, I saw the new pods failed to be schedule. It's a slightly different issue, but if you could post all the events for the cluster (for all namespaces), that would help.

Comment 64 Sergio G. 2019-08-21 08:14:41 UTC
For whatever it's worth, the initial case which originated this bugzilla is no longer being affected. Customer replaced baremetal servers to host the master servers with virtual machines with the same hardware requirements, and the issue is gone.

It may still related to networking if the baremetal servers are differently connected than the virtual machines.

Comment 85 errata-xmlrpc 2019-10-18 01:34:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3139


Note You need to log in before you can comment on or make changes to this bug.