Description of problem: No failover for pods if more than 2 nodes are failing at same time. As an example, stopping 1 infra node and 2 app nodes. Version-Release number of selected component (if applicable): 3.11 How reproducible: Not in my laboratory but it happens all the times in customer's facilities. Steps to Reproduce: 1. Turn off more than 2 nodes (not a master involved) Actual results: 2. Nodes are marked as NotReady 3. Wait for 5 minutes 4. Pods keep in Running state. No Unknown nor NodeLost states. 5. Wait for 5 minutes more 6. Turn on nodes 7. Pods are terminated and new ones are created Expected results: 2. Nodes are marked as NotReady 3. Wait for 5 minutes 4. Pods change to Unknown or NodeLost depending if they are part of a deployment or daemonset. New pods are started up to meet required number of replicas whenver it applies 5. Wait for 5 minutes more 6. Turn on nodes 7. Unknown and NodeLost pods are terminated and new ones are created if apply. Additional info: - When turning off one single node the result is the expected. This only happens when more than 2 nodes are turned off. - Note that no master has been turned off during the test. - See attached file with master-api and master-controllers logs during the test. - This is a stretched cluster between two datacenters: DataCenter-1 dfrsijaspcpm1.example.net 10.240.153.11 (master) dfrsijaspcpin1.example.net 10.240.153.13 (infra) dfrvijaspcplb1.example.net 10.241.232.92 (loadbalancer: haproxy) dfrsijaspcpcn1.example.net 10.240.153.17 (compute) dfrsijaspcpcn3.example.net 10.240.153.19 (compute) DataCenter-2 dfhsijaspcpm2.example.net 10.240.153.12 (master) dfhsijaspcpin2.example.net 10.240.153.14 (infra) dfhvijaspcplb2.example.net 10.241.232.93 (loadbalancer: haproxy) dfhsijaspcpcn2.example.net 10.240.153.18 (compute) dfhsijaspcpcn4.example.net 10.240.153.20 (compute) DataCenter-3 dfrsijaspcpm3.example.net 10.241.92.10 (master) The underlying network is a VLAN with 2 masters, all infrastructure and all compute nodes together. The exception here is the third master and both Loadbalancers, they are located in other VLANs. The networks are low-latency (<1ms) 10GBit network connections over multipathing, here the ping values between the masters: root@dfrsijaspcpm1:~# ping -c 3 dfhsijaspcpm2 PING dfhsijaspcpm2.example.net (10.240.153.12) 56(84) bytes of data. 64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=1 ttl=64 time=0.442 ms 64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=2 ttl=64 time=0.429 ms 64 bytes from dfhsijaspcpm2.example.net (10.240.153.12): icmp_seq=3 ttl=64 time=0.462 ms root@dfrsijaspcpm1:~# ping -c 3 dfrsijaspcpm3 PING dfrsijaspcpm3.examplenet (10.241.92.10) 56(84) bytes of data. 64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=1 ttl=60 time=0.588 ms 64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=2 ttl=60 time=0.598 ms 64 bytes from dfrsijaspcpm3.example.net (10.241.92.10): icmp_seq=3 ttl=60 time=0.614 ms
I tend to think that this is related with the fact that the cluster is spread but I can't find a reason why due to the very low latency and the fact that no masters have been turned off during the test so etcd and master-api is okay. If you need anything else please let me know and I'll get it from customer.
*** Bug 1722288 has been marked as a duplicate of this bug. ***
While going through the logs, I saw the new pods failed to be schedule. It's a slightly different issue, but if you could post all the events for the cluster (for all namespaces), that would help.
For whatever it's worth, the initial case which originated this bugzilla is no longer being affected. Customer replaced baremetal servers to host the master servers with virtual machines with the same hardware requirements, and the issue is gone. It may still related to networking if the baremetal servers are differently connected than the virtual machines.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3139