Description of problem: After testing network partition by dropping all packets between two masters the majority of API calls fail with "etcdserver: request timed out" When the command "iptables -t filter -A INPUT -s 10.0.0.x -j DROP" is run on one of the master nodes dropping packets from one of the other master nodes, we start to see "etcdserver: request timed out" errors. Originally tested with OVNKubernetes on Azure, also seems to reproduce with OpenShiftSDN on Azure The same iptable DROP test on AWS seems to cause no degradation. OVN on GCP seems to experience the same "etcdserver: request timed out". Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-03-31-053841 How reproducible: Very Steps to Reproduce: 1. identify OVNKubernetes leader, find the node where "Leader: self" $ for f in $(oc -n openshift-ovn-kubernetes get pods -l app=ovnkube-master -o jsonpath={.items[*].metadata.name}) ; do echo -e "\n${f}\n" ; oc -n openshift-ovn-kubernetes exec "${f}" -c northd -- ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound ; done 2. on a different master node, drop all track from the OVN Northbound leader oc debug node/master-2 -- chroot /host iptables -t filter -A INPUT -s <LEADER_MASTER_IP> -j DROP 3. repeatedly get nodes and check for "etcdserver: request timed out" errors oc -n openshift-etcd get pods 4. let the cluster sit for several hours set if errors increase Actual results: Error from server: etcdserver: request timed out Expected results: oc get pods succeeds cluster operators are not degraded Additional info: Dropping only the OVN Raft ports does not seem to cause issues. This test "oc debug node/master-2 -- chroot /host iptables -t filter -A INPUT -s <LEADER_MASTER_IP> -p tcp --dport 9641:9648 -j DROP" does not seem to cause issues.
> Steps to Reproduce: > 1. identify OVNKubernetes leader That part seems irrelevant? You're blocking *all* traffic to one of the masters, and then seeing etcd problems. Nothing to do with OVN-Kubernetes... (But maybe that's why you reproduced the problem on Azure and GCP but not AWS? Maybe the ovn-kube leader happened to be on the same master as the etcd leader when you tested on Azure and GCP, but not when you tested on AWS.)
>>Maybe the ovn-kube leader happened to be on the same master as the etcd leader hmm, `oc exec <etcd-master-pod> -n openshift-etcd -- etcdctl endpoint status` might shed more light on this
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity. If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.
The choice of network plugin has no influence on etcd-to-etcd or apiserver-to-etcd traffic. If you are confident that etcd does not break in a real network partition then I would say to just close this bug; my guess would be that the command the OP was using to simulate a network partition was incorrect and had unexpected additional side effects.
closing per https://bugzilla.redhat.com/show_bug.cgi?id=1819907#c8 We are working on a periodic networking partition test for the cluster but etcd is partition tolerant and tested extensively upstream by etcd CI.