Bug 1819907

Summary:	etcdserver timeouts and degraded cluster and after network partition on Azure
Product:	OpenShift Container Platform	Reporter:	Ross Brattain <rbrattai>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.4	CC:	anusaxen, danw, dcbw, sbatsche, skolicha
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-20 14:06:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ross Brattain 2020-04-01 20:29:23 UTC

Description of problem:

After testing network partition by dropping all packets between two masters the majority of API calls fail with "etcdserver: request timed out"

When the command "iptables -t filter -A INPUT -s 10.0.0.x  -j DROP" is run on one of the master nodes dropping packets from one of the other master nodes, we start to see "etcdserver: request timed out" errors.

Originally tested with OVNKubernetes on Azure, also seems to reproduce with OpenShiftSDN on Azure

The same iptable DROP test on AWS seems to cause no degradation.

OVN on GCP seems to experience the same "etcdserver: request timed out".



Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-03-31-053841

How reproducible:
Very

Steps to Reproduce:
1. identify OVNKubernetes leader, find the node where "Leader: self"
$ for f in $(oc -n openshift-ovn-kubernetes get pods -l app=ovnkube-master -o jsonpath={.items[*].metadata.name})  ; do echo -e "\n${f}\n" ; oc -n openshift-ovn-kubernetes exec "${f}" -c northd -- ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound  ; done

2. on a different master node, drop all track from the OVN Northbound leader
oc debug node/master-2 -- chroot /host iptables -t filter -A INPUT -s <LEADER_MASTER_IP>  -j DROP
3. repeatedly get nodes and check for "etcdserver: request timed out" errors
oc -n openshift-etcd get pods
4. let the cluster sit for several hours set if errors increase

Actual results:
Error from server: etcdserver: request timed out


Expected results:
oc get pods succeeds
cluster operators are not degraded

Additional info:


Dropping only the OVN Raft ports does not seem to cause issues.

This test
"oc debug node/master-2 -- chroot /host iptables -t filter -A INPUT -s <LEADER_MASTER_IP> -p tcp --dport 9641:9648 -j DROP"
does not seem to cause issues.

Comment 2 Dan Winship 2020-04-02 13:36:34 UTC

> Steps to Reproduce:
> 1. identify OVNKubernetes leader

That part seems irrelevant? You're blocking *all* traffic to one of the masters, and then seeing etcd problems. Nothing to do with OVN-Kubernetes...

(But maybe that's why you reproduced the problem on Azure and GCP but not AWS? Maybe the ovn-kube leader happened  to be on the same master as the etcd leader when you tested on Azure and GCP, but not when you tested on AWS.)

Comment 3 Anurag saxena 2020-04-02 15:17:41 UTC

>>Maybe the ovn-kube leader happened  to be on the same master as the etcd leader

hmm, `oc exec <etcd-master-pod> -n openshift-etcd -- etcdctl endpoint status` might shed more light on this

Comment 6 Michal Fojtik 2020-05-12 10:59:38 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

As such, we're marking this bug as "LifecycleStale" and decreasing the severity. 

If you have further information on the current state of the bug, please update it, otherwise this bug will be automatically closed in 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 8 Dan Winship 2020-05-18 14:01:44 UTC

The choice of network plugin has no influence on etcd-to-etcd or apiserver-to-etcd traffic.

If you are confident that etcd does not break in a real network partition then I would say to just close this bug; my guess would be that the command the OP was using to simulate a network partition was incorrect and had unexpected additional side effects.

Comment 9 Sam Batschelet 2020-05-20 14:06:15 UTC

closing per https://bugzilla.redhat.com/show_bug.cgi?id=1819907#c8

We are working on a periodic networking partition test for the cluster but etcd is partition tolerant and tested extensively upstream by etcd CI.