Bug 1872470

Summary:	[sig-storage] PersistentVolumes GCEPD should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach - transport is closing
Product:	OpenShift Container Platform	Reporter:	David Eads <deads>
Component:	Networking	Assignee:	Tim Rozet <trozet>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	bbennett, cdaley, dceara, nusiddiq, prubenda, trozet, vrutkovs
Version:	4.6
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	OKDBlocker
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1886675 (view as bug list)		Environment:	[sig-storage] PersistentVolumes GCEPD should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach
Last Closed:	2020-10-27 16:33:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1876174, 1877128
Bug Blocks:	1886675

Description David Eads 2020-08-25 20:42:15 UTC

test:
[sig-storage] PersistentVolumes GCEPD should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-storage%5C%5D+PersistentVolumes+GCEPD+should+test+that+deleting+the+PV+before+the+pod+does+not+cause+pod+deletion+to+fail+on+PD+detach


This only appears to be failing on ovn, but it's the in the top 10 test failures without a bug.


https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-ovn-4.6/1298201054550167552 is an example job

Comment 1 Tim Rozet 2020-08-30 20:26:59 UTC

The cause of this failure among a lot of other failures in CI (especially around Volume tests) is that the client ends up getting "transport is closing" during grpc calls, usually to the kapi server. I can also see that other operators and the kapi server itself are also getting "transport is closing" errors:

I0828 10:10:28.215401       1 status_controller.go:172] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2020-08-28T10:03:04Z","message":"NodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: static pod of revision 3 has been installed, but is not ready while new revision 4 is pending\nKubeAPIServerStaticResourcesDegraded: \"v4.1.0/kube-apiserver/localhost-recovery-client-crb.yaml\" (string): rpc error: code = Unavailable desc = transport is closing\n

from:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/243/pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry/1299279259998621696/artifacts/e2e-ovn-hybrid-step-registry/gather-extra/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-58478f4d4b-jn98x_kube-apiserver-operator.log

transport is closing means that the underlying TCP connection is getting shutdown underneath the grpc connection. This can be pretty difficult to root cause, because we need to know see packet captures when this is happening to understand why the TCP connection is closing, or we can try enabling more logging in kapi server to maybe indicate why the transport is closing:

https://github.com/grpc/grpc-go/blob/master/README.md#the-rpc-failed-with-error-code--unavailable-desc--transport-is-closing

In addition to this, we can see that sometime during our OVN jobs there is a failure to secure leadership for ovnkube-master and it ends up restarting. After looking through more logs, I see other services including kapi cluster operator having the same issue.

On top of all of this, I can also see that some calls are taking a very long amount of time. For example ovnkube-master is taking sometimes 4 seconds to addLogicalPort, which is an operation that takes usually a couple hundred ms. I can also see ovn DBs growing in memory at a fast rate, and ovn-northd polling intervals taking over a second with warnings coming out. Additionally, OVS is complaining that its packet-in queue has overflowed. tl;dr it would be good to rule out high cpu usage on the masters here as a cause for the TCP connections getting dropped.

There is one obvious error we can start with which involves connections to KAPI service. In OVS I can see tons of:

2020-08-28T11:03:35.022Z|00014|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=39,nat(dst=10.0.147.245:6443)),recirc(0x55e) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:31,dl_dst=0a:58:0a:82:00:01,nw_src=10.130.0.49,nw_dst=172.30.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=48202,tp_dst=443,tcp_flags=syn tcp_csum:5aac
 with metadata skb_priority(0),dp_hash(0x4b4f0c29),skb_mark(0),ct_state(0x21),ct_zone(0x27),ct_tuple4(src=10.130.0.49,dst=172.30.0.1,proto=6,tp_src=48202,tp_dst=443),in_port(17) mtu 0


2020-08-28T11:06:05.024Z|00015|dpif(handler10)|WARN|system@ovs-system: execute ct(commit,zone=39,nat(dst=10.0.147.245:6443)),recirc(0x55e) failed (Invalid argument) on packet tcp,vlan_tci=0x0000,dl_src=0a:58:0a:82:00:31,dl_dst=0a:58:0a:82:00:01,nw_src=10.130.0.49,nw_dst=172.30.0.1,nw_tos=0,nw_ecn=0,nw_ttl=64,tp_src=56166,tp_dst=443,tcp_flags=syn tcp_csum:ba52
 with metadata skb_priority(0),dp_hash(0xccb750ac),skb_mark(0),ct_state(0x21),ct_zone(0x27),ct_tuple4(src=10.130.0.49,dst=172.30.0.1,proto=6,tp_src=56166,tp_dst=443),in_port(17) mtu 0

10.130.0.49 is the openshift-api server trying to contact the kapi service. We should start here I think and then work through other possible causes afterwards.

Comment 2 Tim Rozet 2020-08-30 20:27:47 UTC

One of the logs showing OVS CT errors:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/243/pull-ci-openshift-ovn-kubernetes-master-e2e-ovn-hybrid-step-registry/1299279259998621696/artifacts/e2e-ovn-hybrid-step-registry/gather-extra/pods/openshift-ovn-kubernetes_ovs-node-8sfvm_ovs-daemons.log

Comment 3 Tim Rozet 2020-09-06 04:07:04 UTC

I was able to reproduce the failure by deploying with AWS and then run 2 infinite loops doing:
1. kubectl get pod -A
2. ./openshift-tests run-test "[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir-link] [Testpattern: Pre-provisioned PV (filesystem volmode)] volumeMode should not mount / map unused volumes in a pod [Suite:openshift/conformance/parallel] [Suite:k8s]"

The test seems to make a privileged pod and then kubectl exec into a pod and do some commands to check volumes. Once the issue occurs in the test, I lose network connectivity momentarily to the cluster and get "transport is closing" on both the kubeclient and the output of the test. It looks like the traffic ends up causing a temporary network disruption. Afterwards I can connect back to the api server and run the test again. Although we also see "transport is closing" in kapi and etcd logs in both openshift-sdn and ovn-k8s, I'm not convinced that is related to these disconnects on the kube client side.

The issue does not seem to happen in local gateway mode on e2e-aws-ovn or e2e-gcp-ovn jobs:
https://github.com/openshift/cluster-network-operator/pull/775

Therefore the only thing it could be is some issue with shared gateway mode. We do not see this issue in upstream with shared gateway mode, but downstream is a heavier deployment with more traffic. In shared gateway mode, we share the same IP and MAC between the host and OVN:

host (10.0.0.1)
|
eth0-----br-ex----OVN GR (10.0.0.1)

When ip traffic today egresses the node from OVN, we conntrack it so that when reply traffic comes back into eth0 we can send it only to OVN GR. However, egress traffic from the host is flooded out eth0 and OVN GR. Additionally, ingress traffic coming into eth0 or reply egress traffic originated by the host coming in eth0, will be flooded to OVN GR and the host. This causes a lot of unwanted packets to enter OVN, which we assumed would just get dropped. However I can see that OVN looks to be accidentally sending every IP packet destined to the host to ovn-controller, causing packet_in queue overflows in OVS and potentially inhibiting ovn-controllers ability to actually handle ARP/ICMP/other control traffic. I filed an OVN bug on this here: https://bugzilla.redhat.com/show_bug.cgi?id=1876174

ovn-controller also has the ability to send TCP RST based on reject ACLs we apply in OVN. It could be possible that accidentally getting a tcp packet in ovn-controller, could trigger a reset. That or other network disruption from this might be causing the connection to drop.

From the ovn-kubernetes side, we can be a little more intelligent about how we handle our traffic in br-ex. I added a proposal to also allow egress host traffic to be conntracked, and force all egress traffic from the host out eth0, and all egress traffic from OVN GR out eth0:
https://github.com/ovn-org/ovn-kubernetes/pull/1666

This way we can ensure egress traffic is isolated between the host and OVN. It also means that reply egress ip traffic originating from host and entering eth0 will only go to host and not flooded into OVN. This helps cut down on some of the traffic going into OVN. However, unknown ingress traffic is still flooded to both as we do not know who the recipient should be. I tested my change downstream and there is some improvement. The test cases which always fail in CI twice due to the bug now are passing on a second attempt in aws:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/255/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn/1302081685277380608
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/255/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1302305957480501248

This leads me to believe that this does have something to do with the amount of traffic entering OVN and getting packet_in'ed.

Comment 4 Aniket Bhat 2020-09-08 16:38:03 UTC

*** Bug 1862271 has been marked as a duplicate of this bug. ***

Comment 5 Ben Bennett 2020-09-09 13:24:33 UTC

*** Bug 1852977 has been marked as a duplicate of this bug. ***

Comment 6 Tim Rozet 2020-09-12 00:41:49 UTC

We do not know the exact cause of why this transport is closing issue happens, but we do know it has something to do with traffic in shared gateway mode.

Fix proposed upstream that will resolve the issue by doing local mode traffic processing in a shared gateway environment:
https://github.com/ovn-org/ovn-kubernetes/pull/1692

Testing this downstream shows the error no longer happens and CI passes:
https://github.com/openshift/ovn-kubernetes/pull/266

Comment 9 Anurag saxena 2020-10-09 08:10:09 UTC

downstream shows this error no longer happens and CI seems passing. Verifying bases on same observation

Comment 11 errata-xmlrpc 2020-10-27 16:33:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196