Bug 1947712
Summary: | [OVN] Many faults and Polling interval stuck for 4 seconds every roughly 5 minutes intervals. | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robin Cernin <rcernin> | |
Component: | Networking | Assignee: | Tim Rozet <trozet> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Ross Brattain <rbrattai> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | aconstan, akaris, dblack, fpaoline, gdiotte, memodi, mmethot, pmannidi, rbrattai, trozet | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1950432 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:58:16 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1950432 |
Description
Robin Cernin
2021-04-09 02:04:06 UTC
Note it was also seen that: So far on the connectivity side I found a bunch of these (about as early as logging starts on node 06), about every minute: Apr 08 18:20:24 worker-006 hyperkube[4011520]: F0224 13:59:03.392898 8431 ovnkube.go:129] Error setting OVS external ID 'ovn-nb="ssl:172.20.130.202:9641,ssl:172.20.130.203:9641,ssl:172.20.130.204:9641"': exit status 1 Apr 08 18:21:44 worker-006 hyperkube[4011520]: F0224 13:59:03.392898 8431 ovnkube.go:129] Error setting OVS external ID 'ovn-nb="ssl:172.20.130.202:9641,ssl:172.20.130.203:9641,ssl:172.20.130.204:9641"': exit status 1 This is odd because this is just ovnkube setting the value in local OVSDB. The 5 seconds interval resonates with https://bugzilla.redhat.com/show_bug.cgi?id=1908921 which was fixed in 4.5.27 The dropped traffic every 5 minutes was due to stale ports for a previous instance of the pod on another node. This was causing ovn-controllers on two different nodes to thrash as they both kept binding the port, and that will cause traffic drop. The root cause of this was kubelet was not sending the events on the stale node, covered by: https://bugzilla.redhat.com/show_bug.cgi?id=1948052 However, we can make this more robust on the SDN side, by specifically telling OVN where the current pod should be bound. This will prevent the thrashing in the event that there are stale ports leftover for whatever reason. Additionally, after fixing the above issue by restarting kubelet the affected worker; TCP connections were being closed every hour. This was determined to be due to Istio's default behavior with idle connections and not an OVN issue. Verified on 4.8.0-0.nightly-2021-04-20-032026 requested-chassis is set. sh-4.4# ovn-nbctl --no-leader-only --format=csv list Logical_Switch_Port | grep test-rc 124b0b93-9a97-4d83-a1ae-92c3b83a7784,"[""0a:58:0a:81:03:39 10.129.3.57""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-jwq7p,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:39 10.129.3.57""]",[],[],"""""",true 854ef321-b993-4d78-b167-5045f6841bdd,"[""0a:58:0a:80:02:24 10.128.2.36""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-vxj2k,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:24 10.128.2.36""]",[],[],"""""",true 0040de26-dac8-4160-8630-56129d15b290,"[""0a:58:0a:80:02:22 10.128.2.34""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-gzm87,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:22 10.128.2.34""]",[],[],"""""",true 14140fc7-8516-4e56-8890-b6708aa3ca9d,"[""0a:58:0a:80:02:21 10.128.2.33""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-vtcch,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:21 10.128.2.33""]",[],[],"""""",true 8023b0cf-3992-49bf-8521-edb6193835d3,"[""0a:58:0a:81:03:3a 10.129.3.58""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-rn2qp,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:3a 10.129.3.58""]",[],[],"""""",true 7e57a8d7-7f75-4486-9c1b-364855aca405,"[""0a:58:0a:83:00:38 10.131.0.56""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-9r45q,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:38 10.131.0.56""]",[],[],"""""",true 9603d659-dcd2-4996-9a6b-a8107e325985,"[""0a:58:0a:83:00:36 10.131.0.54""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-jljjx,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:36 10.131.0.54""]",[],[],"""""",true ca7e6f5c-44cd-4aeb-a806-e00fbdb22776,"[""0a:58:0a:81:03:38 10.129.3.56""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-dqvqx,{requested-chassis=compute-2},[],"[""0a:58:0a:81:03:38 10.129.3.56""]",[],[],"""""",true a7d0335c-8bc7-4e1c-851a-4bde4292b907,"[""0a:58:0a:80:02:23 10.128.2.35""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-fw296,{requested-chassis=compute-1},[],"[""0a:58:0a:80:02:23 10.128.2.35""]",[],[],"""""",true cbf227f9-2c77-4d75-8abb-241c5e29f530,"[""0a:58:0a:83:00:37 10.131.0.55""]",[],[],[],[],"{namespace=t1, pod=""true""}",[],t1_test-rc-xn5db,{requested-chassis=compute-0},[],"[""0a:58:0a:83:00:37 10.131.0.55""]",[],[],"""""",true Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |