1893067 – pods keep restarting with network-related errors occur

Bug 1893067 - pods keep restarting with network-related errors occur

Summary: pods keep restarting with network-related errors occur

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Alexander Constantinescu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-30 02:15 UTC by yhe
Modified:	2023-12-15 19:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-11 13:51:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 4 Ben Bennett 2020-11-05 14:07:24 UTC

Assigning to the development branch to investigate.  We will consider the backport when the issue is understood.

Comment 24 Alexander Constantinescu 2021-03-05 09:19:48 UTC

Hi

I am going to close this bug as INSUFFICIENT_DATA. The reason is: we don't know exactly what happened as to trigger the bug. Our internal analysis of the problem resulted in the following:

We think we've understood the issue here. It seems that the reason nothing worked on node: oapprown02.oap-011.oappro.jp was because there were not pod flows configured in OVS. This is taken from the sosreport provided here: https://access.redhat.com/support/cases/#/case/02787947/discussion?attachmentId=a092K00002CpDibQAF

cat sosreport-oapprown02-2020-10-28-qjgjxtq/sos_commands/openvswitch/ovs-ofctl_-O_OpenFlow13_dump-flows_br0
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x4d6ac4d9, duration=11830.943s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.24 actions=goto_table:30
 cookie=0x21876511, duration=11830.905s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.25 actions=goto_table:30
 cookie=0x3e01f207, duration=11830.867s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.26 actions=goto_table:30
 cookie=0xffafcc08, duration=11830.829s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.34 actions=goto_table:30
 cookie=0x51badde4, duration=11830.791s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.36 actions=goto_table:30
 cookie=0x4d6ac4d9, duration=11830.943s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.24->tun_dst,output:1
 cookie=0x21876511, duration=11830.905s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.129.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.25->tun_dst,output:1
 cookie=0x3e01f207, duration=11830.867s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.130.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.26->tun_dst,output:1
 cookie=0xffafcc08, duration=11830.829s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.34->tun_dst,output:1
 cookie=0x51badde4, duration=11830.791s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.129.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.36->tun_dst,output:1
 cookie=0x4d6ac4d9, duration=11830.943s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.24->tun_dst,output:1
 cookie=0x21876511, duration=11830.905s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.129.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.25->tun_dst,output:1
 cookie=0x3e01f207, duration=11830.867s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.130.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.26->tun_dst,output:1
 cookie=0xffafcc08, duration=11830.829s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.34->tun_dst,output:1
 cookie=0x51badde4, duration=11830.791s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.129.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.36->tun_dst,output:1
 cookie=0x0, duration=11830.772s, table=111, n_packets=0, n_bytes=0, priority=100 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.24->tun_dst,output:1,set_field:10.3.209.25->tun_dst,output:1,set_field:10.3.209.26->tun_dst,output:1,set_field:10.3.209.34->tun_dst,output:1,set_field:10.3.209.36->tun_dst,output:1,goto_table:120
 cookie=0x0, duration=11831.211s, table=253, n_packets=0, n_bytes=0, actions=note:02.07.00.00.00.00

That shows that each nodes subnet flows are configured as it should on  oapprown02.oap-011.oappro.jp, however we are missing all pod flows of the existing pods running on this node. 

In openshift-sdn we however do not check that pod flows are existing when determining if we need to perform a new setup. In this case, for example, we determined that we did not need to. This was because, according to the existing conditions we specify, it was setup. We thus, subsequently did not proceed to setting up the flows for the existing pods on that node. 

I have no way of determining why the pods flows are missing but everything else is existing. This is because neither the sosreport nor must-gather do not contain any more SDN / OVS logs for the previous runs. Something might have ended up deleting the pod flows during the upgrade, but I have no conclusive indications of this.

Given that we don't know the root cause, we can't say for sure that the fix linked in this BZ will fix the problem completely. It will in fact most likely not do that, but instead hide the real problem from ever happening again (next time maybe with more logs to investigate further).

I am thus closing the bug and the PR since that will not merge.

Thanks in advance,
Alexander

Note You need to log in before you can comment on or make changes to this bug.