Bug 1893067

Summary: pods keep restarting with network-related errors occur
Product: OpenShift Container Platform Reporter: yhe
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: high CC: aconstan, anbhat, bbennett, rh-container, surya
Version: 4.4Keywords: Reopened
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-11 13:51:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 4 Ben Bennett 2020-11-05 14:07:24 UTC
Assigning to the development branch to investigate.  We will consider the backport when the issue is understood.

Comment 24 Alexander Constantinescu 2021-03-05 09:19:48 UTC
Hi

I am going to close this bug as INSUFFICIENT_DATA. The reason is: we don't know exactly what happened as to trigger the bug. Our internal analysis of the problem resulted in the following:

We think we've understood the issue here. It seems that the reason nothing worked on node: oapprown02.oap-011.oappro.jp was because there were not pod flows configured in OVS. This is taken from the sosreport provided here: https://access.redhat.com/support/cases/#/case/02787947/discussion?attachmentId=a092K00002CpDibQAF

cat sosreport-oapprown02-2020-10-28-qjgjxtq/sos_commands/openvswitch/ovs-ofctl_-O_OpenFlow13_dump-flows_br0
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x4d6ac4d9, duration=11830.943s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.24 actions=goto_table:30
 cookie=0x21876511, duration=11830.905s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.25 actions=goto_table:30
 cookie=0x3e01f207, duration=11830.867s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.26 actions=goto_table:30
 cookie=0xffafcc08, duration=11830.829s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.34 actions=goto_table:30
 cookie=0x51badde4, duration=11830.791s, table=10, n_packets=0, n_bytes=0, priority=100,tun_src=10.3.209.36 actions=goto_table:30
 cookie=0x4d6ac4d9, duration=11830.943s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.24->tun_dst,output:1
 cookie=0x21876511, duration=11830.905s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.129.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.25->tun_dst,output:1
 cookie=0x3e01f207, duration=11830.867s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.130.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.26->tun_dst,output:1
 cookie=0xffafcc08, duration=11830.829s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.34->tun_dst,output:1
 cookie=0x51badde4, duration=11830.791s, table=50, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.129.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.36->tun_dst,output:1
 cookie=0x4d6ac4d9, duration=11830.943s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.24->tun_dst,output:1
 cookie=0x21876511, duration=11830.905s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.129.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.25->tun_dst,output:1
 cookie=0x3e01f207, duration=11830.867s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.130.0.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.26->tun_dst,output:1
 cookie=0xffafcc08, duration=11830.829s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.34->tun_dst,output:1
 cookie=0x51badde4, duration=11830.791s, table=90, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.129.2.0/23 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.36->tun_dst,output:1
 cookie=0x0, duration=11830.772s, table=111, n_packets=0, n_bytes=0, priority=100 actions=move:NXM_NX_REG0[]->NXM_NX_TUN_ID[0..31],set_field:10.3.209.24->tun_dst,output:1,set_field:10.3.209.25->tun_dst,output:1,set_field:10.3.209.26->tun_dst,output:1,set_field:10.3.209.34->tun_dst,output:1,set_field:10.3.209.36->tun_dst,output:1,goto_table:120
 cookie=0x0, duration=11831.211s, table=253, n_packets=0, n_bytes=0, actions=note:02.07.00.00.00.00

That shows that each nodes subnet flows are configured as it should on  oapprown02.oap-011.oappro.jp, however we are missing all pod flows of the existing pods running on this node. 

In openshift-sdn we however do not check that pod flows are existing when determining if we need to perform a new setup. In this case, for example, we determined that we did not need to. This was because, according to the existing conditions we specify, it was setup. We thus, subsequently did not proceed to setting up the flows for the existing pods on that node. 

I have no way of determining why the pods flows are missing but everything else is existing. This is because neither the sosreport nor must-gather do not contain any more SDN / OVS logs for the previous runs. Something might have ended up deleting the pod flows during the upgrade, but I have no conclusive indications of this.

Given that we don't know the root cause, we can't say for sure that the fix linked in this BZ will fix the problem completely. It will in fact most likely not do that, but instead hide the real problem from ever happening again (next time maybe with more logs to investigate further).

I am thus closing the bug and the PR since that will not merge.

Thanks in advance,
Alexander