1787635 – OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3

Bug 1787635 - OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3

Summary: OCP 4.2.12: ingress and network operators degraded after upgrade to 4.3

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Ben Bennett
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:	1787581
Blocks:	1789248 1789982 1789983
TreeView+	depends on / blocked

Reported:	2020-01-03 18:23 UTC by Ben Bennett
Modified:	2020-02-12 09:42 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1787581
Clones:	1789248 1792533 (view as bug list)
Environment:
Last Closed:	2020-02-12 09:42:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
IP address allocation information from worker node (6.67 KB, text/plain) 2020-01-15 18:40 UTC, Weibin Liang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1361	0	None	closed	[release-4.3] Bug 1787635: cleanup sdn ips on reboot	2021-02-10 20:55:18 UTC
Red Hat Product Errata	RHBA-2020:0391	0	None	None	None	2020-02-12 09:42:53 UTC

Comment 1 Clayton Coleman 2020-01-08 22:12:51 UTC

Since this can bring down user apps i consider this to block 4.2 to 4.3 upgrades going to customer.

Comment 6 Weibin Liang 2020-01-15 18:40:40 UTC

Created attachment 1652538 [details]
IP address allocation information from worker node

Comment 8 zhaozhanqi 2020-01-16 04:15:57 UTC

@huirwang
I saw you verified this bug in 4.3.0-0.nightly-2020-01-14-014012. but https://bugzilla.redhat.com/show_bug.cgi?id=1787635#c5 is using 4.3.0-0.nightly-2020-01-13-223759
Please help check the above version if it's the fixed version. and if so. please have a try again if it can be reproduced and paste the logs of all workers. thanks

Comment 11 Dan Winship 2020-01-16 21:38:09 UTC

To get better SDN debugging with current 4.3 you'd need to bump the log level up to 5. If https://github.com/openshift/sdn/pull/78 merges then you should get more useful pod create/delete logging at the default log level.

Comment 12 Douglas Smith 2020-01-16 22:27:04 UTC

One question that was raised was if Multus requires there to be a net namespace for a pod in order to delegate the CNI DEL call to another CNI plugin (especially: openshift-sdn)

My findings show that Multus only requires the netns to be present during a CNI ADD call, in particular it requires the netns before it attempts to add an network interface to namespace using `LinkByName()` from the netlink library. 

Multus does look for the netns during a CNI DEL, but, its absence is tolerated. Multus continues execution and continues passing the CNI DEL to the delegated plugin (in this case, openshift-sdn)

Additionally, another question has been raised about an error which reads as:

```
Jan 16 10:26:08 ip-10-0-146-171 crio[596572]: 2020-01-16T10:26:08Z [error] Multus: error unsetting the networks status: SetNetworkStatus: [...snipped...]
```

In this case, Multus just notes the error in the log and continues (multus.go@HEAD:583)

```
// error happen but continue to delete
logging.Errorf("Multus: error unsetting the networks status: %v", err)

```

This improvement was made upstream in https://github.com/intel/multus-cni/pull/311 -- and this code exists in Multus CNI versions in OpenShift 4.2 and 4.3

Comment 13 Mrunal Patel 2020-01-17 01:35:15 UTC

Can we retest on a new cluster with kubelet logging set to v4 and crio in info mode?

You can hop onto a node using oc debug/node and modify /etc/crio/crio.conf. Uncomment the log_level line and set the value to info.

You can then systemctl reload crio and it will start logging at info level without restarting.

If we get these logs, we can see check if kubelet is calling stop pod sandbox and crio is getting it to call into the networking cleanup code.

Comment 18 Ben Bennett 2020-01-17 21:09:37 UTC

There were two cases where we were leaking and this bug covered both.

I split https://bugzilla.redhat.com/show_bug.cgi?id=1792533 off to cover the kubelet restart leak.

The case where we would leak on a reboot is covered by this bug, and that seems to have been completely resolved.

Comment 24 errata-xmlrpc 2020-02-12 09:42:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0391

Note You need to log in before you can comment on or make changes to this bug.