1940233 – Upgrade failing from 4.5.24 to 4.6.20 (OVN w/IPv6 single-stack configured, dual-stack on the OCP hosts w/ multiple bonds)

Bug 1940233 - Upgrade failing from 4.5.24 to 4.6.20 (OVN w/IPv6 single-stack configured, dual-stack on the OCP hosts w/ multiple bonds)

Summary: Upgrade failing from 4.5.24 to 4.6.20 (OVN w/IPv6 single-stack configured, du...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Mohamed Mahmoud
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-17 21:09 UTC by Daniel Del Ciancio
Modified:	2024-06-14 00:52 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-07 12:30:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 4 Ricardo Carrillo Cruz 2021-03-22 10:19:49 UTC

Upstream PR:

https://github.com/ovn-org/ovn-kubernetes/pull/2123

Comment 5 Daniel Del Ciancio 2021-03-22 13:41:07 UTC

My customer reached out to me for an ETA on the fix.  Any ideas?

Thanks!

Comment 6 Daniel Del Ciancio 2021-03-24 16:07:29 UTC

(In reply to Daniel Del Ciancio from comment #5)
> My customer reached out to me for an ETA on the fix.  Any ideas?
> 
> Thanks!

Hi all,
Customer reached our to me again on this.  Any updates you can share on an ETA?

Thanks!

Comment 8 Daniel Del Ciancio 2021-03-25 12:03:39 UTC

This is a blocker issue preventing them from upgrading to 4.6, and they need to be on 4.6 in order to stabilize OVN and some IPv6 issues they are facing.

So I've increased the severity and priority of this bug to align with the customer's expectations.

Can you provide an approx ETA as to when we could expect the fix?  They have been repeatedly asking me for an update.
I'd appreciate your help in getting this prioritized.  

Thanks!

Comment 9 Mohamed Mahmoud 2021-03-25 12:40:23 UTC

upstream PR https://github.com/ovn-org/ovn-kubernetes/pull/2134 waiting on reviews

Comment 10 Mohamed Mahmoud 2021-03-26 16:52:01 UTC

(In reply to Daniel Del Ciancio from comment #8)
> This is a blocker issue preventing them from upgrading to 4.6, and they need
> to be on 4.6 in order to stabilize OVN and some IPv6 issues they are facing.
> 
> So I've increased the severity and priority of this bug to align with the
> customer's expectations.
> 
> Can you provide an approx ETA as to when we could expect the fix?  They have
> been repeatedly asking me for an update.
> I'd appreciate your help in getting this prioritized.  
> 
> Thanks!

Daniel I would like to explain how IPv6 PODs assignment happens so we know the expected IPs

Valid IPv6 IPs ranges from base to base + 65536 where base is the IPv6CIDR + 1
this is a known limitation for the current POD's IP allocation algorithm. 

so as long as the IP address fit in the above range there should be no issues and the PR to ensure this range limitation is enforced, having IP address outside this range won't work
so if CIDR is 2605:b100:283:1::/64 for example 
valid IPv6 IPs will be from 2605:b100:283:1::1 to 2605:b100:283:1::ffff

Comment 17 Daniel Del Ciancio 2021-03-31 15:53:06 UTC

UPDATE :

The customer was able to update a dev cluster from 4.5.24 to 4.6.21. When it was failing on the rollout of the network operator update phase, they saw the following issue in the newest ovnkube-node pod, that was in a CrashLoopBackOff state:


F0330 17:30:51.212374 4188378 ovnkube.go:130] failed to add neighbour entry 2605:b100:283:2::1 0a:58:4b:3a:df:53: file exists

First i restarted all the pods in openshift-ovn-kubernetes namespace. All the ovnkube-node pods started having the same issue. So they force rebooted the nodes one-by-one, and were able to make that go away. 


After that the update reached at 100% showing as failing for monitoring and image-registry operators. The image registry failure is due to our egress pod not being able to come up, with the following macvlan or multus issue.


Pod envoyv4v6-9558b6d4f-j2zcr
Namespace bell-services
4 minutes ago
Generated from kubelet on ocp87-worker-node-2
148 times in the last 42 minutes
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_envoyv4v6-9558b6d4f-j2zcr_bell-services_eb356843-f32c-4fe8-a9da-ae5534edd06e_0(5676e80d23c639dcfbbc8c77a9ef8d5fab7e706d7a8eefc1490bd46a35a3eff7): [bell-services/envoyv4v6-9558b6d4f-j2zcr:envoyv4]: error adding container to network "envoyv4": failed to create macvlan: device or resource busy


Any ideas?

Comment 18 Mohamed Mahmoud 2021-03-31 16:00:56 UTC

let us wait for the fix to the bootstrap issue and then try fresh cluster with 4.6.

Note You need to log in before you can comment on or make changes to this bug.