2093901 – After a minor update, connectiviy issues between a new instance and metadata.

Bug 2093901 - After a minor update, connectiviy issues between a new instance and metadata.

Summary: After a minor update, connectiviy issues between a new instance and metadata.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-networking-ovn
Sub Component:
Version:	16.2 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z4
Target Release:	16.2 (Train on RHEL 8.4)
Assignee:	Ihar Hrachyshka
QA Contact:	Fiorella Yanac
Docs Contact:
URL:
Whiteboard:
Depends On:	2076604
Blocks:	2144564
TreeView+	depends on / blocked

Reported:	2022-06-06 10:19 UTC by Julia Marciano
Modified:	2022-12-07 19:23 UTC (History)
CC List:	15 users (show)
Fixed In Version:	python-networking-ovn-7.4.2-2.20220409154865.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2129866 2144564 (view as bug list)
Environment:
Last Closed:	2022-12-07 19:23:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad.net	1997092	None	None	None	2022-11-18 15:35:27 UTC
OpenStack gerrit	864777	None	MERGED	ovn: first tear down old metadata namespaces, then deploy new	2022-11-22 15:01:28 UTC
Red Hat Issue Tracker	OSP-15542	None	None	None	2022-06-06 10:21:28 UTC
Red Hat Product Errata	RHBA-2022:8794	None	None	None	2022-12-07 19:23:47 UTC

Comment 8 Miro Tomaska 2022-06-17 20:09:05 UTC

I was able to confirm following once the system got into the "bad" state:

- Not able to ping existing VMs from within their ovn-metadata namespace (destination host unreachable)
- Was able to ping existing VMs using floating IPs
- Not able to ping newly created VMs from within their ovn-metadata namespace (destination host unreachable)
- Not able to ping newly created VMs using floating ip
- VMs on the same network are able to ping each other. (on the same compute node or between different compute nodes)
- br-int tap interfaces look fine (everything plugged in as expected).

I performed openflow dump before recompute and after recompute. I noticed that there were some new flows added (files attached).

Some flows that were added after ovn-controller recompute were:
 cookie=0x7b4860d8, table=0, priority=100,in_port=27 actions=load:0xd->NXM_NX_REG13[],load:0x14->NXM_NX_REG11[],load:0x15->NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8)
 cookie=0x916e5c4c, table=0, priority=100,in_port=29 actions=load:0xe->NXM_NX_REG13[],load:0xc->NXM_NX_REG11[],load:0xb->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x2->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8)
 cookie=0xe74a6347, table=0, priority=100,in_port=25 actions=load:0x1c->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x17->NXM_NX_REG12[],load:0x8->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8)
 cookie=0xf924d9b8, table=0, priority=100,in_port=28 actions=load:0x1e->NXM_NX_REG13[],load:0x6->NXM_NX_REG11[],load:0x7->NXM_NX_REG12[],load:0x3->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8)
 cookie=0x17dfc0ad, table=0, priority=100,in_port=26 actions=load:0x1d->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x4->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8)

Where in_port's represented tap interfaces from br-int.
in_port 25 = tapecab191e-e0   where the other veth tapecab191e-e1 is part of ovnmeta namespace
in_port 26 = tapddf01192-e0   same
in_port 27 = tap26f8a86e-70   same
in_port 28 = tap3bf3ebe3-40   same
in_port 29 = tap05ed2909-20   saem

I guess the next step is to understand what cause these flows to be missing after upgrade.

@Julia, just curious. What is the this "minor" update. Is there an easy way to see which packages got updated?

Comment 11 Julia Marciano 2022-06-19 16:12:13 UTC

(In reply to Miro Tomaska from comment #8)
> I was able to confirm following once the system got into the "bad" state:
> 
> - Not able to ping existing VMs from within their ovn-metadata namespace
> (destination host unreachable)
> - Was able to ping existing VMs using floating IPs
> - Not able to ping newly created VMs from within their ovn-metadata
> namespace (destination host unreachable)
> - Not able to ping newly created VMs using floating ip
> - VMs on the same network are able to ping each other. (on the same compute
> node or between different compute nodes)
> - br-int tap interfaces look fine (everything plugged in as expected).
> 
> I performed openflow dump before recompute and after recompute. I noticed
> that there were some new flows added (files attached).
> 
> Some flows that were added after ovn-controller recompute were:
>  cookie=0x7b4860d8, table=0, priority=100,in_port=27
> actions=load:0xd->NXM_NX_REG13[],load:0x14->NXM_NX_REG11[],load:0x15-
> >NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:
> 0x1->NXM_NX_REG10[10],resubmit(,8)
>  cookie=0x916e5c4c, table=0, priority=100,in_port=29
> actions=load:0xe->NXM_NX_REG13[],load:0xc->NXM_NX_REG11[],load:0xb-
> >NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x2->NXM_NX_REG14[],load:
> 0x1->NXM_NX_REG10[10],resubmit(,8)
>  cookie=0xe74a6347, table=0, priority=100,in_port=25
> actions=load:0x1c->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x17-
> >NXM_NX_REG12[],load:0x8->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:
> 0x1->NXM_NX_REG10[10],resubmit(,8)
>  cookie=0xf924d9b8, table=0, priority=100,in_port=28
> actions=load:0x1e->NXM_NX_REG13[],load:0x6->NXM_NX_REG11[],load:0x7-
> >NXM_NX_REG12[],load:0x3->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:
> 0x1->NXM_NX_REG10[10],resubmit(,8)
>  cookie=0x17dfc0ad, table=0, priority=100,in_port=26
> actions=load:0x1d->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x4-
> >NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:
> 0x1->NXM_NX_REG10[10],resubmit(,8)
> 
> Where in_port's represented tap interfaces from br-int.
> in_port 25 = tapecab191e-e0   where the other veth tapecab191e-e1 is part of
> ovnmeta namespace
> in_port 26 = tapddf01192-e0   same
> in_port 27 = tap26f8a86e-70   same
> in_port 28 = tap3bf3ebe3-40   same
> in_port 29 = tap05ed2909-20   saem
> 
> I guess the next step is to understand what cause these flows to be missing
> after upgrade.
> 
> @Julia, just curious. What is the this "minor" update. Is there an easy way
> to see which packages got updated?

Hey Miro.Since

The job updated from latest_cdn (core_puddle: RHOS-16.2-RHEL-8-20220311.n.1) to  RHOS-16.2-RHEL-8-20220525.n.2. By "minor" I meant to say that OSP version (16.2) hasn't been upgraded. Not sure about the changed packages.

Comment 12 Bernard Cafarelli 2022-06-20 10:28:48 UTC

We have similar bugs (missing metadata flows fixed by recompute) on 17:
https://bugzilla.redhat.com/show_bug.cgi?id=2088454 (fix in core OVN https://bugzilla.redhat.com/show_bug.cgi?id=2076604)
and 16.2:
https://bugzilla.redhat.com/show_bug.cgi?id=2069668 (fix in core OVN https://bugzilla.redhat.com/show_bug.cgi?id=2069783)

It sounds worth checking if it is same bug or related - depending on the OVN versions used in that update test

Comment 13 Miro Tomaska 2022-06-21 17:41:01 UTC

Thanks Bernard for the heads up! It does look like this is the same problem from BZ2088454. All the missing ports from my comment#8 are ovs br-int "localports".

The core puddle version on my system is RHOS-16.2-RHEL-8-20220525.n.2 where ovn rpm is ovn-2021-21.12.0-46.el8fdp.x86_64 package.
The fix was not pulled until ovn-21.12.0-68 [1]. @Julia, is it possible to rerun this job with later puddle? Looks like we have puddle RHOS-16.2-RHEL-8-20220610.n.1 available.


[1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2032466

Comment 14 Julia Marciano 2022-06-22 11:02:58 UTC

(In reply to Miro Tomaska from comment #13)
> Thanks Bernard for the heads up! It does look like this is the same problem
> from BZ2088454. All the missing ports from my comment#8 are ovs br-int
> "localports".
> 
> The core puddle version on my system is RHOS-16.2-RHEL-8-20220525.n.2 where
> ovn rpm is ovn-2021-21.12.0-46.el8fdp.x86_64 package.
> The fix was not pulled until ovn-21.12.0-68 [1]. @Julia, is it possible to
> rerun this job with later puddle? Looks like we have puddle
> RHOS-16.2-RHEL-8-20220610.n.1 available.
> 
> 
> [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2032466


It seems that we are still facing the issue. We can see the failure in [1] (update to puddle RHOS-16.2-RHEL-8-20220610.n.1), though not all logs are available there.
Now [2] is running, so in a few hours we would be able to get more info.

[1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-pidone-updates-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-faults/92/infrared/.workspaces/workspace_2022-06-12_20-56-45/tobiko_check-resources-faults/tobiko_check-resources-faults_01_faults_faults.html
[2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.2/view/PidOne/job/DFG-pidone-updates-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-faults/96/

Comment 17 Julia Marciano 2022-07-05 11:48:18 UTC

Apparently I changed parameters of this BZ by mistake, so moving it back to required state.

Comment 33 Ihar Hrachyshka 2022-11-16 18:57:27 UTC

My understanding is that while this situation can be better handled in OVN (and the OVN patch up for review will take care of it), there's still a slight problem with what metadata agent is doing in regards to migrating ports to the new namespace. In general, Neutron should not at any point in time keep 2+ vifs assigned to the same port. This is a misconfiguration. It seems to me that sync() function in the agent should first tear down unnecessary namespaces, then establish new ones. Something along the lines of: https://review.opendev.org/c/openstack/neutron/+/864777

To quote Dumitru from dev ML, "It's undefined behavior because the CMS is doing something it shouldn't be doing."

Comment 34 Ihar Hrachyshka 2022-11-16 19:07:44 UTC

To add some more color, even with the patch as discussed in OVN, 1) when two vifs are assigned to the same port, neither gets configured - both get released; and 2) this situation is considered invalid and triggers recompute. This is not advisable. CMS shouldn't put OVN in this situation.

Comment 35 Yatin Karel 2022-11-17 06:20:25 UTC

(In reply to Ihar Hrachyshka from comment #33)
> My understanding is that while this situation can be better handled in OVN
> (and the OVN patch up for review will take care of it), there's still a
> slight problem with what metadata agent is doing in regards to migrating
> ports to the new namespace. In general, Neutron should not at any point in
> time keep 2+ vifs assigned to the same port. This is a misconfiguration. It
> seems to me that sync() function in the agent should first tear down
> unnecessary namespaces, then establish new ones. Something along the lines
> of: https://review.opendev.org/c/openstack/neutron/+/864777
> 
> To quote Dumitru from dev ML, "It's undefined behavior
> because the CMS is doing something it shouldn't be doing."

Thanks Ihar for looking into it, Yes right it's not normal behavior and started with switch to network id from datapath id https://review.opendev.org/c/openstack/networking-ovn/+/785181. I also missed Dumitru feedback on the OVN patch :(.
When i was investigating the issue i was considering similar fix(cleanup namespaces with datapath id before adding any new namespaces and cleaning unused) in neutron but after realizing the behavior change in OVN(before 21.09.0 it used to work) i thought it would be better handled in core OVN side.
Considering now your rationals and neutron patch seems ovn metadata agent should also be fixed to handle this case. Thanks.

Comment 36 Ihar Hrachyshka 2022-11-18 21:42:45 UTC

Tested the fix in devstack with main as follows:

1) install latest with ovn21.09 that includes the offending ovn commit 3ae8470edc64 ("I-P: Handle runtime data changes for pflow_output engine.")
2) revert neutron patch that renamed the metadata namespace for ovn agent: e4fb06b24299a7ecf10b05ef6ddc2d883c40e5a1 [ovn] Add neutron network to metadata namespace names
3) restart metadata agent
4) start a vm
5) curl http://169.254.169.254/latest/meta-data/ - it works
6) unrevert neutron patch that renamed the metadata namespace
7) restart metadata agent
8) curl http://169.254.169.254/latest/meta-data/ - it fails

Comment 37 Ihar Hrachyshka 2022-11-18 21:44:37 UTC

Tested the fix with OSP as follows:

1) Deployed https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-pidone-updates-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-faults/118/#showFailuresLink
2) start a vm
5) from the vm: curl http://169.254.169.254/latest/meta-data/ - it works
6) updated metadata packages inside ovn_metadata_agent container to include the upstream fix
7) restart metadata agents with: sudo systemctl restart tripleo_ovn_metadata_agent.service
8) from the vm: curl http://169.254.169.254/latest/meta-data/ - it still works

Comment 38 Ihar Hrachyshka 2022-11-21 17:12:44 UTC

Since we have a neutron only fix for the bug, I don't think this bug should depend on OVN fix.

Comment 42 Ihar Hrachyshka 2022-11-29 00:05:02 UTC

Since the bug never happened for customers and affects only the latest OVN from FDP queue, I don't think there's a reason to document the bug in release notes.

Comment 49 errata-xmlrpc 2022-12-07 19:23:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8794

Note You need to log in before you can comment on or make changes to this bug.