I was able to confirm following once the system got into the "bad" state: - Not able to ping existing VMs from within their ovn-metadata namespace (destination host unreachable) - Was able to ping existing VMs using floating IPs - Not able to ping newly created VMs from within their ovn-metadata namespace (destination host unreachable) - Not able to ping newly created VMs using floating ip - VMs on the same network are able to ping each other. (on the same compute node or between different compute nodes) - br-int tap interfaces look fine (everything plugged in as expected). I performed openflow dump before recompute and after recompute. I noticed that there were some new flows added (files attached). Some flows that were added after ovn-controller recompute were: cookie=0x7b4860d8, table=0, priority=100,in_port=27 actions=load:0xd->NXM_NX_REG13[],load:0x14->NXM_NX_REG11[],load:0x15->NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8) cookie=0x916e5c4c, table=0, priority=100,in_port=29 actions=load:0xe->NXM_NX_REG13[],load:0xc->NXM_NX_REG11[],load:0xb->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x2->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8) cookie=0xe74a6347, table=0, priority=100,in_port=25 actions=load:0x1c->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x17->NXM_NX_REG12[],load:0x8->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8) cookie=0xf924d9b8, table=0, priority=100,in_port=28 actions=load:0x1e->NXM_NX_REG13[],load:0x6->NXM_NX_REG11[],load:0x7->NXM_NX_REG12[],load:0x3->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8) cookie=0x17dfc0ad, table=0, priority=100,in_port=26 actions=load:0x1d->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x4->NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load:0x1->NXM_NX_REG10[10],resubmit(,8) Where in_port's represented tap interfaces from br-int. in_port 25 = tapecab191e-e0 where the other veth tapecab191e-e1 is part of ovnmeta namespace in_port 26 = tapddf01192-e0 same in_port 27 = tap26f8a86e-70 same in_port 28 = tap3bf3ebe3-40 same in_port 29 = tap05ed2909-20 saem I guess the next step is to understand what cause these flows to be missing after upgrade. @Julia, just curious. What is the this "minor" update. Is there an easy way to see which packages got updated?
(In reply to Miro Tomaska from comment #8) > I was able to confirm following once the system got into the "bad" state: > > - Not able to ping existing VMs from within their ovn-metadata namespace > (destination host unreachable) > - Was able to ping existing VMs using floating IPs > - Not able to ping newly created VMs from within their ovn-metadata > namespace (destination host unreachable) > - Not able to ping newly created VMs using floating ip > - VMs on the same network are able to ping each other. (on the same compute > node or between different compute nodes) > - br-int tap interfaces look fine (everything plugged in as expected). > > I performed openflow dump before recompute and after recompute. I noticed > that there were some new flows added (files attached). > > Some flows that were added after ovn-controller recompute were: > cookie=0x7b4860d8, table=0, priority=100,in_port=27 > actions=load:0xd->NXM_NX_REG13[],load:0x14->NXM_NX_REG11[],load:0x15- > >NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load: > 0x1->NXM_NX_REG10[10],resubmit(,8) > cookie=0x916e5c4c, table=0, priority=100,in_port=29 > actions=load:0xe->NXM_NX_REG13[],load:0xc->NXM_NX_REG11[],load:0xb- > >NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x2->NXM_NX_REG14[],load: > 0x1->NXM_NX_REG10[10],resubmit(,8) > cookie=0xe74a6347, table=0, priority=100,in_port=25 > actions=load:0x1c->NXM_NX_REG13[],load:0x18->NXM_NX_REG11[],load:0x17- > >NXM_NX_REG12[],load:0x8->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load: > 0x1->NXM_NX_REG10[10],resubmit(,8) > cookie=0xf924d9b8, table=0, priority=100,in_port=28 > actions=load:0x1e->NXM_NX_REG13[],load:0x6->NXM_NX_REG11[],load:0x7- > >NXM_NX_REG12[],load:0x3->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load: > 0x1->NXM_NX_REG10[10],resubmit(,8) > cookie=0x17dfc0ad, table=0, priority=100,in_port=26 > actions=load:0x1d->NXM_NX_REG13[],load:0x1->NXM_NX_REG11[],load:0x4- > >NXM_NX_REG12[],load:0x2->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],load: > 0x1->NXM_NX_REG10[10],resubmit(,8) > > Where in_port's represented tap interfaces from br-int. > in_port 25 = tapecab191e-e0 where the other veth tapecab191e-e1 is part of > ovnmeta namespace > in_port 26 = tapddf01192-e0 same > in_port 27 = tap26f8a86e-70 same > in_port 28 = tap3bf3ebe3-40 same > in_port 29 = tap05ed2909-20 saem > > I guess the next step is to understand what cause these flows to be missing > after upgrade. > > @Julia, just curious. What is the this "minor" update. Is there an easy way > to see which packages got updated? Hey Miro.Since The job updated from latest_cdn (core_puddle: RHOS-16.2-RHEL-8-20220311.n.1) to RHOS-16.2-RHEL-8-20220525.n.2. By "minor" I meant to say that OSP version (16.2) hasn't been upgraded. Not sure about the changed packages.
We have similar bugs (missing metadata flows fixed by recompute) on 17: https://bugzilla.redhat.com/show_bug.cgi?id=2088454 (fix in core OVN https://bugzilla.redhat.com/show_bug.cgi?id=2076604) and 16.2: https://bugzilla.redhat.com/show_bug.cgi?id=2069668 (fix in core OVN https://bugzilla.redhat.com/show_bug.cgi?id=2069783) It sounds worth checking if it is same bug or related - depending on the OVN versions used in that update test
Thanks Bernard for the heads up! It does look like this is the same problem from BZ2088454. All the missing ports from my comment#8 are ovs br-int "localports". The core puddle version on my system is RHOS-16.2-RHEL-8-20220525.n.2 where ovn rpm is ovn-2021-21.12.0-46.el8fdp.x86_64 package. The fix was not pulled until ovn-21.12.0-68 [1]. @Julia, is it possible to rerun this job with later puddle? Looks like we have puddle RHOS-16.2-RHEL-8-20220610.n.1 available. [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2032466
(In reply to Miro Tomaska from comment #13) > Thanks Bernard for the heads up! It does look like this is the same problem > from BZ2088454. All the missing ports from my comment#8 are ovs br-int > "localports". > > The core puddle version on my system is RHOS-16.2-RHEL-8-20220525.n.2 where > ovn rpm is ovn-2021-21.12.0-46.el8fdp.x86_64 package. > The fix was not pulled until ovn-21.12.0-68 [1]. @Julia, is it possible to > rerun this job with later puddle? Looks like we have puddle > RHOS-16.2-RHEL-8-20220610.n.1 available. > > > [1] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=2032466 It seems that we are still facing the issue. We can see the failure in [1] (update to puddle RHOS-16.2-RHEL-8-20220610.n.1), though not all logs are available there. Now [2] is running, so in a few hours we would be able to get more info. [1] http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-pidone-updates-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-faults/92/infrared/.workspaces/workspace_2022-06-12_20-56-45/tobiko_check-resources-faults/tobiko_check-resources-faults_01_faults_faults.html [2] https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/Phase3/view/OSP%2016.2/view/PidOne/job/DFG-pidone-updates-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-faults/96/
Apparently I changed parameters of this BZ by mistake, so moving it back to required state.
My understanding is that while this situation can be better handled in OVN (and the OVN patch up for review will take care of it), there's still a slight problem with what metadata agent is doing in regards to migrating ports to the new namespace. In general, Neutron should not at any point in time keep 2+ vifs assigned to the same port. This is a misconfiguration. It seems to me that sync() function in the agent should first tear down unnecessary namespaces, then establish new ones. Something along the lines of: https://review.opendev.org/c/openstack/neutron/+/864777 To quote Dumitru from dev ML, "It's undefined behavior because the CMS is doing something it shouldn't be doing."
To add some more color, even with the patch as discussed in OVN, 1) when two vifs are assigned to the same port, neither gets configured - both get released; and 2) this situation is considered invalid and triggers recompute. This is not advisable. CMS shouldn't put OVN in this situation.
(In reply to Ihar Hrachyshka from comment #33) > My understanding is that while this situation can be better handled in OVN > (and the OVN patch up for review will take care of it), there's still a > slight problem with what metadata agent is doing in regards to migrating > ports to the new namespace. In general, Neutron should not at any point in > time keep 2+ vifs assigned to the same port. This is a misconfiguration. It > seems to me that sync() function in the agent should first tear down > unnecessary namespaces, then establish new ones. Something along the lines > of: https://review.opendev.org/c/openstack/neutron/+/864777 > > To quote Dumitru from dev ML, "It's undefined behavior > because the CMS is doing something it shouldn't be doing." Thanks Ihar for looking into it, Yes right it's not normal behavior and started with switch to network id from datapath id https://review.opendev.org/c/openstack/networking-ovn/+/785181. I also missed Dumitru feedback on the OVN patch :(. When i was investigating the issue i was considering similar fix(cleanup namespaces with datapath id before adding any new namespaces and cleaning unused) in neutron but after realizing the behavior change in OVN(before 21.09.0 it used to work) i thought it would be better handled in core OVN side. Considering now your rationals and neutron patch seems ovn metadata agent should also be fixed to handle this case. Thanks.
Tested the fix in devstack with main as follows: 1) install latest with ovn21.09 that includes the offending ovn commit 3ae8470edc64 ("I-P: Handle runtime data changes for pflow_output engine.") 2) revert neutron patch that renamed the metadata namespace for ovn agent: e4fb06b24299a7ecf10b05ef6ddc2d883c40e5a1 [ovn] Add neutron network to metadata namespace names 3) restart metadata agent 4) start a vm 5) curl http://169.254.169.254/latest/meta-data/ - it works 6) unrevert neutron patch that renamed the metadata namespace 7) restart metadata agent 8) curl http://169.254.169.254/latest/meta-data/ - it fails
Tested the fix with OSP as follows: 1) Deployed https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/job/DFG-pidone-updates-16.2_director-rhel-virthost-3cont_2comp-ipv4-geneve-tobiko-faults/118/#showFailuresLink 2) start a vm 5) from the vm: curl http://169.254.169.254/latest/meta-data/ - it works 6) updated metadata packages inside ovn_metadata_agent container to include the upstream fix 7) restart metadata agents with: sudo systemctl restart tripleo_ovn_metadata_agent.service 8) from the vm: curl http://169.254.169.254/latest/meta-data/ - it still works
Since we have a neutron only fix for the bug, I don't think this bug should depend on OVN fix.
Since the bug never happened for customers and affects only the latest OVN from FDP queue, I don't think there's a reason to document the bug in release notes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.4), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8794