Bug 1637626 - The cloud is not functional after OVN minor update
Summary: The cloud is not functional after OVN minor update
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Carlos Camacho
QA Contact: Eran Kuris
URL:
Whiteboard:
: 1653622 (view as bug list)
Depends On: 1656368 1656409
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-09 15:33 UTC by Bernard Cafarelli
Modified: 2019-09-09 14:15 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1614157
Environment:
Last Closed: 2018-12-11 08:23:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 620262 0 None MERGED Do not adjust roles data for minor updates. 2020-09-04 23:14:34 UTC
OpenStack gerrit 620580 0 None MERGED Do not adjust roles data for minor updates. 2020-09-04 23:14:34 UTC

Comment 17 Carlos Camacho 2018-11-22 15:26:34 UTC
Hi Lucas, 

So, this kind of issues usually happens when the environment files are not added correctly, in this case in the upgrade prepare command.

I just ran a minor update in my dev env (OSP13), 3 controllers, 1 compute and worked as it should.

Can you share the content of your roles data file? Also, the environment files you are using and the steps you are following?

My vote is that there is something duplicated in the roles data or some skipped env file in the upgrade prepare.

Comment 18 Lucas Alvares Gomes 2018-11-22 16:46:21 UTC
(In reply to Carlos Camacho from comment #17)
> Hi Lucas, 
> 

Hi Carlos,

> So, this kind of issues usually happens when the environment files are not
> added correctly, in this case in the upgrade prepare command.

Thanks for the promptly reply.

> 
> I just ran a minor update in my dev env (OSP13), 3 controllers, 1 compute
> and worked as it should.
> 

So apparently this problem does not happen in OSP 13 (or 12) because the way the prepare containers works changed in OSP 14 (I found some context here [0] and patch [1] seems related)

> Can you share the content of your roles data file? Also, the environment
> files you are using and the steps you are following?
> 
> My vote is that there is something duplicated in the roles data or some
> skipped env file in the upgrade prepare.

Sure, I'm debugging the issue using at the CI logs [2] because, at the moment, I don't have an environment to try it out.

The content of the of the roles data can be found here: http://pastebin.test.redhat.com/672537

(You can find it at [2], undercloud-0.tar.gz, undercloud-0/home/stack/composable_roles/roles/roles_data.yaml)

As you u can see there's the "OS::TripleO::Services::ContainerImagePrepare" is duplicated in the roles data. I believe that's the root problem.

The environment files used in the step that is failing is here http://pastebin.test.redhat.com/672538

It's important to note that this same error is also happening in the generic update CI job as well [3] (not related to networking-ovn). I inspected the logs there [3] and I can see the exact same errors, including the duplicated "ContainerImagePrepare" item in the roles data.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1648918

[1] https://review.openstack.org/618462

[2] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/13/artifact/

[3] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-14-from-2018-10-25.3-composable-ipv4/1/

Comment 19 Carlos Camacho 2018-11-26 11:58:55 UTC
Based on the IRC chat there was actually a duplication on the roles data (C17), the duplication is coming apparently from the tripleo-upgrade role.

Comment 22 Carlos Camacho 2018-11-28 12:03:15 UTC
Hi Eran,

Would you mind to re-run the job testing the minor update for OVN using https://review.openstack.org/#/c/620580/ ?

Or you can just apply https://review.openstack.org/#/c/620580/1/tasks/update/overcloud_update_prepare.yml


That should fix the reported issue in your CI job.

Comment 23 Daniel Alvarez Sanchez 2018-11-28 15:12:56 UTC
Just to recap, this looks like an issue with tripleo-upgrade which is just consumed by infrared. So this should be opened in Jira then?

I've been trying to run a minor upgrade for the last couple of days hitting several issues (That are already tracked) but last one is that we need a fix in tripleo-common [0]. So if you guys agree, please open a bug in Jira so that the tripleo-upgrade patch linked by Carlos on C22 is applied on infrared and leave this bug open to track [0].

Sounds reasonable Carlos/Jose Luis?

[0] https://review.openstack.org/#/c/619759

Comment 24 Daniel Alvarez Sanchez 2018-11-28 15:13:45 UTC
amend from c23: s/minor upgrade/minor update
sorry :)

Comment 25 Jose Luis Franco 2018-11-28 15:17:18 UTC
Agree with Daniel. This issue is not specific to any OpenStack component, it's in a set of the ansible tasks used by infrared to perform the tests, more specifically in the upgrade/update jobs.

So as Daniel mentioned, I also think this should be tracked in its corresponding tool, not in Bugzilla.

Comment 26 Carlos Camacho 2018-11-28 16:46:53 UTC
Hi Daniel, we have a Bz for tracking the BZ you just pasted.
This is the BZ in question: https://bugzilla.redhat.com/show_bug.cgi?id=1652924

This BZ is for tracking the issue reported as:

...
"Error: Evaluation Error: The title 'container_image_prepare' has already been used in this resource expression at /etc/puppet/modules/tripleo/manifests/firewall.pp:135:5 on node controller-2.localdomain"
And looking on one controller:
% grep -A5 -B5 container_image_
...

Once we have the fix in place we will move it to post.

Comment 27 Daniel Alvarez Sanchez 2018-11-29 09:35:16 UTC
Closing this BZ after conversation with Arie, it's CI only and the patch from tripleo-upgrade is going to be used automatically as it's already merged.

Comment 28 Daniel Alvarez Sanchez 2018-11-29 09:41:37 UTC
(In reply to Daniel Alvarez Sanchez from comment #27)
> Closing this BZ after conversation with Arie, it's CI only and the patch
> from tripleo-upgrade is going to be used automatically as it's already
> merged.

Sorry, it's not yet merged! I'll close it once then

Comment 29 Jose Luis Franco 2018-11-29 12:12:54 UTC
*** Bug 1653622 has been marked as a duplicate of this bug. ***

Comment 30 Daniel Alvarez Sanchez 2018-12-03 10:53:26 UTC
@Arie was going to try this patch on a onetime job. Do you have any updates?
Thanks a lot!!

Comment 31 Amit Ugol 2018-12-03 12:02:58 UTC
@Nir, is this a release blocker in any way? I set blocker back to "?"

Comment 32 Carlos Camacho 2018-12-04 13:05:12 UTC
Hey Amit, this issue is a CI only problem due to a duplication in the tripleo-upgrade repo.

The fixes are in place, so just missing to move to verify.

Moving it to ON_QA until we have it verified from the networking folks.

Comment 35 Eran Kuris 2018-12-10 11:46:14 UTC
according to this run, it looks like we have some tests that failed:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/37/testReport/

so I am sending it back for more investigating.

Comment 37 Daniel Alvarez Sanchez 2018-12-10 17:24:46 UTC
Looks like we are hitting: https://bugzilla.redhat.com/show_bug.cgi?id=1656368

Comment 38 Carlos Camacho 2018-12-11 08:23:52 UTC
The issues reported in this BZ are actually fixed, please if you hit any other issue create a BZ describing the actual error and provide logs/how to reproduce steps.

@Eran this bug was reported for the tripleo-upgrade duplicated resource registry error.

Once a fix is available for the any BZ, you should not move it back to ASSIGNED, if you hit another issue please raise another bug with your finding and the depends on the RHEL issue.


Note You need to log in before you can comment on or make changes to this bug.