Bug 1791949

Summary:	[B&R] After controller restore using ReaR cloud-init fails with error 'Failed to start Initial cloud-init job (metadata service crawler)' - 'RuntimeError: duplicate mac found! both 'br-ex' and 'ens5' have mac...'
Product:	Red Hat OpenStack	Reporter:	Eliad Cohen <elicohen>
Component:	tripleo-ansible	Assignee:	Toure Dunnon <tdunnon>
Status:	CLOSED DUPLICATE	QA Contact:	Eliad Cohen <elicohen>
Severity:	high	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	acanan, apevec, bfournie, ccamacho, hjensas, jbadiapa, jjoyce, jkreger, joflynn, jschluet, lhh, sbaker, tdunnon
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-27 22:45:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1768770, 1802152
Bug Blocks:

Description Eliad Cohen 2020-01-16 18:50:56 UTC

Description of problem:
After doing a restore using ReaR as per the procedure described at [1], upon looking at the console for the restored controller, it is evident that cloud-init failed [2]. Stack shows "RuntimeError: duplicate mac found! both 'br-ex' and 'ens5' have mac..."

oddly, pcs status shows nothing wrong.


Version-Release number of selected component (if applicable):


How reproducible:
100% with every systemctl restart cloud-init.service

Steps to Reproduce:
1. [In a virtual monolithic deployment] Use the tripleo-ansible role to create a backup of all controllers on the hypervisor
2. Restore one of the controllers as per [1]
3. reboot the restored controller and see the failure in the console

Actual results:
cloud-init fails to run

Expected results:
Cloud init should run successfully

Additional info:
[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/undercloud_and_control_plane_back_up_and_restore/index
[2] http://pastebin.test.redhat.com/828099

Comment 1 Bob Fournier 2020-01-17 13:59:01 UTC

Not sure why the component is python-ironic-lib but this looks more like an upgrade issue. Including Upgrades DFG.

Eliad - is it possible to get an sosreport?

Comment 3 Julia Kreger 2020-01-17 16:17:47 UTC

Greetings,

I suspect I know what is happening (Mainly I had the same bug in another case long ago.) Essentially, upon restore the instance-id is different. Because cloud-init identifies the different id value from it's last configuration run, it attempts to reconfigure the machine as if it is a brand new cloned machine. Obviously this can be problematic with system that has undergone configuration from another tool set. Ideally, post initial configuration, we would disable cloud-init so it can never run again. That may not be the actual solution though.

-Julia

Comment 4 Eliad Cohen 2020-01-17 16:20:47 UTC

Thanks Julia, Bob. To make things more complicated, looks like all nodes went into maintenance mode. Any idea?

Comment 7 Bob Fournier 2020-01-24 14:54:15 UTC

Removing HardProv as looks like B+R issue.

Comment 9 Bob Fournier 2020-01-27 16:26:39 UTC

Note that there is a bug to prevent cloud-init from modifying network config after first boot - https://bugzilla.redhat.com/show_bug.cgi?id=1773642, not sure if that is relevant here.  That bug was created as a result of https://bugzilla.redhat.com/show_bug.cgi?id=1760806.

Comment 11 Steve Baker 2020-01-27 22:45:05 UTC


*** This bug has been marked as a duplicate of bug 1795383 ***