Bug 1791949

Summary: [B&R] After controller restore using ReaR cloud-init fails with error 'Failed to start Initial cloud-init job (metadata service crawler)' - 'RuntimeError: duplicate mac found! both 'br-ex' and 'ens5' have mac...'
Product: Red Hat OpenStack Reporter: Eliad Cohen <elicohen>
Component: tripleo-ansibleAssignee: Toure Dunnon <tdunnon>
Status: CLOSED DUPLICATE QA Contact: Eliad Cohen <elicohen>
Severity: high Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: acanan, apevec, bfournie, ccamacho, hjensas, jbadiapa, jjoyce, jkreger, joflynn, jschluet, lhh, sbaker, tdunnon
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-27 22:45:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1768770, 1802152    
Bug Blocks:    

Description Eliad Cohen 2020-01-16 18:50:56 UTC
Description of problem:
After doing a restore using ReaR as per the procedure described at [1], upon looking at the console for the restored controller, it is evident that cloud-init failed [2]. Stack shows "RuntimeError: duplicate mac found! both 'br-ex' and 'ens5' have mac..."

oddly, pcs status shows nothing wrong.


Version-Release number of selected component (if applicable):


How reproducible:
100% with every systemctl restart cloud-init.service

Steps to Reproduce:
1. [In a virtual monolithic deployment] Use the tripleo-ansible role to create a backup of all controllers on the hypervisor
2. Restore one of the controllers as per [1]
3. reboot the restored controller and see the failure in the console

Actual results:
cloud-init fails to run

Expected results:
Cloud init should run successfully

Additional info:
[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html-single/undercloud_and_control_plane_back_up_and_restore/index
[2] http://pastebin.test.redhat.com/828099

Comment 1 Bob Fournier 2020-01-17 13:59:01 UTC
Not sure why the component is python-ironic-lib but this looks more like an upgrade issue. Including Upgrades DFG.

Eliad - is it possible to get an sosreport?

Comment 3 Julia Kreger 2020-01-17 16:17:47 UTC
Greetings,

I suspect I know what is happening (Mainly I had the same bug in another case long ago.) Essentially, upon restore the instance-id is different. Because cloud-init identifies the different id value from it's last configuration run, it attempts to reconfigure the machine as if it is a brand new cloned machine. Obviously this can be problematic with system that has undergone configuration from another tool set. Ideally, post initial configuration, we would disable cloud-init so it can never run again. That may not be the actual solution though.

-Julia

Comment 4 Eliad Cohen 2020-01-17 16:20:47 UTC
Thanks Julia, Bob. To make things more complicated, looks like all nodes went into maintenance mode. Any idea?

Comment 7 Bob Fournier 2020-01-24 14:54:15 UTC
Removing HardProv as looks like B+R issue.

Comment 9 Bob Fournier 2020-01-27 16:26:39 UTC
Note that there is a bug to prevent cloud-init from modifying network config after first boot - https://bugzilla.redhat.com/show_bug.cgi?id=1773642, not sure if that is relevant here.  That bug was created as a result of https://bugzilla.redhat.com/show_bug.cgi?id=1760806.

Comment 11 Steve Baker 2020-01-27 22:45:05 UTC

*** This bug has been marked as a duplicate of bug 1795383 ***