Bug 1465730 - [Regression] cloud-init fails to configure network while EC2 creates instance from AMI
[Regression] cloud-init fails to configure network while EC2 creates instance...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: cloud-init (Show other bugs)
7.4
x86_64 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Ryan McCabe
Chen Shi
: EC2, Regression, Triaged
Depends On:
Blocks: 1451548 1456511
  Show dependency treegraph
 
Reported: 2017-06-28 00:27 EDT by Chen Shi
Modified: 2018-04-10 10:05 EDT (History)
20 users (show)

See Also:
Fixed In Version: cloud-init-0.7.9-13.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-04-10 10:05:07 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
nvidia-installer_on_3.10.0-705.el7.x86_64.log (31.14 KB, text/plain)
2018-02-02 01:54 EST, Chen Shi
no flags Details
nvidia-installer_on_3.10.0-706.el7.x86_64.log (2.39 MB, text/plain)
2018-02-02 01:55 EST, Chen Shi
no flags Details

  None (edit)
Description Chen Shi 2017-06-28 00:27:27 EDT
Description of problem:
The network can't be reached after creating the instance. And this instance was created from an AMI which includes cloud-init-0.7.9-8.el7.x86_64 package.


Version-Release number of selected component (if applicable):
cloud-init-0.7.9-4.el7.x86_64-> PASS
cloud-init-0.7.9-6.el7.x86_64-> PASS
cloud-init-0.7.9-8.el7.x86_64-> FAIL
cloud-init-0.7.9-9.el7.x86_64-> FAIL

How reproducible:
100%

Steps to Reproduce:
1. Login to the AWS, in Tokyo region.
2. Create an instance from RHEL-7.4_HVM_Beta-20170518-x86_64-1-Hourly2-GP2 (ami-31261f56)
3. Upgrade cloud-init-0.7.9-8.el7.x86_64 package.
4. Create an AMI from this instance, make sure the AMI contain cloud-init-0.7.9-8.el7.x86_64 package.
5. Create an instance from this AMI.
6. This instance can be booted up but the network can't be reached.

Actual results:
This instance can be booted up but the network can't be reached.

Expected results:
This instance can be booted up and the network can be reached.

Additional info:
1. This problem can be reproduced regardless the AMI was created via web or cli.
2. This problem can be reproduced on other region.
3. Both Public DNS/IP and Private DNS/IP can't be reached.
Comment 3 Chen Shi 2017-06-28 07:30:02 EDT
Some useful information from /var/log/messages:

```
Jun 28 09:53:37 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643617.9597] device (lo): link connected
Jun 28 09:53:37 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643617.9621] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jun 28 09:53:37 ip-172-31-0-87 systemd: Started Hostname Service.
Jun 28 09:53:37 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643617.9680] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Jun 28 09:53:37 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643617.9723] device (eth0): state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
Jun 28 09:53:37 ip-172-31-0-87 kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Jun 28 09:53:37 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643617.9836] device (eth0): link connected
Jun 28 09:53:38 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643618.0186] device (eth0): state change: unavailable -> disconnected (reason 'none') [20 30 0]
Jun 28 09:53:38 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643618.0219] manager: startup complete
Jun 28 09:53:38 ip-172-31-0-87 systemd: Started Network Manager Wait Online.
Jun 28 09:53:38 ip-172-31-0-87 systemd: Starting LSB: Bring up/down networking...
Jun 28 09:53:38 ip-172-31-0-87 network: Bringing up loopback interface:  [  OK  ]
Jun 28 09:53:38 ip-172-31-0-87 NetworkManager[480]: <info>  [1498643618.5053] audit: op="connection-activate" uuid="5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03" name="System eth0" result="fail" reason="No suitable device found for this connection."
Jun 28 09:53:38 ip-172-31-0-87 network: Bringing up interface eth0:  Error: Connection activation failed: No suitable device found for this connection.
Jun 28 09:53:38 ip-172-31-0-87 network: [FAILED]
Jun 28 09:53:38 ip-172-31-0-87 systemd: network.service: control process exited, code=exited status=1
Jun 28 09:53:38 ip-172-31-0-87 systemd: Failed to start LSB: Bring up/down networking.
Jun 28 09:53:38 ip-172-31-0-87 systemd: Unit network.service entered failed state.
Jun 28 09:53:38 ip-172-31-0-87 systemd: network.service failed.
```

We added scripts to rc.local, and get the following information during system boot:

$ ifconfig -a

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        ether 06:9e:0f:f7:c4:65  txqueuelen 1000  (Ethernet)
        RX packets 2  bytes 140 (140.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
......

$ cat /etc/sysconfig/network-scripts/ifcfg-eth0

BOOTPROTO=dhcp
DEVICE=eth0
HWADDR=06:91:96:e1:6e:e7
ONBOOT=yes
TYPE=Ethernet
USERCTL=no

$ cat /etc/udev/rules.d/70-persistent-net.rules

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="06:91:96:e1:6e:e7", NAME="eth0"


We believe that mismatched MAC address in configure file caused "eth0" boot up failure. (We also checked with "cloud-init-0.7.9-6.el7.x86_64", they are same MAC address)
Comment 4 Lars Kellogg-Stedman 2017-06-28 11:49:21 EDT
I believe that the problem is due to the way you are creating the new AMI in step 4:

> 4. Create an AMI from this instance, make sure the AMI contain cloud-init-0.7.9-8.el7.x86_64 package.

It doesn't appear that you are running "virt-sysprep" or otherwise attempting to de-configure the instance before generating the AMI.  Minimally, you will need to remove the HWADDR line from /etc/sysconfig/network-scripts/ifcfg-eth0, and you will need to remove the corresponding udev rule in /etc/udev/rules.d/70-persistent-net.rules.

Running "virt-sysprep" should accomplish both tasks for you.

If you are still experiencing a problem after implementing these steps, please update this bug with details.
Comment 6 Chen Shi 2017-06-29 05:16:28 EDT
(In reply to Lars Kellogg-Stedman from comment #4)
> I believe that the problem is due to the way you are creating the new AMI in
> step 4:
> 
> > 4. Create an AMI from this instance, make sure the AMI contain cloud-init-0.7.9-8.el7.x86_64 package.
> 
> It doesn't appear that you are running "virt-sysprep" or otherwise
> attempting to de-configure the instance before generating the AMI. 
> Minimally, you will need to remove the HWADDR line from
> /etc/sysconfig/network-scripts/ifcfg-eth0, and you will need to remove the
> corresponding udev rule in /etc/udev/rules.d/70-persistent-net.rules.
> 
> Running "virt-sysprep" should accomplish both tasks for you.
> 
> If you are still experiencing a problem after implementing these steps,
> please update this bug with details.

Hi Lars,

  Thank you for the inputs. I think this procedure should be done by AWS, and I have no idea about "virt-sysprep", is this used to deal with the AMI? We can't download and even not to download the AMI, deal with it, then upload it back to AWS. So what is the usage of "virt-sysprep" with AWS EC2?

Thanks,
Charles
Comment 9 yuxisun@redhat.com 2017-07-20 05:49:53 EDT
Hi,

I tried to reproduce it in Azure and it seems that if I update cloud-init package, the service is not active during booting phase. I think it might be the root cause of this issue.

My steps:
1. Update cloud-init from 0.7.9-3 to 0.7.9-9 through rpm -Uvh
2. Ensure the services are enabled:
# systemctl is-enabled cloud-{init-local,init,config,final}
enabled
enabled
enabled
enabled
3. Reboot
4. Check /var/log/cloud-init.log

Actual result:
There's no new log in cloud-init.log
Then check /var/log/messages, there's no cloud-init related messages

Debug:
Compare the files in 0.7.9-3 and 0.7.9-9, the 
/usr/lib/systemd/system/cloud-init.target
/usr/lib/systemd/system-generators/cloud-init-generator 
2 files are not in the 0.7.9-9. 

It seems that in old version(0.7.9-3) it put cloud-*.services into /etc/systemd/system/cloud-init.target.wants/ folder, but not in multi-user.target.wants/. Then use che cloud-init.target to run cloud-init related services after multi-user.target. The cloud-init.target needs cloud-init-generator to enable itself.

# ll /etc/systemd/system/cloud-init.target.wants/
total 0
lrwxrwxrwx. 1 root root 44 Jul 20 09:24 cloud-config.service -> /usr/lib/systemd/system/cloud-config.service
lrwxrwxrwx. 1 root root 43 Jul 20 09:24 cloud-final.service -> /usr/lib/systemd/system/cloud-final.service
lrwxrwxrwx. 1 root root 48 Jul 20 09:24 cloud-init-local.service -> /usr/lib/systemd/system/cloud-init-local.service
lrwxrwxrwx. 1 root root 42 Jul 20 09:24 cloud-init.service -> /usr/lib/systemd/system/cloud-init.service

[root@wala73cloud0793 ~]# ll /usr/lib/systemd/system/cloud-init.target
-rw-r--r--. 1 root root 255 Dec 23  2016 /usr/lib/systemd/system/cloud-init.target
[root@wala73cloud0793 ~]# ll /usr/lib/systemd/system-generators/cloud-init-generator 
-rwxr-xr-x. 1 root root 3972 Dec 23  2016 /usr/lib/systemd/system-generators/cloud-init-generator

# cat /usr/lib/systemd/system/cloud-init.target
# cloud-init target is enabled by cloud-init-generator
# To disable it you can either:
#  a.) boot with kernel cmdline of 'cloudinit=disabled'
#  b.) touch a file /etc/cloud/cloud-init.disabled
[Unit]
Description=Cloud-init target
After=multi-user.target

In v0.7.9-9, if fresh install, cloud-*.service are in multi-user.target.wants/, so the cloud-init.target and cloud-init-generator are not needed anymore. 

# ll /etc/systemd/system/multi-user.target.wants/|grep cloud
lrwxrwxrwx. 1 root root 44 Jul 20 11:26 cloud-config.service -> /usr/lib/systemd/system/cloud-config.service
lrwxrwxrwx. 1 root root 43 Jul 20 11:26 cloud-final.service -> /usr/lib/systemd/system/cloud-final.service
lrwxrwxrwx. 1 root root 48 Jul 20 11:26 cloud-init-local.service -> /usr/lib/systemd/system/cloud-init-local.service
lrwxrwxrwx. 1 root root 42 Jul 20 11:26 cloud-init.service -> /usr/lib/systemd/system/cloud-init.service


But if upgrade from v0.7.9-3 to 0.7.9-9, the cloud-init related services are not re-enabled, so the .service soft links are still in /etc/systemd/system/cloud-init.target.wants/. But the cloud-init.target and cloud-init-generator files are removed. So systemd cannot run cloud-init related services.

[root@wala73cloud0793bak ~]# ls /usr/lib/systemd/system/cloud-init.target 
ls: cannot access /usr/lib/systemd/system/cloud-init.target: No such file or directory
[root@wala73cloud0793bak ~]# ls /usr/lib/systemd/system-generators/cloud-init-generator
ls: cannot access /usr/lib/systemd/system-generators/cloud-init-generator: No such file or directory
[root@wala73cloud0793bak ~]# ll /etc/systemd/system/multi-user.target.wants/|grep cloud
[root@wala73cloud0793bak ~]# ll /etc/systemd/system/cloud-init.target.wants/
total 0
lrwxrwxrwx. 1 root root 44 Jul 20 09:24 cloud-config.service -> /usr/lib/systemd/system/cloud-config.service
lrwxrwxrwx. 1 root root 43 Jul 20 09:24 cloud-final.service -> /usr/lib/systemd/system/cloud-final.service
lrwxrwxrwx. 1 root root 48 Jul 20 09:24 cloud-init-local.service -> /usr/lib/systemd/system/cloud-init-local.service
lrwxrwxrwx. 1 root root 42 Jul 20 09:24 cloud-init.service -> /usr/lib/systemd/system/cloud-init.service
Comment 10 Chen Shi 2017-08-04 01:30:49 EDT
As verified on AWS with RHEL7.4 RC1, we can see "cloud-init services are not re-enabled" is the root cause for issue "Bring up networking failed".

And for RHEL-7.4_HVM_GA-20170724-x86_64-1-Hourly2-GP2 (ami-3901e15f), the .service soft links are already in multi-user.target.wants . So upgrading cloud-init based on RHEL7.4 GA will have no issue.

RHEL7.4 GA:
[ec2-user@ip-172-31-9-0 system]$ uname -a
Linux ip-172-31-9-0.ap-northeast-1.compute.internal 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@ip-172-31-9-0 system]$ rpm -qa | grep cloud-init
cloud-init-0.7.9-9.el7.x86_64
[ec2-user@ip-172-31-9-0 system]$ find . | grep cloud
./multi-user.target.wants/cloud-config.service
./multi-user.target.wants/cloud-final.service
./multi-user.target.wants/cloud-init.service
./multi-user.target.wants/cloud-init-local.service
[ec2-user@ip-172-31-9-0 system]$ 


As a conclusion, customer will encounter this issue when:
1. Upgrading cloud-init based on RHEL-7.4_HVM_Beta-20170518-x86_64-1-Hourly2-GP2 - ami (or earlier version); AND,
2. Make an AMI for further use.

Workaround: execute the following commands before making the AMI:
# systemctl disable cloud-{init-local,init,config,final}.service
# systemctl enable cloud-{init-local,init,config,final}.service
make sure the .service soft links are located in multi-user.target.wants .
Comment 15 Chen Shi 2018-02-02 01:54 EST
Created attachment 1389947 [details]
nvidia-installer_on_3.10.0-705.el7.x86_64.log
Comment 16 Chen Shi 2018-02-02 01:55 EST
Created attachment 1389948 [details]
nvidia-installer_on_3.10.0-706.el7.x86_64.log
Comment 17 Chen Shi 2018-02-02 01:57:55 EST
Comment on attachment 1389947 [details]
nvidia-installer_on_3.10.0-705.el7.x86_64.log

the attachment is not for this bug
Comment 18 Chen Shi 2018-02-02 01:58:15 EST
Comment on attachment 1389948 [details]
nvidia-installer_on_3.10.0-706.el7.x86_64.log

the attachment is not for this bug
Comment 23 errata-xmlrpc 2018-04-10 10:05:07 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0806

Note You need to log in before you can comment on or make changes to this bug.