Bug 1200596 - Bare metal overcloud hosts fail cloud-init metadata retrieval
Summary: Bare metal overcloud hosts fail cloud-init metadata retrieval
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ga
: Director
Assignee: Jay Dobies
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-10 22:26 UTC by Dan Sneddon
Modified: 2015-08-05 13:50 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-05 13:50:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:1549 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 17:49:10 UTC

Description Dan Sneddon 2015-03-10 22:26:08 UTC
Description of problem:
While deploying overcloud images on instack+delorean the overcloud images boot, but cloud-init tries to contact the metadata server before the network has initialized.

Version-Release number of selected component (if applicable):
cloud-init.x86_64                 0.7.5-6.el7             @epel
instack.noarch                    0.0.6.4-g57c723a.el7.centos
instack-undercloud.noarch         2.0.0-dev1422.g2dd13c6.el7.centos


How reproducible:
100% (in my testing)

Steps to Reproduce:
1. Follow sphinx instructions for bare metal
2. instack-deploy-overcloud
3. 

Actual results:
When the instances boot, cloud-init will fail to get the metadata. The network is not initialized when cloud-init runs, so an exception is given ("Route info failed"). Later (really a matter of seconds), the route to 169.254.169.254 is available after DHCP completes.

Expected results:
The network should be fully initialized before cloud-init tries to run, that way the routes will be available and cloud-init should succeed. If we had a delay in between the curls to 169.254.169.254, or even a delay before cloud-init starts, that would probably fix the issue. I think cloud-init waits for the network service to be up, but not necessarily the network.

Additional info:
Logs, also at http://fpaste.org/196437/42602334/

    ===== grep puppet /var/log/os-apply-config.log =====
    [2015/03/10 05:02:59 PM] [INFO] writing /etc/os-net-config/config.json
    [2015/03/10 05:02:59 PM] [INFO] writing /var/run/heat-config/heat-config
    [2015/03/10 05:02:59 PM] [INFO] writing /etc/puppet/hiera.yaml
    [2015/03/10 05:02:59 PM] [INFO] writing /etc/os-collect-config.conf
    [2015/03/10 05:02:59 PM] [INFO] success
    [2015/03/10 05:03:07 PM] [INFO] writing /etc/os-net-config/config.json
    [2015/03/10 05:03:07 PM] [INFO] writing /var/run/heat-config/heat-config
    [2015/03/10 05:03:07 PM] [INFO] writing /etc/puppet/hiera.yaml
    [2015/03/10 05:03:07 PM] [INFO] writing /etc/os-collect-config.conf
    [2015/03/10 05:03:07 PM] [INFO] success
    [ Repeat endlessly every 20-30 seconds ]
    ===== End grep puppet /var/log/os-apply-config.log =====
     
     
    ===== /var/log/cloud-init.log =====
    Mar 10 17:02:28 localhost cloud-init: Cloud-init v. 0.7.5 running 'init-local' at Tue, 10 Mar 2015 21:02:28 +0000. Up 9.20 seconds.
    Mar 10 17:02:30 localhost cloud-init: Cloud-init v. 0.7.5 running 'init' at Tue, 10 Mar 2015 21:02:29 +0000. Up 10.66 seconds.
    Mar 10 17:02:30 localhost cloud-init: ci-info: ++++++++++++++++++++++++Net device info+++++++++++++++++++++++++
    Mar 10 17:02:30 localhost cloud-init: ci-info: +-----------+------+-----------+-----------+-------------------+
    Mar 10 17:02:30 localhost cloud-init: ci-info: |   Device  |  Up  |  Address  |    Mask   |     Hw-Address    |
    Mar 10 17:02:30 localhost cloud-init: ci-info: +-----------+------+-----------+-----------+-------------------+
    Mar 10 17:02:30 localhost cloud-init: ci-info: |   eno1:   | True |     .     |     .     | b0:83:fe:c6:53:21 |
    Mar 10 17:02:30 localhost cloud-init: ci-info: |    lo:    | True | 127.0.0.1 | 255.0.0.0 |         .         |
    Mar 10 17:02:30 localhost cloud-init: ci-info: | enp8s0f1: | True |     .     |     .     | a0:36:9f:52:7e:d9 |
    Mar 10 17:02:30 localhost cloud-init: ci-info: |   eno2:   | True |     .     |     .     | b0:83:fe:c6:53:22 |
    Mar 10 17:02:30 localhost cloud-init: ci-info: | enp8s0f0: | True |     .     |     .     | a0:36:9f:52:7e:d8 |
    Mar 10 17:02:30 localhost cloud-init: ci-info: +-----------+------+-----------+-----------+-------------------+
    Mar 10 17:02:30 localhost cloud-init: ci-info: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Route info failed!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    Mar 10 17:02:30 localhost cloud-init: 2015-03-10 17:02:30,876 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s
    ]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'
    >: [Errno 101] Network is unreachable)]
    Mar 10 17:02:31 localhost cloud-init: 2015-03-10 17:02:31,883 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/120s
    ]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'
    >: [Errno 101] Network is unreachable)]
    Mar 10 17:02:32 localhost cloud-init: 2015-03-10 17:02:32,888 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [2/120s
    ]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'
    >: [Errno 101] Network is unreachable)]
    Mar 10 17:02:33 localhost cloud-init: 2015-03-10 17:02:33,894 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s
    ]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'
    >: [Errno 101] Network is unreachable)]
    Mar 10 17:02:34 localhost cloud-init: 2015-03-10 17:02:34,898 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [4/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]
    Mar 10 17:02:35 localhost cloud-init: 2015-03-10 17:02:35,903 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [5/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by <class 'socket.error'>: [Errno 101] Network is unreachable)]
    Mar 10 17:02:54 localhost cloud-init: Cloud-init v. 0.7.5 running 'modules:config' at Tue, 10 Mar 2015 21:02:54 +0000. Up 35.13 seconds.
    Mar 10 17:02:54 localhost cloud-init: Cloud-init v. 0.7.5 running 'modules:final' at Tue, 10 Mar 2015 21:02:54 +0000. Up 35.63 seconds.
    Mar 10 17:02:55 localhost cloud-init: 2015-03-10 17:02:55,005 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/userdata [-]
    Mar 10 17:02:55 localhost cloud-init: 2015-03-10 17:02:55,009 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
    Mar 10 17:02:55 localhost cloud-init: 2015-03-10 17:02:55,010 - util.py[WARNING]: Running scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
    Mar 10 17:02:55 localhost cloud-init: Cloud-init v. 0.7.5 finished at Tue, 10 Mar 2015 21:02:55 +0000. Datasource DataSourceEc2.  Up 35.82 seconds
    ===== End /var/log/cloud-init.log =====

Comment 3 Dan Sneddon 2015-03-10 22:36:57 UTC
To confirm this bug, I tried adding this line to the /usr/bin/cloud-init script:

/usr/bin/cloud-init line 200: time.sleep(30)

This gave the network time to come up, and cloud-init functioned.

Comment 4 Emilien Macchi 2015-03-10 23:04:31 UTC
Maybe this patch is helping: https://review.openstack.org/#/c/163219/

Comment 5 Dan Sneddon 2015-03-11 19:39:36 UTC
(In reply to Emilien Macchi from comment #4)
> Maybe this patch is helping: https://review.openstack.org/#/c/163219/

I'm pretty sure that patch won't help. The patch linked is to a Puppet module, in this case Puppet doesn't even run correctly because the cloud-init failed.

Comment 6 Dan Sneddon 2015-03-12 20:52:33 UTC
I did a test which involved mounting the built disk images and making the one-line change at line 200 to introduce a sleep cycle. This allowed cloud-init to get further, but it looks like SSH keys did not transmit, and Puppet still didn't run.

There is another patch that rlandy put in that helps, by making sure that em2 is available: https://review.gerrithub.io/#/c/218559/1/scripts/instack-build-images

I am trying this with a new build that includes that patch, and I'm going to see if cloud-init fails in the same way. If so, I'll add the sleep and try again.

Comment 7 Lon Hohberger 2015-03-12 20:56:26 UTC
cloud-init-0.7.6-2.el7 is available in Red Hat Common.

Comment 8 Lars Kellogg-Stedman 2015-03-12 21:13:26 UTC
Ideally we would take advantage of systemd to make this work reliably.  One option would be to insert a service that runs Before= cloud-init, e.g., something like:

    [Unit]
    Description=Wait for networking before starting cloud-init
    Before=cloud-init-local.service cloud-init.service

    [Service]
    Type=oneshot

    # perform some sort of network availability test
    ExecStart=/bin/sh -c 'while ! curl -sf http://169.254.169.254/; do sleep 1; done'

    [Install]
    RequiredBy=cloud-init-local.service cloud-init.service

But this doesn't make sense as a general solution, because cloud-init does not necessarily require a network-accessible metadata service in order to operate.

Comment 9 chris alfonso 2015-05-26 17:56:45 UTC
Dan, is this still an issue? There are a few ideas to solve or work around the issue in the comments. Is there concensus at this point?

Comment 10 Dan Sneddon 2015-05-26 18:20:23 UTC
(In reply to chris alfonso from comment #9)
> Dan, is this still an issue? There are a few ideas to solve or work around
> the issue in the comments. Is there concensus at this point?

Chris,

This turned out to be a mostly cosmetic issue. The issue is that cloud-init starts immediately after network.target starts, but it takes a while for the network to actually activate and get a DHCP address. After cloud-init reports the errors, it continues to retry the connection and it eventually works once the network is completely up.

The fix to this is to change the systemd startup file for cloud-init to depend on the network being online, not just when the network is first activated.

This is done by making cloud-init depend on one of the following:

[Unit]
Wants=network-online.target
After=network-online.target

or it could be added to the dependencies on network.target that exist already:

[Unit]
Wants=network.target network-online.target
After=network.target network-online.target

I would still like to see this get addressed in a future fix, because it will generate needless support questions when people see the very scary failure.

Comment 11 Dan Sneddon 2015-05-26 18:23:23 UTC
(In reply to Lars Kellogg-Stedman from comment #8)
> Ideally we would take advantage of systemd to make this work reliably.  One
> option would be to insert a service that runs Before= cloud-init, e.g.,
> something like:
> [...]

I tried inserting a service in between network.target and cloud-init, and it didn't make a difference. The actual fix is to make cloud-init depend on network-online.target:

[Unit]
Wants=network-online.target
After=network-online.target

Comment 13 Mike Burns 2015-06-24 13:44:38 UTC
did this get fixed?  do we need a fix?  is this a blocker?

Comment 15 Dan Sneddon 2015-07-14 23:25:46 UTC
I don't think this ever got fixed, but since it's mainly a cosmetic bug I'm not sure that anyone cares enough to fix it.

Comment 16 Alexander Chuzhoy 2015-07-20 14:12:57 UTC
Verified:
Environment: instack-undercloud-2.1.2-21.el7ost.noarch


Able to deploy an overcloud.

Comment 18 errata-xmlrpc 2015-08-05 13:50:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549


Note You need to log in before you can comment on or make changes to this bug.