Bug 970453 - instance does not retrieve DHCP address because of initscript ordering
Summary: instance does not retrieve DHCP address because of initscript ordering
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-quantum
Version: 3.0
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: snapshot2
: 3.0
Assignee: Gary Kotton
QA Contact: yfried
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-04 05:15 UTC by Steven Dake
Modified: 2016-04-26 19:34 UTC (History)
9 users (show)

Fixed In Version: openstack-quantum-2013.1.1-10.el6ost
Doc Type: Bug Fix
Doc Text:
Previously, the initialization scripts for the OpenStack networking server and the DHCP agent were configured with the same priority. As a result the DHCP agent would sometimes complete initialization and be running before the networking server. This resulted in a non-functional DHCP setup as the DHCP agent relies on the networking server already being operational when it is started. To workaround this issue the initialization priority of the OpenStack networking server has been raised to ensure that it is always started before the DHCP agent.
Clone Of:
Environment:
Last Closed: 2013-06-11 18:56:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0933 0 normal SHIPPED_LIVE openstack-quantum bug fix update 2013-06-11 22:48:36 UTC

Description Steven Dake 2013-06-04 05:15:11 UTC
Description of problem:
The DHCP server starts after quantum-server in the startup sequence.  This results in an error in the logs, and a nonfunctional DHCP setup.  Booting any instances results in timeouts on the dhcp discover operation with errors in the dhcp agent's log.

The root cause is DHCP agent expecting quantum-server to already be running and not re-attempting (or doing so incorrectly) to connect to the quantum server over amqp).  We had a similar problem in heat long ago :)

Version-Release number of selected component (if applicable):
Puddle 2013.5.31.1

How reproducible:
100%

Steps to Reproduce:
1. install RHEL6.4 in VM - follow quantum install directions
2. when the boot vm step is completed, open virtual machine manager
3. login to the cirros console as root pw cubswin:)
4. try to ifup/ifdown eth0 - it will not be able to obtain an IP address
5. service quantum-dhcp-agent stop
6. service quantum-dhcp-agent start
7. start new vm
8. open the new vm in the console
9. it will have an ip address 

Actual results:
No DHCP address obtained by instance

Expected results:
DHCP address obtained by instance

Additional info:
The root cause of the problem is that the quantum-server starts in the init sequence after the dhcp agent - eg:
lrwxrwxrwx. 1 root root 29 Jun  3 15:30 /etc/rc.d/rc3.d/S97quantum-ovs-cleanup -> ../init.d/quantum-ovs-cleanup
lrwxrwxrwx. 1 root root 28 Jun  3 15:21 /etc/rc.d/rc3.d/S98quantum-dhcp-agent -> ../init.d/quantum-dhcp-agent
lrwxrwxrwx. 1 root root 26 Jun  3 15:21 /etc/rc.d/rc3.d/S98quantum-l3-agent -> ../init.d/quantum-l3-agent
lrwxrwxrwx. 1 root root 32 Jun  3 15:21 /etc/rc.d/rc3.d/S98quantum-metadata-agent -> ../init.d/quantum-metadata-agent
lrwxrwxrwx. 1 root root 35 Jun  3 15:21 /etc/rc.d/rc3.d/S98quantum-openvswitch-agent -> ../init.d/quantum-openvswitch-agent
lrwxrwxrwx. 1 root root 24 Jun  3 22:01 /etc/rc.d/rc3.d/S98quantum-server -> ../init.d/quantum-server

init scripts are started in order as read with readdir() or friends.

A very simple solution which I tested was to change the init level of quantum-server to 97 and quantum-ovs-cleanup to 96.  This resolves the problem.

Logs:
2013-06-03 22:04:16    ERROR [quantum.agent.dhcp_agent] Failed reporting state!
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/quantum/agent/dhcp_agent.py", line 700, in _report_state
    self.agent_state)
  File "/usr/lib/python2.6/site-packages/quantum/agent/rpc.py", line 66, in report_state
    topic=self.topic)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/proxy.py", line 80, in call
    return rpc.call(context, self._get_topic(topic), msg, timeout)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/__init__.py", line 140, in call
    return _get_impl().call(CONF, context, topic, msg, timeout)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 611, in call
    rpc_amqp.get_connection_pool(conf, Connection))
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 613, in call
    rv = list(rv)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 555, in __iter__
    self.done()
  File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 552, in __iter__
    self._iterator.next()
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 436, in iterconsume
    yield self.ensure(_error_callback, _consume)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 380, in ensure
    error_callback(e)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 421, in _error_callback
    raise rpc_common.Timeout()
Timeout: Timeout while waiting on RPC response.
2013-06-03 22:04:16  WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 56.074657 sec
2013-06-03 22:04:16     INFO [quantum.agent.dhcp_agent] Synchronizing state
2013-06-03 22:04:21  WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 0.879631 sec

With the solution in place, no such backtrace occurs and dhcp operates as expected

Comment 1 Gary Kotton 2013-06-04 11:50:10 UTC
diff --git a/quantum-ovs-cleanup.init b/quantum-ovs-cleanup.init
index 88f47a5..c150afe 100644
--- a/quantum-ovs-cleanup.init
+++ b/quantum-ovs-cleanup.init
@@ -2,7 +2,7 @@
 #
 # quantum-ovs-cleanup  OpenStack Open vSwitch cleanup utility
 #
-# chkconfig:   - 97 02
+# chkconfig:   - 96 02
 # description: Purge Open vSwitch of the Quantum devices
 ### END INIT INFO
 
diff --git a/quantum-server.init b/quantum-server.init
index 6a33b94..4666bde 100644
--- a/quantum-server.init
+++ b/quantum-server.init
@@ -2,7 +2,7 @@
 #
 # quantum  OpenStack Software Defined Networking Service
 #
-# chkconfig:   - 98 02
+# chkconfig:   - 97 02
 # description: quantum provides an API to  \
 #              * request and configure virtual networks
 ### END INIT INFO

Comment 3 yfried 2013-06-06 11:09:18 UTC
openstack-quantum-2013.1.1-9.el6ost.noarch

couldn't recreate the problem

allinone setup

cirros image

1. ifdown
2. ifup 

result:
ip is obtained (~1min time)

Comment 4 Rami Vaknin 2013-06-07 20:28:06 UTC
Tested on grizzly running on rhel6.4 with openstack-quantum-2013.1.1-10.el6ost (puddle 2013-06-05.2), OpenStack Networking with OVS, it looks like the fix failed, probably because there is no quantum-ovs-clean in the machine.

The env includes 2 machines, one of the is a compute node, the second is quantum-server+l3+dhcp+all other services, installed with packstack.

The quantum server still runs before the quantum-dhcp-agent and the error still appears in the dhcp log file.

# ll /etc/rc.d/rc`runlevel | awk '{print $NF}'`.d/ | grep quantum
lrwxrwxrwx. 1 root root 24 Jun  7 19:38 S97quantum-server -> ../init.d/quantum-server
lrwxrwxrwx. 1 root root 28 Jun  7 19:37 S98quantum-dhcp-agent -> ../init.d/quantum-dhcp-agent
lrwxrwxrwx. 1 root root 26 Jun  7 19:37 S98quantum-l3-agent -> ../init.d/quantum-l3-agent
lrwxrwxrwx. 1 root root 32 Jun  7 19:37 S98quantum-metadata-agent -> ../init.d/quantum-metadata-agent
lrwxrwxrwx. 1 root root 35 Jun  7 19:38 S98quantum-openvswitch-agent -> ../init.d/quantum-openvswitch-agent


2013-06-07 19:37:41     INFO [quantum.common.config] Logging enabled!
2013-06-07 19:37:42     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-07 19:37:42     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-07 19:37:42     INFO [quantum.agent.dhcp_agent] DHCP agent started
2013-06-07 19:38:42    ERROR [quantum.agent.dhcp_agent] Failed reporting state!
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/quantum/agent/dhcp_agent.py", line 700, in _report_state
    self.agent_state)
  File "/usr/lib/python2.6/site-packages/quantum/agent/rpc.py", line 66, in report_state
    topic=self.topic)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/proxy.py", line 80, in call
    return rpc.call(context, self._get_topic(topic), msg, timeout)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/__init__.py", line 140, in call
    return _get_impl().call(CONF, context, topic, msg, timeout)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 611, in call
    rpc_amqp.get_connection_pool(conf, Connection))
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 613, in call
    rv = list(rv)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 555, in __iter__
    self.done()
  File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 552, in __iter__
    self._iterator.next()
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 436, in iterconsume
    yield self.ensure(_error_callback, _consume)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 380, in ensure
    error_callback(e)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 421, in _error_callback
    raise rpc_common.Timeout()
Timeout: Timeout while waiting on RPC response.
2013-06-07 19:38:42  WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 56.10093 sec
2013-06-07 19:38:42     INFO [quantum.agent.dhcp_agent] Synchronizing state
2013-06-07 19:40:31     INFO [quantum.openstack.common.service] Caught SIGTERM, exiting
2013-06-07 19:40:31     INFO [quantum.common.config] Logging enabled!
2013-06-07 19:40:31     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-07 19:40:31     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-07 19:40:31     INFO [quantum.agent.dhcp_agent] DHCP agent started
2013-06-07 19:40:31     INFO [quantum.agent.dhcp_agent] Synchronizing state
2013-06-07 19:41:39  WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 0.126473 sec
2013-06-07 20:31:48     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672

Comment 5 Rami Vaknin 2013-06-08 21:30:04 UTC
btw, the same happens in puddle 2013-06-07.1 where quantum-ovs-cleanup is installed, the same error message appears in /var/log/quantum/dhcp-agent.log.

lrwxrwxrwx. 1 root root 29 Jun  8 15:51 S96quantum-ovs-cleanup -> ../init.d/quantum-ovs-cleanup
lrwxrwxrwx. 1 root root 24 Jun  8 15:51 S97quantum-server -> ../init.d/quantum-server
lrwxrwxrwx. 1 root root 28 Jun  8 15:50 S98quantum-dhcp-agent -> ../init.d/quantum-dhcp-agent
lrwxrwxrwx. 1 root root 26 Jun  8 15:50 S98quantum-l3-agent -> ../init.d/quantum-l3-agent
lrwxrwxrwx. 1 root root 32 Jun  8 15:50 S98quantum-metadata-agent -> ../init.d/quantum-metadata-agent
lrwxrwxrwx. 1 root root 35 Jun  8 15:51 S98quantum-openvswitch-agent -> ../init.d/quantum-openvswitch-agent


2013-06-08 15:50:51     INFO [quantum.common.config] Logging enabled!
2013-06-08 15:50:51     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-08 15:50:51     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-08 15:50:51     INFO [quantum.agent.dhcp_agent] DHCP agent started
2013-06-08 15:51:51    ERROR [quantum.agent.dhcp_agent] Failed reporting state!
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/quantum/agent/dhcp_agent.py", line 700, in _report_state
    self.agent_state)
  File "/usr/lib/python2.6/site-packages/quantum/agent/rpc.py", line 66, in report_state
    topic=self.topic)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/proxy.py", line 80, in call
    return rpc.call(context, self._get_topic(topic), msg, timeout)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/__init__.py", line 140, in call
    return _get_impl().call(CONF, context, topic, msg, timeout)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 611, in call
    rpc_amqp.get_connection_pool(conf, Connection))
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 613, in call
    rv = list(rv)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 555, in __iter__
    self.done()
  File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/amqp.py", line 552, in __iter__
    self._iterator.next()
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 436, in iterconsume
    yield self.ensure(_error_callback, _consume)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 380, in ensure
    error_callback(e)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 421, in _error_callback
    raise rpc_common.Timeout()
Timeout: Timeout while waiting on RPC response.
2013-06-08 15:51:51  WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 56.099052 sec
2013-06-08 15:51:51     INFO [quantum.agent.dhcp_agent] Synchronizing state
2013-06-08 15:55:40     INFO [quantum.openstack.common.service] Caught SIGTERM, exiting
2013-06-08 15:55:40     INFO [quantum.common.config] Logging enabled!
2013-06-08 15:55:40     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-08 15:55:40     INFO [quantum.openstack.common.rpc.impl_qpid] Connected to AMQP server on 10.35.160.23:5672
2013-06-08 15:55:40     INFO [quantum.agent.dhcp_agent] DHCP agent started
2013-06-08 15:55:40     INFO [quantum.agent.dhcp_agent] Synchronizing state
2013-06-08 15:56:51  WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 2.732581 sec

Comment 6 Steven Dake 2013-06-08 23:15:32 UTC
Rami,

The main problem is not the error printing, it was an unability to obtain a DHCP address in all in one mode.  Since your testing multiple nodes, can you check that you can (or cannot) obtain a DHCP address?

Comment 7 Rami Vaknin 2013-06-09 05:04:48 UTC
(In reply to Steven Dake from comment #6)
> Rami,
> 
> The main problem is not the error printing, it was an unability to obtain a

is this traceback is part of this bug? maybe we should seperate into two bugs?

> DHCP address in all in one mode.  Since your testing multiple nodes, can you
> check that you can (or cannot) obtain a DHCP address?

In multiple nodes I do obtain a dhcp address.

Comment 8 Gary Kotton 2013-06-09 11:13:37 UTC
The trace indicates that the DHCP agent is unable to communicate with the Quantum plugin. This may happen when the plugin is restarted or there is a network error. The internals of the qpid messaging layer have retries until there is no answer and an exception is thrown.

"....
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 380, in ensure
    error_callback(e)
  File "/usr/lib/python2.6/site-packages/quantum/openstack/common/rpc/impl_qpid.py", line 421, in _error_callback
    raise rpc_common.Timeout()
Timeout: Timeout while waiting on RPC response."

This is not related to the issue reported.

Comment 9 Rami Vaknin 2013-06-09 11:24:50 UTC
Moving this bug to verified per comments 3 & 8.

Comment 11 errata-xmlrpc 2013-06-11 18:56:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0933.html


Note You need to log in before you can comment on or make changes to this bug.