Related to #1036523 [root@rhos4-node6 ~]# rpm -q -f /etc/init.d/neutron-openvswitch-agent openstack-neutron-openvswitch-2013.2-10.el6ost.noarch [root@rhos4-node6 ~]# /etc/init.d/neutron-openvswitch-agent start Starting neutron-openvswitch-agent: [ OK ] 2013-12-02 10:08:37.324 16580 ERROR neutron.openstack.common.rpc.amqp [-] Exception during message handling 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp Traceback (most recent call last): 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp **args) 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 44, in dispatch 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp neutron_ctxt, version, method, namespace, **kwargs) 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs) 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 184, in report_state 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp self.plugin.create_or_update_agent(context, agent_state) 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 147, in create_or_update_agent 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp context, agent['agent_type'], agent['host']) 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 130, in _get_agent_by_type_and_host 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp host=host) 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp MultipleAgentFoundByTypeHost: Multiple agents with agent_type=Open vSwitch agent and host=rhos4-node6.int.fabbione.net found 2013-12-02 10:08:37.324 16580 TRACE neutron.openstack.common.rpc.amqp
Very simply, the error means there are is already an entry in Neutron's agents table with the same agent_type and host field
mysql> select * from agents; +--------------------------------------+--------------------+---------------------------+------------+------------------------------+----------------+---------------------+---------------------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | agent_type | binary | topic | host | admin_state_up | created_at | started_at | heartbeat_timestamp | description | configurations | +--------------------------------------+--------------------+---------------------------+------------+------------------------------+----------------+---------------------+---------------------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 01303100-98cb-459d-912c-ad6a939c87f8 | Open vSwitch agent | neutron-openvswitch-agent | N/A | rhos4-node6.int.fabbione.net | 1 | 2013-11-29 08:52:19 | 2013-11-29 08:52:19 | 2013-11-29 08:52:19 | NULL | {"tunnel_types": [], "tunneling_ip": "", "bridge_mappings": {"eth1": "br-int"}, "l2_population": false, "devices": 0} | | 2c608971-66ac-488a-9e0f-4074b6e3026b | Open vSwitch agent | neutron-openvswitch-agent | N/A | rhos4-node5.int.fabbione.net | 1 | 2013-11-29 09:09:30 | 2013-11-29 09:09:30 | 2013-11-29 09:09:30 | NULL | {"tunnel_types": [], "tunneling_ip": "", "bridge_mappings": {"eth1": "br-int"}, "l2_population": false, "devices": 0} | | 3d2c14c7-be01-4a59-86d1-413d3ae4d517 | L3 agent | neutron-l3-agent | l3_agent | rhos4-node6.int.fabbione.net | 1 | 2013-11-29 13:08:50 | 2013-11-29 13:09:37 | 2013-11-29 13:13:34 | NULL | {"router_id": "", "gateway_external_network_id": "", "handle_internal_only_routers": true, "use_namespaces": true, "routers": 1, "interfaces": 0, "floating_ips": 0, "interface_driver": "neutron.agent.linux.interface.OVSInterfaceDriver", "ex_gw_ports": 1} | | 58021437-7b10-4504-bd74-e5cfb6f76f3f | Open vSwitch agent | neutron-openvswitch-agent | N/A | rhos4-node5.int.fabbione.net | 1 | 2013-11-29 09:09:30 | 2013-11-29 09:09:30 | 2013-11-29 09:09:30 | NULL | {"tunnel_types": [], "tunneling_ip": "", "bridge_mappings": {"eth1": "br-int"}, "l2_population": false, "devices": 0} | | 6fe9edd7-0b13-49e8-99c7-b9b96ec7ca5b | Open vSwitch agent | neutron-openvswitch-agent | N/A | rhos4-node6.int.fabbione.net | 1 | 2013-11-29 08:52:19 | 2013-11-29 08:52:19 | 2013-11-29 08:52:19 | NULL | {"tunnel_types": [], "tunneling_ip": "", "bridge_mappings": {"eth1": "br-int"}, "l2_population": false, "devices": 0} | | 73564bf2-293f-4940-bbd5-f741e560661c | DHCP agent | neutron-dhcp-agent | dhcp_agent | rhos4-node5.int.fabbione.net | 1 | 2013-11-28 11:26:44 | 2013-11-29 15:25:57 | 2013-12-02 08:21:44 | NULL | {"subnets": 0, "use_namespaces": true, "dhcp_lease_duration": 86400, "dhcp_driver": "neutron.agent.linux.dhcp.Dnsmasq", "networks": 0, "ports": 0} | | f43ae6c4-4628-4571-b7a7-c2ee4d86fdba | DHCP agent | neutron-dhcp-agent | dhcp_agent | rhos4-node6.int.fabbione.net | 1 | 2013-11-28 11:26:43 | 2013-12-02 09:12:19 | 2013-12-02 09:49:36 | NULL | {"subnets": 0, "use_namespaces": true, "dhcp_lease_duration": 86400, "dhcp_driver": "neutron.agent.linux.dhcp.Dnsmasq", "networks": 0, "ports": 0} | +--------------------------------------+--------------------+---------------------------+------------+------------------------------+----------------+---------------------+---------------------+---------------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 7 rows in set (0.00 sec)
(In reply to Mark McLoughlin from comment #1) > Very simply, the error means there are is already an entry in Neutron's > agents table with the same agent_type and host field Yes that´s clearly the case, but then the next question is: why have they been added? the process i am following to setup is here: http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to neutron starts around line 650
I asked Fabio to drop stop the service, drop the two rows and restart The two columns were created with the same timestamp. This could be a race condition and I'm suspicious of https://review.openstack.org/49439 - could this be the service creating an entry at startup, generating a entry but not saving it, yielding to an RPC call which does the same, context switching back and saving the first entry before context switching back to the RPC call to save the second entry ?
Ok, looks like it is a race and dropping the entries did regenerate a correct db. When creating the cluster service, pacemaker will start 2 instances of the dhcp agent in parallel on both machines. It appears that there is no lock in creating db entries (bad) that would lead to a situation like this one. I´ll enable a temporary workaround in the cluster setup, by starting one at a time, but clearly a more general fix is necessary.
(In reply to Fabio Massimo Di Nitto from comment #5) > Ok, looks like it is a race and dropping the entries did regenerate a > correct db. > > When creating the cluster service, pacemaker will start 2 instances of the > dhcp agent in parallel on both machines. It appears that there is no lock in > creating db entries (bad) that would lead to a situation like this one. > > I´ll enable a temporary workaround in the cluster setup, by starting one at > a time, but clearly a more general fix is necessary. I can actually reproduce it without cluster involved. Drop the entries from the mysql table start the service check logs stop the service and repeat. I can hit it about 90% of the time.
(In reply to Fabio Massimo Di Nitto from comment #6) > I can actually reproduce it without cluster involved. > > Drop the entries from the mysql table > start the service > check logs > stop the service > > and repeat. I can hit it about 90% of the time. Must I have configured pacemaker and start ovs agent on two machines simultaneously to reproduce this error or allinone setup should be fine? Does this link http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to describe steps to reproduce?
(In reply to Jakub Libosvar from comment #7) > (In reply to Fabio Massimo Di Nitto from comment #6) > > I can actually reproduce it without cluster involved. > > > > Drop the entries from the mysql table > > start the service > > check logs > > stop the service > > > > and repeat. I can hit it about 90% of the time. > > Must I have configured pacemaker and start ovs agent on two machines > simultaneously to reproduce this error or allinone setup should be fine? > > Does this link http://rhel-ha.etherpad.corp.redhat.com/RHOS-RHEL-HA-how-to > describe steps to reproduce? I was able to reproduce it without the cluster. I think the trick is to place mysql on one machine (!= localhost) and neutron somewhere else. Even if you follow the etherpad that has the exact setup details that I have, you will end up with: nodeX neutron -> network -> nodeY mysql and this is a reduced set without the complexity of cluster. I know for a fact that my mysql is "slow" (I have I/O perf issue on my system) but it is alive and it works. Probably slow I/O + network latency makes it easier for me to reproduce. Right now i am running with the patch in comment #4 reverted, but i didn´t do a lot of restart tests yet.
It looks like the problem persists even reverting the patch in comment #4. Jan and I exchanged a few emails, but no luck so far. I have tried all kind of combinations but the smallest set to reproduce the problem is: nodeX neutron -> network -> nodeY mysql Even starting neutron-server and then neutron-dhcp-agent is sometime enough to trigger multiple entries in the db. now, one question.. how does dhcp-agent update the db? does it do it directly? does it talk to server? does it use some qpid messaging?
(In reply to Fabio Massimo Di Nitto from comment #9) > It looks like the problem persists even reverting the patch in comment #4. > > Jan and I exchanged a few emails, but no luck so far. > > I have tried all kind of combinations but the smallest set to reproduce the > problem is: > > nodeX neutron -> network -> nodeY mysql > > Even starting neutron-server and then neutron-dhcp-agent is sometime enough > to trigger multiple entries in the db. > > now, one question.. how does dhcp-agent update the db? does it do it > directly? does it talk to server? does it use some qpid messaging? It communicates with neutron-server via rpc implementation (qpid in your case), neutron-server is the only service that handles DB. Does it mean you have this issue with dhcp-agent as well? It would make sense because OVSNeutronAgent and DhcpAgentWithStateReport use the same way of updating db with theirs heartbeats. From looking at the code I suspect the issue is in parallel updating the hearbeat in the neutron-service. To achieve that, there must be two requests for creating agent in DB, agent that is placed on the same host. The agent is not present in database yet, therefore it will be created by the hearbeat update. I'll try to 'draw' the greenthreads on the server side: | Request A | Request B | |---------------------|--------------------| | create_agent | | | get_agent | | | AgentNotFound | | | yield | | | | create_agent | | | get_agent | | | AgentNotFound | | | yield | | add_agent_to_db | | | yield | | | | add_agent_to_db | -------------------------------------------| What I don't understand so far is where do these rpc calls come from...
(In reply to Jakub Libosvar from comment #10) > (In reply to Fabio Massimo Di Nitto from comment #9) > > It looks like the problem persists even reverting the patch in comment #4. > > > > Jan and I exchanged a few emails, but no luck so far. > > > > I have tried all kind of combinations but the smallest set to reproduce the > > problem is: > > > > nodeX neutron -> network -> nodeY mysql > > > > Even starting neutron-server and then neutron-dhcp-agent is sometime enough > > to trigger multiple entries in the db. > > > > now, one question.. how does dhcp-agent update the db? does it do it > > directly? does it talk to server? does it use some qpid messaging? > > It communicates with neutron-server via rpc implementation (qpid in your > case), neutron-server is the only service that handles DB. > > Does it mean you have this issue with dhcp-agent as well? It would make > sense because OVSNeutronAgent and DhcpAgentWithStateReport use the same way > of updating db with theirs heartbeats. > > From looking at the code I suspect the issue is in parallel updating the > hearbeat in the neutron-service. To achieve that, there must be two requests > for creating agent in DB, agent that is placed on the same host. The agent > is not present in database yet, therefore it will be created by the hearbeat > update. I'll try to 'draw' the greenthreads on the server side: > > | Request A | Request B | > |---------------------|--------------------| > | create_agent | | > | get_agent | | > | AgentNotFound | | > | yield | | > | | create_agent | > | | get_agent | > | | AgentNotFound | > | | yield | > | add_agent_to_db | | > | yield | | > | | add_agent_to_db | > -------------------------------------------| > > What I don't understand so far is where do these rpc calls come from... Ok, I think they might come from qpidd being in cluster mode. I´ll try and see if reducing to one instance of qpidd (that is running remote from neutron, same host of the mysql for me) makes any difference.
I think i found the issue in neutron-server <-> RPC protocol and have what I think is a faster reproducer. node A _and_ B are running neutron-server node C run both mysql and qpidd (it doesn´t matter if they are in cluster mode) mysql ovs_neutron has no agents defined. When firing up neutron-dhcp-agent that is the first to populate the DB, the RPC request is sent to qpidd and qpidd dispatches the request to both listening neutron-server. At this point it´s a matter of one or two loops and I can see the traceback. In some very rare run I also saw this: 2013-12-04 14:29:53.008 3259 ERROR neutron.openstack.common.rpc.amqp [-] Exception during message handling 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp Traceback (most recent call last): 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp **args) 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 44, in dispatch 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp neutron_ctxt, version, method, namespace, **kwargs) 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs) 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 178, in report_state 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp self.plugin.create_or_update_agent(context, agent_state) 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 145, in create_or_update_agent 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp context, agent['agent_type'], agent['host']) 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 128, in _get_agent_by_type_and_host 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp host=host) 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp MultipleAgentFoundByTypeHost: Multiple agents with agent_type=Open vSwitch agent and host=rhos4-node3.int.fabbione.net found 2013-12-04 14:29:53.008 3259 TRACE neutron.openstack.common.rpc.amqp 2013-12-04 14:29:53.337 3259 ERROR neutron.openstack.common.rpc.amqp [-] Exception during message handling 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp Traceback (most recent call last): 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/amqp.py", line 438, in _process_data 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp **args) 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/common/rpc.py", line 44, in dispatch 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp neutron_ctxt, version, method, namespace, **kwargs) 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs) 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/l3_rpc_base.py", line 56, in sync_routers 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp context, host, router_ids) 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/l3_agentschedulers_db.py", line 131, in list_active_sync_routers_on_active_l3_agent 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp context, constants.AGENT_TYPE_L3, host) 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp File "/usr/lib/python2.6/site-packages/neutron/db/agents_db.py", line 128, in _get_agent_by_type_and_host 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp host=host) 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp MultipleAgentFoundByTypeHost: Multiple agents with agent_type=L3 agent and host=rhos4-node3.int.fabbione.net found 2013-12-04 14:29:53.337 3259 TRACE neutron.openstack.common.rpc.amqp
https://bugs.launchpad.net/ceilometer/+bug/1257293
I'm adding the launchpad bug as an external tracker. Note that there is a neutron review underway (https://review.openstack.org/#/c/58814/) for the bug in question and I've marked the bug as havana-backport-potential in the hopes of making the backport for RHOS 4.0 easier.
@marun: Since this fix is in Icehouse trunk, should we move this bug to POST? If we want to backport this to 4.0.z (should we?) then we probably need to move the Target Release to 4.0 and target to one of our async milestones, no?
Passing this question off to Miguel, since he's already working on it.
Perry, that's partially correct, for RHOS 5.0 there is an upstream fix. But Maru confirmed that this bug could also reproduce in a single-server instance, making it a general RHOS4.0 bug. So we may want to clone this bug for 4.0.
I think the correct LP bug id is the one Fabio posted in comment 13. Changing tracker and setting this bug to ON_QA since it was merged upstream stable havana branch in December: https://review.openstack.org/#/c/60525/
Verified NVR: openstack-neutron-2013.2.2-5.el6ost.noarch Followed Comment #12 for verifications. 1. Used openstack-packstack-2013.2.1-0.28.dev989 to install openstack. configured neutron-server to be installed on a seperate server than mysql and qpid. 2. maunally configured additional neutron-server on a dedicated node (thanks for the help Jakub!). that server is configured exactly the same as the first neutron-server (with the same DB and RPC IP addresses) 3. stopped the neutron-dhcp-agent 4. deleted the neutron-dhcp-agnet from the mysql db: # use ovs_neutron # delete from agents WHERE agent_type = 'DHCP agent'; 5. re-started the neutron-dhcp-agent. 6. verified that it was updated in the db: # select count(*) from agents WHERE agent_type = 'DHCP agent'; +----------+ | count(*) | +----------+ | 1 | +----------+ 1 row in set (0.00 sec) 7. verified that there are no errors in neutron-dhcp-agent logs and in both neutron-server(s) logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0334.html