Bug 1273680

Summary: HA overcloud with network isolation deployment fails
Product: [Community] RDO Reporter: Alexander Chuzhoy <sasha>
Component: rdo-managerAssignee: John Trowbridge <jtrowbri>
Status: CLOSED CURRENTRELEASE QA Contact: Shai Revivo <srevivo>
Severity: high Docs Contact:
Priority: high    
Version: LibertyCC: apevec, chris.brown, ihrachys, jtrowbri, mburns, mcornea, sasha
Target Milestone: ---   
Target Release: Liberty   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-18 06:24:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
neutron server log
none
Neutron configuration files
none
Neutron log files
none
sosreport none

Description Alexander Chuzhoy 2015-10-21 01:35:16 UTC
Deployment fails: "Resource CREATE failed: resources.ControllerOvercloudServicesDeployment_Step5: resources.ControllerNodesPostDeployment.Error: resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6"



[stack@undercloud ~]$ heat resource-list -n 5 overcloud|grep -v COMPLE
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------+
| resource_name                               | physical_resource_id                          | resource_type                                     | resource_status | updated_time        | stack_name                                                                                                    |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------+
| ControllerNodesPostDeployment               | abb7659b-008f-4877-940f-d63fb3f1f6f9          | OS::TripleO::ControllerPostDeployment             | CREATE_FAILED   | 2015-10-09T20:44:51 | overcloud                                                                                                     |
| ControllerOvercloudServicesDeployment_Step5 | 5e94bbae-da4f-4748-b7da-378cb248b13e          | OS::Heat::StructuredDeployments                   | CREATE_FAILED   | 2015-10-09T20:57:41 | overcloud-ControllerNodesPostDeployment-77ujcqvdrmro                                                          |
| 0                                           | 35fc43c3-2faf-47dc-b53a-0e18de104a0b          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-10-09T21:09:38 | overcloud-ControllerNodesPostDeployment-77ujcqvdrmro-ControllerOvercloudServicesDeployment_Step5-5mjvfzaqqcez |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+---------------------+---------------------------------------------------------------------------------------------------------------+




Checking the /var/log/messages on one of the controllers:


Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent [req-23b41339-0b69-441f-bc4a-c088161cf920 - - - - -] Failed reporting state!
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 571, in _report_state
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     self.state_rpc.report_state(ctx, self.agent_state, self.use_call)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 86, in report_state
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     retry=self.retry)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     timeout=timeout, retry=retry)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 431, in send
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     retry=retry)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     result = self._waiter.wait(msg_id, timeout)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID d88ede86381042b3b742cdf9eb03d9b0
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.839 21103 ERROR neutron.agent.dhcp.agent
Oct 21 01:28:32 localhost neutron-dhcp-agent: 2015-10-21 01:28:32.843 21103 WARNING oslo.service.loopingcall [req-23b41339-0b69-441f-bc4a-c088161cf920 - - - - -] Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state'
run outlasted interval by 30.01 sec
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 330, in _report_state
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     self.use_call)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 86, in report_state
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     timeout=timeout, retry=retry)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 431, in send
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     result = self._waiter.wait(msg_id, timeout)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 9fa9942aebb0487c9f79012ead8e147e
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.133 21134 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
Oct 21 01:28:33 localhost neutron-openvswitch-agent: 2015-10-21 01:28:33.135 21134 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 30.01 sec




Steps to reproduce:
Attempt to deploy HA overcloud with network isolation.



Result:
Stack failed with status: Resource CREATE failed: resources.ControllerOvercloudServicesDeployment_Step5: resources.ControllerNodesPostDeployment.Error: resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 6
Heat Stack create failed.


Expected result:
Successful deployment.

Comment 1 Ihar Hrachyshka 2015-10-21 09:09:56 UTC
It can be eg. neutron-server failing to reply, or AMQP broker failing to start, or anything in between. Without proper logs provided, it's hard to say. /var/log/messages snippets are NOT enough to make clear call.

Comment 2 Marius Cornea 2015-10-21 09:49:08 UTC
Created attachment 1085060 [details]
neutron server log

Attaching the neutron server.log on the first controller on Sasha's environment.

Neutron related packages:
python-neutron-lbaas-7.0.0-1.el7.noarch
openstack-neutron-openvswitch-7.0.0-2.el7.noarch
python-neutron-7.0.0-2.el7.noarch
openstack-neutron-7.0.0-2.el7.noarch
python-neutronclient-3.1.0-1.el7.noarch
openstack-neutron-common-7.0.0-2.el7.noarch
openstack-neutron-lbaas-7.0.0-1.el7.noarch
openstack-neutron-ml2-7.0.0-2.el7.noarch
openstack-neutron-metering-agent-7.0.0-2.el7.noarch

Comment 3 Ihar Hrachyshka 2015-10-21 10:28:26 UTC
OK, logs suggest configuration issue for server. Please show what's core_plugin.

Also, please attach all logs, not a single snippet. I don't see debug logs that could show which configuration the server sees.

Overall, let's try to avoid the churn and provide logs and configs in advance without waiting for devs to request them.

Comment 4 Alan Pevec (Fedora) 2015-10-21 10:32:56 UTC
Was cisco networking plugin installed?
Last time we saw ValueError: Empty module name when neutron is loading plugins was when cisco plugin from Kilo was installed (it has duplicate ml2 entry point confusing stevedore)
Cisco plugin is NOT Liberty compatible yet and most not be installed.

Comment 5 Marius Cornea 2015-10-21 10:36:02 UTC
Created attachment 1085078 [details]
Neutron configuration files

Attaching the neutron configuration files in /etc/neutron.

Comment 6 Alan Pevec (Fedora) 2015-10-21 10:36:51 UTC
What Ihar said, also rpm -qa or even full sosreport (which might too big for BZ attachment, so upload somewhere e.g. personal fedorapeople.org page)

Comment 7 Marius Cornea 2015-10-21 10:37:02 UTC
Created attachment 1085079 [details]
Neutron log files

Attaching the Neutron log files in /var/log/neutron

Comment 8 Marius Cornea 2015-10-21 10:38:30 UTC
[root@overcloud-controller-0 ~]# rpm -qa | grep cisco
python-networking-cisco-2015.1.0-1.el7.noarch
fence-agents-cisco-ucs-4.0.11-13.el7_1.2.x86_64
fence-agents-cisco-mds-4.0.11-13.el7_1.2.x86_64

Comment 9 Ihar Hrachyshka 2015-10-21 10:38:45 UTC
OK, I see cisco config files in the config tarball. I believe that's the issue Alan mentioned.

Comment 10 Marius Cornea 2015-10-21 10:50:44 UTC
Created attachment 1085084 [details]
sosreport

Attaching the sosreport.

Comment 11 Alan Pevec (Fedora) 2015-10-21 11:13:11 UTC
overcloud controller has kilo repo enabled which should not be the case!
I'm also not sure where is centos-cloud-rdo coming from, that repo is not defined in either cloud sig or rdo-release RPM:

base/7/x86_64 CentOS-7 - Base                                                                                                          
centos-cloud-rdo CentOS Cloud RDO
extras/7/x86_64 CentOS-7 - Extras                                                                                                          openstack-kilo OpenStack Kilo Repository                                                                                                  updates/7/x86_64 CentOS-7 - Updates

Please rebuild images using only steps from https://etherpad.openstack.org/p/RDO-Manager_liberty:
yum install -y  http://rdoproject.org/repos/openstack-liberty/rdo-release-liberty.rpm 
export NODE_DIST=centos7
export RDO_RELEASE='liberty'
openstack overcloud image build --all

Comment 12 John Trowbridge 2015-10-21 12:32:34 UTC
This is at least related, I would probably say duplicate actually of https://bugzilla.redhat.com/show_bug.cgi?id=1271200

My last comment from the above BZ for completeness:

This is fixed in the liberty-testing repo
http://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-liberty/

with python-tripleoclient-0.0.11-3

However, I am going to leave it assigned until the fix is also in delorean.

In order to use the release repo when building images it is now needed to:
`export RDO_RELEASE=<release>`

so for liberty:
`export RDO_RELEASE='liberty'`

by default it will use kilo.

Comment 14 Christopher Brown 2017-06-17 17:21:53 UTC
This was fixed a while back so can be closed.