Bug 1629659 - Every new instance gets ERROR state
Summary: Every new instance gets ERROR state
Keywords:
Status: CLOSED DUPLICATE of bug 1631335
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Assaf Muller
QA Contact: Roee Agiman
URL:
Whiteboard:
Depends On:
Blocks: 1629449
TreeView+ depends on / blocked
 
Reported: 2018-09-17 09:17 UTC by Sasha Smolyak
Modified: 2022-03-13 15:40 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-01 14:06:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-13648 0 None None None 2022-03-13 15:40:53 UTC

Description Sasha Smolyak 2018-09-17 09:17:24 UTC
Description of problem:
Trying to raise an instance on overcloud:
openstack server create --image centos-7-image --flavor m1.extra_tiny MyFirstInstance

where centos-7image exists in the list of images: 
(overcloud) [stack@undercloud-0 ~]$ openstack image list
+--------------------------------------+-----------------------------------------------------------+--------+
| ID                                   | Name                                                      | Status |
+--------------------------------------+-----------------------------------------------------------+--------+
| 9770eef5-85b7-4a8c-8720-a55c3fa8047e | centos-7-image                                            | active |

and the flavor is in the list of flavors:
(overcloud) [stack@undercloud-0 ~]$ openstack flavor list
+--------------------------------------+---------------+-----+------+-----------+-------+-----------+
| ID                                   | Name          | RAM | Disk | Ephemeral | VCPUs | Is Public |
+--------------------------------------+---------------+-----+------+-----------+-------+-----------+
| 5af18f4b-a5db-4b7f-a7aa-1002b036aa81 | m1.extra_tiny | 256 |    0 |         0 |     1 | True      |

When checking on the process in nova list, get
BUILD  | scheduling
then 
BUILD  | spawning 
then 
BUILD  | block_device_mapping
then once again 
BUILD  | spawning
and finally ERROR

From nova-conductor.log on controller:

[heat-admin@controller-0 ~]$ sudo cat /var/log/containers/nova/nova-conductor.log | grep ERROR
2018-09-17 07:21:14.320 25 ERROR nova.scheduler.utils [req-f7ff1c5a-55ed-423e-b201-db38dd24540d 5924a12fc9df4f23a248e2f1e7407a77 9038d8856b76418aa949b54ebe8d8d8e - default default] [instance: d8f7c85e-827d-4432-a03b-f31cd3dab8c0] Error from last host: compute-1.localdomain (node compute-1.localdomain): [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1919, in _do_build_and_run_instance\n    filter_properties, request_spec)\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2208, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance d8f7c85e-827d-4432-a03b-f31cd3dab8c0 was re-scheduled: Binding failed for port 30213bbe-8304-4f11-bfb2-71ef0da045c3, please check neutron logs for more information.\n']: OSError: [Errno 2] No such file or directory: '/var/log/nova/nova-conductor.log'
2018-09-17 07:34:30.371 27 ERROR nova.scheduler.utils [req-fe239f32-0f1b-43ca-a229-7213fd3cf339 5924a12fc9df4f23a248e2f1e7407a77 9038d8856b76418aa949b54ebe8d8d8e - default default] [instance: fcddfc0e-2e3e-45d3-94a5-ca9f5a3cb2b5] Error from last host: compute-0.localdomain (node compute-0.localdomain): [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1919, in _do_build_and_run_instance\n    filter_properties, request_spec)\n', u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2208, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance fcddfc0e-2e3e-45d3-94a5-ca9f5a3cb2b5 was re-scheduled: Binding failed for port adac0d63-d495-4da3-b877-1f8e1651be57, please check neutron logs for more information.\n']
2018-09-17 07:34:30.374 27 WARNING nova.scheduler.utils [req-fe239f32-0f1b-43ca-a229-7213fd3cf339 5924a12fc9df4f23a248e2f1e7407a77 9038d8856b76418aa949b54ebe8d8d8e - default default] [instance: fcddfc0e-2e3e-45d3-94a5-ca9f5a3cb2b5] Setting instance to ERROR state.: MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance fcddfc0e-2e3e-45d3-94a5-ca9f5a3cb2b5.


Version-Release number of selected component (if applicable):
openstack-nova-common-18.0.1-0.20180830232802.43f6f74.el7ost.noarch
python2-novaclient-11.0.0-0.20180809174649.f1005ce.el7ost.noarch
puppet-nova-13.3.1-0.20180831195236.ce0efbe.el7ost.noarch
python-nova-18.0.1-0.20180830232802.43f6f74.el7ost.noarch
openstack-nova-conductor-18.0.1-0.20180830232802.43f6f74.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. openstack image create --disk-format qcow2 --container-format bare   --public --file CentOS-7-x86_64-GenericCloud-1509.qcow2 centos-7-image
2. openstack flavor create --public m1.extra_tiny --id auto --ram 256 --disk 0 --vcpus 1
3. openstack server create --image centos-7-image --flavor m1.extra_tiny MyFirstInstance

Actual results:
instance gets ERROR state

Expected results:
instance is active

Additional info:

Comment 1 Joe H. Rahme 2018-09-17 10:09:24 UTC
Relevant entries from the neutron logs (/var/log/containers/neutron/server.log)


2018-09-17 07:34:22.028 3390 WARNING neutron.plugins.ml2.drivers.mech_agent [req-056e189f-b380-49ef-bb01-15bd4ac53802 2970b99fec0e47ff96cd9595b6a77140 b810f284437040d5ad156d1b7968c581 - default default] Refusing to bind port adac0d63-d495-4da3-b877-1f8e1651be57 to dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'availability_zone': None, 'heartbeat_timestamp': datetime.datetime(2018, 9, 17, 0, 0, 38), 'admin_state_up': True, 'alive': False, 'topic': u'N/A', 'host': u'compute-0.localdomain', 'agent_type': u'Open vSwitch agent', 'resource_versions': {u'Subnet': u'1.0', u'Log': u'1.0', u'PortForwarding': u'1.0', u'SubPort': u'1.0', u'SecurityGroup': u'1.0', u'SecurityGroupRule': u'1.0', u'Trunk': u'1.1', u'QosPolicy': u'1.7', u'Port': u'1.4', u'Network': u'1.0'}, 'created_at': datetime.datetime(2018, 9, 16, 13, 28, 37), 'started_at': datetime.datetime(2018, 9, 16, 13, 28, 37), 'id': '385285f8-c6a1-44ab-bbaa-caed78e5bec5', 'configurations': {u'ovs_hybrid_plug': True, u'in_distributed_mode': False, u'datapath_type': u'system', u'arp_responder_enabled': False, u'tunneling_ip': u'172.17.2.12', u'vhostuser_socket_dir': u'/var/run/openvswitch', u'devices': 0, u'ovs_capabilities': {u'datapath_types': [u'netdev', u'system'], u'iface_types': [u'erspan', u'geneve', u'gre', u'internal', u'ip6erspan', u'ip6gre', u'lisp', u'patch', u'stt', u'system', u'tap', u'vxlan']}, u'extensions': [u'qos'], u'l2_population': False, u'tunnel_types': [u'vxlan'], u'log_agent_heartbeats': False, u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex', u'tenant': u'br-isolated'}}}

2018-09-17 07:34:22.028 3390 ERROR neutron.plugins.ml2.managers [req-056e189f-b380-49ef-bb01-15bd4ac53802 2970b99fec0e47ff96cd9595b6a77140 b810f284437040d5ad156d1b7968c581 - default default] Failed to bind port adac0d63-d495-4da3-b877-1f8e1651be57 on host compute-0.localdomain for vnic_type normal using segments [{'network_id': '5cfe4153-3901-4e38-960e-1b18d3a26c9e', 'segmentation_id': None, 'physical_network': u'datacentre', 'id': '62d3dbd9-4fac-425a-97ee-d74d5b583a35', 'network_type': u'flat'}]

Comment 2 Sasha Smolyak 2018-09-17 11:20:05 UTC
Sosreport of the controller:
https://drive.google.com/open?id=1CKIzY4NIPSVEQg9d128mQnUyngneD2vH

Comment 3 Joe H. Rahme 2018-09-17 16:13:18 UTC
Looking at the logs I believe there's something wrong with Neutron. It
seems like all the agents are stuck in a loop waiting for a
reply. Here are some extracts from the logs on the controller:


/var/log/containers/neutron/dhcp-agent.log
------------------------------------------

2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent [req-5bea43f3-d6d0-405d-a225-6aeccdcb5479 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID 46d155e8c3b74cc7b30d21a2ffd07944
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 762, in _report_state
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     ctx, self.agent_state, True)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     retry=self.retry)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     retry=retry)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     call_monitor_timeout, retry=retry)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     call_monitor_timeout)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID 46d155e8c3b74cc7b30d21a2ffd07944
2018-09-17 00:05:05.580 113737 ERROR neutron.agent.dhcp.agent 


/var/log/containers/neutron/l3-agent.log
----------------------------------------

2018-09-17 00:01:04.125 114040 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-29992b94-b80b-4fb6-8ee7-cdd7eecd1ce8', 'haproxy', '-f', '/var/lib/neutron/ns-metadata-proxy/29992b94-b80b-4fb6-8ee7-cdd7eecd1ce8.conf'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:103
2018-09-17 00:01:04.131 114040 ERROR neutron.agent.linux.utils [-] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot open network namespace "qrouter-29992b94-b80b-4fb6-8ee7-cdd7eecd1ce8": No such file or directory
2018-09-17 00:01:04.131 114040 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.009s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-09-17 00:01:32.904 114040 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process._check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-09-17 00:01:32.904 114040 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-09-17 00:01:38.480 114040 DEBUG oslo_service.periodic_task [req-45c59656-09f4-4c39-8bfe-0232bfe0d26b - - - - -] Running periodic task L3NATAgentWithStateReport.periodic_sync_routers_task run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID 24a3bd47b46746afaafe13e475a36de6
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 755, in _report_state
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     True)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     return method(context, 'report_state', **kwargs)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     retry=self.retry)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     retry=retry)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     call_monitor_timeout, retry=retry)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     call_monitor_timeout)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent     'to message ID %s' % msg_id)
2018-09-17 00:02:03.770 114040 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 24a3bd47b46746afaafe13e475a36de6


/var/log/containers/neutron/openvswitch-agent.log
-------------------------------------------------


2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID 5ba30c2f2752436ebb8651a6f20d3a52
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 325, in _report_state
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     True)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 97, in report_state
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     return method(context, 'report_state', **kwargs)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=self.retry)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 133, in _send
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     retry=retry)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     call_monitor_timeout, retry=retry)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     call_monitor_timeout)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 5ba30c2f2752436ebb8651a6f20d3a52
2018-09-17 16:01:47.682 114675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
2018-09-17 16:01:47.684 114675 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 30.01 sec

Comment 10 Joe H. Rahme 2018-10-01 12:21:59 UTC
I was able to reproduce the issue by deploying a deployment on a friday afternoon, and leaving it idle over the weekend. On monday morning, Neutron was misbehaving the same way.

Note that I don't know if it's related, but the undercloud is also broken after the weekend like here: https://bugzilla.redhat.com/show_bug.cgi?id=1628319

Comment 12 Bernard Cafarelli 2018-10-01 14:06:39 UTC
Marking duplicate of 1631335 where patches are tracked, checking on system this is the same issue indeed

*** This bug has been marked as a duplicate of bug 1631335 ***


Note You need to log in before you can comment on or make changes to this bug.