Bug 1658151

Summary:	ceilometer HTTP Error 500 after controllers were temporarily disconnected
Product:	Red Hat OpenStack	Reporter:	Noam Manos <nmanos>
Component:	openstack-nova	Assignee:	OSP DFG:Compute <osp-dfg-compute>
Status:	CLOSED NOTABUG	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	14.0 (Rocky)	CC:	dasmith, eglynn, jdanjou, jhakimra, jruzicka, kchamart, mkolesni, oblaut, sbauza, sgordon, tfreger, vromanso
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-12-21 15:38:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Noam Manos 2018-12-11 11:55:09 UTC

Description of problem:
Getting HTTP Error 500, while all Controllers are up and containers are healthy. 
It happened after creating snapshots of overcloud nodes (3 controllers, 2 computes) and undercloud, while those nodes were probably disconnected for a short while during snapshot.


Version-Release number of selected component (if applicable):
OSP14   -p 2018-12-07.2


How reproducible:
Many times.

Steps to Reproduce:
1. Install OSP14
2. On the host (not the undercloud), make sure you have enogh disk space to create snapshots. Then run:

for vm in $(virsh list --all --name); do
  virsh snapshot-create-as $vm ${vm}-clean;
done

3. When done (takes several minutes), login undercloud and run simple openstack command on the overcloud:
. overcloudrc
openstack network list 


Actual results:

[stack@undercloud-0 ~]$ openstack server list --all
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| ID                                   | Name         | Status | Networks               | Image          | Flavor     |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+
| 3b28056e-5329-4997-b381-a5627f18f6e4 | controller-0 | ACTIVE | ctlplane=192.168.24.9  | overcloud-full | controller |
| a5d58994-3875-4b24-a6fb-5c43ba4ed4b2 | controller-1 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | controller |
| 17464a31-0ddd-467e-a939-f2b18a4d755f | compute-0    | ACTIVE | ctlplane=192.168.24.6  | overcloud-full | compute    |
| a7fa0a19-b120-4855-a1af-31ed52378477 | controller-2 | ACTIVE | ctlplane=192.168.24.13 | overcloud-full | controller |
| cc0fff87-6135-4a0c-a21d-d2e08d580e4f | compute-1    | ACTIVE | ctlplane=192.168.24.14 | overcloud-full | compute    |
+--------------------------------------+--------------+--------+------------------------+----------------+------------+

[stack@undercloud-0 ~]$ . overcloudrc

(overcloud) [stack@undercloud-0 ~]$ openstack server list --all
An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-88bcd414-2b9c-48e1-8c33-6c391783a4a2)

# Controller-1 - Containers logs:

*** controller-1: /var/log/containers/ceilometer/central.log ***
2018-12-11 09:34:35.773 27 ERROR root Traceback (most recent call last):
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 187, in _beat_forever_until_stopped
2018-12-11 09:35:30.361 27 DEBUG novaclient.v2.client [-] RESP: [200] Content-Length: 15 Content-Type: application/json Date: Tue, 11 Dec 2018 09:35:27 GMT OpenStack-API-Version: compute 2.1 Server: Apache Vary: OpenStack-API-Version,X-OpenStack-Nova-API-Version X-OpenStack-Nova-API-Version: 2.1 x-compute-request-id: req-fb64f31e-2693-42f1-9170-e4e9661e6332 x-openstack-request-id: req-fb64f31e-2693-42f1-9170-e4e9661e6332 _http_log_response /usr/lib/python2.7/site-packages/keystoneauth1/session.py:479
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2018-12-11 09:40:27.242 27 DEBUG novaclient.v2.client [-] GET call to compute for http://172.17.1.16:8774/v2.1/servers/detail?all_tenants=True&changes-since=2018-12-11T09%3A35%3A30.363455%2B00%3A00 used request id req-f7e557e2-d1d7-405e-bb02-ab968072a7fa request /usr/lib/python2.7/site-packages/keystoneauth1/session.py:844
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client   File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 328, in get
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 513, in heartbeat
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client   File "/usr/lib/python2.7/site-packages/ceilometer/nova_client.py", line 157, in instance_get_all
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client   File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 858, in list
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client   File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 257, in _list
<class 'oslo_db.exception.DBConnectionError'> (HTTP 500) (Request-ID: req-f7e557e2-d1d7-405e-bb02-ab968072a7fa): ClientException: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client [-] Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client   File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 83, in request
2018-12-11 09:30:27.233 27 DEBUG ceilometer.polling.manager [-] Skip pollster hardware.memory.buffer, no resources found this cycle poll_and_notify /usr/lib/python2.7/site-packages/ceilometer/polling/manager.py:189
2018-12-11 09:35:07.240 27 ERROR root ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2018-12-11 09:34:35.773 27 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.: ToozConnectionError: Timeout reading from socket
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
2018-12-11 09:40:27.241 27 DEBUG novaclient.v2.client [-] RESP BODY: {"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\n<class 'oslo_db.exception.DBConnectionError'>", "code": 500}} _http_log_response /usr/lib/python2.7/site-packages/keystoneauth1/session.py:511
2018-12-11 09:40:27.242 27 ERROR ceilometer.nova_client   File "/usr/lib/python2.7/site-packages/ceilometer/nova_client.py", line 39, in with_logging
2018-12-11 09:34:35.773 27 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 250, in wrapper
2018-12-11 09:35:27.076 27 DEBUG novaclient.v2.client [-] REQ: curl -g -i -X GET http://172.17.1.16:8774/v2.1/servers/detail?all_tenants=True&changes-since=2018-12-11T09%3A30%3A27.232000%2B00%3A00 -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA1}68bf8f730717cd4629c7fc56864cce8be86ddfe8" -H "X-OpenStack-Nova-API-Version: 2.1" _http_log_request /usr/lib/python2.7/site-packages/keystoneauth1/session.py:448


*** controller-1: /var/log/containers/aodh/aodh-evaluator.log ***
2018-12-11 09:34:35.224 27 ERROR aodh.evaluator [-] alarm evaluation cycle failed: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.16' (timed out)") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-12-11 09:34:24.248 27 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/cinder/cinder-scheduler.log ***
2018-12-11 09:38:04.366 1 ERROR oslo.messaging._drivers.impl_rabbit [-] [c9ab9f3a-2845-4cd3-8ba4-a3fd500cec63] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed
2018-12-11 09:34:36.616 1 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2018-12-11 09:34:36.616 1 ERROR oslo_db.sqlalchemy.engines [req-bcfd6812-394b-4b68-9d41-62e3d27b7ac3 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/heat/heat-engine.log ***
2018-12-11 09:34:38.487 29 ERROR oslo_db.sqlalchemy.engines [req-bfd51c98-1bab-4f4f-9eb0-dd64e7e9e193 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/keystone/keystone.log ***
2018-12-11 09:34:45.068 379 DEBUG keystone.middleware.auth [req-af591093-fed6-494c-b25e-54f007eb6f58 - - - - -] Authenticating user token process_request /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:401
2018-12-11 09:34:45.081 379 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2018-12-11 09:34:45.081 379 ERROR oslo_db.sqlalchemy.engines [req-af591093-fed6-494c-b25e-54f007eb6f58 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-12-11 09:34:44.720 380 DEBUG keystone.server.flask.application [req-4f598883-88ce-4cff-abea-db294f493330 - - - - -] Dispatching back to Flask native app. __call__ /usr/lib/python2.7/site-packages/keystone/server/flask/application.py:145


*** controller-1: /var/log/containers/neutron/server.log ***
2018-12-11 06:25:31.058 37 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py", line 124, in run_websocket_thread
2018-12-11 06:25:31.062 37 DEBUG networking_odl.common.client [req-6407a99a-227f-4838-ab1d-22969c068727 - - - - -] Sending METHOD (post) URL (http://172.17.1.16:8081/restconf/operations/sal-remote:create-data-change-event-subscription) JSON ({
2018-12-11 06:25:31.061 37 INFO networking_odl.common.websocket_client [req-6407a99a-227f-4838-ab1d-22969c068727 - - - - -] websocket transition to status ODL_WEBSOCKET_DISCONNECTED


*** controller-1: /var/log/containers/neutron/dhcp-agent.log ***
2018-12-11 09:38:04.202 144920 ERROR oslo.messaging._drivers.impl_rabbit [-] [3dcfaa50-9c12-4ef8-9bbf-682977c37092] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
2018-12-11 09:36:24.910 144920 ERROR neutron.agent.dhcp.agent [req-d47e84c0-1e3f-4784-8507-fb2dd90d8f27 - - - - -] Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae8e9488964b42d2a59756561074988a


*** controller-1: /var/log/containers/nova/nova-scheduler.log ***
2018-12-11 09:34:34.696 27 ERROR oslo_db.sqlalchemy.engines   File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2018-12-11 09:34:34.696 27 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/nova/nova-consoleauth.log ***
2018-12-11 09:34:57.384 1 ERROR oslo_db.sqlalchemy.engines [-] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/nova/nova-conductor.log ***
2018-12-11 09:34:34.933 26 ERROR oslo_db.sqlalchemy.engines [req-1ec53a36-a8be-4627-ae38-aaebe6a9e308 - - - - -] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/nova/nova-placement-api.log ***
2018-12-11 09:34:46.973 19 INFO nova.api.openstack.placement.requestlog [req-995787b0-be76-40bc-b464-512b65899371 90bc159f62d440afb02ad368c00c3e5c b10d0ac65ffc44a2af3764bf5fd4e61f - default default] 172.17.1.22 "GET /placement/resource_providers?in_tree=fba17869-c4ab-4347-8302-bfda2f47a5db" status: 200 len: 871 microversion: 1.14
2018-12-11 09:34:45.638 19 ERROR oslo_db.sqlalchemy.engines [req-995787b0-be76-40bc-b464-512b65899371 90bc159f62d440afb02ad368c00c3e5c b10d0ac65ffc44a2af3764bf5fd4e61f - default default] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)


*** controller-1: /var/log/containers/nova/nova-api.log ***
2018-12-11 09:35:26.959 18 INFO nova.api.openstack.requestlog [req-cbede52d-9156-46ab-8b8c-0eec7814c31e 9648fca4f76f43fbbd80bfb597618988 b10d0ac65ffc44a2af3764bf5fd4e61f - default default] 172.17.1.12 "OPTIONS /" status: 200 len: 411 microversion: - time: 0.001766
2018-12-11 09:35:26.957 18 DEBUG nova.api.openstack.wsgi [req-cbede52d-9156-46ab-8b8c-0eec7814c31e 9648fca4f76f43fbbd80bfb597618988 b10d0ac65ffc44a2af3764bf5fd4e61f - default default] Calling method '<bound method Versions.index of <nova.api.openstack.compute.versions.Versions object at 0x7f32d1003490>>' _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:618
2018-12-11 09:35:28.758 21 ERROR oslo_db.sqlalchemy.engines [req-fb64f31e-2693-42f1-9170-e4e9661e6332 9648fca4f76f43fbbd80bfb597618988 b10d0ac65ffc44a2af3764bf5fd4e61f - default default] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-12-11 09:35:29.005 21 DEBUG oslo_concurrency.lockutils [req-fb64f31e-2693-42f1-9170-e4e9661e6332 9648fca4f76f43fbbd80bfb597618988 b10d0ac65ffc44a2af3764bf5fd4e61f - default default] Lock "00000000-0000-0000-0000-000000000000" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-12-11 09:35:29.053 21 DEBUG oslo_concurrency.lockutils [req-fb64f31e-2693-42f1-9170-e4e9661e6332 9648fca4f76f43fbbd80bfb597618988 b10d0ac65ffc44a2af3764bf5fd4e61f - default default] Lock "363d0e62-0068-4b67-86d4-fef0fb8f42d9" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-12-11 09:35:28.754 21 DEBUG nova.compute.api [req-fb64f31e-2693-42f1-9170-e4e9661e6332 9648fca4f76f43fbbd80bfb597618988 b10d0ac65ffc44a2af3764bf5fd4e61f - default default] Searching by: {u'changes-since': datetime.datetime(2018, 12, 11, 9, 30, 27, 232000, tzinfo=<FixedOffset u'+00:00' datetime.timedelta(0)>)} get_all /usr/lib/python2.7/site-packages/nova/compute/api.py:2434


*** controller-1: /var/log/containers/opendaylight/karaf/logs/karaf.log ***
2018-12-11T09:38:05,020 | INFO  | opendaylight-cluster-data-shard-dispatcher-80 | Shard                            | 224 - org.opendaylight.controller.sal-clustering-commons - 1.7.4.redhat-7 | Peer address for peer member-2-shard-default-config set to akka.tcp://opendaylight-cluster-data.1.20:2550/user/shardmanager-config/member-2-shard-default-config
2018-12-11T09:36:03,461 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbBridgeAugmentation_AsyncClusteredDataTreeChangeListenerBase-DataTreeChangeHandler-0 | InterfaceTopologyStateListener   | 255 - org.opendaylight.genius.interfacemanager-impl - 0.4.4.redhat-7 | Received Update DataChange Notification for identifier: InstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbBridgeAugmentation, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d/bridge/br-int]]], org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbBridgeAugmentation]}, ovsdbBridgeAugmentation old: OvsdbBridgeAugmentation{getBridgeExternalIds=[BridgeExternalIds{getBridgeExternalIdKey=opendaylight-iid, getBridgeExternalIdValue=/network-topology:network-topology/network-topology:topology[network-topology:topology-id='ovsdb:1']/network-topology:node[network-topology:node-id='ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d/bridge/br-int'], augmentations={}}], getBridgeName=OvsdbBridgeName [_value=br-int], getBridgeOpenflowNodeRef=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d]]]]}, getBridgeOtherConfigs=[BridgeOtherConfigs{getBridgeOtherConfigKey=disable-in-band, getBridgeOtherConfigValue=true, augmentations={}}, BridgeOtherConfigs{getBridgeOtherConfigKey=hwaddr, getBridgeOtherConfigValue=c4:e0:0e:29:d0:cf, augmentations={}}], getBridgeUuid=Uuid [_value=a04a3748-2ebf-4974-9027-8d2e9b246c63], getControllerEntry=[ControllerEntry{getControllerUuid=Uuid [_value=d78266bc-a14d-474d-bee3-39fcbff3746a], getTarget=Uri [_value=tcp:172.17.1.20:6653], isIsConnected=true, augmentations={}}, ControllerEntry{getControllerUuid=Uuid [_value=212ed84e-a5eb-4f8b-8542-e080915def50], getTarget=Uri [_value=tcp:172.17.1.12:6653], isIsConnected=true, augmentations={}}, ControllerEntry{getControllerUuid=Uuid [_value=c9827130-669e-4047-b715-7aff22e0b176], getTarget=Uri [_value=tcp:172.17.1.22:6653], isIsConnected=true, augmentations={}}], getDatapathId=DatapathId [_value=00:00:c4:e0:0e:29:d0:cf], getDatapathType=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.DatapathTypeSystem, getFailMode=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbFailModeSecure, getManagedBy=OvsdbNodeRef [_value=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d]]]]}], getProtocolEntry=[ProtocolEntry{getProtocol=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbBridgeProtocolOpenflow13, augmentations={}}], isStpEnable=false}, new: OvsdbBridgeAugmentation{getBridgeExternalIds=[BridgeExternalIds{getBridgeExternalIdKey=opendaylight-iid, getBridgeExternalIdValue=/network-topology:network-topology/network-topology:topology[network-topology:topology-id='ovsdb:1']/network-topology:node[network-topology:node-id='ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d/bridge/br-int'], augmentations={}}], getBridgeName=OvsdbBridgeName [_value=br-int], getBridgeOpenflowNodeRef=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d]]]]}, getBridgeOtherConfigs=[BridgeOtherConfigs{getBridgeOtherConfigKey=disable-in-band, getBridgeOtherConfigValue=true, augmentations={}}, BridgeOtherConfigs{getBridgeOtherConfigKey=hwaddr, getBridgeOtherConfigValue=c4:e0:0e:29:d0:cf, augmentations={}}], getBridgeUuid=Uuid [_value=a04a3748-2ebf-4974-9027-8d2e9b246c63], getControllerEntry=[ControllerEntry{getControllerUuid=Uuid [_value=d78266bc-a14d-474d-bee3-39fcbff3746a], getTarget=Uri [_value=tcp:172.17.1.20:6653], isIsConnected=true, augmentations={}}, ControllerEntry{getControllerUuid=Uuid [_value=212ed84e-a5eb-4f8b-8542-e080915def50], getTarget=Uri [_value=tcp:172.17.1.12:6653], isIsConnected=true, augmentations={}}, ControllerEntry{getControllerUuid=Uuid [_value=c9827130-669e-4047-b715-7aff22e0b176], getTarget=Uri [_value=tcp:172.17.1.22:6653], isIsConnected=false, augmentations={}}], getDatapathId=DatapathId [_value=00:00:c4:e0:0e:29:d0:cf], getDatapathType=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.DatapathTypeSystem, getFailMode=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbFailModeSecure, getManagedBy=OvsdbNodeRef [_value=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/a7f01fa4-4def-4a86-99b5-fe432ac8701d]]]]}], getProtocolEntry=[ProtocolEntry{getProtocol=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbBridgeProtocolOpenflow13, augmentations={}}], isStpEnable=false}.
2018-12-11T09:36:03,461 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.elan.rev150602.elan.dpn.interfaces.elan.dpn.interfaces.list.DpnInterfaces_AsyncClusteredDataTreeChangeListenerBase-DataTreeChangeHandler-0 | ElanDpnInterfaceClusteredListener | 344 - org.opendaylight.netvirt.elanmanager-impl - 0.6.4.redhat-7 | dpninterfaces update fired new size 1
2018-12-11T09:35:47,014 | INFO  | nioEventLoopGroup-9-14 | SystemNotificationsListenerImpl  | 378 - org.opendaylight.openflowplugin.impl - 0.6.4.redhat-7 | ConnectionEvent: Connection closed by device, Device:/172.17.1.22:60912, NodeId:null
2018-12-11T09:38:05,026 | WARN  | nioEventLoopGroup-5-2 | HwvtepConnectionManager          | 391 - org.opendaylight.ovsdb.hwvtepsouthbound-impl - 1.6.4.redhat-7 | HWVTEP disconnected event did not find connection instance for ConnectionInfo [_localIp=IpAddress [_ipv4Address=Ipv4Address [_value=172.17.1.22]], _localPort=PortNumber [_value=6640], _remoteIp=IpAddress [_ipv4Address=Ipv4Address [_value=172.17.1.10]], _remotePort=PortNumber [_value=59000], augmentation=[]]
2018-12-11T09:38:04,967 | ERROR | nioEventLoopGroup-5-2 | ExceptionHandler                 | 392 - org.opendaylight.ovsdb.library - 1.6.4.redhat-7 | Exception occurred while processing connection pipeline
2018-12-11T09:35:47,020 | WARN  | transaction-invoker-impl-0 | OvsdbNodeRemoveCommand           | 396 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.redhat-7 | Failure to delete ovsdbNode
java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8b66d92a-9fa9-41b2-8810-7f1890141cfc}], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/8b66d92a-9fa9-41b2-8810-7f1890141cfc}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.Exception: Error creating READ_WRITE transaction on shard default]]}


Expected results:
Openstack cli commands should work as expected, and not return error 500

Additional info:
Uploading sos repots.

Comment 1 Noam Manos 2018-12-11 12:14:06 UTC

SOS Reports:
http://rhos-release.virt.bos.redhat.com/log/bz1658151

Comment 2 Julien Danjou 2018-12-17 15:52:42 UTC

The 500 error is returned by Nova API, reassigning.

Comment 3 Artom Lifshitz 2018-12-20 16:16:10 UTC

I see two important things here:

  2018-12-11 09:36:24.910 144920 ERROR neutron.agent.dhcp.agent [req-d47e84c0-1e3f-4784-8507-fb2dd90d8f27 - - - - -]
  Failed reporting state!: MessagingTimeout: Timed out waiting for a reply to message ID ae8e9488964b42d2a59756561074988a

and:

  2018-12-11 09:34:34.933 26 ERROR oslo_db.sqlalchemy.engines [req-1ec53a36-a8be-4627-ae38-aaebe6a9e308 - - - - -]
  Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError)
  (2006, "MySQL server has gone away (error(110, 'Connection timed out'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)

many times over.

This indicates that the database isn't in a usable state and probably didn't recover correctly from the snapshot process. It looks like the same is true for the message queue. Because of this, I don't think this is a bug in Nova (or really any other OSP component), but I didn't think that creating a snapshot affected the guest OS. So I'm not going to close this NOTABUG just yet to double check snapshot side-effects, but this is what I'm leaning towards.

Comment 4 Artom Lifshitz 2018-12-21 15:38:03 UTC

> I didn't think that creating a
> snapshot affected the guest OS. So I'm not going to close this NOTABUG just
> yet to double check snapshot side-effects, but this is what I'm leaning
> towards.

I was wrong. Unless the snapshot is both consistent and live, it will affect the guest OS. There are two ways around this:

1. Do a consistent and live snapshot to make sure the guest OS isn't affected :)

2. Assume any other type of snapshot will affect the guest OS. We have a guide about rebooting the overcloud [1] to make sure the services are allowed to recover. Following this guide, but replacing the reboot action with the snapshot action should allow the services to recover between snapshots and should avoid database and/or message queue failure.

Cheers!

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/director_installation_and_usage/sect-rebooting_the_overcloud