Bug 1661784

Summary: aodh_evaluator and ceilometer_agent_central report unhealthy after OSP14 Minor update
Product: Red Hat OpenStack Reporter: Noam Manos <nmanos>
Component: openstack-ceilometerAssignee: Eoghan Glynn <eglynn>
Status: CLOSED WORKSFORME QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: medium    
Version: 14.0 (Rocky)CC: jruzicka, mmagr, mrunge, nmanos, rafael.marcelino, sinan.polat
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-14 16:37:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Noam Manos 2018-12-23 10:00:04 UTC
Description of problem:
OSP14 Minor Update: aodh_evaluator and ceilometer_agent_central report unhealthy after update completed.

Version-Release number of selected component (if applicable):
OSP14 Minor Update from 2018-12-12.4 to 2018-12-14.1

How reproducible:
Sometimes

Steps to Reproduce:
1. Run OSP Minor Update

Actual results:
After Update completed ceilometer_agent_central and aodh_evaluator are up but unhealthy on ALL 3 controllers.

Here's some trace from controller-2:

2a80d267c8a6        192.168.24.1:8787/rhosp14/openstack-ceilometer-central:2018-12-14.1            "kolla_start"            5 days ago          Up 5 days (unhealthy)                       ceilometer_agent_central
f8e61631581f        192.168.24.1:8787/rhosp14/openstack-aodh-evaluator:2018-12-14.1                "kolla_start"            5 days ago          Up 5 days (unhealthy)                       aodh_evaluator


*** controller-2: /var/log/containers/aodh/aodh-evaluator.log ***

2018-12-23 08:01:26.915 28 ERROR aodh.coordination Traceback (most recent call last):
2018-12-23 08:01:26.915 28 ERROR aodh.coordination   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2018-12-23 08:01:26.915 28 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/aodh/coordination.py", line 152, in heartbeat
2018-12-23 08:01:26.915 28 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from
2018-12-23 08:01:26.915 28 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause
2018-12-23 08:01:26.915 28 ERROR aodh.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2018-12-23 08:01:26.915 28 ERROR aodh.coordination [-] Error sending a heartbeat to coordination backend.: ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2018-12-23 08:01:26.915 28 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures


*** controller-2: /var/log/containers/ceilometer/central.log ***

2018-12-23 00:01:25.234 28 ERROR root Traceback (most recent call last):
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 187, in _beat_forever_until_stopped
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 64, in __call__
2018-12-23 00:01:59.495 28 ERROR futurist.periodics [-] Failed to call periodic 'ceilometer.polling.manager.task' (it runs every 300.00 seconds): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 178, in decorator
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 662, in get
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 513, in heartbeat
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/ceilometer/polling/manager.py", line 423, in interval_task
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 290, in run
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 749, in run_watchers
2018-12-23 00:01:25.234 28 ERROR root ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2018-12-23 00:01:59.495 28 ERROR futurist.periodics   File "/usr/lib/python2.7/site-packages/ceilometer/polling/manager.py", line 394, in task
2018-12-23 00:01:25.234 28 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.: ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
2018-12-23 00:01:25.234 28 ERROR root   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 250, in wrapper


*** controller-2: /var/log/containers/neutron/server.log ***

2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client Traceback (most recent call last):
2018-12-23 06:03:11.443 38 DEBUG networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] closing websocket _close_ws /usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py:289
2018-12-23 06:03:11.460 38 DEBUG networking_odl.common.client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] Got response:
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 348, in recv
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 476, in _recv
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 365, in recv_data
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 378, in recv_data_frame
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 410, in recv_frame
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_abnf.py", line 249, in recv_header
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/websocket/_abnf.py", line 334, in recv_strict
2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client   File "/usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py", line 124, in run_websocket_thread
2018-12-23 06:03:11.445 38 DEBUG networking_odl.common.client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] Sending METHOD (post) URL (http://172.17.1.13:8081/restconf/operations/sal-remote:create-data-change-event-subscription) JSON ({
2018-12-23 06:03:11.062 31 DEBUG neutron.wsgi [-] (31) accepted ('172.17.1.24', 51542) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-12-23 06:03:11.065 31 INFO neutron.wsgi [-] 172.17.1.24 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0014091
2018-12-23 06:03:11.479 38 DEBUG networking_odl.ml2.port_status_update [-] starting to pull pending statuses... _pull_missed_statuses /usr/lib/python2.7/site-packages/networking_odl/ml2/port_status_update.py:108
2018-12-23 06:03:11.437 38 WARNING networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] websocket connection closed or IO error: WebSocketConnectionClosedException: Connection is already closed.
2018-12-23 06:03:11.478 38 DEBUG networking_odl.ml2.port_status_update [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] Websocket notification on reconnection _process_websocket_reconnect /usr/lib/python2.7/site-packages/networking_odl/ml2/port_status_update.py:102
2018-12-23 06:03:11.469 38 DEBUG networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] websocket stream URL: ws://172.17.1.13:8185/data-change-event-subscription/neutron:neutron/neutron:ports/datastore=OPERATIONAL/scope=SUBTREE _subscribe_websocket /usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py:227
2018-12-23 06:03:11.443 38 INFO networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] websocket transition to status ODL_WEBSOCKET_DISCONNECTED


*** controller-2: /var/log/containers/nova/nova-scheduler.log ***

2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 251, in __call__
2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 361, in _callback
2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py", line 122, in check_duplicate_message
2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(dd925a5c8fa44b789bb6287ce241f801). Skipping it.
2018-12-23 00:01:24.404 29 DEBUG oslo_service.periodic_task [req-ad621c19-f8d2-4dcb-9a03-46e6860beac2 - - - - -] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(dd925a5c8fa44b789bb6287ce241f801). Skipping it.


*** controller-2: /var/log/containers/opendaylight/karaf/logs/karaf.log ***

	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:528) [38:com.google.guava:23.6.1.jre]
2018-12-12T20:42:23,444 | INFO  | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker           | 194 - org.opendaylight.controller.blueprint - 0.8.4.redhat-7 | Blueprint container for bundle org.opendaylight.openflowplugin.applications.reconciliation-framework_0.6.4.redhat-7 [372] was successfully created
2018-12-12T20:42:23,440 | INFO  | Blueprint Extender: 1 | ReconciliationManagerImpl        | 372 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.6.4.redhat-7 | ReconciliationManager has started successfully.
2018-12-12T20:42:23,612 | ERROR | Blueprint Extender: 2 | UpgradeStateListener             | 406 - org.opendaylight.serviceutils.upgrade - 0.1.4.redhat-7 | Failed to write mdsal config
java.util.concurrent.ExecutionException: OptimisticLockFailedException{message=Optimistic lock failed for path /(urn:opendaylight:serviceutils:upgrade?revision=2018-07-02)upgrade-config, errorList=[RpcError [message=Optimistic lock failed for path /(urn:opendaylight:serviceutils:upgrade?revision=2018-07-02)upgrade-config, severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.]]}
2018-12-12T20:42:23,584 | INFO  | awaitility[checkBundleDiagInfos] | SystemReadyImpl                  | 282 - org.opendaylight.infrautils.ready-impl - 1.3.4.redhat-7 | checkBundleDiagInfos: Elapsed time 2s, remaining time 297s, diag: Booting {Installed=0, Resolved=6, Unknown=0, GracePeriod=37, Waiting=0, Starting=0, Active=432, Stopping=0, Failure=0}
2018-12-12T20:42:23,424 | INFO  | Thread-64        | BgpThriftService                 | 338 - org.opendaylight.netvirt.bgpmanager-impl - 0.6.4.redhat-7 | Bgp thrift server pre serve event
2018-12-16T10:18:40,581 | INFO  | nioEventLoopGroup-9-8 | ReconciliationManagerImpl        | 372 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.6.4.redhat-7 | Stopping reconciliation for node Uri [_value=openflow:119175419828361]
2018-12-16T10:18:38,790 | INFO  | nioEventLoopGroup-9-7 | SystemNotificationsListenerImpl  | 378 - org.opendaylight.openflowplugin.impl - 0.6.4.redhat-7 | ConnectionEvent: Connection closed by device, Device:/172.17.1.24:50846, NodeId:null
2018-12-16T10:18:38,790 | INFO  | nioEventLoopGroup-11-2 | AbstractConnectionAdapter        | 387 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.4.redhat-7 | The channel outbound queue size:1024
2018-12-16T10:18:40,581 | ERROR | nioEventLoopGroup-5-3 | ExceptionHandler                 | 392 - org.opendaylight.ovsdb.library - 1.6.4.redhat-7 | Exception occurred while processing connection pipeline
2018-12-16T10:18:40,581 | INFO  | nioEventLoopGroup-5-3 | OvsdbConnectionManager           | 396 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.redhat-7 | Library disconnected PASSIVE from /172.17.1.39:59376 to /172.17.1.24:6640. Cleaning up the operational data store



Expected results:
All Containers should be up and healthy after Minor Update.

Additional info:

Here are SOS-Reports:
http://rhos-release.virt.bos.redhat.com/log/bz1661190/
(please ignore the bz number of this sosreport link name)

Comment 1 Martin Magr 2019-01-07 15:44:19 UTC
Can you please check the health of Redis? This seems like the services cannot connect to Redis and actually report correctly.

Comment 2 Matthias Runge 2019-01-14 15:42:34 UTC
ping?

Comment 3 Noam Manos 2019-01-14 16:11:58 UTC
I do not have the environment anymore, once having it again, I ll be able to check.

Comment 4 Matthias Runge 2019-01-14 16:37:55 UTC
Ok, thank you. I'll close it for now. If you're able to reproduce, feel free to open again.

Comment 5 Sinan Polat 2019-10-01 14:16:09 UTC
I just deployed OSP13 with 3 Controller nodes and have the same problem.

[root@lab-controller-0 ~]# sudo docker ps|grep -i "unhealthy"
aa64d19595b9        kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-gnocchi-api:latest                 "kolla_start"            About an hour ago   Up About an hour (unhealthy)                       gnocchi_api
9a8bfe45cecd        kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-gnocchi-metricd:latest             "kolla_start"            About an hour ago   Up About an hour (unhealthy)                       gnocchi_metricd
99cc009f6245        kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-aodh-evaluator:latest              "kolla_start"            2 hours ago         Up 2 hours (unhealthy)                             aodh_evaluator
ce7f33b3d59b        kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-ceilometer-central:latest          "kolla_start"            2 hours ago         Up 2 hours (unhealthy)                             ceilometer_agent_central
[root@lab-controller-0 ~]# 

/var/log/containers/aodh/aodh-evaluator.log:
2019-10-01 14:12:34.758 27 ERROR aodh.coordination [-] Error sending a heartbeat to coordination backend.: ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination Traceback (most recent call last):
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/aodh/coordination.py", line 152, in heartbeat
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     self._coordinator.heartbeat()
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 513, in heartbeat
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     value=self.STILL_ALIVE)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     self.gen.throw(type, value, traceback)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     cause=e)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
2019-10-01 14:12:34.758 27 ERROR aodh.coordination   File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from
2019-10-01 14:12:34.758 27 ERROR aodh.coordination     raise value
2019-10-01 14:12:34.758 27 ERROR aodh.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)

pcs (redis bundle):
 Docker container set: redis-bundle [kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-redis:pcmklatest]
   redis-bundle-0	(ocf::heartbeat:redis):	Master lab-controller-0
   redis-bundle-1	(ocf::heartbeat:redis):	Slave lab-controller-1
   redis-bundle-2	(ocf::heartbeat:redis):	Slave lab-controller-2

Comment 6 Matthias Runge 2019-10-01 19:06:03 UTC
Can we please separate the issues and could you please open a new ticket for this? I'd need a bit more info from your side. This bug here is about OSP14.

Comment 7 Matthias Runge 2019-10-02 07:05:01 UTC
discussed offline https://bugzilla.redhat.com/show_bug.cgi?id=1661784#c5, will be moved to a different bz.

Comment 8 Rafael Marcelino 2019-10-31 07:36:39 UTC
(In reply to Matthias Runge from comment #7)
> discussed offline https://bugzilla.redhat.com/show_bug.cgi?id=1661784#c5,
> will be moved to a different bz.

Hi Matthias,
We are struggling with same issue reported by Sinan Polat.
Did you open a new bz for this topic?