Bug 1661784
| Summary: | aodh_evaluator and ceilometer_agent_central report unhealthy after OSP14 Minor update | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Noam Manos <nmanos> |
| Component: | openstack-ceilometer | Assignee: | Eoghan Glynn <eglynn> |
| Status: | CLOSED WORKSFORME | QA Contact: | Sasha Smolyak <ssmolyak> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 14.0 (Rocky) | CC: | jruzicka, mmagr, mrunge, nmanos, rafael.marcelino, sinan.polat |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-01-14 16:37:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Can you please check the health of Redis? This seems like the services cannot connect to Redis and actually report correctly. ping? I do not have the environment anymore, once having it again, I ll be able to check. Ok, thank you. I'll close it for now. If you're able to reproduce, feel free to open again. I just deployed OSP13 with 3 Controller nodes and have the same problem.
[root@lab-controller-0 ~]# sudo docker ps|grep -i "unhealthy"
aa64d19595b9 kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-gnocchi-api:latest "kolla_start" About an hour ago Up About an hour (unhealthy) gnocchi_api
9a8bfe45cecd kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-gnocchi-metricd:latest "kolla_start" About an hour ago Up About an hour (unhealthy) gnocchi_metricd
99cc009f6245 kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-aodh-evaluator:latest "kolla_start" 2 hours ago Up 2 hours (unhealthy) aodh_evaluator
ce7f33b3d59b kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-ceilometer-central:latest "kolla_start" 2 hours ago Up 2 hours (unhealthy) ceilometer_agent_central
[root@lab-controller-0 ~]#
/var/log/containers/aodh/aodh-evaluator.log:
2019-10-01 14:12:34.758 27 ERROR aodh.coordination [-] Error sending a heartbeat to coordination backend.: ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination Traceback (most recent call last):
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/aodh/coordination.py", line 152, in heartbeat
2019-10-01 14:12:34.758 27 ERROR aodh.coordination self._coordinator.heartbeat()
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 513, in heartbeat
2019-10-01 14:12:34.758 27 ERROR aodh.coordination value=self.STILL_ALIVE)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2019-10-01 14:12:34.758 27 ERROR aodh.coordination self.gen.throw(type, value, traceback)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
2019-10-01 14:12:34.758 27 ERROR aodh.coordination cause=e)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause
2019-10-01 14:12:34.758 27 ERROR aodh.coordination excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
2019-10-01 14:12:34.758 27 ERROR aodh.coordination six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
2019-10-01 14:12:34.758 27 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from
2019-10-01 14:12:34.758 27 ERROR aodh.coordination raise value
2019-10-01 14:12:34.758 27 ERROR aodh.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
pcs (redis bundle):
Docker container set: redis-bundle [kvm-sat01-am6.stack.example.com:5000/example-osp13_containers-redis:pcmklatest]
redis-bundle-0 (ocf::heartbeat:redis): Master lab-controller-0
redis-bundle-1 (ocf::heartbeat:redis): Slave lab-controller-1
redis-bundle-2 (ocf::heartbeat:redis): Slave lab-controller-2
Can we please separate the issues and could you please open a new ticket for this? I'd need a bit more info from your side. This bug here is about OSP14. discussed offline https://bugzilla.redhat.com/show_bug.cgi?id=1661784#c5, will be moved to a different bz. (In reply to Matthias Runge from comment #7) > discussed offline https://bugzilla.redhat.com/show_bug.cgi?id=1661784#c5, > will be moved to a different bz. Hi Matthias, We are struggling with same issue reported by Sinan Polat. Did you open a new bz for this topic? |
Description of problem: OSP14 Minor Update: aodh_evaluator and ceilometer_agent_central report unhealthy after update completed. Version-Release number of selected component (if applicable): OSP14 Minor Update from 2018-12-12.4 to 2018-12-14.1 How reproducible: Sometimes Steps to Reproduce: 1. Run OSP Minor Update Actual results: After Update completed ceilometer_agent_central and aodh_evaluator are up but unhealthy on ALL 3 controllers. Here's some trace from controller-2: 2a80d267c8a6 192.168.24.1:8787/rhosp14/openstack-ceilometer-central:2018-12-14.1 "kolla_start" 5 days ago Up 5 days (unhealthy) ceilometer_agent_central f8e61631581f 192.168.24.1:8787/rhosp14/openstack-aodh-evaluator:2018-12-14.1 "kolla_start" 5 days ago Up 5 days (unhealthy) aodh_evaluator *** controller-2: /var/log/containers/aodh/aodh-evaluator.log *** 2018-12-23 08:01:26.915 28 ERROR aodh.coordination Traceback (most recent call last): 2018-12-23 08:01:26.915 28 ERROR aodh.coordination File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__ 2018-12-23 08:01:26.915 28 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/aodh/coordination.py", line 152, in heartbeat 2018-12-23 08:01:26.915 28 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from 2018-12-23 08:01:26.915 28 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause 2018-12-23 08:01:26.915 28 ERROR aodh.coordination ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2018-12-23 08:01:26.915 28 ERROR aodh.coordination [-] Error sending a heartbeat to coordination backend.: ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2018-12-23 08:01:26.915 28 ERROR aodh.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures *** controller-2: /var/log/containers/ceilometer/central.log *** 2018-12-23 00:01:25.234 28 ERROR root Traceback (most recent call last): 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 187, in _beat_forever_until_stopped 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 64, in __call__ 2018-12-23 00:01:59.495 28 ERROR futurist.periodics [-] Failed to call periodic 'ceilometer.polling.manager.task' (it runs every 300.00 seconds): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 178, in decorator 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__ 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/tooz/coordination.py", line 662, in get 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 513, in heartbeat 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/ceilometer/polling/manager.py", line 423, in interval_task 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib/python2.7/site-packages/six.py", line 718, in raise_from 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib/python2.7/site-packages/tooz/utils.py", line 225, in raise_with_cause 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/futurist/periodics.py", line 290, in run 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 749, in run_watchers 2018-12-23 00:01:25.234 28 ERROR root ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2018-12-23 00:01:59.495 28 ERROR futurist.periodics File "/usr/lib/python2.7/site-packages/ceilometer/polling/manager.py", line 394, in task 2018-12-23 00:01:25.234 28 ERROR root [-] Unexpected exception occurred 60 time(s)... retrying.: ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures 2018-12-23 00:01:25.234 28 ERROR root File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 250, in wrapper *** controller-2: /var/log/containers/neutron/server.log *** 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client Traceback (most recent call last): 2018-12-23 06:03:11.443 38 DEBUG networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] closing websocket _close_ws /usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py:289 2018-12-23 06:03:11.460 38 DEBUG networking_odl.common.client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] Got response: 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 348, in recv 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 476, in _recv 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 365, in recv_data 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 378, in recv_data_frame 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_core.py", line 410, in recv_frame 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_abnf.py", line 249, in recv_header 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/websocket/_abnf.py", line 334, in recv_strict 2018-12-23 06:03:11.437 38 ERROR networking_odl.common.websocket_client File "/usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py", line 124, in run_websocket_thread 2018-12-23 06:03:11.445 38 DEBUG networking_odl.common.client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] Sending METHOD (post) URL (http://172.17.1.13:8081/restconf/operations/sal-remote:create-data-change-event-subscription) JSON ({ 2018-12-23 06:03:11.062 31 DEBUG neutron.wsgi [-] (31) accepted ('172.17.1.24', 51542) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-12-23 06:03:11.065 31 INFO neutron.wsgi [-] 172.17.1.24 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0014091 2018-12-23 06:03:11.479 38 DEBUG networking_odl.ml2.port_status_update [-] starting to pull pending statuses... _pull_missed_statuses /usr/lib/python2.7/site-packages/networking_odl/ml2/port_status_update.py:108 2018-12-23 06:03:11.437 38 WARNING networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] websocket connection closed or IO error: WebSocketConnectionClosedException: Connection is already closed. 2018-12-23 06:03:11.478 38 DEBUG networking_odl.ml2.port_status_update [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] Websocket notification on reconnection _process_websocket_reconnect /usr/lib/python2.7/site-packages/networking_odl/ml2/port_status_update.py:102 2018-12-23 06:03:11.469 38 DEBUG networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] websocket stream URL: ws://172.17.1.13:8185/data-change-event-subscription/neutron:neutron/neutron:ports/datastore=OPERATIONAL/scope=SUBTREE _subscribe_websocket /usr/lib/python2.7/site-packages/networking_odl/common/websocket_client.py:227 2018-12-23 06:03:11.443 38 INFO networking_odl.common.websocket_client [req-3ccb9ec8-7cb4-4462-8107-6df8f891377a - - - - -] websocket transition to status ODL_WEBSOCKET_DISCONNECTED *** controller-2: /var/log/containers/nova/nova-scheduler.log *** 2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last): 2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 251, in __call__ 2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 361, in _callback 2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py", line 122, in check_duplicate_message 2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(dd925a5c8fa44b789bb6287ce241f801). Skipping it. 2018-12-23 00:01:24.404 29 DEBUG oslo_service.periodic_task [req-ad621c19-f8d2-4dcb-9a03-46e6860beac2 - - - - -] Running periodic task SchedulerManager._run_periodic_tasks run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215 2018-12-23 00:02:01.960 27 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(dd925a5c8fa44b789bb6287ce241f801). Skipping it. *** controller-2: /var/log/containers/opendaylight/karaf/logs/karaf.log *** at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:528) [38:com.google.guava:23.6.1.jre] 2018-12-12T20:42:23,444 | INFO | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker | 194 - org.opendaylight.controller.blueprint - 0.8.4.redhat-7 | Blueprint container for bundle org.opendaylight.openflowplugin.applications.reconciliation-framework_0.6.4.redhat-7 [372] was successfully created 2018-12-12T20:42:23,440 | INFO | Blueprint Extender: 1 | ReconciliationManagerImpl | 372 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.6.4.redhat-7 | ReconciliationManager has started successfully. 2018-12-12T20:42:23,612 | ERROR | Blueprint Extender: 2 | UpgradeStateListener | 406 - org.opendaylight.serviceutils.upgrade - 0.1.4.redhat-7 | Failed to write mdsal config java.util.concurrent.ExecutionException: OptimisticLockFailedException{message=Optimistic lock failed for path /(urn:opendaylight:serviceutils:upgrade?revision=2018-07-02)upgrade-config, errorList=[RpcError [message=Optimistic lock failed for path /(urn:opendaylight:serviceutils:upgrade?revision=2018-07-02)upgrade-config, severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.]]} 2018-12-12T20:42:23,584 | INFO | awaitility[checkBundleDiagInfos] | SystemReadyImpl | 282 - org.opendaylight.infrautils.ready-impl - 1.3.4.redhat-7 | checkBundleDiagInfos: Elapsed time 2s, remaining time 297s, diag: Booting {Installed=0, Resolved=6, Unknown=0, GracePeriod=37, Waiting=0, Starting=0, Active=432, Stopping=0, Failure=0} 2018-12-12T20:42:23,424 | INFO | Thread-64 | BgpThriftService | 338 - org.opendaylight.netvirt.bgpmanager-impl - 0.6.4.redhat-7 | Bgp thrift server pre serve event 2018-12-16T10:18:40,581 | INFO | nioEventLoopGroup-9-8 | ReconciliationManagerImpl | 372 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.6.4.redhat-7 | Stopping reconciliation for node Uri [_value=openflow:119175419828361] 2018-12-16T10:18:38,790 | INFO | nioEventLoopGroup-9-7 | SystemNotificationsListenerImpl | 378 - org.opendaylight.openflowplugin.impl - 0.6.4.redhat-7 | ConnectionEvent: Connection closed by device, Device:/172.17.1.24:50846, NodeId:null 2018-12-16T10:18:38,790 | INFO | nioEventLoopGroup-11-2 | AbstractConnectionAdapter | 387 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.4.redhat-7 | The channel outbound queue size:1024 2018-12-16T10:18:40,581 | ERROR | nioEventLoopGroup-5-3 | ExceptionHandler | 392 - org.opendaylight.ovsdb.library - 1.6.4.redhat-7 | Exception occurred while processing connection pipeline 2018-12-16T10:18:40,581 | INFO | nioEventLoopGroup-5-3 | OvsdbConnectionManager | 396 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.redhat-7 | Library disconnected PASSIVE from /172.17.1.39:59376 to /172.17.1.24:6640. Cleaning up the operational data store Expected results: All Containers should be up and healthy after Minor Update. Additional info: Here are SOS-Reports: http://rhos-release.virt.bos.redhat.com/log/bz1661190/ (please ignore the bz number of this sosreport link name)