Description of problem: Ceilometer dbsync is failing during HA deployment: Error: /Stage[main]/Ceilometer::Db::Sync/Exec[ceilometer-dbsync]: Command exceeded timeout In /var/log/ceilometer/ceilometer-dbsync.log: CRITICAL ceilometer [-] ServerSelectionTimeoutError: No replica set members available for replica set name "" Version-Release number of selected component (if applicable): openstack-heat-templates-0.0.1-dev381.el7.centos.noarch openstack-tripleo-heat-templates-0.8.7-dev277.el7.centos.noarch How reproducible: 100% Steps to Reproduce: 1. openstack overcloud deploy --templates ~/templates/my-overcloud -e ~/templates/my-overcloud/environments/network-isolation.yaml -e ~/templates/network-environment.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --ntp-server clock.redhat.com Actual results: Deployment fails. os-collect-config logs show: Error: /Stage[main]/Ceilometer::Db::Sync/Exec[ceilometer-dbsync]: Failed to call refresh: Command exceeded timeout Error: /Stage[main]/Ceilometer::Db::Sync/Exec[ceilometer-dbsync]: Command exceeded timeout Expected results: Deployment succeeds. Additional info: [root@overcloud-controller-0 ~]# grep mongo /etc/ceilometer/ceilometer.conf connection=mongodb://172.16.20.12:27017,172.16.20.15:27017,172.16.20.13:27017/ceilometer?replicaSet=tripleo [root@overcloud-controller-0 ~]# mongo --host 172.16.20.12 <<<'rs.status()' MongoDB shell version: 2.6.11 connecting to: 172.16.20.12:27017/test { "set" : "tripleo", "date" : ISODate("2015-10-12T21:21:42Z"), "myState" : 1, "members" : [ { "_id" : 0, "name" : "172.16.20.12:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 5273, "optime" : Timestamp(1444679633, 1), "optimeDate" : ISODate("2015-10-12T19:53:53Z"), "electionTime" : Timestamp(1444679641, 1), "electionDate" : ISODate("2015-10-12T19:54:01Z"), "self" : true }, { "_id" : 1, "name" : "172.16.20.15:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 5269, "optime" : Timestamp(1444679633, 1), "optimeDate" : ISODate("2015-10-12T19:53:53Z"), "lastHeartbeat" : ISODate("2015-10-12T21:21:41Z"), "lastHeartbeatRecv" : ISODate("2015-10-12T21:21:42Z"), "pingMs" : 0, "syncingTo" : "172.16.20.12:27017" }, { "_id" : 2, "name" : "172.16.20.13:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 5269, "optime" : Timestamp(1444679633, 1), "optimeDate" : ISODate("2015-10-12T19:53:53Z"), "lastHeartbeat" : ISODate("2015-10-12T21:21:41Z"), "lastHeartbeatRecv" : ISODate("2015-10-12T21:21:41Z"), "pingMs" : 0, "syncingTo" : "172.16.20.12:27017" } ], "ok" : 1 } bye
I was able to run ceilometer-dbsync after applying this: https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/mongo/utils.py#L269-L276 Installed ceilometer packages: python-ceilometer-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-central-5.0.0.0-rc1.el7.centos.noarch python-ceilometerclient-1.5.1-dev1.el7.centos.noarch openstack-ceilometer-collector-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-alarm-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-polling-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-api-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-common-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-notification-5.0.0.0-rc1.el7.centos.noarch openstack-ceilometer-compute-5.0.0.0-rc1.el7.centos.noarch
(In reply to Marius Cornea from comment #1) > I was able to run ceilometer-dbsync after applying this: I don't see how master code could help vs what's in stable/liberty? https://github.com/openstack/ceilometer/blob/stable/liberty/ceilometer/storage/mongo/utils.py#L269-L281 Could it be just timing issue and mongodb was fully ready on the first attempt? Please try revert to original stable/liberty code and try again. > https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/mongo/ > utils.py#L269-L276 > > Installed ceilometer packages: > python-ceilometer-5.0.0.0-rc1.el7.centos.noarch There was ceilometer RC2 in the meantime but _mongo_connect was not changed rc1..rc2
I tried to manually run dbsync after the cluster was up and got the following error: /usr/bin/python2 /usr/bin/ceilometer-dbsync --config-file=/etc/ceilometer/ceilometer.conf --debug Unable to reconnect to the primary mongodb: No replica set members available for replica set name "". Trying again in 10 seconds. After switching to the master chunk in utils.py the dbsync finished: [root@overcloud-controller-2 ~]# /usr/bin/python2 /usr/bin/ceilometer-dbsync --config-file=/etc/ceilometer/ceilometer.conf --debug No handlers could be found for logger "oslo_config.cfg" 2015-10-13 11:29:27.270 29659 DEBUG ceilometer.storage [-] looking for 'mongodb' driver in 'ceilometer.metering.storage' get_connection /usr/lib/python2.7/site-packages/ceilometer/storage/__init__.py:149 2015-10-13 11:29:27.329 29659 INFO ceilometer.storage.mongo.utils [-] Connecting to mongodb on [('172.16.20.15', 27017), ('172.16.20.13', 27017), ('172.16.20.14', 27017)] 2015-10-13 11:29:27.343 29659 DEBUG ceilometer.storage [-] looking for 'mongodb' driver in 'ceilometer.alarm.storage' get_connection /usr/lib/python2.7/site-packages/ceilometer/storage/__init__.py:149 2015-10-13 11:29:27.418 29659 INFO ceilometer.storage.mongo.utils [-] Connecting to mongodb on [('172.16.20.15', 27017), ('172.16.20.13', 27017), ('172.16.20.14', 27017)] 2015-10-13 11:29:27.424 29659 WARNING oslo_config.cfg [-] Option "alarm_history_time_to_live" from group "database" is deprecated for removal. Its value may be silently ignored in the future. 2015-10-13 11:29:27.428 29659 DEBUG ceilometer.storage [-] looking for 'mongodb' driver in 'ceilometer.event.storage' get_connection /usr/lib/python2.7/site-packages/ceilometer/storage/__init__.py:149 2015-10-13 11:29:27.429 29659 INFO ceilometer.storage.mongo.utils [-] Connecting to mongodb on [('172.16.20.15', 27017), ('172.16.20.13', 27017), ('172.16.20.14', 27017)]
I have also seen this issue in my HA RDO-Manager deploys. It actually makes sense that master code is the fix, since the upstream tripleoci does not see this issue. Also, if I remove the ceilometer dbsync from the tripleo heat templates, I get a successful deploy. So, this is the only issue blocking working HA for RDO-Manager.
I've had a look at the change in https://review.openstack.org/#/c/227909/ and I don't think we can do a straight backport of that. Instead I think there's been a bug in the liberty code for some time where the code in ceilometer.storage.mongo.utils needs to have a conditional like the one in the test: https://github.com/openstack/ceilometer/commit/a6d608a33235dfa0d4ef91e3a3d69359ceb0263f#diff-0a4e8fdfc30fefb2d0aab976822c386bL3592 Basically, for liberty, if replica_set is set, use it, otherwise, use the URL without passing a replica_set argument. I'll make an upstream bug about this, targeting liberty and see where that gets us (and link it back here).
A potential workaround (proven by mcornea) to this problem is to: * _not_ use replica_set parameter in the database connection url * set [database]mongodb_replica_set in ceilometer.conf to the name of the replica set
Adding mongodb_replica_set=tripleo to ceilometer.conf database section(I left the mongodb url untouched) made dbsync pass.
Upstream fix has merged to stable/liberty and been confirmed by trown.