Bug 1299833
Summary: | ceilometer.coordination ToozError: You can't write against a read only slave | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Eduard Barrera <ebarrera> |
Component: | rhel-osp-installer | Assignee: | Mike Burns <mburns> |
Status: | CLOSED EOL | QA Contact: | Omri Hochman <ohochman> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 (Kilo) | CC: | ebarrera, eglynn, jinqlu, jruzicka, mabaakou, rhos-maint, srevivo, sspreitz |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 8.0 (Liberty) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-09-29 13:47:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Eduard Barrera
2016-01-19 11:12:47 UTC
I did this troubleshooting... Please guys be aware that I'm not a developer and it could be completly wrong: 2015-12-16 08:07:54.136 15156 ERROR ceilometer.coordination [-] Error sending a heartbeat to coordination backend. 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination Traceback (most recent call last): 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 105, in heartbeat 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination self._coordinator.heartbeat() 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 426, in heartbeat 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination value=b"Not dead!") 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__ 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination self.gen.throw(type, value, traceback) 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 84, in _translate_failures 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination raise coordination.ToozError(utils.exception_message(e)) 2015-12-16 08:07:54.136 15156 TRACE ceilometer.coordination ToozError: You can't write against a read only slave. File "/usr/lib/python2.7/site-packages/ceilometer/coordination.py", line 105, in heartbeat 99 def heartbeat(self): 100 if self._coordinator: 101 if not self._started: 102 # re-connect 103 self.start() 104 try: 105 self._coordinator.heartbeat() <======= 106 except tooz.coordination.ToozError: 107 LOG.exception(_LE('Error sending a heartbeat to coordination ' 108 'backend.')) /usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 426, in heartbeat 411 def heartbeat(self): 412 with _translate_failures(): 413 beat_id = self._encode_beat_id(self._member_id) 414 # Use milliseconds if we can (which are more accurate than 415 # just seconds); but this PSETEX support was added in 2.6.0 or 416 # newer so we can only use it then... 417 supports_psetex, _version = self._check_fetch_redis_version( 418 '2.6.0', not_existent=False) 419 if not supports_psetex: 420 expiry_secs = max(0, int(self.membership_timeout)) 421 self._client.setex(beat_id, time=expiry_secs, 422 value=b"Not dead!") 423 else: 424 expiry_ms = max(0, int(self.membership_timeout * 1000.0)) 425 self._client.psetex(beat_id, time_ms=expiry_ms, 426 value=b"Not dead!") <================== 427 for lock in self._acquired_locks: 428 try: 429 lock.heartbeat() 430 except coordination.ToozError: 431 LOG.warning("Unable to heartbeat lock '%s'", lock, 432 exc_info=True) PSETEX works exactly like SETEX with the sole difference that the expire time is specified in milliseconds instead of seconds. Set key to hold the string value and set key to timeout after a given number of seconds. This command is equivalent to executing the following commands: SETEX is atomic, and can be reproduced by using the previous two commands inside an MULTI / EXEC block. It is provided as a faster alternative to the given sequence of operations, because this operation is very common when Redis is used as a cache. An error is returned when seconds is invalid. Example: redis> SETEX mykey 10 "Hello" OK redis> TTL mykey (integer) 10 redis> GET mykey "Hello" redis> So the above piece of code seems to add a key beat_id, with an expiration of expire_ms with value "Not dead!" self._client.psetex(beat_id, time_ms=expiry_ms, value=b"Not dead!") <================== File "/usr/lib64/python2.7/contextlib.py", line 35, 21 def __exit__(self, type, value, traceback): 22 if type is None: 23 try: 24 self.gen.next() 25 except StopIteration: 26 return 27 else: 28 raise RuntimeError("generator didn't stop") 29 else: 30 if value is None: 31 # Need to force instantiation so we can reliably 32 # tell if we get the same exception back 33 value = type() 34 try: 35 self.gen.throw(type, value, traceback) <============================= 36 raise RuntimeError("generator didn't stop after throw()") 37 except StopIteration, exc: 38 # Suppress the exception *unless* it's the same exception that 39 # was passed to throw(). This prevents a StopIteration 40 # raised inside the "with" statement from being suppressed 41 return exc is not value 42 except: 43 # only re-raise if it's *not* the exception that was 44 # passed to throw(), because __exit__() must not raise 45 # an exception unless __exit__() itself failed. But throw() 46 # has to raise the exception to signal propagation, so this 47 # fixes the impedance mismatch between the throw() protocol 48 # and the __exit__() protocol. 49 # 50 if sys.exc_info()[1] is not value: 51 raise 52 File "/usr/lib/python2.7/site-packages/tooz/drivers/redis.py", line 84, in _translate_failures 77 def _translate_failures(): 78 """Translates common redis exceptions into tooz exceptions.""" 79 try: 80 yield 81 except (exceptions.ConnectionError, exceptions.TimeoutError) as e: 82 raise coordination.ToozConnectionError(utils.exception_message(e)) 83 except exceptions.RedisError as e: 84 raise coordination.ToozError(utils.exception_message(e)) <================= About the Tooz error, this is because Ceilometer is connected to a slave Redis node. According your haproxy configuration the time to switch to a newly elected master Redis node is 25 minutes (timeout check * fall), and can be more because the current tcp-checks connection doesn't use a fresh connection for each check. This is because you only assert on "tcp-check expect string role:master", so haproxy wait 5(fall) * 300s (timeout check) before retrying the tcp check instead of using the expected retry period of 2s (inter). When "role:slave" is returned instead of "role:master", the tcp-check doesn't return KO and still wait for the string "role:master" during "timeout check" period. Also "tcp-check connect" is not used so, so they can have race where redis command are sent to a closed socket. You also need to ensure that your pacemaker configuration stop/kill redis when its depromoted from master, otherwise Ceilometer will stay connected to it even haproxy marks the node as DOWN. So I would like to recommend to set tcp-checks option like this for Redis: listen redis bind XXX.XXX.XXX.XXX:6379 balance first option tcp-check option tcplog ****************************************** tcp-check connect tcp-check send PING\r\n tcp-check expect string +PONG tcp-check send info\ replication\r\n tcp-check expect string role:master tcp-check send QUIT\r\n tcp-check expect string +OK ****************************************** timeout client 0 timeout server 0 server overcloud-controller-0 XXX.XXX.XXX.XXX:6379 check fall 5 inter 2000 rise 2 server overcloud-controller-1 XXX.XXX.XXX.XXX:6379 check fall 5 inter 2000 rise 2 server overcloud-controller-2 XXX.XXX.XXX.XXX:6379 check fall 5 inter 2000 rise 2 About the http timeout, this can be simply due because your mongo DB really have a huge number of samples. So querying them without filtering, will always result to a http timeout. Can you checks the number of samples direclty in MongoDB database ? When the number of samples collected by Ceilometer is high, is strongly recommanded to not use the dashboard interface and always use filter when the samples database is queried. Hi Mehdi, Thanks for your analysis. The customer configuration is the default osp-director configuration. Should I open a RFE? Yes, that should be tracked for osp-director, in that case. Some info about the database: tripleo:PRIMARY> show dbs admin (empty) ceilometer 39.935GB local 14.071GB I don't know if these sizes are normal for mongo. I would like to emphatize that the main problem is when we click "Resource Usage" on Horizon and we tried to reproduce it with sample-list. I'm not expert on Ceilometer, but Does it means that all the OSP7 installation will have this problme with time ? Perhaps we have to fill a bugzilla against horizon and make the query changed. Additionally I thing is also worth to check -After upgrade mongodb from 2.4.9 to 2.6.9 our queries on mongo is very slow https://bugzilla.redhat.com/show_bug.cgi?id=1303347 Yeah the dashboard is not going to work with 40G of samples. This problem is well known, The Ceilometer storage have been designed for post processing audit/billing, not for ui/realtime/monitoring stuffs, even horizon enhances the query, that will not really fix the issue. In futur OSP version (Tech preview in OSP8 I guess), a new Ceilometer storage backend will be available that uses Gnocchi to store samples. This one will be more suitable for horizon dashboard. But Horizon plugin for Gnocchi doesn't exists yet, and the current Horizon plugin for Ceilometer don't allow to query the new storage system. So how can we solve this problem ? The platform was even not working, just installed and taking a look on it, so we got 40GB of data, so how it will be once in production ? So things we can do here is - Remove the ceilometer query from the dashboard to avoid the timeout - Take a look to this bugzilla, we can gain 40% performance https://bugzilla.redhat.com/show_bug.cgi?id=1303347 - Don't store too much data, how ? Can you give us some advice to solve the problem and avoid it to happen in the future ? If it's a well known problem, is there a workaround ? Thanks - Remove the ceilometer query from the dashboard to avoid the timeout I have heard that's something planned, but I can find the detail again. - Take a look to this bugzilla, we can gain 40% performance https://bugzilla.redhat.com/show_bug.cgi?id=1303347 At first glance, no clue about that one. - Don't store too much data, how ? Ceilometer provides /etc/ceilometer/pipeline.yaml that can be customized to select meters that you want to take care. By default, it takes all meters sent by all openstack applications that it can read the notification payload or poll the API. The supported meter list can be found here: http://docs.openstack.org/admin-guide-cloud/telemetry-measurements.html pipeline.yaml docs can be found here: http://docs.openstack.org/developer/ceilometer/configuration.html#pipelines Thanks for the links. Additionally I found with mschuppert that the time_to_live was et to -1 so the data is kept in the database forever. I'm going to suggest to the customer also to set it to expire in one month. What I'm not sure about now if the old data will be deleted automatically if we set TTL to a resonable value or if we have to enable it in mongo, something like: mongo> db.getSiblingDB('admin').runCommand({getParameter: 1, ttlMonitorEnabled: 1}) in my environment I set the thread running but it didn't deleted anything: mongo> db.serverStatus().metrics.ttl { "deletedDocuments" : NumberLong(0), "passes" : NumberLong(90) } It's seems also that would be necessary to create and index for keeping the ttl ? mongo> db.log_events.createIndex( { "createdAt": 1 }, { expireAfterSeconds: 3600 } ) ohh.. It seems restarting ceilometer with the propper ttl value does the jobs. So my question is: Is it necessary to enable the TTL thread or ceilometer will do iyt ? mongo> db.getSiblingDB('admin').runCommand({getParameter: 1, ttlMonitorEnabled: 1}) Thanks What do you call 'TTL thread' ? If this is the 'ceilometer-expirer' cron job, when mongodb is used as backend, 'ceilometer-expirer' is not needed. If this is the 'MongoDB background thread that deletes data', ceilometer doesn't enable it. But this one should be enabled by default in MongoDB. What do we do with this BZ, should we reassign it to installer ? Or another one have been already opened and we can close this one ? Hi Mehdi, We see the same issue in Cisco OSP7, from OpenStack Gui login, “Resource Usage” tab give “504 Gateway time out” errors. Please give some update on this fix. Thanks, Ben Closing list of bugs for RHEL OSP Installer since its support cycle has already ended [0]. If there is some bug closed by mistake, feel free to re-open. For new deployments, please, use RHOSP director (starting with version 7). -- Jaromir Coufal -- Sr. Product Manager -- Red Hat OpenStack Platform [0] https://access.redhat.com/support/policy/updates/openstack/platform |