Description of problem: If gnocchi is unable to create metric/mkdir on backend, it leaves the connection to redis opened. There's a situation right now where the gnocchi backend has a limit of 65532 files (we know we can solve this by deleting expired metrics). So gnocchi-carbonara is unable to create directory: ~~~ 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara [-] Error processing new measures 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara Traceback (most recent call last): 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara File "/usr/lib/python2.7/site-packages/gnocchi/storage/_carbonara.py", line 538, in process_new_measures 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara self._create_metric(metric) 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara File "/usr/lib/python2.7/site-packages/gnocchi/storage/file.py", line 108, in _create_metric 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara os.mkdir(path, 0o750) 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara OSError: [Errno 5] Input/output error: '/var/lib/gnocchi/afba56af-9fc7-4778-ab13-c65ff8a24688' 2018-03-29 14:44:22.443 86371 ERROR gnocchi.storage._carbonara ~~~ When this happens, gnocchi leaves the connection to redis opened instead of closing it, and at some point, redis is suffering from resource starvation. As you can see with the following example, 75% of the connections to haproxy are for redis. At some point, it's going to blow and cause more impact to other services. ~~~ # ss -tanp | grep haproxy | grep -oP "^ESTAB .*192.168.1.(11|28):\K([0-9]+)" | sort | uniq -c | sort -nr | head -1 6131 6379 # ss -tanp | grep haproxy | wc -l 8245 ~~~ Version-Release number of selected component (if applicable): haproxy-1.5.18-6.el7.x86_64 openstack-gnocchi-api-3.0.15-1.el7ost.noarch openstack-gnocchi-carbonara-3.0.15-1.el7ost.noarch openstack-gnocchi-common-3.0.15-1.el7ost.noarch openstack-gnocchi-indexer-sqlalchemy-3.0.15-1.el7ost.noarch openstack-gnocchi-metricd-3.0.15-1.el7ost.noarch openstack-gnocchi-statsd-3.0.15-1.el7ost.noarch puppet-gnocchi-9.5.0-3.el7ost.noarch puppet-haproxy-1.5.0-3.f8c5f27git.el7ost.noarch puppet-redis-1.2.3-2.el7ost.noarch python-gnocchi-3.0.15-1.el7ost.noarch python-gnocchiclient-2.8.2-2.el7ost.noarch python-redis-2.10.3-3.el7ost.noarch redis-3.0.6-2.el7ost.x86_64 How reproducible: All the time. It takes ~48h in this environment to break Steps to Reproduce: 1. Gnocchi uses file storage 2. Storage should be unreliable 3. Wait 48h Actual results: Connections to redis are piling up Expected results: Connections should get closed Additional info:
This does not make any sense to me for now. There should be one connection to Redis per metricd processor running, and one per metricd scheduler, that's all. The exception you see is caught handled and the lock is then released on Redis side. There's no reconnection or whatever to Redis at this point. So I'm struggling to see what the source of this might be. Are you sure that Gnocchi is the source of this? Redis is also used by e.g. Ceilometer FWIW.
3.0.21 released upstream with the fix. Need a rebase. Prad? :)
Hi there, If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to -. Thanks, Alex
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2671