Bug 1924373 - [RHOSP 16.1] Gnocchi throwing errors when haproxy closes TCP connections to Redis
Summary: [RHOSP 16.1] Gnocchi throwing errors when haproxy closes TCP connections to R...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 16.1 (Train)
Hardware: Unspecified
OS: Linux
urgent
high
Target Milestone: z9
: 16.1 (Train on RHEL 8.2)
Assignee: Martin Magr
QA Contact: Leonid Natapov
Joanne O'Flynn
URL:
Whiteboard:
Depends On:
Blocks: 2070565
TreeView+ depends on / blocked
 
Reported: 2021-02-03 02:42 UTC by Michele Valsecchi
Modified: 2022-12-14 15:43 UTC (History)
26 users (show)

Fixed In Version: puppet-tripleo-11.5.0-1.20220331172929.f716ef5.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1933491 2070565 (view as bug list)
Environment:
Last Closed: 2022-12-07 20:24:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 829185 0 None MERGED Increase connection timeouts for Redis 2022-09-21 12:26:24 UTC
Red Hat Issue Tracker OSP-1609 0 None None None 2021-11-18 14:20:59 UTC
Red Hat Knowledge Base (Solution) 6208862 0 None None None 2021-07-22 13:15:13 UTC
Red Hat Product Errata RHBA-2022:8795 0 None None None 2022-12-07 20:25:37 UTC

Description Michele Valsecchi 2021-02-03 02:42:38 UTC
Description of problem:

- /var/log/containers/gnocchi/gnocchi-metricd.log is flooded with errors from Redis connection Errors being closed by the server
- See https://bugzilla.redhat.com/show_bug.cgi?id=1791243 for very similar symptoms in RHOSP16.0

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.3 GA (Train)

Affected containers fro my lab (see privately attached sosreport from a live environment hitting this)
~~~

- rhosp16-openstackgnocchi-statsd 16.1.3-5.1611701445 => [2]
  gnocchi-common-4.3.5-0.20200310232918.8cbc9ee.el8ost.noarch
  gnocchi-statsd-4.3.5-0.20200310232918.8cbc9ee.el8ost.noarch
  puppet-gnocchi-15.4.1-1.20200818130933.7b0e3c7.el8ost.noarch
  python3-gnocchi-4.3.5-0.20200310232918.8cbc9ee.el8ost.noarch
  python3-gnocchiclient-7.0.4-0.20200310205157.64814b9.el8ost.noarch

- rhosp16-openstack-gnocchi-metricd:16.1.3-5.1611701445  => [3]
  same as above 

- rhosp16-openstack-gnocchi-api:16.1.3-5.1611701445  => [4]
  same as above

- rhosp16-openstack-haproxy: => [5]
  haproxy-1.8.23-3.el8.x86_64
  puppet-haproxy-4.1.0-0.20200311060011.df96ffc.el8ost.noarch
~~~


How reproducible:
100%, even in freshly deployed env (other similarity with BZ1791243)

Steps to Reproduce:
1. Deploy the overcloud so that it uses telemetry
2. Monitor /var/log/containers/gnocchi/gnocchi-metricd.log on each controller nodes

Actual results:

- The following message is logged over and over on all Controller nodes
~~~
 The above exception was the direct cause of the following exception:
 
 Traceback (most recent call last):
   File "/usr/lib/python3.6/site-packages/gnocchi/cli/metricd.py", line 235, in _run_job
     m_count += self.chef.process_new_measures_for_sack(s)
   File "/usr/lib/python3.6/site-packages/gnocchi/chef.py", line 149, in process_new_measures_for_sack
     if not lock.acquire(blocking=blocking):
   File "/usr/lib/python3.6/site-packages/tooz/drivers/redis.py", line 93, in acquire
     return acquired
   File "/usr/lib64/python3.6/contextlib.py", line 99, in __exit__
     self.gen.throw(type, value, traceback)
   File "/usr/lib/python3.6/site-packages/tooz/drivers/redis.py", line 51, in _translate_failures
     cause=e)
   File "/usr/lib/python3.6/site-packages/tooz/utils.py", line 225, in raise_with_cause
     excutils.raise_with_cause(exc_cls, message, *args, **kwargs)
   File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 143, in raise_with_cause
     six.raise_from(exc_cls(message, *args, **kwargs), kwargs.get('cause'))
   File "<string>", line 3, in raise_from
 tooz.coordination.ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) <=============
~~~~

- So far _it seems_ no actual effect on data stored in gnocchi has been noticed
- From the logs it seems that haproxy is actually terminating the connections, or gnocchi (ot its redis client) are throwing errors when haproxy terminate TCP connections

Expected results:

- Errors should not be emitted from gnocchi.
- The TCP connection should be kept alive through the usage of haproxy

Additional info:

- Used nc to confirm Redis is up and running, anc can be reached from the controller, also because of that, we know redis credentials are correct
- Gnocchi containers are marked as healthy the whole time
- Pcs resource shows no issues in redis bundle


- Setting "tcp-alive" in var/lib/config-data/puppet-generated/redis/etc/redis.conf did NOT mitigate the issue
- Increasing "timeout  client" by a factor of 3 and 5 in /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg did NOT mitigate the issue
- Increasing "tcp-keepalive" to 300 in var/lib/config-data/puppet-generated/redis/etc/redis.conf and restarting redis bundle did NOT mitigate the issue


-  As per log shown below, we can confirm that many connection (to redis) were closed because of client timeouts: i.e. "cD"[1]
~~~
 $ less /var/log/containers/haproxy/haproxy.log
Jan  8 16:08:30 controller-x <local0.info> haproxy[6397]:10.X.Y.69:54250 [08/Jan/2021:15:41:28.737] redis redis/xxx-internal-api-xxx 1/0/1621322 6885 cD 178/178/177/177/0 0/0
Jan  8 16:08:30 controller-x <local0.info> haproxy[6397]:10.X.Y.71:34092 [08/Jan/2021:16:05:03.511] redis redis/xxx-internal-api-xxx 1/0/206681 625 cD 177/177/176/176/0 0/0
Jan  8 16:09:16 controller-x <local0.info> haproxy[6397]:10.X.Y.71:50766 [08/Jan/2021:15:55:04.515] redis redis/xxx-internal-api-xxx 1/0/851753 4805 cD 176/176/175/175/0 0/0
Jan  8 16:09:18 controller-x <local0.info> haproxy[6397]:10.X.Y.69:54732 [08/Jan/2021:16:06:30.992] redis redis/xxx-internal-api-xxx 1/0/167288 989 cD 175/175/174/174/0 0/0
Jan  8 16:09:18 controller-x <local0.info> haproxy[6397]:10.X.Y.71:34036 [08/Jan/2021:16:05:03.035] redis redis/xxx-internal-api-xxx 1/0/255724 1353 cD 174/174/173/173/0 0/0
~~~
- From the redis MONITOR data gathered in this environment, we can notice only 181 connected_clients so it's unlikely we're hitting maxclient limitation
  Also, redis server won't close idle clients since timeout is 0 in this env. 


- We don't see any errors in gnocchi side in RHOSP13 versions, so it is very likely that the client(gnocchi itself or the redis client being used by it) 
  has changed (in RHOSP16.1) so that it resultes in errors raised when haproxy closed the connection.
  ...Or, this is an issue on haproxy side.


[1] https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#8.5
[2] https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-gnocchi-statsd/5de6b58abed8bd164a0c17a9?digest=2b02bf7e8a35&container-tabs=packages
[3] https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-gnocchi-statsd/5de6b58abed8bd164a0c17a9?digest=2b02bf7e8a35&container-tabs=packages
[4] https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-gnocchi-api/5de6b374d70cc51644a56e7d?digest=5a28b10b6be7&container-tabs=packages
[5] https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-haproxy/5de6b42cd70cc51644a56ea7?digest=5a09bd8bc3a4&container-tabs=packages

Comment 1 Michele Valsecchi 2021-02-03 02:54:58 UTC
Version from live environment hitting this:
~~~
xxx:8787/rhosp-rhel8/openstack-gnocchi-metricd             16.1-51      42d164496f7c   3 months ago   922 MB
xxx:8787/rhosp-rhel8/openstack-gnocchi-statsd              16.1-51      5897deba26ee   3 months ago   922 MB
xxx:8787/rhosp-rhel8/openstack-gnocchi-api                 16.1-51      b101f6839947   3 months ago   922 MB
  gnocchi-api-4.3.5-0.20200310232918.8cbc9ee.el8ost.noarch
  gnocchi-common-4.3.5-0.20200310232918.8cbc9ee.el8ost.noarch
  puppet-gnocchi-15.4.1-1.20200818130933.7b0e3c7.el8ost.noarch
  python3-gnocchi-4.3.5-0.20200310232918.8cbc9ee.el8ost.noarch
  python3-gnocchiclient-7.0.4-0.20200310205157.64814b9.el8ost.noarch
cluster.common.tag/openstack-haproxy                       pcmklatest   007cc42591fc   3 months ago   523 MB
~~~

Comment 3 Takashi Kajinami 2021-02-03 04:37:57 UTC
So the problem seems to be that haproxy disconnects connection to redis because of client timeout.
We see bunch of "cD" appears in haproxy log.

~~~
[heat-admin@controller-0 ~]$ sudo cat /var/log/containers/haproxy/haproxy.log
...
Feb  2 14:58:31 controller-2 haproxy[15]: 172.17.1.14:53064 [02/Feb/2021:14:53:56.494] redis redis/controller-0.internalapi.redhat.local 1/0/274926 6214 cD 76/76/75/75/0 0/0
Feb  2 14:58:31 controller-2 haproxy[15]: 172.17.1.14:53128 [02/Feb/2021:14:53:57.433] redis redis/controller-0.internalapi.redhat.local 1/0/274138 6214 cD 75/75/74/74/0 0/0
Feb  2 14:58:31 controller-2 haproxy[15]: 172.17.1.14:52930 [02/Feb/2021:14:53:54.432] redis redis/controller-0.internalapi.redhat.local 1/0/277168 6214 cD 74/74/73/73/0 0/0
Feb  2 14:58:31 controller-2 haproxy[15]: 172.17.1.14:53016 [02/Feb/2021:14:53:55.651] redis redis/controller-0.internalapi.redhat.local 1/0/276082 6214 cD 73/73/72/72/0 0/0
Feb  2 14:59:25 controller-2 haproxy[15]: 172.17.1.40:52866 [02/Feb/2021:14:54:50.475] redis redis/controller-0.internalapi.redhat.local 1/0/274673 6214 cD 72/72/71/71/0 0/0
Feb  2 14:59:25 controller-2 haproxy[15]: 172.17.1.40:52786 [02/Feb/2021:14:54:49.493] redis redis/controller-0.internalapi.redhat.local 1/0/275959 6214 cD 71/71/70/70/0 0/0
Feb  2 14:59:25 controller-2 haproxy[15]: 172.17.1.40:52722 [02/Feb/2021:14:54:48.479] redis redis/controller-0.internalapi.redhat.local 1/0/276999 6214 cD 70/70/69/69/0 0/0
Feb  2 14:59:25 controller-2 haproxy[15]: 172.17.1.40:52926 [02/Feb/2021:14:54:51.463] redis redis/controller-0.internalapi.redhat.local 1/0/274115 6214 cD 69/69/68/68/0 0/0
Feb  2 14:59:43 controller-2 haproxy[15]: 172.17.1.119:34690 [02/Feb/2021:14:55:12.526] redis redis/controller-0.internalapi.redhat.local 1/0/270869 6062 cD 68/68/67/67/0 0/0
Feb  2 14:59:43 controller-2 haproxy[15]: 172.17.1.119:34696 [02/Feb/2021:14:55:12.528] redis redis/controller-0.internalapi.redhat.local 1/0/270874 6038 cD 67/67/66/66/0 0/0
Feb  2 14:59:43 controller-2 haproxy[15]: 172.17.1.119:34738 [02/Feb/2021:14:55:12.974] redis redis/controller-0.internalapi.redhat.local 1/0/270584 6214 cD 66/66/65/65/0 0/0
Feb  2 14:59:43 controller-2 haproxy[15]: 172.17.1.119:34718 [02/Feb/2021:14:55:12.725] redis redis/controller-0.internalapi.redhat.local 1/0/270855 6214 cD 65/65/64/64/0 0/0
Feb  2 15:01:31 controller-2 haproxy[15]: 172.17.1.14:36578 [02/Feb/2021:14:56:56.496] redis redis/controller-0.internalapi.redhat.local 1/0/274988 6214 cD 76/76/75/75/0 0/0
Feb  2 15:01:31 controller-2 haproxy[15]: 172.17.1.14:36444 [02/Feb/2021:14:56:54.438] redis redis/controller-0.internalapi.redhat.local 1/0/277188 6214 cD 75/75/74/74/0 0/0
Feb  2 15:01:31 controller-2 haproxy[15]: 172.17.1.14:36646 [02/Feb/2021:14:56:57.430] redis redis/controller-0.internalapi.redhat.local 1/0/274204 6214 cD 74/74/73/73/0 0/0
Feb  2 15:01:31 controller-2 haproxy[15]: 172.17.1.14:36530 [02/Feb/2021:14:56:55.659] redis redis/controller-0.internalapi.redhat.local 1/0/276137 6214 cD 73/73/72/72/0 0/0
~~~


Interestingly we observe the same client disconnection in RHOSP13, however we don't see any errors
in client side(gnocchi-metricd or so on) so I suspect something has been changed in client implementation.

~~~
[heat-admin@controller-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 13.0.12 (Queens)
[heat-admin@controller-0 ~]$ sudo grep redis /var/log/containers/haproxy/haproxy.log | tail -10
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.64:48592 [02/Feb/2021:16:09:56.337] redis redis/controller-0.internalapi.redhat.local 1/0/178529 4392 cD 81/70/70/70/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.62:40022 [02/Feb/2021:16:09:56.337] redis redis/controller-0.internalapi.redhat.local 1/0/178529 4392 cD 80/69/69/69/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.62:40024 [02/Feb/2021:16:09:56.337] redis redis/controller-0.internalapi.redhat.local 1/0/178529 4392 cD 79/68/68/68/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.126:60696 [02/Feb/2021:16:09:56.337] redis redis/controller-0.internalapi.redhat.local 1/0/178529 4392 cD 78/67/67/67/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.64:48600 [02/Feb/2021:16:09:56.338] redis redis/controller-0.internalapi.redhat.local 1/0/178529 4392 cD 77/66/66/66/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.62:40026 [02/Feb/2021:16:09:56.338] redis redis/controller-0.internalapi.redhat.local 1/0/178529 4392 cD 76/65/65/65/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.62:40028 [02/Feb/2021:16:09:56.338] redis redis/controller-0.internalapi.redhat.local 1/0/178528 4392 cD 75/64/64/64/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.64:48614 [02/Feb/2021:16:09:56.339] redis redis/controller-0.internalapi.redhat.local 1/0/178527 4392 cD 74/63/63/63/0 0/0
Feb  2 16:12:54 controller-0 haproxy[14]: 172.17.1.62:59768 [02/Feb/2021:16:10:52.580] redis redis/controller-0.internalapi.redhat.local 1/0/122286 93 cD 73/62/62/62/0 0/0
Feb  2 16:13:28 controller-0 haproxy[14]: 172.17.1.126:54514 [02/Feb/2021:16:03:28.846] redis redis/controller-0.internalapi.redhat.local 1/0/600001 80 cD 73/73/73/73/0 0/0
~~~

We tried enabling keepalive in haproxy as well as client side(*1) but it seems that enabling
tcp keepalive doesn't solve the problem and haproxy requires any actual data sent/received
between client and haproxy.

(*1)
[heat-admin@controller-0 ~]$ sudo egrep -e coordination_url -e redis_url /var/lib/config-data/puppet-generated/gnocchi/etc/gnocchi/gnocchi.conf
#coordination_url = <None>
#redis_url = ${storage.redis_url}
redis_url=redis://:7KTfKNEbCRtk0PdYpMIsoYW4Y.1.119:6379/?socket_keepalive=1
#redis_url = redis://localhost:6379/
coordination_url=redis://:7KTfKNEbCRtk0PdYpMIsoYW4Y.1.119:6379/?socket_keepalive=1

I guess that enabling health_check_interval in redis-py might be a solution here but
tooz and redis driver in gnocchi doesn't allow us to pass that parameter now.

Comment 4 Matthias Runge 2021-02-03 18:47:54 UTC
Maybe related? https://bugzilla.redhat.com/show_bug.cgi?id=1910876#c18

Comment 5 Matthias Runge 2021-02-15 13:08:03 UTC
I can reproduce the issue by deploying OSP and enabling legacy-telemetry.yaml.

The issue goes away when configuring gnocchi to directly use the redis master node and running gnocchi-upgrade afterwards.
This hints me, that this is caused by haproxy, which sits in between.

Comment 6 Matthias Runge 2021-02-15 14:30:51 UTC
As a short try, I reverted https://review.opendev.org/c/openstack/puppet-tripleo/+/749882 (which should be the only difference between osp13 and osp16.1 config wrt redis). It doesn't change anything.

after configuring gnocchi to use only the redis master (and running gnocchi-upgrade), the issue is gone.
turning this back to the vip for redis (and running gnocchi-upgrade again), the errors happen again. Reverting https://review.opendev.org/c/openstack/puppet-tripleo/+/749882 doesn
t have any influence here.

Comment 7 Takashi Kajinami 2021-02-15 15:14:41 UTC
I found suspicious change in redis-py
 https://github.com/andymccurdy/redis-py/commit/0647659c2f3550063c404a616ba37987c9158149

IIUC because of this change redis client doesn't make retry when it detects ClientError or Timeout.
 https://github.com/andymccurdy/redis-py/blob/3.1.0/redis/client.py#L778-L780

We should set retry_on_timeout=True in redis url like 

redis_url=redis://:7KTfKNEbCRtk0PdYpMIsoYW4Y.1.119:6379/?retry_on_timeout=True
...

coordination_url=redis://:7KTfKNEbCRtk0PdYpMIsoYW4Y.1.119:6379/?retry_on_timeout=True

to restore the previous behavior with automatic retry without exception.

Comment 8 Takashi Kajinami 2021-02-15 15:29:41 UTC
I implemented the change I mentioned in comment:7

~~~
[root@controller-0 ~]# sudo egrep -e coordination_url -e redis_url /var/lib/config-data/puppet-generated/gnocchi/etc/gnocchi/gnocchi.conf
#coordination_url = <None>
#redis_url = ${storage.redis_url}
redis_url=redis://:7KTfKNEbCRtk0PdYpMIsoYW4Y.1.119:6379/?retry_on_timeout=True
#redis_url = redis://localhost:6379/
coordination_url=redis://:7KTfKNEbCRtk0PdYpMIsoYW4Y.1.119:6379/?retry_on_timeout=True
~~~

and restarted gnocchi-metricd at 15:18 UTC.

~~~
[root@controller-0 ~]# sudo systemctl restart tripleo_gnocchi_metricd.service
[root@controller-0 ~]# date
Mon Feb 15 15:18:08 UTC 2021
[root@controller-0 ~]# sudo systemctl status tripleo_gnocchi_metricd.service
● tripleo_gnocchi_metricd.service - gnocchi_metricd container
   Loaded: loaded (/etc/systemd/system/tripleo_gnocchi_metricd.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2021-02-15 15:18:07 UTC; 23s ago
  Process: 106094 ExecStopPost=/usr/bin/podman stop -t 10 gnocchi_metricd (code=exited, status=0/SUCCESS)
  Process: 105761 ExecStop=/usr/bin/podman stop -t 10 gnocchi_metricd (code=exited, status=0/SUCCESS)
  Process: 106174 ExecStart=/usr/bin/podman start gnocchi_metricd (code=exited, status=0/SUCCESS)
 Main PID: 106195 (conmon)
    Tasks: 0 (limit: 204317)
   Memory: 1.9M
   CGroup: /system.slice/tripleo_gnocchi_metricd.service
           ‣ 106195 /usr/bin/conmon --api-version 1 -s -c 40b720c574aec22585e2c98a439497011471a04cd309491af52647b0cc7>

Feb 15 15:18:06 controller-0 systemd[1]: Stopped gnocchi_metricd container.
Feb 15 15:18:06 controller-0 systemd[1]: Starting gnocchi_metricd container...
Feb 15 15:18:07 controller-0 podman[106174]: 2021-02-15 15:18:07.062603552 +0000 UTC m=+0.482178126 container init 40>
Feb 15 15:18:07 controller-0 podman[106174]: 2021-02-15 15:18:07.082071754 +0000 UTC m=+0.501646356 container start 4>
Feb 15 15:18:07 controller-0 podman[106174]: gnocchi_metricd
Feb 15 15:18:07 controller-0 systemd[1]: Started gnocchi_metricd container.
~~~

However it doesn't work.
~~
[root@controller-0 ~]# grep "ERROR " /var/log/containers/gnocchi/gnocchi-metricd.log  | tail -20
2021-02-15 15:15:38,917 [34] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:15:39,513 [31] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:15:39,851 [38] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:15:40,018 [42] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:20:09,740 [38] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:20:09,744 [31] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:20:09,748 [34] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:20:09,750 [42] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:22:10,246 [38] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:22:10,251 [31] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:22:10,255 [34] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:22:10,258 [42] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:24:11,250 [38] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:24:11,256 [31] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:24:11,261 [42] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:24:11,262 [34] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:26:13,259 [38] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:26:13,263 [31] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:26:13,267 [42] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
2021-02-15 15:26:13,271 [34] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
~~~

I now noticed that this is expected because retry_on_timeout only works for TimeoutError
and doesn't affect behavior with ClientError...
Trace back shows that the exception is still raised from redis client.

~~~
2021-02-15 15:26:13,267 [38] DEBUG    gnocchi.chef: Processing measures for sack incoming128-67
2021-02-15 15:26:13,271 [34] ERROR    gnocchi.cli.metricd: Error while listening for new measures notification, retrying
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 184, in _read_from_socket
    raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/gnocchi/cli/metricd.py", line 184, in _fill_sacks_to_process
    for sack in self.incoming.iter_on_sacks_to_process():
  File "/usr/lib/python3.6/site-packages/gnocchi/incoming/redis.py", line 186, in iter_on_sacks_to_process
    for message in p.listen():
  File "/usr/lib/python3.6/site-packages/redis/client.py", line 3123, in listen
    response = self.handle_message(self.parse_response(block=True))
  File "/usr/lib/python3.6/site-packages/redis/client.py", line 3036, in parse_response
    return self._execute(connection, connection.read_response)
  File "/usr/lib/python3.6/site-packages/redis/client.py", line 3013, in _execute
    return command(*args)
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 636, in read_response
    raise e
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 633, in read_response
    response = self._parser.read_response()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 291, in read_response
    response = self._buffer.readline()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 223, in readline
    self._read_from_socket()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 198, in _read_from_socket
    (e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
~~~

So we would need some fixes in redis-py or implement proper error handling in client side
(Tooz and Gnocchi in this case) to supress that ClientError.

Comment 9 Takashi Kajinami 2021-02-15 15:37:27 UTC
One correction.

> I found suspicious change in redis-py
>  https://github.com/andymccurdy/redis-py/commit/0647659c2f3550063c404a616ba37987c9158149
In fact this didn't change behavior with ClientError.

The commit which made actual change in behavior with ClientError is;
 https://github.com/andymccurdy/redis-py/commit/7860995ce4226ee0ca82d2eefa4b8bd5c3943a75

Comment 10 Takashi Kajinami 2021-02-15 15:48:58 UTC
(In reply to Takashi Kajinami from comment #9)
> One correction.
> 
> > I found suspicious change in redis-py
> >  https://github.com/andymccurdy/redis-py/commit/0647659c2f3550063c404a616ba37987c9158149
> In fact this didn't change behavior with ClientError.
> 
> The commit which made actual change in behavior with ClientError is;
>  https://github.com/andymccurdy/redis-py/commit/
> 7860995ce4226ee0ca82d2eefa4b8bd5c3943a75

So that change was introduced as a part of the following pull request.
 https://github.com/andymccurdy/redis-py/pull/1023

And I found the following issue related to the pull request
  https://github.com/andymccurdy/redis-py/issues/1022

According to what was mentioned in that issue, I think we should rebase redis-py
on 3.2.0 to solve the issue caused by invalid connections in connection pool ?

~~~
I believe this should be resolved. 3.1.0 fixed the faulty retry logic and 3.2.0 improved
the ConnectionPool in that it will only return healthy connections now.
~~~

Comment 12 Matthias Runge 2021-02-15 19:32:50 UTC
upgrading to python-redis 3.3.8 alone does not fix this issue. It still occurs.

Comment 13 Matthias Runge 2021-02-22 15:22:50 UTC
A workaround/fix was merged recently https://github.com/gnocchixyz/gnocchi/pull/1119 
It still needs a backport

Comment 15 Takashi Kajinami 2021-02-27 13:49:58 UTC
Hi Matthias,

Thank you for submitting the fix to Gnocchi.

I think we also need the same fix for tooz because it also uses redis.
(IIUC what you have fixed is redis storage driver in Gnocchi).
Could you also check it ?

We see the same ConnectionError exception is raised from tooz redis driver
in the error log pasted in the bug descrpiton.

I have reproduced the same traceback from tooz with Gnocchi in my lab deployment,
but a customer also reported that they see the same ConnectionError from tooz
redis driver in aodh-evaculator.log

If needed I'll clone this bug so that we track the issue with tooz separately.

Thankk you,
Takashi

Comment 16 Matthias Runge 2021-02-28 16:17:31 UTC
I haven't seen any errors in tooz during my tests, but reproduced the issue with gnocchi-metricd (and also with gnocchi-api) quite reproducible.
I'll clone this to tooz.

Comment 17 Matthias Runge 2021-03-30 15:33:05 UTC
Only blockers allowed for z5, moving to z6.

Comment 21 Matthias Runge 2021-05-04 16:26:46 UTC
How to test:

Steps to Reproduce:
1. Deploy the overcloud so that it uses telemetry
2. Monitor /var/log/containers/gnocchi/gnocchi-metricd.log on each controller nodes

you should not see any redis errors anymore in the mentioned file.


The tooz fix has been split off to a separate bug, this here is about the fix in gnocchi.

Comment 26 Leonid Natapov 2021-05-19 05:32:46 UTC
I have clean OSP16.1 Z6 and no described Tracebacks seen in the gnocchi-metricd.log files.

Comment 27 Matthias Runge 2021-05-19 06:32:13 UTC
Somehow I still get the tracebacks in my local env here, moving this back to assigned.

Comment 30 Matthias Runge 2021-05-31 14:51:55 UTC
- This does not happen on devstack
- this also happens when circumventing haproxy
- both haproxy on devstack and haproxy on osp are using redis-5.0.3-2.module_el8, the version on devstack is module_el8.2.0, the one for OSP is module_el8.0.0
- devstack does not use redis inside a container, osp does.

Comment 31 Matthias Runge 2021-06-07 05:58:28 UTC
- using an external redis helps with this issue too.
  - switch the redis_url and coordination_url in gnocchi.conf with the other redis server
  - also make sure to copy over the authentication bits from osp internal redis to the external one
- apparently, this also does not happen with all osp16.x deployments

Comment 35 Matthias Runge 2021-07-12 13:52:34 UTC
Still no root cause, but a different way to approach this

https://github.com/gnocchixyz/gnocchi/pull/1164

Comment 37 Matthias Runge 2021-07-22 06:07:20 UTC
There is still no update here, still no root cause known.

Comment 38 Matthias Runge 2021-07-27 14:10:54 UTC
While testing with 16.2 (same basis): I see this issue happening until I restart ceilometer_agent_notification on controllers. If that is reproducible on more locations, that may provide a data point. After restarting ceilometer_agent_notification, the issue does not happen.

Comment 39 Matthias Runge 2021-07-29 08:06:54 UTC
in 16.1, issues like

    self._read_from_socket()
  File "/usr/lib/python3.6/site-packages/redis/connection.py", line 198, in _read_from_socket
    (e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)

happen, no matter if ceilometer is restarted or not.

Comment 47 Matthias Runge 2021-10-01 13:53:27 UTC
There is no clear reproducer here, if you look at the huge number of observations. 

It's not even clear this is really bad (other than the exceptions happening). There no real changes between OSP13 and 16 in the code.
The only bigger change was redis-py which now surfaces connection issues happening, where they were ignored before (in OSP13).

Comment 49 Matthias Runge 2021-10-15 05:43:10 UTC
See comment above, like #c47.

Comment 66 Martin Magr 2022-01-12 14:35:20 UTC
Relevant redis-py issue https://github.com/redis/redis-py/issues/1140 which explains why we did not see this behaviour before with following: Prior to 3.1.0, redis-py would retry any command that raised a ConnectionError exception. This was a bug and never intended. This bug was fixed in 3.1.0.

Comment 67 Martin Magr 2022-09-07 16:40:53 UTC
Need to document the fact that the log records only notifies about Redis connection reset every 90minutes.

Comment 71 Leonid Natapov 2022-11-01 11:25:52 UTC
I have an OSP16z9 environment for 20 hours now and no traces,errors,fail messages appear in gnocchi-metricd or gnocchi-statsd log files.

haproxy.pp:

$redis_tcp_check_options = $redis_tcp_check_connect_options + $redis_tcp_check_common_options
    haproxy::listen { 'redis':
      bind             => $redis_bind_opts,
      options          => {
        'balance'        => 'first',
        'timeout client' => '90m',
        'timeout server' => '90m',
        'option'         => [ 'tcp-check', 'tcplog', ],
        'tcp-check'      => $redis_tcp_check_options,



Verifying this bug.

Comment 79 errata-xmlrpc 2022-12-07 20:24:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795


Note You need to log in before you can comment on or make changes to this bug.