Bug 1730723

Summary: [OSP13] Error running 'docker run --name ceilometer_gnocchi_upgrade' in rhel7.7-based containers phase1 test job
Product: Red Hat OpenStack Reporter: Steve Linabery <slinaber>
Component: puppet-tripleoAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: afariasa, dhill, fhubik, jbadiapa, jjoyce, jpichon, jruzicka, jschluet, lmiccini, lnatapov, michele, rohara, sasha, slinaber, tvignaud
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-8.4.1-20.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1732444 1732446 (view as bug list) Environment:
Last Closed: 2019-08-08 19:58:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1732444, 1732446    

Comment 5 Matthias Runge 2019-07-17 15:09:52 UTC
so:
apparently, redis claims to be up and running.
haproxy can not connect to redis.


84:M 16 Jul 19:04:11.583 * The server is now ready to accept connections on port 6379
84:M 16 Jul 19:04:11.583 * The server is now ready to accept connections at /var/run/redis/redis.sock
84:S 16 Jul 19:04:13.352 * SLAVE OF no-such-master:6379 enabled (user request from 'id=4 addr=/var/run/redis/redis.sock:0 fd=6 name= age=0 idle=0 flags=U db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')
84:S 16 Jul 19:04:13.586 * Connecting to MASTER no-such-master:6379
84:S 16 Jul 19:04:13.589 # Unable to connect to MASTER: No such file or directory
84:S 16 Jul 19:04:14.591 * Connecting to MASTER no-such-master:6379


Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37784 [16/Jul/2019:19:17:14.189] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37786 [16/Jul/2019:19:17:14.189] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.23:33529 [16/Jul/2019:19:17:14.185] mysql mysql/controller-0.internalapi.localdomain 1/0/158 8792 -- 51/51/51/51/0 0/0
Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37796 [16/Jul/2019:19:17:14.599] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37798 [16/Jul/2019:19:17:14.599] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37816 [16/Jul/2019:19:17:15.047] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37818 [16/Jul/2019:19:17:15.048] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37820 [16/Jul/2019:19:17:15.049] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37822 [16/Jul/2019:19:17:15.049] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.14:41586 [16/Jul/2019:19:17:15.253] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.14:41588 [16/Jul/2019:19:17:15.253] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37826 [16/Jul/2019:19:17:15.432] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37828 [16/Jul/2019:19:17:15.433] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.14:41590 [16/Jul/2019:19:17:15.471] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.14:41592 [16/Jul/2019:19:17:15.471] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37846 [16/Jul/2019:19:17:15.663] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0

Comment 7 Luca Miccini 2019-07-17 18:58:05 UTC
(In reply to Matthias Runge from comment #5)
> so:
> apparently, redis claims to be up and running.
> haproxy can not connect to redis.
> 
> 
> 84:M 16 Jul 19:04:11.583 * The server is now ready to accept connections on
> port 6379
> 84:M 16 Jul 19:04:11.583 * The server is now ready to accept connections at
> /var/run/redis/redis.sock
> 84:S 16 Jul 19:04:13.352 * SLAVE OF no-such-master:6379 enabled (user
> request from 'id=4 addr=/var/run/redis/redis.sock:0 fd=6 name= age=0 idle=0
> flags=U db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0
> events=r cmd=slaveof')
> 84:S 16 Jul 19:04:13.586 * Connecting to MASTER no-such-master:6379
> 84:S 16 Jul 19:04:13.589 # Unable to connect to MASTER: No such file or
> directory
> 84:S 16 Jul 19:04:14.591 * Connecting to MASTER no-such-master:6379
> 
> 
> Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37784
> [16/Jul/2019:19:17:14.189] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
> Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37786
> [16/Jul/2019:19:17:14.189] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
> Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.23:33529
> [16/Jul/2019:19:17:14.185] mysql mysql/controller-0.internalapi.localdomain
> 1/0/158 8792 -- 51/51/51/51/0 0/0
> Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37796
> [16/Jul/2019:19:17:14.599] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
> Jul 16 19:17:14 controller-0 haproxy[12]: 172.17.1.16:37798
> [16/Jul/2019:19:17:14.599] redis redis/<NOSRV> -1/-1/0 0 SC 52/0/0/0/0 0/0
> Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37816
> [16/Jul/2019:19:17:15.047] redis redis/<NOSRV> -1/-1/0 0 SC 53/0/0/0/0 0/0
> Jul 16 19:17:15 controller-0 haproxy[12]: 172.17.1.16:37818


we tracked this down to be an issue with haproxy 1.5.18-9 shipped in the latest containers, specifically this change:

2019-02-27 Ryan O'Hara <rohara> - 1.5.18-9
- Detect down servers with mutliple tcp-check connect rules (#1677420)

apparently this breaks the multiple tcp-check logic we use to detect if redis is up or not.

listen redis
  bind 172.17.1.23:6379 transparent
  balance first
  option tcp-check
  option tcplog
  tcp-check send AUTH\ <PASSWORD>\r\n
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  server controller-0.internalapi.localdomain 172.17.1.17:6379 check fall 5 inter 2000 on-marked-down shutdown-sessions rise 2
  server controller-1.internalapi.localdomain 172.17.1.15:6379 check fall 5 inter 2000 on-marked-down shutdown-sessions rise 2
  server controller-2.internalapi.localdomain 172.17.1.18:6379 check fall 5 inter 2000 on-marked-down shutdown-sessions rise 2

we reverted to the '2019-06-27.1' haproxy container and now it can correctly detect redis to be up:

[root@controller-2 haproxy]# echo "show info;show stat" | nc -U stats  |grep ^redis
redis,FRONTEND,,,9,9,4096,9,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,19,0,,,,0,0,0,8,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
redis,controller-0.internalapi.localdomain,0,0,9,9,,9,0,0,,0,,0,0,0,0,UP,1,1,0,0,0,21,0,,1,19,1,,9,,2,0,,8,L7OK,0,0,,,,,,,0,,,,0,0,,,,,19,(tcp-check),,0,0,0,0,
redis,controller-1.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,9,9,,1,19,2,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
redis,controller-2.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,9,9,,1,19,3,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
redis,BACKEND,0,0,9,9,410,9,0,0,0,0,,0,0,0,0,UP,1,1,0,,0,21,0,,1,19,0,,9,,1,0,,8,,,,,,,,,,,,,,0,0,0,0,0,0,19,,,0,0,0,0

Comment 8 Ryan O'Hara 2019-07-18 15:25:12 UTC
As state in rhbz#1677420, put a "tcp-check connect port <n>" statement before all the send/expect statements. Alternatively, specify a healthcheck port on each server line. If that does not work, use tcpdump to check that healthcheck are being made to the correct port. I realize the docs are a bit unclear on this -- you have a port on the server line (eg. 172.17.1.17:6379), thus port 6379 should be used for healthchecks unless "port" is used. But you don't used "port" on the server line, so I am not sure your tcp-check knows what port to connect to.

I'm reluctant to blame the fix for rhbz#1677420 because that actually fixed a problem where tcp-check was not failing when it should. Now you're seeing failures with the fix, which makes me wonder if your redis proxy was ever detecing failed healthchecks correctly.

Comment 9 Michele Baldessari 2019-07-18 17:14:13 UTC
The detection has worked fine in the past.

I added the connect stuff here https://review.opendev.org/671526. The problem I see is that it will be hard/impossible to land all this stuff for all OSP releases *before* RHEL7.7 comes out

Comment 10 Steve Linabery 2019-07-18 19:28:06 UTC
(In reply to Michele Baldessari from comment #9)
> The detection has worked fine in the past.
> 
> I added the connect stuff here https://review.opendev.org/671526. The
> problem I see is that it will be hard/impossible to land all this stuff for
> all OSP releases *before* RHEL7.7 comes out

We should try to proactively backport this ASAP for osp13 and osp14, and possibly/probably osp10 although we have not reproduced it there yet. I notice that the upstream code review already has a +2 from emilien.

Comment 23 errata-xmlrpc 2019-08-08 19:58:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2421

Comment 24 Alex Schultz 2019-09-12 22:25:36 UTC
*** Bug 1751725 has been marked as a duplicate of this bug. ***