Bug 1475404

Summary: OSP11 -> OSP12 upgrade: redis and gnocchi haproxy backend are down after major-upgrade-composable-steps-docker.yaml
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: puppet-tripleoAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: abeekhof, aherr, bperkins, chjones, dbecker, dciabrin, fdinitto, jjoyce, jschluet, mandreou, mburns, michele, morazi, pkilambi, rhel-osp-director-maint, rscarazz, sasha, slinaber, tvignaud, ushkalim
Target Milestone: rcKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-7.4.3-0.20171025110205.93a9217.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-13 21:45:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1493915, 1497602, 1503064    
Bug Blocks:    
Attachments:
Description Flags
redis.log none

Description Marius Cornea 2017-07-26 14:59:56 UTC
Created attachment 1304828 [details]
redis.log

Description of problem:
OSP11 -> OSP12 upgrade:  redis gnocchi haproxy backend are down after major-upgrade-composable-steps-docker.yaml:

[root@controller-0 heat-admin]# docker exec -it haproxy-bundle-docker-0 bash -c 'echo show stat | socat /var/lib/haproxy/stats stdio | grep gnocchi'
gnocchi,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,
gnocchi,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8042,8042,,1,5,1,,0,,2,0,,0,L7TOUT,,10002,0,0,0,0,0,0,0,,,,0,0,,,,,-1,,,0,0,0,0,
gnocchi,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,DOWN,0,0,0,,1,8042,8042,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,

()[root@controller-0 /]# echo show stat | socat /var/lib/haproxy/stats stdio | grep redis  
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
redis,FRONTEND,,,0,2,4096,61743,2055130,0,0,0,0,,,,,OPEN,,,,,,,,,1,20,0,,,,0,6,0,20,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
redis,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8995,8995,,1,20,1,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
redis,BACKEND,0,0,0,1,410,61743,2055130,0,0,0,,61743,0,0,0,DOWN,0,0,0,,1,8995,8995,,1,20,0,,0,,1,6,,20,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170718190543.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11
2. Upgrade to OSP12
3. Check HAProxy backends

Actual results:
Gnocchi backend is down

Expected results:
Gnocchi backend is up.

Additional info:

listen gnocchi
  bind 10.0.0.106:8041 transparent
  bind 172.17.1.13:8041 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  option httpchk
  server controller-0.internalapi.localdomain 172.17.1.14:8041 check fall 5 inter 2000 rise 2


trying to reach the backend directly times out:

()[root@controller-0 /]# curl http://172.17.1.14:8041 -v 
* About to connect() to 172.17.1.14 port 8041 (#0)
*   Trying 172.17.1.14...
* Connected to 172.17.1.14 (172.17.1.14) port 8041 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 172.17.1.14:8041
> Accept: */*
> 

Checking on host who's listening on 8041:

[root@controller-0 ~]# netstat -tupan | grep 8041 | grep LISTEN
tcp      129      0 172.17.1.14:8041        0.0.0.0:*               LISTEN      521504/httpd        
tcp        0      0 172.17.1.13:8041        0.0.0.0:*               LISTEN      491781/haproxy      
tcp        0      0 10.0.0.106:8041         0.0.0.0:*               LISTEN      491781/haproxy   

Checking the httpd processes inside the gnocchi_api container:

[root@controller-0 ~]# docker exec -it gnocchi_api bash -c 'ps axu | grep httpd | wc -l'
259

which is around the MaxClient limit:

[root@controller-0 ~]# docker exec -it gnocchi_api bash -c 'grep MaxClient /etc/httpd/conf.modules.d/prefork.conf'
  MaxClients          256


Checking the logs:

[root@controller-0 ~]# tail -5  /var/log/containers/gnocchi/app.log 
2017-07-26 14:48:30.400 18 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:48:42.742 19 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:48:54.878 20 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:49:06.080 21 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:49:30.440 18 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)

It looks like Redis is also unreacheable even though container is up and running:

[root@controller-0 ~]# docker ps | grep redis
fcaf65f4162f        192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-07-22.1                       "/bin/bash /usr/local"   2 hours ago         Up 2 hours                                           redis-bundle-docker-0


Checking the redis config on haproxy container:

listen redis
  bind 172.17.1.20:6379 transparent
  balance first
  option tcp-check
  tcp-check send AUTH\ UdEnQrH6Jfdy6A4W2Rnja7UuZ\r\n
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  server controller-0.internalapi.localdomain 172.17.1.14:6379 check fall 5 inter 2000 rise 2


[root@controller-0 heat-admin]# docker exec -it haproxy-bundle-docker-0 bash
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
()[root@controller-0 /]# nc 172.17.1.14 6379
AUTH UdEnQrH6Jfdy6A4W2Rnja7UuZ
+OK
PING
+PONG
info replication
$409
# Replication
role:slave
master_host:no-such-master
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1
master_link_down_since_seconds:1501080865
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

it looks like the redis backend doesn't return the master role so haproxy doesn't send any requests to it.


Attaching a snippet from the redis.log and redis.conf below:

()[root@controller-0 /]# grep -v ^# /etc/redis.conf | grep -v ^$
daemonize yes
pidfile /var/run/redis/redis.pid
port 6379
tcp-backlog 511
bind 172.17.1.14
unixsocket /var/run/redis/redis.sock
unixsocketperm 755
timeout 0
tcp-keepalive 0
loglevel notice
logfile /var/log/redis/redis.log
syslog-enabled no
databases 16
save 300 10 
save 60 10000 
save 900 1 
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
masterauth UdEnQrH6Jfdy6A4W2Rnja7UuZ
slave-serve-stale-data yes
slave-read-only yes
repl-timeout 60
repl-disable-tcp-nodelay no
repl-backlog-size 1mb
repl-backlog-ttl 3600
slave-priority 100
min-slaves-to-write 0
min-slaves-max-lag 10
requirepass UdEnQrH6Jfdy6A4W2Rnja7UuZ
maxclients 10000
appendonly no
appendfilename appendonly.aof
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size   64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 1024
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes

Comment 5 Michele Baldessari 2017-08-01 07:18:54 UTC
* Cause                                                              
The reason for this is that in OSP 11 (redis in baremetal) there is a
property that the redis resource agent uses:                         
redis_REPL_INFO: controller-0                                        

This property, after the migration to containers, prevents the resource 
to become master because we look for the wait_last_known_master that was
master. But in a bundle the name is not controller-0 but redis-bundle-0.
So something like this:                                                 
redis_REPL_INFO: redis-bundle-0                                         
                                                                        
As soon as we remove that global cluster property the resource starts up
in master mode correctly.                                               


* Fixes
We are exploring the best way to fix this. It will likely need some additional pacemaker env variables that will give us the host vs bundle distinction and some changes to the resource agents for rabbitmq and redis (which are the agents that store hostnames in properties)

Comment 6 Michele Baldessari 2017-08-14 13:53:45 UTC
We're still looking into this. Likely the changes will involve both pacemaker and the resource-agents for a proper clean fix.

Comment 7 Marius Cornea 2017-08-24 19:33:25 UTC
Hi Damien,

Could you please link the patches to this BZ once they can be pulled for testing? This issue is currently blocking upgrades and the only way to bypass is to disable telemetry services.

Thanks!

Comment 8 Michele Baldessari 2017-08-30 12:22:38 UTC
Just a drive-by update. We've been working tirelessly on this topic and I think we're slowly settling on the approach that is backwards compatible enough.

Basically we will need three pieces:
1. Pacemaker changes (currently the minimum version needed is 1.1.16-12.12 and
   can be found here http://people.redhat.com/mbaldess/rpms/container-repo/pacemaker-bundle.repo)
2. Resource agents. A first non-final draft of the needed patch is here
   http://acksyn.org/files/tripleo/all-bundles-hostattribute-fixes.patch and a temporary build is in resource-agents-3.9.5-105.pidone.2.el7.centos.x86_64
3. Two additional reviews are needed. Namely:
   https://review.openstack.org/497766
   https://review.openstack.org/495491

The reason for all this work is basically that pacemaker and resource-agent
are in the common channels and so we need to make sure that every change is
fully backwards-compatible and works both in BM and containers.

I'll post more updates once this stuff is fully baked.

Comment 13 Andrew Beekhof 2017-10-11 12:11:29 UTC
For anyone coming in late, this is a race condition caused by our sensible job(s) sending a resource delete and cleanup at about the same time.

Comment 15 Marius Cornea 2017-10-24 09:25:09 UTC
*** Bug 1498540 has been marked as a duplicate of this bug. ***

Comment 19 Udi Shkalim 2017-11-21 12:28:13 UTC
Verified on: puppet-tripleo-7.4.3-6.el7ost.noarch

Upgrade passed and redis back-end is reachable

Comment 22 errata-xmlrpc 2017-12-13 21:45:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462