Bug 1475404 - OSP11 -> OSP12 upgrade: redis and gnocchi haproxy backend are down after major-upgrade-composable-steps-docker.yaml
OSP11 -> OSP12 upgrade: redis and gnocchi haproxy backend are down after maj...
Status: ON_DEV
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
12.0 (Pike)
Unspecified Unspecified
urgent Severity urgent
: ga
: 12.0 (Pike)
Assigned To: Michele Baldessari
Amit Ugol
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-26 10:59 EDT by Marius Cornea
Modified: 2017-08-22 11:15 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
redis.log (20.07 KB, text/plain)
2017-07-26 10:59 EDT, Marius Cornea
no flags Details

  None (edit)
Description Marius Cornea 2017-07-26 10:59:56 EDT
Created attachment 1304828 [details]
redis.log

Description of problem:
OSP11 -> OSP12 upgrade:  redis gnocchi haproxy backend are down after major-upgrade-composable-steps-docker.yaml:

[root@controller-0 heat-admin]# docker exec -it haproxy-bundle-docker-0 bash -c 'echo show stat | socat /var/lib/haproxy/stats stdio | grep gnocchi'
gnocchi,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,,
gnocchi,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8042,8042,,1,5,1,,0,,2,0,,0,L7TOUT,,10002,0,0,0,0,0,0,0,,,,0,0,,,,,-1,,,0,0,0,0,
gnocchi,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,DOWN,0,0,0,,1,8042,8042,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,

()[root@controller-0 /]# echo show stat | socat /var/lib/haproxy/stats stdio | grep redis  
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
redis,FRONTEND,,,0,2,4096,61743,2055130,0,0,0,0,,,,,OPEN,,,,,,,,,1,20,0,,,,0,6,0,20,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,,
redis,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8995,8995,,1,20,1,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0,
redis,BACKEND,0,0,0,1,410,61743,2055130,0,0,0,,61743,0,0,0,DOWN,0,0,0,,1,8995,8995,,1,20,0,,0,,1,6,,20,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0,


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170718190543.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP11
2. Upgrade to OSP12
3. Check HAProxy backends

Actual results:
Gnocchi backend is down

Expected results:
Gnocchi backend is up.

Additional info:

listen gnocchi
  bind 10.0.0.106:8041 transparent
  bind 172.17.1.13:8041 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  option httpchk
  server controller-0.internalapi.localdomain 172.17.1.14:8041 check fall 5 inter 2000 rise 2


trying to reach the backend directly times out:

()[root@controller-0 /]# curl http://172.17.1.14:8041 -v 
* About to connect() to 172.17.1.14 port 8041 (#0)
*   Trying 172.17.1.14...
* Connected to 172.17.1.14 (172.17.1.14) port 8041 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 172.17.1.14:8041
> Accept: */*
> 

Checking on host who's listening on 8041:

[root@controller-0 ~]# netstat -tupan | grep 8041 | grep LISTEN
tcp      129      0 172.17.1.14:8041        0.0.0.0:*               LISTEN      521504/httpd        
tcp        0      0 172.17.1.13:8041        0.0.0.0:*               LISTEN      491781/haproxy      
tcp        0      0 10.0.0.106:8041         0.0.0.0:*               LISTEN      491781/haproxy   

Checking the httpd processes inside the gnocchi_api container:

[root@controller-0 ~]# docker exec -it gnocchi_api bash -c 'ps axu | grep httpd | wc -l'
259

which is around the MaxClient limit:

[root@controller-0 ~]# docker exec -it gnocchi_api bash -c 'grep MaxClient /etc/httpd/conf.modules.d/prefork.conf'
  MaxClients          256


Checking the logs:

[root@controller-0 ~]# tail -5  /var/log/containers/gnocchi/app.log 
2017-07-26 14:48:30.400 18 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:48:42.742 19 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:48:54.878 20 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:49:06.080 21 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)
2017-07-26 14:49:30.440 18 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',)

It looks like Redis is also unreacheable even though container is up and running:

[root@controller-0 ~]# docker ps | grep redis
fcaf65f4162f        192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-07-22.1                       "/bin/bash /usr/local"   2 hours ago         Up 2 hours                                           redis-bundle-docker-0


Checking the redis config on haproxy container:

listen redis
  bind 172.17.1.20:6379 transparent
  balance first
  option tcp-check
  tcp-check send AUTH\ UdEnQrH6Jfdy6A4W2Rnja7UuZ\r\n
  tcp-check send PING\r\n
  tcp-check expect string +PONG
  tcp-check send info\ replication\r\n
  tcp-check expect string role:master
  tcp-check send QUIT\r\n
  tcp-check expect string +OK
  server controller-0.internalapi.localdomain 172.17.1.14:6379 check fall 5 inter 2000 rise 2


[root@controller-0 heat-admin]# docker exec -it haproxy-bundle-docker-0 bash
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
()[root@controller-0 /]# nc 172.17.1.14 6379
AUTH UdEnQrH6Jfdy6A4W2Rnja7UuZ
+OK
PING
+PONG
info replication
$409
# Replication
role:slave
master_host:no-such-master
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1
master_link_down_since_seconds:1501080865
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

it looks like the redis backend doesn't return the master role so haproxy doesn't send any requests to it.


Attaching a snippet from the redis.log and redis.conf below:

()[root@controller-0 /]# grep -v ^# /etc/redis.conf | grep -v ^$
daemonize yes
pidfile /var/run/redis/redis.pid
port 6379
tcp-backlog 511
bind 172.17.1.14
unixsocket /var/run/redis/redis.sock
unixsocketperm 755
timeout 0
tcp-keepalive 0
loglevel notice
logfile /var/log/redis/redis.log
syslog-enabled no
databases 16
save 300 10 
save 60 10000 
save 900 1 
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
masterauth UdEnQrH6Jfdy6A4W2Rnja7UuZ
slave-serve-stale-data yes
slave-read-only yes
repl-timeout 60
repl-disable-tcp-nodelay no
repl-backlog-size 1mb
repl-backlog-ttl 3600
slave-priority 100
min-slaves-to-write 0
min-slaves-max-lag 10
requirepass UdEnQrH6Jfdy6A4W2Rnja7UuZ
maxclients 10000
appendonly no
appendfilename appendonly.aof
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size   64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 1024
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
Comment 5 Michele Baldessari 2017-08-01 03:18:54 EDT
* Cause                                                              
The reason for this is that in OSP 11 (redis in baremetal) there is a
property that the redis resource agent uses:                         
redis_REPL_INFO: controller-0                                        

This property, after the migration to containers, prevents the resource 
to become master because we look for the wait_last_known_master that was
master. But in a bundle the name is not controller-0 but redis-bundle-0.
So something like this:                                                 
redis_REPL_INFO: redis-bundle-0                                         
                                                                        
As soon as we remove that global cluster property the resource starts up
in master mode correctly.                                               


* Fixes
We are exploring the best way to fix this. It will likely need some additional pacemaker env variables that will give us the host vs bundle distinction and some changes to the resource agents for rabbitmq and redis (which are the agents that store hostnames in properties)
Comment 6 Michele Baldessari 2017-08-14 09:53:45 EDT
We're still looking into this. Likely the changes will involve both pacemaker and the resource-agents for a proper clean fix.

Note You need to log in before you can comment on or make changes to this bug.