Hide Forgot
Created attachment 1304828 [details] redis.log Description of problem: OSP11 -> OSP12 upgrade: redis gnocchi haproxy backend are down after major-upgrade-composable-steps-docker.yaml: [root@controller-0 heat-admin]# docker exec -it haproxy-bundle-docker-0 bash -c 'echo show stat | socat /var/lib/haproxy/stats stdio | grep gnocchi' gnocchi,FRONTEND,,,0,0,4096,0,0,0,0,0,0,,,,,OPEN,,,,,,,,,1,5,0,,,,0,0,0,0,,,,0,0,0,0,0,0,,0,0,0,,,0,0,0,0,,,,,,,, gnocchi,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8042,8042,,1,5,1,,0,,2,0,,0,L7TOUT,,10002,0,0,0,0,0,0,0,,,,0,0,,,,,-1,,,0,0,0,0, gnocchi,BACKEND,0,0,0,0,410,0,0,0,0,0,,0,0,0,0,DOWN,0,0,0,,1,8042,8042,,1,5,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,0,0,0,0,-1,,,0,0,0,0, ()[root@controller-0 /]# echo show stat | socat /var/lib/haproxy/stats stdio | grep redis # pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime, redis,FRONTEND,,,0,2,4096,61743,2055130,0,0,0,0,,,,,OPEN,,,,,,,,,1,20,0,,,,0,6,0,20,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,, redis,controller-0.internalapi.localdomain,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN,1,1,0,1,1,8995,8995,,1,20,1,,0,,2,0,,0,L7TOUT,,10001,,,,,,,0,,,,0,0,,,,,-1,,,0,0,0,0, redis,BACKEND,0,0,0,1,410,61743,2055130,0,0,0,,61743,0,0,0,DOWN,0,0,0,,1,8995,8995,,1,20,0,,0,,1,6,,20,,,,,,,,,,,,,,0,0,0,0,0,0,-1,,,0,0,0,0, Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.0-0.20170718190543.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP11 2. Upgrade to OSP12 3. Check HAProxy backends Actual results: Gnocchi backend is down Expected results: Gnocchi backend is up. Additional info: listen gnocchi bind 10.0.0.106:8041 transparent bind 172.17.1.13:8041 transparent mode http http-request set-header X-Forwarded-Proto https if { ssl_fc } http-request set-header X-Forwarded-Proto http if !{ ssl_fc } option httpchk server controller-0.internalapi.localdomain 172.17.1.14:8041 check fall 5 inter 2000 rise 2 trying to reach the backend directly times out: ()[root@controller-0 /]# curl http://172.17.1.14:8041 -v * About to connect() to 172.17.1.14 port 8041 (#0) * Trying 172.17.1.14... * Connected to 172.17.1.14 (172.17.1.14) port 8041 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: 172.17.1.14:8041 > Accept: */* > Checking on host who's listening on 8041: [root@controller-0 ~]# netstat -tupan | grep 8041 | grep LISTEN tcp 129 0 172.17.1.14:8041 0.0.0.0:* LISTEN 521504/httpd tcp 0 0 172.17.1.13:8041 0.0.0.0:* LISTEN 491781/haproxy tcp 0 0 10.0.0.106:8041 0.0.0.0:* LISTEN 491781/haproxy Checking the httpd processes inside the gnocchi_api container: [root@controller-0 ~]# docker exec -it gnocchi_api bash -c 'ps axu | grep httpd | wc -l' 259 which is around the MaxClient limit: [root@controller-0 ~]# docker exec -it gnocchi_api bash -c 'grep MaxClient /etc/httpd/conf.modules.d/prefork.conf' MaxClients 256 Checking the logs: [root@controller-0 ~]# tail -5 /var/log/containers/gnocchi/app.log 2017-07-26 14:48:30.400 18 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2017-07-26 14:48:42.742 19 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2017-07-26 14:48:54.878 20 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2017-07-26 14:49:06.080 21 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) 2017-07-26 14:49:30.440 18 ERROR gnocchi.utils [-] Unable to start coordinator: Error while reading from socket: ('Connection closed by server.',): ToozConnectionError: Error while reading from socket: ('Connection closed by server.',) It looks like Redis is also unreacheable even though container is up and running: [root@controller-0 ~]# docker ps | grep redis fcaf65f4162f 192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-07-22.1 "/bin/bash /usr/local" 2 hours ago Up 2 hours redis-bundle-docker-0 Checking the redis config on haproxy container: listen redis bind 172.17.1.20:6379 transparent balance first option tcp-check tcp-check send AUTH\ UdEnQrH6Jfdy6A4W2Rnja7UuZ\r\n tcp-check send PING\r\n tcp-check expect string +PONG tcp-check send info\ replication\r\n tcp-check expect string role:master tcp-check send QUIT\r\n tcp-check expect string +OK server controller-0.internalapi.localdomain 172.17.1.14:6379 check fall 5 inter 2000 rise 2 [root@controller-0 heat-admin]# docker exec -it haproxy-bundle-docker-0 bash tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified ()[root@controller-0 /]# nc 172.17.1.14 6379 AUTH UdEnQrH6Jfdy6A4W2Rnja7UuZ +OK PING +PONG info replication $409 # Replication role:slave master_host:no-such-master master_port:6379 master_link_status:down master_last_io_seconds_ago:-1 master_sync_in_progress:0 slave_repl_offset:1 master_link_down_since_seconds:1501080865 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 it looks like the redis backend doesn't return the master role so haproxy doesn't send any requests to it. Attaching a snippet from the redis.log and redis.conf below: ()[root@controller-0 /]# grep -v ^# /etc/redis.conf | grep -v ^$ daemonize yes pidfile /var/run/redis/redis.pid port 6379 tcp-backlog 511 bind 172.17.1.14 unixsocket /var/run/redis/redis.sock unixsocketperm 755 timeout 0 tcp-keepalive 0 loglevel notice logfile /var/log/redis/redis.log syslog-enabled no databases 16 save 300 10 save 60 10000 save 900 1 stop-writes-on-bgsave-error yes rdbcompression yes rdbchecksum yes dbfilename dump.rdb dir /var/lib/redis masterauth UdEnQrH6Jfdy6A4W2Rnja7UuZ slave-serve-stale-data yes slave-read-only yes repl-timeout 60 repl-disable-tcp-nodelay no repl-backlog-size 1mb repl-backlog-ttl 3600 slave-priority 100 min-slaves-to-write 0 min-slaves-max-lag 10 requirepass UdEnQrH6Jfdy6A4W2Rnja7UuZ maxclients 10000 appendonly no appendfilename appendonly.aof appendfsync everysec no-appendfsync-on-rewrite no auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb aof-load-truncated yes lua-time-limit 5000 slowlog-log-slower-than 10000 slowlog-max-len 1024 latency-monitor-threshold 0 notify-keyspace-events "" hash-max-ziplist-entries 512 hash-max-ziplist-value 64 list-max-ziplist-entries 512 list-max-ziplist-value 64 set-max-intset-entries 512 zset-max-ziplist-entries 128 zset-max-ziplist-value 64 hll-sparse-max-bytes 3000 activerehashing yes client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 hz 10 aof-rewrite-incremental-fsync yes
* Cause The reason for this is that in OSP 11 (redis in baremetal) there is a property that the redis resource agent uses: redis_REPL_INFO: controller-0 This property, after the migration to containers, prevents the resource to become master because we look for the wait_last_known_master that was master. But in a bundle the name is not controller-0 but redis-bundle-0. So something like this: redis_REPL_INFO: redis-bundle-0 As soon as we remove that global cluster property the resource starts up in master mode correctly. * Fixes We are exploring the best way to fix this. It will likely need some additional pacemaker env variables that will give us the host vs bundle distinction and some changes to the resource agents for rabbitmq and redis (which are the agents that store hostnames in properties)
We're still looking into this. Likely the changes will involve both pacemaker and the resource-agents for a proper clean fix.
Hi Damien, Could you please link the patches to this BZ once they can be pulled for testing? This issue is currently blocking upgrades and the only way to bypass is to disable telemetry services. Thanks!
Just a drive-by update. We've been working tirelessly on this topic and I think we're slowly settling on the approach that is backwards compatible enough. Basically we will need three pieces: 1. Pacemaker changes (currently the minimum version needed is 1.1.16-12.12 and can be found here http://people.redhat.com/mbaldess/rpms/container-repo/pacemaker-bundle.repo) 2. Resource agents. A first non-final draft of the needed patch is here http://acksyn.org/files/tripleo/all-bundles-hostattribute-fixes.patch and a temporary build is in resource-agents-3.9.5-105.pidone.2.el7.centos.x86_64 3. Two additional reviews are needed. Namely: https://review.openstack.org/497766 https://review.openstack.org/495491 The reason for all this work is basically that pacemaker and resource-agent are in the common channels and so we need to make sure that every change is fully backwards-compatible and works both in BM and containers. I'll post more updates once this stuff is fully baked.
For anyone coming in late, this is a race condition caused by our sensible job(s) sending a resource delete and cleanup at about the same time.
*** Bug 1498540 has been marked as a duplicate of this bug. ***
Verified on: puppet-tripleo-7.4.3-6.el7ost.noarch Upgrade passed and redis back-end is reachable
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462