rhel-osp-director: redis failed to start after a host was fenced in HA cluster. Environment: python-redis-2.10.3-1.el7ost.noarch redis-2.8.21-1.el7ost.x86_64 instack-undercloud-2.1.2-22.el7ost.noarch Steps to reproduce: 1. Deplog overcloud with HA 2. Configure fencing for all controllers. 3. Fence one controller. Result: Redis fails to start after the controller boots.
Created attachment 1057446 [details] pacemaker and redis logs from the fenced controller.
pcmk logs seem to report an attempt to start the redis service at 18:40:13 and then an immediate failure to connect to it: redis(redis)[4129]: 2015/07/29_18:40:13 INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid' Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain lrmd: info: log_finished: finished - rsc:ip-10.19.94.10 action:start call_id:225 pid:4128 exit-code:0 exec-time:142ms queue-time:0ms redis(redis)[4129]: 2015/07/29_18:40:13 ERROR: demote: Failed to demote, redis not running. redis(redis)[4129]: 2015/07/29_18:40:13 ERROR: start: Unknown error starting redis. output= Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain lrmd: notice: operation_finished: redis_start_0:4129:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ] Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain lrmd: info: log_finished: finished - rsc:redis action:start call_id:226 pid:4129 exit-code:7 exec-time:243ms queue-time:0ms this resulted in pcmk enforcing a resource stop later; the redis logs show a start attempt at 18:40:13 apparently completed successfully: [4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections on port 6379 [4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections at /var/run/redis/redis.sock it is not obvious to me why pcmk couldn't find the socket, probably tried to connect too quickly?
The issue is intermittent. When the demote works the redis (new) slave (fenced node) is synced back to master in a few minutes. In the attached logs both the pacemaker and redis logs show a successful attempt to rejoin the cluster at 18:30, on same environment, where same node was fenced.
Sasha, how reproducible/intermitent is this?
Just reproduced - tried to see if the issue is intermittent.
Could this be something to do with the pacemaker resource agent for redis? It doesn't seem to me likely to be puppet, unless there is some timeout that puppet is not passing, but is needed. David, any thoughts on this?
Sasha, perhaps you can get the resource-agents version so Fabio can verify it has the fix he refers to, as well as checking for redis db size being 0 bytes?
Sasha, any update on this?
resource-agents-3.9.5-40.el7_1.4.x86_64
Thanks Sasha, do you have a deployment with this problem where you would be able to check if the failing redis node has a 0 byte db as Fabio mentioned above? I'll have to leave it to him to reply on if the resource-agents version you have is sufficient, I don't know what version is needed to cover the referenced fix.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
Closing due to age of product and lack of activity on the bug