Bug 1248224 - rhel-osp-director: redis failed to start after a host was fenced in HA cluster.
Summary: rhel-osp-director: redis failed to start after a host was fenced in HA cluster.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 10.0 (Newton)
Assignee: Jason Guiditta
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-29 22:46 UTC by Alexander Chuzhoy
Modified: 2016-05-20 15:12 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-20 15:12:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pacemaker and redis logs from the fenced controller. (521.27 KB, application/x-gzip)
2015-07-29 22:49 UTC, Alexander Chuzhoy
no flags Details

Description Alexander Chuzhoy 2015-07-29 22:46:52 UTC
rhel-osp-director: redis failed to start after a host was fenced in HA cluster.

Environment:
python-redis-2.10.3-1.el7ost.noarch
redis-2.8.21-1.el7ost.x86_64
instack-undercloud-2.1.2-22.el7ost.noarch

Steps to reproduce:

1. Deplog overcloud with HA
2. Configure fencing for all controllers.
3. Fence one controller.

Result:
Redis fails to start after the controller boots.

Comment 3 Alexander Chuzhoy 2015-07-29 22:49:12 UTC
Created attachment 1057446 [details]
pacemaker and redis logs from the fenced controller.

Comment 4 Giulio Fidente 2015-07-30 10:43:42 UTC
pcmk logs seem to report an attempt to start the redis service at 18:40:13 and then an immediate failure to connect to it:


redis(redis)[4129]:	2015/07/29_18:40:13 INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:     info: log_finished: 	finished - rsc:ip-10.19.94.10 action:start call_id:225 pid:4128 exit-code:0 exec-time:142ms queue-time:0ms
redis(redis)[4129]:	2015/07/29_18:40:13 ERROR: demote: Failed to demote, redis not running.
redis(redis)[4129]:	2015/07/29_18:40:13 ERROR: start: Unknown error starting redis. output=
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:   notice: operation_finished: 	redis_start_0:4129:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ]
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:     info: log_finished: 	finished - rsc:redis action:start call_id:226 pid:4129 exit-code:7 exec-time:243ms queue-time:0ms

this resulted in pcmk enforcing a resource stop later; the redis logs show a start attempt at 18:40:13 apparently completed successfully:


[4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections on port 6379
[4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections at /var/run/redis/redis.sock


it is not obvious to me why pcmk couldn't find the socket, probably tried to connect too quickly?

Comment 6 Giulio Fidente 2015-07-30 11:47:13 UTC
The issue is intermittent. When the demote works the redis (new) slave (fenced node) is synced back to master in a few minutes. In the attached logs both the pacemaker and redis logs show a successful attempt to rejoin the cluster at 18:30, on same environment, where same node was fenced.

Comment 7 chris alfonso 2015-08-21 16:20:19 UTC
Sasha, how reproducible/intermitent is this?

Comment 8 Alexander Chuzhoy 2015-08-21 16:56:44 UTC
Just reproduced - tried to see if the issue is intermittent.

Comment 9 Jason Guiditta 2015-08-21 18:00:26 UTC
Could this be something to do with the pacemaker resource agent for redis?  It doesn't seem to me likely to be puppet, unless there is some timeout that puppet is not passing, but is needed.  David, any thoughts on this?

Comment 11 Jason Guiditta 2015-08-26 17:07:19 UTC
Sasha, perhaps you can get the resource-agents version so Fabio can verify it has the fix he refers to, as well as checking for redis db size being 0 bytes?

Comment 12 Jason Guiditta 2015-09-03 14:02:52 UTC
Sasha, any update on this?

Comment 13 Alexander Chuzhoy 2015-09-03 16:56:40 UTC
resource-agents-3.9.5-40.el7_1.4.x86_64

Comment 14 Jason Guiditta 2015-09-03 17:10:48 UTC
Thanks Sasha, do you have a deployment with this problem where you would be able to check if the failing redis node has a 0 byte db as Fabio mentioned above?  I'll have to leave it to him to reply on if the resource-agents version you have is sufficient, I don't know what version is needed to cover the referenced fix.

Comment 16 Mike Burns 2016-04-07 20:47:27 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 18 Jason Guiditta 2016-05-20 15:12:13 UTC
Closing due to age of product and lack of activity on the bug


Note You need to log in before you can comment on or make changes to this bug.