This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1248224 - rhel-osp-director: redis failed to start after a host was fenced in HA cluster.
rhel-osp-director: redis failed to start after a host was fenced in HA cluster.
Status: CLOSED WONTFIX
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
high Severity high
: ---
: 10.0 (Newton)
Assigned To: Jason Guiditta
Shai Revivo
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-29 18:46 EDT by Alexander Chuzhoy
Modified: 2016-05-20 11:12 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-20 11:12:13 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
pacemaker and redis logs from the fenced controller. (521.27 KB, application/x-gzip)
2015-07-29 18:49 EDT, Alexander Chuzhoy
no flags Details

  None (edit)
Description Alexander Chuzhoy 2015-07-29 18:46:52 EDT
rhel-osp-director: redis failed to start after a host was fenced in HA cluster.

Environment:
python-redis-2.10.3-1.el7ost.noarch
redis-2.8.21-1.el7ost.x86_64
instack-undercloud-2.1.2-22.el7ost.noarch

Steps to reproduce:

1. Deplog overcloud with HA
2. Configure fencing for all controllers.
3. Fence one controller.

Result:
Redis fails to start after the controller boots.
Comment 3 Alexander Chuzhoy 2015-07-29 18:49:12 EDT
Created attachment 1057446 [details]
pacemaker and redis logs from the fenced controller.
Comment 4 Giulio Fidente 2015-07-30 06:43:42 EDT
pcmk logs seem to report an attempt to start the redis service at 18:40:13 and then an immediate failure to connect to it:


redis(redis)[4129]:	2015/07/29_18:40:13 INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:     info: log_finished: 	finished - rsc:ip-10.19.94.10 action:start call_id:225 pid:4128 exit-code:0 exec-time:142ms queue-time:0ms
redis(redis)[4129]:	2015/07/29_18:40:13 ERROR: demote: Failed to demote, redis not running.
redis(redis)[4129]:	2015/07/29_18:40:13 ERROR: start: Unknown error starting redis. output=
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:   notice: operation_finished: 	redis_start_0:4129:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ]
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:     info: log_finished: 	finished - rsc:redis action:start call_id:226 pid:4129 exit-code:7 exec-time:243ms queue-time:0ms

this resulted in pcmk enforcing a resource stop later; the redis logs show a start attempt at 18:40:13 apparently completed successfully:


[4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections on port 6379
[4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections at /var/run/redis/redis.sock


it is not obvious to me why pcmk couldn't find the socket, probably tried to connect too quickly?
Comment 6 Giulio Fidente 2015-07-30 07:47:13 EDT
The issue is intermittent. When the demote works the redis (new) slave (fenced node) is synced back to master in a few minutes. In the attached logs both the pacemaker and redis logs show a successful attempt to rejoin the cluster at 18:30, on same environment, where same node was fenced.
Comment 7 chris alfonso 2015-08-21 12:20:19 EDT
Sasha, how reproducible/intermitent is this?
Comment 8 Alexander Chuzhoy 2015-08-21 12:56:44 EDT
Just reproduced - tried to see if the issue is intermittent.
Comment 9 Jason Guiditta 2015-08-21 14:00:26 EDT
Could this be something to do with the pacemaker resource agent for redis?  It doesn't seem to me likely to be puppet, unless there is some timeout that puppet is not passing, but is needed.  David, any thoughts on this?
Comment 11 Jason Guiditta 2015-08-26 13:07:19 EDT
Sasha, perhaps you can get the resource-agents version so Fabio can verify it has the fix he refers to, as well as checking for redis db size being 0 bytes?
Comment 12 Jason Guiditta 2015-09-03 10:02:52 EDT
Sasha, any update on this?
Comment 13 Alexander Chuzhoy 2015-09-03 12:56:40 EDT
resource-agents-3.9.5-40.el7_1.4.x86_64
Comment 14 Jason Guiditta 2015-09-03 13:10:48 EDT
Thanks Sasha, do you have a deployment with this problem where you would be able to check if the failing redis node has a 0 byte db as Fabio mentioned above?  I'll have to leave it to him to reply on if the resource-agents version you have is sufficient, I don't know what version is needed to cover the referenced fix.
Comment 16 Mike Burns 2016-04-07 16:47:27 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 18 Jason Guiditta 2016-05-20 11:12:13 EDT
Closing due to age of product and lack of activity on the bug

Note You need to log in before you can comment on or make changes to this bug.