Bug 1248224

Summary:

rhel-osp-director: redis failed to start after a host was fenced in HA cluster.

Product:

Red Hat OpenStack

Reporter:

Alexander Chuzhoy <sasha>

Component:

rhosp-director

Assignee:

Jason Guiditta <jguiditt>

Status:

CLOSED WONTFIX

QA Contact:

Shai Revivo <srevivo>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.0 (Kilo)

CC:

fdinitto, gfidente, hbrock, mburns, rhel-osp-director-maint, sasha

Target Milestone:

---

Keywords:

ZStream

Target Release:

10.0 (Newton)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-05-20 15:12:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
pacemaker and redis logs from the fenced controller.	none

Description Alexander Chuzhoy 2015-07-29 22:46:52 UTC

rhel-osp-director: redis failed to start after a host was fenced in HA cluster.

Environment:
python-redis-2.10.3-1.el7ost.noarch
redis-2.8.21-1.el7ost.x86_64
instack-undercloud-2.1.2-22.el7ost.noarch

Steps to reproduce:

1. Deplog overcloud with HA
2. Configure fencing for all controllers.
3. Fence one controller.

Result:
Redis fails to start after the controller boots.

Comment 3 Alexander Chuzhoy 2015-07-29 22:49:12 UTC

Created attachment 1057446 [details]
pacemaker and redis logs from the fenced controller.

Comment 4 Giulio Fidente 2015-07-30 10:43:42 UTC

pcmk logs seem to report an attempt to start the redis service at 18:40:13 and then an immediate failure to connect to it:


redis(redis)[4129]:	2015/07/29_18:40:13 INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:     info: log_finished: 	finished - rsc:ip-10.19.94.10 action:start call_id:225 pid:4128 exit-code:0 exec-time:142ms queue-time:0ms
redis(redis)[4129]:	2015/07/29_18:40:13 ERROR: demote: Failed to demote, redis not running.
redis(redis)[4129]:	2015/07/29_18:40:13 ERROR: start: Unknown error starting redis. output=
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:   notice: operation_finished: 	redis_start_0:4129:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ]
Jul 29 18:40:13 [2673] overcloud-controller-1.localdomain       lrmd:     info: log_finished: 	finished - rsc:redis action:start call_id:226 pid:4129 exit-code:7 exec-time:243ms queue-time:0ms

this resulted in pcmk enforcing a resource stop later; the redis logs show a start attempt at 18:40:13 apparently completed successfully:


[4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections on port 6379
[4426] 29 Jul 18:40:13.411 * The server is now ready to accept connections at /var/run/redis/redis.sock


it is not obvious to me why pcmk couldn't find the socket, probably tried to connect too quickly?

Comment 6 Giulio Fidente 2015-07-30 11:47:13 UTC

The issue is intermittent. When the demote works the redis (new) slave (fenced node) is synced back to master in a few minutes. In the attached logs both the pacemaker and redis logs show a successful attempt to rejoin the cluster at 18:30, on same environment, where same node was fenced.

Comment 7 chris alfonso 2015-08-21 16:20:19 UTC

Sasha, how reproducible/intermitent is this?

Comment 8 Alexander Chuzhoy 2015-08-21 16:56:44 UTC

Just reproduced - tried to see if the issue is intermittent.

Comment 9 Jason Guiditta 2015-08-21 18:00:26 UTC

Could this be something to do with the pacemaker resource agent for redis?  It doesn't seem to me likely to be puppet, unless there is some timeout that puppet is not passing, but is needed.  David, any thoughts on this?

Comment 11 Jason Guiditta 2015-08-26 17:07:19 UTC

Sasha, perhaps you can get the resource-agents version so Fabio can verify it has the fix he refers to, as well as checking for redis db size being 0 bytes?

Comment 12 Jason Guiditta 2015-09-03 14:02:52 UTC

Sasha, any update on this?

Comment 13 Alexander Chuzhoy 2015-09-03 16:56:40 UTC

resource-agents-3.9.5-40.el7_1.4.x86_64

Comment 14 Jason Guiditta 2015-09-03 17:10:48 UTC

Thanks Sasha, do you have a deployment with this problem where you would be able to check if the failing redis node has a 0 byte db as Fabio mentioned above?  I'll have to leave it to him to reply on if the resource-agents version you have is sufficient, I don't know what version is needed to cover the referenced fix.

Comment 16 Mike Burns 2016-04-07 20:47:27 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 18 Jason Guiditta 2016-05-20 15:12:13 UTC

Closing due to age of product and lack of activity on the bug