1283240 – Redis can't demote previous redis master after rebooting the node.

Bug 1283240 - Redis can't demote previous redis master after rebooting the node.

Summary: Redis can't demote previous redis master after rebooting the node.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	redis
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	z3
Target Release:	7.0 (Kilo)
Assignee:	Lon Hohberger
QA Contact:	Leonid Natapov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-18 13:51 UTC by Leonid Natapov
Modified:	2015-12-21 17:04 UTC (History)
CC List:	15 users (show)
Fixed In Version:	redis-2.8.21-2.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, the redis-server did not wait until the server was ready after daemonizing. As a result, the redis resource-agent erroneously failed in some cases, causing Pacemaker to recover the redis resource unnecessarily. With this update, the redis-server process waits for a signal from the child process before exiting after daemonizing. As a result, the Pacemaker does not recover the redis resource unnecessarily.
Clone Of:
Environment:
Last Closed:	2015-12-21 17:04:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
redis log (479.34 KB, application/octet-stream) 2015-11-18 13:51 UTC, Leonid Natapov	no flags	Details
crm report (649.65 KB, application/x-gzip) 2015-12-03 14:22 UTC, Leonid Natapov	no flags	Details
Patch (1.99 KB, patch) 2015-12-04 16:44 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2676	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform 7 Bug Fix and Enhancement Advisory	2015-12-21 21:51:22 UTC

Description Leonid Natapov 2015-11-18 13:51:13 UTC

Created attachment 1096099 [details]
redis log

I have an HA deployment with 3 controllers. Redis master is on controller-0.
I have rebooted controller-0 and redis master swutched to controller-2.
When controller-0 got back,redis was unable to start on that controller.

Seems to be the triggering issue, it can't demote the previous redis master.

pcs status shows:
redis	(ocf::heartbeat:redis):	FAILED Master overcloud-controller-0
     Masters: [ overcloud-controller-2 ]
     Slaves: [ overcloud-controller-1 ]

and after some time:

Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-2 ]
     Slaves: [ overcloud-controller-1] 
     Stopped:[overcloud-controller-0 ]

As a result problematic node with stopped redis is being fenced every several minutes.

redis.log attached

/var/log/messages file shows the following:
-------------------------------------------

Nov 18 05:41:47 localhost crmd[28270]: warning: Action 164 (redis_start_0) on overcloud-controller-0 failed (target: 0 vs. rc: 1): Error
Nov 18 05:41:47 localhost crmd[28270]: warning: Action 164 (redis_start_0) on overcloud-controller-0 failed (target: 0 vs. rc: 1): Error
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 524: notify redis_post_notify_start_0 on overcloud-controller-2 (local)
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 526: notify redis_post_notify_start_0 on overcloud-controller-1
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 527: notify redis_post_notify_start_0 on overcloud-controller-0
Nov 18 05:41:47 localhost crmd[28270]:  notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=725, rc=0, cib-update=0, confirmed=true)
Nov 18 05:41:47 localhost crmd[28270]:  notice: Transition 132 (Complete=12, Pending=0, Fired=0, Skipped=3, Incomplete=5, Source=/var/lib/pacemaker/pengine/pe-input-79.bz2): Stopped
Nov 18 05:41:47 localhost pengine[28269]: warning: Processing failed op start for redis:2 on overcloud-controller-0: unknown error (1)
Nov 18 05:41:47 localhost pengine[28269]: warning: Processing failed op start for redis:2 on overcloud-controller-0: unknown error (1)
Nov 18 05:41:47 localhost pengine[28269]: warning: Forcing redis-master away from overcloud-controller-0 after 1000000 failures (max=1000000)
Nov 18 05:41:47 localhost pengine[28269]: warning: Forcing redis-master away from overcloud-controller-0 after 1000000 failures (max=1000000)
Nov 18 05:41:47 localhost pengine[28269]: warning: Forcing redis-master away from overcloud-controller-0 after 1000000 failures (max=1000000)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:47 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:2 (then)
Nov 18 05:41:47 localhost pengine[28269]:  notice: Stop    redis:2#011(overcloud-controller-0)
Nov 18 05:41:47 localhost pengine[28269]:  notice: Calculated Transition 133: /var/lib/pacemaker/pengine/pe-input-80.bz2
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 525: notify redis_pre_notify_stop_0 on overcloud-controller-2 (local)
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 527: notify redis_pre_notify_stop_0 on overcloud-controller-1
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 529: notify redis_pre_notify_stop_0 on overcloud-controller-0
Nov 18 05:41:47 localhost crmd[28270]:  notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=726, rc=0, cib-update=0, confirmed=true)
Nov 18 05:41:47 localhost crmd[28270]:  notice: Initiating action 111: stop redis_stop_0 on overcloud-controller-0
Nov 18 05:41:48 localhost crmd[28270]: warning: No match for shutdown action on 1
Nov 18 05:41:48 localhost crmd[28270]:  notice: Transition aborted by deletion of nvpair[@id='status-1-master-redis']: Transient attribute change (cib=0.129.11, source=abort_unless_down:333, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1']/nvpair[@id='status-1-master-redis'], 0)
Nov 18 05:41:48 localhost crmd[28270]:  notice: Initiating action 526: notify redis_post_notify_stop_0 on overcloud-controller-2 (local)
Nov 18 05:41:48 localhost crmd[28270]:  notice: Initiating action 528: notify redis_post_notify_stop_0 on overcloud-controller-1
Nov 18 05:41:48 localhost crmd[28270]:  notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=727, rc=0, cib-update=0, confirmed=true)
Nov 18 05:41:48 localhost crmd[28270]:  notice: Transition 133 (Complete=13, Pending=0, Fired=0, Skipped=3, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-80.bz2): Stopped
Nov 18 05:41:48 localhost pengine[28269]: warning: Processing failed op start for redis:2 on overcloud-controller-0: unknown error (1)
Nov 18 05:41:48 localhost pengine[28269]: warning: Forcing redis-master away from overcloud-controller-0 after 1000000 failures (max=1000000)
Nov 18 05:41:48 localhost pengine[28269]: warning: Forcing redis-master away from overcloud-controller-0 after 1000000 failures (max=1000000)
Nov 18 05:41:48 localhost pengine[28269]: warning: Forcing redis-master away from overcloud-controller-0 after 1000000 failures (max=1000000)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:   error: Internal error: No action found for demote in redis:1 (then)
Nov 18 05:41:48 localhost pengine[28269]:  notice: Calculated Transition 134: /var/lib/pacemaker/pengine/pe-input-81.bz2
Nov 18 05:41:48 localhost crmd[28270]:  notice: Initiating action 159: monitor redis_monitor_20000 on overcloud-controller-2 (local)
Nov 18 05:41:48 localhost crmd[28270]:  notice: Initiating action 163: monitor redis_monitor_60000 on overcloud-controller-1
Nov 18 05:41:48 localhost crmd[28270]:  notice: Initiating action 162: monitor redis_monitor_45000 on overcloud-controller-1
Nov 18 05:41:49 localhost crmd[28270]:  notice: Transition 134 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-81.bz2): Complete
Nov 18 05:41:49 localhost crmd[28270]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 18 05:41:53 localhost kernel: net_ratelimit: 1 callbacks suppressed
Nov 18 05:41:57 localhost su: (to rabbitmq) root on none
Nov 18 05:41:57 localhost systemd: Started Session c11069 of user rabbitmq.
Nov 18 05:41:57 localhost systemd: Starting Session c11069 of user rabbitmq.
Nov 18 05:41:57 localhost su: (to rabbitmq) root on none
Nov 18 05:41:57 localhost systemd: Started Session c11070 of user rabbitmq.
Nov 18 05:41:57 localhost systemd: Starting Session c11070 of user rabbitmq.
Nov 18 05:41:59 localhost Delay(delay)[68135]: INFO: Delay is running OK
Nov 18 05:42:08 localhost su: (to rabbitmq) root on none
Nov 18 05:42:08 localhost systemd: Started Session c11071 of user rabbitmq.
Nov 18 05:42:08 localhost systemd: Starting Session c11071 of user rabbitmq.
Nov 18 05:42:08 localhost su: (to rabbitmq) root on none
Nov 18 05:42:08 localhost systemd: Started Session c11072 of user rabbitmq.
Nov 18 05:42:08 localhost systemd: Starting Session c11072 of user rabbitmq.
Nov 18 05:42:10 localhost os-collect-config: 2015-11-18 05:42:10.299 9043 WARNING os-collect-config [-] Source [request] Unavailable.
Nov 18 05:42:10 localhost os-collect-config: 2015-11-18 05:42:10.301 9043 WARNING os_collect_config.local [-] /var/lib/os-collect-config/local-data not found. Skipping
Nov 18 05:42:10 localhost os-collect-config: 2015-11-18 05:42:10.301 9043 WARNING os_collect_config.local [-] No local metadata found (['/var/lib/os-collect-config/local-data'])
Nov 18 05:42:10 localhost os-collect-config: 2015-11-18 05:42:10.302 9043 WARNING os_collect_config.zaqar [-] No auth_url configured.
Nov 18 05:42:13 localhost kernel: net_ratelimit: 4 callbacks suppressed
Nov 18 05:42:18 localhost su: (to rabbitmq) root on none
Nov 18 05:42:18 localhost systemd: Started Session c11073 of user rabbitmq.
Nov 18 05:42:18 localhost systemd: Starting Session c11073 of user rabbitmq.
Nov 18 05:42:18 localhost su: (to rabbitmq) root on none
Nov 18 05:42:18 localhost systemd: Started Session c11074 of user rabbitmq.
Nov 18 05:42:18 localhost systemd: Starting Session c11074 of user rabbitmq.
Nov 18 05:42:19 localhost Delay(delay)[69092]: INFO: Delay is running OK

Comment 1 Asaf Hirshberg 2015-11-22 14:09:48 UTC

Got the same error on my setup too.
Setup: 3 controller, 3 compute (bm)

1) pcs cluster stop --all
2) pcs cluster start --all

Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-0 ]
     Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-1 overcloud-controller-2 ]

Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36 (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36 (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
Nov 22 08:17:35 overcloud-controller-0 crmd[21904]: warning: Action 69 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 7): Error
Nov 22 08:17:35 overcloud-controller-0 crmd[21904]: warning: Action 69 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 7): Error
Nov 22 08:19:53 overcloud-controller-0 ceilometer-polling: ImportError: No module named redis
Nov 22 08:21:52 overcloud-controller-0 crmd[21904]: warning: Action 101 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 1): Error
Nov 22 08:21:52 overcloud-controller-0 crmd[21904]: warning: Action 101 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 1): Error
Nov 22 08:21:52 overcloud-controller-0 pengine[21903]: warning: Processing failed op start for redis:1 on overcloud-controller-1: unknown error (1)
Nov 22 08:21:52 overcloud-controller-0 pengine[21903]: warning: Processing failed op start for redis:1 on overcloud-controller-1: unknown error (1)

Comment 2 Andrew Beekhof 2015-12-02 03:49:49 UTC

(In reply to Asaf Hirshberg from comment #1)
> Got the same error on my setup too.
> Setup: 3 controller, 3 compute (bm)
> 
> 1) pcs cluster stop --all
> 2) pcs cluster start --all
> 
> Master/Slave Set: redis-master [redis]
>      Masters: [ overcloud-controller-0 ]
>      Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2
> overcloud-controller-1 overcloud-controller-2 ]
> 
> Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36
> (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0):
> Error
> Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36
> (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0):
> Error

Is anyone running "pcs resource cleanup [...]"?
If not, this suggests that either the redis agent monitor action is incorrectly reporting redis to be up when it isnt, or the stop action is not actually stopping redis.

Both are particularly bad.

Comment 3 Jaromir Coufal 2015-12-02 11:39:15 UTC

This seems like pacemaker related issue not tripleo-heat-templates particular, can anybody confirm?

Comment 4 Mike Burns 2015-12-02 12:27:55 UTC

adding needinfo for comment 2

Comment 5 Leonid Natapov 2015-12-02 12:36:42 UTC

I 've tried t run pcs resource cleanup ,yes.

Comment 6 chris alfonso 2015-12-02 16:59:48 UTC

Andrew, can you give us more guidance on what to fix on our side? We don't have folks familiar enough with the specifics and will need your assistance.

Comment 7 Andrew Beekhof 2015-12-02 23:54:03 UTC

That assessment was based on people /not/ having used pcs cleanup.
Since they were, I looked back further.

The first error you pasted was:

Nov 18 05:41:47 localhost crmd[28270]: warning: Action 164 (redis_start_0) on overcloud-controller-0 failed (target: 0 vs. rc: 1): Error

Looking at the redis logs from about that time I see:


[15037] 18 Nov 05:39:47.696 * The server is now ready to accept connections on port 6379
[15037] 18 Nov 05:39:47.696 * The server is now ready to accept connections at /var/run/redis/redis.sock
[15037] 18 Nov 05:39:47.964 * SLAVE OF overcloud-controller-2:6379 enabled (user request)

followed by these two messages over and over:

[15037] 18 Nov 05:39:48.698 * Connecting to MASTER overcloud-controller-2:6379
[15037] 18 Nov 05:39:48.699 # Unable to connect to MASTER: Permission denied

Is the current master really on that host? Under what conditions would you get Permission denied from redis?  Can you get me a crm_report from these nodes starting from when the master was rebooted?

Comment 8 Mike Burns 2015-12-03 13:19:35 UTC

adding needinfo on reporter to provide logs

Comment 9 Leonid Natapov 2015-12-03 14:22:26 UTC

Created attachment 1101814 [details]
crm report

Comment 10 Michele Baldessari 2015-12-03 19:48:49 UTC

overcloud-controller-2 has been rebooted (by hand as stonith is set to false?) at:
Dec  3 09:00:33 overcloud-controller-2 su: (to root) heat-admin on pts/0
Dec  3 09:00:35 overcloud-controller-2 systemd: Stopped target Timers.
Dec  3 09:00:35 overcloud-controller-2 systemd: Stopping Timers.
Dec  3 09:00:35 overcloud-controller-2 systemd: Stopping Session c3 of user redis.
Dec  3 09:00:35 overcloud-controller-2 systemd: Stopping Authorization Manager...
....
Dec  3 09:01:27 overcloud-controller-2 redis(redis)[34514]: ERROR: demote: Failed to demote, redis not running.
Dec  3 09:01:27 overcloud-controller-2 crmd[3425]:  notice: Operation redis_demote_0: not running (node=overcloud-controller-2, call=353, rc=7, cib-update=178, confirmed
Dec  3 09:01:27 overcloud-controller-2 crmd[3425]:  notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=358, rc=0, cib-update=0, confirmed=true)
....
Dec  3 09:04:09 overcloud-controller-2 kernel: Linux version 3.10.0-327
...

Redis *seems* to fail to start according to pacemaker/resource agent:
Dec  3 09:04:57 overcloud-controller-2 redis(redis)[4350]: INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'
Dec  3 09:04:57 overcloud-controller-2 redis(redis)[4350]: ERROR: start: Unknown error starting redis. redis-server output= redis-check-dump output===== Processed 7 valid opcodes (in 502 bytes) ==================================; CRC64 checksum is OK
Dec  3 09:04:57 overcloud-controller-2 lrmd[3376]:  notice: redis_start_0:4350:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ]
Dec  3 09:04:57 overcloud-controller-2 crmd[3379]:  notice: Operation redis_start_0: not running (node=overcloud-controller-2, call=204, rc=7, cib-update=53, confirmed=true)
Dec  3 09:04:57 overcloud-controller-2 crmd[3379]:  notice: overcloud-controller-2-redis_start_0:204 [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory\n ]
Dec  3 09:04:57 overcloud-controller-2 galera(galera)[4258]: INFO: Master instances are already up, setting master score so this instance will join galera cluster.
Dec  3 09:04:57 overcloud-controller-2 rabbitmq-cluster(rabbitmq)[4259]: INFO: RabbitMQ server is not running
Dec  3 09:04:57 overcloud-controller-2 crmd[3379]:  notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=205, rc=0, cib-update=0, confirmed=true)
Dec  3 09:04:57 overcloud-controller-2 crmd[3379]:  notice: Operation galera_start_0: ok (node=overcloud-controller-2, call=198, rc=0, cib-update=54, confirmed=true)


At this point controller-1 tries to force redis master role away from controller-2:
journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]:   notice: Initiating action 1: stop redis_stop_0 on overcloud-controller-2
journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]:   notice: Initiating action 489: notify redis_post_notify_stop_0 on overcloud-controller-0
journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]:   notice: Initiating action 491: notify redis_post_notify_stop_0 on overcloud-controller-1 (local)
journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]:   notice: Operation redis_notify_0: ok (node=overcloud-controller-1, call=1148, rc=0, cib-update=0, confirmed=true)
journal.log:Dec 03 09:05:20 overcloud-controller-1.localdomain haproxy[100006]: Server redis/overcloud-controller-2 is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (send)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]:  warning: Processing failed op start for redis:0 on overcloud-controller-2: not running (7)
journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]:  warning: Forcing redis-master away from overcloud-controller-2 after 1000000 failures (max=1000000)
journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]:  warning: Forcing redis-master away from overcloud-controller-2 after 1000000 failures (max=1000000)
journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]:  warning: Forcing redis-master away from overcloud-controller-2 after 1000000 failures (max=1000000)
journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]:    error: Internal error: No action found for demote in redis:1 (then)
journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]:    error: Internal error: No action found for demote in redis:1 (then)

The status at the time of the report seems consistent with controller-2 being unable to start redis:
Master/Slave Set: redis-master [redis]
	Masters: [ overcloud-controller-0 ]
	Slaves: [ overcloud-controller-1 ]

What is odd is that according to redis.log it did actually start:
[4516] 03 Dec 09:04:57.253 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
[4516] 03 Dec 09:04:57.272 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted.
[4516] 03 Dec 09:04:57.272 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
[4516] 03 Dec 09:04:57.273 # Server started, Redis version 2.8.21
[4516] 03 Dec 09:04:57.273 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
[4516] 03 Dec 09:04:57.273 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[4516] 03 Dec 09:04:57.273 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
[4516] 03 Dec 09:04:57.278 * DB loaded from disk: 0.005 seconds
[4516] 03 Dec 09:04:57.278 * The server is now ready to accept connections on port 6379
[4516] 03 Dec 09:04:57.278 * The server is now ready to accept connections at /var/run/redis/redis.sock
[4516 | signal handler] (1449151511) Received SIGTERM scheduling shutdown...
[4516] 03 Dec 09:05:11.590 # User requested shutdown...

I wonder if the logic here is the one at fault?
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/redis#L341

I also tried to start it by hand and it worked okay:
su redis -s /bin/sh -c "cd /var/run/redis; exec /usr/bin/redis-server '/etc/redis.conf' --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'"


Andrew any thoughts here?

Comment 12 Andrew Beekhof 2015-12-04 03:13:02 UTC

as far as I can see, it comes down to:

redis_start_0:4350:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ]

which comes from the redis_client() function calling:

   "$REDIS_CLIENT" -s "$REDIS_SOCKET" "$@" | sed 's/\r//'

Elsewhere in the agent is:

    # unknown error, but the process still exists.
    # This check is mainly because redis daemonizes before it starts listening, causing `redis-cli` to fail
    #   See https://github.com/antirez/redis/issues/2368
    # It's possible that the `pidof` will pick up a different redis, but in that case, the start operation will just time out

So in combination with the successful restart at 8:50, I'd say this looks like a timing issue, maybe redis_client() needs a loop?

Comment 13 Lon Hohberger 2015-12-04 16:42:59 UTC

Assuming Andrew is right, I've created a patch to redis to make it wait for the child to drop its pid file and create its socket.

This should be considered a temporary fix until upstream has a clean implementation, but should not require any other workarounds in e.g. resource-agents or pacemaker.

Comment 14 Lon Hohberger 2015-12-04 16:44:19 UTC

Created attachment 1102325 [details]
Patch

Comment 16 Leonid Natapov 2015-12-08 14:02:58 UTC

doesn't reproduce on the last build of ospd7.2
redis-2.8.21-2.el7ost.x86_64

Comment 18 errata-xmlrpc 2015-12-21 17:04:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:2676

Note You need to log in before you can comment on or make changes to this bug.