Bug 1305549
Summary: | Redis resource-agent should notify clients of master being demoted | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Eduard Barrera <ebarrera> | |
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
Status: | CLOSED ERRATA | QA Contact: | Asaf Hirshberg <ahirshbe> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 7.4 | CC: | agk, apevec, cchen, cfeist, cluster-maint, dbecker, ebarrera, fdinitto, jcoufal, jherrman, k-akatsuka, lhh, mabaakou, mburns, mkrcmari, mnovacek, morazi, oblaut, rhel-osp-director-maint, royoung, yprokule | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | 7.4 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | resource-agents-3.9.5-87.el7 | Doc Type: | Bug Fix | |
Doc Text: |
Prior to this update, the redis resource agent did not notify the client if a master node was changed into a slave mode. In some cases, this caused temporary unavailability of data on Red Hat OpenStack Platform. This update fixes the underlying code, and the described problem no longer occurs.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1400103 1414967 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-01 14:55:11 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1369066, 1400103, 1414967 |
Description
Eduard Barrera
2016-02-08 14:54:24 UTC
*** Bug 1305553 has been marked as a duplicate of this bug. *** *** Bug 1305550 has been marked as a duplicate of this bug. *** This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. It´s not clear in this bugzilla if we are talking about redis or mongodb. there seems to be a mismatch around. Can you please clarify? This sounds to me as a RFE for Tooz client that needs to reconnect when a master is demoted to slave. The demotion can happen also with or without pacemaker and the problem would persist. As we move away from pacemaker (redis is on the pipeline to be removed), clients will need to deal with those conditions by themselves. Moving to Telemetry and reassigning to Tooz or ceilometer. I don't think this will be accepted upstream in tooz. And this is not only a tooz issue, but also oslo.cache, dogpile, <put any redis based lib>. The depromotion is already handled by the python redis client lib. Redis client expects the server to send "CLIENT KILL type normal" message on depromotion. This issue occurs because the OSP uses pacemaker instead of redis-sentinel. And the current pacemaker implementation doesn't send this message like redis-sentinel does. I suggest that pacemaker script send 'CLIENT KILL type normal' when a redis become slave, that will fix all HA disconnection issue around redis and not only for tooz user. Also upstream recommends to use redis-sentinel instead of pacemaker for redis HA scenario, but that's another thing :p. (In reply to Mehdi ABAAKOUK from comment #10) > I don't think this will be accepted upstream in tooz. And this is not only a > tooz issue, but also oslo.cache, dogpile, <put any redis based lib>. > > The depromotion is already handled by the python redis client lib. Redis > client expects the server to send "CLIENT KILL type normal" message on > depromotion. > > This issue occurs because the OSP uses pacemaker instead of redis-sentinel. > And the current pacemaker implementation doesn't send this message like > redis-sentinel does. pacemaker doesn´t interfere with the application at all. pacemaker only starts / promote / demote / stops applications. It has no idea on application internals. > > I suggest that pacemaker script send 'CLIENT KILL type normal' when a redis > become slave, that will fix all HA disconnection issue around redis and not > only for tooz user. and is there a cli command to send that CLIENT KILL message? the scripts or pacemaker doesn´t work as MIT and hijack sockets or protocols. If so we can fix it in the resource-agent easily. > > Also upstream recommends to use redis-sentinel instead of pacemaker for > redis HA scenario, but that's another thing :p. This problem will go away since the plan is to drop redis from pacemaker and let redis do its work internally. Originally it was necessary to sync the two due to multiple cluster brains deciding in case of network partitions, but that´s not an issue anylonger. (In reply to Fabio Massimo Di Nitto from comment #11) > pacemaker doesn´t interfere with the application at all. pacemaker only > starts / promote / demote / stops applications. It has no idea on > application internals. They are no need to do this, that can be done externally with redis-cli command. > and is there a cli command to send that CLIENT KILL message? the scripts or > pacemaker doesn´t work as MIT and hijack sockets or protocols. If so we can > fix it in the resource-agent easily. Yeah you can send it with the redis-cli tools in the resource-agent, just run "redis-cli CLIENT KILL type normal" on the slave node, or remotely with "redis-cli -h <slave-node-ip/hostname> CLIENT KILL type normal" > This problem will go away since the plan is to drop redis from pacemaker and > let redis do its work internally. > > Originally it was necessary to sync the two due to multiple cluster brains > deciding in case of network partitions, but that´s not an issue anylonger. good to ear :) Ok perfect, reassigning to resource-agents. Logic for the fix is in comment #12 and should be executed during demotion of a master. *** Bug 1395021 has been marked as a duplicate of this bug. *** *** Bug 1398324 has been marked as a duplicate of this bug. *** I performed some heavy testing for this change and couple of issues came up which prevent to get expected behaviour: - The patch is not complete: when the client connections are killed in early stage of node demotion clients try to reconnect but no master has been elected yet -redis recommendation is to PAUSE client first to make sure the slaves processed the latest replication stream from master and then let clients to reconnect. Mehdi create a patch which I tested and helped to get better results. - The reason why even applying Mehdi's patch I could not observe expected behaviour was the fact that redis node demotion, another node promotion to master, and client killand followed reconnection happened before haproxy gave up on "dead" (demoted) redis node. Everything started to work as expected when I lower health check to 3 tries with 1s interval instead of current 5 checks with 2s interval. That means that redis clients keep connecting to slave node and not the new master since haproxy keeps redirecting it to the demoted node. Tested patch: https://github.com/ClusterLabs/resource-agents/pull/890 Verified. resource-agents-3.9.5-86.el7 + /cib: @epoch=38, @num_updates=0 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: + /cib/configuration/constraints/rsc_location[@id='cli-prefer-redis-master']: @node=overcloud-controller-0 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Completed cib_modify operation for section constraints: OK (rc=0, origin=overcloud-controller-2/crm_resource/4, version=0.38.0) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_file_backup: Archived previous version as /var/lib/pacemaker/cib/cib-40.raw Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_file_write_with_digest: Wrote version 0.38.0 of the CIB to disk (digest: b932dac9b3d8ec4cda827b5fb56c661e) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_file_write_with_digest: Reading cluster configuration file /var/lib/pacemaker/cib/cib.rZZ8jQ (digest: /var/lib/pacemaker/cib/cib.lu8w9X) Jan 29 06:56:31 [753792] overcloud-controller-2.localdomain lrmd: info: cancel_recurring_action: Cancelling ocf operation redis_monitor_20000 Jan 29 06:56:31 [753795] overcloud-controller-2.localdomain crmd: info: process_lrm_event: Result of monitor operation for redis on overcloud-controller-2: Cancelled | call=72 key=redis_monitor_20000 confirmed=true Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Forwarding cib_delete operation for section status to all (origin=local/crmd/41) Jan 29 06:56:31 [753795] overcloud-controller-2.localdomain crmd: info: do_lrm_rsc_op: Performing key=135:6:0:0377ce4a-b51e-4ea3-a02e-781e8b51e69f op=redis_notify_0 Jan 29 06:56:31 [753792] overcloud-controller-2.localdomain lrmd: info: log_execute: executing - rsc:redis action:notify call_id:74 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: --- 0.38.0 2 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: +++ 0.38.1 (null) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: -- /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='redis']/lrm_rsc_op[@id='redis_monitor_20000'] Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: + /cib: @num_updates=1 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=overcloud-controller-2/crmd/41, version=0.38.1) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: --- 0.38.1 2 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: +++ 0.38.2 (null) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: -- /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='redis']/lrm_rsc_op[@id='redis_monitor_45000'] Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: + /cib: @num_updates=2 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=overcloud-controller-0/crmd/44, version=0.38.2) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: --- 0.38.2 2 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: +++ 0.38.3 (null) Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: -- /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='redis']/lrm_rsc_op[@id='redis_monitor_60000'] Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: + /cib: @num_updates=3 Jan 29 06:56:31 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Completed cib_delete operation for section status: OK (rc=0, origin=overcloud-controller-0/crmd/45, version=0.38.3) Jan 29 06:56:31 [753792] overcloud-controller-2.localdomain lrmd: info: log_finished: finished - rsc:redis action:notify call_id:74 pid:796601 exit-code:0 exec-time:65ms queue-time:0ms Jan 29 06:56:31 [753795] overcloud-controller-2.localdomain crmd: notice: process_lrm_event: Result of notify operation for redis on overcloud-controller-2: 0 (ok) | call=74 key=redis_notify_0 confirmed=true cib-update=0 Jan 29 06:56:31 [753795] overcloud-controller-2.localdomain crmd: info: do_lrm_rsc_op: Performing key=78:6:0:0377ce4a-b51e-4ea3-a02e-781e8b51e69f op=redis_demote_0 Jan 29 06:56:31 [753792] overcloud-controller-2.localdomain lrmd: info: log_execute: executing - rsc:redis action:demote call_id:75 redis(redis)[796646]: 2017/01/29_06:56:31 INFO: demote: Setting master to 'no-such-master' Jan 29 06:56:33 [753792] overcloud-controller-2.localdomain lrmd: info: log_finished: finished - rsc:redis action:demote call_id:75 pid:796646 exit-code:0 exec-time:2318ms queue-time:0ms Jan 29 06:56:33 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/crmd/42) Jan 29 06:56:33 [753795] overcloud-controller-2.localdomain crmd: notice: process_lrm_event: Result of demote operation for redis on overcloud-controller-2: 0 (ok) | call=75 key=redis_demote_0 confirmed=true cib-update=42 Jan 29 06:56:33 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: --- 0.38.3 2 Jan 29 06:56:33 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: Diff: +++ 0.38.4 (null) Jan 29 06:56:33 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: + /cib: @num_updates=4 Jan 29 06:56:33 [753790] overcloud-controller-2.localdomain cib: info: cib_perform_op: + /cib/status/node_state[@id='3']/lrm[@id='3']/lrm_resources/lrm_resource[@id='redis']/lrm_rsc_op[@id='redis_last_0']: @operation_key=redis_demote_0, @operation=demote, @transition-key=78:6:0:0377ce4a-b51e-4ea3-a02e-781e8b51e69f, @transition-magic=0:0;78:6:0:0377ce4a-b51e-4ea3-a02e-781e8b51e69f, @call-id=75, @last-run=1485672991, @last-rc-change=1485672991, @exec-time=2318 Jan 29 06:56:33 [753790] overcloud-controller-2.localdomain cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=overcloud-controller-2/crmd/42, version=0.38.4) Jan 29 06:56:33 [753795] overcloud-controller-2.localdomain crmd: info: do_lrm_rsc_op: Performing key=136:6:0:0377ce4a-b51e-4ea3-a02e-781e8b51e69f op=redis_notify_0 Jan 29 06:56:33 [753792] overcloud-controller-2.localdomain lrmd: info: log_execute: executing - rsc:redis action:notify call_id:76 redis(redis)[796811]: 2017/01/29_06:56:34 INFO: monitor: Slave mode link has not yet been established (link=down) redis(redis)[796811]: 2017/01/29_06:56:34 INFO: demote: Setting master to 'no-such-master' Jan 29 06:56:36 [753792] overcloud-controller-2.localdomain lrmd: info: log_finished: finished - rsc:redis action:notify call_id:76 pid:796811 exit-code:0 exec-time:2404ms queue-time:0ms Jan 29 06:56:36 [753795] overcloud-controller-2.localdomain crmd: notice: process_lrm_event: Result of notify operation for redis on overcloud-controller-2: 0 (ok) | call=76 key=redis_notify_0 confirmed=true cib-update=0 Jan 29 06:56:36 [753795] overcloud-controller-2.localdomain crmd: info: do_lrm_rsc_op: Performing key=129:6:0:0377ce4a-b51e-4ea3-a02e-781e8b51e69f op=redis_notify_0 : *** Bug 1425835 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1844 |