Bug 1283240
Summary: | Redis can't demote previous redis master after rebooting the node. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Leonid Natapov <lnatapov> | ||||||||
Component: | redis | Assignee: | Lon Hohberger <lhh> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 7.0 (Kilo) | CC: | abeekhof, ahirshbe, apevec, calfonso, dnavale, jcoufal, lhh, lnatapov, mburns, michele, oblaut, rhel-osp-director-maint, sasha, vcojot, yeylon | ||||||||
Target Milestone: | z3 | Keywords: | Regression, ZStream | ||||||||
Target Release: | 7.0 (Kilo) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | redis-2.8.21-2.el7ost | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, the redis-server did not wait until the server was ready after daemonizing. As a result, the redis resource-agent erroneously failed in some cases, causing Pacemaker to recover the redis resource unnecessarily.
With this update, the redis-server process waits for a signal from the child process before exiting after daemonizing. As a result, the Pacemaker does not recover the redis resource unnecessarily.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2015-12-21 17:04:45 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Leonid Natapov
2015-11-18 13:51:13 UTC
Got the same error on my setup too. Setup: 3 controller, 3 compute (bm) 1) pcs cluster stop --all 2) pcs cluster start --all Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-0 ] Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 overcloud-controller-1 overcloud-controller-2 ] Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36 (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36 (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error Nov 22 08:17:35 overcloud-controller-0 crmd[21904]: warning: Action 69 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 7): Error Nov 22 08:17:35 overcloud-controller-0 crmd[21904]: warning: Action 69 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 7): Error Nov 22 08:19:53 overcloud-controller-0 ceilometer-polling: ImportError: No module named redis Nov 22 08:21:52 overcloud-controller-0 crmd[21904]: warning: Action 101 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 1): Error Nov 22 08:21:52 overcloud-controller-0 crmd[21904]: warning: Action 101 (redis_start_0) on overcloud-controller-1 failed (target: 0 vs. rc: 1): Error Nov 22 08:21:52 overcloud-controller-0 pengine[21903]: warning: Processing failed op start for redis:1 on overcloud-controller-1: unknown error (1) Nov 22 08:21:52 overcloud-controller-0 pengine[21903]: warning: Processing failed op start for redis:1 on overcloud-controller-1: unknown error (1) (In reply to Asaf Hirshberg from comment #1) > Got the same error on my setup too. > Setup: 3 controller, 3 compute (bm) > > 1) pcs cluster stop --all > 2) pcs cluster start --all > > Master/Slave Set: redis-master [redis] > Masters: [ overcloud-controller-0 ] > Stopped: [ overcloud-compute-0 overcloud-compute-1 overcloud-compute-2 > overcloud-controller-1 overcloud-controller-2 ] > > Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36 > (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): > Error > Nov 22 08:15:26 overcloud-controller-0 crmd[21904]: warning: Action 36 > (redis:0_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): > Error Is anyone running "pcs resource cleanup [...]"? If not, this suggests that either the redis agent monitor action is incorrectly reporting redis to be up when it isnt, or the stop action is not actually stopping redis. Both are particularly bad. This seems like pacemaker related issue not tripleo-heat-templates particular, can anybody confirm? I 've tried t run pcs resource cleanup ,yes. Andrew, can you give us more guidance on what to fix on our side? We don't have folks familiar enough with the specifics and will need your assistance. That assessment was based on people /not/ having used pcs cleanup. Since they were, I looked back further. The first error you pasted was: Nov 18 05:41:47 localhost crmd[28270]: warning: Action 164 (redis_start_0) on overcloud-controller-0 failed (target: 0 vs. rc: 1): Error Looking at the redis logs from about that time I see: [15037] 18 Nov 05:39:47.696 * The server is now ready to accept connections on port 6379 [15037] 18 Nov 05:39:47.696 * The server is now ready to accept connections at /var/run/redis/redis.sock [15037] 18 Nov 05:39:47.964 * SLAVE OF overcloud-controller-2:6379 enabled (user request) followed by these two messages over and over: [15037] 18 Nov 05:39:48.698 * Connecting to MASTER overcloud-controller-2:6379 [15037] 18 Nov 05:39:48.699 # Unable to connect to MASTER: Permission denied Is the current master really on that host? Under what conditions would you get Permission denied from redis? Can you get me a crm_report from these nodes starting from when the master was rebooted? adding needinfo on reporter to provide logs Created attachment 1101814 [details]
crm report
overcloud-controller-2 has been rebooted (by hand as stonith is set to false?) at: Dec 3 09:00:33 overcloud-controller-2 su: (to root) heat-admin on pts/0 Dec 3 09:00:35 overcloud-controller-2 systemd: Stopped target Timers. Dec 3 09:00:35 overcloud-controller-2 systemd: Stopping Timers. Dec 3 09:00:35 overcloud-controller-2 systemd: Stopping Session c3 of user redis. Dec 3 09:00:35 overcloud-controller-2 systemd: Stopping Authorization Manager... .... Dec 3 09:01:27 overcloud-controller-2 redis(redis)[34514]: ERROR: demote: Failed to demote, redis not running. Dec 3 09:01:27 overcloud-controller-2 crmd[3425]: notice: Operation redis_demote_0: not running (node=overcloud-controller-2, call=353, rc=7, cib-update=178, confirmed Dec 3 09:01:27 overcloud-controller-2 crmd[3425]: notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=358, rc=0, cib-update=0, confirmed=true) .... Dec 3 09:04:09 overcloud-controller-2 kernel: Linux version 3.10.0-327 ... Redis *seems* to fail to start according to pacemaker/resource agent: Dec 3 09:04:57 overcloud-controller-2 redis(redis)[4350]: INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid' Dec 3 09:04:57 overcloud-controller-2 redis(redis)[4350]: ERROR: start: Unknown error starting redis. redis-server output= redis-check-dump output===== Processed 7 valid opcodes (in 502 bytes) ==================================; CRC64 checksum is OK Dec 3 09:04:57 overcloud-controller-2 lrmd[3376]: notice: redis_start_0:4350:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ] Dec 3 09:04:57 overcloud-controller-2 crmd[3379]: notice: Operation redis_start_0: not running (node=overcloud-controller-2, call=204, rc=7, cib-update=53, confirmed=true) Dec 3 09:04:57 overcloud-controller-2 crmd[3379]: notice: overcloud-controller-2-redis_start_0:204 [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory\n ] Dec 3 09:04:57 overcloud-controller-2 galera(galera)[4258]: INFO: Master instances are already up, setting master score so this instance will join galera cluster. Dec 3 09:04:57 overcloud-controller-2 rabbitmq-cluster(rabbitmq)[4259]: INFO: RabbitMQ server is not running Dec 3 09:04:57 overcloud-controller-2 crmd[3379]: notice: Operation redis_notify_0: ok (node=overcloud-controller-2, call=205, rc=0, cib-update=0, confirmed=true) Dec 3 09:04:57 overcloud-controller-2 crmd[3379]: notice: Operation galera_start_0: ok (node=overcloud-controller-2, call=198, rc=0, cib-update=54, confirmed=true) At this point controller-1 tries to force redis master role away from controller-2: journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]: notice: Initiating action 1: stop redis_stop_0 on overcloud-controller-2 journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]: notice: Initiating action 489: notify redis_post_notify_stop_0 on overcloud-controller-0 journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]: notice: Initiating action 491: notify redis_post_notify_stop_0 on overcloud-controller-1 (local) journal.log:Dec 03 09:05:11 overcloud-controller-1.localdomain crmd[25368]: notice: Operation redis_notify_0: ok (node=overcloud-controller-1, call=1148, rc=0, cib-update=0, confirmed=true) journal.log:Dec 03 09:05:20 overcloud-controller-1.localdomain haproxy[100006]: Server redis/overcloud-controller-2 is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (send)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]: warning: Processing failed op start for redis:0 on overcloud-controller-2: not running (7) journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]: warning: Forcing redis-master away from overcloud-controller-2 after 1000000 failures (max=1000000) journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]: warning: Forcing redis-master away from overcloud-controller-2 after 1000000 failures (max=1000000) journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]: warning: Forcing redis-master away from overcloud-controller-2 after 1000000 failures (max=1000000) journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]: error: Internal error: No action found for demote in redis:1 (then) journal.log:Dec 03 09:05:27 overcloud-controller-1.localdomain pengine[25367]: error: Internal error: No action found for demote in redis:1 (then) The status at the time of the report seems consistent with controller-2 being unable to start redis: Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-0 ] Slaves: [ overcloud-controller-1 ] What is odd is that according to redis.log it did actually start: [4516] 03 Dec 09:04:57.253 # You requested maxclients of 10000 requiring at least 10032 max file descriptors. [4516] 03 Dec 09:04:57.272 # Redis can't set maximum open files to 10032 because of OS error: Operation not permitted. [4516] 03 Dec 09:04:57.272 # Current maximum open files is 4096. maxclients has been reduced to 4064 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'. [4516] 03 Dec 09:04:57.273 # Server started, Redis version 2.8.21 [4516] 03 Dec 09:04:57.273 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect. [4516] 03 Dec 09:04:57.273 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled. [4516] 03 Dec 09:04:57.273 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128. [4516] 03 Dec 09:04:57.278 * DB loaded from disk: 0.005 seconds [4516] 03 Dec 09:04:57.278 * The server is now ready to accept connections on port 6379 [4516] 03 Dec 09:04:57.278 * The server is now ready to accept connections at /var/run/redis/redis.sock [4516 | signal handler] (1449151511) Received SIGTERM scheduling shutdown... [4516] 03 Dec 09:05:11.590 # User requested shutdown... I wonder if the logic here is the one at fault? https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/redis#L341 I also tried to start it by hand and it worked okay: su redis -s /bin/sh -c "cd /var/run/redis; exec /usr/bin/redis-server '/etc/redis.conf' --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'" Andrew any thoughts here? as far as I can see, it comes down to: redis_start_0:4350:stderr [ Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory ] which comes from the redis_client() function calling: "$REDIS_CLIENT" -s "$REDIS_SOCKET" "$@" | sed 's/\r//' Elsewhere in the agent is: # unknown error, but the process still exists. # This check is mainly because redis daemonizes before it starts listening, causing `redis-cli` to fail # See https://github.com/antirez/redis/issues/2368 # It's possible that the `pidof` will pick up a different redis, but in that case, the start operation will just time out So in combination with the successful restart at 8:50, I'd say this looks like a timing issue, maybe redis_client() needs a loop? Assuming Andrew is right, I've created a patch to redis to make it wait for the child to drop its pid file and create its socket. This should be considered a temporary fix until upstream has a clean implementation, but should not require any other workarounds in e.g. resource-agents or pacemaker. Created attachment 1102325 [details]
Patch
doesn't reproduce on the last build of ospd7.2 redis-2.8.21-2.el7ost.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:2676 |