Bug 1236407
Summary: | Redis replication breaks after network partitioning Redis master | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Giulio Fidente <gfidente> | ||||
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.0 (Kilo) | CC: | abeekhof, calfonso, chdent, dmacpher, dvossel, fdinitto, jason.dobies, kbasil, lnatapov, mburns, mcornea, rhel-osp-director-maint, yeylon | ||||
Target Milestone: | ga | Keywords: | Triaged | ||||
Target Release: | Director | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-0.8.6-33.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
On Overclouds with network isolation enabled, Pacemaker set the redis master to a hostname on a network where the master was unreachable. This meant redis nodes failed to join the cluster. This fix resolves Pacemaker hostnames against the internal_api addresses when deploying with network isolation.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-08-05 13:57:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Marius Cornea
2015-06-28 16:44:57 UTC
Fabio, Andrew, FWIW, currently we do not set 'slaveof' in redis.conf and do not configure redis-sentinel either, leaving control of master setting and promotion/demotion to the resource agent. The redis agent requires fencing to produce consistent and safe results with regards to split partitions. We determined that fencing was not in use which will produce undeterministic results. My advice is to re-test with fencing enabled. If you're using libvirt quests, setting up fence_virsh just for testing is a simple option. After fencing is enabled, if we still hit this issue please create a crm_report during the time frame the issue occurred. This will help me understand exactly what pacemaker did in hopes of better understanding why the redis agent behaved a certain way. I also wouldn't be surprised to see this issue completely disappear after enabling fencing. Looking at the testing procedure, this is a great test. I'm really glad this sort of scenario is being validated. Other scenarios that are important involve simple things like 'put the pacemaker node that contains a redis master into standby, verify a new master is promoted and all slave instances point to the new master' or 'kill active master redis daemon, verify state of both slaves and master instances after recovery' Please let us know if the fencing setup does indeed resolve this issue. Another issue: After rebooting slave node it didn't start after the controller came up. it's probably trying to reconnect to master on an ip where master is not binding. [4149] 30 Jun 10:09:55.677 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:09:56.679 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:09:56.679 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:09:56.680 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:09:57.680 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:09:57.680 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:09:57.680 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:09:58.682 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:09:58.682 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:09:58.683 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:09:59.684 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:09:59.684 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:09:59.684 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:00.687 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:00.687 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:00.688 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:01.688 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:01.688 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:01.688 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:02.690 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:02.691 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:02.691 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:03.693 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:03.693 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:03.693 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:04.695 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:04.695 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:04.695 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:05.696 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:05.696 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:05.696 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:06.698 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:06.698 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:06.699 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:07.701 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:07.702 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:07.702 # Error condition on socket for SYNC: Connection refused [4149] 30 Jun 10:10:08.705 * Connecting to MASTER overcloud-controller-0:6379 [4149] 30 Jun 10:10:08.707 * MASTER <-> SLAVE sync started [4149] 30 Jun 10:10:08.707 # Error condition on socket for SYNC: Connection refused [4149 | signal handler] (1435673409) Received SIGTERM scheduling shutdown... [4149] 30 Jun 10:10:09.108 # User requested shutdown... [4149] 30 Jun 10:10:09.109 * Saving the final RDB snapshot before exiting. [4149] 30 Jun 10:10:09.121 * DB saved on disk [4149] 30 Jun 10:10:09.121 * Removing the pid file. [4149] 30 Jun 10:10:09.121 * Removing the unix socket file. [4149] 30 Jun 10:10:09.121 # Redis is now ready to exit, bye bye... Looks like we provide as master a hostname which resolves to a network where redis is not listening. (In reply to Giulio Fidente from comment #7) > Looks like we provide as master a hostname which resolves to a network where > redis is not listening. The redis agent expects pacemaker node names to be network resolvable. When a redis instance is promoted to master, all the slave redis instances are told to point at the new master instance which is represented by the pacemaker node name. Tested this on a baremetal environment with fencing enabled and the issue is not present anymore. [stack@bldr16cc09 ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ | c162f9fe-efba-4351-8403-45223d715fd9 | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=10.3.58.10 | | 82b237cd-a9ac-409d-a750-e9d012c704d0 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=10.3.58.11 | | 0c658847-9faf-4209-bdbf-8acf0f55834f | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=10.3.58.12 | | ae2cc01b-378a-476c-a3a4-20e73cfcc62a | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=10.3.58.14 | | f373d947-cf5b-42d9-beab-d6ce2ba7c916 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=10.3.58.13 | +--------------------------------------+-------------------------+--------+------------+-------------+---------------------+ [stack@bldr16cc09 ~]$ cat <(echo info replication) - | nc 10.3.58.12 6379 $263 # Replication role:master connected_slaves:1 slave0:ip=10.3.58.14,port=6379,state=online,offset=11240430,lag=1 master_repl_offset:11240430 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:11228839 repl_backlog_histlen:11592 ^C [stack@bldr16cc09 ~]$ cat <(echo info replication) - | nc 10.3.58.14 6379 $378 # Replication role:slave master_host:overcloud-controller-0 master_port:6379 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0 slave_repl_offset:11241220 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 ^C [stack@bldr16cc09 ~]$ cat <(echo info replication) - | nc 10.3.58.13 6379 ^C [stack@bldr16cc09 ~]$ [stack@bldr16cc09 ~]$ [stack@bldr16cc09 ~]$ [stack@bldr16cc09 ~]$ ping 10.3.58.13 PING 10.3.58.13 (10.3.58.13) 56(84) bytes of data. From 10.3.58.1 icmp_seq=1 Destination Host Unreachable From 10.3.58.1 icmp_seq=2 Destination Host Unreachable From 10.3.58.1 icmp_seq=3 Destination Host Unreachable From 10.3.58.1 icmp_seq=4 Destination Host Unreachable 64 bytes from 10.3.58.13: icmp_seq=5 ttl=64 time=1444 ms 64 bytes from 10.3.58.13: icmp_seq=6 ttl=64 time=444 ms 64 bytes from 10.3.58.13: icmp_seq=7 ttl=64 time=0.260 ms ^C --- 10.3.58.13 ping statistics --- 33 packets transmitted, 29 received, +4 errors, 12% packet loss, time 32001ms rtt min/avg/max/mdev = 0.150/65.336/1444.062/272.848 ms, pipe 4 [stack@bldr16cc09 ~]$ [stack@bldr16cc09 ~]$ [stack@bldr16cc09 ~]$ [stack@bldr16cc09 ~]$ cat <(echo info replication) - | nc 10.3.58.12 6379 $330 # Replication role:master connected_slaves:2 slave0:ip=10.3.58.14,port=6379,state=online,offset=11268018,lag=1 slave1:ip=10.3.58.13,port=6379,state=online,offset=11268018,lag=1 master_repl_offset:11268212 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:11228839 repl_backlog_histlen:39374 ^C [stack@bldr16cc09 ~]$ cat <(echo info replication) - | nc 10.3.58.13 6379 $378 # Replication role:slave master_host:overcloud-controller-0 master_port:6379 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0 slave_repl_offset:11268794 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 [stack@bldr16cc09 ~]$ cat <(echo info replication) - | nc 10.3.58.14 6379 $378 # Replication role:slave master_host:overcloud-controller-0 master_port:6379 master_link_status:up master_last_io_seconds_ago:1 master_sync_in_progress:0 slave_repl_offset:11269196 slave_priority:100 slave_read_only:1 connected_slaves:0 master_repl_offset:0 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1549 |