Bug 1527072
| Summary: | OVN DB Pacemaker bundle : Pacemaker is either promoting a service on a wrong node or not promoting at all when the master node is stopped | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Numan Siddique <nusiddiq> | ||||||||||
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | pkomarov | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 7.4 | CC: | abeekhof, aherr, cfeist, chjones, cluster-maint, dalvarez, kgaillot, lmiksik, mnovacek, ushkalim | ||||||||||
| Target Milestone: | rc | Keywords: | Triaged, ZStream | ||||||||||
| Target Release: | 7.5 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | pacemaker-1.1.18-10.el7 | Doc Type: | No Doc Update | ||||||||||
| Doc Text: |
When placing a resource that was colocated with a particular role of a bundle, Pacemaker did not correctly consider the role. As a consequence, these resources could be started or promoted on a node where the bundle did not have that role. With this update, Pacemaker now properly considers the role of the bundle when placing colocated resources, and the described problem no longer occurs.
|
Story Points: | --- | ||||||||||
| Clone Of: | |||||||||||||
| : | 1537557 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2018-04-10 15:34:42 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1537557 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Numan Siddique
2017-12-18 13:31:15 UTC
Created attachment 1369504 [details]
OVN OCF script 1
Created attachment 1369505 [details]
OVN OCF script 2
Created attachment 1369522 [details]
sosreports
Created attachment 1369523 [details]
sosreports
The issue can also be reproduced with redis-bundle as well. I created the redis bundle like below and added a colocation constraint. But when the redis bundle is created, master is promoted on another node.
******************************
#!/bin/bash
rm -f tmp-cib*
pcs resource delete redis-bundle
pcs cluster cib tmp-cib.xml
cp tmp-cib.xml tmp-cib.xml.deltasrc
pcs -f tmp-cib.xml resource bundle create redis-bundle \
container docker image=192.168.122.206:8787/master/centos-binary-redis:pcmklatest masters=1 network=host \
options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" replicas=3 run-command="/bin/bash /usr/local/bin/kolla_start" \
network control-port=3124 \
storage-map id=redis-cfg-files options=ro source-dir=/var/lib/kolla/config_files/redis.json target-dir=/var/lib/kolla/config_files/config.json \
storage-map id=redis-cfg-data-redis options=ro source-dir=/var/lib/config-data/puppet-generated/redis/ target-dir=/var/lib/kolla/config_files/src \
storage-map id=redis-hosts options=ro source-dir=/etc/hosts target-dir=/etc/hosts \
storage-map id=redis-localtime options=ro source-dir=/etc/localtime target-dir=/etc/localtime \
storage-map id=redis-lib options=rw source-dir=/var/lib/redis target-dir=/var/lib/redis \
storage-map id=redis-log options=rw source-dir=/var/log/redis target-dir=/var/log/redis \
storage-map id=redis-run options=rw source-dir=/var/run/redis target-dir=/var/run/redis \
storage-map id=redis-pki-extracted options=ro source-dir=/etc/pki/ca-trust/extracted target-dir=/etc/pki/ca-trust/extracted \
storage-map id=redis-pki-ca-bundle-crt options=ro source-dir=/etc/pki/tls/certs/ca-bundle.crt target-dir=/etc/pki/tls/certs/ca-bundle.crt \
storage-map id=redis-pki-ca-bundle-trust-crt options=ro source-dir=/etc/pki/tls/certs/ca-bundle.trust.crt target-dir=/etc/pki/tls/certs/ca-bundle.trust.crt \
storage-map id=redis-pki-cert options=ro source-dir=/etc/pki/tls/cert.pem target-dir=/etc/pki/tls/cert.pem \
storage-map id=redis-dev-log options=rw source-dir=/dev/log target-dir=/dev/log
pcs -f tmp-cib.xml resource create redis ocf:heartbeat:redis bundle redis-bundle
pcs -f tmp-cib.xml resource meta redis container-attribute-target=host interleave=true notify=true ordered=true
pcs -f tmp-cib.xml constraint colocation add ip-172.16.2.5 with master redis-bundle
pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc
******************
Below is the output of pcs status and pcs constraint show
[root@overcloud-controller-2 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Tue Dec 19 07:45:12 2017
Last change: Tue Dec 19 07:41:34 2017 by redis-bundle-2 via crm_attribute on overcloud-controller-2
12 nodes configured
37 resources configured
Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]
Full list of resources:
Docker container set: rabbitmq-bundle [192.168.122.206:8787/master/centos-binary-rabbitmq:pcmklatest]
rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-0
rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-1
rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started overcloud-controller-2
Docker container set: galera-bundle [192.168.122.206:8787/master/centos-binary-mariadb:pcmklatest]
galera-bundle-0 (ocf::heartbeat:galera): Master overcloud-controller-0
galera-bundle-1 (ocf::heartbeat:galera): Master overcloud-controller-1
galera-bundle-2 (ocf::heartbeat:galera): Master overcloud-controller-2
ip-192.168.24.8 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
ip-10.0.0.6 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
ip-172.16.2.9 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
ip-172.16.2.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
ip-172.16.1.9 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
ip-172.16.3.12 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
Docker container set: haproxy-bundle [192.168.122.206:8787/master/centos-binary-haproxy:pcmklatest]
haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-0
haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started overcloud-controller-1
haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started overcloud-controller-2
Docker container: openstack-cinder-volume [192.168.122.206:8787/master/centos-binary-cinder-volume:pcmklatest]
openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started overcloud-controller-1
Docker container set: redis-bundle [192.168.122.206:8787/master/centos-binary-redis:pcmklatest]
redis-bundle-0 (ocf::heartbeat:redis): Slave overcloud-controller-0
redis-bundle-1 (ocf::heartbeat:redis): Slave overcloud-controller-1
redis-bundle-2 (ocf::heartbeat:redis): Master overcloud-controller-2
[root@overcloud-controller-2 heat-admin]# pcs constraint show
....
....
....
Ordering Constraints:
start ip-192.168.24.8 then start haproxy-bundle (kind:Optional)
start ip-10.0.0.6 then start haproxy-bundle (kind:Optional)
start ip-172.16.2.9 then start haproxy-bundle (kind:Optional)
start ip-172.16.2.5 then start haproxy-bundle (kind:Optional)
start ip-172.16.1.9 then start haproxy-bundle (kind:Optional)
start ip-172.16.3.12 then start haproxy-bundle (kind:Optional)
Colocation Constraints:
ip-192.168.24.8 with haproxy-bundle (score:INFINITY)
ip-10.0.0.6 with haproxy-bundle (score:INFINITY)
ip-172.16.2.9 with haproxy-bundle (score:INFINITY)
ip-172.16.2.5 with haproxy-bundle (score:INFINITY)
ip-172.16.1.9 with haproxy-bundle (score:INFINITY)
ip-172.16.3.12 with haproxy-bundle (score:INFINITY)
ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)
Ticket Constraints:
I'm not sure I understand what the problem is. The "two" sosreports point to the same location and show: Docker container set: ovn-dbs-bundle [192.168.122.206:8787/master/centos-binary-ovn-northd:latest] ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-1 ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave overcloud-controller-0 ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-2 which looks fine to me. There are no constraints that suggest anything else should happen, no colocation or ordering constraints with ovn resource and there is no indication of any controllers being shut down. Comment #7 about redis has: ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) where did that come from? I don't recall it being part of a standard deployment (In reply to Andrew Beekhof from comment #8) > I'm not sure I understand what the problem is. > > The "two" sosreports point to the same location and show: That's right. My mistake. I did it twice. > > Docker container set: ovn-dbs-bundle > [192.168.122.206:8787/master/centos-binary-ovn-northd:latest] > ovn-dbs-bundle-0 (ocf::ovn:ovndb-servers): Master overcloud-controller-1 > ovn-dbs-bundle-1 (ocf::ovn:ovndb-servers): Slave overcloud-controller-0 > ovn-dbs-bundle-2 (ocf::ovn:ovndb-servers): Slave overcloud-controller-2 > > which looks fine to me. > > There are no constraints that suggest anything else should happen, no > colocation or ordering constraints with ovn resource and there is no > indication of any controllers being shut down. The attached sosreports are from the system wherein I first deployed a tripleo with containers and pacemaker bundle HA setup without OVN (i.e using default neutron ML2OVS). I then did a stack update to enable OVN services. Here is the puppet tripleo bundle code for ovn which does set the colocation constraints - https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_dbs_bundle.pp#L154 I am not really sure why you don't see the colocation information in the logs. I will do a fresh OVN deployment, reproduce the issue and share you the sosreports. > > > Comment #7 about redis has: > > ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) > (with-rsc-role:Master) > > where did that come from? I don't recall it being part of a standard > deployment about redis : Yes, It's not part of the standard deployment. The reason I added comment 7 was to show that the issue can be reproduced with redis bundles as well. The issue is when a redis bundle resource is created and a colocation constraint is added, I was expecting the redis master to be running on the node where the colocation resource (i.e ip-172.16.2.5) is present. But I don't see that happening. Please correct me if I my understand is wrong here. From the comment 7, you can see that ip-172.16.2.5 is running in controller-0 where as redis-master is running in controller-2. We are seeing a same/similar issue with OVN db resource. Michele suggested if the issue can be reproduced without OVN db, it would be easier to debug. Thanks Numan (In reply to Numan Siddique from comment #9) > I am not really sure why you don't see the colocation information in the > logs. Not in the logs... in the stored configurations. Maybe the report was just taken "too soon" > I will do a fresh OVN deployment, reproduce the issue and share you the > sosreports. great > > > > > > > > > > Comment #7 about redis has: > > > > ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) > > (with-rsc-role:Master) > > > > where did that come from? I don't recall it being part of a standard > > deployment > > about redis : Yes, It's not part of the standard deployment. The reason I > added comment 7 was to show that the issue can be reproduced with redis > bundles as well. Oh, sure. What is in the bundle is pretty much irrelevant. > The issue is when a redis bundle resource is created and a > colocation constraint is added, I was expecting the redis master to be > running on the node where the colocation resource (i.e ip-172.16.2.5) is > present. If it works, it would be the other way around... The IP would be placed where redis was a master. I seem to recall making changes in that area recently. Though I don't recall if I was fixing something or explicitly disallowing it. Either way it would be worth making sure you have the latest packages. Brew says pacemaker-1.1.16-12.el7_4.7 but that might just be in the queue and not released. > But I don't see that happening. Please correct me if I my > understand is wrong here. > From the comment 7, you can see that ip-172.16.2.5 is running in > controller-0 where as redis-master is running in controller-2. > We are seeing a same/similar issue with OVN db resource. > Michele suggested if the issue can be reproduced without OVN db, it would be > easier to debug. Nah, I just need the cib (output from: cibadmin -Q) when the cluster is in that state. > > Thanks > Numan I was able to reproduce the issue with latest deployment and get the sosreports. The sosreports can be found here - https://github.com/numansiddique/pcs_logs/tree/master/new_ovn_logs/sosreport_ovndbs_bundle_issue It worked fine and as expected after the tripleo deployment. ovn-dbs-bundle master and the colocation constraint resource ip-172.16.2.10 were in the same node - controller-0. Then I stopped the cluster in controller-0. After this, the ip-172.16.2.10 was started in controller-2 and ovn-dbs-bundle master in controller-1. The pacemaker version is - [root@overcloud-controller-1 heat-admin]# pacemakerd --version Pacemaker 1.1.16-12.el7_4.5 Written by Andrew Beekhof It's a centos based tripleo master deployment. pacemaker packages are coming from RDO. (In reply to Andrew Beekhof from comment #10) > (In reply to Numan Siddique from comment #9) > > I am not really sure why you don't see the colocation information in the > > logs. > > Not in the logs... in the stored configurations. > Maybe the report was just taken "too soon" > > > I will do a fresh OVN deployment, reproduce the issue and share you the > > sosreports. > > great > > > > > > > > > > > > > > > > > > Comment #7 about redis has: > > > > > > ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) > > > (with-rsc-role:Master) > > > > > > where did that come from? I don't recall it being part of a standard > > > deployment > > > > about redis : Yes, It's not part of the standard deployment. The reason I > > added comment 7 was to show that the issue can be reproduced with redis > > bundles as well. > > Oh, sure. What is in the bundle is pretty much irrelevant. > > > The issue is when a redis bundle resource is created and a > > colocation constraint is added, I was expecting the redis master to be > > running on the node where the colocation resource (i.e ip-172.16.2.5) is > > present. > > If it works, it would be the other way around... The IP would be placed > where redis was a master. > > I seem to recall making changes in that area recently. > Though I don't recall if I was fixing something or explicitly disallowing it. > > Either way it would be worth making sure you have the latest packages. > Brew says pacemaker-1.1.16-12.el7_4.7 but that might just be in the queue > and not released. > Some time back, I compiled and generated pacemaker packages myself (I wanted to put some prints myself) and tested. I could still see the issue. The name of package was - pacemaker-1.1.18-1.3d356c4.git.el7.centos.x86_64.rpm Before compiling I switched to branch 1.1 and the last commit at time was commit ddc8933dcae58461f71430cc2e237006715d600f Merge: 2b07d5c 6843891 Author: Ken Gaillot <kgaillot> Date: Mon Dec 11 20:56:39 2017 -0600 Merge pull request #1393 from kgaillot/fixes11 Backport some master/2.0 fixes > > > But I don't see that happening. Please correct me if I my > > understand is wrong here. > > From the comment 7, you can see that ip-172.16.2.5 is running in > > controller-0 where as redis-master is running in controller-2. > > We are seeing a same/similar issue with OVN db resource. > > Michele suggested if the issue can be reproduced without OVN db, it would be > > easier to debug. > > Nah, I just need the cib (output from: cibadmin -Q) when the cluster is in > that state. > > > > > Thanks > > Numan All good, I have the reproducer now. Just need to fix it :) Fixed in https://github.com/beekhof/pacemaker/commit/373c057 Assigning back to HA team for backports and builds Verified ,
Colocation is implemented correct after ovndb_servers-master switch :
#Check colocation:
$ ansible controller-0 -m shell -b -a 'pcs config |grep "Colocation Constraints\|with ovndb_servers-master"'
controller-0 | SUCCESS | rc=0 >>
Colocation Constraints:
ip-172.17.0.18 with ovndb_servers-master (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)
#Before the failover :
$ ansible controller-0 -m shell -b -a 'pcs status'
controller-0 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.16-12.el7_4.8-94ff4df) - partition with
...
ip-172.17.0.18 (ocf::heartbeat:IPaddr2): Started controller-2
...
Master/Slave Set: ovndb_servers-master [ovndb_servers]
Masters: [ controller-2 ]
Slaves: [ controller-0 controller-1 ]
#After failover :
$ ansible controller-0 -m shell -b -a 'pcs status'
controller-0 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.16-12.el7_4.8-94ff4df)
...
ip-172.17.0.18 (ocf::heartbeat:IPaddr2): Started controller-1
...
Master/Slave Set: ovndb_servers-master [ovndb_servers]
Masters: [ controller-1 ]
Slaves: [ controller-0 ]
Stopped: [ controller-2 ]
Verified,
# pcs status |head -n 3
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-10.el7-2b07d5c5a9) - partition with quorum
pcs config :
Colocation Constraints:
ip-172.17.0.11 with ovndb_servers-master (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)
Before ovndb_servers-master standby:
ip-172.17.0.11 (ocf::heartbeat:IPaddr2): Started controller-0
Master/Slave Set: ovndb_servers-master [ovndb_servers]
Masters: [ controller-0 ]
Slaves: [ controller-1 controller-2 ]
After ovndb_servers-master standby:
ip-172.17.0.11 (ocf::heartbeat:IPaddr2): Started controller-2
Master/Slave Set: ovndb_servers-master [ovndb_servers]
Masters: [ controller-2 ]
Slaves: [ controller-1 ]
Stopped: [ controller-0 ]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:0860 |