Bug 1527072

Summary: OVN DB Pacemaker bundle : Pacemaker is either promoting a service on a wrong node or not promoting at all when the master node is stopped
Product: Red Hat Enterprise Linux 7 Reporter: Numan Siddique <nusiddiq>
Component: pacemakerAssignee: Ken Gaillot <kgaillot>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: abeekhof, aherr, cfeist, chjones, cluster-maint, dalvarez, kgaillot, lmiksik, mnovacek, ushkalim
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 7.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pacemaker-1.1.18-10.el7 Doc Type: No Doc Update
Doc Text:
When placing a resource that was colocated with a particular role of a bundle, Pacemaker did not correctly consider the role. As a consequence, these resources could be started or promoted on a node where the bundle did not have that role. With this update, Pacemaker now properly considers the role of the bundle when placing colocated resources, and the described problem no longer occurs.
Story Points: ---
Clone Of:
: 1537557 (view as bug list) Environment:
Last Closed: 2018-04-10 15:34:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1537557    
Attachments:
Description Flags
OVN OCF script 1
none
OVN OCF script 2
none
sosreports
none
sosreports none

Description Numan Siddique 2017-12-18 13:31:15 UTC
Description of problem:

Deployment Environment
---------
Centos 7 - tripleo with docker.yaml, docker-ha.yaml and OVN dbs as a pacemaker bundle resource.


Version-Release number of selected component (if applicable):
pacemaker version - 
[root@overcloud-controller-1 heat-admin]# pacemakerd --version
Pacemaker 1.1.16-12.el7_4.5
Written by Andrew Beekhof


How reproducible:

In a fresh tripleo container deployment with ovn enabled, pacemaker starts the ovn-dbs-bundle and the master is promoted on the expected node. i.e colocation constraint is set on a VIP. The issue is seen when I run "pcs cluster stop overcloud-controller-0". Once the controller-0 is stopped, the ovn-dbs-bundle is not promoted on a node where the VIP is rescheduled.

In my case the resource ip-172.16.2.12 is running on controller-2 but the master is promoted on controller-1.

The issue is also seen when I update an existing tripleo stack deployed with ML2ovs neutron to ML2OVN neutron. In this case, the issue is seen when the stack is updated.

The sosreports are attached. Also attached are the 2 variants of ovndb-servers.ocf and the issue is seen in both of them.

This issue is seen when I started working to making use of 'ocf_attribute_target' instead of 'ocf_local_nodename' in the OVN RA.

The same issue is not seen when OVN pacemaker resource is started as a baremetal (non bundle) resource.

Another observation - When the controller-0 is stopped and when the resource ip-172.16.2.12 is started in controller-2, the OVN db servers start acting as master since the ip is now configured there. When the monitor action is called after this in controller-2, it returns OCF_NOT_RUNNING (since the OVN OCF script expects the ovsdb-servers to run as slaves at this time). Just for testing, I modified the monitor function to return as OCF_RUNNING (instead of OCF_NOT_RUNNING). After this, pacemaker didn't promote the ovn-dbs-bundle on any node and on both the nodes, ovn-dbs instances were running as slaves.

Steps to Reproduce:
1.
2.
3.

Actual results:
Pacemaker is not promoting the ovn dbs bundle master resource in the desired node and isn't handling the colocation constraints properly. 


Expected results:


Additional info:

Comment 2 Numan Siddique 2017-12-18 13:33:49 UTC
Created attachment 1369504 [details]
OVN OCF script 1

Comment 3 Numan Siddique 2017-12-18 13:34:24 UTC
Created attachment 1369505 [details]
OVN OCF script 2

Comment 4 Numan Siddique 2017-12-18 14:02:28 UTC
Created attachment 1369522 [details]
sosreports

Comment 5 Numan Siddique 2017-12-18 14:04:56 UTC
Created attachment 1369523 [details]
sosreports

Comment 7 Numan Siddique 2017-12-19 07:49:52 UTC
The issue can also be reproduced with redis-bundle as well. I created the redis bundle like below and added a colocation constraint. But when the redis bundle is created, master is promoted on another node.

******************************

#!/bin/bash

rm -f tmp-cib*
pcs resource delete redis-bundle
pcs cluster cib tmp-cib.xml
cp tmp-cib.xml tmp-cib.xml.deltasrc

pcs -f tmp-cib.xml resource bundle create redis-bundle \
    container docker image=192.168.122.206:8787/master/centos-binary-redis:pcmklatest masters=1 network=host \
    options="--user=root --log-driver=journald -e KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" replicas=3 run-command="/bin/bash /usr/local/bin/kolla_start" \
    network control-port=3124 \
    storage-map id=redis-cfg-files options=ro source-dir=/var/lib/kolla/config_files/redis.json target-dir=/var/lib/kolla/config_files/config.json \
    storage-map id=redis-cfg-data-redis options=ro source-dir=/var/lib/config-data/puppet-generated/redis/ target-dir=/var/lib/kolla/config_files/src  \
    storage-map id=redis-hosts options=ro source-dir=/etc/hosts target-dir=/etc/hosts  \
    storage-map id=redis-localtime options=ro source-dir=/etc/localtime target-dir=/etc/localtime \
    storage-map id=redis-lib options=rw source-dir=/var/lib/redis target-dir=/var/lib/redis \
    storage-map id=redis-log options=rw source-dir=/var/log/redis target-dir=/var/log/redis \
    storage-map id=redis-run options=rw source-dir=/var/run/redis target-dir=/var/run/redis \
    storage-map id=redis-pki-extracted options=ro source-dir=/etc/pki/ca-trust/extracted target-dir=/etc/pki/ca-trust/extracted \
    storage-map id=redis-pki-ca-bundle-crt options=ro source-dir=/etc/pki/tls/certs/ca-bundle.crt target-dir=/etc/pki/tls/certs/ca-bundle.crt \
    storage-map id=redis-pki-ca-bundle-trust-crt options=ro source-dir=/etc/pki/tls/certs/ca-bundle.trust.crt target-dir=/etc/pki/tls/certs/ca-bundle.trust.crt \
    storage-map id=redis-pki-cert options=ro source-dir=/etc/pki/tls/cert.pem target-dir=/etc/pki/tls/cert.pem  \
    storage-map id=redis-dev-log options=rw source-dir=/dev/log target-dir=/dev/log

pcs -f tmp-cib.xml resource create redis ocf:heartbeat:redis  bundle redis-bundle
pcs -f tmp-cib.xml resource meta redis container-attribute-target=host interleave=true notify=true ordered=true
pcs -f tmp-cib.xml constraint colocation add ip-172.16.2.5 with master redis-bundle
pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc

******************

Below is the output of pcs status and pcs constraint show

[root@overcloud-controller-2 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Tue Dec 19 07:45:12 2017
Last change: Tue Dec 19 07:41:34 2017 by redis-bundle-2 via crm_attribute on overcloud-controller-2

12 nodes configured
37 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
GuestOnline: [ galera-bundle-0@overcloud-controller-0 galera-bundle-1@overcloud-controller-1 galera-bundle-2@overcloud-controller-2 rabbitmq-bundle-0@overcloud-controller-0 rabbitmq-bundle-1@overcloud-controller-1 rabbitmq-bundle-2@overcloud-controller-2 redis-bundle-0@overcloud-controller-0 redis-bundle-1@overcloud-controller-1 redis-bundle-2@overcloud-controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.122.206:8787/master/centos-binary-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started overcloud-controller-2
 Docker container set: galera-bundle [192.168.122.206:8787/master/centos-binary-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master overcloud-controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master overcloud-controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master overcloud-controller-2
 ip-192.168.24.8        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-10.0.0.6    (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-172.16.2.9  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.2.5  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.1.9  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-172.16.3.12 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Docker container set: haproxy-bundle [192.168.122.206:8787/master/centos-binary-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started overcloud-controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started overcloud-controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started overcloud-controller-2
 Docker container: openstack-cinder-volume [192.168.122.206:8787/master/centos-binary-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started overcloud-controller-1
 Docker container set: redis-bundle [192.168.122.206:8787/master/centos-binary-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Slave overcloud-controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Slave overcloud-controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Master overcloud-controller-2

[root@overcloud-controller-2 heat-admin]# pcs constraint show
....
....
....
Ordering Constraints:
  start ip-192.168.24.8 then start haproxy-bundle (kind:Optional)
  start ip-10.0.0.6 then start haproxy-bundle (kind:Optional)
  start ip-172.16.2.9 then start haproxy-bundle (kind:Optional)
  start ip-172.16.2.5 then start haproxy-bundle (kind:Optional)
  start ip-172.16.1.9 then start haproxy-bundle (kind:Optional)
  start ip-172.16.3.12 then start haproxy-bundle (kind:Optional)
Colocation Constraints:
  ip-192.168.24.8 with haproxy-bundle (score:INFINITY)
  ip-10.0.0.6 with haproxy-bundle (score:INFINITY)
  ip-172.16.2.9 with haproxy-bundle (score:INFINITY)
  ip-172.16.2.5 with haproxy-bundle (score:INFINITY)
  ip-172.16.1.9 with haproxy-bundle (score:INFINITY)
  ip-172.16.3.12 with haproxy-bundle (score:INFINITY)
  ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)
Ticket Constraints:

Comment 8 Andrew Beekhof 2018-01-16 06:10:34 UTC
I'm not sure I understand what the problem is.

The "two" sosreports point to the same location and show:

Docker container set: ovn-dbs-bundle [192.168.122.206:8787/master/centos-binary-ovn-northd:latest]
   ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master overcloud-controller-1
   ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave overcloud-controller-0
   ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave overcloud-controller-2

which looks fine to me.

There are no constraints that suggest anything else should happen, no colocation or ordering constraints with ovn resource and there is no indication of any controllers being shut down.


Comment #7 about redis has:

ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)

where did that come from? I don't recall it being part of a standard deployment

Comment 9 Numan Siddique 2018-01-16 07:45:55 UTC
(In reply to Andrew Beekhof from comment #8)
> I'm not sure I understand what the problem is.
> 
> The "two" sosreports point to the same location and show:

That's right. My mistake. I did it twice.


> 
> Docker container set: ovn-dbs-bundle
> [192.168.122.206:8787/master/centos-binary-ovn-northd:latest]
>    ovn-dbs-bundle-0	(ocf::ovn:ovndb-servers):	Master overcloud-controller-1
>    ovn-dbs-bundle-1	(ocf::ovn:ovndb-servers):	Slave overcloud-controller-0
>    ovn-dbs-bundle-2	(ocf::ovn:ovndb-servers):	Slave overcloud-controller-2
> 
> which looks fine to me.
> 
> There are no constraints that suggest anything else should happen, no
> colocation or ordering constraints with ovn resource and there is no
> indication of any controllers being shut down.


The attached sosreports are from the system wherein I first deployed a tripleo with containers and pacemaker bundle HA setup without OVN (i.e using default neutron ML2OVS). I then did a stack update to enable OVN services.

Here is the puppet tripleo bundle code for ovn which does set  the colocation constraints - https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_dbs_bundle.pp#L154

I am not really sure why you don't see the colocation information in the logs.

I will do a fresh OVN deployment, reproduce the issue and share you the sosreports.




> 
> 
> Comment #7 about redis has:
> 
> ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
> 
> where did that come from? I don't recall it being part of a standard
> deployment

about redis : Yes, It's not part of the standard deployment. The reason I added comment 7 was to show that the issue can be reproduced with redis bundles as well. The issue is when a redis bundle resource is created and a colocation constraint is added, I was expecting the redis master to be running on the node where the colocation resource (i.e ip-172.16.2.5) is present. But I don't see that happening. Please correct me if I my understand is wrong here.
From the comment 7, you can see that ip-172.16.2.5 is running in controller-0 where as redis-master is running in controller-2.
We are seeing a same/similar issue with OVN db resource.
Michele suggested if the issue can be reproduced without OVN db, it would be easier to debug. 

Thanks
Numan

Comment 10 Andrew Beekhof 2018-01-16 09:09:30 UTC
(In reply to Numan Siddique from comment #9)
> I am not really sure why you don't see the colocation information in the
> logs.

Not in the logs... in the stored configurations.
Maybe the report was just taken "too soon"

> I will do a fresh OVN deployment, reproduce the issue and share you the
> sosreports.

great

> 
> 
> 
> 
> > 
> > 
> > Comment #7 about redis has:
> > 
> > ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started)
> > (with-rsc-role:Master)
> > 
> > where did that come from? I don't recall it being part of a standard
> > deployment
> 
> about redis : Yes, It's not part of the standard deployment. The reason I
> added comment 7 was to show that the issue can be reproduced with redis
> bundles as well.

Oh, sure. What is in the bundle is pretty much irrelevant.

> The issue is when a redis bundle resource is created and a
> colocation constraint is added, I was expecting the redis master to be
> running on the node where the colocation resource (i.e ip-172.16.2.5) is
> present. 

If it works, it would be the other way around... The IP would be placed where redis was a master.

I seem to recall making changes in that area recently.
Though I don't recall if I was fixing something or explicitly disallowing it.

Either way it would be worth making sure you have the latest packages.
Brew says pacemaker-1.1.16-12.el7_4.7 but that might just be in the queue and not released.


> But I don't see that happening. Please correct me if I my
> understand is wrong here.
> From the comment 7, you can see that ip-172.16.2.5 is running in
> controller-0 where as redis-master is running in controller-2.
> We are seeing a same/similar issue with OVN db resource.
> Michele suggested if the issue can be reproduced without OVN db, it would be
> easier to debug. 

Nah, I just need the cib (output from: cibadmin -Q) when the cluster is in that state.

> 
> Thanks
> Numan

Comment 11 Numan Siddique 2018-01-16 18:10:18 UTC
I was able to reproduce the issue with latest deployment and get the sosreports.

The sosreports can be found here - https://github.com/numansiddique/pcs_logs/tree/master/new_ovn_logs/sosreport_ovndbs_bundle_issue

It worked fine and as expected after the tripleo deployment. ovn-dbs-bundle master and the colocation constraint resource ip-172.16.2.10 were in the same node - controller-0.

Then I stopped the cluster in controller-0. After this, the ip-172.16.2.10 was started in controller-2 and ovn-dbs-bundle master in controller-1.

The pacemaker version is - 

[root@overcloud-controller-1 heat-admin]# pacemakerd --version
Pacemaker 1.1.16-12.el7_4.5
Written by Andrew Beekhof

It's a centos based tripleo master deployment. pacemaker packages are coming from RDO.

Comment 12 Numan Siddique 2018-01-16 18:17:07 UTC
(In reply to Andrew Beekhof from comment #10)
> (In reply to Numan Siddique from comment #9)
> > I am not really sure why you don't see the colocation information in the
> > logs.
> 
> Not in the logs... in the stored configurations.
> Maybe the report was just taken "too soon"
> 
> > I will do a fresh OVN deployment, reproduce the issue and share you the
> > sosreports.
> 
> great
> 
> > 
> > 
> > 
> > 
> > > 
> > > 
> > > Comment #7 about redis has:
> > > 
> > > ip-172.16.2.5 with redis-bundle (score:INFINITY) (rsc-role:Started)
> > > (with-rsc-role:Master)
> > > 
> > > where did that come from? I don't recall it being part of a standard
> > > deployment
> > 
> > about redis : Yes, It's not part of the standard deployment. The reason I
> > added comment 7 was to show that the issue can be reproduced with redis
> > bundles as well.
> 
> Oh, sure. What is in the bundle is pretty much irrelevant.
> 
> > The issue is when a redis bundle resource is created and a
> > colocation constraint is added, I was expecting the redis master to be
> > running on the node where the colocation resource (i.e ip-172.16.2.5) is
> > present. 
> 
> If it works, it would be the other way around... The IP would be placed
> where redis was a master.
> 
> I seem to recall making changes in that area recently.
> Though I don't recall if I was fixing something or explicitly disallowing it.
> 
> Either way it would be worth making sure you have the latest packages.
> Brew says pacemaker-1.1.16-12.el7_4.7 but that might just be in the queue
> and not released.
> 

Some time back, I compiled and generated pacemaker packages myself (I wanted to put some prints myself) and tested. I could still see the issue.

The name of package was - pacemaker-1.1.18-1.3d356c4.git.el7.centos.x86_64.rpm

Before compiling I switched to branch 1.1 and the last commit at time was 
commit ddc8933dcae58461f71430cc2e237006715d600f
Merge: 2b07d5c 6843891
Author: Ken Gaillot <kgaillot>
Date:   Mon Dec 11 20:56:39 2017 -0600

    Merge pull request #1393 from kgaillot/fixes11
    
    Backport some master/2.0 fixes



> 
> > But I don't see that happening. Please correct me if I my
> > understand is wrong here.
> > From the comment 7, you can see that ip-172.16.2.5 is running in
> > controller-0 where as redis-master is running in controller-2.
> > We are seeing a same/similar issue with OVN db resource.
> > Michele suggested if the issue can be reproduced without OVN db, it would be
> > easier to debug. 
> 
> Nah, I just need the cib (output from: cibadmin -Q) when the cluster is in
> that state.
> 
> > 
> > Thanks
> > Numan

Comment 14 Andrew Beekhof 2018-01-17 06:15:34 UTC
All good, I have the reproducer now.
Just need to fix it :)

Comment 17 Andrew Beekhof 2018-01-22 10:37:40 UTC
Fixed in https://github.com/beekhof/pacemaker/commit/373c057
Assigning back to HA team for backports and builds

Comment 25 pkomarov 2018-02-14 13:35:43 UTC
Verified , 

Colocation is implemented correct after ovndb_servers-master switch : 

#Check colocation: 

$ ansible controller-0 -m shell -b -a 'pcs config |grep "Colocation Constraints\|with ovndb_servers-master"' 

controller-0 | SUCCESS | rc=0 >>
Colocation Constraints:
  ip-172.17.0.18 with ovndb_servers-master (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)

#Before the failover : 

$ ansible controller-0 -m shell -b -a 'pcs status' 

controller-0 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.16-12.el7_4.8-94ff4df) - partition with 
...
 ip-172.17.0.18	(ocf::heartbeat:IPaddr2):	Started controller-2
...
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ controller-2 ]
     Slaves: [ controller-0 controller-1 ]

#After failover : 

$ ansible controller-0 -m shell -b -a 'pcs status' 

controller-0 | SUCCESS | rc=0 >>
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.16-12.el7_4.8-94ff4df)
 ...
 ip-172.17.0.18	(ocf::heartbeat:IPaddr2):	Started controller-1
...
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ controller-1 ]
     Slaves: [ controller-0 ]
     Stopped: [ controller-2 ]

Comment 26 pkomarov 2018-02-15 06:26:59 UTC
Verified, 

# pcs status |head -n 3
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-10.el7-2b07d5c5a9) - partition with quorum

pcs config : 

Colocation Constraints:
  ip-172.17.0.11 with ovndb_servers-master (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master)

Before ovndb_servers-master standby: 

 ip-172.17.0.11	(ocf::heartbeat:IPaddr2):	Started controller-0

 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ controller-0 ]
     Slaves: [ controller-1 controller-2 ]

After ovndb_servers-master standby: 
 ip-172.17.0.11	(ocf::heartbeat:IPaddr2):	Started controller-2

 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ controller-2 ]
     Slaves: [ controller-1 ]
     Stopped: [ controller-0 ]

Comment 31 errata-xmlrpc 2018-04-10 15:34:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0860