Bug 1330688 - "pcs cluster stop --all" hangs
Summary: "pcs cluster stop --all" hangs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: async
: 8.0 (Liberty)
Assignee: Jiri Stransky
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-26 17:29 UTC by Ian Pilcher
Modified: 2017-10-03 15:21 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1334429 (view as bug list)
Environment:
Last Closed: 2017-10-03 15:21:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
strace output (1 vCPU) (2.93 MB, text/plain)
2016-04-26 17:29 UTC, Ian Pilcher
no flags Details
"pcs status" output before attempting "pcs cluster stop --all" (6.47 KB, text/plain)
2016-04-26 17:32 UTC, Ian Pilcher
no flags Details
output of "pcs --debug cluster stop --all" (4.67 KB, text/plain)
2016-04-26 17:35 UTC, Ian Pilcher
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1577570 0 None None None 2016-05-02 21:06:17 UTC
OpenStack gerrit 311860 0 None MERGED Disable VIPs before stopping cluster during version upgrade 2020-06-04 15:52:29 UTC
OpenStack gerrit 312454 0 None MERGED Disable VIPs before stopping cluster during version upgrade 2020-06-04 15:52:29 UTC
OpenStack gerrit 312455 0 None MERGED Disable VIPs before stopping cluster during version upgrade 2020-06-04 15:52:29 UTC

Description Ian Pilcher 2016-04-26 17:29:34 UTC
Created attachment 1151051 [details]
strace output (1 vCPU)

Running "pcs cluster stop --all" on a 3-node OpenStack controller cluster hangs.  This is close to 100% reproducible on my system.

  4 vCPU controllers --> HANG
  4 vCPU controllers under strace --> COMPLETES
  1 vCPU controllers --> HANG
  1 vCPU controllers under strace --> HANG

Comment 1 Ian Pilcher 2016-04-26 17:32:14 UTC
Created attachment 1151052 [details]
"pcs status" output before attempting "pcs cluster stop --all"

Comment 2 Ian Pilcher 2016-04-26 17:33:38 UTC
Note that the disabled resources shown in the "pcs status" output are normal.  This the state that the OSP 7 -> OSP 8 upgrade script puts the cluster into before trying to stop it.

Comment 3 Ian Pilcher 2016-04-26 17:35:16 UTC
Created attachment 1151053 [details]
output of "pcs --debug cluster stop --all"

Comment 5 Tomas Jelinek 2016-04-27 08:20:33 UTC
Can you turn on pcsd debugging on all nodes, run the command again and attach /var/log/pcsd/pcsd.log and /var/log/pacemaker.log from all nodes from around the time when stopping gets stuck?

To enable pcsd debugging change "PCSD_DEBUG=false" to "PCSD_DEBUG=true" in /etc/sysconfig/pcsd and restart pcsd daemon.

What version of pcs, pacemaker and corosync do you have?

Thanks.

Comment 6 Fabio Massimo Di Nitto 2016-04-27 09:01:59 UTC
Ian if you can share access to the environment, it might be faster to debug.

Clearly there is something specific to your environment that´s blocking since upgrades are tested in CI and we are not seeing this problem.

Comment 7 Ian Pilcher 2016-04-27 16:24:23 UTC
(In reply to Tomas Jelinek from comment #5)
> Can you turn on pcsd debugging on all nodes, run the command again and
> attach /var/log/pcsd/pcsd.log and /var/log/pacemaker.log from all nodes from
> around the time when stopping gets stuck?
> 
> To enable pcsd debugging change "PCSD_DEBUG=false" to "PCSD_DEBUG=true" in
> /etc/sysconfig/pcsd and restart pcsd daemon.
> 
> What version of pcs, pacemaker and corosync do you have?

pcs-0.9.143-15.el7.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
corosync-2.3.4-7.el7_2.1.x86_64

Logs coming ...

Comment 8 Ian Pilcher 2016-04-27 16:25:17 UTC
Created attachment 1151467 [details]
controller-0-pacemaker.log

Comment 9 Ian Pilcher 2016-04-27 16:25:48 UTC
Created attachment 1151468 [details]
controller-1-pacemaker.log

Comment 10 Ian Pilcher 2016-04-27 16:26:19 UTC
Created attachment 1151469 [details]
controller-2-pacemaker.log

Comment 11 Ian Pilcher 2016-04-27 16:26:49 UTC
Created attachment 1151470 [details]
controller-0-pcsd.log

Comment 12 Ian Pilcher 2016-04-27 16:27:23 UTC
Created attachment 1151471 [details]
controller-1-pcsd.log

Comment 13 Ian Pilcher 2016-04-27 16:27:53 UTC
Created attachment 1151472 [details]
controller-2-pcsd.log

Comment 16 Ian Pilcher 2016-04-27 16:46:50 UTC
Created attachment 1151475 [details]
controller-0-corosync.log

Comment 21 Tomas Jelinek 2016-04-28 14:07:14 UTC
I see what is going on:

There are virtual IPs running in the cluster, which belong to the same network cluster nodes communicate over. When pcs connects to the nodes to tell them to stop pacemaker, sometimes a virtual IP is used as a source address for the connection. pcs requests the nodes to stop and waits to see if it succeeds or not. Each node stops pacemaker and reports success to pcs. However since pacemaker is not running anymore, the virtual IP has been removed from pcs host, so pcs will never get the responses and waits forever.

The thing is, when the IP is removed, the connection does not get closed, so pcs is not notified about it. When you kill pcs and run it again, pacemaker is already stopped, no IPs get removed and so everything works.

Comment 22 Ian Pilcher 2016-04-28 14:52:59 UTC
(In reply to Tomas Jelinek from comment #21)
> I see what is going on:
> 
> There are virtual IPs running in the cluster, which belong to the same
> network cluster nodes communicate over. When pcs connects to the nodes to
> tell them to stop pacemaker, sometimes a virtual IP is used as a source
> address for the connection. pcs requests the nodes to stop and waits to see
> if it succeeds or not. Each node stops pacemaker and reports success to pcs.
> However since pacemaker is not running anymore, the virtual IP has been
> removed from pcs host, so pcs will never get the responses and waits forever.
> 
> The thing is, when the IP is removed, the connection does not get closed, so
> pcs is not notified about it. When you kill pcs and run it again, pacemaker
> is already stopped, no IPs get removed and so everything works.

Makes sense.  Sounds like an upgrade script bug, then.

Changing product/component accordingly.

Thanks!

Comment 24 Fabio Massimo Di Nitto 2016-05-03 06:09:43 UTC
(In reply to Ian Pilcher from comment #22)
> (In reply to Tomas Jelinek from comment #21)
> > I see what is going on:
> > 
> > There are virtual IPs running in the cluster, which belong to the same
> > network cluster nodes communicate over. When pcs connects to the nodes to
> > tell them to stop pacemaker, sometimes a virtual IP is used as a source
> > address for the connection. pcs requests the nodes to stop and waits to see
> > if it succeeds or not. Each node stops pacemaker and reports success to pcs.
> > However since pacemaker is not running anymore, the virtual IP has been
> > removed from pcs host, so pcs will never get the responses and waits forever.
> > 
> > The thing is, when the IP is removed, the connection does not get closed, so
> > pcs is not notified about it. When you kill pcs and run it again, pacemaker
> > is already stopped, no IPs get removed and so everything works.
> 
> Makes sense.  Sounds like an upgrade script bug, then.
> 
> Changing product/component accordingly.
> 
> Thanks!

I sort of disagree that it is an update script bug. This is a combination of problems.

pcs binds to 0.0.0.0 / :: by default for incoming connections and that's fine, then on outgoing sockets, it's the kernel that decides which IP to use if none is specified.

In this case a VIP is selected but it's a non predictable situation.

pcs could use the pcs auth information to determine outgoing connection socket to bind only to IPs available before pacemaker is started and configured in a predictable way.

The update script patch you propose can workaround this issue,but it's not the correct fix IMHO.

Comment 25 Tomas Jelinek 2016-05-03 08:44:58 UTC
The library we are currently using to connect to remote nodes does not allow to specify a source address. We would have to monkey-patch it or switch to a different library / means of sending HTTP requests.

We still need to know which IP is the right one to be used as a source IP, though. Can you be more specific about your idea of using pcs auth information to do that?

Comment 26 Fabio Massimo Di Nitto 2016-05-03 09:24:00 UTC
(In reply to Tomas Jelinek from comment #25)
> The library we are currently using to connect to remote nodes does not allow
> to specify a source address. We would have to monkey-patch it or switch to a
> different library / means of sending HTTP requests.
> 
> We still need to know which IP is the right one to be used as a source IP,
> though. Can you be more specific about your idea of using pcs auth
> information to do that?

When you create a cluster, one of the steps is to perform:

pcs cluster auth ... node list ...

those would generally list all nodes in the cluster, including the local node. That IP is known NOT to be a VIP since it's used before cluster is even up. That IP can be used as source address to talk to other nodes.

I don't know what tech you are using, but i guess it can be changed/improved?

Comment 27 Radek Steiger 2016-05-03 15:25:44 UTC
We don't want to store IP adresses together with hostname anywhere outside DNS as it's unmanageable.

> Simple solution (thanks Poki):

Introduce --srcaddr switch for a subset of pcs commands and a similar configuration option for pcsd where the outgoing IP could be specified manually.


> More complicated solution:

We could use the information from corosync.conf as the nodes there are represented with valid hostnames (nobody's going to put virtual IPs in there), i.e. resolving the nodenames and finding a match in a list of IPs on all locally configured interfaces (assuming we can get such information easily). From here there are two outcomes:

1) If a match is found, we're going to bind the socket to that IP because we know it belongs to this machine and it is used by the cluster stack;

2) If no match is found or corosync.conf doesn't exist, then fallback to kernel default behavior. This can happen in two cases:

  a) the machine is not a cluster node, therefore no vIP
  b) the machine is running pacemaker-remote, where the use of CLI is limited because of missing corosync.conf

All of this requires rewriting pcs to use different/more low-level network library than what we have now, for both CLI and GUI (or eventually a shared pcs lib in the future).

With this solution however the problem will still exist on pacemaker-remote nodes where GUI/daemon is running or when remote node management is implemented in CLI eventually (a.k.a. pcs -h <host>).

Comment 28 Fabio Massimo Di Nitto 2016-05-03 15:56:48 UTC
(In reply to Radek Steiger from comment #27)
> We don't want to store IP adresses together with hostname anywhere outside
> DNS as it's unmanageable.
> 

agreed, but you can use hostname stored in the auth info + current hostname to perform the bind. No need to hardcode anything at any time.

> > Simple solution (thanks Poki):
> 
> Introduce --srcaddr switch for a subset of pcs commands and a similar
> configuration option for pcsd where the outgoing IP could be specified
> manually.
> 

Simple implementation wise, but it simply move the issue to the end user (not ideal)

> 
> > More complicated solution:
> 
> We could use the information from corosync.conf as the nodes there are
> represented with valid hostnames (nobody's going to put virtual IPs in
> there), i.e. resolving the nodenames and finding a match in a list of IPs on
> all locally configured interfaces (assuming we can get such information
> easily). From here there are two outcomes:
> 
> 1) If a match is found, we're going to bind the socket to that IP because we
> know it belongs to this machine and it is used by the cluster stack;
> 
> 2) If no match is found or corosync.conf doesn't exist, then fallback to
> kernel default behavior. This can happen in two cases:
> 
>   a) the machine is not a cluster node, therefore no vIP
>   b) the machine is running pacemaker-remote, where the use of CLI is
> limited because of missing corosync.conf

I don't think we support #b 100% since it's a remote node anyway. so let's start with the simple use case and then we extend as necessary.

> 
> All of this requires rewriting pcs to use different/more low-level network
> library than what we have now, for both CLI and GUI (or eventually a shared
> pcs lib in the future).
> 
> With this solution however the problem will still exist on pacemaker-remote
> nodes where GUI/daemon is running or when remote node management is
> implemented in CLI eventually (a.k.a. pcs -h <host>).

Comment 29 Jan Pokorný [poki] 2016-05-09 12:31:53 UTC
I don't see it mentioned, but following might be an acceptable workaround:

$ >/etc/gai.conf cat <<EOF
label  ::1/128       0
label  ::/0          1
label  2002::/16     2
label ::/96          3
label ::ffff:0:0/96  4
label ::ffff:<ADDRESS-OF-NETWORK-USED-FOR-CLUSTER-MANAGEMENT>/96  5
label ::ffff:<ADDRESS-OF-THE-LOCAL-HOST-WITHIN-THE-ABOVE-NETWORK>/96  5
precedence  ::1/128       50
precedence  ::/0          40
precedence  2002::/16     30
precedence ::/96          20
precedence ::ffff:0:0/96  10
<EOF

I.e., we make an association between source and target address through
matching labels (5), to avoid source address getting assigned cluster IP
that is then kept hanging forever in case of inter-VM communication
(IIUIC).

Alternatively, the precedence could be tweaked.

Disclaimer: untested.

Comment 30 Ian Pilcher 2016-05-09 14:48:10 UTC
Should this bug be cloned, so that the pcs work can be tracked separately from the OSP workaround?

Comment 31 Chris Feist 2016-05-09 15:21:23 UTC
Ian,

Yes, I think that makes sense.  I'll clone this bz.

Thanks,
Chris

Comment 33 Jan Pokorný [poki] 2017-04-12 17:04:47 UTC
re [comment 29]:
> to avoid source address getting assigned cluster IP that is then kept
> hanging forever in case of inter-VM communication (IIUIC)

I cannot find any reference this was only related to interconnect
between VMs (KVM only, correct?) or physically connected bare metal
machines are affected as well.  I sincerely hope such missing
circumstances will be figured out because:

- they are important for diagnosing possible future relapses

- this may actually be a bug in the particular virtualization
  platform -- can the behavior be tweaked in this regard? etc.

Comment 34 Chris Jones 2017-10-03 15:21:39 UTC
We're closing this bug - all of the related patches merged, and https://bugzilla.redhat.com/show_bug.cgi?id=1334429 was fixed some time ago, so there shouldn't be any engineering work left to do for this bug. Please re-open if you disagree.


Note You need to log in before you can comment on or make changes to this bug.