Created attachment 1151051 [details] strace output (1 vCPU) Running "pcs cluster stop --all" on a 3-node OpenStack controller cluster hangs. This is close to 100% reproducible on my system. 4 vCPU controllers --> HANG 4 vCPU controllers under strace --> COMPLETES 1 vCPU controllers --> HANG 1 vCPU controllers under strace --> HANG
Created attachment 1151052 [details] "pcs status" output before attempting "pcs cluster stop --all"
Note that the disabled resources shown in the "pcs status" output are normal. This the state that the OSP 7 -> OSP 8 upgrade script puts the cluster into before trying to stop it.
Created attachment 1151053 [details] output of "pcs --debug cluster stop --all"
Can you turn on pcsd debugging on all nodes, run the command again and attach /var/log/pcsd/pcsd.log and /var/log/pacemaker.log from all nodes from around the time when stopping gets stuck? To enable pcsd debugging change "PCSD_DEBUG=false" to "PCSD_DEBUG=true" in /etc/sysconfig/pcsd and restart pcsd daemon. What version of pcs, pacemaker and corosync do you have? Thanks.
Ian if you can share access to the environment, it might be faster to debug. Clearly there is something specific to your environment that´s blocking since upgrades are tested in CI and we are not seeing this problem.
(In reply to Tomas Jelinek from comment #5) > Can you turn on pcsd debugging on all nodes, run the command again and > attach /var/log/pcsd/pcsd.log and /var/log/pacemaker.log from all nodes from > around the time when stopping gets stuck? > > To enable pcsd debugging change "PCSD_DEBUG=false" to "PCSD_DEBUG=true" in > /etc/sysconfig/pcsd and restart pcsd daemon. > > What version of pcs, pacemaker and corosync do you have? pcs-0.9.143-15.el7.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 corosync-2.3.4-7.el7_2.1.x86_64 Logs coming ...
Created attachment 1151467 [details] controller-0-pacemaker.log
Created attachment 1151468 [details] controller-1-pacemaker.log
Created attachment 1151469 [details] controller-2-pacemaker.log
Created attachment 1151470 [details] controller-0-pcsd.log
Created attachment 1151471 [details] controller-1-pcsd.log
Created attachment 1151472 [details] controller-2-pcsd.log
Created attachment 1151475 [details] controller-0-corosync.log
I see what is going on: There are virtual IPs running in the cluster, which belong to the same network cluster nodes communicate over. When pcs connects to the nodes to tell them to stop pacemaker, sometimes a virtual IP is used as a source address for the connection. pcs requests the nodes to stop and waits to see if it succeeds or not. Each node stops pacemaker and reports success to pcs. However since pacemaker is not running anymore, the virtual IP has been removed from pcs host, so pcs will never get the responses and waits forever. The thing is, when the IP is removed, the connection does not get closed, so pcs is not notified about it. When you kill pcs and run it again, pacemaker is already stopped, no IPs get removed and so everything works.
(In reply to Tomas Jelinek from comment #21) > I see what is going on: > > There are virtual IPs running in the cluster, which belong to the same > network cluster nodes communicate over. When pcs connects to the nodes to > tell them to stop pacemaker, sometimes a virtual IP is used as a source > address for the connection. pcs requests the nodes to stop and waits to see > if it succeeds or not. Each node stops pacemaker and reports success to pcs. > However since pacemaker is not running anymore, the virtual IP has been > removed from pcs host, so pcs will never get the responses and waits forever. > > The thing is, when the IP is removed, the connection does not get closed, so > pcs is not notified about it. When you kill pcs and run it again, pacemaker > is already stopped, no IPs get removed and so everything works. Makes sense. Sounds like an upgrade script bug, then. Changing product/component accordingly. Thanks!
(In reply to Ian Pilcher from comment #22) > (In reply to Tomas Jelinek from comment #21) > > I see what is going on: > > > > There are virtual IPs running in the cluster, which belong to the same > > network cluster nodes communicate over. When pcs connects to the nodes to > > tell them to stop pacemaker, sometimes a virtual IP is used as a source > > address for the connection. pcs requests the nodes to stop and waits to see > > if it succeeds or not. Each node stops pacemaker and reports success to pcs. > > However since pacemaker is not running anymore, the virtual IP has been > > removed from pcs host, so pcs will never get the responses and waits forever. > > > > The thing is, when the IP is removed, the connection does not get closed, so > > pcs is not notified about it. When you kill pcs and run it again, pacemaker > > is already stopped, no IPs get removed and so everything works. > > Makes sense. Sounds like an upgrade script bug, then. > > Changing product/component accordingly. > > Thanks! I sort of disagree that it is an update script bug. This is a combination of problems. pcs binds to 0.0.0.0 / :: by default for incoming connections and that's fine, then on outgoing sockets, it's the kernel that decides which IP to use if none is specified. In this case a VIP is selected but it's a non predictable situation. pcs could use the pcs auth information to determine outgoing connection socket to bind only to IPs available before pacemaker is started and configured in a predictable way. The update script patch you propose can workaround this issue,but it's not the correct fix IMHO.
The library we are currently using to connect to remote nodes does not allow to specify a source address. We would have to monkey-patch it or switch to a different library / means of sending HTTP requests. We still need to know which IP is the right one to be used as a source IP, though. Can you be more specific about your idea of using pcs auth information to do that?
(In reply to Tomas Jelinek from comment #25) > The library we are currently using to connect to remote nodes does not allow > to specify a source address. We would have to monkey-patch it or switch to a > different library / means of sending HTTP requests. > > We still need to know which IP is the right one to be used as a source IP, > though. Can you be more specific about your idea of using pcs auth > information to do that? When you create a cluster, one of the steps is to perform: pcs cluster auth ... node list ... those would generally list all nodes in the cluster, including the local node. That IP is known NOT to be a VIP since it's used before cluster is even up. That IP can be used as source address to talk to other nodes. I don't know what tech you are using, but i guess it can be changed/improved?
We don't want to store IP adresses together with hostname anywhere outside DNS as it's unmanageable. > Simple solution (thanks Poki): Introduce --srcaddr switch for a subset of pcs commands and a similar configuration option for pcsd where the outgoing IP could be specified manually. > More complicated solution: We could use the information from corosync.conf as the nodes there are represented with valid hostnames (nobody's going to put virtual IPs in there), i.e. resolving the nodenames and finding a match in a list of IPs on all locally configured interfaces (assuming we can get such information easily). From here there are two outcomes: 1) If a match is found, we're going to bind the socket to that IP because we know it belongs to this machine and it is used by the cluster stack; 2) If no match is found or corosync.conf doesn't exist, then fallback to kernel default behavior. This can happen in two cases: a) the machine is not a cluster node, therefore no vIP b) the machine is running pacemaker-remote, where the use of CLI is limited because of missing corosync.conf All of this requires rewriting pcs to use different/more low-level network library than what we have now, for both CLI and GUI (or eventually a shared pcs lib in the future). With this solution however the problem will still exist on pacemaker-remote nodes where GUI/daemon is running or when remote node management is implemented in CLI eventually (a.k.a. pcs -h <host>).
(In reply to Radek Steiger from comment #27) > We don't want to store IP adresses together with hostname anywhere outside > DNS as it's unmanageable. > agreed, but you can use hostname stored in the auth info + current hostname to perform the bind. No need to hardcode anything at any time. > > Simple solution (thanks Poki): > > Introduce --srcaddr switch for a subset of pcs commands and a similar > configuration option for pcsd where the outgoing IP could be specified > manually. > Simple implementation wise, but it simply move the issue to the end user (not ideal) > > > More complicated solution: > > We could use the information from corosync.conf as the nodes there are > represented with valid hostnames (nobody's going to put virtual IPs in > there), i.e. resolving the nodenames and finding a match in a list of IPs on > all locally configured interfaces (assuming we can get such information > easily). From here there are two outcomes: > > 1) If a match is found, we're going to bind the socket to that IP because we > know it belongs to this machine and it is used by the cluster stack; > > 2) If no match is found or corosync.conf doesn't exist, then fallback to > kernel default behavior. This can happen in two cases: > > a) the machine is not a cluster node, therefore no vIP > b) the machine is running pacemaker-remote, where the use of CLI is > limited because of missing corosync.conf I don't think we support #b 100% since it's a remote node anyway. so let's start with the simple use case and then we extend as necessary. > > All of this requires rewriting pcs to use different/more low-level network > library than what we have now, for both CLI and GUI (or eventually a shared > pcs lib in the future). > > With this solution however the problem will still exist on pacemaker-remote > nodes where GUI/daemon is running or when remote node management is > implemented in CLI eventually (a.k.a. pcs -h <host>).
I don't see it mentioned, but following might be an acceptable workaround: $ >/etc/gai.conf cat <<EOF label ::1/128 0 label ::/0 1 label 2002::/16 2 label ::/96 3 label ::ffff:0:0/96 4 label ::ffff:<ADDRESS-OF-NETWORK-USED-FOR-CLUSTER-MANAGEMENT>/96 5 label ::ffff:<ADDRESS-OF-THE-LOCAL-HOST-WITHIN-THE-ABOVE-NETWORK>/96 5 precedence ::1/128 50 precedence ::/0 40 precedence 2002::/16 30 precedence ::/96 20 precedence ::ffff:0:0/96 10 <EOF I.e., we make an association between source and target address through matching labels (5), to avoid source address getting assigned cluster IP that is then kept hanging forever in case of inter-VM communication (IIUIC). Alternatively, the precedence could be tweaked. Disclaimer: untested.
Should this bug be cloned, so that the pcs work can be tracked separately from the OSP workaround?
Ian, Yes, I think that makes sense. I'll clone this bz. Thanks, Chris
re [comment 29]: > to avoid source address getting assigned cluster IP that is then kept > hanging forever in case of inter-VM communication (IIUIC) I cannot find any reference this was only related to interconnect between VMs (KVM only, correct?) or physically connected bare metal machines are affected as well. I sincerely hope such missing circumstances will be figured out because: - they are important for diagnosing possible future relapses - this may actually be a bug in the particular virtualization platform -- can the behavior be tweaked in this regard? etc.
We're closing this bug - all of the related patches merged, and https://bugzilla.redhat.com/show_bug.cgi?id=1334429 was fixed some time ago, so there shouldn't be any engineering work left to do for this bug. Please re-open if you disagree.