Bug 1395959

Summary:

Web Interfaces Unusable

Product:

Red Hat Enterprise Linux 7

Reporter:

vlad.socaciu

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED DUPLICATE

QA Contact:

cluster-qe <cluster-qe>

Severity:

unspecified

Docs Contact:

Priority:

medium

Version:

7.2

CC:

ccaulfie, cfeist, cluster-maint, idevat, kgaillot, omular, rsteiger, tojeline, vlad.socaciu

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-02-20 13:53:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1292858

Bug Blocks:

Attachments:

Description	Flags
pacemaker log - fencing failure	none

Description vlad.socaciu 2016-11-17 04:04:10 UTC

Description of problem: 

The cluster has two nodes, which are RedHat virtual machines. After "powering off" one node, the "pcsd" web interfaces become unusable on both nodes. The web interface of the node which stayed up cannot access the cluster anymore. The web interface of the node which went up and down, experienced a different behavior: it first showed both nodes in a faulty condition (red color). After bringing back up the node which was down and taking it down again, the very same web interface shows the node which is up as being in a faulty condition, and the node which is down as running okay with all the resources running on it (as a matter of fact, the resources are running, but on the other node).

This may be actually a problem of communication between pcsd and the cluster software. We cannot segregate among components, to determine the actual culprit.

But we saw this behavior several times in about a week interval. The only remedy is to reboot the nodes -- restarting the cluster didn't help!!!

Version-Release number of selected component (if applicable):

pacemaker-cli-1.1.13-10.el7_2.4.x86_64
pacemaker-libs-1.1.13-10.el7_2.4.x86_64
pacemaker-cluster-libs-1.1.13-10.el7_2.4.x86_64
pacemaker-1.1.13-10.el7_2.4.x86_64

corosync-2.3.4-7.el7_2.3.x86_64
corosynclib-2.3.4-7.el7_2.3.x86_64

Steps to Reproduce:

On a two node cluster, create about 100 lsb resources, start them constrained to one node and bring that node done forcefully.

Comment 3 Tomas Jelinek 2017-01-11 12:39:49 UTC

I am not able to reproduce this with pcs-0.9.143-15.el7. Can you provide pcs version you are using and your cluster configuration? Also output of pcs status from the moments when your cluster was running fine, when one node was down, and then when both nodes were up again would be helpful.

I tried to reproduce the issue like this:
* Setup a two node cluster (node1, node2), both node being virtual machines.
* Setup and test fencing.
* Add some resources to the cluster and make them prefer node2.
* Open the web interface from both nodes.
* Force power off node2.
* Web interface loaded from node2 was unusable, which is expected since the node was not running at all.
* Web interface loaded form node1 worked just fine. I was able to see status of the cluster and change its configuration. Node A was displayed as running, node2 as offline. Resources were displayed as running on node2 because that was what we got from pacemaker [1].
* After starting node2 everything worked fine.
* The behavior was the same after powering off the node2 again.

Let me know if the reproducer is not accurate.



[1] pcs status showed node2 as "UNCLEAN (offline)" and the resource previously running on the node as "Started node2 (UNCLEAN)". Pacemaker tried to fence node2 but did not succeeded, see the attachment.

My fencing configuration:
# pcs stonith --full
 Resource: xvmNode1 (class=stonith type=fence_xvm)
  Attributes: port=rh72-node1 pcmk_host_list=rh72-node1
  Operations: monitor interval=60s (xvmNode1-monitor-interval-60s)
 Resource: xvmNode2 (class=stonith type=fence_xvm)
  Attributes: port=rh72-node2 pcmk_host_list=rh72-node2
  Operations: monitor interval=60s (xvmNode2-monitor-interval-60s)

Comment 4 Tomas Jelinek 2017-01-11 12:40:27 UTC

Created attachment 1239418 [details]
pacemaker log - fencing failure

Comment 5 Tomas Jelinek 2017-01-11 12:44:11 UTC

Ken,

Can you take a look at the fencing issue described above if it is something we already know about? I am able to reproduce it always, so I can provide more logs if needed.

Thanks.

Comment 6 Ken Gaillot 2017-01-11 15:05:29 UTC

Resources should be reported as running on the old node (and should not be started on the remaining node), until fencing is confirmed, because only then are we sure they are not running.

If resources are still reported as running on the old node after fencing is confirmed, something's wrong.

Comment 7 Tomas Jelinek 2017-01-11 15:06:28 UTC

restoring needinfo

Comment 8 vlad.socaciu 2017-01-11 21:02:58 UTC

(In reply to Tomas Jelinek from comment #3)
> I am not able to reproduce this with pcs-0.9.143-15.el7. Can you provide pcs
> version you are using and your cluster configuration? Also output of pcs
> status from the moments when your cluster was running fine, when one node
> was down, and then when both nodes were up again would be helpful.
> 
> I tried to reproduce the issue like this:
> * Setup a two node cluster (node1, node2), both node being virtual machines.
> * Setup and test fencing.
> * Add some resources to the cluster and make them prefer node2.
> * Open the web interface from both nodes.
> * Force power off node2.
> * Web interface loaded from node2 was unusable, which is expected since the
> node was not running at all.
> * Web interface loaded form node1 worked just fine. I was able to see status
> of the cluster and change its configuration. Node A was displayed as
> running, node2 as offline. Resources were displayed as running on node2
> because that was what we got from pacemaker [1].
> * After starting node2 everything worked fine.
> * The behavior was the same after powering off the node2 again.
> 
> Let me know if the reproducer is not accurate.
> 
> 
> 
> [1] pcs status showed node2 as "UNCLEAN (offline)" and the resource
> previously running on the node as "Started node2 (UNCLEAN)". Pacemaker tried
> to fence node2 but did not succeeded, see the attachment.
> 
> My fencing configuration:
> # pcs stonith --full
>  Resource: xvmNode1 (class=stonith type=fence_xvm)
>   Attributes: port=rh72-node1 pcmk_host_list=rh72-node1
>   Operations: monitor interval=60s (xvmNode1-monitor-interval-60s)
>  Resource: xvmNode2 (class=stonith type=fence_xvm)
>   Attributes: port=rh72-node2 pcmk_host_list=rh72-node2
>   Operations: monitor interval=60s (xvmNode2-monitor-interval-60s)

Obviously, your experience did not match ours, back in November.

The pcs version is 0.9.143.

The fence device is "fence_vmware_soap - Fence agent for VMWare over SOAP API". Also:

pcs stonith --full
 Resource: vmfence (class=stonith type=fence_vmware_soap)
  Attributes: pcmk_host_map=uis1ccg1-app:uis1ccg1-vm;uis1ccg2-app:uis1ccg2-vm ipaddr=10.0.0.30 login=administrator passwd=2Securepw pcmk_monitor_timeout=120s pcmk_host_list=uis1ccg1-app,uis1ccg2-app ipport=443 inet4_only=1 ssl_insecure=1 power_wait=3 pcmk_host_check=static-list 
  Operations: monitor interval=60s (vmfence-monitor-interval-60s)
 Node: uis1ccg1-app
  Level 1 - vmfence
 Node: uis1ccg2-app
  Level 2 - vmfence

Note that we are using lsb resources, and I had created 100 of them at the time of the incident. The incident happened almost two months ago and it is hard for me to remember all the details. But I re-read the description and it seems accurate, as much as I can recall. Since then, we avoided using pcsd, so I cannot tell how reproducible the problem may be. Our goal is to work around cluster difficulties so that we can get our own work done, not to find circumstances which would make the cluster fail.

Sorry I cannot be of more help.

Comment 9 Tomas Jelinek 2017-01-23 12:48:53 UTC

It is expected the web UI loaded from a node does not work if the node is down.

Pcs and pcsd gets information about running resources from pacemaker. If pacemaker reports resources running on a not running node, it is likely your fencing is not working well (comment 6). You may want to check pacemaker logs to see what went wrong.

We are still unable to reproduce the issue you described. However based on your report we found out a similar issue. If one node in a cluster has port 2224 blocked but is otherwise running well, pcsd on the remaining nodes do not show status of the cluster. In this case pcsd should timeout when fetching status from the blocked node and provide data from the rest of the nodes.

Currently the communication layer is being overhalued as we switch from python and ruby libraries to curl. This will give us more options and better handling of error states including timeouts (bz1292858). Once this is done we will get back to this bz.

Comment 11 Tomas Jelinek 2017-02-20 13:53:54 UTC

The bug described in comment 9 has been fixed by using libcurl and implementing timeout handling (bz1292858).

We are still unable to reproduce the originally reported bug and it seems we will not get any more info from the reporter.

Therefore I am closing this as a duplicate to bz1292858. Feel free to reopen this bz if the issue occurs again.

*** This bug has been marked as a duplicate of bug 1292858 ***