1395959 – Web Interfaces Unusable

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1395959 - Web Interfaces Unusable

Summary: Web Interfaces Unusable

Keywords:
Status:	CLOSED DUPLICATE of bug 1292858
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	7.2
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1292858
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-17 04:04 UTC by vlad.socaciu
Modified:	2017-02-20 13:53 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-20 13:53:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
pacemaker log - fencing failure (3.62 KB, text/plain) 2017-01-11 12:40 UTC, Tomas Jelinek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1396462	0	urgent	CLOSED	[gui] jquery breaks if the first cluster_status request provides not enough data	2021-02-22 00:41:40 UTC

Internal Links: 1396462

Description vlad.socaciu 2016-11-17 04:04:10 UTC

Description of problem: 

The cluster has two nodes, which are RedHat virtual machines. After "powering off" one node, the "pcsd" web interfaces become unusable on both nodes. The web interface of the node which stayed up cannot access the cluster anymore. The web interface of the node which went up and down, experienced a different behavior: it first showed both nodes in a faulty condition (red color). After bringing back up the node which was down and taking it down again, the very same web interface shows the node which is up as being in a faulty condition, and the node which is down as running okay with all the resources running on it (as a matter of fact, the resources are running, but on the other node).

This may be actually a problem of communication between pcsd and the cluster software. We cannot segregate among components, to determine the actual culprit.

But we saw this behavior several times in about a week interval. The only remedy is to reboot the nodes -- restarting the cluster didn't help!!!

Version-Release number of selected component (if applicable):

pacemaker-cli-1.1.13-10.el7_2.4.x86_64
pacemaker-libs-1.1.13-10.el7_2.4.x86_64
pacemaker-cluster-libs-1.1.13-10.el7_2.4.x86_64
pacemaker-1.1.13-10.el7_2.4.x86_64

corosync-2.3.4-7.el7_2.3.x86_64
corosynclib-2.3.4-7.el7_2.3.x86_64

Steps to Reproduce:

On a two node cluster, create about 100 lsb resources, start them constrained to one node and bring that node done forcefully.

Comment 3 Tomas Jelinek 2017-01-11 12:39:49 UTC

I am not able to reproduce this with pcs-0.9.143-15.el7. Can you provide pcs version you are using and your cluster configuration? Also output of pcs status from the moments when your cluster was running fine, when one node was down, and then when both nodes were up again would be helpful.

I tried to reproduce the issue like this:
* Setup a two node cluster (node1, node2), both node being virtual machines.
* Setup and test fencing.
* Add some resources to the cluster and make them prefer node2.
* Open the web interface from both nodes.
* Force power off node2.
* Web interface loaded from node2 was unusable, which is expected since the node was not running at all.
* Web interface loaded form node1 worked just fine. I was able to see status of the cluster and change its configuration. Node A was displayed as running, node2 as offline. Resources were displayed as running on node2 because that was what we got from pacemaker [1].
* After starting node2 everything worked fine.
* The behavior was the same after powering off the node2 again.

Let me know if the reproducer is not accurate.



[1] pcs status showed node2 as "UNCLEAN (offline)" and the resource previously running on the node as "Started node2 (UNCLEAN)". Pacemaker tried to fence node2 but did not succeeded, see the attachment.

My fencing configuration:
# pcs stonith --full
 Resource: xvmNode1 (class=stonith type=fence_xvm)
  Attributes: port=rh72-node1 pcmk_host_list=rh72-node1
  Operations: monitor interval=60s (xvmNode1-monitor-interval-60s)
 Resource: xvmNode2 (class=stonith type=fence_xvm)
  Attributes: port=rh72-node2 pcmk_host_list=rh72-node2
  Operations: monitor interval=60s (xvmNode2-monitor-interval-60s)

Comment 4 Tomas Jelinek 2017-01-11 12:40:27 UTC

Created attachment 1239418 [details]
pacemaker log - fencing failure

Comment 5 Tomas Jelinek 2017-01-11 12:44:11 UTC

Ken,

Can you take a look at the fencing issue described above if it is something we already know about? I am able to reproduce it always, so I can provide more logs if needed.

Thanks.

Comment 6 Ken Gaillot 2017-01-11 15:05:29 UTC

Resources should be reported as running on the old node (and should not be started on the remaining node), until fencing is confirmed, because only then are we sure they are not running.

If resources are still reported as running on the old node after fencing is confirmed, something's wrong.

Comment 7 Tomas Jelinek 2017-01-11 15:06:28 UTC

restoring needinfo

Comment 8 vlad.socaciu 2017-01-11 21:02:58 UTC

(In reply to Tomas Jelinek from comment #3)
> I am not able to reproduce this with pcs-0.9.143-15.el7. Can you provide pcs
> version you are using and your cluster configuration? Also output of pcs
> status from the moments when your cluster was running fine, when one node
> was down, and then when both nodes were up again would be helpful.
> 
> I tried to reproduce the issue like this:
> * Setup a two node cluster (node1, node2), both node being virtual machines.
> * Setup and test fencing.
> * Add some resources to the cluster and make them prefer node2.
> * Open the web interface from both nodes.
> * Force power off node2.
> * Web interface loaded from node2 was unusable, which is expected since the
> node was not running at all.
> * Web interface loaded form node1 worked just fine. I was able to see status
> of the cluster and change its configuration. Node A was displayed as
> running, node2 as offline. Resources were displayed as running on node2
> because that was what we got from pacemaker [1].
> * After starting node2 everything worked fine.
> * The behavior was the same after powering off the node2 again.
> 
> Let me know if the reproducer is not accurate.
> 
> 
> 
> [1] pcs status showed node2 as "UNCLEAN (offline)" and the resource
> previously running on the node as "Started node2 (UNCLEAN)". Pacemaker tried
> to fence node2 but did not succeeded, see the attachment.
> 
> My fencing configuration:
> # pcs stonith --full
>  Resource: xvmNode1 (class=stonith type=fence_xvm)
>   Attributes: port=rh72-node1 pcmk_host_list=rh72-node1
>   Operations: monitor interval=60s (xvmNode1-monitor-interval-60s)
>  Resource: xvmNode2 (class=stonith type=fence_xvm)
>   Attributes: port=rh72-node2 pcmk_host_list=rh72-node2
>   Operations: monitor interval=60s (xvmNode2-monitor-interval-60s)

Obviously, your experience did not match ours, back in November.

The pcs version is 0.9.143.

The fence device is "fence_vmware_soap - Fence agent for VMWare over SOAP API". Also:

pcs stonith --full
 Resource: vmfence (class=stonith type=fence_vmware_soap)
  Attributes: pcmk_host_map=uis1ccg1-app:uis1ccg1-vm;uis1ccg2-app:uis1ccg2-vm ipaddr=10.0.0.30 login=administrator passwd=2Securepw pcmk_monitor_timeout=120s pcmk_host_list=uis1ccg1-app,uis1ccg2-app ipport=443 inet4_only=1 ssl_insecure=1 power_wait=3 pcmk_host_check=static-list 
  Operations: monitor interval=60s (vmfence-monitor-interval-60s)
 Node: uis1ccg1-app
  Level 1 - vmfence
 Node: uis1ccg2-app
  Level 2 - vmfence

Note that we are using lsb resources, and I had created 100 of them at the time of the incident. The incident happened almost two months ago and it is hard for me to remember all the details. But I re-read the description and it seems accurate, as much as I can recall. Since then, we avoided using pcsd, so I cannot tell how reproducible the problem may be. Our goal is to work around cluster difficulties so that we can get our own work done, not to find circumstances which would make the cluster fail.

Sorry I cannot be of more help.

Comment 9 Tomas Jelinek 2017-01-23 12:48:53 UTC

It is expected the web UI loaded from a node does not work if the node is down.

Pcs and pcsd gets information about running resources from pacemaker. If pacemaker reports resources running on a not running node, it is likely your fencing is not working well (comment 6). You may want to check pacemaker logs to see what went wrong.

We are still unable to reproduce the issue you described. However based on your report we found out a similar issue. If one node in a cluster has port 2224 blocked but is otherwise running well, pcsd on the remaining nodes do not show status of the cluster. In this case pcsd should timeout when fetching status from the blocked node and provide data from the rest of the nodes.

Currently the communication layer is being overhalued as we switch from python and ruby libraries to curl. This will give us more options and better handling of error states including timeouts (bz1292858). Once this is done we will get back to this bz.

Comment 11 Tomas Jelinek 2017-02-20 13:53:54 UTC

The bug described in comment 9 has been fixed by using libcurl and implementing timeout handling (bz1292858).

We are still unable to reproduce the originally reported bug and it seems we will not get any more info from the reporter.

Therefore I am closing this as a duplicate to bz1292858. Feel free to reopen this bz if the issue occurs again.

*** This bug has been marked as a duplicate of bug 1292858 ***

Note You need to log in before you can comment on or make changes to this bug.