Bug 1395959
Summary: | Web Interfaces Unusable | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | vlad.socaciu | ||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.2 | CC: | ccaulfie, cfeist, cluster-maint, idevat, kgaillot, omular, rsteiger, tojeline, vlad.socaciu | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-02-20 13:53:54 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1292858 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
vlad.socaciu
2016-11-17 04:04:10 UTC
I am not able to reproduce this with pcs-0.9.143-15.el7. Can you provide pcs version you are using and your cluster configuration? Also output of pcs status from the moments when your cluster was running fine, when one node was down, and then when both nodes were up again would be helpful. I tried to reproduce the issue like this: * Setup a two node cluster (node1, node2), both node being virtual machines. * Setup and test fencing. * Add some resources to the cluster and make them prefer node2. * Open the web interface from both nodes. * Force power off node2. * Web interface loaded from node2 was unusable, which is expected since the node was not running at all. * Web interface loaded form node1 worked just fine. I was able to see status of the cluster and change its configuration. Node A was displayed as running, node2 as offline. Resources were displayed as running on node2 because that was what we got from pacemaker [1]. * After starting node2 everything worked fine. * The behavior was the same after powering off the node2 again. Let me know if the reproducer is not accurate. [1] pcs status showed node2 as "UNCLEAN (offline)" and the resource previously running on the node as "Started node2 (UNCLEAN)". Pacemaker tried to fence node2 but did not succeeded, see the attachment. My fencing configuration: # pcs stonith --full Resource: xvmNode1 (class=stonith type=fence_xvm) Attributes: port=rh72-node1 pcmk_host_list=rh72-node1 Operations: monitor interval=60s (xvmNode1-monitor-interval-60s) Resource: xvmNode2 (class=stonith type=fence_xvm) Attributes: port=rh72-node2 pcmk_host_list=rh72-node2 Operations: monitor interval=60s (xvmNode2-monitor-interval-60s) Created attachment 1239418 [details]
pacemaker log - fencing failure
Ken, Can you take a look at the fencing issue described above if it is something we already know about? I am able to reproduce it always, so I can provide more logs if needed. Thanks. Resources should be reported as running on the old node (and should not be started on the remaining node), until fencing is confirmed, because only then are we sure they are not running. If resources are still reported as running on the old node after fencing is confirmed, something's wrong. restoring needinfo (In reply to Tomas Jelinek from comment #3) > I am not able to reproduce this with pcs-0.9.143-15.el7. Can you provide pcs > version you are using and your cluster configuration? Also output of pcs > status from the moments when your cluster was running fine, when one node > was down, and then when both nodes were up again would be helpful. > > I tried to reproduce the issue like this: > * Setup a two node cluster (node1, node2), both node being virtual machines. > * Setup and test fencing. > * Add some resources to the cluster and make them prefer node2. > * Open the web interface from both nodes. > * Force power off node2. > * Web interface loaded from node2 was unusable, which is expected since the > node was not running at all. > * Web interface loaded form node1 worked just fine. I was able to see status > of the cluster and change its configuration. Node A was displayed as > running, node2 as offline. Resources were displayed as running on node2 > because that was what we got from pacemaker [1]. > * After starting node2 everything worked fine. > * The behavior was the same after powering off the node2 again. > > Let me know if the reproducer is not accurate. > > > > [1] pcs status showed node2 as "UNCLEAN (offline)" and the resource > previously running on the node as "Started node2 (UNCLEAN)". Pacemaker tried > to fence node2 but did not succeeded, see the attachment. > > My fencing configuration: > # pcs stonith --full > Resource: xvmNode1 (class=stonith type=fence_xvm) > Attributes: port=rh72-node1 pcmk_host_list=rh72-node1 > Operations: monitor interval=60s (xvmNode1-monitor-interval-60s) > Resource: xvmNode2 (class=stonith type=fence_xvm) > Attributes: port=rh72-node2 pcmk_host_list=rh72-node2 > Operations: monitor interval=60s (xvmNode2-monitor-interval-60s) Obviously, your experience did not match ours, back in November. The pcs version is 0.9.143. The fence device is "fence_vmware_soap - Fence agent for VMWare over SOAP API". Also: pcs stonith --full Resource: vmfence (class=stonith type=fence_vmware_soap) Attributes: pcmk_host_map=uis1ccg1-app:uis1ccg1-vm;uis1ccg2-app:uis1ccg2-vm ipaddr=10.0.0.30 login=administrator passwd=2Securepw pcmk_monitor_timeout=120s pcmk_host_list=uis1ccg1-app,uis1ccg2-app ipport=443 inet4_only=1 ssl_insecure=1 power_wait=3 pcmk_host_check=static-list Operations: monitor interval=60s (vmfence-monitor-interval-60s) Node: uis1ccg1-app Level 1 - vmfence Node: uis1ccg2-app Level 2 - vmfence Note that we are using lsb resources, and I had created 100 of them at the time of the incident. The incident happened almost two months ago and it is hard for me to remember all the details. But I re-read the description and it seems accurate, as much as I can recall. Since then, we avoided using pcsd, so I cannot tell how reproducible the problem may be. Our goal is to work around cluster difficulties so that we can get our own work done, not to find circumstances which would make the cluster fail. Sorry I cannot be of more help. It is expected the web UI loaded from a node does not work if the node is down. Pcs and pcsd gets information about running resources from pacemaker. If pacemaker reports resources running on a not running node, it is likely your fencing is not working well (comment 6). You may want to check pacemaker logs to see what went wrong. We are still unable to reproduce the issue you described. However based on your report we found out a similar issue. If one node in a cluster has port 2224 blocked but is otherwise running well, pcsd on the remaining nodes do not show status of the cluster. In this case pcsd should timeout when fetching status from the blocked node and provide data from the rest of the nodes. Currently the communication layer is being overhalued as we switch from python and ruby libraries to curl. This will give us more options and better handling of error states including timeouts (bz1292858). Once this is done we will get back to this bz. The bug described in comment 9 has been fixed by using libcurl and implementing timeout handling (bz1292858). We are still unable to reproduce the originally reported bug and it seems we will not get any more info from the reporter. Therefore I am closing this as a duplicate to bz1292858. Feel free to reopen this bz if the issue occurs again. *** This bug has been marked as a duplicate of bug 1292858 *** |