Login
[x]
Log in using an account from:
Fedora Account System
Red Hat Associate
Red Hat Customer
Or login using a Red Hat Bugzilla account
Forgot Password
Login:
Hide Forgot
Create an Account
Red Hat Bugzilla – Attachment 891211 Details for
Bug 1093111
Add a troubleshooting section to the HA documentation
[?]
New
Simple Search
Advanced Search
My Links
Browse
Requests
Reports
Current State
Search
Tabular reports
Graphical reports
Duplicates
Other Reports
User Changes
Plotly Reports
Bug Status
Bug Severity
Non-Defaults
|
Product Dashboard
Help
Page Help!
Bug Writing Guidelines
What's new
Browser Support Policy
5.0.4.rh83 Release notes
FAQ
Guides index
User guide
Web Services
Contact
Legal
This site requires JavaScript to be enabled to function correctly, please enable it.
Troubleshooting draft
troubleshoot.md (text/x-markdown), 6.84 KB, created by
Alan Conway
on 2014-04-30 15:58:28 UTC
(
hide
)
Description:
Troubleshooting draft
Filename:
MIME Type:
Creator:
Alan Conway
Created:
2014-04-30 15:58:28 UTC
Size:
6.84 KB
patch
obsolete
># Troubleshooting a cluster > >This section applies to clusters that are using rgmanager as the cluster manager, >for example clusters using the Red Hat High Availability add-on. > >### Authentication filures > >If a broker is unable to establish a connection to another broker in the cluster >due to authentication problems, the log will contain SASL errors, for example: > >2012-aug-04 10:17:37 info SASL: Authentication failed: SASL(-13): user not found: Password verification failed > >Set the SASL user name and password used to connect to other brokers using the >ha-username and ha-password properties when you start the broker. Set the SASL >mode using ha-mechanism. Any mechanism you enable for broker-to-broker >communication can also be used by a client, so do not enable >ha-mechanism=ANONYMOUS in a secure environment. Once the cluster is running, >run qpid-ha to make sure that the brokers are running as one cluster. > >## Slow recovery times > >The following configuration settings affect recovery time. The values shown are >examples that give fast recovery on a lightly loaded system. You should run >tests to determine if the values are appropriate for your system and load >conditions. > >### cluster.conf: > ><rm status_poll_interval=1> > >status_poll_interval is the interval in seconds that the resource manager checks >the status of managed services. This affects how quickly the manager will >detect failed services. > ><ip address="20.0.20.200" monitor_link="yes" sleeptime="0"/> > >This is a virtual IP address for client traffic. monitor_link="yes" means >monitor the health of the NIC used for the VIP. sleeptime="0" means don't delay >when failing over the VIP to a new address. > > >### qpidd.conf > >link-maintenance-interval=0.1 > >Interval for backup brokers to check the link to the primary re-connect if need >be. Default 2 seconds. Can be set lower for faster fail-over. Setting too >low will result in excessive link-checking activity on the broker. > >link-heartbeat-interval=5 > >Heartbeat interval for federation links. The HA cluster uses federation links >between the primary and each backup. The primary can take up to twice the >heartbeat interval to detect a failed backup. When a sender sends a message the >primary waits for all backups to acknowledge before acknowledging to the >sender. A disconnected backup may cause the primary to block senders until it is >detected via heartbeat. > >This interval is also used as the timeout for broker status checks by rgmanager. >It may take up to this interval for rgmanager to detect a hung broker. > >The default of 120 seconds is very high, you will probably want to set this to a >lower value. If set too low, under network congestion or heavy load, a >slow-to-respond broker may be re-started by rgmanager. > > >## Total cluster failure > >The cluster can only guarantee availability as long as there is at least one >active primary broker or ready backup broker left alive. If all the brokers fail >simultaneously, the cluster will fail and non-persistent data will be lost. > >To explain this better, note that brokers are in one of 4 states: >- standalone: not part of a HA cluster >- joining: newly started backup, not yet joined to the cluster. >- catch-up: backup has connected to the primary and is downloading queues, messages etc. >- ready: backup is connected and actively replicating from primary, it is ready to take over. >- recovering: newly-promoted to primary, waiting for backups to catch up before serving clients. Only a single primary broker can be recovering at a time. >- active: serving clients, only a single primary broker can be active at a time. > >While there is an active primary broker, clients can get service. If the active >primary fails, one of the "ready" backup brokers will take over, recover and >become active. Note a backup can only be promoted to primary if it is in the >"ready" state (with the exception of the first primary in a new cluster where all >brokers are in the "joining" state) > >Given a stable cluster of N brokers with one active primary and N-1 ready >backups, the system can sustain up to N-1 failures in rapid succession. >The surviving broker will be promoted to active and continue to give service. > >However at this point the system _cannot_ sustain a failure of the surviving >broker until at least one of the other brokers recovers, catches up and becomes >a ready backup. If the surviving broker fails before that the cluster will fail >in one of two modes (depending on the exact timing of failures) > >### 1. The cluster hangs > >All brokers are in joining or catch-up mode. rgmanager tries to promote a new >primary but cannot find any candidates and so gives up. clustat will show that >the qpidd services are running but the the qpidd-primary service has stopped, >something like this: > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > service:mrg33-qpidd-service 20.0.10.33 started > service:mrg34-qpidd-service 20.0.10.34 started > service:mrg35-qpidd-service 20.0.10.35 started > service:qpidd-primary-service (20.0.10.33) stopped > >Eventually all brokers become stuck in "joining" mode, as shown by qpid-ha status --all. > >At this point you need to restart the cluster in one of the following ways: >Restart the entire cluster: >- In luci:<your-cluster>:Nodes click reboot to restart the entire cluster. >- OR stop and restart the cluster with ccs --stopall; ccs --startall >Restart just the Qpid services: >- In luci:<your-cluster>:Service Groups > - select all the qpidd (not primary) services, click restart > - select the qpidd-primary service, click restart >- OR stop the primary and qpidd services with clusvcadm, then restart (primary last) > >### 2. The cluster reboots > >A new primary is promoted and the cluster is functional but all non-persistent >data from before the failure is lost. > >## Fencing and network partitions > >A network partition is a a network failure that divides the cluster into two or >more sub-clusters, where each broker can communicate with brokers in its own >sub-cluster but not with brokers in other sub-clusters. This condition is also >referred to as a "split brain". > >Nodes in one sub-cluster can't tell whether nodes in other sub-clusters are dead >or are still running but disconnected. We cannot allow each sub-cluster to >independently declare its own qpidd primary and start serving clients, as the >cluster will become inconsistent. We must ensure only one sub-cluster continues >to provide service. > >A _quorum_ determines which sub-cluster continues to operate, and _power >fencing_ ensures that nodes in non-quorate sub-clusters cannot attempt to >provide service inconsistently. For more information see: > >https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/High_Availability_Add-On_Overview/index.html, chapter 2. Quorum and 4. Fencing.
You cannot view the attachment while viewing its details because your browser does not support IFRAMEs.
View the attachment on a separate page
.
View Attachment As Raw
Actions:
View
Attachments on
bug 1093111
: 891211