Bug 807398 - Endpoint updating for HA configurations
Endpoint updating for HA configurations
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-aviary (Show other bugs)
Development
All Linux
high Severity medium
: 2.3
: ---
Assigned To: Pete MacKinnon
Tomas Rusnak
:
Depends On: 871080
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-27 12:58 EDT by Robert Rati
Modified: 2013-03-06 13:43 EST (History)
6 users (show)

See Also:
Fixed In Version: condor-7.8.2-0.1
Doc Type: Bug Fix
Doc Text:
Cause: Aviary Locator behaviour when the Aviary Schedd plug-in and Query Server are deployed in a HA group. Consequence: Aviary clients would have experienced stale endpoint references for a longer duration than necessary. Fix: Adjustments were made in the Locator implementation to quickly replace a failed endpoint reference with its new one. Result: An Aviary client using a Schedd or Query Server endpoint will now be able to retrieve the new endpoint faster.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-06 13:43:16 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Robert Rati 2012-03-27 12:58:20 EDT
Description of problem:
Providing High-Availability through Red Hat HA for the Query Server requires endpoint updating for a Query Server to be located after a failover.   Without this, an aviary client would likely lose track of which machine is running the Query Server once the daemon is failed over to another machine in the cluster.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 2 Martin Kudlej 2012-04-04 04:38:20 EDT
How can I test this, please?
Comment 6 Pete MacKinnon 2012-04-26 08:49:32 EDT
Verification advice:

2.2 HA process groups will include the Schedd and the Query Server (QS) which can be configured to publish their SOAP endpoints using the new location feature.

1) Do these endpoints correctly re-locate in a failover scenario as evidenced by a SOAP tool such as locator.py found in /usr/share/condor/aviary?

2) Are there multiple redundant entries for a particular endpoint when there should only be one?

3) Is the endpoint listed after failover actually reachable, or is it a stale reference to the old location (host:port)?

4) Are old references from crashed (e.g., kill -9) process endpoints removed or replaced in a timely manner? Within (AVIARY_LOCATOR_MISSED_UPDATES+1) * AVIARY_LOCATOR_PRUNE_INTERVAL seconds?
Comment 7 Pete MacKinnon 2012-05-02 15:00:21 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Aviary Locator behaviour when the Aviary Schedd plug-in and Query Server are deployed in a HA group.
Consequence: Aviary clients would have experienced stale endpoint references for a longer duration than necessary.
Fix: Adjustments were made in the Locator implementation to quickly replace a failed endpoint reference with its new one.
Result: An Aviary client using a Schedd or Query Server endpoint will now be able to retrieve the new endpoint faster.
Comment 10 Tomas Rusnak 2013-01-09 05:58:34 EST
# ./locator.py --type=ANY -s -r=/etc/condor/certs/ca.crt -k=/etc/condor/certs/client.key -c=/etc/condor/certs/client.crt 
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd1@ | http://node2:45039/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd2@ | http://node2:37425/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd3@ | http://node2:48322/services/query

# clusvcadm -r "HA Schedd HASchedd1" -m node1
Trying to relocate service:HA Schedd HASchedd1 to node1...Success
service:HA Schedd HASchedd1 is now running on node1

# ./locator.py --type=ANY -s -r=/etc/condor/certs/ca.crt -k=/etc/condor/certs/client.key -c=/etc/condor/certs/client.crt 
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd1@ | http://node1:50697/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd2@ | http://node2:37425/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd3@ | http://node2:48322/services/query/


# ps ax | grep -i aviary
25232 ?        S<     0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-HASchedd1_query_server.pid -local-name HASchedd1_query_server

# kill -9 25232

# ./locator.py --type=ANY -s -r=/etc/condor/certs/ca.crt -k=/etc/condor/certs/client.key -c=/etc/condor/certs/client.crt 
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd1@ | http://node1:47936/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd2@ | http://node2:37425/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd3@ | http://node2:48322/services/query/

Locator endpoint updated after service relocation and/or crashed (killed) service.

>>> VERIFIED
Comment 12 errata-xmlrpc 2013-03-06 13:43:16 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html

Note You need to log in before you can comment on or make changes to this bug.