Bug 807398 - Endpoint updating for HA configurations
Summary: Endpoint updating for HA configurations
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-aviary
Version: Development
Hardware: All
OS: Linux
high
medium
Target Milestone: 2.3
: ---
Assignee: Pete MacKinnon
QA Contact: Tomas Rusnak
URL:
Whiteboard:
Depends On: 871080
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-27 16:58 UTC by Robert Rati
Modified: 2013-03-06 18:43 UTC (History)
6 users (show)

Fixed In Version: condor-7.8.2-0.1
Doc Type: Bug Fix
Doc Text:
Cause: Aviary Locator behaviour when the Aviary Schedd plug-in and Query Server are deployed in a HA group. Consequence: Aviary clients would have experienced stale endpoint references for a longer duration than necessary. Fix: Adjustments were made in the Locator implementation to quickly replace a failed endpoint reference with its new one. Result: An Aviary client using a Schedd or Query Server endpoint will now be able to retrieve the new endpoint faster.
Clone Of:
Environment:
Last Closed: 2013-03-06 18:43:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2013:0564 0 normal SHIPPED_LIVE Low: Red Hat Enterprise MRG Grid 2.3 security update 2013-03-06 23:37:09 UTC

Description Robert Rati 2012-03-27 16:58:20 UTC
Description of problem:
Providing High-Availability through Red Hat HA for the Query Server requires endpoint updating for a Query Server to be located after a failover.   Without this, an aviary client would likely lose track of which machine is running the Query Server once the daemon is failed over to another machine in the cluster.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Martin Kudlej 2012-04-04 08:38:20 UTC
How can I test this, please?

Comment 6 Pete MacKinnon 2012-04-26 12:49:32 UTC
Verification advice:

2.2 HA process groups will include the Schedd and the Query Server (QS) which can be configured to publish their SOAP endpoints using the new location feature.

1) Do these endpoints correctly re-locate in a failover scenario as evidenced by a SOAP tool such as locator.py found in /usr/share/condor/aviary?

2) Are there multiple redundant entries for a particular endpoint when there should only be one?

3) Is the endpoint listed after failover actually reachable, or is it a stale reference to the old location (host:port)?

4) Are old references from crashed (e.g., kill -9) process endpoints removed or replaced in a timely manner? Within (AVIARY_LOCATOR_MISSED_UPDATES+1) * AVIARY_LOCATOR_PRUNE_INTERVAL seconds?

Comment 7 Pete MacKinnon 2012-05-02 19:00:21 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Aviary Locator behaviour when the Aviary Schedd plug-in and Query Server are deployed in a HA group.
Consequence: Aviary clients would have experienced stale endpoint references for a longer duration than necessary.
Fix: Adjustments were made in the Locator implementation to quickly replace a failed endpoint reference with its new one.
Result: An Aviary client using a Schedd or Query Server endpoint will now be able to retrieve the new endpoint faster.

Comment 10 Tomas Rusnak 2013-01-09 10:58:34 UTC
# ./locator.py --type=ANY -s -r=/etc/condor/certs/ca.crt -k=/etc/condor/certs/client.key -c=/etc/condor/certs/client.crt 
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd1@ | http://node2:45039/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd2@ | http://node2:37425/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd3@ | http://node2:48322/services/query

# clusvcadm -r "HA Schedd HASchedd1" -m node1
Trying to relocate service:HA Schedd HASchedd1 to node1...Success
service:HA Schedd HASchedd1 is now running on node1

# ./locator.py --type=ANY -s -r=/etc/condor/certs/ca.crt -k=/etc/condor/certs/client.key -c=/etc/condor/certs/client.crt 
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd1@ | http://node1:50697/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd2@ | http://node2:37425/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd3@ | http://node2:48322/services/query/


# ps ax | grep -i aviary
25232 ?        S<     0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-HASchedd1_query_server.pid -local-name HASchedd1_query_server

# kill -9 25232

# ./locator.py --type=ANY -s -r=/etc/condor/certs/ca.crt -k=/etc/condor/certs/client.key -c=/etc/condor/certs/client.crt 
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd1@ | http://node1:47936/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd2@ | http://node2:37425/services/query/
CUSTOM | QUERY_SERVER | ha-schedd-HASchedd3@ | http://node2:48322/services/query/

Locator endpoint updated after service relocation and/or crashed (killed) service.

>>> VERIFIED

Comment 12 errata-xmlrpc 2013-03-06 18:43:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0564.html


Note You need to log in before you can comment on or make changes to this bug.