Bug 807735

Summary: RHHAv2 Query/Job Server failover should be tied to schedd
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: unspecified Docs Contact:
Priority: medium    
Version: DevelopmentCC: matt, mkudlej, trusnak, tstclair
Target Milestone: 2.2   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: done
Fixed In Version: condor-7.6.5-0.15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-19 18:26:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 835525    
Bug Blocks: 751870    
Attachments:
Description Flags
cluster.conf none

Description Robert Rati 2012-03-28 15:15:15 UTC
Description of problem:
To simplify failover scenarios, the Query/Job Server in RHHA should not be independent, but instead should be tied to the schedd.  They should use the same mount point as the schedd they associate with.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2012-04-03 16:21:48 UTC
The Red Hat HAv2 tools now add/remove query/job servers to/from schedd configurations, and all daemons failover as a unit.  The schedd failure count is limited by the number of restarts in a timelimit, but the job/query servers will fail/restart forever and not cause a failover of the service.

Tracking upstream on branch:
V7_6-branch

Comment 2 Martin Kudlej 2012-04-04 08:27:45 UTC
What does mean that query/job servers are added to schedd configurations? Could you please write here settings which are required for this and example of how it works during fail in RH HA?

Comment 3 Robert Rati 2012-04-04 12:44:20 UTC
The job/query servers are added as resources to the service that contains the specified schedd.  They use the same failover domains as the schedd.  The service is comprised of the nfs share and the condor daemons (schedd, job server, query server), and a relocation will relocate the entire service.  Failures of the job/query servers will not cause a failover of the service.  Those resources will just be restarted on the node they were running on.  Failure of the schedd beyond the allowed limits will cause the entire service to failover to another node in the domain.

Comment 5 Tomas Rusnak 2012-06-20 12:25:23 UTC
When I relocate service from one HA node to another, only one aviary server is started. The relocated service have no aviary_query_server. The condor_job_server was relocated as expected.

# ps ax | grep condor
 1721 ?        Ssl    0:12 condor_master -pidfile /var/run/condor/condor_master.pid
 1743 ?        Ssl    0:08 condor_collector -f
 1747 ?        Ssl    0:03 condor_startd -f
 1749 ?        Ssl    0:07 condor_negotiator -f
 1750 ?        Ssl    0:37 /usr/bin/python /usr/sbin/condor_configd
 3717 ?        S<l    0:02 condor_schedd -pidfile /var/run/condor/condor_schedd-ha_schedd2.pid -local-name ha_schedd2
 3721 ?        S<     0:01 condor_procd -A /var/run/condor/procd_pipe.ha_schedd2.SCHEDD -R 10000000 -S 60 -C 64
 3761 ?        S<l    0:03 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver2.pid -local-name ha_jobserver2
 3803 ?        S<     0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query2.pid -local-name ha_query2
13172 ?        S<     0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64
13200 ?        S<l    0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver3.pid -local-name ha_jobserver3
13276 pts/0    S+     0:00 grep condor

rgmanager.log:

Jun 20 12:22:16 rgmanager Some independent resources in service:HA_Schedd3 failed; Attempting inline recovery
Jun 20 12:22:16 rgmanager [condor] Stopping aviary_query_server ha_query3
Jun 20 12:22:17 rgmanager [condor] Stopping condor_job_server ha_jobserver3
Jun 20 12:22:20 rgmanager [condor] Stopping condor_schedd ha_schedd3
Jun 20 12:22:20 rgmanager [condor] Starting condor_schedd ha_schedd3
Jun 20 12:22:21 rgmanager [condor] Starting condor_job_server ha_jobserver3
Jun 20 12:22:21 rgmanager [condor] Starting aviary_query_server ha_query3
Jun 20 12:22:21 rgmanager Inline recovery of service:HA_Schedd3 complete

/var/log/condor/QueryServerLog-ha_query3:

06/20/12 12:23:12 Axis2 HTTP configuration failed
Stack dump for process 18197 at timestamp 1340187792 (20 frames)
aviary_query_server(dprintf_dump_stack+0x63)[0x50f013]
aviary_query_server[0x511102]
/lib64/libpthread.so.0(+0xf500)[0x7ff117282500]
/usr/lib64/libaxis2_engine.so.0(axis2_phase_free+0xe)[0x7ff118f4525e]
/usr/lib64/libaxis2_engine.so.0(axis2_msg_free+0x7e)[0x7ff118f53dae]
/usr/lib64/libaxis2_engine.so.0(axis2_desc_free+0x59)[0x7ff118f4b319]
/usr/lib64/libaxis2_engine.so.0(axis2_op_free+0x1e)[0x7ff118f4d13e]
/usr/lib64/libaxis2_engine.so.0(axis2_svc_free+0x19e)[0x7ff118f4ff9e]
/usr/lib64/libaxis2_engine.so.0(axis2_arch_file_data_free+0xf0)[0x7ff118f5c220]
/usr/lib64/libaxis2_engine.so.0(axis2_dep_engine_free+0xae)[0x7ff118f5b3de]
/usr/lib64/libaxis2_engine.so.0(axis2_conf_free+0xf0)[0x7ff118f44450]
/usr/lib64/libaxis2_engine.so.0(axis2_conf_ctx_free+0x194)[0x7ff118f68bf4]
/usr/lib64/libaxis2_http_receiver.so.0(+0x1b54)[0x7ff1184a5b54]
aviary_query_server(_ZN6aviary4soap17Axis2SoapProviderD2Ev+0x21)[0x45b461]
aviary_query_server(_ZN6aviary4soap17Axis2SoapProviderD0Ev+0x9)[0x45b539]
aviary_query_server(_ZN6aviary9transport21AviaryProviderFactory6createERKSs+0x278)[0x45af68]
aviary_query_server(_Z9main_initiPPc+0x8f)[0x45a8ef]
aviary_query_server(main+0x115f)[0x46bcef]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff116effcdd]
aviary_query_server[0x45a489]

Please, could you take a look, why 2nd aviary dying?

Comment 6 Tomas Rusnak 2012-06-20 12:34:52 UTC
Created attachment 593195 [details]
cluster.conf

This is my cluster.conf and parameters for query server in wallaby:

# condor_configure_pool -g "Automatically generated High-Availability configuration for Schedd ha_schedd3" -l | grep -i query
  2: BaseQueryServer
  QUERY_SERVER.ha_query3.HISTORY = $(QUERY_SERVER.ha_query3.SPOOL)/history
  QUERY_SERVER.ha_query3.AVIARY_PUBLISH_INTERVAL = 10
  QUERY_SERVER.ha_query3.QUERY_SERVER_ADDRESS_FILE = $(LOG)/.query_server_address-ha_query3
  QUERY_SERVER.ha_query3.QUERY_SERVER_LOG = $(LOG)/QueryServerLog-ha_query3
  QUERY_SERVER.ha_query3.SCHEDD_NAME = ha-schedd-ha_schedd3@
  QUERY_SERVER.ha_query3.AVIARY_PUBLISH_LOCATION = True
  QUERY_SERVER.ha_query3.SPOOL = $(SCHEDD.ha_schedd3.SPOOL)
  ha_query3 = $(QUERY_SERVER)
  QUERY_SERVER.ha_query3.QUERY_SERVER_DAEMON_AD_FILE = $(LOG)/.query_server_classad-ha_query3

Maybe the 2nd aviary server doesn't have separate ports assigned for communication and it's in conflict with 1st one.

Comment 7 Robert Rati 2012-06-20 17:42:10 UTC
The crash in the query server is the result of a port collision. There is a fix in the query server to make this more evident.  To run multiple query servers on a single node, you'll need to manually specify the port.  Do so by adding a parameter like the following to the wallaby group for the schedd:

QUERY_SERVER.<query_server_name>.HTTP_PORT = <unique port>

Comment 8 Tomas Rusnak 2012-06-29 12:29:16 UTC
I setup HTTP_PORT for each QS, as you wrote, and it's not dying anymore. Thanks. 

# clustat
Cluster Status for HA-schedd @ Fri Jun 29 12:21:25 2012
Member Status: Quorate

 Member Name                                        ID   Status
 ------ ----                                        ---- ------
 rhel-ha-1.hostname                      1 Online, Local, rgmanager
 rhel-ha-2.hostname                      2 Online, rgmanager
 rhel-ha-3.hostname                      3 Online, rgmanager

 Service Name                              Owner (Last)                              State         
 ------- ----                              ----- ------                              -----         
 service:HA_Schedd1                        rhel-ha-1.hostname        started 
 service:HA_Schedd2                        rhel-ha-2.hostname        started
 service:HA_Schedd3                        rhel-ha-3.hostname        started


# fence_node -vvv rhel-ha-3.hostname
fence rhel-ha-3.hostname dev 0.0 agent fence_xvm result: success
agent args: domain=RHEL-HA-3 nodename=rhel-ha-3.hostname agent=fence_xvm 
fence rhel-ha-3.hostname success

Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd2, state started, owner rhel-ha-2.hostname
Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd3, state started, owner rhel-ha-3.hostname
Jun 29 12:25:58 rgmanager Taking over service service:HA_Schedd3 from down member rhel-ha-3.hostname
Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd1, state started, owner rhel-ha-1.hostname
Jun 29 12:25:58 rgmanager Event (0:3:0) Processed
Jun 29 12:25:59 rgmanager [condor] Starting condor_schedd ha_schedd3
Jun 29 12:26:00 rgmanager [condor] Starting condor_job_server ha_jobserver3
Jun 29 12:26:00 rgmanager [condor] Starting aviary_query_server ha_query3
Jun 29 12:26:00 rgmanager Service service:HA_Schedd3 started
Jun 29 12:26:05 rgmanager 2 events processed
Jun 29 12:26:13 rgmanager Membership Change Event
Jun 29 12:26:13 rgmanager Node 3 is not listening

[root@rhel-ha-2 ~]# ps ax | grep condor
 1708 ?        Ssl    0:12 condor_master -pidfile /var/run/condor/condor_master.pid
 1732 ?        Ssl    0:04 condor_startd -f
 1737 ?        Ssl    1:07 /usr/bin/python /usr/sbin/condor_configd -d
 7196 ?        S<l    0:00 condor_schedd -pidfile /var/run/condor/condor_schedd-ha_schedd2.pid -local-name ha_schedd2
 7203 ?        S<     0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd2.SCHEDD -R 10000000 -S 60 -C 64
 7242 ?        S<l    0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver2.pid -local-name ha_jobserver2
 7285 ?        S<     0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query2.pid -local-name ha_query2
 7626 ?        S<     0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64
 7667 ?        S<l    0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver3.pid -local-name ha_jobserver3
 7710 ?        S<     0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query3.pid -local-name ha_query3
 7937 ?        S<     0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64
 7970 pts/0    S+     0:00 grep condor

Service from node3 was relocated to node2 with all depended services (JS/QS).

>>> VERIFIED

Comment 9 Tomas Rusnak 2012-06-29 12:30:07 UTC
Tested on:
$CondorVersion: 7.6.5 Jun 04 2012 BuildID: RH-7.6.5-0.15.el6 $
$CondorPlatform: X86_64-RedHat_6.2 $