Hide Forgot
Description of problem: To simplify failover scenarios, the Query/Job Server in RHHA should not be independent, but instead should be tied to the schedd. They should use the same mount point as the schedd they associate with. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The Red Hat HAv2 tools now add/remove query/job servers to/from schedd configurations, and all daemons failover as a unit. The schedd failure count is limited by the number of restarts in a timelimit, but the job/query servers will fail/restart forever and not cause a failover of the service. Tracking upstream on branch: V7_6-branch
What does mean that query/job servers are added to schedd configurations? Could you please write here settings which are required for this and example of how it works during fail in RH HA?
The job/query servers are added as resources to the service that contains the specified schedd. They use the same failover domains as the schedd. The service is comprised of the nfs share and the condor daemons (schedd, job server, query server), and a relocation will relocate the entire service. Failures of the job/query servers will not cause a failover of the service. Those resources will just be restarted on the node they were running on. Failure of the schedd beyond the allowed limits will cause the entire service to failover to another node in the domain.
When I relocate service from one HA node to another, only one aviary server is started. The relocated service have no aviary_query_server. The condor_job_server was relocated as expected. # ps ax | grep condor 1721 ? Ssl 0:12 condor_master -pidfile /var/run/condor/condor_master.pid 1743 ? Ssl 0:08 condor_collector -f 1747 ? Ssl 0:03 condor_startd -f 1749 ? Ssl 0:07 condor_negotiator -f 1750 ? Ssl 0:37 /usr/bin/python /usr/sbin/condor_configd 3717 ? S<l 0:02 condor_schedd -pidfile /var/run/condor/condor_schedd-ha_schedd2.pid -local-name ha_schedd2 3721 ? S< 0:01 condor_procd -A /var/run/condor/procd_pipe.ha_schedd2.SCHEDD -R 10000000 -S 60 -C 64 3761 ? S<l 0:03 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver2.pid -local-name ha_jobserver2 3803 ? S< 0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query2.pid -local-name ha_query2 13172 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64 13200 ? S<l 0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver3.pid -local-name ha_jobserver3 13276 pts/0 S+ 0:00 grep condor rgmanager.log: Jun 20 12:22:16 rgmanager Some independent resources in service:HA_Schedd3 failed; Attempting inline recovery Jun 20 12:22:16 rgmanager [condor] Stopping aviary_query_server ha_query3 Jun 20 12:22:17 rgmanager [condor] Stopping condor_job_server ha_jobserver3 Jun 20 12:22:20 rgmanager [condor] Stopping condor_schedd ha_schedd3 Jun 20 12:22:20 rgmanager [condor] Starting condor_schedd ha_schedd3 Jun 20 12:22:21 rgmanager [condor] Starting condor_job_server ha_jobserver3 Jun 20 12:22:21 rgmanager [condor] Starting aviary_query_server ha_query3 Jun 20 12:22:21 rgmanager Inline recovery of service:HA_Schedd3 complete /var/log/condor/QueryServerLog-ha_query3: 06/20/12 12:23:12 Axis2 HTTP configuration failed Stack dump for process 18197 at timestamp 1340187792 (20 frames) aviary_query_server(dprintf_dump_stack+0x63)[0x50f013] aviary_query_server[0x511102] /lib64/libpthread.so.0(+0xf500)[0x7ff117282500] /usr/lib64/libaxis2_engine.so.0(axis2_phase_free+0xe)[0x7ff118f4525e] /usr/lib64/libaxis2_engine.so.0(axis2_msg_free+0x7e)[0x7ff118f53dae] /usr/lib64/libaxis2_engine.so.0(axis2_desc_free+0x59)[0x7ff118f4b319] /usr/lib64/libaxis2_engine.so.0(axis2_op_free+0x1e)[0x7ff118f4d13e] /usr/lib64/libaxis2_engine.so.0(axis2_svc_free+0x19e)[0x7ff118f4ff9e] /usr/lib64/libaxis2_engine.so.0(axis2_arch_file_data_free+0xf0)[0x7ff118f5c220] /usr/lib64/libaxis2_engine.so.0(axis2_dep_engine_free+0xae)[0x7ff118f5b3de] /usr/lib64/libaxis2_engine.so.0(axis2_conf_free+0xf0)[0x7ff118f44450] /usr/lib64/libaxis2_engine.so.0(axis2_conf_ctx_free+0x194)[0x7ff118f68bf4] /usr/lib64/libaxis2_http_receiver.so.0(+0x1b54)[0x7ff1184a5b54] aviary_query_server(_ZN6aviary4soap17Axis2SoapProviderD2Ev+0x21)[0x45b461] aviary_query_server(_ZN6aviary4soap17Axis2SoapProviderD0Ev+0x9)[0x45b539] aviary_query_server(_ZN6aviary9transport21AviaryProviderFactory6createERKSs+0x278)[0x45af68] aviary_query_server(_Z9main_initiPPc+0x8f)[0x45a8ef] aviary_query_server(main+0x115f)[0x46bcef] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff116effcdd] aviary_query_server[0x45a489] Please, could you take a look, why 2nd aviary dying?
Created attachment 593195 [details] cluster.conf This is my cluster.conf and parameters for query server in wallaby: # condor_configure_pool -g "Automatically generated High-Availability configuration for Schedd ha_schedd3" -l | grep -i query 2: BaseQueryServer QUERY_SERVER.ha_query3.HISTORY = $(QUERY_SERVER.ha_query3.SPOOL)/history QUERY_SERVER.ha_query3.AVIARY_PUBLISH_INTERVAL = 10 QUERY_SERVER.ha_query3.QUERY_SERVER_ADDRESS_FILE = $(LOG)/.query_server_address-ha_query3 QUERY_SERVER.ha_query3.QUERY_SERVER_LOG = $(LOG)/QueryServerLog-ha_query3 QUERY_SERVER.ha_query3.SCHEDD_NAME = ha-schedd-ha_schedd3@ QUERY_SERVER.ha_query3.AVIARY_PUBLISH_LOCATION = True QUERY_SERVER.ha_query3.SPOOL = $(SCHEDD.ha_schedd3.SPOOL) ha_query3 = $(QUERY_SERVER) QUERY_SERVER.ha_query3.QUERY_SERVER_DAEMON_AD_FILE = $(LOG)/.query_server_classad-ha_query3 Maybe the 2nd aviary server doesn't have separate ports assigned for communication and it's in conflict with 1st one.
The crash in the query server is the result of a port collision. There is a fix in the query server to make this more evident. To run multiple query servers on a single node, you'll need to manually specify the port. Do so by adding a parameter like the following to the wallaby group for the schedd: QUERY_SERVER.<query_server_name>.HTTP_PORT = <unique port>
I setup HTTP_PORT for each QS, as you wrote, and it's not dying anymore. Thanks. # clustat Cluster Status for HA-schedd @ Fri Jun 29 12:21:25 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ rhel-ha-1.hostname 1 Online, Local, rgmanager rhel-ha-2.hostname 2 Online, rgmanager rhel-ha-3.hostname 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:HA_Schedd1 rhel-ha-1.hostname started service:HA_Schedd2 rhel-ha-2.hostname started service:HA_Schedd3 rhel-ha-3.hostname started # fence_node -vvv rhel-ha-3.hostname fence rhel-ha-3.hostname dev 0.0 agent fence_xvm result: success agent args: domain=RHEL-HA-3 nodename=rhel-ha-3.hostname agent=fence_xvm fence rhel-ha-3.hostname success Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd2, state started, owner rhel-ha-2.hostname Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd3, state started, owner rhel-ha-3.hostname Jun 29 12:25:58 rgmanager Taking over service service:HA_Schedd3 from down member rhel-ha-3.hostname Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd1, state started, owner rhel-ha-1.hostname Jun 29 12:25:58 rgmanager Event (0:3:0) Processed Jun 29 12:25:59 rgmanager [condor] Starting condor_schedd ha_schedd3 Jun 29 12:26:00 rgmanager [condor] Starting condor_job_server ha_jobserver3 Jun 29 12:26:00 rgmanager [condor] Starting aviary_query_server ha_query3 Jun 29 12:26:00 rgmanager Service service:HA_Schedd3 started Jun 29 12:26:05 rgmanager 2 events processed Jun 29 12:26:13 rgmanager Membership Change Event Jun 29 12:26:13 rgmanager Node 3 is not listening [root@rhel-ha-2 ~]# ps ax | grep condor 1708 ? Ssl 0:12 condor_master -pidfile /var/run/condor/condor_master.pid 1732 ? Ssl 0:04 condor_startd -f 1737 ? Ssl 1:07 /usr/bin/python /usr/sbin/condor_configd -d 7196 ? S<l 0:00 condor_schedd -pidfile /var/run/condor/condor_schedd-ha_schedd2.pid -local-name ha_schedd2 7203 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd2.SCHEDD -R 10000000 -S 60 -C 64 7242 ? S<l 0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver2.pid -local-name ha_jobserver2 7285 ? S< 0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query2.pid -local-name ha_query2 7626 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64 7667 ? S<l 0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver3.pid -local-name ha_jobserver3 7710 ? S< 0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query3.pid -local-name ha_query3 7937 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64 7970 pts/0 S+ 0:00 grep condor Service from node3 was relocated to node2 with all depended services (JS/QS). >>> VERIFIED
Tested on: $CondorVersion: 7.6.5 Jun 04 2012 BuildID: RH-7.6.5-0.15.el6 $ $CondorPlatform: X86_64-RedHat_6.2 $