| Summary: | RHHAv2 Query/Job Server failover should be tied to schedd | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Robert Rati <rrati> | ||||
| Component: | condor | Assignee: | Robert Rati <rrati> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Tomas Rusnak <trusnak> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | Development | CC: | matt, mkudlej, trusnak, tstclair | ||||
| Target Milestone: | 2.2 | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | done | ||||||
| Fixed In Version: | condor-7.6.5-0.15 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-09-19 18:26:30 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | 835525 | ||||||
| Bug Blocks: | 751870 | ||||||
| Attachments: |
|
||||||
|
Description
Robert Rati
2012-03-28 15:15:15 UTC
The Red Hat HAv2 tools now add/remove query/job servers to/from schedd configurations, and all daemons failover as a unit. The schedd failure count is limited by the number of restarts in a timelimit, but the job/query servers will fail/restart forever and not cause a failover of the service. Tracking upstream on branch: V7_6-branch What does mean that query/job servers are added to schedd configurations? Could you please write here settings which are required for this and example of how it works during fail in RH HA? The job/query servers are added as resources to the service that contains the specified schedd. They use the same failover domains as the schedd. The service is comprised of the nfs share and the condor daemons (schedd, job server, query server), and a relocation will relocate the entire service. Failures of the job/query servers will not cause a failover of the service. Those resources will just be restarted on the node they were running on. Failure of the schedd beyond the allowed limits will cause the entire service to failover to another node in the domain. When I relocate service from one HA node to another, only one aviary server is started. The relocated service have no aviary_query_server. The condor_job_server was relocated as expected. # ps ax | grep condor 1721 ? Ssl 0:12 condor_master -pidfile /var/run/condor/condor_master.pid 1743 ? Ssl 0:08 condor_collector -f 1747 ? Ssl 0:03 condor_startd -f 1749 ? Ssl 0:07 condor_negotiator -f 1750 ? Ssl 0:37 /usr/bin/python /usr/sbin/condor_configd 3717 ? S<l 0:02 condor_schedd -pidfile /var/run/condor/condor_schedd-ha_schedd2.pid -local-name ha_schedd2 3721 ? S< 0:01 condor_procd -A /var/run/condor/procd_pipe.ha_schedd2.SCHEDD -R 10000000 -S 60 -C 64 3761 ? S<l 0:03 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver2.pid -local-name ha_jobserver2 3803 ? S< 0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query2.pid -local-name ha_query2 13172 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64 13200 ? S<l 0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver3.pid -local-name ha_jobserver3 13276 pts/0 S+ 0:00 grep condor rgmanager.log: Jun 20 12:22:16 rgmanager Some independent resources in service:HA_Schedd3 failed; Attempting inline recovery Jun 20 12:22:16 rgmanager [condor] Stopping aviary_query_server ha_query3 Jun 20 12:22:17 rgmanager [condor] Stopping condor_job_server ha_jobserver3 Jun 20 12:22:20 rgmanager [condor] Stopping condor_schedd ha_schedd3 Jun 20 12:22:20 rgmanager [condor] Starting condor_schedd ha_schedd3 Jun 20 12:22:21 rgmanager [condor] Starting condor_job_server ha_jobserver3 Jun 20 12:22:21 rgmanager [condor] Starting aviary_query_server ha_query3 Jun 20 12:22:21 rgmanager Inline recovery of service:HA_Schedd3 complete /var/log/condor/QueryServerLog-ha_query3: 06/20/12 12:23:12 Axis2 HTTP configuration failed Stack dump for process 18197 at timestamp 1340187792 (20 frames) aviary_query_server(dprintf_dump_stack+0x63)[0x50f013] aviary_query_server[0x511102] /lib64/libpthread.so.0(+0xf500)[0x7ff117282500] /usr/lib64/libaxis2_engine.so.0(axis2_phase_free+0xe)[0x7ff118f4525e] /usr/lib64/libaxis2_engine.so.0(axis2_msg_free+0x7e)[0x7ff118f53dae] /usr/lib64/libaxis2_engine.so.0(axis2_desc_free+0x59)[0x7ff118f4b319] /usr/lib64/libaxis2_engine.so.0(axis2_op_free+0x1e)[0x7ff118f4d13e] /usr/lib64/libaxis2_engine.so.0(axis2_svc_free+0x19e)[0x7ff118f4ff9e] /usr/lib64/libaxis2_engine.so.0(axis2_arch_file_data_free+0xf0)[0x7ff118f5c220] /usr/lib64/libaxis2_engine.so.0(axis2_dep_engine_free+0xae)[0x7ff118f5b3de] /usr/lib64/libaxis2_engine.so.0(axis2_conf_free+0xf0)[0x7ff118f44450] /usr/lib64/libaxis2_engine.so.0(axis2_conf_ctx_free+0x194)[0x7ff118f68bf4] /usr/lib64/libaxis2_http_receiver.so.0(+0x1b54)[0x7ff1184a5b54] aviary_query_server(_ZN6aviary4soap17Axis2SoapProviderD2Ev+0x21)[0x45b461] aviary_query_server(_ZN6aviary4soap17Axis2SoapProviderD0Ev+0x9)[0x45b539] aviary_query_server(_ZN6aviary9transport21AviaryProviderFactory6createERKSs+0x278)[0x45af68] aviary_query_server(_Z9main_initiPPc+0x8f)[0x45a8ef] aviary_query_server(main+0x115f)[0x46bcef] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff116effcdd] aviary_query_server[0x45a489] Please, could you take a look, why 2nd aviary dying? Created attachment 593195 [details]
cluster.conf
This is my cluster.conf and parameters for query server in wallaby:
# condor_configure_pool -g "Automatically generated High-Availability configuration for Schedd ha_schedd3" -l | grep -i query
2: BaseQueryServer
QUERY_SERVER.ha_query3.HISTORY = $(QUERY_SERVER.ha_query3.SPOOL)/history
QUERY_SERVER.ha_query3.AVIARY_PUBLISH_INTERVAL = 10
QUERY_SERVER.ha_query3.QUERY_SERVER_ADDRESS_FILE = $(LOG)/.query_server_address-ha_query3
QUERY_SERVER.ha_query3.QUERY_SERVER_LOG = $(LOG)/QueryServerLog-ha_query3
QUERY_SERVER.ha_query3.SCHEDD_NAME = ha-schedd-ha_schedd3@
QUERY_SERVER.ha_query3.AVIARY_PUBLISH_LOCATION = True
QUERY_SERVER.ha_query3.SPOOL = $(SCHEDD.ha_schedd3.SPOOL)
ha_query3 = $(QUERY_SERVER)
QUERY_SERVER.ha_query3.QUERY_SERVER_DAEMON_AD_FILE = $(LOG)/.query_server_classad-ha_query3
Maybe the 2nd aviary server doesn't have separate ports assigned for communication and it's in conflict with 1st one.
The crash in the query server is the result of a port collision. There is a fix in the query server to make this more evident. To run multiple query servers on a single node, you'll need to manually specify the port. Do so by adding a parameter like the following to the wallaby group for the schedd: QUERY_SERVER.<query_server_name>.HTTP_PORT = <unique port> I setup HTTP_PORT for each QS, as you wrote, and it's not dying anymore. Thanks.
# clustat
Cluster Status for HA-schedd @ Fri Jun 29 12:21:25 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
rhel-ha-1.hostname 1 Online, Local, rgmanager
rhel-ha-2.hostname 2 Online, rgmanager
rhel-ha-3.hostname 3 Online, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:HA_Schedd1 rhel-ha-1.hostname started
service:HA_Schedd2 rhel-ha-2.hostname started
service:HA_Schedd3 rhel-ha-3.hostname started
# fence_node -vvv rhel-ha-3.hostname
fence rhel-ha-3.hostname dev 0.0 agent fence_xvm result: success
agent args: domain=RHEL-HA-3 nodename=rhel-ha-3.hostname agent=fence_xvm
fence rhel-ha-3.hostname success
Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd2, state started, owner rhel-ha-2.hostname
Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd3, state started, owner rhel-ha-3.hostname
Jun 29 12:25:58 rgmanager Taking over service service:HA_Schedd3 from down member rhel-ha-3.hostname
Jun 29 12:25:58 rgmanager Evaluating RG service:HA_Schedd1, state started, owner rhel-ha-1.hostname
Jun 29 12:25:58 rgmanager Event (0:3:0) Processed
Jun 29 12:25:59 rgmanager [condor] Starting condor_schedd ha_schedd3
Jun 29 12:26:00 rgmanager [condor] Starting condor_job_server ha_jobserver3
Jun 29 12:26:00 rgmanager [condor] Starting aviary_query_server ha_query3
Jun 29 12:26:00 rgmanager Service service:HA_Schedd3 started
Jun 29 12:26:05 rgmanager 2 events processed
Jun 29 12:26:13 rgmanager Membership Change Event
Jun 29 12:26:13 rgmanager Node 3 is not listening
[root@rhel-ha-2 ~]# ps ax | grep condor
1708 ? Ssl 0:12 condor_master -pidfile /var/run/condor/condor_master.pid
1732 ? Ssl 0:04 condor_startd -f
1737 ? Ssl 1:07 /usr/bin/python /usr/sbin/condor_configd -d
7196 ? S<l 0:00 condor_schedd -pidfile /var/run/condor/condor_schedd-ha_schedd2.pid -local-name ha_schedd2
7203 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd2.SCHEDD -R 10000000 -S 60 -C 64
7242 ? S<l 0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver2.pid -local-name ha_jobserver2
7285 ? S< 0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query2.pid -local-name ha_query2
7626 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64
7667 ? S<l 0:00 condor_job_server -pidfile /var/run/condor/condor_job_server-ha_jobserver3.pid -local-name ha_jobserver3
7710 ? S< 0:00 aviary_query_server -pidfile /var/run/condor/aviary_query_server-ha_query3.pid -local-name ha_query3
7937 ? S< 0:00 condor_procd -A /var/run/condor/procd_pipe.ha_schedd3.SCHEDD -R 10000000 -S 60 -C 64
7970 pts/0 S+ 0:00 grep condor
Service from node3 was relocated to node2 with all depended services (JS/QS).
>>> VERIFIED
Tested on: $CondorVersion: 7.6.5 Jun 04 2012 BuildID: RH-7.6.5-0.15.el6 $ $CondorPlatform: X86_64-RedHat_6.2 $ |