Bug 1035272

Summary: It can take up to an hour for a new server in HA to appear in an agent's fail-over list
Product: [JBoss] JBoss Operations Network Reporter: Jeeva Kandasamy <jkandasa>
Component: High AvailabilityAssignee: RHQ Project Maintainer <rhq-maint>
Status: CLOSED EOL QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: unspecified    
Version: JON 3.3.0CC: hrupp, jkandasa, jkremser, jshaughn, loleary, myarboro
Target Milestone: ---Keywords: FutureFeature
Target Release: JON 4.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-24 14:53:35 UTC Type: Enhancement
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
log-files
none
screenshot none

Description Jeeva Kandasamy 2013-11-27 12:25:29 UTC
Created attachment 829685 [details]
log-files

Description of problem:
On HA setup initial server's agent is not sending data to other server in HA deployment while initial server is down.


Version-Release number of selected component (if applicable):
JBoss Operations Network	
Version : 3.2.0.ER7
Build Number : e8e6401:ff0061d
GWT Version : 2.5.0
SmartGWT Version : 3.0p

How reproducible:
always

Steps to Reproduce:
1. Create a HA setup at least with two JON servers
2. Install JON server on host-A (Initial server) with postgresql database, storage node
3. Install JON server on host-B (Second server) with storage node.
4. Create "Affinity Group" and add both the servers and agents on this group.
5. Make down the JON server on host-A(Initial server).
6. Agent from host-A should report to host-B JON server. But it is not reporting
7. Host-A agent goes to not reachable state and throws the following exception on agent log file.


-----------snap------------------
2013-11-27 17:29:03,022 INFO  [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Sending availability report to Server...
2013-11-27 17:29:03,023 WARN  [InventoryManager.availability-1] (rhq.core.pc.inventory.InventoryManager)- Could not transmit availability report to server
java.lang.IllegalStateException: The sender object is currently not sending commands now. Command not sent: [Command: type=[remotepojo]; cmd-in-response=[false]; config=[{rhq.send-throttle=true}]; params=[{invocation=NameBasedInvocation[mergeAvailabilityReport], targetInterfaceName=org.rhq.core.clientapi.server.discovery.DiscoveryServerService}]]
        at org.rhq.enterprise.communications.command.client.ClientCommandSender.sendSynch(ClientCommandSender.java:631)
        at org.rhq.enterprise.communications.command.client.ClientRemotePojoFactory$RemotePojoProxyHandler.invoke(ClientRemotePojoFactory.java:418)  
        at com.sun.proxy.$Proxy2.mergeAvailabilityReport(Unknown Source)  
        at org.rhq.core.pc.inventory.InventoryManager.handleReport(InventoryManager.java:1100)
        at org.rhq.core.pc.inventory.AvailabilityExecutor.run(AvailabilityExecutor.java:112)  
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)  
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)  
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
        at java.lang.Thread.run(Thread.java:724)

-----------snap------------------

Note: If we make down host-B JON server, agent from host-B is reporting to host-A JON server. But it's not working for host-A agent. (Initial server).

Actual results:
Host-A agent is not reporting to host-B, when host-A JON server is down.

Expected results:
Host-A agent should report to host-B, when host-A JON server is down.

Additional info: log files and screen shot are attached.

Comment 1 Jeeva Kandasamy 2013-11-27 12:27:15 UTC
Created attachment 829686 [details]
screenshot

Comment 2 Heiko W. Rupp 2013-12-02 11:31:21 UTC
Jeeva,
did the agent on Host A ever have contact to Server A? Or did you set up the agent to point to server A when server A was already down?
In the latter case, this would be expected as the agent first needs to download the failover list from the server before it can know about Server B at all.

Comment 3 Jeeva Kandasamy 2013-12-02 12:19:57 UTC
Heiko,

Agent A connected and imported resources to Server A, I did down Server A after ~30 minutes once I did all the setup on HA environment. I think Server A should push failover list to Agent A once Server B connected to HA environment. If we restart Agent A it works. If we add any new JON server on HA setup, server details should be pushed to across all the agents. Old agents are getting new server detail on next restart.

Comment 5 Jirka Kremser 2013-12-02 14:12:28 UTC
Taking it back.. I was able to reproduce it. Agent on A is not able to connect to server B and its failover list now contains only Server A.

Investigating..

Comment 6 Jay Shaughnessy 2013-12-02 15:58:44 UTC
I was going to say, most likely Agent A has only Server A in its failover list.  Given that fact the agent is behaving as expected.  The question is only why it has only Server A in its failover list.

Agent A is connected to Server A when Server B joins as an HA node.  So it would have only Server A in its failover list at that time.  Adding Server B should regenerate all of the failover lists for all agents given that adding a server is a full repartitioning event.

The only question is when we expect Agent A to get a refreshed failover list... Looking as well...

Comment 7 Jay Shaughnessy 2013-12-02 16:12:27 UTC
Looking at: https://docs.jboss.org/author/display/RHQ/Design-High+Availability+-+Agent+Failover#Design-HighAvailability-AgentFailover-CloudRepartition


"A repartition does not push new server lists to connected agents. This prevents large scale fail-over in large environments, potentially spiking a server with connection processing. Instead, agents will intermittently check for updated server lists, and reconnect to new primary assignments, if necessary. This disperses the connection load."


The Agent refreshes the failover list as part of the PrimaryServerSwitchoverThread job, which runs hourly by default (see rhq.agent.primary-server-switchover-check-interval-msecs in agent-configuration.xml).

In Comment 3 Jeeva says, "I did down Server A after ~30 minutes", which seems to leave plenty of time for the update to not have happened.

I think this can be closed, or, if you re-test, wait an hour or update the agent failover list manually before the agent shutdown.

Comment 8 Jay Shaughnessy 2013-12-02 16:14:11 UTC
sorry, that last line should say ..."before the server A shutdown"

Comment 9 Jirka Kremser 2013-12-02 18:14:03 UTC
Confirming Jay's thoughts, after 1 hour from the last agent start up time the fail-over list was correctly propagated to all agents and when shutting down Server A, the agent A correctly triggered its failover mechanism and connected to Server B. I would close this bug if Jeeva is not against.


agent log from agent A (1 hour after Server B + Agent B were started):

2013-12-02 13:07:07,488 INFO  [ClientCommandSenderTask Timer Thread #0] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://10.16.23.98:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://10.16.23.98:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2013-12-02 13:07:07,490 WARN  [ClientCommandSenderTask Timer Thread #0] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed to failover to another server. Cause: org.jboss.remoting.CannotConnectException: Can not connect http client invoker after 1 attempt(s)
2013-12-02 13:07:07,496 INFO  [ClientCommandSenderTask Timer Thread #0] (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)- {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is changing endpoint from [InvokerLocator [servlet://10.16.23.98:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] to [InvokerLocator [servlet://10.16.23.102:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
2013-12-02 13:07:07,744 INFO  [ClientCommandSenderTask Timer Thread #0] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failed-over-to-server}The agent has triggered its failover mechanism and switched to server [servlet://10.16.23.102:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]
2013-12-02 13:07:09,149 INFO  [ClientCommandSenderTask Timer Thread #0] (org.rhq.enterprise.agent.AgentMain)- {AgentMain.not-sending-dup-connect}Not sending another connect message since one was recently sent: [servlet://10.16.23.102:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet@Mon Dec 02 13:07:07 EST 2013]

Comment 10 Jeeva Kandasamy 2013-12-03 13:48:30 UTC
I feel this is again big time frame in some situations, suppose user wants to add a server on HA enviromment and try to down exsiting server before an hour from the HA new partition. In this case existing agents do not have new server details on their fail-over list and it is failed to report their resource monitoring details.

As HA re-repartition is not a frequent process it is not good to reduce fail-over polling time in agent side. But we might provide an option on GUI (or) CLI for end user to push fail-over list to all the agents. In this case that particular server only sends fail-over list to all the agents. If we do like this fail-over list gets updated immediately on agent side.

Comment 11 Jirka Kremser 2013-12-03 14:30:52 UTC
"But we might provide an option on GUI (or) CLI for end user to push fail-over list to all the agents."

It is actually already there. Agent operation called "Download Latest Failover List"

Comment 13 Larry O'Leary 2014-01-07 19:30:59 UTC
I am relabeling this issue as an enhancement request to be reviewed during the next major or minor product planning phase.

Based on comment 7, things are working as designed. If a new server is added to an HA configuration, it may take up to 1 hour for an agent to learn about the new server. This means that the original issue identified by this bug report is not actually a bug but working as designed. It was just that the test case didn't wait long enough for agent A to get the updated fail-over list which would contain server B.

That still leaves us with the remaining point raised by comment 10. If an administrator decides to take server A offline but before doing so, adds server B, the agents may not learn about server B in time before server A goes offline. This would result in inventory back-filling.

Please be aware that although this use-case is plausible, it is not common. In situations where a new server is being added to an HA configuration, there is most likely already 2 or more server in the HA configuration already. Thus, allowing agent fail-over to one of the existing servers.

To make this a bit clearer to the end user, perhaps the process of adding a new server to an HA configuration should expose an option that asks if all agents should have their fail-over list updated? Or perhaps a warning/reminder is added to the "add server to HA" that informs the user that agent's may not pick up the new server in their fail-over list until they are restarted or the "update fail-over list" operation is invoked? Additionally an operation should be added to the topology administration page that allows the fail-over list to be sent to a specific agent(s) or all agents?

Comment 14 Jay Shaughnessy 2014-09-05 18:33:47 UTC
Re-targeting to JON4, this is a big item with questionable benefit, so it would need to be considered in a greater context of architectural changes.

Comment 15 Filip Brychta 2019-06-24 14:53:35 UTC
JBoss ON is coming to the end of its product life cycle. For more information regarding this transition, see https://access.redhat.com/articles/3827121.
This bug report/request is being closed. If you feel this issue should not be closed or requires further review, please create a new bug report against the latest supported JBoss ON 3.3 version.