Description of problem: Currently, networking_odl has a default timer of 30 seconds for the restconf_poll_interval. In a cluster environment, it is a race as to which networking-odl will run the hostconfig task. The hostconfig task on each controller will have this timer pop, but there is some logic checks to make sure the task didn't just complete within the last timer interval, make sure task isn't already running, etc. With these checks, in a clustered environment, it is possible that the hostconfig task may not actually run (in default 30 second config) until 55+ seconds have elapsed on a given controller node. Given that the dead_agent_timer is 75 seconds, this would mean, that depending on timer cycles of networking-odl hostconfig agent, we would only get 1 shot to properly fetch the hostconfing from ODL. Version-Release number of selected component (if applicable): How reproducible: Intermittent Steps to Reproduce: 1. Deploy cluster 2. Monitor hostconfig collection interval 3. Actual results: Interval for running hostconfig task could exceed 30 seconds between cycles. Expected results: In order to allow for at least 2 attempts to fetch ODL hostconfig, adjust the restconf_poll_interval to 15 seconds, so that the max time between the hostconfig task running in networking-odl will be max 29 seconds. Additional info: We should move the default to 15 seconds, but also, have a way to configure this in production (with proper instructions from support staff... meaning hidden config command).
Closing as duplicate of bug 1610546 since we need to solve the big picture of how/what these agent statuses mean and their alive/deadness. *** This bug has been marked as a duplicate of bug 1610546 ***