Description of problem: Apparently a Heartbeat exceeded during connectStorageServer causes the engine to move perfectly working hosts to Non-Operational, causing mass migrations for nothing. 1. RHV-M loses connection to host 2. Connects again 3. Among other things, does a connectStorageServer 4. Something disrupts the connection again, connectStorageServer fails with Heartbeat Exceeded exception. 5. Engine understands it as a failure to connect to the Pool and moves it to Non-Operational 6. Unnecessary mass migrations. I'm not 100% sure on 5, but it's the only thing I can see. Everything else looks fine. Also: * VDSM answers are all timely, repoStats before and after the connectStorageServer are the same. No Storage Domain was in problem or not reported. * In the specific case below the OS running ovirt-engine had severe problems, causing the Heartbeat exceeded problems. But it can also be caused by poor network. * See 3 examples: 2017-08-28 03:42:19,675+05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-6-thread-41) [] Could not connect host 'dch11bl13' to pool 'DR': Heartbeat exceeded 2017-08-28 03:42:36,867+05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler2) [db65f77] START, SetVdsStatusVDSCommand(HostName = dch11bl13, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='2a02097f-7026-4d63-93d7-3db474b1f889', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 226305ec 2017-08-28 03:42:39,329+05 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [db65f77] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Correlation ID: db65f77, Job ID: 3f389dbf-c73b-45ad-8db2-fa5f6d9b475d, Call Stack: null, Custom Event ID: -1, Message: Host dch11bl13 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center DR-CHENNAI. Setting Host state to Non-Operational. ----------------------- 2017-08-28 03:52:13,035+05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-6-thread-21) [] Could not connect host 'dch12bl01' to pool 'DR-CHENNAI': Heartbeat exceeded 2017-08-28 03:52:17,750+05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler10) [17e08558] START, SetVdsStatusVDSCommand(HostName = dch12bl01, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='d7191d33-899d-49ea-9309-d51da80bc5eb', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 5cd84c52 ----------------------- 2017-08-28 03:54:28,801+05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-6-thread-43) [] Could not connect host 'dch12bl04' to pool 'DR-CHENNAI': Heartbeat exceeded 2017-08-28 03:54:32,618+05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler3) [737e71c3] START, SetVdsStatusVDSCommand(HostName = dch12bl04, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='289abce1-5898-4be8-b582-dfe82905adf6', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 444cf9a5 Version-Release number of selected component (if applicable): ovirt-engine-4.1.2 How reproducible: Unsure
It seems to be a ConnectStoragePool on initializeStorage() in InitVdsOnUpCommand. And does it really need to do this every single time a connection is re-established? Because these hosts were already connected to the Pool.
The real question is why we got heartbeat exceeded. I suspect due to the rhvm terrible state. I think moving a host to non operational on heartbeat exceeded makes sense, so I'm not sure what we should look at here.
(In reply to Yaniv Kaul from comment #3) > The real question is why we got heartbeat exceeded. I suspect due to the > rhvm terrible state. Yes this is from that hung (softlockups) rhv-m. > I think moving a host to non operational on heartbeat exceeded makes sense, > so I'm not sure what we should look at here. It doesn't look like a good idea to me. What if only RHV-M has network problems (switch port, remote site WAN, VPN etc..). It can trigger mass migrations on a perfect working cluster. IMO Non-Operational should be triggered based on received data/returns, not exceptions. But this is your decision.
Checking the logs I see that "heartbeat exceeded" is very common in the log starting with: 2017-08-27 00:09:13,644+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler1) [] Command 'GetAllVmStatsVDSCommand(HostName = dch12bl13.cbec.gov.in, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='e31634bf-d4a0-47e8-ba81-d49c280acaa3'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded I see many hosts being not able to communicate. It could be either too small heartbeat timeout value (for busier environments) or there is other issue that we need to investigate. Please provide one or more failing host logs.
Piotr, The network and the hosts are all fine here, this is not about any bug on jsonrpc. The problem is not even the engine, but the Machine (Virtual) the engine is running on. It was completely broken and locked up. Java as postgres processes are stuck on CPUs, not running etc, softlockups everywhere, IO not happening etc.. This triggered some very erratic behavior (corner cases) which ended up with undesirable results. This BZ is about of one them: Should a Heartbeat Exceeded during ConnectStoragePool move a Host to Non-Operational? Because on during other commands it does not do this. So if a manager<->host connection is dropped in the case of a network problem (quite common). The manager connects again and during ConnectStoragePool the communication problem happens again, triggering Heartbeat exceeded. Should the host move to Non-Operational? If yes, feel free to close the BZ.
Please provide steps how to reproduce the issue.
Reducing severity to high as this happens only if engine is running inside VM and this VM was frozen (based on above description).
From exception handling perspective it looks like it needs to be fixed. I understand that the critical piece is handling ConnectStoragePool. Let me ask Allon what he things about such changes.
(In reply to Martin Perina from comment #12) > Reducing severity to high as this happens only if engine is running inside > VM and this VM was frozen (based on above description). Hmm. I think it can happen on network issues too. As long as the Heartbeat exception is raised during a connectStoragePool the same can happen. Just needs to have bad luck. (In reply to Piotr Kliczewski from comment #13) > From exception handling perspective it looks like it needs to be fixed. I > understand that the critical piece is handling ConnectStoragePool. Let me > ask Allon what he things about such changes. Yup, that's why I raised the BZ. Thanks for following up.
(In reply to Piotr Kliczewski from comment #13) > From exception handling perspective it looks like it needs to be fixed. I > understand that the critical piece is handling ConnectStoragePool. Let me > ask Allon what he things about such changes. If a host cannot connect to the storage pool (i.e. - it does not start monitoring all the relevant storage domains), it should indeed be NonOp. Piotr - can you elaborate on what you think should be fixed here? To me, this looks like the correct behavior.
(In reply to Allon Mureinik from comment #15) > > If a host cannot connect to the storage pool (i.e. - it does not start > monitoring all the relevant storage domains), it should indeed be NonOp. > Piotr - can you elaborate on what you think should be fixed here? > To me, this looks like the correct behavior. Allon, As Germano described above there is a customer running the engine in a vm which gets frozen from time to time. The issue is that it was frozen during connect storage command. The issue was detected and handled to move it to NoOp (host was healthy) and it triggered unneeded migration. What do you think about retying the call or perfrom a simple check whether the host is accessible? This should mitigate unneeded migrations.
(In reply to Piotr Kliczewski from comment #16) > Allon, As Germano described above there is a customer running the engine in > a vm which gets frozen from time to time. The issue is that it was frozen > during connect storage command. Well, that's the least of our concerns right? A hanging RHV-M should not happen that frequently. What concerns me is that a bad network (intermittent issue) between just the engine and vdsm can cause the same thing. Even if the host is completely healthy and all networks are fine between all hosts. Just the manager is having network issues issues like a bad NIC or VPN in case it's on a different site. Then we can immediately move hosts to non-operational even if the manager reconnects instantly after (fencing countdown never reached, so never fenced) right Piotr?
(In reply to Germano Veit Michel from comment #17) > (In reply to Piotr Kliczewski from comment #16) > > Allon, As Germano described above there is a customer running the engine in > > a vm which gets frozen from time to time. The issue is that it was frozen > > during connect storage command. > > Well, that's the least of our concerns right? A hanging RHV-M should not > happen that frequently. > > What concerns me is that a bad network (intermittent issue) between just the > engine and vdsm can cause the same thing. Even if the host is completely > healthy and all networks are fine between all hosts. Just the manager is > having network issues issues like a bad NIC or VPN in case it's on a > different site. Then we can immediately move hosts to non-operational even > if the manager reconnects instantly after (fencing countdown never reached, > so never fenced) right Piotr? Now I understand the question, thanks. I don't think retrying is the right way to go. Although ConnectStoragePool is relatively innocuous, the same treatment can happen in numerous other commands you won't want to retry. I'd expect the host monitoring subsystem to catch the heartbeat exceeded error and move the host to NonResponding (which it is, frankly)
When there is a network issue we do start fencing flow and it seems that ConnectStoragePool is different. If fencing is not triggered it is a bug and we need to fix it but it should. When I rewrote transport I provided retry logic but it is disabled in the db. It was not tested in the real life and vdsm calls are not idempotent so enabling it could cause some issues. We could provide deduplication functionality in vdsm so we could move our guarantees from "at most once" to "at least once".
(In reply to Piotr Kliczewski from comment #19) > When there is a network issue we do start fencing flow and it seems that > ConnectStoragePool is different. If fencing is not triggered it is a bug and > we need to fix it but it should. > > When I rewrote transport I provided retry logic but it is disabled in the > db. It was not tested in the real life and vdsm calls are not idempotent so > enabling it could cause some issues. We could provide deduplication > functionality in vdsm so we could move our guarantees from "at most once" to > "at least once". So what's the next step here?
According to Allon exception handling for connectStoragePool is correct even though it causes the issue. We have 2 ways of handling the issue: 1. Retry connectStoragePool which at the moment is one time operation and when there is networking issue it can have significant consequences. 2. Increase heartbeat timeout to the point the issues are not occurring. Solution 2 is short term before we will be able to do 1 as part of this BZ. Allon do you agree?
(In reply to Piotr Kliczewski from comment #21) > According to Allon exception handling for connectStoragePool is correct even > though it causes the issue. We have 2 ways of handling the issue: > > 1. Retry connectStoragePool which at the moment is one time operation and > when there is networking issue it can have significant consequences. > 2. Increase heartbeat timeout to the point the issues are not occurring. > > Solution 2 is short term before we will be able to do 1 as part of this BZ. > Allon do you agree? Agreed.
Heartbeat interval change is tracked by BZ #1513360.
We will increase heartbeat in 4.2 as a part of BZ1513360, but as mentioned above the proper fix for this issue is inside storage part where connectStoragePool should be retried if some connection issues are raised.
(In reply to Martin Perina from comment #24) > We will increase heartbeat in 4.2 as a part of BZ1513360, but as mentioned > above the proper fix for this issue is inside storage part where > connectStoragePool should be retried if some connection issues are raised. Agreed. With the increased timeout and monitoring improvements, this is not a 4.2 item.
We made improvements to fix the main issue. If there is still a bugs around this, please open a new report with logs.
BZ<2>Jira re-sync