Bug 1593300
Summary: | Migration leads to VM running on 2 Hosts (Split brain) | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> |
Component: | ovirt-engine | Assignee: | Nobody <nobody> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | meital avital <mavital> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.1.11 | CC: | aperotti, dfediuck, emesika, gveitmic, lsurette, michal.skrivanek, nashok, Rhev-m-bugs, srevivo |
Target Milestone: | --- | Flags: | lsvaty:
testing_plan_complete-
|
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-26 12:32:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
nijin ashok
2018-06-20 13:49:51 UTC
The problem is that the source host vdsm is restarted during an ongoing migration. The recovery migration detection code was only introduced in 4.2. Can you please try to reproduce the scenario again, this time with a 4.2 vdsm (the engine can still stay 4.1)? We believe it can either fix it completely or help significantly Now to those restarts - they are frequent (196 times in cca 12 hours) and induced explicitly (SIGTERM) in quite fixed interval which indicates those are soft fencing actions. That can be caused by the fact that other hostscan still talk to engine and olso to this blocked host so whenever engine invokes a fence action on other host it easily ssh to this host and restarts vdsm. It still looks like a fencing problem, though, that it tries to do that over and over again for hours. Eli, can you please take a look? (In reply to Michal Skrivanek from comment #4) > The problem is that the source host vdsm is restarted during an ongoing > migration. The recovery migration detection code was only introduced in 4.2. > Can you please try to reproduce the scenario again, this time with a 4.2 > vdsm (the engine can still stay 4.1)? > We believe it can either fix it completely or help significantly > Yes, it's not reproducible in 4.2. Checked with 4.2 hosts with both 4.1 and 4.2 manager. The VM is getting correct status. 2018-06-22 02:29:37,848-04 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [431488e3] VM 'fb7add47-e61a-40cf-9eb7-b9bddf2a2970'(test_vm) moved from 'Unknown' --> 'MigratingFrom' (In reply to Michal Skrivanek from comment #4) Is info still needed due to comment 6 ? Yes. I believe restarting VDSM 200 times is not a desired behavior. Or did it change in 4.2 as well? The recovery improvements are not covering all the corner cases od vdsm reatarts I think it would help to understand why we restarted it so often (In reply to Michal Skrivanek from comment #8) > Yes. I believe restarting VDSM 200 times is not a desired behavior. Or did > it change in 4.2 as well? The recovery improvements are not covering all the > corner cases od vdsm reatarts > I think it would help to understand why we restarted it so often Looking at the log I see that soft0-fencing is done always once after host changes status from NonOperational to nonResponding The scenario is : Host tries to connect to storage Host fails to connect to storage Host became non operational Network error on connecting to the host is thrown Host became non responding Fencing handling begins , starting with soft-fencing thanks. The bug description says the connection from engine to host was blocked, I assumed all the time. But I see that indeed the connection was always working for few minutes and only then cut, causing another fencing cycle. How exactly was the blocking done? (In reply to Michal Skrivanek from comment #10) > thanks. The bug description says the connection from engine to host was > blocked, I assumed all the time. But I see that indeed the connection was > always working for few minutes and only then cut, causing another fencing > cycle. > > How exactly was the blocking done? For the customer, one of the storage was not available and hence the host was going non-operational. The test I have done in my test lab is by blocking the 54321 port using iptable. (In reply to nijin ashok from comment #11) > (In reply to Michal Skrivanek from comment #10) > > thanks. The bug description says the connection from engine to host was > > blocked, I assumed all the time. But I see that indeed the connection was > > always working for few minutes and only then cut, causing another fencing > > cycle. > > > > How exactly was the blocking done? > > For the customer, one of the storage was not available and hence the host > was going non-operational. > > The test I have done in my test lab is by blocking the 54321 port using > iptable. then how could the engine establish the connection periodically, after each soft fencing? I was looking at your test run, and there are 26 restarts in the vdsm log and 21x host status Up in engine.log, and after each one there is a communication with engine going on. As for the customer behavior, I wasn't able to check it as the engine.log is from different period than host logs, I only see those 198 restarts of vdsm presumably due to soft fencing. But we need the engine log to understand the reason for fencing. (In reply to Michal Skrivanek from comment #12)> > then how could the engine establish the connection periodically, after each > soft fencing? I was looking at your test run, and there are 26 restarts in > the vdsm log and 21x host status Up in engine.log, and after each one there > is a communication with engine going on. > There were many manual steps which have done in my test. I have manually removed and added IPTABLE rule to mimic the issue and have restarted vdsmd manually. I think it would be better to check customer log for ssh soft fencing issue. > As for the customer behavior, I wasn't able to check it as the engine.log is > from different period than host logs, I only see those 198 restarts of vdsm > presumably due to soft fencing. But we need the engine log to understand the > reason for fencing. I can see that the tar file uploaded contains engine log at the time of issue. There are 196 restarts of vdsm. xzgrep "I am the actual vdsm" customer_environment/src_host_vdsm.log.xz|wc -l 196 I can see ssh soft fencing action related to these events as well. xzgrep "Executing SSH Soft Fencing" customer_environment/engine.log.xz |wc -l 196 This happened between "2018-06-18 21:01" to "2018-06-19 14:37". The engine log uploaded has logs between "2018-06-18 03:39" to "2018-06-19 15:05". Please let me know if you are interested in any other logs. (In reply to nijin ashok from comment #13) > (In reply to Michal Skrivanek from comment #12)> > > then how could the engine establish the connection periodically, after each > > soft fencing? I was looking at your test run, and there are 26 restarts in > > the vdsm log and 21x host status Up in engine.log, and after each one there > > is a communication with engine going on. > > > > There were many manual steps which have done in my test. I have manually > removed and added IPTABLE rule to mimic the issue and have restarted vdsmd > manually. I think it would be better to check customer log for ssh soft > fencing issue. right. indeed > > As for the customer behavior, I wasn't able to check it as the engine.log is > > from different period than host logs, I only see those 198 restarts of vdsm > > presumably due to soft fencing. But we need the engine log to understand the > > reason for fencing. > > I can see that the tar file uploaded contains engine log at the time of > issue. There are 196 restarts of vdsm. > > xzgrep "I am the actual vdsm" customer_environment/src_host_vdsm.log.xz|wc -l > 196 > > I can see ssh soft fencing action related to these events as well. > > xzgrep "Executing SSH Soft Fencing" customer_environment/engine.log.xz |wc -l > 196 great - so that explains the restarts. What I'd still like to understand (though a little bit off topic) is the reason for that in their environment. If they executed a connectivity test it shouldn't be flapping like that. I just want to rule out some fencing problem which would behave like that during a permanent connection failure. apparently not urgent anymore, demoting without further information the only thing we have is the theory in comment #4 that it is already fixed in 4.2 Please reopen if you get any more details or think it is not fixed in 4.2 so we can look into it further sync2jira sync2jira |