| Summary: | [scale] - NPE for vm monitoring - vm monitor failed | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Eldad Marciano <emarcian> |
| Component: | Backend.Core | Assignee: | Arik <ahadas> |
| Status: | CLOSED WONTFIX | QA Contact: | eberman |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.0.5 | CC: | ahadas, bugs, michal.skrivanek, rgolan, tjelinek |
| Target Milestone: | --- | Flags: | sbonazzo:
ovirt-4.2-
|
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-14 11:29:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
please add host side logs from that occurrence. Should be b01-h11-r620 (In reply to Michal Skrivanek from comment #2) > please add host side logs from that occurrence. Should be b01-h11-r620 me and rgolan debug it shortyly together, in the engine we saw running vms on a host, when the host have no running vms. we checked it via virsh and and also by vdsClient adding logs and outputs. (In reply to Eldad Marciano from comment #3) > (In reply to Michal Skrivanek from comment #2) > > please add host side logs from that occurrence. Should be b01-h11-r620 > > me and rgolan debug it shortyly together, > in the engine we saw running vms on a host, when the host have no running > vms. > we checked it via virsh and and also by vdsClient > > adding logs and outputs. yes, and that should not ever happen without vdsm reporting it as such, hence the need for vdsm.log:) -=>>vdsClient -s 0 getVdsStats |grep vmCount vmCount = 0 (root@b01-h30-r620) - (09:52) - (~) -=>>virsh -r list setlocale: No such file or directory Id Name State ---------------------------------------------------- (root@b01-h30-r620) - (09:52) - (~) Eldad, Looking at VM 6a7d47c2-c5be-437e-b608-982a2947ba45 (vm_scale_real_121) - I see that it started to migrate at 09:18:07, failed to migrate to the first host, tried to migrate to another host but this time the engine failed to communicate with the second host and thus the command failed at 9:19:13. However, the engine detects a transition of this VM from Up to MigrationFrom at 9:19:18 with no prior migrate command. So: 1. is there a chance that the VM has been migrated using virsh or vdsClient? that would explain a lot. 2. the attached VDSM log doesn't contain this VM at all, from which host did you take it? I need the log from b01-h11-r620.rhev.openstack.engineering.redhat.com that covers the period of 2016-10-31 9:18:00 - 9:21:00 (engine time) Please ignore everything but the request for the proper VDSM log in comment 7. Now I understand what was going on: So the second migration attempt was to b02-h13-r620.rhev.openstack.engineering.redhat.com at 9:19:13. The failed communication with VDSM resulted in: VDSGenericException: VDSNetworkException: Heartbeat exeeded In that case the engine doesn't change the VM (so it remained as Up and migratingToVds==null), but apparently the migration was actually triggered by VDSM (I know that because the VM eventually appeared on that host at 09:20:46,741). At 9:20:18 the VM switched to Down on the source as part of the migration process - but migratingToVds was null so we got the NPE later on. The rest of the NPEs seem like a consequence of this issue. (In reply to Arik from comment #8) > Please ignore everything but the request for the proper VDSM log in comment > 7. > Now I understand what was going on: > > So the second migration attempt was to > b02-h13-r620.rhev.openstack.engineering.redhat.com at 9:19:13. The failed > communication with VDSM resulted in: > VDSGenericException: VDSNetworkException: Heartbeat exeeded > In that case the engine doesn't change the VM (so it remained as Up and > migratingToVds==null), but apparently the migration was actually triggered > by VDSM (I know that because the VM eventually appeared on that host at > 09:20:46,741). > > At 9:20:18 the VM switched to Down on the source as part of the migration > process - but migratingToVds was null so we got the NPE later on. The rest > of the NPEs seem like a consequence of this issue. Currently hosts were re provisioned. either way, we didn't migrating vm manually, seems like the engine migrate them over time. Is it possible to avoid migratingToVds null? you still need the logs? isn't it addressed by bug 1391933? (In reply to Michal Skrivanek from comment #10) > isn't it addressed by bug 1391933? the NPE - yes. But the solution should be different in order to handle this case, of getting timeout exception while asking VDSM to migrate a VM, properly - we currently produce an audit message that says the migration failed (and lose the destination host hence the NPE) and then get the VM in migrating state and conclude that there is a migration in progress. So this is not the right fix - should we open another bug to improve the handling of this case and move this bug to on_qa? indeed i would use a different bug or change this one. It's far less priority though The bug was not addressed in time for 4.1. Postponing to 4.2 The chance this happens is very low and the consequence is basically only an incorrect error message. The fix would be pretty hard to do. Consequently, closing due to capacity. If this is something which is happening in your case and is important for you to fix, please feel free to reopen. |
Description of problem: a NPE were throw for vm monitoring, it happens for some of the hosts, 5 out of 35. lab conditions: 1 cluster, 12 NFS SDs, 35 bare metal hosts, 3K vms also by looking at the code with rgolan it looks like VmAnalyzer.java:760 failed to fetch the host from the DB. which is currently exists on the hosts tab. moreover manually I verified it manually and it exist on the vds_static (DB). 2016-10-31 15:27:28,881 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler7) [2ddc507] Failed during vms monitoring on host xyz error is: java.lang.NullP ointerException 2016-10-31 15:27:28,881 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler7) [2ddc507] Exception:: java.lang.NullPointerException at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.isVdsNonResponsive(VmAnalyzer.java:760) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.handOverVm(VmAnalyzer.java:739) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.proceedDisappearedVm(VmAnalyzer.java:698) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer.analyze(VmAnalyzer.java:144) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.lambda$analyzeVms$1(VmsMonitoring.java:161) [vdsbroker.jar:] at java.util.ArrayList.forEach(ArrayList.java:1249) [rt.jar:1.8.0_102] at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.analyzeVms(VmsMonitoring.java:156) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring.perform(VmsMonitoring.java:119) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.monitoring.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:39) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source) [:1.8.0_102] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_102] at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_102] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:77) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:51) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_102] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_102] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_102] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_102] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_102] Version-Release number of selected component (if applicable): 4.0.5.0-1 How reproducible: not clear Steps to Reproduce: 1. not clear (some of the vms tried to be migrated). Actual results: vm monitor failed. Expected results: vm monitor pass Additional info: