Description of problem: one of my vm went to Not Responding state because there was a wrong entry being made in /etc/fstab file. After correcting the entry i see that the vm comes up fine but RHEV- UI still shows that the vm is Not Responding. Version-Release number of selected component (if applicable): vdsm-4.17.26-0.1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Install HC setup 2. Create vms on the setup. 3. Now have an wrong entry in /etc/fstab which makes the system to go to maintenance mode. 4. correct the /etc/fstab file. 5. stop and start the vm. 6. Vm comes up with out any issue Actual results: Even when vm comes up with out any issue, i see that status of the vm does not reflect correctly. Expected results: vm status should be reflecting correctly if the vm is up and running fine. Additional info:
Logs?
Kasturi, which of the nodes was the VM running? was the VM named "vm_file_check" ?
Node on which vm was running is tettnang.lab.eng.blr.redhat.com and the VM name is BootStrom_linux_vm-1.
Ok. I see below in engine.log 2016-05-05 10:11:38,222 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-60) [] VM '4fca345e-65e8-4825-babc-55c3f436f459'(BootStrom_linux_vm-1) moved from 'Up' --> 'NotResponding' and in vdsm.log on sulpur /var/log/vdsm/vdsm.log:Thread-381231::WARNING::2016-05-05 14:01:02,424::vm::5161::virt.vm::(_setUnresponsiveIfTimeout) vmId=` 4fca345e-65e8-4825-babc-55c3f436f459`::monitor become unresponsive (command timeout, age=13816.53) Moving this to virt team. Please provide inputs on above error
there was no recovery in vdsm sense. The VM stopped working (libvirt stopped responding) at 10:11 and since then it haven't started working again nor there was any vdsm restart. I fail to find any libvirt logs in those sosreport files, any chance to reproduce this with libvirt debug logs enabled (and gathered)?
Hi Michal, I have /var/lib/libvirt/qemu/<vm-name>.log file. Would that help? can you help me with enabling debug logs for libvirt so that i can try reproducing the issue ? Thanks kasturi.
(In reply to RamaKasturi from comment #7) > Hi Michal, > > I have /var/lib/libvirt/qemu/<vm-name>.log file. Would that help? can No, that's only a qemu's log which is not helpful in this case. We need to see what was wrong in libvirt which doesn't seems to be included in the sosreport. I also can't find the journal so can't check for any system messages:/ See http://wiki.libvirt.org/page/DebugLogs for how to enable libvirt debug logs.
Michal, there are journalctl logs in sos_commands/logs/journalctl_--all_--this-boot_--no-pager May 05 10:05:21 sulphur..com libvirtd[17241]: Cannot start job (query, none) for domain BootStrom_linux_vm-6; current job is (query, none) owned by (17246 remoteDispatchDom ainGetBlockIoTune, 0 <null>) for (32s, 0s) May 05 10:05:21 sulphur...com libvirtd[17241]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) I see errors like this, however not for vm BootStrom_linux_vm-1
(In reply to Sahina Bose from comment #9) > Michal, there are journalctl logs in > sos_commands/logs/journalctl_--all_--this-boot_--no-pager ah, thanks! that's one ugly file name:) > May 05 10:05:21 sulphur..com libvirtd[17241]: Cannot start job (query, none) > for domain BootStrom_linux_vm-6; current job is (query, none) owned by > (17246 remoteDispatchDom > ainGetBlockIoTune, 0 <null>) for (32s, 0s) > May 05 10:05:21 sulphur...com libvirtd[17241]: Timed out during operation: > cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune) > > I see errors like this, however not for vm BootStrom_linux_vm-1 these are significant and indicate a libvirt problem. Those debug logs would be helpful. btw, there are also tons of these. You may have some misconfiguration, but even then the logs should not be flooded like that. Please open a separate bug on Hosted Engine (SLA) ovirt-ha-broker[12856]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection est ablished ovirt-ha-broker[12856]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection clo sed
kernel: libvirtd: page allocation failure: order:4, mode:0x1040d0 is for sure not good. There are many errors pointing to stuck libvirt query(failures to acquire lock), way too many errors involving hosted engine...that's never good either. There are plenty of vmGetIoTune and vmGetIoTunePolicy calls in vdsm, not sure if that's the result of those hosted engine issues or something else. Martin, thoughts?
Sahina/RamaKasturi - in other words, the system doesn't seem to have recovered. I can't say whether it was like that before the gluster issues or those were the trigger, but I would suggest to reproduce after making sure the system is in stable state
GetIoTune is needed for disk QoS and MOM. We are reading the values every cycle so that is why you see the calls in the log. But all the mentioned timeouts and failures are probably sign of something bad happening on the libvirt/qemu side. It will take some time to download the logs..
(In reply to Michal Skrivanek from comment #12) > Sahina/RamaKasturi - in other words, the system doesn't seem to have > recovered. I can't say whether it was like that before the gluster issues or > those were the trigger, but I would suggest to reproduce after making sure > the system is in stable state We have bugs filed related to errors in logs related to Hosted engine - see Bug 1331514, Bug 1331503 where functionality does not seem to affected as per comments.
I don't see any way forward without clear reproduction so it can be looked at by libvirt or qemu Please clear needinfo once we got rid of the noise
Kasturi, can you upgrade to 3.6.6 and latest glusterfs and retry? Please update bug with logs and steps to reproduce
Hi Sahina, I tried reproducing this bug on 3.6.7 but no luck. For now, this can be closed as not reproducible. I will re open this bug in case i hit this again. Thanks kasturi.