1332841 – VM Monitoring: After vm recovery, status not updated in UI

Bug 1332841 - VM Monitoring: After vm recovery, status not updated in UI

Summary: VM Monitoring: After vm recovery, status not updated in UI

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.17.26
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.8
Target Release:	---
Assignee:	Dan Kenigsberg
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1
TreeView+	depends on / blocked

Reported:	2016-05-04 07:48 UTC by RamaKasturi
Modified:	2016-06-09 07:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-06-09 07:18:29 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	sabose: ovirt-3.6.z? rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)

Description RamaKasturi 2016-05-04 07:48:34 UTC

Description of problem:
one of my vm went to Not Responding state because there was a wrong entry being made in /etc/fstab file. After correcting the entry i see that the vm comes up fine but RHEV- UI still shows that the vm is Not Responding.

Version-Release number of selected component (if applicable):
vdsm-4.17.26-0.1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup
2. Create vms on the setup.
3. Now have an wrong entry in /etc/fstab which makes the system to go to maintenance mode.
4. correct the /etc/fstab file.
5. stop and start the vm.
6. Vm comes up with out any issue

Actual results:
Even when vm comes up with out any issue, i see that status of the vm does not reflect correctly.

Expected results:
vm status should be reflecting correctly if the vm is up and running fine.

Additional info:

Comment 1 Yaniv Kaul 2016-05-05 06:42:05 UTC

Logs?

Comment 3 Sahina Bose 2016-05-05 11:32:10 UTC

Kasturi, which of the nodes was the VM running? was the VM named "vm_file_check" ?

Comment 4 RamaKasturi 2016-05-05 12:02:29 UTC

Node on which vm was running is tettnang.lab.eng.blr.redhat.com and the VM name is BootStrom_linux_vm-1.

Comment 5 Sahina Bose 2016-05-05 13:18:01 UTC

Ok.

I see below in engine.log
2016-05-05 10:11:38,222 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-60) [] VM '4fca345e-65e8-4825-babc-55c3f436f459'(BootStrom_linux_vm-1) moved from 'Up' --> 'NotResponding'

and in vdsm.log on sulpur
/var/log/vdsm/vdsm.log:Thread-381231::WARNING::2016-05-05 14:01:02,424::vm::5161::virt.vm::(_setUnresponsiveIfTimeout) vmId=`
4fca345e-65e8-4825-babc-55c3f436f459`::monitor become unresponsive (command timeout, age=13816.53)


Moving this to virt team. Please provide inputs on above error

Comment 6 Michal Skrivanek 2016-05-05 14:04:57 UTC

there was no recovery in vdsm sense. The VM stopped working (libvirt stopped responding) at 10:11 and since then it haven't started working again nor there was any vdsm restart.
I fail to find any libvirt logs in those sosreport files, any chance to reproduce this with libvirt debug logs enabled (and gathered)?

Comment 7 RamaKasturi 2016-05-06 07:51:23 UTC

Hi Michal,

   I have  /var/lib/libvirt/qemu/<vm-name>.log file. Would that help? can you   help me with enabling debug logs for libvirt  so that i can try reproducing the issue ?

Thanks
kasturi.

Comment 8 Michal Skrivanek 2016-05-06 08:46:36 UTC

(In reply to RamaKasturi from comment #7)
> Hi Michal,
> 
>    I have  /var/lib/libvirt/qemu/<vm-name>.log file. Would that help? can

No, that's only a qemu's log which is not helpful in this case. We need to see what was wrong in libvirt which doesn't seems to be included in the sosreport. I also can't find the journal so can't check for any system messages:/
See http://wiki.libvirt.org/page/DebugLogs for how to enable libvirt debug logs.

Comment 9 Sahina Bose 2016-05-06 09:47:11 UTC

Michal, there are journalctl logs in sos_commands/logs/journalctl_--all_--this-boot_--no-pager


May 05 10:05:21 sulphur..com libvirtd[17241]: Cannot start job (query, none) for domain BootStrom_linux_vm-6; current job is (query, none) owned by (17246 remoteDispatchDom
ainGetBlockIoTune, 0 <null>) for (32s, 0s)
May 05 10:05:21 sulphur...com libvirtd[17241]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

I see errors like this, however not for vm BootStrom_linux_vm-1

Comment 10 Michal Skrivanek 2016-05-06 13:10:39 UTC

(In reply to Sahina Bose from comment #9)
> Michal, there are journalctl logs in
> sos_commands/logs/journalctl_--all_--this-boot_--no-pager

ah, thanks! that's one ugly file name:)

 
> May 05 10:05:21 sulphur..com libvirtd[17241]: Cannot start job (query, none)
> for domain BootStrom_linux_vm-6; current job is (query, none) owned by
> (17246 remoteDispatchDom
> ainGetBlockIoTune, 0 <null>) for (32s, 0s)
> May 05 10:05:21 sulphur...com libvirtd[17241]: Timed out during operation:
> cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
> 
> I see errors like this, however not for vm BootStrom_linux_vm-1

these are significant and indicate a libvirt problem. Those debug logs would be helpful.

btw, there are also tons of these. You may have some misconfiguration, but even then the logs should not be flooded like that. Please open a separate bug on Hosted Engine (SLA)

ovirt-ha-broker[12856]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection est
ablished
ovirt-ha-broker[12856]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection clo
sed

Comment 11 Michal Skrivanek 2016-05-06 13:23:48 UTC

kernel: libvirtd: page allocation failure: order:4, mode:0x1040d0
is for sure not good.
There are many errors pointing to stuck libvirt query(failures to acquire lock), way too many errors involving hosted engine...that's never good either. There are plenty of vmGetIoTune and vmGetIoTunePolicy calls in vdsm, not sure if that's the result of those hosted engine issues or something else. Martin, thoughts?

Comment 12 Michal Skrivanek 2016-05-06 13:29:35 UTC

Sahina/RamaKasturi - in other words, the system doesn't seem to have recovered. I can't say whether it was like that before the gluster issues or those were the trigger, but I would suggest to reproduce after making sure the system is in stable state

Comment 13 Martin Sivák 2016-05-06 13:53:17 UTC

GetIoTune is needed for disk QoS and MOM. We are reading the values every cycle so that is why you see the calls in the log.

But all the mentioned timeouts and failures are probably sign of something bad happening on the libvirt/qemu side. It will take some time to download the logs..

Comment 14 Sahina Bose 2016-05-09 09:38:03 UTC

(In reply to Michal Skrivanek from comment #12)
> Sahina/RamaKasturi - in other words, the system doesn't seem to have
> recovered. I can't say whether it was like that before the gluster issues or
> those were the trigger, but I would suggest to reproduce after making sure
> the system is in stable state

We have bugs filed related to errors in logs related to Hosted engine - see Bug 1331514, Bug 1331503 where functionality does not seem to affected as per comments.

Comment 15 Michal Skrivanek 2016-05-11 13:33:38 UTC

I don't see any way forward without clear reproduction so it can be looked at by libvirt or qemu
Please clear needinfo once we got rid of the noise

Comment 16 Sahina Bose 2016-05-13 07:19:05 UTC

Kasturi, can you upgrade to 3.6.6 and latest glusterfs and retry? Please update bug with logs and steps to reproduce

Comment 17 RamaKasturi 2016-06-07 06:11:55 UTC

Hi Sahina,

  I tried reproducing this bug on 3.6.7 but no luck. For now, this can be closed as not reproducible. I will re open this bug in case i hit this again.

Thanks
kasturi.

Note You need to log in before you can comment on or make changes to this bug.