Bug 1332841 - VM Monitoring: After vm recovery, status not updated in UI
Summary: VM Monitoring: After vm recovery, status not updated in UI
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.17.26
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-3.6.8
: ---
Assignee: Dan Kenigsberg
QA Contact: Aharon Canan
URL:
Whiteboard:
Depends On:
Blocks: Gluster-HC-1
TreeView+ depends on / blocked
 
Reported: 2016-05-04 07:48 UTC by RamaKasturi
Modified: 2016-06-09 07:18 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-09 07:18:29 UTC
oVirt Team: Virt
Embargoed:
sabose: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)

Description RamaKasturi 2016-05-04 07:48:34 UTC
Description of problem:
one of my vm went to Not Responding state because there was a wrong entry being made in /etc/fstab file. After correcting the entry i see that the vm comes up fine but RHEV- UI still shows that the vm is Not Responding.

Version-Release number of selected component (if applicable):
vdsm-4.17.26-0.1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup
2. Create vms on the setup.
3. Now have an wrong entry in /etc/fstab which makes the system to go to maintenance mode.
4. correct the /etc/fstab file.
5. stop and start the vm.
6. Vm comes up with out any issue

Actual results:
Even when vm comes up with out any issue, i see that status of the vm does not reflect correctly.

Expected results:
vm status should be reflecting correctly if the vm is up and running fine.

Additional info:

Comment 1 Yaniv Kaul 2016-05-05 06:42:05 UTC
Logs?

Comment 3 Sahina Bose 2016-05-05 11:32:10 UTC
Kasturi, which of the nodes was the VM running? was the VM named "vm_file_check" ?

Comment 4 RamaKasturi 2016-05-05 12:02:29 UTC
Node on which vm was running is tettnang.lab.eng.blr.redhat.com and the VM name is BootStrom_linux_vm-1.

Comment 5 Sahina Bose 2016-05-05 13:18:01 UTC
Ok.

I see below in engine.log
2016-05-05 10:11:38,222 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-60) [] VM '4fca345e-65e8-4825-babc-55c3f436f459'(BootStrom_linux_vm-1) moved from 'Up' --> 'NotResponding'

and in vdsm.log on sulpur
/var/log/vdsm/vdsm.log:Thread-381231::WARNING::2016-05-05 14:01:02,424::vm::5161::virt.vm::(_setUnresponsiveIfTimeout) vmId=`
4fca345e-65e8-4825-babc-55c3f436f459`::monitor become unresponsive (command timeout, age=13816.53)


Moving this to virt team. Please provide inputs on above error

Comment 6 Michal Skrivanek 2016-05-05 14:04:57 UTC
there was no recovery in vdsm sense. The VM stopped working (libvirt stopped responding) at 10:11 and since then it haven't started working again nor there was any vdsm restart.
I fail to find any libvirt logs in those sosreport files, any chance to reproduce this with libvirt debug logs enabled (and gathered)?

Comment 7 RamaKasturi 2016-05-06 07:51:23 UTC
Hi Michal,

   I have  /var/lib/libvirt/qemu/<vm-name>.log file. Would that help? can you   help me with enabling debug logs for libvirt  so that i can try reproducing the issue ?

Thanks
kasturi.

Comment 8 Michal Skrivanek 2016-05-06 08:46:36 UTC
(In reply to RamaKasturi from comment #7)
> Hi Michal,
> 
>    I have  /var/lib/libvirt/qemu/<vm-name>.log file. Would that help? can

No, that's only a qemu's log which is not helpful in this case. We need to see what was wrong in libvirt which doesn't seems to be included in the sosreport. I also can't find the journal so can't check for any system messages:/
See http://wiki.libvirt.org/page/DebugLogs for how to enable libvirt debug logs.

Comment 9 Sahina Bose 2016-05-06 09:47:11 UTC
Michal, there are journalctl logs in sos_commands/logs/journalctl_--all_--this-boot_--no-pager


May 05 10:05:21 sulphur..com libvirtd[17241]: Cannot start job (query, none) for domain BootStrom_linux_vm-6; current job is (query, none) owned by (17246 remoteDispatchDom
ainGetBlockIoTune, 0 <null>) for (32s, 0s)
May 05 10:05:21 sulphur...com libvirtd[17241]: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

I see errors like this, however not for vm BootStrom_linux_vm-1

Comment 10 Michal Skrivanek 2016-05-06 13:10:39 UTC
(In reply to Sahina Bose from comment #9)
> Michal, there are journalctl logs in
> sos_commands/logs/journalctl_--all_--this-boot_--no-pager

ah, thanks! that's one ugly file name:)

 
> May 05 10:05:21 sulphur..com libvirtd[17241]: Cannot start job (query, none)
> for domain BootStrom_linux_vm-6; current job is (query, none) owned by
> (17246 remoteDispatchDom
> ainGetBlockIoTune, 0 <null>) for (32s, 0s)
> May 05 10:05:21 sulphur...com libvirtd[17241]: Timed out during operation:
> cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)
> 
> I see errors like this, however not for vm BootStrom_linux_vm-1

these are significant and indicate a libvirt problem. Those debug logs would be helpful.

btw, there are also tons of these. You may have some misconfiguration, but even then the logs should not be flooded like that. Please open a separate bug on Hosted Engine (SLA)

ovirt-ha-broker[12856]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection est
ablished
ovirt-ha-broker[12856]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection clo
sed

Comment 11 Michal Skrivanek 2016-05-06 13:23:48 UTC
kernel: libvirtd: page allocation failure: order:4, mode:0x1040d0
is for sure not good.
There are many errors pointing to stuck libvirt query(failures to acquire lock), way too many errors involving hosted engine...that's never good either. There are plenty of vmGetIoTune and vmGetIoTunePolicy calls in vdsm, not sure if that's the result of those hosted engine issues or something else. Martin, thoughts?

Comment 12 Michal Skrivanek 2016-05-06 13:29:35 UTC
Sahina/RamaKasturi - in other words, the system doesn't seem to have recovered. I can't say whether it was like that before the gluster issues or those were the trigger, but I would suggest to reproduce after making sure the system is in stable state

Comment 13 Martin Sivák 2016-05-06 13:53:17 UTC
GetIoTune is needed for disk QoS and MOM. We are reading the values every cycle so that is why you see the calls in the log.

But all the mentioned timeouts and failures are probably sign of something bad happening on the libvirt/qemu side. It will take some time to download the logs..

Comment 14 Sahina Bose 2016-05-09 09:38:03 UTC
(In reply to Michal Skrivanek from comment #12)
> Sahina/RamaKasturi - in other words, the system doesn't seem to have
> recovered. I can't say whether it was like that before the gluster issues or
> those were the trigger, but I would suggest to reproduce after making sure
> the system is in stable state

We have bugs filed related to errors in logs related to Hosted engine - see Bug 1331514, Bug 1331503 where functionality does not seem to affected as per comments.

Comment 15 Michal Skrivanek 2016-05-11 13:33:38 UTC
I don't see any way forward without clear reproduction so it can be looked at by libvirt or qemu
Please clear needinfo once we got rid of the noise

Comment 16 Sahina Bose 2016-05-13 07:19:05 UTC
Kasturi, can you upgrade to 3.6.6 and latest glusterfs and retry? Please update bug with logs and steps to reproduce

Comment 17 RamaKasturi 2016-06-07 06:11:55 UTC
Hi Sahina,

  I tried reproducing this bug on 3.6.7 but no luck. For now, this can be closed as not reproducible. I will re open this bug in case i hit this again.

Thanks
kasturi.


Note You need to log in before you can comment on or make changes to this bug.