1396183 – [Scale] vms were not recovered, several days, after vdsm restart.

Bug 1396183 - [Scale] vms were not recovered, several days, after vdsm restart.

Summary: [Scale] vms were not recovered, several days, after vdsm restart.

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Virt
Sub Component:
Version:	4.0.5.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Michal Skrivanek
QA Contact:	Ilanit Stein
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-17 17:10 UTC by Ilanit Stein
Modified:	2019-04-28 14:30 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-13 18:49:52 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	istein: needinfo- istein: devel_ack?

Attachments	(Terms of Use)
initlal vdsm log (578.11 KB, application/x-xz) 2016-11-24 15:45 UTC, Ilanit Stein	no flags	Details
final vdsm log (629.00 KB, application/x-xz) 2016-11-24 15:48 UTC, Ilanit Stein	no flags	Details
View All

Description Ilanit Stein 2016-11-17 17:10:21 UTC

Description of problem:

In RHV-4.5.0 Scale environment (12 storage domains, 2281 VMs),
2 storage domains were not recovered for more than 10 days.

Here are some relevant pieces from the log, found by pkliczew: 

(03:45:25 PM) pkliczew: clientIFinit::INFO::2016-11-15 07:01:02,512::clientIF::545::vds::(_waitForDomainsUp) recovery: waiting for 2 domains to go up

(03:56:51 PM) pkliczew: vdsm was rebooted and started at MainThread::INFO::2016-11-07 09:10:24,776::vdsm::135::vds::(run) (PID: 1758) I am the actual vdsm 4.18.15-1.el7ev b01-h18-r620.rhev.openstack.engineering.redhat.com (3.10.0-514.el7.x86_64)

(03:57:04 PM) pkliczew: from that time it was in recovery mode 
(03:57:18 PM) pkliczew: but till today 2 domains were not recovered


Version-Release number of selected component (if applicable):
RHV-4.0.5.5-0.1.el7ev
vdsm-4.18.15-1.el7ev.x86_64


Additional info:
This RHV setup was added to a CFME appliance. where we encountered very slow reaction by RHV to CFME requests. Might be that this slowness is related to this bug.

Comment 1 Ilanit Stein 2016-11-17 17:12:08 UTC

This bug might be related to Bug 1393295.

Comment 2 Tal Nisan 2016-11-21 10:11:23 UTC

Liron please have a look if it's indeed related

Comment 3 Liron Aravot 2016-11-23 13:06:41 UTC

Ilanit, can you please attach the relevant logs?

thanks,
Liron.

Comment 4 Liron Aravot 2016-11-24 14:12:58 UTC

I've checked the vdsm code, the code is related to libvirt domains (vms) and not for storage domains.

Moving to virt for further inspection of the issue.

Comment 5 Ilanit Stein 2016-11-24 15:45:01 UTC

Created attachment 1223921 [details]
initlal vdsm log

In this log, see vdsm was rebooted and started at 

MainThread::INFO::2016-11-07 09:10:24,776::vdsm::135::vds::(run) (PID: 1758) I am the actual vdsm 4.18.15-1.el7ev b01-h18-r620.rhev.openstack.engineering.redhat.com (3.10.0-514.el7.x86_64)

Comment 6 Ilanit Stein 2016-11-24 15:48:25 UTC

Created attachment 1223922 [details]
final vdsm log

See in log

 clientIFinit::INFO::2016-11-15 07:01:02,512::clientIF::545::vds::(_waitForDomainsUp) recovery: waiting for 2 domains to go up

Comment 7 Michal Skrivanek 2016-12-21 11:05:29 UTC

seems like 2 VMs never responded when querying them in libvirt during recovery. Can you reproduce the problem or dig out the libvirt logs? 
If you didn't have debug enabled in libvirt then it needs to be reproduced, unfortunately.

Also, did you have fencing enabled? It may be skipped when it's returning "recovery" Martine?
...hmm...not good

Comment 8 Ilanit Stein 2016-12-21 13:08:29 UTC

I do not have the libvirt records.
There are similar scale testing planned for the coming days. I can track if this problem reproduces.

Comment 9 Yaniv Kaul 2017-01-03 12:48:38 UTC

Please re-run with libvirt debug logs.

Comment 10 Tomas Jelinek 2017-01-25 08:28:51 UTC

Ilanit, any chance to get this info?

Comment 11 Ilanit Stein 2017-01-25 09:01:46 UTC

I am still waiting to getting a scale machine to test it on.

Comment 12 Tomas Jelinek 2017-02-01 08:17:17 UTC

ok, so putting the needinfo back on you to mark that we are waiting for some info.

Comment 13 Yaniv Kaul 2017-02-13 18:49:52 UTC

(In reply to Tomas Jelinek from comment #12)
> ok, so putting the needinfo back on you to mark that we are waiting for some
> info.

I'm closing for the time being, please re-open when reproduced.

Comment 14 Ilanit Stein 2017-05-18 07:15:57 UTC

Removing need info, as problem was not reproduced so far.
I shall reopen bug, if it will reproduce.

Note You need to log in before you can comment on or make changes to this bug.