Bug 1396183 - [Scale] vms were not recovered, several days, after vdsm restart.
Summary: [Scale] vms were not recovered, several days, after vdsm restart.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Virt
Version: 4.0.5.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Michal Skrivanek
QA Contact: Ilanit Stein
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-17 17:10 UTC by Ilanit Stein
Modified: 2019-04-28 14:30 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2017-02-13 18:49:52 UTC
oVirt Team: Virt
Embargoed:
istein: needinfo-
istein: devel_ack?


Attachments (Terms of Use)
initlal vdsm log (578.11 KB, application/x-xz)
2016-11-24 15:45 UTC, Ilanit Stein
no flags Details
final vdsm log (629.00 KB, application/x-xz)
2016-11-24 15:48 UTC, Ilanit Stein
no flags Details

Description Ilanit Stein 2016-11-17 17:10:21 UTC
Description of problem:

In RHV-4.5.0 Scale environment (12 storage domains, 2281 VMs),
2 storage domains were not recovered for more than 10 days.

Here are some relevant pieces from the log, found by pkliczew: 

(03:45:25 PM) pkliczew: clientIFinit::INFO::2016-11-15 07:01:02,512::clientIF::545::vds::(_waitForDomainsUp) recovery: waiting for 2 domains to go up

(03:56:51 PM) pkliczew: vdsm was rebooted and started at MainThread::INFO::2016-11-07 09:10:24,776::vdsm::135::vds::(run) (PID: 1758) I am the actual vdsm 4.18.15-1.el7ev b01-h18-r620.rhev.openstack.engineering.redhat.com (3.10.0-514.el7.x86_64)

(03:57:04 PM) pkliczew: from that time it was in recovery mode 
(03:57:18 PM) pkliczew: but till today 2 domains were not recovered


Version-Release number of selected component (if applicable):
RHV-4.0.5.5-0.1.el7ev
vdsm-4.18.15-1.el7ev.x86_64


Additional info:
This RHV setup was added to a CFME appliance. where we encountered very slow reaction by RHV to CFME requests. Might be that this slowness is related to this bug.

Comment 1 Ilanit Stein 2016-11-17 17:12:08 UTC
This bug might be related to Bug 1393295.

Comment 2 Tal Nisan 2016-11-21 10:11:23 UTC
Liron please have a look if it's indeed related

Comment 3 Liron Aravot 2016-11-23 13:06:41 UTC
Ilanit, can you please attach the relevant logs?

thanks,
Liron.

Comment 4 Liron Aravot 2016-11-24 14:12:58 UTC
I've checked the vdsm code, the code is related to libvirt domains (vms) and not for storage domains.

Moving to virt for further inspection of the issue.

Comment 5 Ilanit Stein 2016-11-24 15:45:01 UTC
Created attachment 1223921 [details]
initlal vdsm log

In this log, see vdsm was rebooted and started at 

MainThread::INFO::2016-11-07 09:10:24,776::vdsm::135::vds::(run) (PID: 1758) I am the actual vdsm 4.18.15-1.el7ev b01-h18-r620.rhev.openstack.engineering.redhat.com (3.10.0-514.el7.x86_64)

Comment 6 Ilanit Stein 2016-11-24 15:48:25 UTC
Created attachment 1223922 [details]
final vdsm log

See in log

 clientIFinit::INFO::2016-11-15 07:01:02,512::clientIF::545::vds::(_waitForDomainsUp) recovery: waiting for 2 domains to go up

Comment 7 Michal Skrivanek 2016-12-21 11:05:29 UTC
seems like 2 VMs never responded when querying them in libvirt during recovery. Can you reproduce the problem or dig out the libvirt logs? 
If you didn't have debug enabled in libvirt then it needs to be reproduced, unfortunately.

Also, did you have fencing enabled? It may be skipped when it's returning "recovery" Martine?
...hmm...not good

Comment 8 Ilanit Stein 2016-12-21 13:08:29 UTC
I do not have the libvirt records.
There are similar scale testing planned for the coming days. I can track if this problem reproduces.

Comment 9 Yaniv Kaul 2017-01-03 12:48:38 UTC
Please re-run with libvirt debug logs.

Comment 10 Tomas Jelinek 2017-01-25 08:28:51 UTC
Ilanit, any chance to get this info?

Comment 11 Ilanit Stein 2017-01-25 09:01:46 UTC
I am still waiting to getting a scale machine to test it on.

Comment 12 Tomas Jelinek 2017-02-01 08:17:17 UTC
ok, so putting the needinfo back on you to mark that we are waiting for some info.

Comment 13 Yaniv Kaul 2017-02-13 18:49:52 UTC
(In reply to Tomas Jelinek from comment #12)
> ok, so putting the needinfo back on you to mark that we are waiting for some
> info.

I'm closing for the time being, please re-open when reproduced.

Comment 14 Ilanit Stein 2017-05-18 07:15:57 UTC
Removing need info, as problem was not reproduced so far.
I shall reopen bug, if it will reproduce.


Note You need to log in before you can comment on or make changes to this bug.