Description of problem: In RHV-4.5.0 Scale environment (12 storage domains, 2281 VMs), 2 storage domains were not recovered for more than 10 days. Here are some relevant pieces from the log, found by pkliczew: (03:45:25 PM) pkliczew: clientIFinit::INFO::2016-11-15 07:01:02,512::clientIF::545::vds::(_waitForDomainsUp) recovery: waiting for 2 domains to go up (03:56:51 PM) pkliczew: vdsm was rebooted and started at MainThread::INFO::2016-11-07 09:10:24,776::vdsm::135::vds::(run) (PID: 1758) I am the actual vdsm 4.18.15-1.el7ev b01-h18-r620.rhev.openstack.engineering.redhat.com (3.10.0-514.el7.x86_64) (03:57:04 PM) pkliczew: from that time it was in recovery mode (03:57:18 PM) pkliczew: but till today 2 domains were not recovered Version-Release number of selected component (if applicable): RHV-4.0.5.5-0.1.el7ev vdsm-4.18.15-1.el7ev.x86_64 Additional info: This RHV setup was added to a CFME appliance. where we encountered very slow reaction by RHV to CFME requests. Might be that this slowness is related to this bug.
This bug might be related to Bug 1393295.
Liron please have a look if it's indeed related
Ilanit, can you please attach the relevant logs? thanks, Liron.
I've checked the vdsm code, the code is related to libvirt domains (vms) and not for storage domains. Moving to virt for further inspection of the issue.
Created attachment 1223921 [details] initlal vdsm log In this log, see vdsm was rebooted and started at MainThread::INFO::2016-11-07 09:10:24,776::vdsm::135::vds::(run) (PID: 1758) I am the actual vdsm 4.18.15-1.el7ev b01-h18-r620.rhev.openstack.engineering.redhat.com (3.10.0-514.el7.x86_64)
Created attachment 1223922 [details] final vdsm log See in log clientIFinit::INFO::2016-11-15 07:01:02,512::clientIF::545::vds::(_waitForDomainsUp) recovery: waiting for 2 domains to go up
seems like 2 VMs never responded when querying them in libvirt during recovery. Can you reproduce the problem or dig out the libvirt logs? If you didn't have debug enabled in libvirt then it needs to be reproduced, unfortunately. Also, did you have fencing enabled? It may be skipped when it's returning "recovery" Martine? ...hmm...not good
I do not have the libvirt records. There are similar scale testing planned for the coming days. I can track if this problem reproduces.
Please re-run with libvirt debug logs.
Ilanit, any chance to get this info?
I am still waiting to getting a scale machine to test it on.
ok, so putting the needinfo back on you to mark that we are waiting for some info.
(In reply to Tomas Jelinek from comment #12) > ok, so putting the needinfo back on you to mark that we are waiting for some > info. I'm closing for the time being, please re-open when reproduced.
Removing need info, as problem was not reproduced so far. I shall reopen bug, if it will reproduce.