Bug 1003904 - Lost connection to host during template creation or import/export
Lost connection to host during template creation or import/export
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
Unspecified Linux
unspecified Severity high
: ---
: 3.4.0
Assigned To: Ayal Baron
Aharon Canan
storage
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-03 09:15 EDT by Dušan Kajan
Modified: 2016-02-10 15:26 EST (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-07 04:12:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine log from 3.2 setup (4.32 MB, text/x-log)
2013-09-03 09:15 EDT, Dušan Kajan
no flags Details
engine log from 3.3 setup (1.88 MB, text/x-log)
2013-09-03 09:16 EDT, Dušan Kajan
no flags Details
vdsm log from host from 3.2 setup (13.43 MB, text/x-log)
2013-09-03 09:17 EDT, Dušan Kajan
no flags Details
vdsm log from host from 3.3 setup (11.44 MB, text/x-log)
2013-09-03 09:18 EDT, Dušan Kajan
no flags Details
/var/log/messages from 3.3 setup (8.52 KB, text/plain)
2013-09-10 03:42 EDT, Dušan Kajan
no flags Details
/var/log/messages from 3.3 host (350.72 KB, text/plain)
2013-09-10 03:42 EDT, Dušan Kajan
no flags Details
/var/log/messages from 3.2 host (189.28 KB, text/plain)
2013-09-10 03:44 EDT, Dušan Kajan
no flags Details

  None (edit)
Description Dušan Kajan 2013-09-03 09:15:31 EDT
Created attachment 793190 [details]
engine log from 3.2 setup

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Unknow - this happend to me on multiple occasions, it is possible to be reproduced 1 in 5 times approximalty

Steps to Reproduce:
1. Have setup with one host, one storage.
2. Create VM with operating system
3. Make template from this VM


Actual results:
The host is set as connecting, although the connection to host is still possible via ssh, the host is responding to pings. The whole setup is down -> as a result the creation of template failed 

Expected results:
the template should be created and host should be in state up the whole time

Additional info:
This happened to me also during exports and imports of templates from/to export domain. It is on both 3.2 and 3.3 build.

In attachment engine and vdsm logs from both 3.2 and 3.3 host and setups
Comment 1 Dušan Kajan 2013-09-03 09:16:22 EDT
Created attachment 793191 [details]
engine log from 3.3 setup
Comment 2 Dušan Kajan 2013-09-03 09:17:37 EDT
Created attachment 793192 [details]
vdsm log from host from 3.2 setup
Comment 3 Dušan Kajan 2013-09-03 09:18:47 EDT
Created attachment 793193 [details]
vdsm log from host from 3.3 setup
Comment 4 Vinzenz Feenstra [evilissimo] 2013-09-06 02:10:57 EDT
On Both hosts I can storage connection issues and Exceptions related to storage:
-> StoragePoolUnknown
-> SpmStatusError
-> StorageDomainDoesNotExist
-> StoragePoolMasterNotFound
And timeouts in the remoteFileHandler.py _recvAll function

Federico: Any explanation for this?
Comment 5 Federico Simoncelli 2013-09-06 02:54:27 EDT
Please attach also /var/log/messages (from the relevant day/time)
Comment 6 Dušan Kajan 2013-09-10 03:42:19 EDT
Created attachment 795879 [details]
/var/log/messages from 3.3 setup
Comment 7 Dušan Kajan 2013-09-10 03:42:54 EDT
Created attachment 795880 [details]
/var/log/messages from 3.3 host
Comment 8 Dušan Kajan 2013-09-10 03:44:53 EDT
Created attachment 795881 [details]
/var/log/messages from 3.2 host
Comment 9 Ayal Baron 2013-09-23 07:47:54 EDT
Fede, can you have a look?
Comment 10 Federico Simoncelli 2013-10-06 07:48:53 EDT
The nfs server chokes under heavy load and the timeouts are triggering the SPM fence. Particularly important are the kernel messages "nfs: server 10.34.63.202 not responding, timed out".

Relevant flow:

Thread-205866::DEBUG::2013-09-03 13:59:58,198::task::579::TaskManager.Task::(_updateState) Task=`c21a9b5c-cf60-4c74-b39b-1bdca125c18d`::moving from state init -> state preparing
Thread-205866::INFO::2013-09-03 13:59:58,198::logUtils::40::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='c3fd43f9-71f4-43a5-862f-56ee74c629fb', spUUID='72c04eac-8652-4962-ad6e-9b614a8ada0e', vmUUID='', srcImgUUID='172b9ff4-d42d-45b3-af9c-13df0242e81c', srcVolUUID='3e0dcb86-c19c-41cd-8a60-d0fbfcfc343e', dstImgUUID='59c935bf-e96f-4184-b86b-102d099e3960', dstVolUUID='294ffdae-066d-43d0-aecb-de46ffd1949f', description='Active VM', dstSdUUID='c3fd43f9-71f4-43a5-862f-56ee74c629fb', volType=6, volFormat=5, preallocate=2, postZero='false', force='false')

c21a9b5c-cf60-4c74-b39b-1bdca125c18d::DEBUG::2013-09-03 14:00:01,064::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img convert -t none -f qcow2 /rhev/data-center/72c04eac-8652-4962-ad6e-9b614a8ada0e/c3fd43f9-71f4-43a5-862f-56ee74c629fb/images/172b9ff4-d42d-45b3-af9c-13df0242e81c/3e0dcb86-c19c-41cd-8a60-d0fbfcfc343e -O raw /rhev/data-center/72c04eac-8652-4962-ad6e-9b614a8ada0e/c3fd43f9-71f4-43a5-862f-56ee74c629fb/images/59c935bf-e96f-4184-b86b-102d099e3960/294ffdae-066d-43d0-aecb-de46ffd1949f' (cwd None)

Sep  3 14:11:55 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:55+0200 338528 [17870]: b4010867 close_task_aio 1 0x7fdb70000910 busy
Sep  3 14:11:55 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:55+0200 338528 [17870]: b4010867 close_task_aio 2 0x7fdb70000960 busy
Sep  3 14:11:55 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:55+0200 338528 [17870]: b4010867 close_task_aio 3 0x7fdb700009b0 busy
Sep  3 14:11:56 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:11:56 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:11:56 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:56+0200 338529 [17870]: b4010867 aio collect 0x7fdb70000910:0x7fdb70000920:0x7fdb8cf4a000 result -5:0 close free
Sep  3 14:12:06 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:06+0200 338539 [17870]: b4010867 close_task_aio 2 0x7fdb70000960 busy
Sep  3 14:12:06 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:06+0200 338539 [17870]: b4010867 close_task_aio 3 0x7fdb700009b0 busy
Sep  3 14:12:07 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:12:07 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:12:07 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:07+0200 338540 [17870]: b4010867 aio collect 0x7fdb70000960:0x7fdb70000970:0x7fdb8ce48000 result -5:0 close free
Sep  3 14:12:17 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:17+0200 338550 [17870]: b4010867 close_task_aio 3 0x7fdb700009b0 busy
Sep  3 14:12:18 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:12:18 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:12:18 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:18+0200 338551 [17870]: b4010867 aio collect 0x7fdb700009b0:0x7fdb700009c0:0x7fdb8cd46000 result -5:0 close free
Sep  3 14:12:32 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out
Sep  3 14:12:42 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out

Thread-148::ERROR::2013-09-03 14:13:56,611::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain c3fd43f9-71f4-43a5-862f-56ee74c629fb monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest
    fileSD.FileStorageDomain.selftest(self)
  File "/usr/share/vdsm/storage/fileSD.py", line 503, in selftest
    self.oop.os.statvfs(self.domaindir)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction
    *args, **kwargs)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll
    raise Timeout()
Timeout

MainThread::INFO::2013-09-03 14:14:02,181::vdsm::88::vds::(run) I am the actual vdsm 4.10.2-25.0.el6ev dell-r210ii-10.rhev.lab.eng.brq.redhat.com (2.6.32-358.14.1.el6.x86_64)
Comment 11 Ayal Baron 2013-10-06 08:14:45 EDT
Dušan,

From comment 10, this looks like a storage server issue, there isn't anything really we can do about.
Comment 12 Dušan Kajan 2013-10-07 04:12:57 EDT
ok, I was talking to Vinzenz about this already, and it really is a storage problem, closing as NOTABUG

Note You need to log in before you can comment on or make changes to this bug.