Hide Forgot
Created attachment 793190 [details] engine log from 3.2 setup Description of problem: Version-Release number of selected component (if applicable): How reproducible: Unknow - this happend to me on multiple occasions, it is possible to be reproduced 1 in 5 times approximalty Steps to Reproduce: 1. Have setup with one host, one storage. 2. Create VM with operating system 3. Make template from this VM Actual results: The host is set as connecting, although the connection to host is still possible via ssh, the host is responding to pings. The whole setup is down -> as a result the creation of template failed Expected results: the template should be created and host should be in state up the whole time Additional info: This happened to me also during exports and imports of templates from/to export domain. It is on both 3.2 and 3.3 build. In attachment engine and vdsm logs from both 3.2 and 3.3 host and setups
Created attachment 793191 [details] engine log from 3.3 setup
Created attachment 793192 [details] vdsm log from host from 3.2 setup
Created attachment 793193 [details] vdsm log from host from 3.3 setup
On Both hosts I can storage connection issues and Exceptions related to storage: -> StoragePoolUnknown -> SpmStatusError -> StorageDomainDoesNotExist -> StoragePoolMasterNotFound And timeouts in the remoteFileHandler.py _recvAll function Federico: Any explanation for this?
Please attach also /var/log/messages (from the relevant day/time)
Created attachment 795879 [details] /var/log/messages from 3.3 setup
Created attachment 795880 [details] /var/log/messages from 3.3 host
Created attachment 795881 [details] /var/log/messages from 3.2 host
Fede, can you have a look?
The nfs server chokes under heavy load and the timeouts are triggering the SPM fence. Particularly important are the kernel messages "nfs: server 10.34.63.202 not responding, timed out". Relevant flow: Thread-205866::DEBUG::2013-09-03 13:59:58,198::task::579::TaskManager.Task::(_updateState) Task=`c21a9b5c-cf60-4c74-b39b-1bdca125c18d`::moving from state init -> state preparing Thread-205866::INFO::2013-09-03 13:59:58,198::logUtils::40::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='c3fd43f9-71f4-43a5-862f-56ee74c629fb', spUUID='72c04eac-8652-4962-ad6e-9b614a8ada0e', vmUUID='', srcImgUUID='172b9ff4-d42d-45b3-af9c-13df0242e81c', srcVolUUID='3e0dcb86-c19c-41cd-8a60-d0fbfcfc343e', dstImgUUID='59c935bf-e96f-4184-b86b-102d099e3960', dstVolUUID='294ffdae-066d-43d0-aecb-de46ffd1949f', description='Active VM', dstSdUUID='c3fd43f9-71f4-43a5-862f-56ee74c629fb', volType=6, volFormat=5, preallocate=2, postZero='false', force='false') c21a9b5c-cf60-4c74-b39b-1bdca125c18d::DEBUG::2013-09-03 14:00:01,064::misc::83::Storage.Misc.excCmd::(<lambda>) '/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img convert -t none -f qcow2 /rhev/data-center/72c04eac-8652-4962-ad6e-9b614a8ada0e/c3fd43f9-71f4-43a5-862f-56ee74c629fb/images/172b9ff4-d42d-45b3-af9c-13df0242e81c/3e0dcb86-c19c-41cd-8a60-d0fbfcfc343e -O raw /rhev/data-center/72c04eac-8652-4962-ad6e-9b614a8ada0e/c3fd43f9-71f4-43a5-862f-56ee74c629fb/images/59c935bf-e96f-4184-b86b-102d099e3960/294ffdae-066d-43d0-aecb-de46ffd1949f' (cwd None) Sep 3 14:11:55 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:55+0200 338528 [17870]: b4010867 close_task_aio 1 0x7fdb70000910 busy Sep 3 14:11:55 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:55+0200 338528 [17870]: b4010867 close_task_aio 2 0x7fdb70000960 busy Sep 3 14:11:55 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:55+0200 338528 [17870]: b4010867 close_task_aio 3 0x7fdb700009b0 busy Sep 3 14:11:56 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:11:56 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:11:56 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:11:56+0200 338529 [17870]: b4010867 aio collect 0x7fdb70000910:0x7fdb70000920:0x7fdb8cf4a000 result -5:0 close free Sep 3 14:12:06 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:06+0200 338539 [17870]: b4010867 close_task_aio 2 0x7fdb70000960 busy Sep 3 14:12:06 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:06+0200 338539 [17870]: b4010867 close_task_aio 3 0x7fdb700009b0 busy Sep 3 14:12:07 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:12:07 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:12:07 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:07+0200 338540 [17870]: b4010867 aio collect 0x7fdb70000960:0x7fdb70000970:0x7fdb8ce48000 result -5:0 close free Sep 3 14:12:17 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:17+0200 338550 [17870]: b4010867 close_task_aio 3 0x7fdb700009b0 busy Sep 3 14:12:18 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:12:18 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:12:18 dell-r210ii-09 sanlock[6885]: 2013-09-03 14:12:18+0200 338551 [17870]: b4010867 aio collect 0x7fdb700009b0:0x7fdb700009c0:0x7fdb8cd46000 result -5:0 close free Sep 3 14:12:32 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Sep 3 14:12:42 dell-r210ii-09 kernel: nfs: server 10.34.63.202 not responding, timed out Thread-148::ERROR::2013-09-03 14:13:56,611::domainMonitor::225::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain c3fd43f9-71f4-43a5-862f-56ee74c629fb monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 201, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/nfsSD.py", line 108, in selftest fileSD.FileStorageDomain.selftest(self) File "/usr/share/vdsm/storage/fileSD.py", line 503, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 275, in callCrabRPCFunction *args, **kwargs) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 180, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 146, in _recvAll raise Timeout() Timeout MainThread::INFO::2013-09-03 14:14:02,181::vdsm::88::vds::(run) I am the actual vdsm 4.10.2-25.0.el6ev dell-r210ii-10.rhev.lab.eng.brq.redhat.com (2.6.32-358.14.1.el6.x86_64)
Dušan, From comment 10, this looks like a storage server issue, there isn't anything really we can do about.
ok, I was talking to Vinzenz about this already, and it really is a storage problem, closing as NOTABUG