Description of problem: during scale test ramp up, around ~5,600 vms many of the hosts become 'non-operational'. is seems that this state happens due to 'No free file handlers in pool' (from the /var/log/messages in hosts). see the findings, by first investigation: -from engine log: 2014-07-23 10:31:51,494 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-84) [2e1f80b7] Command GetStatsVDSCommand(HostName = host04-rack04.scale.openstack.engineering.redhat.com, HostId = 931346f2-3d6d-4cbb-80c3-50105934a44f, vds=Host[host04-rack04.scale.openstack.engineering.redhat.com,931346f2-3d6d-4cbb-80c3-50105934a44f]) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing 2014-07-23 10:31:51,497 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-84) [2e1f80b7] vds::refreshVdsStats Failed getVdsStats, vds = 931346f2-3d6d-4cbb-80c3-50105934a44f : host04-rack04.scale.openstack.engineering.redhat.com, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing -from vdsm.log: clientIFinit::ERROR::2014-07-23 10:26:54,038::task::866::TaskManager.Task::(_setError) Task=`fa2cfef1-b4a8-4db4-8329-3bbd7362d2d9`::Unexpected error clientIFinit::ERROR::2014-07-23 10:26:54,050::dispatcher::68::Storage.Dispatcher.Protect::(run) raise self.error clientIFinit::ERROR::2014-07-23 10:26:54,050::clientIF::435::vds::(_recoverExistingVms) Vm a0008c86-e4a0-43b6-8ce4-28eb47547980 recovery failed raise vm.VolumeError(drive) VolumeError: Bad volume specification {'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, 'volumeInfo': {'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310.lease', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310'}, 'index': 0, 'iface': 'virtio', 'apparentsize': '346161152', 'alias': u'virtio-disk0', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'readonly': 'False', 'shared': 'false', 'truesize': '346099712', 'type': 'disk', 'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'reqsize': '0', 'format': 'cow', 'deviceId': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'poolID': '5d43076e-f0b2-48ca-9984-c6788e9adb31', 'device': 'disk', 'path': '/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'propagateErrors': 'off', 'optional': 'false', 'name': u'vda', 'bootOrder': u'1', 'volumeID': 'e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'specParams': {}, 'volumeChain': [{'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'c6f2463e-91ed-4d47-8bcd-883a5a61985e', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/c6f2463e-91ed-4d47-8bcd-883a5a61985e.lease', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/c6f2463e-91ed-4d47-8bcd-883a5a61985e'}, {'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310.lease', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310'}]} -from /var/log/messages: Jul 23 02:27:21 host09-rack06 vdsm TaskManager.Task ERROR Task=`fd70dee4-faae-4c6b-8cf0-832d68a65164`::Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 873, in _run#012 return fn(*args, **kargs)#012 File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012 res = f(*args, **kwargs)#012 File "/usr/share/vdsm/storage/hsm.py", line 3055, in getVolumeSize#012 apparentsize = str(dom.getVSize(imgUUID, volUUID))#012 File "/usr/share/vdsm/storage/fileSD.py", line 314, in getVSize#012 return self.oop.os.stat(volPath).st_size#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 312, in callCrabRPCFunction#012 raise Exception("No free file handlers in pool")#012Exception: No free file handlers in pool. Jul 23 11:30:15 host09-rack06 vdsm vds ERROR Vm 79efb347-222d-4df4-84f9-8b4ef4b10838 recovery failed#012Traceback (most recent call last):#012 File "/usr/share/vdsm/clientIF.py", line 432, in _recoverExistingVms#012 vmObj.buildConfDevices()[vm.DISK_DEVICES])#012 File "/usr/share/vdsm/vm.py", line 2339, in preparePaths#012 drive['path'] = self.cif.prepareVolumePath(drive, self.id)#012 File "/usr/share/vdsm/clientIF.py", line 247, in prepareVolumePath#012 raise vm.VolumeError(drive)#012VolumeError: Bad volume specification {'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, 'volumeInfo': {'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'fb991d1a-39c2-4463-a504-04e4294856f8', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/174d792c-8fbf-4834-af0b-9dc0bf46629b/fb991d1a-39c2-4463-a504-04e4294856f8.lease', 'imageID': '174d792c-8fbf-4834-af0b-9dc0bf46629b', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/174d792c-8fbf-4834-af0b-9dc0bf46629b/fb991d1a-39c2-4463-a504-04e4294856f8'}, 'index': 0, 'iface': 'virtio', 'apparentsize': '358162432', 'alias': u'virtio-disk0', 'imageID': '174d792c-8fbf-4834-af0b-9dc0bf46629b', 'readonly': 'False', 'shared': 'false', 'truesize': '358092800', 'type': 'disk', 'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'reqsize': '0', 'format': 'cow', 'deviceId': '174d792c-8fbf-4834-af0b-9dc0bf46629b', 'poolID': '5d43076e-f0b2-48ca-9984-c6788e9adb31', 'device': 'disk', 'path': '/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/68957f61-33a4-47ea-9b7d-4e0a84639841/images/174d792c-8fbf-4834-af0b-9dc0bf46629b/fb991d1a-39c2-4463-a504-04e4294856f8', 'propagateErrors': 'off', 'optional': 'false', 'name': u'vda', 'bootOrder': u'1', 'volumeID': 'fb991d1a-39c2-4463-a504-04e42948 Version-Release number of selected component (if applicable): 3.4.1-0.30.el6ev How reproducible: unknown Steps to Reproduce: 1. build 37 hosts. 2. ramp up, up to ~5,600 hosts. 3. hosts become non-operational or unassigned Actual results: hosts become unassigned or non-operational on scale setup. Expected results: hosts should be 'Up'. Additional info:
Created attachment 920231 [details] vdsm.log
Created attachment 920233 [details] messages.log
Created attachment 920234 [details] engine.log
From what vdsm is it taken from? Host or fake host? How many vms was that vdsm running ? It looks like a simple fd limit
Real host, running real ~70 vms Each vm running 4 cpu, 256mb ram, 20gb disk (template thin provision) It is a known FD issue?
another investigation were done, in time host become non-operational. it seems to be another limit. there is another pool that vdsm mange ? file handlers open: [root@host06-rack04 ~]# lsof |wc -l 15135 FD limit: [root@host06-rack04 ~]# cat /proc/sys/fs/file-max 6550426 at the this point this is what host logged: [root@host06-rack04 ~]# tail -f /var/log/messages Jul 24 09:04:55 host06-rack04 sanlock[2193]: 2014-07-24 09:04:55+0000 349755 [11315]: s32 renewal error -202 delta_length 11 last_success 349691 Jul 24 09:04:55 host06-rack04 sanlock[2193]: 2014-07-24 09:04:55+0000 349755 [2193]: s32 check_our_lease warning 64 last_success 349691 Jul 24 09:04:56 host06-rack04 sanlock[2193]: 2014-07-24 09:04:56+0000 349756 [2193]: s32 check_our_lease warning 65 last_success 349691 Jul 24 09:04:57 host06-rack04 sanlock[2193]: 2014-07-24 09:04:57+0000 349757 [2193]: s32 check_our_lease warning 66 last_success 349691 Jul 24 09:04:58 host06-rack04 sanlock[2193]: 2014-07-24 09:04:58+0000 349758 [2193]: s32 check_our_lease warning 67 last_success 349691 Jul 24 09:04:59 host06-rack04 sanlock[2193]: 2014-07-24 09:04:59+0000 349759 and many more. and from vdsm log: VM Channels Listener::DEBUG::2014-07-24 09:14:12,283::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 46. VM Channels Listener::DEBUG::2014-07-24 09:14:12,284::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 54. VM Channels Listener::DEBUG::2014-07-24 09:14:12,284::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 63. VM Channels Listener::DEBUG::2014-07-24 09:14:12,285::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 66. VM Channels Listener::DEBUG::2014-07-24 09:14:12,285::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 70. VM Channels Listener::DEBUG::2014-07-24 09:14:12,285::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 81. VM Channels Listener::DEBUG::2014-07-24 09:14:12,288::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 85. VM Channels Listener::DEBUG::2014-07-24 09:14:12,289::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 88. VM Channels Listener::DEBUG::2014-07-24 09:14:12,289::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 90. VM Channels Listener::DEBUG::2014-07-24 09:14:12,289::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 93. VM Channels Listener::DEBUG::2014-07-24 09:14:12,290::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 94. VM Channels Listener::DEBUG::2014-07-24 09:14:12,290::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 95. VM Channels Listener::DEBUG::2014-07-24 09:14:12,290::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 102 and many more
I have also find this: Jul 24 11:33:30 host04-rack04 vdsm TaskManager ERROR taskManager: Skipping directory: 59f423b8-f36f-4ba7-b995-9c5e982ba8e4#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/taskManager.py", line 213, in loadDumpedTasks#012 t = Task.loadTask(store, taskID)#012 File "/usr/share/vdsm/storage/task.py", line 1149, in loadTask#012 t._load(store, ext)#012 File "/usr/share/vdsm/storage/task.py", line 730, in _load#012 self._loadTaskMetaFile(taskDir)#012 File "/usr/share/vdsm/storage/task.py", line 681, in _loadTaskMetaFile#012 self._loadMetaFile(taskFile, self, Task.fields)#012 File "/usr/share/vdsm/storage/task.py", line 646, in _loadMetaFile#012 raise se.TaskMetaDataLoadError(filename)#012TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/mastersd/master/tasks/59f423b8-f36f-4ba7-b995-9c5e982ba8e4/59f423b8-f36f-4ba7-b995-9c5e982ba8e4.task',) Jul 24 11:33:30 host04-rack04 vdsm TaskManager.Task WARNING Task=`96d6decf-e122-4059-997c-151e3c644352`::Task was autocleaned Jul 24 11:33:30 host04-rack04 vdsm TaskManager.Task ERROR Unexpected error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/storage/task.py", line 624, in _loadMetaFile#012 for line in getProcPool().readLines(filename):#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012 *args, **kwargs)#012 File "/usr/share/vdsm/storage/remoteFileHandler.py", line 199, in callCrabRPCFunction#012 raise err#012IOError: [Errno 2] No such file or directory: '/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/mastersd/master/tasks/36173803-7fe4-45db-8154-0c2476e907e9/36173803-7fe4-45db-8154-0c2476e907e9.task' Jul 24 11:33:30 host04-rack04 vdsm TaskManager.Task WARNING Task=`59f423b8-f36f-4ba7-b995-9c5e982ba8e4`::Task was autocleaned
This is not related to file descriptor limit. The error message is a bout file handlers. This looks like a storage connectivity issue, which cause file operation to become slow, which cause the remote file handlers pool to be overloaded by requests, leading to the "No free file handlers in pool" error. This error means that all 10 file handlers for this domain are busy. From sanlock message, we understand that sanlock cannot access the storage for updating the lease: host06-rack04 sanlock[2193]: 2014-07-24 09:04:55+0000 349755 [11315]: s32 renewal error -202 delta_length 11 last_success 349691 Please upload logs from other hosts that became non-operational, and from hosts that remain active. If some hosts cannot access storage, there is probably a networking issue preventing these hosts from accessing the shared storage. Becoming non-operational is the expected result when some hosts cannot access storage. Eldad, can you check how the hosts that became non-operational are connected to the network? maybe all of them are connected to a faulty switch?
Im not sure what the switch status is. we can open more than 10 file handlers? we can make sure we are releasing resources as well, it could be also application side issue if not releasing the file handlers as well.
(In reply to Eldad Marciano from comment #10) > Im not sure what the switch status is. Please check > > we can open more than 10 file handlers? No, each file handler is a separate process, and when using many storage domains we already using GB's of memory because of this. > we can make sure we are releasing resources as well, it could be also > application side issue if not releasing the file handlers as well. It can be everything, but in this case, according to sanlock log, we cannot access storage, which explain everything else. Please provide the info I asked for in my comment, and keep the needinfo flag until you do so, so we can track this bug correctly.
Created attachment 921458 [details] non operational host
ping to storage looks ok PING cmplnt-nas01.scale.openstack.engineering.redhat.com (10.1.10.15) 56(84) bytes of data. 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=1 ttl=63 time=0.086 ms 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=2 ttl=63 time=1.96 ms 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=3 ttl=63 time=2.16 ms 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=4 ttl=63 time=2.05 ms 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=5 ttl=63 time=2.07 ms 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=6 ttl=63 time=2.14 ms 64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=7 ttl=63 time=2.14
Created attachment 921487 [details] ok host
Created attachment 921720 [details] vdsm.log that shows failed calls This log show a lot of updateDriveVolume calls and most of them go through and occasionally fail with 'no file handler error' but is seems like connectivity is not lost or gets restored. Also, the remote file handlers do not get stuck in disk state or something and are responding. Maybe 10 file handlers is just not enough to handle the load at all times?
(In reply to Eldad Marciano from comment #14) > Created attachment 921487 [details] > ok host This file cannot be opened, please upload again. You uploaded files from one non-operational host of 37 hosts. We need more info. Please upload logs from 5 non-operational hosts. For each host, we need: 1. vdsm logs /var/log/vdsm 2. sanlock log /var/log/sanlock.log 3. message /var/log/messages When you upload, please do not put zip files inside zip files. I would collect all the files from all hosts and use tar to archive them. If the resulting file is too big for bugzilla, archive each host files separately.
And please specify when the system started to be non-operational, so we don't waste time looking the logs.
Created attachment 922176 [details] non-op hosts the problem begin around 14:00
As i chat with Nir, we decided to find a correlation by date between the engine.log to sanlock.log at the host to avoid any storage bottlenecks. No correlation found. engine: [root@host01-rack04 ovirt-engine]# less engine.log-20140727.gz |grep -i 'non-op' |grep -i 'host03-rack04' 2014-07-26 12:49:44,387 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-26) [3732b8a2] Correlation ID: 3732b8a2, Job ID: 0f274cc3-ad94-474a-bd1c-41c5547ace70, Call Stack: null, Custom Event ID: -1, Message: Host host03-rack04.scale.openstack.engineering.redhat.com cannot access one of the Storage Domains attached to the Data Center dc_real. Setting Host state to Non-Operational. host: 2014-07-25 19:18:57+0000 473037 [2183]: s34 lockspace 68957f61-33a4-47ea-9b7d-4e0a84639841:1:/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4- 47ea-9b7d-4e0a84639841/dom_md/ids:0 2014-07-25 20:24:54+0000 476994 [2178]: cmd 9 target pid 8593 not found 2014-07-25 20:45:41+0000 478241 [2178]: cmd 9 target pid 16373 not found 2014-07-25 20:45:42+0000 478242 [2178]: cmd 9 target pid 16560 not found 2014-07-25 20:51:55+0000 478615 [2178]: cmd 9 target pid 27598 not found 2014-07-25 22:11:39+0000 483399 [2178]: cmd 9 target pid 29367 not found 2014-07-25 22:34:25+0000 484765 [2178]: cmd 9 target pid 12780 not found 2014-07-25 22:34:30+0000 484770 [2178]: cmd 9 target pid 13087 not found 2014-07-25 22:41:52+0000 485212 [2178]: cmd 9 target pid 27689 not found 2014-07-26 00:06:21+0000 490281 [2178]: cmd 9 target pid 14008 not found 2014-07-26 00:27:35+0000 491555 [2178]: cmd 9 target pid 27787 not found 2014-07-26 00:28:31+0000 491611 [2178]: cmd 9 target pid 30630 not found 2014-07-26 00:33:05+0000 491885 [2178]: cmd 9 target pid 6879 not found 2014-07-26 01:55:13+0000 496813 [2178]: cmd 9 target pid 22077 not found 2014-07-26 02:20:23+0000 498323 [2178]: cmd 9 target pid 898 not found 2014-07-26 02:20:39+0000 498339 [2178]: cmd 9 target pid 1477 not found 2014-07-26 02:24:06+0000 498546 [2178]: cmd 9 target pid 7102 not found 2014-07-26 02:38:28+0000 499408 [2178]: cmd 9 target pid 2805 not found 2014-07-26 21:10:01+0000 566101 [2178]: cmd 9 target pid 23959 not found 2014-07-27 10:25:22+0000 613822 [2178]: cmd 9 target pid 1426 not found 2014-07-28 13:36:23+0000 711683 [19604]: 68957f61 aio timeout 0 0x7ff0f00008c0:0x7ff0f00008d0:0x7ff110176000 ioto 10 to_count 1
(In reply to Eldad Marciano from comment #19) > As i chat with Nir, we decided to find a correlation by date between the > engine.log to sanlock.log at the host to avoid any storage bottlenecks. > > No correlation found. Yet, on one host? 1. Please attach log showing all relevant events in the relevant date (where you opened that bug). We like to see these events: - Host becomes non-op because it cannot access storage - Host becomes operational again 2. sanlock log on these hosts showing that time frame 3. the time zone of the engine machine - engine log use localtime, and sanlock log uses UTC.
1. all of the relevant logs already in comment 18. 2. all of the hosts in our lab are using the same date. 3. looks like sanlock convert the utc to date format, beside there is no correlation in all of the '2014-07-26' time frame.
(In reply to Eldad Marciano from comment #21) > 1. all of the relevant logs already in comment 18. The bug is called "Hosts become Non-Operational", but you can find only one host that became non operational? When did the host became operational again? We need the time frame where the host was non-op because it did not see the storage. I suggest to separate the bugs - host becoming Unassigned is not related to host becoming non operational because it cannot access a storage domain. Ok, if there are no more hosts with this problem, and we don't see any error on sanlock.log in this day, there was no issue accessing storage on this host. Since the problem is not "host becoming non op", but one host became non-op, I'm removing the severity of this bug. We don't have enough info to tell what the severity of this issue yet. Please split the issue of host becoming unassigned to separate bug. Please ensure that we have all the relevant logs from host03-rack04.scale.openstack.engineering.redhat.com on this bug.
Created attachment 924524 [details] issue started in the following time frame '2014-08-06 14:20:25'
looks like restart the entire setup reproduce the problem.
Created attachment 924526 [details] engine screen shot
Created attachment 924529 [details] engine log
this also found at the engine logs, might be related to problem. 2014-08-06 16:06:25,835 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-79) ResourceManager::refreshVdsRunTimeInfo: Error: EJBException: JBAS014516: Failed to acquire a permit within 5 MINUTES, vds = cc7b4c0d-7d6f-4e0d-a172-0181d5d19277 : host21-rack06.scale.openstack.engineering.redhat.com
(In reply to Eldad Marciano from comment #23) > Created attachment 924524 [details] > issue started in the following time frame '2014-08-06 14:20:25' Thanks Eldad! having logs from one machine, we can start to make progress with this bug. From sanlock log, we learn that: 1. storage became inaccessible at 2014-08-06 14:19:24+0000 2. storage became accessible again at 2014-08-06 14:19:24+0000 Any error in vdsm during this timeframe is expected. See the annotated-sanlock.log From /var/log/messages we don't learn anything usefull, we see the same messages from sanlock. Please provide: 1. Filtered engine log, showing the time all hosts becoming non-operational entered into this state, and the time that they exited from that state. Please contact Liron or Maor for help on getting such log. 2. sanlock.log from all hosts that became non-operational 3. sanlock.log from one host that was still operational at the same time
Created attachment 924534 [details] Annotated sanlock log on a host that became non operational
this bug was reproduced again periodically. at this time we have just upgrade our scale setup and rebuild the entire sys to 3.4.2 when i have connected 37 hosts after a while many of the hosts become non-operational. note - since we are just build the lab again, there is no vms or any other load on the sys. engine raise this: 2014-08-19 02:11:31,442 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-22) [19a42edb] Command HSMGetAllTasksStatusesVDSCommand(HostName = host03-rack04.scale.openstack.engineering.redhat.com, HostId = 67fbe0cb-5c94-41fb-a27a-5ada1c235254) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: () by the SPM: Thread-14::ERROR::2014-08-19 02:11:31,265::task::866::TaskManager.Task::(_setError) Task=`4c6ccf20-6937-45d3-93f0-fccf749fbeef`::Unexpected error Thread-14::ERROR::2014-08-19 02:11:31,275::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Unknown pool id, pool not connected: ('95c2f257-fc62-4907-bb54-2e099be599e9',)", 'code': 309}} Thread-14::ERROR::2014-08-19 02:11:31,404::task::866::TaskManager.Task::(_setError) Task=`4022c44b-5cb0-48fc-b442-a1c3437a65e2`::Unexpected error Thread-14::ERROR::2014-08-19 02:11:31,409::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Not SPM: ()', 'code': 654}} Thread-13::ERROR::2014-08-19 02:12:29,680::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain d4417282-22fb-42f3-86c1-fdea72903640 Thread-13::ERROR::2014-08-19 02:12:29,680::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain d4417282-22fb-42f3-86c1-fdea72903640 Thread-13::DEBUG::2014-08-19 02:12:29,683::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name d4417282-22fb-42f3-86c1-fdea72903640' (cwd None) Th by other hosts which become 'non-op' all of them reports these log messages: Thread-16::ERROR::2014-08-19 00:14:59,835::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:16:09,876::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:17:19,890::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:18:29,907::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:19:39,918::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found Thread-16::ERROR::2014-08-19 00:19:39,919::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:20:49,932::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found Thread-16::ERROR::2014-08-19 00:20:49,933::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:21:59,946::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found Thread-16::ERROR::2014-08-19 00:21:59,947::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-16::ERROR::2014-08-19 00:23:09,961::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found Thread-16::ERROR::2014-08-19 00:23:09,962::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information Thread-30::DEBUG::2014-08-19 00:25:18,196::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name' (cwd None) schedule a live debugging session required. by now we can not keep in our scale process with this unstable situation.
Eldad, on 2014-08-06 12:53:15 EDT I asked for more logs (see comment 28). This was 13 days ago, and you still did not provide any of them. Without the requested logs we cannot help you.
do you prefer to connect to the relevant hosts (engine, spm host, host)?
Created attachment 928373 [details] engine, spm host, host