Bug 1122568 - [Scale] - Hosts become Non-Operational
Summary: [Scale] - Hosts become Non-Operational
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.5.0
Assignee: Saggi Mizrahi
QA Contact: Pavel Stehlik
URL:
Whiteboard: infra
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-23 14:13 UTC by Eldad Marciano
Modified: 2019-06-13 08:05 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-14 14:48:14 UTC
oVirt Team: Infra


Attachments (Terms of Use)
vdsm.log (272.00 KB, application/zip)
2014-07-23 14:14 UTC, Eldad Marciano
no flags Details
messages.log (1.46 MB, application/zip)
2014-07-23 14:15 UTC, Eldad Marciano
no flags Details
engine.log (2.74 MB, application/zip)
2014-07-23 14:16 UTC, Eldad Marciano
no flags Details
non operational host (1.35 MB, application/zip)
2014-07-27 11:33 UTC, Eldad Marciano
no flags Details
ok host (176.00 KB, application/zip)
2014-07-27 15:37 UTC, Eldad Marciano
no flags Details
vdsm.log that shows failed calls (1.60 MB, text/x-log)
2014-07-28 08:17 UTC, Dima Kuznetsov
no flags Details
non-op hosts the problem begin around 14:00 (1.84 MB, application/zip)
2014-07-29 14:54 UTC, Eldad Marciano
no flags Details
issue started in the following time frame '2014-08-06 14:20:25' (842.18 KB, application/zip)
2014-08-06 16:01 UTC, Eldad Marciano
no flags Details
engine screen shot (88.72 KB, image/png)
2014-08-06 16:04 UTC, Eldad Marciano
no flags Details
engine log (2.95 MB, application/zip)
2014-08-06 16:22 UTC, Eldad Marciano
no flags Details
Annotated sanlock log on a host that became non operational (3.82 KB, text/x-log)
2014-08-06 16:58 UTC, Nir Soffer
no flags Details
engine, spm host, host (4.52 MB, application/zip)
2014-08-19 13:41 UTC, Eldad Marciano
no flags Details

Description Eldad Marciano 2014-07-23 14:13:40 UTC
Description of problem:
during scale test ramp up, around ~5,600 vms many of the hosts become 'non-operational'.

is seems that this state happens due to 'No free file handlers in pool' (from the /var/log/messages in hosts).

see the findings, by first investigation:

-from engine log:
2014-07-23 10:31:51,494 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-84) [2e1f80b7] Command GetStatsVDSCommand(HostName = host04-rack04.scale.openstack.engineering.redhat.com, HostId = 931346f2-3d6d-4cbb-80c3-50105934a44f, vds=Host[host04-rack04.scale.openstack.engineering.redhat.com,931346f2-3d6d-4cbb-80c3-50105934a44f]) execution failed. Exception: VDSRecoveringException: Recovering from crash or Initializing 
2014-07-23 10:31:51,497 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-84) [2e1f80b7] vds::refreshVdsStats Failed getVdsStats, vds = 931346f2-3d6d-4cbb-80c3-50105934a44f : host04-rack04.scale.openstack.engineering.redhat.com, error = org.ovirt.engine.core.vdsbroker.vdsbroker.VDSRecoveringException: Recovering from crash or Initializing


-from vdsm.log:
clientIFinit::ERROR::2014-07-23 10:26:54,038::task::866::TaskManager.Task::(_setError) Task=`fa2cfef1-b4a8-4db4-8329-3bbd7362d2d9`::Unexpected error
clientIFinit::ERROR::2014-07-23 10:26:54,050::dispatcher::68::Storage.Dispatcher.Protect::(run)
raise self.error
clientIFinit::ERROR::2014-07-23 10:26:54,050::clientIF::435::vds::(_recoverExistingVms) Vm a0008c86-e4a0-43b6-8ce4-28eb47547980 recovery failed
raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, 'volumeInfo': {'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310.lease', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310'}, 'index': 0, 'iface': 'virtio', 'apparentsize': '346161152', 'alias': u'virtio-disk0', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'readonly': 'False', 'shared': 'false', 'truesize': '346099712', 'type': 'disk', 'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'reqsize': '0', 'format': 'cow', 'deviceId': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'poolID': '5d43076e-f0b2-48ca-9984-c6788e9adb31', 'device': 'disk', 'path': '/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'propagateErrors': 'off', 'optional': 'false', 'name': u'vda', 'bootOrder': u'1', 'volumeID': 'e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'specParams': {}, 'volumeChain': [{'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'c6f2463e-91ed-4d47-8bcd-883a5a61985e', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/c6f2463e-91ed-4d47-8bcd-883a5a61985e.lease', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/c6f2463e-91ed-4d47-8bcd-883a5a61985e'}, {'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'e7d15e12-e566-4bd3-9c9f-149d22dd8310', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310.lease', 'imageID': '0fb0cdd6-aa26-491e-92dc-01bb05a8adc7', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/0fb0cdd6-aa26-491e-92dc-01bb05a8adc7/e7d15e12-e566-4bd3-9c9f-149d22dd8310'}]}



-from /var/log/messages:
Jul 23 02:27:21 host09-rack06 vdsm TaskManager.Task ERROR Task=`fd70dee4-faae-4c6b-8cf0-832d68a65164`::Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 873, in _run#012    return fn(*args, **kargs)#012  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper#012    res = f(*args, **kwargs)#012  File "/usr/share/vdsm/storage/hsm.py", line 3055, in getVolumeSize#012    apparentsize = str(dom.getVSize(imgUUID, volUUID))#012  File "/usr/share/vdsm/storage/fileSD.py", line 314, in getVSize#012    return self.oop.os.stat(volPath).st_size#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 312, in callCrabRPCFunction#012    raise Exception("No free file handlers in pool")#012Exception: No free file handlers in pool.

Jul 23 11:30:15 host09-rack06 vdsm vds ERROR Vm 79efb347-222d-4df4-84f9-8b4ef4b10838 recovery failed#012Traceback (most recent call last):#012 File "/usr/share/vdsm/clientIF.py", line 432, in _recoverExistingVms#012 vmObj.buildConfDevices()[vm.DISK_DEVICES])#012 File "/usr/share/vdsm/vm.py", line 2339, in preparePaths#012 drive['path'] = self.cif.prepareVolumePath(drive, self.id)#012 File "/usr/share/vdsm/clientIF.py", line 247, in prepareVolumePath#012 raise vm.VolumeError(drive)#012VolumeError: Bad volume specification {'address': {u'slot': u'0x06', u'bus': u'0x00', u'domain': u'0x0000', u'type': u'pci', u'function': u'0x0'}, 'volumeInfo': {'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'fb991d1a-39c2-4463-a504-04e4294856f8', 'leasePath': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/174d792c-8fbf-4834-af0b-9dc0bf46629b/fb991d1a-39c2-4463-a504-04e4294856f8.lease', 'imageID': '174d792c-8fbf-4834-af0b-9dc0bf46629b', 'path': '/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-47ea-9b7d-4e0a84639841/images/174d792c-8fbf-4834-af0b-9dc0bf46629b/fb991d1a-39c2-4463-a504-04e4294856f8'}, 'index': 0, 'iface': 'virtio', 'apparentsize': '358162432', 'alias': u'virtio-disk0', 'imageID': '174d792c-8fbf-4834-af0b-9dc0bf46629b', 'readonly': 'False', 'shared': 'false', 'truesize': '358092800', 'type': 'disk', 'domainID': '68957f61-33a4-47ea-9b7d-4e0a84639841', 'reqsize': '0', 'format': 'cow', 'deviceId': '174d792c-8fbf-4834-af0b-9dc0bf46629b', 'poolID': '5d43076e-f0b2-48ca-9984-c6788e9adb31', 'device': 'disk', 'path': '/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/68957f61-33a4-47ea-9b7d-4e0a84639841/images/174d792c-8fbf-4834-af0b-9dc0bf46629b/fb991d1a-39c2-4463-a504-04e4294856f8', 'propagateErrors': 'off', 'optional': 'false', 'name': u'vda', 'bootOrder': u'1', 'volumeID': 'fb991d1a-39c2-4463-a504-04e42948

 

Version-Release number of selected component (if applicable):
 3.4.1-0.30.el6ev

How reproducible:
unknown 

Steps to Reproduce:
1. build 37 hosts.
2. ramp up, up to ~5,600 hosts.
3. hosts become non-operational or unassigned  

Actual results:
hosts become unassigned or non-operational on scale setup.

Expected results:
hosts should be 'Up'.

Additional info:

Comment 1 Eldad Marciano 2014-07-23 14:14:19 UTC
Created attachment 920231 [details]
vdsm.log

Comment 2 Eldad Marciano 2014-07-23 14:15:12 UTC
Created attachment 920233 [details]
messages.log

Comment 3 Eldad Marciano 2014-07-23 14:16:37 UTC
Created attachment 920234 [details]
engine.log

Comment 5 Barak 2014-07-23 17:33:31 UTC
From what vdsm is it taken from?
Host or fake host?
How many vms was that vdsm running ?

It looks like a simple fd limit

Comment 6 Eldad Marciano 2014-07-24 07:47:16 UTC
Real host, running real ~70 vms
Each vm running 4 cpu, 256mb ram, 20gb disk (template thin provision)


It is a known FD issue?

Comment 7 Eldad Marciano 2014-07-24 09:24:06 UTC
another investigation were done, in time host become non-operational.
it seems to be another limit.

there is another pool that vdsm mange ?

file handlers open:
[root@host06-rack04 ~]# lsof |wc -l
15135  

FD limit:
[root@host06-rack04 ~]# cat /proc/sys/fs/file-max
6550426

at the this point this is what host logged:
[root@host06-rack04 ~]# tail -f /var/log/messages 
Jul 24 09:04:55 host06-rack04 sanlock[2193]: 2014-07-24 09:04:55+0000 349755 [11315]: s32 renewal error -202 delta_length 11 last_success 349691
Jul 24 09:04:55 host06-rack04 sanlock[2193]: 2014-07-24 09:04:55+0000 349755 [2193]: s32 check_our_lease warning 64 last_success 349691         
Jul 24 09:04:56 host06-rack04 sanlock[2193]: 2014-07-24 09:04:56+0000 349756 [2193]: s32 check_our_lease warning 65 last_success 349691         
Jul 24 09:04:57 host06-rack04 sanlock[2193]: 2014-07-24 09:04:57+0000 349757 [2193]: s32 check_our_lease warning 66 last_success 349691         
Jul 24 09:04:58 host06-rack04 sanlock[2193]: 2014-07-24 09:04:58+0000 349758 [2193]: s32 check_our_lease warning 67 last_success 349691         
Jul 24 09:04:59 host06-rack04 sanlock[2193]: 2014-07-24 09:04:59+0000 349759
and many more.


and from vdsm log:
VM Channels Listener::DEBUG::2014-07-24 09:14:12,283::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 46.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,284::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 54.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,284::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 63.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,285::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 66.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,285::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 70.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,285::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 81.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,288::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 85.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,289::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 88.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,289::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 90.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,289::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 93.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,290::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 94.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,290::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 95.                                                                                                  
VM Channels Listener::DEBUG::2014-07-24 09:14:12,290::vmChannels::95::vds::(_handle_timeouts) Timeout on fileno 102

and many more

Comment 8 Eldad Marciano 2014-07-24 11:35:05 UTC
I have also find this:

Jul 24 11:33:30 host04-rack04 vdsm TaskManager ERROR taskManager: Skipping directory: 59f423b8-f36f-4ba7-b995-9c5e982ba8e4#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/taskManager.py", line 213, in loadDumpedTasks#012    t = Task.loadTask(store, taskID)#012  File "/usr/share/vdsm/storage/task.py", line 1149, in loadTask#012    t._load(store, ext)#012  File "/usr/share/vdsm/storage/task.py", line 730, in _load#012    self._loadTaskMetaFile(taskDir)#012  File "/usr/share/vdsm/storage/task.py", line 681, in _loadTaskMetaFile#012    self._loadMetaFile(taskFile, self, Task.fields)#012  File "/usr/share/vdsm/storage/task.py", line 646, in _loadMetaFile#012    raise se.TaskMetaDataLoadError(filename)#012TaskMetaDataLoadError: Can't load Task Metadata: ('/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/mastersd/master/tasks/59f423b8-f36f-4ba7-b995-9c5e982ba8e4/59f423b8-f36f-4ba7-b995-9c5e982ba8e4.task',)
Jul 24 11:33:30 host04-rack04 vdsm TaskManager.Task WARNING Task=`96d6decf-e122-4059-997c-151e3c644352`::Task was autocleaned
Jul 24 11:33:30 host04-rack04 vdsm TaskManager.Task ERROR Unexpected error#012Traceback (most recent call last):#012  File "/usr/share/vdsm/storage/task.py", line 624, in _loadMetaFile#012    for line in getProcPool().readLines(filename):#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 297, in callCrabRPCFunction#012    *args, **kwargs)#012  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 199, in callCrabRPCFunction#012    raise err#012IOError: [Errno 2] No such file or directory: '/rhev/data-center/5d43076e-f0b2-48ca-9984-c6788e9adb31/mastersd/master/tasks/36173803-7fe4-45db-8154-0c2476e907e9/36173803-7fe4-45db-8154-0c2476e907e9.task'
Jul 24 11:33:30 host04-rack04 vdsm TaskManager.Task WARNING Task=`59f423b8-f36f-4ba7-b995-9c5e982ba8e4`::Task was autocleaned

Comment 9 Nir Soffer 2014-07-24 19:18:33 UTC
This is not related to file descriptor limit. The error message is a bout file handlers.

This looks like a storage connectivity issue, which cause file operation to become slow,
which cause the remote file handlers pool to be overloaded by requests, leading to the 
"No free file handlers in pool" error. This error means that all 10 file handlers for
this domain are busy.

From sanlock message, we understand that sanlock cannot access the storage for
updating the lease:

    host06-rack04 sanlock[2193]: 2014-07-24 09:04:55+0000 349755 [11315]:
    s32 renewal error -202 delta_length 11 last_success 349691

Please upload logs from other hosts that became non-operational, and from hosts that
remain active. 

If some hosts cannot access storage, there is probably a networking issue preventing
these hosts from accessing the shared storage. Becoming non-operational is the
expected result when some hosts cannot access storage.

Eldad, can you check how the hosts that became non-operational are connected to the 
network? maybe all of them are connected to a faulty switch?

Comment 10 Eldad Marciano 2014-07-27 10:45:49 UTC
Im not sure what the switch status is.

we can open more than 10 file handlers?

we can make sure we are releasing resources as well, it could be also application side issue if not releasing the file handlers as well.

Comment 11 Nir Soffer 2014-07-27 10:57:24 UTC
(In reply to Eldad Marciano from comment #10)
> Im not sure what the switch status is.

Please check
> 
> we can open more than 10 file handlers?

No, each file handler is a separate process, and when using many storage domains we already using GB's of memory because of this.

> we can make sure we are releasing resources as well, it could be also
> application side issue if not releasing the file handlers as well.

It can be everything, but in this case, according to sanlock log, we cannot access storage, which explain everything else.

Please provide the info I asked for in my comment, and keep the needinfo flag until you do so, so we can track this bug correctly.

Comment 12 Eldad Marciano 2014-07-27 11:33:56 UTC
Created attachment 921458 [details]
non operational host

Comment 13 Eldad Marciano 2014-07-27 11:37:16 UTC
ping to storage looks ok

PING cmplnt-nas01.scale.openstack.engineering.redhat.com (10.1.10.15) 56(84) bytes of data.
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=1 ttl=63 time=0.086 ms
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=2 ttl=63 time=1.96 ms
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=3 ttl=63 time=2.16 ms
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=4 ttl=63 time=2.05 ms
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=5 ttl=63 time=2.07 ms
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=6 ttl=63 time=2.14 ms
64 bytes from cmplnt-fs8600-01a.scale.openstack.engineering.redhat.com (10.1.10.15): icmp_seq=7 ttl=63 time=2.14

Comment 14 Eldad Marciano 2014-07-27 15:37:17 UTC
Created attachment 921487 [details]
ok host

Comment 15 Dima Kuznetsov 2014-07-28 08:17:48 UTC
Created attachment 921720 [details]
vdsm.log that shows failed calls

This log show a lot of updateDriveVolume calls and most of them go through and occasionally fail with 'no file handler error' but is seems like connectivity is not lost or gets restored. Also, the remote file handlers do not get stuck in disk state or something and are responding.

Maybe 10 file handlers is just not enough to handle the load at all times?

Comment 16 Nir Soffer 2014-07-28 09:24:59 UTC
(In reply to Eldad Marciano from comment #14)
> Created attachment 921487 [details]
> ok host

This file cannot be opened, please upload again.

You uploaded files from one non-operational host of 37 hosts. We need more info.

Please upload logs from 5 non-operational hosts.

For each host, we need:
1. vdsm logs /var/log/vdsm
2. sanlock log /var/log/sanlock.log
3. message /var/log/messages

When you upload, please do not put zip files inside zip files. I would collect all the files from all hosts and use tar to archive them. If the resulting file is too big for bugzilla, archive each host files separately.

Comment 17 Nir Soffer 2014-07-28 10:32:38 UTC
And please specify when the system started to be non-operational, so we don't waste time looking the logs.

Comment 18 Eldad Marciano 2014-07-29 14:54:53 UTC
Created attachment 922176 [details]
non-op hosts the problem begin around 14:00

Comment 19 Eldad Marciano 2014-07-31 12:37:07 UTC
As i chat with Nir, we decided to find a correlation by date between the engine.log to sanlock.log at the host to avoid any storage bottlenecks.

No correlation found.

engine:
[root@host01-rack04 ovirt-engine]# less engine.log-20140727.gz |grep -i 'non-op' |grep -i 'host03-rack04'
2014-07-26 12:49:44,387 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-26) [3732b8a2] Correlation ID: 3732b8a2, Job ID: 0f274cc3-ad94-474a-bd1c-41c5547ace70, Call Stack: null, Custom Event ID: -1, Message: Host host03-rack04.scale.openstack.engineering.redhat.com cannot access one of the Storage Domains attached to the Data Center dc_real. Setting Host state to Non-Operational.


host:
2014-07-25 19:18:57+0000 473037 [2183]: s34 lockspace 68957f61-33a4-47ea-9b7d-4e0a84639841:1:/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_rhev__real_data_storage__mount/68957f61-33a4-
47ea-9b7d-4e0a84639841/dom_md/ids:0
2014-07-25 20:24:54+0000 476994 [2178]: cmd 9 target pid 8593 not found
2014-07-25 20:45:41+0000 478241 [2178]: cmd 9 target pid 16373 not found
2014-07-25 20:45:42+0000 478242 [2178]: cmd 9 target pid 16560 not found
2014-07-25 20:51:55+0000 478615 [2178]: cmd 9 target pid 27598 not found
2014-07-25 22:11:39+0000 483399 [2178]: cmd 9 target pid 29367 not found
2014-07-25 22:34:25+0000 484765 [2178]: cmd 9 target pid 12780 not found
2014-07-25 22:34:30+0000 484770 [2178]: cmd 9 target pid 13087 not found
2014-07-25 22:41:52+0000 485212 [2178]: cmd 9 target pid 27689 not found
2014-07-26 00:06:21+0000 490281 [2178]: cmd 9 target pid 14008 not found
2014-07-26 00:27:35+0000 491555 [2178]: cmd 9 target pid 27787 not found
2014-07-26 00:28:31+0000 491611 [2178]: cmd 9 target pid 30630 not found
2014-07-26 00:33:05+0000 491885 [2178]: cmd 9 target pid 6879 not found
2014-07-26 01:55:13+0000 496813 [2178]: cmd 9 target pid 22077 not found
2014-07-26 02:20:23+0000 498323 [2178]: cmd 9 target pid 898 not found
2014-07-26 02:20:39+0000 498339 [2178]: cmd 9 target pid 1477 not found
2014-07-26 02:24:06+0000 498546 [2178]: cmd 9 target pid 7102 not found
2014-07-26 02:38:28+0000 499408 [2178]: cmd 9 target pid 2805 not found
2014-07-26 21:10:01+0000 566101 [2178]: cmd 9 target pid 23959 not found
2014-07-27 10:25:22+0000 613822 [2178]: cmd 9 target pid 1426 not found
2014-07-28 13:36:23+0000 711683 [19604]: 68957f61 aio timeout 0 0x7ff0f00008c0:0x7ff0f00008d0:0x7ff110176000 ioto 10 to_count 1

Comment 20 Nir Soffer 2014-07-31 12:48:46 UTC
(In reply to Eldad Marciano from comment #19)
> As i chat with Nir, we decided to find a correlation by date between the
> engine.log to sanlock.log at the host to avoid any storage bottlenecks.
> 
> No correlation found.
Yet, on one host?

1. Please attach log showing all relevant events in the relevant date (where you opened that bug).

We like to see these events:
- Host becomes non-op because it cannot access storage
- Host becomes operational again

2. sanlock log on these hosts showing that time frame

3. the time zone of the engine machine - engine log use localtime, and sanlock log uses UTC.

Comment 21 Eldad Marciano 2014-07-31 13:09:42 UTC
1. all of the relevant logs already in comment 18.

2. all of the hosts in our lab are using the same date.

3. looks like sanlock convert the utc to date format, beside there is no correlation in all of the '2014-07-26' time frame.

Comment 22 Nir Soffer 2014-07-31 14:32:33 UTC
(In reply to Eldad Marciano from comment #21)
> 1. all of the relevant logs already in comment 18.

The bug is called "Hosts become Non-Operational", but you can find only one host that became non operational?

When did the host became operational again?

We need the time frame where the host was non-op because it did not see the storage.

I suggest to separate the bugs - host becoming Unassigned is not related to host becoming non operational because it cannot access a storage domain.

Ok, if there are no more hosts with this problem, and we don't see any error on sanlock.log in this day, there was no issue accessing storage on this host.

Since the problem is not "host becoming non op", but one host became non-op, I'm removing the severity of this bug. We don't have enough info to tell what the severity of this issue yet.

Please split the issue of host becoming unassigned to separate bug.

Please ensure that we have all the relevant logs from host03-rack04.scale.openstack.engineering.redhat.com on this bug.

Comment 23 Eldad Marciano 2014-08-06 16:01:22 UTC
Created attachment 924524 [details]
issue started in the following time frame '2014-08-06 14:20:25'

Comment 24 Eldad Marciano 2014-08-06 16:02:22 UTC
looks like restart the entire setup reproduce the problem.

Comment 25 Eldad Marciano 2014-08-06 16:04:17 UTC
Created attachment 924526 [details]
engine screen shot

Comment 26 Eldad Marciano 2014-08-06 16:22:25 UTC
Created attachment 924529 [details]
engine log

Comment 27 Eldad Marciano 2014-08-06 16:23:33 UTC
this also found at the engine logs, might be related to problem.
2014-08-06 16:06:25,835 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-79) ResourceManager::refreshVdsRunTimeInfo: Error: EJBException: JBAS014516: Failed to acquire a permit within 5 MINUTES, vds = cc7b4c0d-7d6f-4e0d-a172-0181d5d19277 : host21-rack06.scale.openstack.engineering.redhat.com

Comment 28 Nir Soffer 2014-08-06 16:53:15 UTC
(In reply to Eldad Marciano from comment #23)
> Created attachment 924524 [details]
> issue started in the following time frame '2014-08-06 14:20:25'

Thanks Eldad! having logs from one machine, we can start to make
progress with this bug.

From sanlock log, we learn that:

1. storage became inaccessible at 2014-08-06 14:19:24+0000
2. storage became accessible again at 2014-08-06 14:19:24+0000

Any error in vdsm during this timeframe is expected.

See the annotated-sanlock.log

From /var/log/messages we don't learn anything usefull, we see the same
messages from sanlock.

Please provide:

1. Filtered engine log, showing the time all hosts becoming non-operational
   entered into this state, and the time that they exited from that state.
   Please contact Liron or Maor for help on getting such log.
2. sanlock.log from all hosts that became non-operational
3. sanlock.log from one host that was still operational at the same time

Comment 29 Nir Soffer 2014-08-06 16:58:32 UTC
Created attachment 924534 [details]
Annotated sanlock log on a host that became non operational

Comment 33 Eldad Marciano 2014-08-19 10:17:17 UTC
this bug was reproduced again periodically.
at this time we have just upgrade our scale setup and rebuild the entire sys to 3.4.2

when i have connected 37 hosts after a while many of the hosts become non-operational.

note - since we are just build the lab again, there is no vms or any other load on the sys.


engine raise this:
2014-08-19 02:11:31,442 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetAllTasksStatusesVDSCommand] (DefaultQuartzScheduler_Worker-22) [19a42edb] Command HSMGetAllTasksStatusesVDSCommand(HostName = host03-rack04.scale.openstack.engineering.redhat.com, HostId = 67fbe0cb-5c94-41fb-a27a-5ada1c235254) execution failed. Exception: IRSNonOperationalException: IRSGenericException: IRSErrorException: IRSNonOperationalException: Not SPM: ()  


by the SPM:
Thread-14::ERROR::2014-08-19 02:11:31,265::task::866::TaskManager.Task::(_setError) Task=`4c6ccf20-6937-45d3-93f0-fccf749fbeef`::Unexpected error                                                                    
Thread-14::ERROR::2014-08-19 02:11:31,275::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': "Unknown pool id, pool not connected: ('95c2f257-fc62-4907-bb54-2e099be599e9',)", 'code': 309}}  
Thread-14::ERROR::2014-08-19 02:11:31,404::task::866::TaskManager.Task::(_setError) Task=`4022c44b-5cb0-48fc-b442-a1c3437a65e2`::Unexpected error                                                                    
Thread-14::ERROR::2014-08-19 02:11:31,409::dispatcher::65::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Not SPM: ()', 'code': 654}}                                                                     
Thread-13::ERROR::2014-08-19 02:12:29,680::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain d4417282-22fb-42f3-86c1-fdea72903640                                                     
Thread-13::ERROR::2014-08-19 02:12:29,680::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain d4417282-22fb-42f3-86c1-fdea72903640                                                      
Thread-13::DEBUG::2014-08-19 02:12:29,683::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name d4417282-22fb-42f3-86c1-fdea72903640' (cwd None)                                                                                                                                                                                                      
Th




by other hosts which become 'non-op' all of them reports these log messages:
Thread-16::ERROR::2014-08-19 00:14:59,835::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:16:09,876::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:17:19,890::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:18:29,907::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:19:39,918::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found                                                  
Thread-16::ERROR::2014-08-19 00:19:39,919::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:20:49,932::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found                                                  
Thread-16::ERROR::2014-08-19 00:20:49,933::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:21:59,946::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found                                                  
Thread-16::ERROR::2014-08-19 00:21:59,947::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-16::ERROR::2014-08-19 00:23:09,961::sdc::143::Storage.StorageDomainCache::(_findDomain) domain d4417282-22fb-42f3-86c1-fdea72903640 not found                                                  
Thread-16::ERROR::2014-08-19 00:23:09,962::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain d4417282-22fb-42f3-86c1-fdea72903640 monitoring information
Thread-30::DEBUG::2014-08-19 00:25:18,196::lvm::296::Storage.Misc.excCmd::(cmd) '/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ \'r|.*|\' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name' (cwd None) 


schedule a live debugging session required.
by now we can not keep in our scale process with this unstable situation.

Comment 34 Nir Soffer 2014-08-19 11:19:25 UTC
Eldad, on 2014-08-06 12:53:15 EDT I asked for more logs (see comment 28). This was 13 days ago, and you still did not provide any of them. Without the requested logs we cannot help you.

Comment 35 Eldad Marciano 2014-08-19 13:18:05 UTC
do you prefer to connect to the relevant hosts (engine, spm host, host)?

Comment 36 Eldad Marciano 2014-08-19 13:41:05 UTC
Created attachment 928373 [details]
engine, spm host, host


Note You need to log in before you can comment on or make changes to this bug.