Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1603376

Summary:

'SanlockException:(-202, 'Sanlock resource read failure', 'IO timeout')' while trying to attach the non-master SD when disk creation is in progress

Product:

[oVirt] vdsm

Reporter:

Shir Fishbain <sfishbai>

Component:

Core

Assignee:

Nir Soffer <nsoffer>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Elad <ebenahar>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.20.31

CC:

ahino, bugs, ebenahar, nsoffer, sfishbai, tnisan

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-08-08 10:45:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Logs	none

Description Shir Fishbain 2018-07-19 14:37:11 UTC

Created attachment 1460409 [details]
Logs

Description of problem:
All the storage domains were detached except for the master SD.
While creating a 500 GiB, preallocated disk try to put the master SD to maintenance and then try to attach the unattached storage domains.
A 'Sanlock resource read failure' error message appeared in engine.log

Version-Release number of selected component (if applicable):
4.2.5.2_SNAPSHOT-79.gffafd93.0.scratch.master.el7ev

How reproducible:
Not sure

Steps to Reproduce:
1. All storage domains except for the master SD were detached.
2. Create a 500 GiB preallocated disk on the master SD 
3. While the disk creation is in progress try to put the master to maintenance.
4. Try to attach the storage domains 

Actual results:
The attempt to attach the non-master storage domains fail with the following errors:

engine log :
2018-07-17 15:33:05,620+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command SpmStatusVDS failed: (-202, 'Sanlock resource read failure', 'IO timeout')

vdsm log:
2018-07-17 15:16:30,288+0300 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH getSpmStatus error=(-202, 'Unable to read resource owners', 'IO timeout') (dispatcher:86)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare
    raise self.error

sanlock log:
2018-07-17 15:12:47 174532 [6941]: s21 renewal error -202 delta_length 10 last_success 174501

Expected Result: All the storage domains non-master storage domains should be attached successfully.

Additional info: After trying to put the master SD to maintenance (while processing the disk creation) fails with a proper message (which is expected).

Comment 1 Shir Fishbain 2018-07-19 14:51:18 UTC

vdsm-4.20.34-1.el7ev.x86_64

Comment 2 Tal Nisan 2018-07-29 07:56:26 UTC

Seems like the expected behavior to me if you attach few storage domains as once, Nir what do you think?

Comment 3 Nir Soffer 2018-08-07 12:46:42 UTC

I don't think the number of storage domain should matter.  This looks like QOS
issue, preallocating big disk cause sanlock timeouts when reading the SPM lease.

Can we get more info about this setup?

For every storage domain:
- storage type?
- what is the target storage?
- how we connect to storage? FC/iSCSI/NFS
- if iSCSI, is this 1G network or 10G
- if NFS, which NFS version?
- master?

For the preallocated disk, which type of storage is this?

Please also include output of mount command, showing all mounts and mount options.

Finally, try to reproduce, it is important if this is reproducible or not.

Comment 4 Shir Fishbain 2018-08-08 10:41:54 UTC

The bug isn't reproduced again.
The bug was opened because of the nfs environment problem.

Comment 5 Nir Soffer 2018-08-08 10:45:44 UTC

Closing since we don't have enough data to investigate, and we cannot reproduce.

Please reopen when you have the data requested in comment 3, and know how to 
reproduce.