Bug 1603376
| Summary: | 'SanlockException:(-202, 'Sanlock resource read failure', 'IO timeout')' while trying to attach the non-master SD when disk creation is in progress | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Shir Fishbain <sfishbai> | ||||
| Component: | Core | Assignee: | Nir Soffer <nsoffer> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Elad <ebenahar> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.20.31 | CC: | ahino, bugs, ebenahar, nsoffer, sfishbai, tnisan | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-08-08 10:45:44 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
vdsm-4.20.34-1.el7ev.x86_64 Seems like the expected behavior to me if you attach few storage domains as once, Nir what do you think? I don't think the number of storage domain should matter. This looks like QOS issue, preallocating big disk cause sanlock timeouts when reading the SPM lease. Can we get more info about this setup? For every storage domain: - storage type? - what is the target storage? - how we connect to storage? FC/iSCSI/NFS - if iSCSI, is this 1G network or 10G - if NFS, which NFS version? - master? For the preallocated disk, which type of storage is this? Please also include output of mount command, showing all mounts and mount options. Finally, try to reproduce, it is important if this is reproducible or not. The bug isn't reproduced again. The bug was opened because of the nfs environment problem. |
Created attachment 1460409 [details] Logs Description of problem: All the storage domains were detached except for the master SD. While creating a 500 GiB, preallocated disk try to put the master SD to maintenance and then try to attach the unattached storage domains. A 'Sanlock resource read failure' error message appeared in engine.log Version-Release number of selected component (if applicable): 4.2.5.2_SNAPSHOT-79.gffafd93.0.scratch.master.el7ev How reproducible: Not sure Steps to Reproduce: 1. All storage domains except for the master SD were detached. 2. Create a 500 GiB preallocated disk on the master SD 3. While the disk creation is in progress try to put the master to maintenance. 4. Try to attach the storage domains Actual results: The attempt to attach the non-master storage domains fail with the following errors: engine log : 2018-07-17 15:33:05,620+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command SpmStatusVDS failed: (-202, 'Sanlock resource read failure', 'IO timeout') vdsm log: 2018-07-17 15:16:30,288+0300 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH getSpmStatus error=(-202, 'Unable to read resource owners', 'IO timeout') (dispatcher:86) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper return m(self, *a, **kw) File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1189, in prepare raise self.error sanlock log: 2018-07-17 15:12:47 174532 [6941]: s21 renewal error -202 delta_length 10 last_success 174501 Expected Result: All the storage domains non-master storage domains should be attached successfully. Additional info: After trying to put the master SD to maintenance (while processing the disk creation) fails with a proper message (which is expected).