Bug 1832967
Summary: | Uploading images from glance may delay sanlock I/O and cause sanlock operations to fail | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Amit Bawer <abawer> |
Component: | Core | Assignee: | Amit Bawer <abawer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Evelina Shames <eshames> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.40.0 | CC: | aefrat, bugs, nsoffer, tnisan |
Target Milestone: | ovirt-4.4.1 | Flags: | sbonazzo:
ovirt-4.4?
aefrat: planning_ack? aefrat: devel_ack? aefrat: testing_ack? |
Target Release: | 4.40.17 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | vdsm-4.40.17 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-08 08:25:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Amit Bawer
2020-05-07 14:33:22 UTC
Setting severity and priority to high since this may break basic flows like creating storage domain, or fail lease renewal which may cause a lease to expire, killing vdsm or running VMs. We don't know how common is this issue, it probably affects only setups with very slow storage (like OST). according the info, i think issue is with download image writes and not the upload image reads, so the title might be inaccurate. The title say: "Uploading images from glance" which is correct. In the code we call this "download image". Feel free to make the title more clear. (In reply to Nir Soffer from comment #3) > The title say: "Uploading images from glance" which is correct. In the code > we call > this "download image". Feel free to make the title more clear. Okay, as long it is clear that the "download" part is code-wise. Amit can you please provide a clear verification scenario? (In reply to Avihai from comment #5) > Amit can you please provide a clear verification scenario? Original issue seen on master 4.4 OST tests where NFS domain was in creation while glance image uploads were in progress to another NFS domain on the same NFS server. For testing independently I'll suggest the following first with an unfixed vdsm build: 1) Create NFS SD: nfs1 2) Run Import for glance images from ovirt-image-repository domain to nfs1 (in OST 2 ongoing uploads of 1GB each were enough to cause latency, but this may vary depending on QE env). 3) While image uploads are in progress, attempt it create another NFS SD: nfs2 on another export path of same storage, if there is sufficient latency it will cause it to fail with sanlock IO timeout error (-202) in vdsm.log - An alternative to step (3) is to run over host while the uploads are in progress: # touch /mnt1/ids # for x in `seq 1000`; do sanlock direct init -s LS:1/mnt1/ids:0 ; done where /mnt1 is mounted to another export path of the same NFS server used for domain nfs1, and see if calls return with "init done -202", on healthy condition all calls will print "init done 0" Repeating the process with a fixed vdsm build should not fail for nfs2 SD creation or alternatively return "init done 0" if tried manually with sanlock direct while uploads are in progress. (In reply to Amit Bawer from comment #6) > (In reply to Avihai from comment #5) > > Amit can you please provide a clear verification scenario? > > Original issue seen on master 4.4 OST tests where NFS domain was in creation > while glance image uploads were in progress to another NFS domain on the > same NFS server. > > For testing independently I'll suggest the following first with an unfixed > vdsm build: > > 1) Create NFS SD: nfs1 > 2) Run Import for glance images from ovirt-image-repository domain to nfs1 > (in OST 2 ongoing uploads of 1GB each were enough to cause latency, but this > may vary depending on QE env). > 3) While image uploads are in progress, attempt it create another NFS SD: > nfs2 on another export path of same storage, if there is sufficient latency > it will cause it to fail with sanlock IO timeout error (-202) in vdsm.log > > - An alternative to step (3) is to run over host while the uploads are in > progress: > > # touch /mnt1/ids > # for x in `seq 1000`; do sanlock direct init -s LS:1/mnt1/ids:0 ; done > > where /mnt1 is mounted to another export path of the same NFS server used > for domain nfs1, > and see if calls return with "init done -202", on healthy condition all > calls will print "init done 0" > > Repeating the process with a fixed vdsm build should not fail for nfs2 SD > creation or alternatively return "init done 0" if tried manually with > sanlock direct while uploads are in progress. I tried several times in both ways with different image sizes (2G/10G). - Create new SD while image uploads are in progress - Operation succeed without errors. - Run over host while the uploads are in progress - Not sure what if operation succeeded or failed: "init done -19" - need your input. (In reply to Evelina Shames from comment #7) > > I tried several times in both ways with different image sizes (2G/10G). > - Create new SD while image uploads are in progress - Operation succeed > without errors. > - Run over host while the uploads are in progress - Not sure what if > operation succeeded or failed: "init done -19" - need your input. Sounds like permission issue for sanlock to access the ids file you are trying to write into, also mentioned at bz1778485. Please check that file exists first and permissions allowing write for your user (preferably root) # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 # touch ids # ls -l ids -rw-r--r--. 1 root root 1048576 Jun 2 03:21 ids # sanlock direct init -s LS:1:ids:0 init done 0 For bad user "a", this will fail with -19 error code $ id uid=1000(a) gid=1000(a) groups=1000(a) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 $ sanlock direct init -s LS:1:ids:0 init done -19 Verified with both ways: - Create new SD while image uploads are in progress - Operation succeed without errors. - Run over host while the uploads are in progress - Operation succeed: 'init done 0'. Version: rhv-4.4.1-2 Moving to 'Verified'. This bugzilla is included in oVirt 4.4.1 release, published on July 8th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |