Bug 1780290
Summary: | Host goes non-operational post upgrading that host from RHHI-V 1.6 to RHHI-V 1.7 | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-distribution | Reporter: | SATHEESARAN <sasundar> | ||||
Component: | ioprocess | Assignee: | Nir Soffer <nsoffer> | ||||
ioprocess sub component: | General | QA Contact: | SATHEESARAN <sasundar> | ||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
Severity: | urgent | ||||||
Priority: | urgent | CC: | bugs, emarcus, godas, kdhananj, lleistne, nsoffer, rhs-bugs, sabose, vjuranek | ||||
Version: | 4.3.6 | Flags: | pm-rhel:
ovirt-4.3+
sasundar: blocker? sasundar: testing_ack+ |
||||
Target Milestone: | ovirt-4.3.8 | ||||||
Target Release: | 4.3.8 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | vdsm-4.30.40, ioprocess-1.3.1 | Doc Type: | Bug Fix | ||||
Doc Text: |
When sharding was enabled and a file's path had been unlinked, but its descriptor was still available, attempting to perform a read or write operation against the file resulted in the host becoming non-operational. File paths are now unlinked later, avoiding this issue.
|
Story Points: | --- | ||||
Clone Of: | 1780265 | Environment: | |||||
Last Closed: | 2020-02-26 16:31:41 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1780265, 1800803 | ||||||
Attachments: |
|
Description
SATHEESARAN
2019-12-05 16:05:10 UTC
Created attachment 1642421 [details]
vdsm.log
Please provide vdsm and ioprocess versions. Thanks Can the host access storage domain (SD with ID 950148f3-f2f0-4f96-ba6e-b571a7a1a40b)? From the vdsm log, bofore exception happens, it seems that the domain is not accessible at all: 2019-12-05 21:01:05,968+0530 INFO (jsonrpc/6) [vdsm.api] FINISH getStorageDomainInfo error=[Errno 2] No such file or directory from=::1,46748, task_id=fd63f6dc-35bd-41b1-9098-c9128d1d65b3 (api:52) (In reply to Vojtech Juranek from comment #2) > Please provide vdsm and ioprocess versions. Thanks [root@ ~]# rpm -qa | grep vdsm vdsm-api-4.30.38-1.el7ev.noarch vdsm-http-4.30.38-1.el7ev.noarch vdsm-hook-fcoe-4.30.38-1.el7ev.noarch vdsm-hook-openstacknet-4.30.38-1.el7ev.noarch vdsm-yajsonrpc-4.30.38-1.el7ev.noarch vdsm-python-4.30.38-1.el7ev.noarch vdsm-hook-vmfex-dev-4.30.38-1.el7ev.noarch vdsm-hook-ethtool-options-4.30.38-1.el7ev.noarch vdsm-common-4.30.38-1.el7ev.noarch vdsm-hook-vhostmd-4.30.38-1.el7ev.noarch vdsm-network-4.30.38-1.el7ev.x86_64 vdsm-jsonrpc-4.30.38-1.el7ev.noarch vdsm-4.30.38-1.el7ev.x86_64 vdsm-gluster-4.30.38-1.el7ev.x86_64 vdsm-client-4.30.38-1.el7ev.noarch [root@ ~]# rpm -qa | grep ioprocess python2-ioprocess-1.3.0-1.el7ev.x86_64 ioprocess-1.3.0-1.el7ev.x86_64 (In reply to Vojtech Juranek from comment #3) > Can the host access storage domain (SD with ID > 950148f3-f2f0-4f96-ba6e-b571a7a1a40b)? > From the vdsm log, bofore exception happens, it seems that the domain is not > accessible at all: > > 2019-12-05 21:01:05,968+0530 INFO (jsonrpc/6) [vdsm.api] FINISH > getStorageDomainInfo error=[Errno 2] No such file or directory > from=::1,46748, task_id=fd63f6dc-35bd-41b1-9098-c9128d1d65b3 (api:52) This host basically has 2 networks one for logical network 'ovirtmgmt' and another for logical network 'storage'. Both of the networks are DHCP, but when the host rebooted, the network corresponding to storage was down, and BOOTPROTO directive was missing from the network configuration file ( /etc/sysconfig/network-scripts/ifcfg-enp3s0f0 ). This lead to gluster network not having IP, later I fixed that and brought up that particular network alone. The issue is that unlink() doesn't behave as expected. In ioprocess we use following pattern to create tmp file: 1. create a file [1] 2. unlink this file [2] 3. return file descriptor create in 1. to function which needs tmp file [3] 4. write into tmp file [4] 5. close file descriptor I tried to reproduce using python on your environment. If I comment out unlink (2.), it works, with unlink it fails with Traceback (most recent call last): File "dio.py", line 14, in <module> os.write(f, m) OSError: [Errno 2] No such file or directory So unlink doesn't behave as expected on your environment. It's very likely some issue in Gluster setup, as we tested several times with Gluster and everything it worked. Any idea what can cause this? [1] https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L866 [2] https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L872 [3] https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L882 [4] https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L922 [5] https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L946 As the previous needinfo's are clarified, removing the flags For the question in comment6, it needs investigation from gluster team. @Krutika, do you know why is this happening as described in comment6 (In reply to Vojtech Juranek from comment #6) > The issue is that unlink() doesn't behave as expected. In ioprocess we use > following pattern to create tmp file: > 1. create a file [1] > 2. unlink this file [2] > 3. return file descriptor create in 1. to function which needs tmp file [3] > 4. write into tmp file [4] > 5. close file descriptor > > I tried to reproduce using python on your environment. If I comment out > unlink (2.), it works, with unlink it fails with > > Traceback (most recent call last): > File "dio.py", line 14, in <module> > os.write(f, m) > OSError: [Errno 2] No such file or directory > > So unlink doesn't behave as expected on your environment. It's very likely > some issue in Gluster setup, as we tested several times with Gluster and > everything it worked. Any idea what can cause this? > > > [1] > https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L866 > [2] > https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L872 > [3] > https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L882 > [4] > https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L922 > [5] > https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L946 So the file is created with an open fd, then its path is unlinked and the writes are sent on this fd whose path is unlinked? Yeah, this is a known bug in shard. Shard doesn't behave correctly in these situations. It unlinks all the shards (except the base shard) without taking into account the presence of open fds. Vojtech, How big can this temp file get? -Krutika > How big can this temp file get? we try to write 1B, 512B and 4096B (see [6]) to find out what is the block size, i.e. maximum size is 4 KiB. [6] https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L888 (In reply to Vojtech Juranek from comment #9) > > How big can this temp file get? > > we try to write 1B, 512B and 4096B (see [6]) to find out what is the block > size, i.e. maximum size is 4 KiB. > Is that so? Then I don't think the bug in shard should affect this. Let me try and run the reproducer directly on the mount to check where exactly the ENOENT is originating from. -Krutika > [6] > https://github.com/oVirt/ioprocess/blob/master/src/exported-functions.c#L888 Tested this on Sas' setup. 1. There definitely is a bug in shard's treatment of fd-based operations on multiple levels in terms of depending on file path for performing some checks (such as gathering file attributes) when it needn't necessarily exist. In this case, the difference in behavior between RHGS-3.4.4 and RHGS-3.5.x (although the shard bug has been there since day-1) is due to some kind of cache-invalidation shard is doing in RHGS-3.5.x of file attributes, for which it is performing the operation on the *path* which returns an ENOENT. 2. Having said that, is there a strong reason to want to unlink a path before performing the 4k-check writes? Or is it possible to simply defer the unlink to a point *after* the checks are performed? -Krutika SATHEESARAN, can you test the proposed fix? You can use the build for el8 here: https://jenkins.ovirt.org/job/ioprocess_standard-check-patch/209/artifact/build-artifacts.el8.x86_64/ And for el7 here: https://jenkins.ovirt.org/job/ioprocess_standard-check-patch/211/artifact/build-artifacts.el7.x86_64/ Both URLs are yum repos. SATHEESARAN, based on comment 0 I understand that this happens when upgrading: > RHV 4.3.7 (w/ RHGS 3.4.4 - glusterfs-3.12.2-47.5 ) To: > RHV 4.3.8 (w/ RHGS 3.5 - glusterfs-6.0-24.el7rhgs ) The same code in ioprocess was available since RHV 4.3.6, and was shipped in oVirt with gluster 6.5, and also tested by users lately with gluster 7.0. Maybe RHGS 3.5 is missing some fix from gluster 6.5? (In reply to Nir Soffer from comment #13) > SATHEESARAN, based on comment 0 I understand that this happens when > upgrading: > > > RHV 4.3.7 (w/ RHGS 3.4.4 - glusterfs-3.12.2-47.5 ) > > To: > > > RHV 4.3.8 (w/ RHGS 3.5 - glusterfs-6.0-24.el7rhgs ) > > The same code in ioprocess was available since RHV 4.3.6, and was > shipped in oVirt with gluster 6.5, and also tested by users lately > with gluster 7.0. > > Maybe RHGS 3.5 is missing some fix from gluster 6.5? I am not sure sure on this. I get to understand from Krutika, that this problem even exists with gluster sharding and was never fixed. Sometimes, this issue gets suppressed, as we have never seen it with RHGS 3.4.4 and sometime pops-up, as we see with RHGS 3.5 Nevertheless the original issue at the shard layer still persists. FYI - RHGS 3.5 is using gluster-6.0 SATHEESARAN, any update on testing the fix? see comment 12. (In reply to Nir Soffer from comment #15) > SATHEESARAN, any update on testing the fix? see comment 12. Hello Nir, Yes, after upgrading with the upstream rpms ( python2-ioprocess-1.3.0-1.201912111243.git87f19e1.el7.x86_64.rpm + ioprocess-1.3.0-1.201912111243.git87f19e1.el7.x86_64.rpm ), the issue seems to be fixed. Actually we were also seeing this issue with the fresh deployment of RHHI-V 1.7 w/ RHVH 4.3.8, with these rpms those issues also fixed. Verified the upgrade from RHV 4.3.7 to RHV 4.3.8. Steps involved are: 1. RHHI-V Deployment ( self-hosted-engine deployment ) with 3 nodes. 2. 10 RHEL 7.7 VMs are created and all these VMs are continuously running I/O with kernel untar workload Note: kernel untar workload, downloads the tarball of kernel, untars and computes the sha256sum of all the extracted files. 3. Enabled the local repo that contained the RHVH 4.3.8 redhat-virtualization-host-image-update 4. Enabled global maintenance for Hosted Engine VM. 5. RHV Manager 4.3.7 is updated to RHV Manager 4.3.8, and also the all software packages update is done and rebooted. 6. HE VM is started and move out of global maintenance 7. Once RHV Manager UI is up, logged in to it and upgraded one after the other from UI 8. Once all the hosts are upgraded, edit the cluster 'Default' and go to 'General' -> 'Compatibility version' and update it from '4.2' to '4.3' Existing VMs needs to powered-off and restarted post updating the compatibility version Known_issues 1. Sometimes, in the 2 network configuration, the gluster network never came up and need to set 'BOOTPROTO=dhcp' and bring up the network. 2. Because of 1, gluster bricks are not coming up, leading to pending self-heal One had to bring up the network and heal starts 3. After all the healing is completed, it takes sometime ( not more than 5 mins ) to reflect the status of heal in RHV Manager UI This bugzilla is included in oVirt 4.3.8 release, published on January 27th 2020. Since the problem described in this bug report should be resolved in oVirt 4.3.8 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |