Bug 1780265

Summary:	Host goes non-operational post upgrading that host from RHHI-V 1.6 to RHHI-V 1.7
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	SATHEESARAN <sasundar>
Component:	rhhi	Assignee:	Gobinda Das <godas>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	rhhiv-1.7	CC:	rhs-bugs, sabose, seamurph, usurse, vjuranek
Target Milestone:	---
Target Release:	RHHI-V 1.7
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	When sharding was enabled and a file's path had been unlinked, but its descriptor was still available, attempting to perform a read or write operation against the file resulted in the host becoming non-operational. File paths are now unlinked later, avoiding this issue.	Story Points:	---
Clone Of:
Clones:	1780290 (view as bug list)		Environment:
Last Closed:	2020-02-13 15:57:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1780290
Bug Blocks:

Description SATHEESARAN 2019-12-05 15:35:37 UTC

Description of problem:
-----------------------
After updating from RHV 4.3.7 to RHV 4.3.8, the HC host goes non-operational

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.3.7 (w/ RHGS 3.4.4 - glusterfs-3.12

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Deploy RHHI-V : RHVH 4.3.7 + Hosted engine setup with 3 HC hosts
2. Create 16 VMs running kernel untar workload
3. Update the engine to RHV 4.3.8
4. Update RHVH node from RHV Manager UI

Actual results:
---------------
VMs running on that RHVH host is migrated to other hosts, moved that host to maintenance and post reboot, the host went in to non-operational

Expected results:
-----------------
Host should be in activated state post reboot.


Additional info:

Comment 1 SATHEESARAN 2019-12-05 15:36:26 UTC

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.3.7 (w/ RHGS 3.4.4 - glusterfs-3.12.2-47.5 )
RHV 4.3.8 (w/ RHGS 3.5 - glusterfs-6.0-24.el7rhgs )

Comment 2 SATHEESARAN 2019-12-05 15:38:08 UTC

Suspicious statements in vdsm logs lead to something was wrong with 4K native implementation check:

<snip>
2019-12-05 20:42:48,670+0530 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='a02a9ba0-4fb8-434b-be0a-a6b22ab55de6') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in getStorageDomainInfo
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2753, in getStorageDomainInfo
    dom = self.validateSdUUID(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 305, in validateSdUUID
    sdDom = sdCache.produce(sdUUID=sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 62, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 378, in __init__
    manifest.sdUUID, manifest.mountpoint)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 853, in _detect_block_size
    block_size = iop.probe_block_size(mountpoint)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 384, in probe_block_size
    return self._ioproc.probe_block_size(dir_path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 602, in probe_block_size
    "probe_block_size", {"dir": dir_path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
2019-12-05 20:42:48,670+0530 INFO  (jsonrpc/3) [storage.TaskManager.Task] (Task='a02a9ba0-4fb8-434b-be0a-a6b22ab55de6') aborting: Task is aborted: u'[Errno 2] No such file or directory' - code 100 (task:1181)
2019-12-05 20:42:48,670+0530 ERROR (jsonrpc/3) [storage.Dispatcher] FINISH getStorageDomainInfo error=[Errno 2] No such file or directory (dispatcher:87)

</snip>

Comment 3 Sahina Bose 2019-12-05 15:53:36 UTC

Vojtech, could you check?

Comment 4 Vojtech Juranek 2019-12-06 09:08:28 UTC

the issue is tracked under BZ #1780290

Comment 5 SATHEESARAN 2019-12-09 13:28:27 UTC

(In reply to Vojtech Juranek from comment #4)
> the issue is tracked under BZ #1780290

Thanks Vojta, as per RHHI-V process, we do have additional bug created to track this issue for RHHI-V product perspective

Comment 6 SATHEESARAN 2019-12-17 04:34:29 UTC

The issue is also seen with fresh installation of RHVH 4.3.8 , RHHI-V 1.7 with RHGS 3.5.1 

But with the latest upstream version of ioprocess, the issue is fixed

Comment 8 SATHEESARAN 2020-02-01 05:22:16 UTC

Verified the upgrade from RHV 4.3.7 to RHV 4.3.8.

Steps involved are:
1. RHHI-V Deployment ( self-hosted-engine deployment ) with 3 nodes.
2. 10 RHEL 7.7 VMs are created and all these VMs are continuously running I/O with kernel untar workload
Note: kernel untar workload, downloads the tarball of kernel, untars and computes the sha256sum of all the extracted files.

3. Enabled the local repo that contained the RHVH 4.3.8 redhat-virtualization-host-image-update
4. Enabled global maintenance for Hosted Engine VM.
5. RHV Manager 4.3.7 is updated to RHV Manager 4.3.8, and also the all software packages update is done and rebooted.
6. HE VM is started and move out of global maintenance
7. Once RHV Manager UI is up, logged in to it and upgraded one after the other from UI
8. Once all the hosts are upgraded, edit the cluster 'Default' and go to 'General' -> 'Compatibility version' and update it from '4.2' to '4.3'
Existing VMs needs to powered-off and restarted post updating the compatibility version

Known_issues
1. Sometimes, in the 2 network configuration, the gluster network never came up and need to set 'BOOTPROTO=dhcp'
and bring up the network.
2. Because of 1, gluster bricks are not coming up, leading to pending self-heal
One had to bring up the network and heal starts
3. After all the healing is completed, it takes sometime ( not more than 5 mins ) to reflect the status of heal in RHV Manager UI

Comment 10 errata-xmlrpc 2020-02-13 15:57:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0508