Bug 1875554

Summary: [RFE] supervdsm is holding unmapped LUNs after storageDevicesList call
Product: Red Hat Enterprise Virtualization Manager Reporter: hhaberma
Component: vdsmAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: Avihai <aefrat>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.3.5CC: cswanson, lsurette, mavital, michal.skrivanek, mkalinin, nashok, srevivo, ycui
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-05 03:26:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1879920    
Bug Blocks:    

Comment 2 Michal Skrivanek 2020-09-04 04:41:39 UTC
Can we get the exact actions on the system so that we can reproduce it? When the volume is detached from VM it should be let go by vdsm. 
Also, you mention lvm filter...how was that supposed to work, if it’s filtered out then it can’t be used as a VM disk, no?

Comment 4 nijin ashok 2020-09-15 11:04:47 UTC
(In reply to Michal Skrivanek from comment #2)
> Can we get the exact actions on the system so that we can reproduce it? When
> the volume is detached from VM it should be let go by vdsm. 

I was able to reproduce the issue in my test environment. The supervdsm is holding the unmapped device during the storageDevicesList call. The engine will only initiate this call if the cluster is enabled with "gluster service". So we won't see this issue in normal setup as gluster and virt service are not enabled by default but we will see this issue in RHHI.  

The issue can be reproduced by steps below.

[1] Map a LUN to the server. 

[2] Create a partition on this LUN.

[3] Unamp the LUN from the storage.

[4] Fush the cache on the host.

# echo 3 > /proc/sys/vm/drop_caches

[5] Login to RHV-M => click on hosts => Storage devices => sync

The step [5] will initiate the storageDevicesList in the supervdsm which will use blivet. While blivet reads the mbr, it will fail with i/o error as below.

===
MainProcess|jsonrpc/1::WARNING::2020-09-15 10:38:25,870::edd::169::blivet::(collect_mbrs) edd: error reading mbrsig from disk 36001405097fdbe4d6e04e3b9bdc97014: [Errno 5] Input/output error

multipath -ll |grep -A3 36001405097fdbe4d6e04e3b9bdc97014
36001405097fdbe4d6e04e3b9bdc97014 dm-22 LIO-ORG ,sdk             
size=10G features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=0 status=enabled
  `- 5:0:0:1 sde 8:64 failed faulty running
===

From the strace output, we can see that the blivet opens the device, tried to read it and then failed with EIO. However, it doesn't care to close it. 

===
10078 10:38:25.869889 open("/dev/mapper/36001405097fdbe4d6e04e3b9bdc97014", O_RDONLY <unfinished ...>
10078 10:38:25.870015 <... open resumed>) = 25
10078 10:38:25.870415 lseek(25, 440, SEEK_SET <unfinished ...>
10078 10:38:25.870600 read(25,  <unfinished ...>
10078 10:38:25.870759 <... read resumed>0x7f865c0ec984, 4) = -1 EIO (Input/output error)

< -- then jumped to next device without closing the fd 25-->

10078 10:38:25.871192 open("/dev/sda", O_RDONLY <unfinished ...>
===

lsof also shows fd 25 is not closed by supervdsm.

===
lsof |grep supervdsm |grep 25r
supervdsm  9592                  root   25r      BLK             253,22     0t440             58811969 /dev/dm-22
===

blivlet is  not closing the device if there is an exception while accessing the device.

===
blivet/devicelibs/edd.py

153 def collect_mbrs(devices):
154     """ Read MBR signatures from devices.
155 
156         Returns a dict mapping device names to their MBR signatures. It is not
157         guaranteed this will succeed, with a new disk for instance.
158     """
159     mbr_dict = {}
160     for dev in devices:
161         try:
162             fd = os.open(dev.path, os.O_RDONLY)
163             # The signature is the unsigned integer at byte 440:
164             os.lseek(fd, 440, 0)
165             mbrsig = struct.unpack('I', os.read(fd, 4))
166             os.close(fd)                                          
167         except OSError as e:
168             log.warning("edd: error reading mbrsig from disk %s: %s",
169                         dev.name, str(e))
170             continue                                                            ===> Not closing fd if it fails to access the device and continues with next device
====

The latest upstream code also has the same behaviour.

So now when customer tries to remove the unammped device from multipath, it will fail with the error "map in use".

Comment 5 nijin ashok 2020-09-15 13:15:25 UTC
I have submitted this PR for blivet https://github.com/storaged-project/blivet/pull/899

Comment 8 nijin ashok 2021-01-05 03:26:22 UTC
The fix is available in python-blivet-0.61.15.76-1.el7_9. See bug 1879920. Closing this as duplicate.

*** This bug has been marked as a duplicate of bug 1879920 ***