Bug 1069317

Summary: mpath pool fails to refresh due to target_type == NULL
Product: [Community] Virtualization Tools Reporter: Jeremy Kitchen <kitchen>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED UPSTREAM QA Contact:
Severity: medium Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: crobinso, rbalakri, shyu
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-14 17:29:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeremy Kitchen 2014-02-24 17:36:35 UTC
I have about 170 mpath devices on my VM hypervisors and am using libvirt to manage the VMs on those machines.

Suddenly, the other day, libvirt decided that I only had about 14 mpath devices in my storage pool. Ick.

Long story short, after a lot of adding in VIR_DEBUG (thanks for this, btw, made it really actually pretty easy to debug this issue) everywhere, I found that if a device's target_type is NULL, libvirt stops refreshing the mpath storage pool and fails silently.


The problem code is in src/storage/storage_backend_mpath.c, in the virStorageBackendIsMultipath function:

http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_mpath.c;h=1e65a8d3cbbd7f2dc32fd090b0c762638791e100;hb=HEAD#l120

specifically this bit:
http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_mpath.c;h=1e65a8d3cbbd7f2dc32fd090b0c762638791e100;hb=HEAD#l149

where it checks the target type of the next entry in the table and gets NULL and returns. I'm not sure what other situations the target_type might be NULL, which means libvirt might actually want to bail out here, or my vote would be to simply make that check return 0, indicating that it's not a multipath device, rather than -1 to indicate failure. At the very least I think it should report what's happening and why. This was a very troublesome issue to debug, and totally non-obvious what was happening. Of course, the moment I figured out which device was causing the problem... :)

Here's some info on the device itself:
I created a VM on one of my mpath devices and then within the VM used ubuntu installer / preseed to lay down a DM logical device on it for installation. Along with all of the -part devices in my /dev/mapper, I think the hypervisor's device mapper simply picked up on this device and added the logical devices to its own mapping. When I later reinstalled the machine and destroyed the device mapper logical device, the hypervisor failed to remove the device. It did, however, set its status to "suspended", which I'm not sure is directly associated with target_type == NULL, or if they're 2 symptoms of the same issue.

If need be, I can spend some time devising a direct method for reproducing this issue, in case there is other information about the device that can lead to a sort of "target_type == NULL is ok if X, too" type of thing.

Changing the check to return 0 instead of -1 "solved" the issue for me, of course, but I don't know if that's completely correct. dmsetup remove'ing the device also fixes it, but libvirt gave me 0 clue on which device it was that I needed to investigate. Other than that I could have probably gone through dmsetup list and saw where the mpath devices stopped getting added to the pool... either way, it was a very non-obvious problem and I spent a lot of time digging into the C code to determine the cause!

Comment 1 Cole Robinson 2016-04-13 21:36:39 UTC
Sorry this didn't receive a timely response. Nice job investigating! I agree that it seems reasonable to just ignore target_type=NULL rather than treat it as an error, I've sent a patch to that effect:

http://www.redhat.com/archives/libvir-list/2016-April/msg00769.html

Comment 2 Cole Robinson 2016-04-14 17:29:32 UTC
commit 8f8c0feb113420625f15b7f1e17bfd719c977eeb
Author: Cole Robinson <crobinso>
Date:   Wed Apr 13 17:29:59 2016 -0400

    storage: mpath: Don't error on target_type=NULL
    
    We use device-mapper to enumerate all dm devices, and filter out
    the list of multipath devices by checking the target_type string
    name. The code however cancels all scanning if we encounter
    target_type=NULL
    
    I don't know how to reproduce that situation, but a user was hitting
    it in their setup, and inspecting the lvm2/device-mapper code shows
    many places where !target_type is explicitly ignored and processing
    continues on to the next device. So I think we should do the same
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1069317