I have about 170 mpath devices on my VM hypervisors and am using libvirt to manage the VMs on those machines. Suddenly, the other day, libvirt decided that I only had about 14 mpath devices in my storage pool. Ick. Long story short, after a lot of adding in VIR_DEBUG (thanks for this, btw, made it really actually pretty easy to debug this issue) everywhere, I found that if a device's target_type is NULL, libvirt stops refreshing the mpath storage pool and fails silently. The problem code is in src/storage/storage_backend_mpath.c, in the virStorageBackendIsMultipath function: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_mpath.c;h=1e65a8d3cbbd7f2dc32fd090b0c762638791e100;hb=HEAD#l120 specifically this bit: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_mpath.c;h=1e65a8d3cbbd7f2dc32fd090b0c762638791e100;hb=HEAD#l149 where it checks the target type of the next entry in the table and gets NULL and returns. I'm not sure what other situations the target_type might be NULL, which means libvirt might actually want to bail out here, or my vote would be to simply make that check return 0, indicating that it's not a multipath device, rather than -1 to indicate failure. At the very least I think it should report what's happening and why. This was a very troublesome issue to debug, and totally non-obvious what was happening. Of course, the moment I figured out which device was causing the problem... :) Here's some info on the device itself: I created a VM on one of my mpath devices and then within the VM used ubuntu installer / preseed to lay down a DM logical device on it for installation. Along with all of the -part devices in my /dev/mapper, I think the hypervisor's device mapper simply picked up on this device and added the logical devices to its own mapping. When I later reinstalled the machine and destroyed the device mapper logical device, the hypervisor failed to remove the device. It did, however, set its status to "suspended", which I'm not sure is directly associated with target_type == NULL, or if they're 2 symptoms of the same issue. If need be, I can spend some time devising a direct method for reproducing this issue, in case there is other information about the device that can lead to a sort of "target_type == NULL is ok if X, too" type of thing. Changing the check to return 0 instead of -1 "solved" the issue for me, of course, but I don't know if that's completely correct. dmsetup remove'ing the device also fixes it, but libvirt gave me 0 clue on which device it was that I needed to investigate. Other than that I could have probably gone through dmsetup list and saw where the mpath devices stopped getting added to the pool... either way, it was a very non-obvious problem and I spent a lot of time digging into the C code to determine the cause!
Sorry this didn't receive a timely response. Nice job investigating! I agree that it seems reasonable to just ignore target_type=NULL rather than treat it as an error, I've sent a patch to that effect: http://www.redhat.com/archives/libvir-list/2016-April/msg00769.html
commit 8f8c0feb113420625f15b7f1e17bfd719c977eeb Author: Cole Robinson <crobinso> Date: Wed Apr 13 17:29:59 2016 -0400 storage: mpath: Don't error on target_type=NULL We use device-mapper to enumerate all dm devices, and filter out the list of multipath devices by checking the target_type string name. The code however cancels all scanning if we encounter target_type=NULL I don't know how to reproduce that situation, but a user was hitting it in their setup, and inspecting the lvm2/device-mapper code shows many places where !target_type is explicitly ignored and processing continues on to the next device. So I think we should do the same https://bugzilla.redhat.com/show_bug.cgi?id=1069317