Bug 1255514

Summary:	SD flapping between inactive and active due to Unknown Device
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Olimp Bockowski <obockows>
Component:	vdsm	Assignee:	Maor <mlipchuk>
Status:	CLOSED NOTABUG	QA Contact:	Aharon Canan <acanan>
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.5.0	CC:	amureini, bazulay, ecohen, gklein, gwatson, laravot, lpeer, lsurette, nsoffer, obockows, pstehlik, rbalakri, Rhev-m-bugs, tnisan, ycui, yeylon, ylavi
Target Milestone:	---
Target Release:	3.5.6
Hardware:	x86_64
OS:	Linux
Whiteboard:	storage
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-09-24 07:33:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Olimp Bockowski 2015-08-20 19:37:46 UTC

Description of problem:

After replacing a redundant controller in Storage Device, Hypervisors had gone Non-Operational and after reboot they worked but Storage Domain was flapping between inactive and active. IN VG shown up Unknown Devices. 
BZ 1181678 possibly causing the current problem. This bug provides the "fix" for the original problem where "unknown device" PVs should now be ignored by VDSM. 

Version-Release number of selected component (if applicable):
RHEV 3.5 , vdsm-4.16.20-1 , vdsm-4.16.13.1

How reproducible:
expand SD with LUN, then forced  removing LUN from Storage Device

Steps to Reproduce:
1. Provide LUN from Storage Device
2. Expand Storage Domain with new LUN
3. Remove/Unzone/Unmap LUN
4. check PVs

Actual results:
Storage Domain keeps flapping

Expected results:
"unknown device" PVs should be ignored by VDSM

Additional info:

Comment 1 Allon Mureinik 2015-08-23 08:34:17 UTC

[Tentatively targeting for 3.5.5 until properly scrubbed with PM and QA stakeholders]

If part of a VG is not accessible the SD based on it should be considered down. It should not, however, flip between active and inactive.

Comment 2 Nir Soffer 2015-08-23 17:54:57 UTC

(In reply to Olimp Bockowski from comment #0)
> Description of problem:
> 
> BZ 1181678 possibly causing the current problem. This bug provides the "fix"
> for the original problem where "unknown device" PVs should now be ignored by
> VDSM.

Can you explain how the fix for bug 1181678 is "causing" this problem?
without that fix, vdsm lvm operation would fail because of the unknown
device. With the fix, vdsm ignore such devices and log a warning for 
each one.

Ignoring the missing device does not fix the problem of a missing device,
this should be fixed by the administrator.

We do not support removing a lun from storage domain.

Please attach vdsm and engine logs.

Comment 3 Olimp Bockowski 2015-08-24 19:17:47 UTC

Hello,

1st problem is how LVM handles VG, if VG is not completely accessible, SD should not switch between active/inactive. 
2. BZ 1181678 has changed how vdsm handled this condition, so if this fix reports a clearer message, we should not see active/inactive condition. 
3. My reproduction example is related to such situation. I remove PVs and it is still flapping between active/inactive. I think it should not.
4. Additional question what was the reason that after replacement 1 of 2 storage redundant controller we saw 2x more PVs?

Comment 4 Nir Soffer 2015-08-31 12:52:36 UTC

(In reply to Olimp Bockowski from comment #3)
> Hello,
> 
> 1st problem is how LVM handles VG, if VG is not completely accessible, SD
> should not switch between active/inactive.

Maybe, we need to investigate why it happens.

> 2. BZ 1181678 has changed how vdsm handled this condition, so if this fix
> reports a clearer message, we should not see active/inactive condition.

It is not related, the fix for bug 1181678 solve the issue of entire data
center not functioning if one pv is missing. This fix possibly revealed
this issue, which was hidden until that fix.
 
> 4. Additional question what was the reason that after replacement 1 of 2
> storage redundant controller we saw 2x more PVs?

I don't follow; please explain what do you mean by 2x more pvs.

Comment 5 Olimp Bockowski 2015-08-31 13:07:13 UTC

One of redundant controller was replaced in Storage Device. After that what happened was 2x more PVs on SPM hypervisor. That was the reason SD was flapping. It was an output:

 PV /dev/mapper/360050763008108192000000000000000     VG 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 51.00 GiB free]
  PV /dev/mapper/360050763008108192000000000000003     VG 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 1.51 TiB free]
  PV unknown device                                    VG 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
  PV unknown device                                    VG 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
  PV unknown device                                    VG 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
  PV /dev/mapper/360050763008108192000000000000005     VG 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]

Problem was resolved simply through vgreduce --removemissing, but we are wondering why it has happened

Comment 6 Nir Soffer 2015-08-31 13:24:52 UTC

(In reply to Olimp Bockowski from comment #5)
> One of redundant controller was replaced in Storage Device. After that what
> happened was 2x more PVs on SPM hypervisor. That was the reason SD was
> flapping. It was an output:
> 
>  PV /dev/mapper/360050763008108192000000000000000     VG
> 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 51.00 GiB free]
>   PV /dev/mapper/360050763008108192000000000000003     VG
> 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 1.51 TiB free]
>   PV unknown device                                    VG
> 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
>   PV unknown device                                    VG
> 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
>   PV unknown device                                    VG
> 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
>   PV /dev/mapper/360050763008108192000000000000005     VG
> 52ae1721-814c-47b2-a92e-ba4ae314b816   lvm2 [4.00 TiB / 4.00 TiB free]
> 
> Problem was resolved simply through vgreduce --removemissing, but we are
> wondering why it has happened

The unknown devices are probably the missing devices. When you have such
devices, storage domain self check, implemented using vgck will fail,
making the storage domain inactive.

Removing the missing the devices will make vgck succeed again, fixing 
the issue.

What do you mean by 2x the devices? please explain what is each pv we
see in this output, and which command did you use to collect this info.

Comment 7 Olimp Bockowski 2015-08-31 13:46:05 UTC

Just pvscan and grep output for affected volume group / pvs -v
 I am not aware could it be related to multipath issue during controller replacement?

Comment 8 Nir Soffer 2015-08-31 14:06:48 UTC

(In reply to Olimp Bockowski from comment #7)
> Just pvscan and grep output for affected volume group / pvs -v
>  I am not aware could it be related to multipath issue during controller
> replacement?

You did not answer my question, check again comment 6.

Comment 9 Olimp Bockowski 2015-08-31 14:37:39 UTC

PVs are: LUNs from SAN using Fibre Channel, with switched + redundant fabric and provided from Storage Device: IBM Storwize V7000. LUNs presented as Physical Volumes belong to VolumeGroup being Storage Domain with storage_domain_type 1 (data) and storage_type value 2 (FCP)