Bug 1255514
Summary: | SD flapping between inactive and active due to Unknown Device | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Olimp Bockowski <obockows> |
Component: | vdsm | Assignee: | Maor <mlipchuk> |
Status: | CLOSED NOTABUG | QA Contact: | Aharon Canan <acanan> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 3.5.0 | CC: | amureini, bazulay, ecohen, gklein, gwatson, laravot, lpeer, lsurette, nsoffer, obockows, pstehlik, rbalakri, Rhev-m-bugs, tnisan, ycui, yeylon, ylavi |
Target Milestone: | --- | ||
Target Release: | 3.5.6 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | storage | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-09-24 07:33:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Olimp Bockowski
2015-08-20 19:37:46 UTC
[Tentatively targeting for 3.5.5 until properly scrubbed with PM and QA stakeholders] If part of a VG is not accessible the SD based on it should be considered down. It should not, however, flip between active and inactive. (In reply to Olimp Bockowski from comment #0) > Description of problem: > > BZ 1181678 possibly causing the current problem. This bug provides the "fix" > for the original problem where "unknown device" PVs should now be ignored by > VDSM. Can you explain how the fix for bug 1181678 is "causing" this problem? without that fix, vdsm lvm operation would fail because of the unknown device. With the fix, vdsm ignore such devices and log a warning for each one. Ignoring the missing device does not fix the problem of a missing device, this should be fixed by the administrator. We do not support removing a lun from storage domain. Please attach vdsm and engine logs. Hello, 1st problem is how LVM handles VG, if VG is not completely accessible, SD should not switch between active/inactive. 2. BZ 1181678 has changed how vdsm handled this condition, so if this fix reports a clearer message, we should not see active/inactive condition. 3. My reproduction example is related to such situation. I remove PVs and it is still flapping between active/inactive. I think it should not. 4. Additional question what was the reason that after replacement 1 of 2 storage redundant controller we saw 2x more PVs? (In reply to Olimp Bockowski from comment #3) > Hello, > > 1st problem is how LVM handles VG, if VG is not completely accessible, SD > should not switch between active/inactive. Maybe, we need to investigate why it happens. > 2. BZ 1181678 has changed how vdsm handled this condition, so if this fix > reports a clearer message, we should not see active/inactive condition. It is not related, the fix for bug 1181678 solve the issue of entire data center not functioning if one pv is missing. This fix possibly revealed this issue, which was hidden until that fix. > 4. Additional question what was the reason that after replacement 1 of 2 > storage redundant controller we saw 2x more PVs? I don't follow; please explain what do you mean by 2x more pvs. One of redundant controller was replaced in Storage Device. After that what happened was 2x more PVs on SPM hypervisor. That was the reason SD was flapping. It was an output: PV /dev/mapper/360050763008108192000000000000000 VG 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 51.00 GiB free] PV /dev/mapper/360050763008108192000000000000003 VG 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 1.51 TiB free] PV unknown device VG 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] PV unknown device VG 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] PV unknown device VG 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] PV /dev/mapper/360050763008108192000000000000005 VG 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] Problem was resolved simply through vgreduce --removemissing, but we are wondering why it has happened (In reply to Olimp Bockowski from comment #5) > One of redundant controller was replaced in Storage Device. After that what > happened was 2x more PVs on SPM hypervisor. That was the reason SD was > flapping. It was an output: > > PV /dev/mapper/360050763008108192000000000000000 VG > 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 51.00 GiB free] > PV /dev/mapper/360050763008108192000000000000003 VG > 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 1.51 TiB free] > PV unknown device VG > 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] > PV unknown device VG > 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] > PV unknown device VG > 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] > PV /dev/mapper/360050763008108192000000000000005 VG > 52ae1721-814c-47b2-a92e-ba4ae314b816 lvm2 [4.00 TiB / 4.00 TiB free] > > Problem was resolved simply through vgreduce --removemissing, but we are > wondering why it has happened The unknown devices are probably the missing devices. When you have such devices, storage domain self check, implemented using vgck will fail, making the storage domain inactive. Removing the missing the devices will make vgck succeed again, fixing the issue. What do you mean by 2x the devices? please explain what is each pv we see in this output, and which command did you use to collect this info. Just pvscan and grep output for affected volume group / pvs -v I am not aware could it be related to multipath issue during controller replacement? (In reply to Olimp Bockowski from comment #7) > Just pvscan and grep output for affected volume group / pvs -v > I am not aware could it be related to multipath issue during controller > replacement? You did not answer my question, check again comment 6. PVs are: LUNs from SAN using Fibre Channel, with switched + redundant fabric and provided from Storage Device: IBM Storwize V7000. LUNs presented as Physical Volumes belong to VolumeGroup being Storage Domain with storage_domain_type 1 (data) and storage_type value 2 (FCP) |