Bug 1403836

Summary: pvmove progress polling returns wrongfully prematurely
Product: Red Hat Enterprise Linux 7 Reporter: Liron Aravot <laravot>
Component: lvm2Assignee: LVM and device-mapper development team <lvm-team>
lvm2 sub component: Default / Unclassified QA Contact: cluster-qe <cluster-qe>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: unspecified CC: agk, amureini, anprice, bmarzins, bmcclain, bmr, heinzm, jbrassow, jentrena, jonathan, lvm-team, msnitzer, nsoffer, okozina, prajnoha, prockai, snagar, thornber, tnisan, ykaul, ylavi, zkabelac
Version: 7.0Flags: okozina: needinfo-
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-04 15:22:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1385242, 1403839, 1454287    
Attachments:
Description Flags
execution log
none
lvmpolld test - log none

Description Liron Aravot 2016-12-12 13:05:09 UTC
Created attachment 1230789 [details]
execution log

Description of problem:
We attempt to make use of pvmove/vgreduce for ovirt storage domains.

When attempting to run pvmove (see exact command in the attached execution log) the command ends prematurely and the progress is being polled only one time.
The pvmove continues to run on the background and further execution of the same command will have the same effect - one polling attempt and then determination of successful execution.
The weird thing is that if i clean the caches and execute the same command again - it'll be polled as expected (see in the attached execution log) although we use use_lvmetad=0 on the command (so cache should have no effect).


Version-Release number of selected component (if applicable):
libblockdev-lvm-1.6-1.fc24.x86_64
lvm2-2.02.150-2.fc24.x86_64
lvm2-libs-2.02.150-2.fc24.x86_64
Fedora release 24 (Twenty Four)

How reproducible:
Always

Steps to Reproduce:
1. Issue pvmove as specified above.

Actual results:
The pvmove is polled only one time.

Expected results:
The command should return when after it was actually finished.

Additional info:

When attempting to execute with -vvvv, I've got the following error -
LVM command executed by lvmpolld failed.
Reloading config files    Wiping internal VG cache';

The pvmove was still ongoing on the background on that case as well.

I'll try to reproduce again with lvmetad service disabled (just a reminder - we execute with use_lvmetad=0, but i know with had other issues on those areas).

Comment 1 Yaniv Lavi 2016-12-12 13:24:50 UTC
Can you please escalate? This is urgent for RHV 4.1 release and should be fixed for 7.3.z.

Comment 2 Ondrej Kozina 2016-12-12 14:44:56 UTC
(In reply to Liron Aravot from comment #0)
> (...)we use use_lvmetad=0 on the command (so cache should have no
> effect).

What do you have set for lvmetad in /etc/lvm/lvm.conf?

> 
> Additional info:
> 
> When attempting to execute with -vvvv, I've got the following error -
> LVM command executed by lvmpolld failed.

Long running operations are managed by lvmpolld daemon since RHEL7.2 IIRC. The pvmove monitoring executed on daemon side failed for some reason.

Could you have a look at the daemon's log for more info?
(journalctl -u lvm2-lvmpolld)

If it's not of much help please try to run it in debug mode by:
systemctl stop lvm2-lvmpolld
systemctl stop lvm2-lvmpolld.socket
lvmpolld -f -l all

and capture the output when you issue your pvmove cmd in another terminal.

I think that what happened is in fact this:

you started the pvmove w/ lvmetad dissabled but lvmpolld daemon runs commands in default config therefore probably with use_lvmetad=1 and lvmetad was not aware of changes in mda made by pvmove cmd with use_lvmetad=0 on cmdline (according to your report)

Comment 3 Liron Aravot 2016-12-12 14:57:40 UTC
Thanks Ondrej,
prior to you comment I made a run with lvm2-lvmetad.service lvm2-lvmetad.socket disabled, pvmove/lvmpolld worked as expected from the first execution - so the scenario you specified makes sense.

I'll report back with the additional info you requested.

Comment 5 Liron Aravot 2016-12-14 16:39:20 UTC
Created attachment 1231806 [details]
lvmpolld test - log

Comment 6 Liron Aravot 2016-12-14 16:40:03 UTC
Ondrej, please find the requested log attached.

Thanks, Liron.

Comment 7 Ondrej Kozina 2016-12-22 13:45:09 UTC
Hi Liron,

it's as I suspected earlier. The command initiated on lvmpolld side runs with use_lvmetad=1 set and therefore doesn't see pvmove0 mirror lv (missing in lvmetad cache). My recommendation is to turn off lvmetad for good if you have bad experience with it and most importantly don't mix commands modifyting lvm metadata (with use_lvmetad=0) and commands with use_lvmetad=1 set while lvmetad is running because that's what's causing the mda corruption on lvmetad side.

The only currently supported setup is to stop lvmetad manually after command modifying mda w/ use_lvmetad=0 was run (also clearly stated in logs). There's plan to stop lvmetad automatically by such harmful command, but that's future feature.

Comment 8 Liron Aravot 2016-12-27 17:59:47 UTC
Hi Ondrej,
Thanks for your reply.
Question - when the lvmpolld command is initiated to monitor the pvmove, why doesn't the use_lvmetad=0 flag passed to it? regardless of the option of disabling lvmetad, passing the flag makes sense to me (if pvmove doesn't use the metadata, so should the lvmpolld command monitoring it).


Nir, can you please review the discussion? As we have other issues with having lvmetad enabled - do we feel safe enough to disable it for 4.1? or do we want to do that for 4.2?

Thanks,
Liron

Comment 9 Ondrej Kozina 2017-01-02 10:04:24 UTC
(In reply to Liron Aravot from comment #8)
> Hi Ondrej,
> Thanks for your reply.
> Question - when the lvmpolld command is initiated to monitor the pvmove, why
> doesn't the use_lvmetad=0 flag passed to it? regardless of the option of
> disabling lvmetad, passing the flag makes sense to me (if pvmove doesn't use
> the metadata, so should the lvmpolld command monitoring it).

Sure, we were thinking about passing the option and few others as well. Unfortunately in this case any update on lvmpolld side would make cache in lvmetad outdated. Some commands run in lvmpolld environment may update lvm mda (pvmove case included). Mixing commands w/ use_lvmetad=1 and =0 while lvmetad is actually running leads to problems sooner or later I'm affraid. Turning lvmetad off (or restart it after cmd w/ use_lvmetad=0 set) is the only reliable option here.