Bug 1512631

Summary:	failing vdo status commands should mention vdoconf.yml as a possible solution
Product:	Red Hat Enterprise Linux 7	Reporter:	Corey Marthaler <cmarthal>
Component:	kmod-kvdo	Assignee:	Corey Marthaler <cmarthal>
Status:	CLOSED NOTABUG	QA Contact:	vdo-qe
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.5	CC:	awalsh, bgurney, jkrysl, jshimkus, limershe
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:	6.1.0.85	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-01-03 20:57:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2017-11-13 17:42:24 UTC

Description of problem:
I had to clean up from bug 1512624 by removing the underlying storage (vdo failed to allow me to remove the old device) and starting again (even remembering to zero out the new storage to avoid bug 1510558).

# New device
[root@mckinley-04 ~]# lvs -a -o +devices
  LV   VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices               
  LV   snapper_thinp    -wi-a-----  10.00g                                                     /dev/mapper/mpatha1(0)

[root@mckinley-04 ~]# dd if=/dev/zero of=/dev/snapper_thinp/LV bs=4096 count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 15.1227 s, 271 MB/s

[root@mckinley-04 ~]# vdo create --name glarch --vdoLogicalSize 20G --device /dev/snapper_thinp/LV
Creating VDO glarch
Starting VDO glarch
Starting compression on VDO glarch
VDO instance 1 volume is ready at /dev/mapper/glarch

# /dev/snapper_thinp/origin was the old device, long gone now.
[root@mckinley-04 ~]# vdo status 
vdo: ERROR - vdodumpconfig: Failed to make FileLayer from '/dev/snapper_thinp/origin' with No such file or directory

Nov 13 11:30:22 mckinley-04 vdo: ERROR - vdodumpconfig: Failed to make FileLayer from '/dev/snapper_thinp/origin' with No such file or directory

[root@mckinley-04 ~]# vdo list
glarch

# I'll reboot and try again...


Version-Release number of selected component (if applicable):
vdo-6.1.0.46-9    BUILT: Fri Nov 10 15:47:57 CST 2017
kmod-kvdo-6.1.0.46-8.el7    BUILT: Fri Nov 10 16:03:57 CST 2017

Comment 2 Bryan Gurney 2017-11-13 17:58:43 UTC

There may have been a version mismatch here, due to BZ 1510176 causing the old module to not be unloaded after the "yum remove" and "yum install" phase.  Also, BZ 1511096 covers the module not displaying its version in modinfo, which would have otherwise identified the old module.

Let me know what the remove / create sequence looks like after the reboot.

Comment 3 Bryan Gurney 2017-11-13 20:47:06 UTC

Corey let me know that the issue survives after the reboot; however, the manual removal was incomplete, because there was still an entry in the /etc/vdoconf.yml file.  Since it was the only entry, he removed the /etc/vdoconf.yml file, and the "Failed to make FileLayer" error message for the nonexistent device no longer appeared.

Comment 4 Bryan Gurney 2017-11-13 20:58:16 UTC

The remaining question: is there a message that the "vdo" command can relay off of the vdodumpconfig "Failed to make FileLayer... No such file or directory" message?  It could be something to convey that there could be a configuration entry for a VDO volume stored on a device that no longer exists.

Comment 5 bjohnsto 2017-11-14 18:52:13 UTC

I'm not sure we can do anything special here. vdodumpconfig really has no knowledge of vdo manager or its config file, nor should it.

Comment 6 Louis Imershein 2017-11-14 20:01:20 UTC

Could the message describe a probable cause for the error conditiion?  Would that be useful or misleading?

Comment 7 bjohnsto 2017-11-17 19:33:58 UTC

Rereading the question here. Let me rephrase my answer.

Is there something we could do? Sure. I'm just not sure its the best idea. We could parse vdodumpconfig output to stderr for specific error messages and then try to relog them as more vdo specific things. But we would have to be very sure about what mappings to create. Also, we don't really do this sort of this now with other tools we use, like vdoformat for instance. We just let the tool display what error it gets. 

This feels like it should be a PM or CEE decision.

Comment 8 Louis Imershein 2017-11-29 19:06:50 UTC

At the minimnum, we should make sure that our generic messages provide information about common potential causes of failures.

It is better if we can give the customer more direction with a couple days of engineering effort, i think we should.  If it's more than that, we should think about putting more planning into doing it for a future release.

Comment 9 Corey Marthaler 2017-11-29 21:31:27 UTC

Here's another 'vdo status' failure after a successful creation (but with a left over entry from a failed prior vdo creation) that again survives reboots.

[root@host-116 ~]# vdostats --human-readable
Device                    Size      Used Available Use% Space saving%
/dev/mapper/origin       20.0G      4.0G     16.0G  20%           94%

[root@host-116 ~]# vdo status
vdo: ERROR - VDO volume PV previous operation (create) is incomplete

Nov 29 15:10:28 host-116 vdo: ERROR - VDO volume PV previous operation (create) is incomplete

After removing the invalid entry (caused by a prior failed create), vdo status worked again.

If vdo status is failing we need to educate users (i'd argue in the failure message itself) about the /etc/vdoconf.yml file if manually editing/cleaning it is going to be the only way in which to have the status command work again.

Comment 10 bjohnsto 2017-12-05 00:17:30 UTC

(In reply to Corey Marthaler from comment #9)
> Here's another 'vdo status' failure after a successful creation (but with a
> left over entry from a failed prior vdo creation) that again survives
> reboots.
> 
> [root@host-116 ~]# vdostats --human-readable
> Device                    Size      Used Available Use% Space saving%
> /dev/mapper/origin       20.0G      4.0G     16.0G  20%           94%
> 
> [root@host-116 ~]# vdo status
> vdo: ERROR - VDO volume PV previous operation (create) is incomplete
> 
> Nov 29 15:10:28 host-116 vdo: ERROR - VDO volume PV previous operation
> (create) is incomplete
> 
> After removing the invalid entry (caused by a prior failed create), vdo
> status worked again.
> 
> If vdo status is failing we need to educate users (i'd argue in the failure
> message itself) about the /etc/vdoconf.yml file if manually editing/cleaning
> it is going to be the only way in which to have the status command work
> again.

If you have an entry in the config from a failed previous create, you shouldn't need to manually edit the config file (I would never suggest doing this ever). You should be able to run vdo remove with the --force method to clear it from the config file.

Comment 11 bjohnsto 2017-12-05 00:18:44 UTC

(In reply to bjohnsto from comment #10)
> (In reply to Corey Marthaler from comment #9)
> > Here's another 'vdo status' failure after a successful creation (but with a
> > left over entry from a failed prior vdo creation) that again survives
> > reboots.
> > 
> > [root@host-116 ~]# vdostats --human-readable
> > Device                    Size      Used Available Use% Space saving%
> > /dev/mapper/origin       20.0G      4.0G     16.0G  20%           94%
> > 
> > [root@host-116 ~]# vdo status
> > vdo: ERROR - VDO volume PV previous operation (create) is incomplete
> > 
> > Nov 29 15:10:28 host-116 vdo: ERROR - VDO volume PV previous operation
> > (create) is incomplete
> > 
> > After removing the invalid entry (caused by a prior failed create), vdo
> > status worked again.
> > 
> > If vdo status is failing we need to educate users (i'd argue in the failure
> > message itself) about the /etc/vdoconf.yml file if manually editing/cleaning
> > it is going to be the only way in which to have the status command work
> > again.
> 
> If you have an entry in the config from a failed previous create, you
> shouldn't need to manually edit the config file (I would never suggest doing
> this ever). You should be able to run vdo remove with the --force method to
> clear it from the config file.

meant the --force option, not method.

Comment 13 Jakub Krysl 2017-12-15 09:44:49 UTC

I am not able to hit the error using vdo status. I reproduced it with vdo start by stopping the vdo, removing the lv under it and starting it again. At this point there is only the /etc/vdoconfig.yml entry, which makes the vdo think this particular volume still exists. 
# vdo start --name vdo                                                                                                                                                                        
Starting VDO vdo                                                                                                                                                                                                  
vdo: ERROR - Could not set up device mapper for vdo                                                                                                                                                               
vdo: ERROR - vdodumpconfig: Failed to make FileLayer from '/dev/mapper/vg-lv' with No such file or directory

Using 'vdo remove --name vdo --force' resolves this. But there is no change to the vdodumpconfig error as Louis suggested to direct customer to this solution.
Is it possible to maybe check if underlying device still exists when this error is triggered and if not, give the --force option suggestion?

Comment 14 Joe Shimkus 2017-12-21 19:58:28 UTC

Stepping outside the boundaries of defined management practices one can create scenarios which are (barring bugs) not possible within those boundaries. For any such  scenario we can know what "correct" (meaning "what we want") response should occur.  This, though, is only because we know the totality of the specifically crafted scenario and the desired outcome.

This is not to say that such scenarios are impossible in the "real world."  Given human fallibility it is well within the realm of possibilities that an error (whether of oversight or deliberate action) can arise.  These real world occurrences do not provide the complete view of the constructed scenarios.  As a consequence, determinism as to the correct response is impossible to achieve.

Consider the situation described in Jakub's comment of 2017-12-15.  We know what the "correct" response is because the scenario was crafted to evoke that response.  In the case of an user erroneously removing the logical volume that same response is incorrect.  The user, hopefully being able to non-destructively reconstruct the logical volume's description, will want the vdo instance to remain.

As far as is possible we should provide correct, precise information and advice to the user.  Unfortunately, not all possible scenarios can be so handled and require human intervention.

Comment 15 Joe Shimkus 2018-01-02 20:59:32 UTC

Corey,
I'm assigning the bug to you because it's marked ON_QA and you reported it.  If it should be assigned to someone else I would appreciate it if you would do so.

Thanks.

Comment 16 Corey Marthaler 2018-01-03 15:42:36 UTC

I'm moving this back to assigned for now as the move to modified appears to have been invalid w/o an actual fix for this issue. Please correct me if I'm wrong.

I think the best bet here is to have devel close this bug as either WONTFIX or NOTABUG.

Originally it was thought that editing the vdoconf file was the only way to remedy this situation, but then it was learned that a 'vdo remove --force' appears to work for these types of issues as well. If we come across a scenario in the future where the force doesn't work then we can reopen this bug.

Comment 17 Joe Shimkus 2018-01-03 20:57:49 UTC

As agreed yesterday (2018-01-02) in #vdo we're marking this as NOTABUG.
In any specific test scenario one should attempt 'vdo remove --force' for the particular vdo and if that fails open a new bug (reopen this one only if changing the description).
We are not specifically including recommendation in the face of these scenarios to use 'vdo remove --force' as differentiating between a test scenario and a failure/mistake in the field is not possible.