Bug 1520326

Summary: [GA] Upgrade step: call cpdb --validate
Product: Red Hat Satellite Reporter: Lukas Zapletal <lzap>
Component: Satellite MaintainAssignee: Kavita <kgaikwad>
Status: CLOSED CURRENTRELEASE QA Contact: Katello QA List <katello-qa-list>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: apatel, bbuckingham, brcoca, brubisch, cdonnell, cmarinea, crog, inecas, jsherril, karnsing, kgaikwad, ktordeur, lzap, mmccune, mmithaiw, mstead, pdudley, peter.vreman, sabnave, smane, sthirugn
Target Milestone: UnspecifiedKeywords: FieldEngineering, PrioBumpField, PrioBumpGSS, PrioBumpQA, Triaged
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-03-02 09:55:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1519904, 1520319, 1530686    
Bug Blocks: 1122832, 1496794, 1533259    

Description Lukas Zapletal 2017-12-04 09:26:29 UTC
Call this step during upgrade:

/usr/share/candlepin/cpdb --validate

When non-zero code, do not continue with upgrade and ask present this message:

Candlepin database pre-2.0 migration validation failed, or is not available. If this is Satellite 6.2.13 or older, please upgrade to 6.2.14 or newer, or visit upgrade docs on how to perform Candlepin validation manually and skip this step. If you see this on 6.2.14+ please do not continue and contact Red Hat support with /var/log/candlepin/cpdb.log contents.

For more info on docs:

https://bugzilla.redhat.com/show_bug.cgi?id=1520319

For more info on cpdb utility:

https://bugzilla.redhat.com/show_bug.cgi?id=1519904

WARNING: On installations pre 6.2.14 the cpdb utility will fail as the command line option will not be present:

[root@older ~]# /usr/share/candlepin/cpdb --validate
Usage: cpdb [options]

cpdb: error: no such option: --validate
[root@older ~]# echo  $?
2

As you can see, in the error message we ask the user to skip this step in this case. Skip option MUST be present for beta users. Thanks.

Comment 1 Kavita 2017-12-05 06:54:13 UTC
*** Bug 1520261 has been marked as a duplicate of this bug. ***

Comment 2 Lukas Zapletal 2017-12-05 10:47:43 UTC
CORRECTION:

Please call this:

/usr/share/candlepin/cpdb --validate --verbose

The extra flag will increase verbosity in logs.

Comment 7 Kavita 2018-01-04 10:38:21 UTC
@Lukas,

I have added check which will run cpdb --validate command only if '--validate' option exists for cpdb otherwise download scripts and run below command as mentioned in the document - https://github.com/candlepin/candlepinproject.org/blob/master/docs/candlepin/database_validation.md

liquibase --classpath=/usr/share/java/postgresql-jdbc.jar:<classpath> --changeLogFile=changelog-validate.xml --driver=org.postgresql.Driver --url=jdbc:postgresql:candlepin --username=<db_username> --password=<db_password> --logLevel=severe migrate

I would like to test failure scenario i.e which returns a non-zero exit code.
For this, do you know what DB changes I will need to add manually for testing purpose.

I have tried to add some dummy record in cp_pool table with invalid product_id on my sat 6.2. In this case also liquibase command gives me zero exit code with message - Liquibase Update Successful.
I was not able to check cpdb --validate command as --validate option is not there.

Comment 8 Ivan Necas 2018-01-04 13:10:42 UTC
I guess the needinfo should go to Michael

Comment 9 Lukas Zapletal 2018-01-08 14:11:13 UTC
Just an idea - modify the script and return non-zero to simulate a failure...

Comment 10 Michael Stead 2018-01-11 13:20:11 UTC
To test the error case you can modify the candlepin database directly to make a pool reference a product that doesn't exist.

Here's how I did it on my test candlepin instance:

# Find a pool record.
candlepin=# select id, productid from cp_pool LIMIT 1;
                id                |            productid            
----------------------------------+---------------------------------
 402882e760e558670160e558bc2220cb | non-stacked-8core4ram-multiattr
(1 row)

# Update the productid to any string that will not match an existing productid. 
candlepin=# update cp_pool set productid='doesnt_exist' where id = '402882e760e558670160e558bc2220cb';
UPDATE 1

$ cpdb --validate --verbose
SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog:   Pool "402882e760e558670160e558bc2220cb" references an unresolvable product: doesnt_exist
SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Org snowwhite (402882e760e558670160e55875f20004) failed data validation
INFO 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Validating data for org admin (402882e760e558670160e55875f20002) (2 of 3)
INFO 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Validating data for org donaldduck (402882e760e558670160e55875f20003) (3 of 3)
SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Change Set db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog failed.  Error: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation
DEBUG 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Release Database Lock
DEBUG 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Executing UPDATE database command: UPDATE public.databasechangeloglock SET LOCKED = FALSE, LOCKEDBY = NULL, LOCKGRANTED = NULL WHERE ID = 1
INFO 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Successfully released change log lock
Unexpected error running Liquibase: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation

SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation
liquibase.exception.MigrationFailedException: Migration failed for change set db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog:
     Reason: liquibase.exception.UnexpectedLiquibaseException: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation
	at liquibase.changelog.ChangeSet.execute(ChangeSet.java:619)
	at liquibase.changelog.visitor.UpdateVisitor.visit(UpdateVisitor.java:51)
	at liquibase.changelog.ChangeLogIterator.run(ChangeLogIterator.java:79)
	at liquibase.Liquibase.update(Liquibase.java:214)
	at liquibase.Liquibase.update(Liquibase.java:192)
	at liquibase.integration.commandline.Main.doMigration(Main.java:1130)
	at liquibase.integration.commandline.Main.run(Main.java:188)
	at liquibase.integration.commandline.Main.main(Main.java:103)
Caused by: liquibase.exception.UnexpectedLiquibaseException: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation
	at liquibase.change.custom.CustomChangeWrapper.generateStatements(CustomChangeWrapper.java:185)
	at liquibase.database.AbstractJdbcDatabase.executeStatements(AbstractJdbcDatabase.java:1257)
	at liquibase.changelog.ChangeSet.execute(ChangeSet.java:582)
	... 7 more
Caused by: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation
	at org.candlepin.liquibase.LiquibaseCustomTaskWrapper.execute(LiquibaseCustomTaskWrapper.java:74)
	at liquibase.change.custom.CustomChangeWrapper.generateStatements(CustomChangeWrapper.java:180)
	... 9 more
Caused by: liquibase.exception.DatabaseException: One or more orgs failed data validation
	at org.candlepin.liquibase.PerOrgProductsMigrationValidationTask.execute(PerOrgProductsMigrationValidationTask.java:189)
	at org.candlepin.liquibase.LiquibaseCustomTaskWrapper.execute(LiquibaseCustomTaskWrapper.java:71)
	... 10 more

Comment 11 Justin Sherrill 2018-01-11 14:19:45 UTC
*** Bug 1512082 has been marked as a duplicate of this bug. ***

Comment 12 Kavita 2018-01-15 09:22:25 UTC
As per https://github.com/candlepin/candlepinproject.org/blob/master/docs/candlepin/database_validation.md document, following objects are checked:
    Pool => Product
    Subscription => Product
    Environment => Content

Command cpdb --validate runs validation checks and output errors as you shared above.

Are there any commands in cpdb only to fix these orphaned records related to above objects? 
so that we can just run those commands from foreman-maintain to fix candlepin DB.

Ideally, this would be great if we have option in cpdb command only to fix those
invalid records instead of doing it in foreman-maintain.

But if we don't have in cpdb and we need to provide steps to find and delete those orphaned records then
please confirm below details.

1. For bad products,

find -> SELECT cp.id, cp.productid 
        FROM cp_pool cp
        LEFT JOIN cp_product p ON p.id = cp.productid
        WHERE p.id IS NULL

delete -> delete from cp_pool where productid in (above_find_list)

2. For bad content,

find ->  SELECT e.id, ec.contentid
         FROM cp_environment e
         JOIN cp_env_content ec ON e.id = ec.environment_id
         LEFT JOIN cp_content c ON c.id = ec.contentid
         WHERE c.id IS NULL

delete -> delete from cp_env_content where contentid in (above_find_list)

3. For Subscription => Product, please confirm SQL queries.

Comment 13 Michael Stead 2018-01-16 20:32:39 UTC
There is currently no command available in cpdb to do an automatic clean up of these orphaned records.

My biggest concern about automating the deletion of these orphaned records is that something gets deleted when it should not have been (perhaps an orphan from a bug that we currently do not know about). We could end up in a state where data doesn't make sense and we have no idea why.

While I understand the reason why this is wanted, I do have my reservations.

Comment 14 Michael Stead 2018-01-16 20:38:05 UTC
In cases where the validation fails, while not ideal, I feel that we need to at least understand what data is orphaned. If there are patterns, say from an existing bug that has now been fixed, and that we can understand, then I'd be comfortable with perhaps automating that or providing manual clean up scripts.

Comment 15 Chris "Ceiu" Rog 2018-01-17 16:44:37 UTC
Unfortunately, it's not an easy thing to automate. The nature of this issue is that some of these products or content objects that are being referenced could be important (in which case, we have a bigger issue to address), or they could be artifacts from an incomplete cleanup, or they could have been replaced by newer instances, but the reference was not updated accordingly.

Since we don't know the importance of the missing objects, I'm reluctant to recommend automating the cleanup. Whenever this has come up, we've generally advised to:

(a) Record the ID of the missing object (product or content)
(b) Record the ID of referrer (pool, environment, etc.)
(c) Try to identify the importance missing object (does the ID look familiar? If so, is it something the org should still have?)
(d) Manually cleanup/correct the reference (will significantly vary)


Using Michael's example output from the validate task above, we would be dealing with a pool with ID "402882e760e558670160e558bc2220cb" referencing the product "doesnt_exist". Such a product doesn't seem terribly important, so we can just delete the reference... maybe. Now we have more questions: Is the product being referenced as the pool's SKU (cp_pool.product_id)? Is it a derived SKU (cp_pool.derivedproductid)? Or is it an engineering product (cp_pool_prov_prods)?

In the latter case, you can just delete the entries from the cp_pool_prov_prods table. But in the former two cases, now we need to delete the whole pool or update it to point to a valid product. If we're to delete the pool, what does that pool provide? Is it important to the org? What consumers are entitled to it? If we delete it, are we going to make consumers non-compliant?

I could keep going, but I feel I've made the point clear: for a given missing reference, there is no 100% safe instructions that can be blindly followed. Each step needs to be considered before any destructive action (such as a deletion) is performed. As with many of these types of things, I fully expect the majority to be simple, mindless deletes with no side effects to worry about. However, I can pretty much guarantee there will be an edge case or two that will burn us if we aren't careful here.
 


As far as identifying the bad references, the following queries can be used.

-- Environment content references
SELECT e.owner_id, e.id, ec.contentid
  FROM cp_environment e
  JOIN cp_env_content ec ON e.id = ec.environment_id
  LEFT JOIN cp_content c ON c.id = ec.contentid
  WHERE c.id IS NULL;


-- Pool product references
SELECT pool.owner_id, pool.id, pool.productid
  FROM cp_pool pool
  LEFT JOIN cp_product p ON p.id = pool.productid
  WHERE p.id IS NULL;

SELECT pool.owner_id, pool.id, pool.derivedproductid
  FROM cp_pool pool
  LEFT JOIN cp_product p ON p.id = pool.derivedproductid
  WHERE p.id IS NULL;

SELECT pool.owner_id, pool.id, pp.product_id
  FROM cp_pool pool
  JOIN cp_pool_products pp ON pp.pool_id = pool.id
  LEFT JOIN cp_product p ON p.id = pp.product_id
  WHERE p.id IS NULL;


-- Subscription product references
SELECT sub.owner_id, sub.id, sub.product_id
  FROM cp_subscription sub
  LEFT JOIN cp_product p ON p.id = sub.product_id
  WHERE p.id IS NULL;

SELECT sub.owner_id, sub.id, sub.derivedproduct_id
  FROM cp_subscription sub
  LEFT JOIN cp_product p ON p.id = sub.derivedproduct_id
  WHERE p.id IS NULL;

SELECT sub.owner_id, sub.id, sp.product_id
  FROM cp_subscription sub
  JOIN cp_subscription_products sp ON sp.subscription_id = sub.id
  LEFT JOIN cp_product p ON p.id = sp.product_id
  WHERE p.id IS NULL;

SELECT sub.owner_id, sub.id, p.product_id
  FROM cp_subscription sub
  JOIN cp_sub_derivedprods sdp ON sdp.subscription_id = sub.id
  LEFT JOIN cp_product p ON p.id = sdp.product_id
  WHERE p.id IS NULL;

Comment 16 Mike McCune 2018-01-29 18:28:58 UTC
For now we need to hold off on automatic remediation and instead report and direct users to open support tickets when the data is marked as corrupted.

All we need in foreman-maintain for now is to identify failures with the --validate run and show the details so tickets can be opened.

Comment 17 Mike McCune 2018-02-01 18:03:52 UTC
*** Bug 1515614 has been marked as a duplicate of this bug. ***

Comment 18 Satellite Program 2018-02-08 19:28:05 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/21927 has been resolved.

Comment 27 Kavita 2018-03-02 09:16:39 UTC
Just for reference, 

This is BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1545660 that opened for better failure message under Candlepin component.

Currently, it is difficult to find out cause of failure from output messages.

Comment 28 Ivan Necas 2018-03-02 09:55:51 UTC
I'm closing this BZ, as the ask of it was to implement the validation check, so that we don't run into the migration issues once the upgrade itself starts. Besides the BZ Kavita mentioned, I've also opened additional BZ for foreman-maintain to enhance the check to provide additional steps to resolve known issues that cause the validation to fail: https://bugzilla.redhat.com/show_bug.cgi?id=1550924