Bug 1520326
Summary: | [GA] Upgrade step: call cpdb --validate | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Lukas Zapletal <lzap> |
Component: | Satellite Maintain | Assignee: | Kavita <kgaikwad> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Katello QA List <katello-qa-list> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.3.0 | CC: | apatel, bbuckingham, brcoca, brubisch, cdonnell, cmarinea, crog, inecas, jsherril, karnsing, kgaikwad, ktordeur, lzap, mmccune, mmithaiw, mstead, pdudley, peter.vreman, sabnave, smane, sthirugn |
Target Milestone: | Unspecified | Keywords: | FieldEngineering, PrioBumpField, PrioBumpGSS, PrioBumpQA, Triaged |
Target Release: | Unused | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-03-02 09:55:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1519904, 1520319, 1530686 | ||
Bug Blocks: | 1122832, 1496794, 1533259 |
Description
Lukas Zapletal
2017-12-04 09:26:29 UTC
*** Bug 1520261 has been marked as a duplicate of this bug. *** CORRECTION: Please call this: /usr/share/candlepin/cpdb --validate --verbose The extra flag will increase verbosity in logs. @Lukas, I have added check which will run cpdb --validate command only if '--validate' option exists for cpdb otherwise download scripts and run below command as mentioned in the document - https://github.com/candlepin/candlepinproject.org/blob/master/docs/candlepin/database_validation.md liquibase --classpath=/usr/share/java/postgresql-jdbc.jar:<classpath> --changeLogFile=changelog-validate.xml --driver=org.postgresql.Driver --url=jdbc:postgresql:candlepin --username=<db_username> --password=<db_password> --logLevel=severe migrate I would like to test failure scenario i.e which returns a non-zero exit code. For this, do you know what DB changes I will need to add manually for testing purpose. I have tried to add some dummy record in cp_pool table with invalid product_id on my sat 6.2. In this case also liquibase command gives me zero exit code with message - Liquibase Update Successful. I was not able to check cpdb --validate command as --validate option is not there. I guess the needinfo should go to Michael Just an idea - modify the script and return non-zero to simulate a failure... To test the error case you can modify the candlepin database directly to make a pool reference a product that doesn't exist. Here's how I did it on my test candlepin instance: # Find a pool record. candlepin=# select id, productid from cp_pool LIMIT 1; id | productid ----------------------------------+--------------------------------- 402882e760e558670160e558bc2220cb | non-stacked-8core4ram-multiattr (1 row) # Update the productid to any string that will not match an existing productid. candlepin=# update cp_pool set productid='doesnt_exist' where id = '402882e760e558670160e558bc2220cb'; UPDATE 1 $ cpdb --validate --verbose SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Pool "402882e760e558670160e558bc2220cb" references an unresolvable product: doesnt_exist SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Org snowwhite (402882e760e558670160e55875f20004) failed data validation INFO 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Validating data for org admin (402882e760e558670160e55875f20002) (2 of 3) INFO 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Validating data for org donaldduck (402882e760e558670160e55875f20003) (3 of 3) SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/changelog-validate.xml: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Change Set db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog failed. Error: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation DEBUG 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Release Database Lock DEBUG 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Executing UPDATE database command: UPDATE public.databasechangeloglock SET LOCKED = FALSE, LOCKEDBY = NULL, LOCKGRANTED = NULL WHERE ID = 1 INFO 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Successfully released change log lock Unexpected error running Liquibase: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation SEVERE 11/01/18 9:13 AM: liquibase: db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation liquibase.exception.MigrationFailedException: Migration failed for change set db/changelog/20171108140157-perorgproducts-migration-validation.xml::20171108140157-1::crog: Reason: liquibase.exception.UnexpectedLiquibaseException: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation at liquibase.changelog.ChangeSet.execute(ChangeSet.java:619) at liquibase.changelog.visitor.UpdateVisitor.visit(UpdateVisitor.java:51) at liquibase.changelog.ChangeLogIterator.run(ChangeLogIterator.java:79) at liquibase.Liquibase.update(Liquibase.java:214) at liquibase.Liquibase.update(Liquibase.java:192) at liquibase.integration.commandline.Main.doMigration(Main.java:1130) at liquibase.integration.commandline.Main.run(Main.java:188) at liquibase.integration.commandline.Main.main(Main.java:103) Caused by: liquibase.exception.UnexpectedLiquibaseException: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation at liquibase.change.custom.CustomChangeWrapper.generateStatements(CustomChangeWrapper.java:185) at liquibase.database.AbstractJdbcDatabase.executeStatements(AbstractJdbcDatabase.java:1257) at liquibase.changelog.ChangeSet.execute(ChangeSet.java:582) ... 7 more Caused by: liquibase.exception.CustomChangeException: liquibase.exception.DatabaseException: One or more orgs failed data validation at org.candlepin.liquibase.LiquibaseCustomTaskWrapper.execute(LiquibaseCustomTaskWrapper.java:74) at liquibase.change.custom.CustomChangeWrapper.generateStatements(CustomChangeWrapper.java:180) ... 9 more Caused by: liquibase.exception.DatabaseException: One or more orgs failed data validation at org.candlepin.liquibase.PerOrgProductsMigrationValidationTask.execute(PerOrgProductsMigrationValidationTask.java:189) at org.candlepin.liquibase.LiquibaseCustomTaskWrapper.execute(LiquibaseCustomTaskWrapper.java:71) ... 10 more *** Bug 1512082 has been marked as a duplicate of this bug. *** As per https://github.com/candlepin/candlepinproject.org/blob/master/docs/candlepin/database_validation.md document, following objects are checked: Pool => Product Subscription => Product Environment => Content Command cpdb --validate runs validation checks and output errors as you shared above. Are there any commands in cpdb only to fix these orphaned records related to above objects? so that we can just run those commands from foreman-maintain to fix candlepin DB. Ideally, this would be great if we have option in cpdb command only to fix those invalid records instead of doing it in foreman-maintain. But if we don't have in cpdb and we need to provide steps to find and delete those orphaned records then please confirm below details. 1. For bad products, find -> SELECT cp.id, cp.productid FROM cp_pool cp LEFT JOIN cp_product p ON p.id = cp.productid WHERE p.id IS NULL delete -> delete from cp_pool where productid in (above_find_list) 2. For bad content, find -> SELECT e.id, ec.contentid FROM cp_environment e JOIN cp_env_content ec ON e.id = ec.environment_id LEFT JOIN cp_content c ON c.id = ec.contentid WHERE c.id IS NULL delete -> delete from cp_env_content where contentid in (above_find_list) 3. For Subscription => Product, please confirm SQL queries. There is currently no command available in cpdb to do an automatic clean up of these orphaned records. My biggest concern about automating the deletion of these orphaned records is that something gets deleted when it should not have been (perhaps an orphan from a bug that we currently do not know about). We could end up in a state where data doesn't make sense and we have no idea why. While I understand the reason why this is wanted, I do have my reservations. In cases where the validation fails, while not ideal, I feel that we need to at least understand what data is orphaned. If there are patterns, say from an existing bug that has now been fixed, and that we can understand, then I'd be comfortable with perhaps automating that or providing manual clean up scripts. Unfortunately, it's not an easy thing to automate. The nature of this issue is that some of these products or content objects that are being referenced could be important (in which case, we have a bigger issue to address), or they could be artifacts from an incomplete cleanup, or they could have been replaced by newer instances, but the reference was not updated accordingly. Since we don't know the importance of the missing objects, I'm reluctant to recommend automating the cleanup. Whenever this has come up, we've generally advised to: (a) Record the ID of the missing object (product or content) (b) Record the ID of referrer (pool, environment, etc.) (c) Try to identify the importance missing object (does the ID look familiar? If so, is it something the org should still have?) (d) Manually cleanup/correct the reference (will significantly vary) Using Michael's example output from the validate task above, we would be dealing with a pool with ID "402882e760e558670160e558bc2220cb" referencing the product "doesnt_exist". Such a product doesn't seem terribly important, so we can just delete the reference... maybe. Now we have more questions: Is the product being referenced as the pool's SKU (cp_pool.product_id)? Is it a derived SKU (cp_pool.derivedproductid)? Or is it an engineering product (cp_pool_prov_prods)? In the latter case, you can just delete the entries from the cp_pool_prov_prods table. But in the former two cases, now we need to delete the whole pool or update it to point to a valid product. If we're to delete the pool, what does that pool provide? Is it important to the org? What consumers are entitled to it? If we delete it, are we going to make consumers non-compliant? I could keep going, but I feel I've made the point clear: for a given missing reference, there is no 100% safe instructions that can be blindly followed. Each step needs to be considered before any destructive action (such as a deletion) is performed. As with many of these types of things, I fully expect the majority to be simple, mindless deletes with no side effects to worry about. However, I can pretty much guarantee there will be an edge case or two that will burn us if we aren't careful here. As far as identifying the bad references, the following queries can be used. -- Environment content references SELECT e.owner_id, e.id, ec.contentid FROM cp_environment e JOIN cp_env_content ec ON e.id = ec.environment_id LEFT JOIN cp_content c ON c.id = ec.contentid WHERE c.id IS NULL; -- Pool product references SELECT pool.owner_id, pool.id, pool.productid FROM cp_pool pool LEFT JOIN cp_product p ON p.id = pool.productid WHERE p.id IS NULL; SELECT pool.owner_id, pool.id, pool.derivedproductid FROM cp_pool pool LEFT JOIN cp_product p ON p.id = pool.derivedproductid WHERE p.id IS NULL; SELECT pool.owner_id, pool.id, pp.product_id FROM cp_pool pool JOIN cp_pool_products pp ON pp.pool_id = pool.id LEFT JOIN cp_product p ON p.id = pp.product_id WHERE p.id IS NULL; -- Subscription product references SELECT sub.owner_id, sub.id, sub.product_id FROM cp_subscription sub LEFT JOIN cp_product p ON p.id = sub.product_id WHERE p.id IS NULL; SELECT sub.owner_id, sub.id, sub.derivedproduct_id FROM cp_subscription sub LEFT JOIN cp_product p ON p.id = sub.derivedproduct_id WHERE p.id IS NULL; SELECT sub.owner_id, sub.id, sp.product_id FROM cp_subscription sub JOIN cp_subscription_products sp ON sp.subscription_id = sub.id LEFT JOIN cp_product p ON p.id = sp.product_id WHERE p.id IS NULL; SELECT sub.owner_id, sub.id, p.product_id FROM cp_subscription sub JOIN cp_sub_derivedprods sdp ON sdp.subscription_id = sub.id LEFT JOIN cp_product p ON p.id = sdp.product_id WHERE p.id IS NULL; For now we need to hold off on automatic remediation and instead report and direct users to open support tickets when the data is marked as corrupted. All we need in foreman-maintain for now is to identify failures with the --validate run and show the details so tickets can be opened. *** Bug 1515614 has been marked as a duplicate of this bug. *** Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/21927 has been resolved. Just for reference, This is BZ - https://bugzilla.redhat.com/show_bug.cgi?id=1545660 that opened for better failure message under Candlepin component. Currently, it is difficult to find out cause of failure from output messages. I'm closing this BZ, as the ask of it was to implement the validation check, so that we don't run into the migration issues once the upgrade itself starts. Besides the BZ Kavita mentioned, I've also opened additional BZ for foreman-maintain to enhance the check to provide additional steps to resolve known issues that cause the validation to fail: https://bugzilla.redhat.com/show_bug.cgi?id=1550924 |