Red Hat Bugzilla – Bug 812499
Unable to remove "Red Hat Content Provider" if something goes wrong
Last modified: 2015-07-13 00:35:17 EDT
Created attachment 577437 [details]
Screenshot showing the error message and missing view contents.
Description of problem:
Suffered a power outage of a CloudForms SE server, whilst it was adding a RH Content Provider manifest.
Upon the box restarting again (filesystem recovery was ok), whenever the user goes to the "Red Hat Content Provider" page, the normal contents are missing. Instead a "500 Internal Server Error" is the content for the page.
This seems to give no way through the web UI, for removing a (likely) broken RH Content Provider definition.
Version-Release number of selected component (if applicable):
It's a recent puddle build:
$ rpm -qa | grep -i katello
Steps to Reproduce:
1. Being adding a RH Content Provider manifest... whilst in the installation process (waiting for Katello to finish processing the manifest), suffer a power outage.
2. Restart the box.
3. Go to the RH Content Providers tab. It should be "500 Internal Server Error" instead of normal contents.
Normal RH Content Provider tab contents should be there.
I took a snapshot of the disk for this BZ, so can extract logs or whatever if needed.
Nice bug! The concerning thing with this bug is there is no workaround.
Escalating as a blocker for visibility. If we can identify a workaround, I support resolving this in a future release, and adding a 1.0 release note.
FYI you can not delete the Red Hat Provider. it is baked into every org in CFSE and is created during Org creation time.
you will have to reset your database using:
or restore from backup.
there is no way to remove the Red Hat Content Provider even during normal operations. since it is hard to predict the state of your database during an outage like you experienced it would be hard to know exactly what it would take to correct the situation your DB is in.
In 1.1 we could look into how to handle this type of situation better but there isn't a whole lot we can do for 1.0.*
For 1.1 we should look into better transaction management for long running jobs that can rollback and recover from situations like this where there is a power outage or some massive breakage during execution.
Lets investigate how to cleanup and recover broken data.
If there's a way to reset (just) the RH Provider (without everything else), that might make for a decent workaround. The admin would just need to do (in this case) the manifest import again.
Though, it kind of sounds like this would be future work, with Mike's mentioned "more transactional" approach being a more complete (likely better) goal. ;)
There is one simple script/tool which compares repositories which are in Katello and in Pulp and prints what needs to be deleted to put Katello-Pulp back in sync. You can run in like that:
We can extend this script if the output is not helpful so GSS can take actions when this happens. Please note this tool is not documented and it is not intended to be used by users.
Ping me if it does not work or makes no sense to you and I can investigate the box directly extending this script with this special case.
I can't reproduce. There is no general advice how to recover - there can be so many states during things like manifest import. It depends on when you suffer power failure. It can be data inconsistency in Candlepin or Pulp or both.
I really cannot investigate all the possibilities. We need to improve our orchestration code and totally change our approach to orchestration. If you encounter any data inconsistency, we need access to the box to investigate particular case.
So the general advice is: backup and recover in this case.
Together with org deletion, this is still relevant, but we are chainging our orchestration layer and this should be re-evaluated after the migration is done.
Providers have been hidden. This is no longer relevant.