Bug 821219

Summary: prevent executing upgrade when asynchronous tasks are still running
Product: [Retired] oVirt Reporter: Eli Mesika <emesika>
Component: ovirt-engine-installerAssignee: Moran Goldboim <mgoldboi>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: acathrow, alourie, dyasny, hateya, iheim, mgoldboi, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: integration infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-23 21:38:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eli Mesika 2012-05-13 08:30:24 UTC
Description of problem:
Since asynchronous tasks information are persisted to async_tasks table and compensation data (used for rollbacks) are persisted to business_entity_snapshot table, we should avoid upgrading the system when asynchronous tasks are still  running.
The reason is that those tables have some binary data representing business entities and command parameters. Since those may change from version to version, it is clear that any attempt to restore an object from its old binary representation will cause the system to crash.
async_tasks
Version-Release number of selected component (if applicable):


How reproducible:
Upgrading the system in the middle of asynchronous task when the upgrade changes some the objects stored in the above table (for example : a field was added to a business entity)


Steps to Reproduce:
1.
2.
3.
  
Actual results:

Upgrades done when asynchronous tasks are still running may lead to a crash in core code that will prevent JBoss from running (since compensation data is checked in application startup)

Expected results:postgres

The installer should check if there are any asynchronous tasks are still  running in async_tasks or business_entity_snapshot tables and ask the user to wait for task completion if he tries to upgrade the system when asynchronous tasks are still  running
Since we may have a crash that leaves unclear junk in those tables, the installer should have a check-box of "Force delete asynchronous tasks meta-data"
In the case the user mark this check-box the database upgrade script will run with a -c flag that will force cleanup of the async_tasks and business_entity_snapshot tables

Additional info:
You can check if there are no records in the async_tasks and business_entity_snapshot tables by
> echo "select count(*) from business_entity_snapshot,async_tasks;" | psql -U <user> --pset=tuples_only=on <database>
This will return 0 if there is no data in those tables.

Comment 1 Itamar Heim 2012-05-13 11:10:10 UTC
you can't tell an administrator, who has a scheduled window of downtime to perform the upgrade that they can't do it for a few hours.
it is fine to warn user about this, cancel them if needed, but we can't have the admin blocked from performing an upgrade.

Comment 2 Eli Mesika 2012-05-13 13:16:46 UTC
(In reply to comment #1)
> you can't tell an administrator, who has a scheduled window of downtime to
> perform the upgrade that they can't do it for a few hours.
> it is fine to warn user about this, cancel them if needed, but we can't have
> the admin blocked from performing an upgrade.

As stated above, it is just a recommendation and the admin is able to check the check-box immediately and run the upgrade without any wait.
If the upgrade is done when async tasks are active and business entities or commands are changed , the result will be that the JBoss will not start because the core code will try to restore objects from the binary representation stored in those table and this is really dangerous and I don't know even if we have simple recover procedure for such cases.
So, it is suggested as a recommendation when admin can force upgrade to occur whenever he likes by the additional checkbox described above.

Comment 3 Haim 2012-05-13 13:23:52 UTC
(In reply to comment #2)
> (In reply to comment #1)
> > you can't tell an administrator, who has a scheduled window of downtime to
> > perform the upgrade that they can't do it for a few hours.
> > it is fine to warn user about this, cancel them if needed, but we can't have
> > the admin blocked from performing an upgrade.
> 
> As stated above, it is just a recommendation and the admin is able to check the
> check-box immediately and run the upgrade without any wait.
> If the upgrade is done when async tasks are active and business entities or
> commands are changed , the result will be that the JBoss will not start because
> the core code will try to restore objects from the binary representation stored
> in those table and this is really dangerous and I don't know even if we have
> simple recover procedure for such cases.
> So, it is suggested as a recommendation when admin can force upgrade to occur
> whenever he likes by the additional checkbox described above.

just had this incident today, current behavior is not good as upgrade.sh fails brutally.

Comment 4 Eli Mesika 2012-05-13 14:42:04 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > you can't tell an administrator, who has a scheduled window of downtime to
> > > perform the upgrade that they can't do it for a few hours.
> > > it is fine to warn user about this, cancel them if needed, but we can't have
> > > the admin blocked from performing an upgrade.
> > 
> > As stated above, it is just a recommendation and the admin is able to check the
> > check-box immediately and run the upgrade without any wait.
> > If the upgrade is done when async tasks are active and business entities or
> > commands are changed , the result will be that the JBoss will not start because
> > the core code will try to restore objects from the binary representation stored
> > in those table and this is really dangerous and I don't know even if we have
> > simple recover procedure for such cases.
> > So, it is suggested as a recommendation when admin can force upgrade to occur
> > whenever he likes by the additional checkbox described above.
> 
> just had this incident today, current behavior is not good as upgrade.sh fails
> brutally.

I don't think the problrm is in the upgrade.sh, in most cases it will complete successfully, but starting JBoss will invoke the Backend compensation handling on startup that assumes that if those tables are not clean we should try to rollback the falling command and since there will be no matching between objects binary representation in DB and actual object (because a code change modified some objects structure), this operation will fail causing the application not to start at all even after a successful upgrade.
This already occurred several times in QE and has a chance to happen in customer sites as  well.
Patch is not affecting the downtime needed for an upgrade, rather , it recommends in the per-upgrade step when an upgrade is relatively safe and let the user ignore the warning on his own risk.

Comment 5 Alex Lourie 2013-01-23 21:38:29 UTC
The code that prevents upgrading when there are async tasks in the system was merged to 3.1.