Bug 997769

Summary: oo-admin-upgrade will be broken when trying to rerun for the failed gears from previous upgrade
Product: OpenShift Online Reporter: Jianwei Hou <jhou>
Component: ContainersAssignee: Dan Mace <dmace>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: dmace, jhou
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-29 12:51:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 991543    

Description Jianwei Hou 2013-08-16 07:58:11 UTC
Description of problem:
Given oo-admin-upgrade is run and there are some failed gears, when I try to execute the program again, the command is broken.
The upgrade can not continue with the presence of /tmp/oo-upgrade/node_queue, however, when /tmp/oo-upgrade/node_queue is deleted, oo-admin-upgrade will do an upgrade against all gears, it does not actually try on the failed ones.

Version-Release number of selected component (if applicable):
On devenv_3660

How reproducible:
Always

Steps to Reproduce:
1. Prepare data, upgrade instance to latest and upgrade gears with oo-admin-upgrade, found some gears failed to upgrade.
oo-admin-upgrade upgrade-node --upgrade-node ip-10-151-21-209 --version 2.0.32 --ignore-cartridge-version
2. Try to run the upgrade program again
oo-admin-upgrade upgrade-node --upgrade-node ip-10-151-21-209 --version 2.0.32 --ignore-cartridge-version



Actual results:
[root@ip-10-151-21-209 oo-upgrade]# oo-admin-upgrade upgrade-node --upgrade-node ip-10-151-21-209 --version 2.0.32 --ignore-cartridge-version
Upgrader started with options: {:version=>"2.0.32", :ignore_cartridge_version=>true, :target_server_identity=>"ip-10-151-21-209", :upgrade_position=>1, :num_upgraders=>1, :max_threads=>12, :gear_whitelist=>[]}
Building new upgrade queues and cluster metadata
Node queue file already exists at /tmp/oo-upgrade/node_queue
/usr/sbin/oo-admin-upgrade:381:in `create_upgrade_queues'
/usr/sbin/oo-admin-upgrade:251:in `upgrade'
/usr/sbin/oo-admin-upgrade:999:in `block in upgrade_node'
/usr/sbin/oo-admin-upgrade:928:in `with_upgrader'
/usr/sbin/oo-admin-upgrade:988:in `upgrade_node'
/opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor/task.rb:27:in `run'
/opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor/invocation.rb:120:in `invoke_task'
/opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor.rb:275:in `dispatch'
/opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor/base.rb:425:in `start'
/usr/sbin/oo-admin-upgrade:1004:in `<main>'
/usr/sbin/oo-admin-upgrade:381:in `create_upgrade_queues': Node queue file already exists at /tmp/oo-upgrade/node_queue (RuntimeError)
	from /usr/sbin/oo-admin-upgrade:251:in `upgrade'
	from /usr/sbin/oo-admin-upgrade:999:in `block in upgrade_node'
	from /usr/sbin/oo-admin-upgrade:928:in `with_upgrader'
	from /usr/sbin/oo-admin-upgrade:988:in `upgrade_node'
	from /opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor/task.rb:27:in `run'
	from /opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor/invocation.rb:120:in `invoke_task'
	from /opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor.rb:275:in `dispatch'
	from /opt/rh/ruby193/root/usr/share/gems/gems/thor-0.15.4/lib/thor/base.rb:425:in `start'
	from /usr/sbin/oo-admin-upgrade:1004:in `<main>'

Expected results:
The program should pickup the failures and try to re-run the upgrade against them. 

Additional info:

Comment 1 Dan Mace 2013-08-16 18:28:45 UTC
Meng,

You should never manually manipulate/delete the files in /tmp/oo-upgrade with the new oo-admin-upgrade tool; if you need to start over from scratch, use `oo-admin-upgrade archive` which will archive the contents of /tmp/oo-upgrade to /tmp/oo-upgrade/archive_{timestamp}.

That said, I need a little more information for this test case. Can you try the following:

1. Start over (`oo-admin-upgrade archive`)
2. Run your first upgrade (where errors are expected)
3. Make a tarball of /tmp/oo-upgrade (e.g. upgrade-step-1.tar.gz)
4. Run your second upgrade (that you see failing)
5. Make another tarball of /tmp/oo-upgrade (e.g. upgrade-step-2.tar.gz)

Then attach both tarballs and the stdout of both oo-admin-upgrade runs to this issue so I can inspect the before and after state of the data files.

Thanks.

Comment 2 Dan Mace 2013-08-16 18:29:20 UTC
Comment #1 was addressed to Hou, sorry!

Comment 3 Jianwei Hou 2013-08-19 11:17:59 UTC
This is not reproducible this time, when the 1st time migration left some gears with errors, run it a second time and the program will process only the failed gears. 

If there is ever any problem again, I'll be sure to attach all logs, thanks!

Comment 4 Jianwei Hou 2013-08-20 07:48:43 UTC
Moving to verified according to comment 3