Bug 1001855

Summary: oo-admin-upgrade isn't upgrading all active gears first and then starting on idle gears...
Product: OpenShift Online Reporter: Thomas Wiest <twiest>
Component: ContainersAssignee: Dan Mace <dmace>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.xCC: bmeng, dmace
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-19 16:48:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Thomas Wiest 2013-08-28 01:41:46 UTC
Description of problem:
We're running oo-admin-upgrade in PROD, and it's upgrading idle gears before all active gears have completed.

28 hosts on ex-srv1 haven't even started upgrading their active gears (nor their idle gears).

What it looks like is happening is that the oo-admin-upgrade queues were scheduled to run the active, then inactive of the same host before moving onto the new host. 

This, of course, means that the next hosts' active gears won't be scheduled until after the current hosts inactive gears are finished, which is not what we want.

To be clear, last release this worked properly. The recent refactoring broke this.


Version-Release number of selected component (if applicable):
openshift-origin-broker-util-1.13.10-1.el6oso.noarch


How reproducible:
Very in PROD


Steps to Reproduce:
1. unknown


Actual results:
Active gears aren't being upgraded until all gears on the initial set of hosts are done.


Expected results:
All active gears should be upgraded, then move onto idle gears.

Comment 1 Thomas Wiest 2013-09-06 20:44:12 UTC
Abhishek, Not sure why you think this is a bug for the node team. It's a bug in the gear upgrade scheduling of oo-admin-upgrade.

Maybe I didn't explain it clearly, let me try again.

oo-admin-upgrade has a maximum number of nodes that it'll upgrade at a time. I think that number is 8. It used to be called "THREADS" but I can't find that now in the script.

So let's say that we're upgrading 12 nodes, oo-admin-upgrade would chunk that into two groups, the first group would be the first 8 nodes, the 2nd group would be the final 4 nodes.

What's happening is that oo-admin-upgrade is upgrading the active gears on the first 8 nodes (which is correct), but then it's upgrading the inactive gears on the first 8 nodes _before_ it upgrades the active gears on the last 4 nodes. This is incorrect.

All active gears on all nodes should be upgraded _before_ inactive gears.

This is how oo-admin-upgrade worked before it's recent refactor.

Also, this problem is very bad in PROD where we have a lot of gears and nodes.

Moving back to broker.

Comment 2 Dan Mace 2013-09-06 20:53:24 UTC
Thomas,

It's a bug for the runtime team as we own the oo-admin-upgrade script. The code itself is just poorly located currently (in a placed which implies it's owned by the broker). I understand and acknowledge the bug, and will work on getting it fixed.

Thanks!

Comment 3 Thomas Wiest 2013-09-09 13:41:25 UTC
Oh, I see, sorry for the confusion. :)

Comment 4 Dan Mace 2013-09-10 22:35:30 UTC
https://github.com/openshift/origin-server/pull/3610


Please test multi-node setups including failures and re-runs with the same parameters to verify that they are corrected the second time around. If you have any questions about constructing scenarios, please get in touch directly. Thanks!

Comment 5 openshift-github-bot 2013-09-11 01:21:39 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/89019725ca61479cc13a7247b21a9b8cb989aa12
Bug 1001855: Process all active gears before inactive

Comment 6 Meng Bo 2013-09-11 08:25:52 UTC
Checked on devenv_3772 with multi-node and about 120 gears on it.

Upgrade with max-thread=1

# oo-admin-upgrade upgrade-node --version 2.0.33 --ignore-cartridge-version --max-threads=1
Upgrader started with options: {:version=>"2.0.33", :ignore_cartridge_version=>true, :target_server_identity=>nil, :upgrade_position=>1, :num_upgraders=>1, :max_threads=>1, :gear_whitelist=>[]}
Building new upgrade queues and cluster metadata
Getting all active gears...
Getting all logins...
Writing 34 entries to gear queue for node ip-10-184-29-92 at /tmp/oo-upgrade/gear_queue_ip-10-184-29-92
Writing 21 entries to gear queue for node ip-10-184-29-92 at /tmp/oo-upgrade/gear_queue_ip-10-184-29-92
Writing 45 entries to gear queue for node ip-10-164-113-135 at /tmp/oo-upgrade/gear_queue_ip-10-164-113-135
Writing 20 entries to gear queue for node ip-10-164-113-135 at /tmp/oo-upgrade/gear_queue_ip-10-164-113-135


tail the upgrade log under /tmp/oo-upgrade the inactive gears will start upgrade when the active ones are finished on all nodes.