Bug 1001855 - oo-admin-upgrade isn't upgrading all active gears first and then starting on idle gears...
Summary: oo-admin-upgrade isn't upgrading all active gears first and then starting on ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Dan Mace
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-28 01:41 UTC by Thomas Wiest
Modified: 2015-05-14 23:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-19 16:48:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Thomas Wiest 2013-08-28 01:41:46 UTC
Description of problem:
We're running oo-admin-upgrade in PROD, and it's upgrading idle gears before all active gears have completed.

28 hosts on ex-srv1 haven't even started upgrading their active gears (nor their idle gears).

What it looks like is happening is that the oo-admin-upgrade queues were scheduled to run the active, then inactive of the same host before moving onto the new host. 

This, of course, means that the next hosts' active gears won't be scheduled until after the current hosts inactive gears are finished, which is not what we want.

To be clear, last release this worked properly. The recent refactoring broke this.


Version-Release number of selected component (if applicable):
openshift-origin-broker-util-1.13.10-1.el6oso.noarch


How reproducible:
Very in PROD


Steps to Reproduce:
1. unknown


Actual results:
Active gears aren't being upgraded until all gears on the initial set of hosts are done.


Expected results:
All active gears should be upgraded, then move onto idle gears.

Comment 1 Thomas Wiest 2013-09-06 20:44:12 UTC
Abhishek, Not sure why you think this is a bug for the node team. It's a bug in the gear upgrade scheduling of oo-admin-upgrade.

Maybe I didn't explain it clearly, let me try again.

oo-admin-upgrade has a maximum number of nodes that it'll upgrade at a time. I think that number is 8. It used to be called "THREADS" but I can't find that now in the script.

So let's say that we're upgrading 12 nodes, oo-admin-upgrade would chunk that into two groups, the first group would be the first 8 nodes, the 2nd group would be the final 4 nodes.

What's happening is that oo-admin-upgrade is upgrading the active gears on the first 8 nodes (which is correct), but then it's upgrading the inactive gears on the first 8 nodes _before_ it upgrades the active gears on the last 4 nodes. This is incorrect.

All active gears on all nodes should be upgraded _before_ inactive gears.

This is how oo-admin-upgrade worked before it's recent refactor.

Also, this problem is very bad in PROD where we have a lot of gears and nodes.

Moving back to broker.

Comment 2 Dan Mace 2013-09-06 20:53:24 UTC
Thomas,

It's a bug for the runtime team as we own the oo-admin-upgrade script. The code itself is just poorly located currently (in a placed which implies it's owned by the broker). I understand and acknowledge the bug, and will work on getting it fixed.

Thanks!

Comment 3 Thomas Wiest 2013-09-09 13:41:25 UTC
Oh, I see, sorry for the confusion. :)

Comment 4 Dan Mace 2013-09-10 22:35:30 UTC
https://github.com/openshift/origin-server/pull/3610


Please test multi-node setups including failures and re-runs with the same parameters to verify that they are corrected the second time around. If you have any questions about constructing scenarios, please get in touch directly. Thanks!

Comment 5 openshift-github-bot 2013-09-11 01:21:39 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/89019725ca61479cc13a7247b21a9b8cb989aa12
Bug 1001855: Process all active gears before inactive

Comment 6 Meng Bo 2013-09-11 08:25:52 UTC
Checked on devenv_3772 with multi-node and about 120 gears on it.

Upgrade with max-thread=1

# oo-admin-upgrade upgrade-node --version 2.0.33 --ignore-cartridge-version --max-threads=1
Upgrader started with options: {:version=>"2.0.33", :ignore_cartridge_version=>true, :target_server_identity=>nil, :upgrade_position=>1, :num_upgraders=>1, :max_threads=>1, :gear_whitelist=>[]}
Building new upgrade queues and cluster metadata
Getting all active gears...
Getting all logins...
Writing 34 entries to gear queue for node ip-10-184-29-92 at /tmp/oo-upgrade/gear_queue_ip-10-184-29-92
Writing 21 entries to gear queue for node ip-10-184-29-92 at /tmp/oo-upgrade/gear_queue_ip-10-184-29-92
Writing 45 entries to gear queue for node ip-10-164-113-135 at /tmp/oo-upgrade/gear_queue_ip-10-164-113-135
Writing 20 entries to gear queue for node ip-10-164-113-135 at /tmp/oo-upgrade/gear_queue_ip-10-164-113-135


tail the upgrade log under /tmp/oo-upgrade the inactive gears will start upgrade when the active ones are finished on all nodes.


Note You need to log in before you can comment on or make changes to this bug.