Description of problem: We're running oo-admin-upgrade in PROD, and it's upgrading idle gears before all active gears have completed. 28 hosts on ex-srv1 haven't even started upgrading their active gears (nor their idle gears). What it looks like is happening is that the oo-admin-upgrade queues were scheduled to run the active, then inactive of the same host before moving onto the new host. This, of course, means that the next hosts' active gears won't be scheduled until after the current hosts inactive gears are finished, which is not what we want. To be clear, last release this worked properly. The recent refactoring broke this. Version-Release number of selected component (if applicable): openshift-origin-broker-util-1.13.10-1.el6oso.noarch How reproducible: Very in PROD Steps to Reproduce: 1. unknown Actual results: Active gears aren't being upgraded until all gears on the initial set of hosts are done. Expected results: All active gears should be upgraded, then move onto idle gears.
Abhishek, Not sure why you think this is a bug for the node team. It's a bug in the gear upgrade scheduling of oo-admin-upgrade. Maybe I didn't explain it clearly, let me try again. oo-admin-upgrade has a maximum number of nodes that it'll upgrade at a time. I think that number is 8. It used to be called "THREADS" but I can't find that now in the script. So let's say that we're upgrading 12 nodes, oo-admin-upgrade would chunk that into two groups, the first group would be the first 8 nodes, the 2nd group would be the final 4 nodes. What's happening is that oo-admin-upgrade is upgrading the active gears on the first 8 nodes (which is correct), but then it's upgrading the inactive gears on the first 8 nodes _before_ it upgrades the active gears on the last 4 nodes. This is incorrect. All active gears on all nodes should be upgraded _before_ inactive gears. This is how oo-admin-upgrade worked before it's recent refactor. Also, this problem is very bad in PROD where we have a lot of gears and nodes. Moving back to broker.
Thomas, It's a bug for the runtime team as we own the oo-admin-upgrade script. The code itself is just poorly located currently (in a placed which implies it's owned by the broker). I understand and acknowledge the bug, and will work on getting it fixed. Thanks!
Oh, I see, sorry for the confusion. :)
https://github.com/openshift/origin-server/pull/3610 Please test multi-node setups including failures and re-runs with the same parameters to verify that they are corrected the second time around. If you have any questions about constructing scenarios, please get in touch directly. Thanks!
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/89019725ca61479cc13a7247b21a9b8cb989aa12 Bug 1001855: Process all active gears before inactive
Checked on devenv_3772 with multi-node and about 120 gears on it. Upgrade with max-thread=1 # oo-admin-upgrade upgrade-node --version 2.0.33 --ignore-cartridge-version --max-threads=1 Upgrader started with options: {:version=>"2.0.33", :ignore_cartridge_version=>true, :target_server_identity=>nil, :upgrade_position=>1, :num_upgraders=>1, :max_threads=>1, :gear_whitelist=>[]} Building new upgrade queues and cluster metadata Getting all active gears... Getting all logins... Writing 34 entries to gear queue for node ip-10-184-29-92 at /tmp/oo-upgrade/gear_queue_ip-10-184-29-92 Writing 21 entries to gear queue for node ip-10-184-29-92 at /tmp/oo-upgrade/gear_queue_ip-10-184-29-92 Writing 45 entries to gear queue for node ip-10-164-113-135 at /tmp/oo-upgrade/gear_queue_ip-10-164-113-135 Writing 20 entries to gear queue for node ip-10-164-113-135 at /tmp/oo-upgrade/gear_queue_ip-10-164-113-135 tail the upgrade log under /tmp/oo-upgrade the inactive gears will start upgrade when the active ones are finished on all nodes.