Description of problem: 'Remove the image stream tag' task not idempotent. How reproducible: 100% Steps to Reproduce: 1. Run 3.10 -> 3.11 upgrade_control_plane.yml 2. Upgrade fails for some correctable reason 3. Rerun upgrade_control_plane.yml Actual results: fatal: [ec2-35-153-179-215.compute-1.amazonaws.com]: FAILED! => { "changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-sdn istag node:v3.11 --ignore-not-found", "delta": "0:00:00.192404", "end": "2018-08-24 22:37:00.878107", "invocation": { "module_args": { "_raw_params": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-sdn istag node:v3.11 --ignore-not-found", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2018-08-24 22:37:00.685703", "stderr": "error: the server doesn't have a resource type \"istag\"", "stderr_lines": [ "error: the server doesn't have a resource type \"istag\"" ], "stdout": "", "stdout_lines": [] } Expected results: Upgrade is able to continue past this step Additional info: There are other tasks that use the same task name and use a similar command. We need to fix this or clusters that fail during upgrade will be stuck unable to upgrade.
PR Created: https://github.com/openshift/openshift-ansible/pull/9780
I experienced the problem above and tracked down the changes made to the yml files. Applying those changes allows it to pass this step. However, it now fails on the next step. TASK [openshift_node_group : create node-config.yaml configmap] ******************************************************************* ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : create node-config.yaml and volume-config.yaml configmap] ******************************************** skipping: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : remove templated files] ****************************************************************************** changed: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Ensure project exists] ******************************************************************************* ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Make temp directory for templates] ******************************************************************* ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Copy templates to temp directory] ******************************************************************** changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-images.yaml) changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-policy.yaml) changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync.yaml) TASK [openshift_node_group : Update the image tag] ******************************************************************************** changed: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Ensure the service account can run privileged] ******************************************************* ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Remove the image stream tag] ************************************************************************* changed: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Apply the config] ************************************************************************************ fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-pBI2Hx", "delta": "0:00:21.119941", "end": "2018-08-31 09:59:36.502122", "msg": "non-zero return code", "rc": 1, "start": "2018-08-31 09:59:15.382181", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" configured", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" configured"]} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry Now it is failing because the imagestream tag exists. Any suggestions would be really helpful.
As I have looked through this I believe the issue all along has been that it is attempting to remove the istag from the files that have been moved to a completely different directory. The sync-images.yaml, sync-policy.yaml, and sync.yaml are all in the /tmp/ansible-xxxxxx directory which is created when the templates are copied to the temp directory. So when the "Remove the image stream tag" step occurs, it is attempting to remove the image stream, but the files that it is looking for are not present in the directory it is searching, they are in the /tmp directory created three steps prior. These files do have the image stream in them and this may the reason that it fails in the apply config step.
Made the following change to the sync.yml in /roles/openshift_node_group/tasks/: --- - name: Ensure project exists oc_project: name: openshift-node state: present node_selector: - "" - name: Make temp directory for templates command: mktemp -d /tmp/ansible-XXXXXX register: mktemp changed_when: False # TODO: temporary until we fix apply for image stream tags - name: Remove the image stream tag command: > {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig delete -n openshift-node istag node:v3.10 --ignore-not-found register: l_os_istag_del # The istag might not be there, so we want to not fail in that case. failed_when: - l_os_istag_del.rc != 0 - "'have a resource type' not in l_os_istag_del.stderr" - name: Copy templates to temp directory copy: src: "{{ item }}" dest: "{{ mktemp.stdout }}/{{ item | basename }}" with_fileglob: - "files/*.yaml" - name: Update the image tag yedit: src: "{{ mktemp.stdout }}/sync-images.yaml" key: 'tag.from.name' value: "{{ osn_image }}" - name: Ensure the service account can run privileged oc_adm_policy_user: namespace: "openshift-node" resource_kind: scc resource_name: privileged state: present user: "system:serviceaccount:openshift-node:sync" - name: Apply the config shell: > {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig apply -f {{ mktemp.stdout }} - name: Remove temp directory and to roles/openshift_sdn/tasks/main.yml and roles/openshift_bootstrap_autoapprover/tasks/main.yml and the upgrade ran without issue.
(In reply to Randolph Morgan from comment #5) > Made the following change to the sync.yml in Randolph, please file a new BZ for your issue. This BZ is specifically about 3.11 upgrades, not 3.10.
My apologies. I made my comments here because it was your bugzilla report and your solution that allowed me to correct my issues. I have opened a new bz, see buzilla bug 1624493.
Verified on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch Checked pr merged. Steps: 1. Run upgrade control_plane 2. Abort the upgrade after task "Remove the image stream tag" pass TASK [openshift_sdn : Remove the image stream tag] ***************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_sdn/tasks/main.yml:38 Friday 07 September 2018 08:51:19 +0000 (0:00:02.734) 0:04:10.275 ****** changed: [x] => {"changed": true, "cmd": ["oc", "--config=/etc/origin/master/admin.kubeconfig", "delete", "-n", "openshift-sdn", "istag", "node:v3.11", "--ignore-not-found"], "delta": "0:00:00.294263", "end": "2018-09-07 04:51:30.743725", "failed_when_result": false, "rc": 0, "start": "2018-09-07 04:51:30.449462", "stderr": "", "stderr_lines": [], "stdout": "imagestreamtag.image.openshift.io \"node:v3.11\" deleted", "stdout_lines": ["imagestreamtag.image.openshift.io \"node:v3.11\" deleted"]} 3. re-run upgrade control plane changed: [x] => { "changed": true, "cmd": [ "oc", "--config=/etc/origin/master/admin.kubeconfig", "delete", "-n", "openshift-sdn", "istag", "node:v3.11", "--ignore-not-found" ], "delta": "0:00:00.280488", "end": "2018-09-07 04:59:58.919619", "failed_when_result": false, "invocation": { "module_args": { "_raw_params": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-sdn istag node:v3.11 --ignore-not-found", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "rc": 0, "start": "2018-09-07 04:59:58.639131", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [] }
Re-open it. Detail info refer to https://bugzilla.redhat.com/show_bug.cgi?id=1624493#c11
Does still work on 3.10 -> 3.11 upgrades? I don't think we can assume the behavior is the same across releases. Please re-verify this on 3.11 upgrades independently of the other bug.
(In reply to Michael Gugino from comment #10) > Does still work on 3.10 -> 3.11 upgrades? I don't think we can assume the > behavior is the same across releases. Please re-verify this on 3.11 > upgrades independently of the other bug. Hi Michael QE can not re-produce the issue according to your steps in v3.11. V3.11 upgrade works well and never hit the issue according to your steps whether before or after the pr9780 merged. The first verify in comment8 is only based that pr9780 merged in and upgrade works well. But dig more according to bz1624493, we worried that the root cause was not identified. so re-open it to track the issue for further investigation. Non-idempotent will not cause "error: the server doesn't have a resource type \"istag\"". I'v tried it to run the command several times manually. Even if istag node:v3.11 was deleted, re-run the command only shows no output. So I don't think we really resolve the issue currently. Change to "POST" first to wait for dev's further debug/fix.
If there's another bug tracking another issue we'll go ahead and leave that bug open. Going to go ahead and close this one.