Bug 1622255 - Upgrade not idempotent 'Remove the image stream tag'
Summary: Upgrade not idempotent 'Remove the image stream tag'
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Michael Gugino
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-24 22:38 UTC by Michael Gugino
Modified: 2019-02-14 21:56 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-14 21:56:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael Gugino 2018-08-24 22:38:34 UTC
Description of problem: 'Remove the image stream tag' task not idempotent.


How reproducible: 100%


Steps to Reproduce:
1. Run 3.10 -> 3.11 upgrade_control_plane.yml
2. Upgrade fails for some correctable reason
3. Rerun upgrade_control_plane.yml

Actual results:
fatal: [ec2-35-153-179-215.compute-1.amazonaws.com]: FAILED! => {
    "changed": true,
    "cmd": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-sdn istag node:v3.11 --ignore-not-found",
    "delta": "0:00:00.192404",
    "end": "2018-08-24 22:37:00.878107",
    "invocation": {
        "module_args": {
            "_raw_params": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-sdn istag node:v3.11 --ignore-not-found",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2018-08-24 22:37:00.685703",
    "stderr": "error: the server doesn't have a resource type \"istag\"",
    "stderr_lines": [
        "error: the server doesn't have a resource type \"istag\""
    ],
    "stdout": "",
    "stdout_lines": []
}


Expected results:
Upgrade is able to continue past this step

Additional info:
There are other tasks that use the same task name and use a similar command.  We need to fix this or clusters that fail during upgrade will be stuck unable to upgrade.

Comment 1 Michael Gugino 2018-08-27 15:46:35 UTC
PR Created: https://github.com/openshift/openshift-ansible/pull/9780

Comment 3 Randolph Morgan 2018-08-31 16:08:00 UTC
I experienced the problem above and tracked down the changes made to the yml files.  Applying those changes allows it to pass this step.  However, it now fails on the next step.

TASK [openshift_node_group : create node-config.yaml configmap] *******************************************************************
ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : create node-config.yaml and volume-config.yaml configmap] ********************************************
skipping: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : remove templated files] ******************************************************************************
changed: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Ensure project exists] *******************************************************************************
ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Make temp directory for templates] *******************************************************************
ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Copy templates to temp directory] ********************************************************************
changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-images.yaml)
changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-policy.yaml)
changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync.yaml)

TASK [openshift_node_group : Update the image tag] ********************************************************************************
changed: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Ensure the service account can run privileged] *******************************************************
ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Remove the image stream tag] *************************************************************************
changed: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Apply the config] ************************************************************************************
fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-pBI2Hx", "delta": "0:00:21.119941", "end": "2018-08-31 09:59:36.502122", "msg": "non-zero return code", "rc": 1, "start": "2018-08-31 09:59:15.382181", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" configured", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" configured"]}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry

Now it is failing because the imagestream tag exists.  Any suggestions would be really helpful.

Comment 4 Randolph Morgan 2018-08-31 17:00:05 UTC
As I have looked through this I believe the issue all along has been that it is attempting to remove the istag from the files that have been moved to a completely different directory. The sync-images.yaml, sync-policy.yaml, and sync.yaml are all in the /tmp/ansible-xxxxxx directory which is created when the templates are copied to the temp directory.  So when the "Remove the image stream tag" step occurs, it is attempting to remove the image stream, but the files that it is looking for are not present in the directory it is searching, they are in the /tmp directory created three steps prior.  These files do have the image stream in them and this may the reason that it fails in the apply config step.

Comment 5 Randolph Morgan 2018-08-31 18:35:26 UTC
Made the following change to the sync.yml in /roles/openshift_node_group/tasks/:
---
- name: Ensure project exists
  oc_project:
    name: openshift-node
    state: present
    node_selector:
      - ""

- name: Make temp directory for templates
  command: mktemp -d /tmp/ansible-XXXXXX
  register: mktemp
  changed_when: False

# TODO: temporary until we fix apply for image stream tags
- name: Remove the image stream tag
  command: >
    {{ openshift_client_binary }}
    --config={{ openshift.common.config_base }}/master/admin.kubeconfig
    delete -n openshift-node istag node:v3.10 --ignore-not-found
  register: l_os_istag_del
  # The istag might not be there, so we want to not fail in that case.
  failed_when:
    - l_os_istag_del.rc != 0
    - "'have a resource type' not in l_os_istag_del.stderr"

- name: Copy templates to temp directory
  copy:
    src: "{{ item }}"
    dest: "{{ mktemp.stdout }}/{{ item | basename }}"
  with_fileglob:
    - "files/*.yaml"

- name: Update the image tag
  yedit:
    src: "{{ mktemp.stdout }}/sync-images.yaml"
    key: 'tag.from.name'
    value: "{{ osn_image }}"

- name: Ensure the service account can run privileged
  oc_adm_policy_user:
    namespace: "openshift-node"
    resource_kind: scc
    resource_name: privileged
    state: present
    user: "system:serviceaccount:openshift-node:sync"

- name: Apply the config
  shell: >
    {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig apply -f {{ mktemp.stdout }}

- name: Remove temp directory

and to roles/openshift_sdn/tasks/main.yml and roles/openshift_bootstrap_autoapprover/tasks/main.yml and the upgrade ran without issue.

Comment 6 Michael Gugino 2018-08-31 18:49:59 UTC
(In reply to Randolph Morgan from comment #5)
> Made the following change to the sync.yml in

Randolph, please file a new BZ for your issue.  This BZ is specifically about 3.11 upgrades, not 3.10.

Comment 7 Randolph Morgan 2018-08-31 19:39:39 UTC
My apologies.  I made my comments here because it was your bugzilla report and your solution that allowed me to correct my issues.  I have opened a new bz, see buzilla bug 1624493.

Comment 8 liujia 2018-09-07 09:36:44 UTC
Verified on openshift-ansible-3.11.0-0.28.0.git.0.730d4be.el7.noarch

Checked pr merged.

Steps:
1. Run upgrade control_plane 
2. Abort the upgrade after task "Remove the image stream tag" pass
TASK [openshift_sdn : Remove the image stream tag] *****************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_sdn/tasks/main.yml:38
Friday 07 September 2018  08:51:19 +0000 (0:00:02.734)       0:04:10.275 ****** 
changed: [x] => {"changed": true, "cmd": ["oc", "--config=/etc/origin/master/admin.kubeconfig", "delete", "-n", "openshift-sdn", "istag", "node:v3.11", "--ignore-not-found"], "delta": "0:00:00.294263", "end": "2018-09-07 04:51:30.743725", "failed_when_result": false, "rc": 0, "start": "2018-09-07 04:51:30.449462", "stderr": "", "stderr_lines": [], "stdout": "imagestreamtag.image.openshift.io \"node:v3.11\" deleted", "stdout_lines": ["imagestreamtag.image.openshift.io \"node:v3.11\" deleted"]}
3. re-run upgrade control plane
changed: [x] => {
    "changed": true, 
    "cmd": [
        "oc", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "delete", 
        "-n", 
        "openshift-sdn", 
        "istag", 
        "node:v3.11", 
        "--ignore-not-found"
    ], 
    "delta": "0:00:00.280488", 
    "end": "2018-09-07 04:59:58.919619", 
    "failed_when_result": false, 
    "invocation": {
        "module_args": {
            "_raw_params": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-sdn istag node:v3.11 --ignore-not-found", 
            "_uses_shell": false, 
            "argv": null, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "rc": 0, 
    "start": "2018-09-07 04:59:58.639131", 
    "stderr": "", 
    "stderr_lines": [], 
    "stdout": "", 
    "stdout_lines": []
}

Comment 9 liujia 2018-09-12 03:53:16 UTC
Re-open it. Detail info refer to https://bugzilla.redhat.com/show_bug.cgi?id=1624493#c11

Comment 10 Michael Gugino 2018-09-12 15:29:21 UTC
Does still work on 3.10 -> 3.11 upgrades?  I don't think we can assume the behavior is the same across releases.  Please re-verify this on 3.11 upgrades independently of the other bug.

Comment 11 liujia 2018-09-13 01:42:59 UTC
(In reply to Michael Gugino from comment #10)
> Does still work on 3.10 -> 3.11 upgrades?  I don't think we can assume the
> behavior is the same across releases.  Please re-verify this on 3.11
> upgrades independently of the other bug.

Hi Michael

QE can not re-produce the issue according to your steps in v3.11. V3.11 upgrade works well and never hit the issue according to your steps whether before or after the pr9780 merged. The first verify in comment8 is only based that pr9780 merged in and upgrade works well. 

But dig more according to bz1624493, we worried that the root cause was not identified. so re-open it to track the issue for further investigation. 

Non-idempotent will not cause "error: the server doesn't have a resource type \"istag\"". I'v tried it to run the command several times manually. Even if istag node:v3.11 was deleted, re-run the command only shows no output. So I don't think we really resolve the issue currently.

Change to "POST" first to wait for dev's further debug/fix.

Comment 12 Scott Dodson 2018-09-27 13:35:22 UTC
If there's another bug tracking another issue we'll go ahead and leave that bug open. Going to go ahead and close this one.


Note You need to log in before you can comment on or make changes to this bug.