1624493 – Upgrade from Openshift Origin 3.9 to 3.10 fails at Remove Image Stream Tag if no Image Stream Tag is present.

Bug 1624493 - Upgrade from Openshift Origin 3.9 to 3.10 fails at Remove Image Stream Tag if no Image Stream Tag is present.

Summary: Upgrade from Openshift Origin 3.9 to 3.10 fails at Remove Image Stream Tag if...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Scott Dodson
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1686590 1688452
TreeView+	depends on / blocked

Reported:	2018-08-31 19:34 UTC by Randolph Morgan
Modified:	2022-03-13 15:29 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openshift-ansible-3.10.66-1.git.0.3c3a83a.el7
Doc Type:	Bug Fix
Doc Text:	During the upgrade aggregated APIs may take a while to become fully functional again after restarts. The upgrade playbooks now wait for them to become available before continuing on.
Clone Of:
Clones:	1686590 (view as bug list)
Environment:
Last Closed:	2018-12-05 15:14:40 UTC
Target Upstream Version:
Embargoed:
Flags:	mrobson: needinfo+

Attachments	(Terms of Use)
output of playbook -vvv (8.00 MB, text/plain) 2018-09-13 18:15 UTC, Randolph Morgan	no flags	Details
log of the oc get command (1.86 KB, text/plain) 2018-09-14 16:32 UTC, Randolph Morgan	no flags	Details
results of oc get command for scc (26.91 KB, application/zip) 2018-09-14 16:37 UTC, Randolph Morgan	no flags	Details
results of oc get command for istag (18.73 KB, application/zip) 2018-09-14 16:38 UTC, Randolph Morgan	no flags	Details
Output of Journalctl -r -u origin-node (1.10 MB, text/x-vhdl) 2018-09-14 19:55 UTC, Randolph Morgan	no flags	Details
Output of Journalctl -r -u origin-node (1.28 MB, text/x-vhdl) 2018-09-14 19:56 UTC, Randolph Morgan	no flags	Details
output of: master-logs etcd etcd &> etcd.log (48.02 KB, text/plain) 2018-09-26 16:54 UTC, Randolph Morgan	no flags	Details
output of: master-logs api api &> api.log (373.06 KB, text/plain) 2018-09-26 16:54 UTC, Randolph Morgan	no flags	Details
output of: master-logs controllers controllers &> controllers.log (27.29 KB, text/plain) 2018-09-26 16:56 UTC, Randolph Morgan	no flags	Details
View All

Description Randolph Morgan 2018-08-31 19:34:16 UTC

Upgrading from Origin v3.9 to v3.10 failed.  After correcting the problem, the upgrade now fails at Remove Image Stream Tag.  It fails with the following error:

TASK [openshift_node_group : Remove the image stream tag] *************************************************************************

fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-node istag node:v3.10 --ignore-not-found", "delta": "0:00:54.354315", "end": "2018-08-23 16:37:35.608204", "msg": "non-zero return code", "rc": 1, "start": "2018-08-23 16:36:41.253889", "stderr": "error: the server doesn't have a resource type \"istag\"", "stderr_lines": ["error: the server doesn't have a resource type \"istag\""], "stdout": "", "stdout_lines": []}

Researching this problem I found that the same issue exists upgrading from 3.10 to 3.1l, see bugzilla report 1622255.  After looking at the solution reported by Michael Gugino, I tracked down the code in github (https://github.com/openshift/openshift-ansible/commit/dc77308bdffb5f3a69bbdb17466898dd6dc339d2) and modified the following files:

roles/openshift_bootstrap_autoapprover/tasks/main.yml
roles/openshift_node_group/tasks/sync.yml
roles/openshift_sdn/tasks/main.yml

This corrected the fault listed above, however, the upgrade now fails at the next step:

TASK [openshift_node_group : Apply the config] ************************************************************************************
fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-pBI2Hx", "delta": "0:00:21.119941", "end": "2018-08-31 09:59:36.502122", "msg": "non-zero return code", "rc": 1, "start": "2018-08-31 09:59:15.382181", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" configured", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" configured"]}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry

This was frustrating because now it shows that the istag exists in the files and after looking at the files, sure enough it does.  After looking through the process, I found that the files the Remove Image Stream Tag references are actually copied over the /tmp three steps prior the removing the tag.

I modified the three files mentioned above to be as follows:

---
- name: Ensure project exists
  oc_project:
    name: openshift-node
    state: present
    node_selector:
      - ""

- name: Make temp directory for templates
  command: mktemp -d /tmp/ansible-XXXXXX
  register: mktemp
  changed_when: False

# TODO: temporary until we fix apply for image stream tags
- name: Remove the image stream tag
  command: >
    {{ openshift_client_binary }}
    --config={{ openshift.common.config_base }}/master/admin.kubeconfig
    delete -n openshift-node istag node:v3.10 --ignore-not-found
  register: l_os_istag_del
  # The istag might not be there, so we want to not fail in that case.
  failed_when:
    - l_os_istag_del.rc != 0
    - "'have a resource type' not in l_os_istag_del.stderr"

- name: Copy templates to temp directory
  copy:
    src: "{{ item }}"
    dest: "{{ mktemp.stdout }}/{{ item | basename }}"
  with_fileglob:
    - "files/*.yaml"

- name: Update the image tag
  yedit:
    src: "{{ mktemp.stdout }}/sync-images.yaml"
    key: 'tag.from.name'
    value: "{{ osn_image }}"

- name: Ensure the service account can run privileged
  oc_adm_policy_user:
    namespace: "openshift-node"
    resource_kind: scc
    resource_name: privileged
    state: present
    user: "system:serviceaccount:openshift-node:sync"

- name: Apply the config
  shell: >
    {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig apply -f {{ mktemp.stdout }}

- name: Remove temp directory

The upgrade proceeded without incident.  So changing the order of the tasks in the files allows for the upgrade to be re-run without incident.

Comment 2 Michael Gugino 2018-09-04 20:04:20 UTC

Backport created for 3.10: https://github.com/openshift/openshift-ansible/pull/9911

Comment 5 Michael Gugino 2018-09-10 18:02:12 UTC

For any issues, please attach the inventory and variables + full ansible-playbook -vvv output (3 v's.).

Comment 11 liujia 2018-09-12 03:52:54 UTC

@Randolph Morgan @Matthew Robson

You got the error is because that you got pr9780 in, this pr will not fail the the task [openshift_node_group : Remove the image stream tag] when the task did not remove istag node:v3.10 actually for some error. So when it came to the task[Apply the config], it will prompt u that "imagestreamtag.image.openshift.io \"node:v3.10\" already exists". T

he root cause should be that [openshift_node_group : Remove the image stream tag] did not actually remove istag node:v3.10.

The log in attachment shows above:
TASK [openshift_node_group : Remove the image stream tag] **********************************************
..
"rc": 1,
"start": "2018-09-10 10:19:22.813434",
"stderr": "error: the server doesn't have a resource type \"istag\"",
TASK [openshift_node_group : Apply the config] *********************************************************
..
"stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-l1G8UE/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists",

Then came to why task [openshift_node_group : Remove the image stream tag] failed.(Refer to bz1622255, which Michael hit it in step3)

I tried command in task [openshift_node_group : Remove the image stream tag] several times manually. Even if istag node:v3.10 was deleted, then run the command again, it should not prompt the error like “error: the server doesn't have a resource type \"istag\"”. It just give no output because tag v3.10 was removed and not existed.

Then I discussed with other folks who tested the command, and got some info in bz1577520, which shows that this error happened because the server was not connected when run the command.

I dig the log again, found that, the previous task also has the same error.
TASK [openshift_node_group : Ensure the service account can run privileged] ****************************
..
"msg": {
"cmd": "/bin/oc get scc privileged -o json -n openshift-node",
"results": [
{}
],
"returncode": 1,
"stderr": "error: the server doesn't have a resource type \"scc\"\n",
"stdout": ""
...
TASK [openshift_node_group : Remove the image stream tag] **********************************************
"msg": "non-zero return code",
"rc": 1,
"start": "2018-09-10 10:19:22.813434",
"stderr": "error: the server doesn't have a resource type \"istag\"",
..
TASK [openshift_node_group : Apply the config] *********************************************************

So, @Michael, I think now we should found why server was not connected at that time, and seems pr9780 does not fix the issue in bz1622255, and also this bug. And this bug should be dup with that bug. I will re-open that one too.

Anyway, all above is just my analysis according to current info/log/bug, please correct me if my guess is wrong.

Comment 12 Johnny Liu 2018-09-12 08:04:34 UTC

Totally agree comment 11, in initial log or comment 8, in "Remove the image stream tag" task, oc return is saying "the server doesn't have a resource type istag", then in the following task oc return is saying "imagestreamtag.image.openshift.io 'node:v3.10' already exists". They are conflicting with each other. 

In a healthy cluster, whatever it is 3.9 and 3.10, "oc get istag" always return succeed, never saying "the server doesn't have a resource type istag".

The only possibility is master server is not fully up when running "Ensure the service account can run privileged" and "Remove the image stream tag" task, that is why it is saying "error: the server doesn't have a resource type 'scc'", "error: the server doesn't have a resource type 'istag'". So the key point is that before running these task, upgrade should do some strict pre-check, make sure server is really get ready, at least should not hit "the server doesn't have a resource type" error. While the PR #9911 is trying to ignore such error, that would make the following tasks getting more unstable.

Comment 15 Matthew Robson 2018-09-13 11:49:28 UTC

The issue with getting a resource type has to do with the inability of the api server to correctly discover and return the information for the alias like scc or istag.

    "stderr": "error: the server doesn't have a resource type \"scc\"\n",
    "stderr": "error: the server doesn't have a resource type \"istag\"",

We hit the issue because the service catalog was in a degraded state causing calls to the API to be throttled and timeout.

On the first run of the 3.10 upgrade, there was a firewall issue when bringing up etcd on master 1 meaning the old service was down and the new static pod could not start.

In 3.9, by default, the SC apiserver only points to 1 etcd server, in this case, the one running on master1. This lead to all alias calls failing and regular oc commnanks taking upwards of 1 minute.

root      43250  126  7.3 1527892 1203308 ?     SNsl Sep04 14332:23 /usr/bin/service-catalog apiserver --storage-type etcd --secure-port 6443 --etcd-servers https://master1:2379 --etcd-cafile /etc/origin/master/master.etcd-ca.crt --etcd-certfile /etc/origin/master/master.etcd-client.crt --etcd-keyfile /etc/origin/master/master.etcd-client.key -v 3 --cors-allowed-origins localhost --admission-control KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck --feature-gates OriginatingIdentity=true

We came to this conclusion by running with loglevel 9 on the oc command and seeing over 60 429s on 1 oc get invocation:

oc get scc privileged -o json -n openshift-node --loglevel=9:

I0912 11:11:29.359077   69349 round_trippers.go:405] GET https://masterlbredhat.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s 429 Too Many Requests in 65 milliseconds

At the end of the request, you would see:

I0912 11:11:30.619148   69349 request.go:897] Response Body: Too many requests, please try again later.
I0912 11:11:30.619338   69349 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" }
I0912 11:11:30.619357   69349 cached_discovery.go:77] skipped caching discovery info due to the server has received too many requests and has asked us to try again later
F0912 11:11:30.620980   69349 helpers.go:119] error: the server doesn't have a resource type "scc"

We did an: oc edit -n kube-service-catalog ds/apiserver and changed it to point to the second etcd server. Once we bounced the pod, all commands started executing again very quickly, including aliases.

This allowed us to complete the upgrade without any fixes or rearranging of the playbook execution.

If you are seeing 'the server doesn't have a resource type' there is a good chance you may have an issue with the API and running with log level 9 as per above will help you root cause the issue.

The upgrade to OCP 3.10 modifies the service catalog to point to all 3 etcd servers which would prevent something like this in the future.

Comment 16 Randolph Morgan 2018-09-13 15:43:04 UTC

@Matthew Robson - your suggestion worked well until the time came to apply the config and now I am seeing the same problem I was seeing before: 

TASK [openshift_node_group : Apply the config] ************************************************************************************
fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-aZWqrY", "delta": "0:00:21.153334", "end": "2018-09-13 09:28:53.068766", "msg": "non-zero return code", "rc": 1, "start": "2018-09-13 09:28:31.915432", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-aZWqrY/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-aZWqrY/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" configured", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" configured"]}

PLAY RECAP ************************************************************************************************************************
localhost                  : ok=14   changed=0    unreachable=0    failed=0
openshift-infra1.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0
openshift-infra2.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0
openshift-master1.chem.byu.edu : ok=260  changed=43   unreachable=0    failed=1
openshift-master2.chem.byu.edu : ok=102  changed=9    unreachable=0    failed=0
openshift-master3.chem.byu.edu : ok=102  changed=9    unreachable=0    failed=0
openshift-node1.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0
openshift-node2.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0
openshift-node3.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0
vpn1.chem.byu.edu          : ok=1    changed=0    unreachable=0    failed=0



Failure summary:


  1. Hosts:    openshift-master1.chem.byu.edu
     Play:     Configure components that must be available prior to upgrade
     Task:     Apply the config
     Message:  non-zero return code

What frustrates me about this, is that if the tag already exists, it should accept the existing tag and move on, not fail.  Failing for something that it put there seems like a bad idea, or someone missed a step.

Comment 17 Randolph Morgan 2018-09-13 17:48:35 UTC

# Create an OSEv3 group that contains the master, nodes, etcd, and lb groups.
# The lb group lets Ansible configure HAProxy as the load balancing solution.
# Comment lb out if your load balancer is pre-configured.
[OSEv3:children]
masters
nodes
etcd
nfs

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
ansible_ssh_user=root
openshift_deployment_type=origin
openshift_service_type=origin
enable_docker_excluder=false
openshift_disable_check=disk_availability
openshift_enable_service_catalog=false
template_service_broker_install=false
ansible_service_broker_install=false

# Uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider.

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}, {'name': 'IPA', 'challenge': 'true', 'login': 'true', 'mappingMethod': 'claim', 'kind': 'LDAPPasswordIdentityProvider', 'attributes': {'id': ['dn'], 'email': ['mail'], 'name': ['cn'], 'preferredUsername': ['uid']}, 'bindDN': 'uid=user,cn=sysaccounts,cn=etc,dc=xxx,dc=xxx,dc=xxx', 'bindPassword': '0000', 'ca': '/etc/origin/master/ldap_ca.crt', 'insecure': 'false', 'url': 'ldaps://ipa.xxx.xxx:636/cn=users,cn=accounts,dc=xxx,dc=xxx,dc=xxx?uid?sub?(&(uid=*)(memberOf=cn=xxx,cn=groups,cn=accounts,dc=xxx,dc=xxx,dc=xxx))'}]
openshift_master_htpasswd_users={'user': '0000'}
openshift_master_ldap_ca_file=/home/root/ldap_ca.crt

# Native high availbility cluster method with optional load balancer.
# If no lb group is defined installer assumes that a load balancer has
# been preconfigured. For installation the value of
# openshift_master_cluster_hostname must resolve to the load balancer
# or to one or all of the masters defined in the inventory if no load
# balancer is present.
openshift_master_cluster_method=native
# set the following line to be the vm that will start as the load balancer
openshift_master_cluster_hostname=openshift-masters.xxx.xxx
openshift_master_cluster_public_hostname=openshift.xxx.xxx
openshift_master_default_subdomain=apps.xxx.xxx
# Router selector (optional)
# Router will only be created if nodes matching this label are present.
# Default value: 'region=infra'
openshift_hosted_router_selector='region=infra'

# default project node selector
osm_default_node_selector='region=primary'

# override the default controller lease ttl
#osm_controller_lease_ttl=30

# Configure the multi-tenant SDN plugin (default is 'redhat/openshift-ovs-subnet')
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'

# External NFS Host
# NFS volume must already exist with path "nfs_directory/_volume_name" on
# the storage_host. For example, the remote volume path using these
# options would be "nfs.example.com:/exports/registry"
#It looks like this automatically maps this directory to the nodes
openshift_enable_unsupported_configurations=True
openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_access_modes=['ReadWriteMany']
openshift_hosted_registry_storage_host=vpn1.xxx.xxx
openshift_hosted_registry_storage_nfs_directory=/exports
openshift_hosted_registry_storage_volume_name=registry
openshift_hosted_registry_storage_volume_size=322Gi

#Configure custom ca certificate
#openshift_master_ca_certificate={'certfile': '/etc/origin/master/ca.crt', 'keyfile': '/etc/origin/master.key'}

[nfs]
vpn1.xxx.xxx

# host group for masters
# changed in DNS to point openshift-masters.xxx.xxx to multiple places, as Garrett suggested.
[masters]
openshift-master[1:3].xxx.xxx
#openshift-master1.xxx.xxx
#openshift-master3.xxx.xxx

# host group for etcd, hosted on masters
# same as masters group.
[etcd]
openshift-master[1:3].xxx.xxx
#openshift-master1.xxx.xxx
#openshift-master3.xxx.xxx

# host group for nodes, includes region info
[nodes]
openshift-master[1:3].xxx.xxx openshift_node_group_name='node-config-master'
openshift-infra[1:2].xxx.xxx openshift_node_group_name='node-config-infra'
openshift-node[1:3].xxx.xxx openshift_node_group_name='node-config-compute'

Comment 18 Randolph Morgan 2018-09-13 18:15:49 UTC

Created attachment 1483121 [details]
output of playbook -vvv

Comment 19 Randolph Morgan 2018-09-13 18:19:22 UTC

After watching this run repeatedly, I am more convinced that the remove image tag is looking in the wrong directory.  The files that are copied and used to create the config are found in /tmp.  These files actually do have a istag in them.  The remove image tag does not remove this tag because it is reading different files that don't have an istag.

Comment 20 Johnny Liu 2018-09-14 02:32:38 UTC

@Randolph, according to comment 15, QE's guess (comment 11 and 12) is right.

This is not related to "remove image tag" step, but because of the inability of the api server. In your log, checked the task prior to "Remove the image stream tag", it is returning:
    "stderr": "error: the server doesn't have a resource type \"scc\"\n",
    "stderr": "error: the server doesn't have a resource type \"istag\"",

I think you need do some investigation about why your api server can not correctly discover and return the information for the alias like scc or istag.

Comment 21 Randolph Morgan 2018-09-14 14:39:59 UTC

So I followed the suggestions in comment 15. I ran the "oc get scc privileged -o json -n openshift-node --loglevel=9:" command and had similar results to what Matthew was discussing. I then ran the "oc edit -n kube-service-catalog ds/apiserver" and changed the etcd server to one of the other masters. There are 3 masters in my cluster. The upgrade is triggered from Master1, and the current etcd server is pointing to master3. I chose master3 because master2 had a communication issue that was resolved. I am not seeing any instability in the cluster, but I also have a harder time finding the logs. I was working with a friend who is an expert in Openshift, and he looked things over for me, but we found no issues with the stability of the cluster. To date, I have followed instructions from Matthew. The results found in comment 16 show what happened.

I would love to do some deeper investigation, but I am not even sure where to start at this point. I have looked at the files mentioned in comment 19, both the originals and those in the tmp directory and what I describe is accurate. The original files do not have and SCC or ISTAG entries. However, the files in the tmp directory have both. If the remove istag step of the upgrade is supposed to remove the istag, it should be removing it from the files in the tmp directory and not the originals. This way when the files get copied to their permanent location and the config is applied, the istag won't be there to cause problems.

Admittedly I am not a developer, I am a sysadmin, so perhaps I have this wrong, but it just seems logical from my perspective, that this would be the case. I am happy to run this again, and capture whatever logs you would like to see. I do appreciate the assistance I receive and want to contribute to making this product even better than it currently is. Since my job is to take the vision of the developers and make it work in environments perhaps not envisioned at the time I am happy to help.

Comment 22 Matthew Robson 2018-09-14 14:47:38 UTC

If you use the full command versus the scc or istag alias, do they work?:

oc get securitycontextconstraints
oc get imagestreamtag

Can you post the output from log level 9 if you sanitize it?

Comment 23 Randolph Morgan 2018-09-14 16:32:22 UTC

Created attachment 1483373 [details]
log of the oc get command

oc get securitycontextconstraints privileged -o json -n openshift-node --loglevel=9 > scc.log

Comment 24 Randolph Morgan 2018-09-14 16:37:32 UTC

Created attachment 1483374 [details]
results of oc get command for scc

Comment 25 Randolph Morgan 2018-09-14 16:38:23 UTC

Created attachment 1483375 [details]
results of oc get command for istag

I sent this to a file for the log, and the log file was empty.

Comment 26 Randolph Morgan 2018-09-14 16:39:20 UTC

I ran both commands, the results and associated log files are attached.  The istag.log file was empty so it would not let upload it.

Comment 27 Randolph Morgan 2018-09-14 19:55:25 UTC

Created attachment 1483389 [details]
Output of Journalctl -r -u origin-node

These are the logs for the oc get securitycontextconstraints privileged -o json -n openshift-node --loglevel=9 > scc.log

Comment 28 Randolph Morgan 2018-09-14 19:56:21 UTC

Created attachment 1483390 [details]
Output of Journalctl -r -u origin-node

These are the logs for oc get imagestreamtag privileged -o json -n openshift-node --loglevel=9 > istag.log

Comment 30 Randolph Morgan 2018-09-25 21:51:13 UTC

I am still experiencing the problem with the istag removal.  I keep seeing others who have experienced this but their solution is to just rebuild.  This is production environment so rebuilding is not an option.  I need a solution to this, so any ideas would be appreciated.

Comment 31 Randolph Morgan 2018-09-25 22:29:20 UTC

Here is the result of my last attempt.  The pauses were inserted by me to manually remove the istag.  Prior to the latest update yesterday, this worked to allow me to get past this problem, but now it doesn't.  It appears that the file is being held in memory and the changes are no longer visible.  If you can't tell I am getting desperate and extremely frustrated.  If there is some information you need, please tell me and I will get it for you.

TASK [openshift_node_group : Make temp directory for templates] ********************************************
ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Copy templates to temp directory] *********************************************
changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-images.yaml)
changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-policy.yaml)
changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync.yaml)

TASK [openshift_node_group : pause] ************************************************************************
Pausing for 300 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
[[Press 'C' to continue the play or 'A' to abort 
[[ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Update the image tag] *********************************************************
changed: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Ensure the service account can run privileged] ********************************
ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Remove the image stream tag] **************************************************
changed: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : pause] ************************************************************************
Pausing for 300 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
[[Press 'C' to continue the play or 'A' to abort 
[[ok: [openshift-master1.chem.byu.edu]

TASK [openshift_node_group : Apply the config] *************************************************************
cfatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-V4QMRF", "delta": "0:00:19.224360", "end": "2018-09-25 16:21:27.219506", "msg": "non-zero return code", "rc": 1, "start": "2018-09-25 16:21:07.995146", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-V4QMRF/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-V4QMRF/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" unchanged", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" unchanged"]}
	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry

PLAY RECAP *************************************************************************************************
localhost                  : ok=14   changed=0    unreachable=0    failed=0   
openshift-infra1.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0   
openshift-infra2.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0   
openshift-master1.chem.byu.edu : ok=263  changed=43   unreachable=0    failed=1   
openshift-master2.chem.byu.edu : ok=103  changed=9    unreachable=0    failed=0   
openshift-master3.chem.byu.edu : ok=103  changed=9    unreachable=0    failed=0   
openshift-node1.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0   
openshift-node2.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0   
openshift-node3.chem.byu.edu : ok=18   changed=0    unreachable=0    failed=0   
vpn1.chem.byu.edu          : ok=1    changed=0    unreachable=0    failed=0   



Failure summary:


  1. Hosts:    openshift-master1.chem.byu.edu
     Play:     Configure components that must be available prior to upgrade
     Task:     Apply the config
     Message:  non-zero return code

Comment 36 Randolph Morgan 2018-09-26 16:52:27 UTC

I do have logs from the last upgrade attempt, I ran the following commands to acquire them:

journalctl --no-pager > node.log
master-logs etcd etcd &> etcd.log
master-logs api api &> api.log
master-logs controllers controllers &> controllers.log

The node.log file is too big to upload, is there another way for me to attach it this ticket?

Comment 37 Randolph Morgan 2018-09-26 16:54:05 UTC

Created attachment 1487375 [details]
output of:  master-logs etcd etcd &> etcd.log

Comment 38 Randolph Morgan 2018-09-26 16:54:57 UTC

Created attachment 1487376 [details]
output of:  master-logs api api &> api.log

Comment 39 Randolph Morgan 2018-09-26 16:56:15 UTC

Created attachment 1487377 [details]
output of:  master-logs controllers controllers &> controllers.log

Comment 41 Randolph Morgan 2018-09-26 21:55:57 UTC

I ran: "oc describe pods -n kube-service-catalog" and the results follow:
Name:           apiserver-h68gc
Namespace:      kube-service-catalog
Node:           openshift-master1.xxx.xxx/xxx.xxx.xxx.xxx
Start Time:     Thu, 20 Sep 2018 13:40:10 -0600
Labels:         app=apiserver
                controller-revision-hash=2395821796
                pod-template-generation=9
Annotations:    ca_hash=74b331eee9c40a5edb321fea9ccb2f84b9dffafd
                openshift.io/scc=hostmount-anyuid
Status:         Running
IP:             10.128.0.59
Controlled By:  DaemonSet/apiserver
Containers:
  apiserver:
    Container ID:  docker://4c89125a4b8fde921bf7811eb42621bf498fdd38e92f197fabcbf2cef8760aff
    Image:         docker.io/openshift/origin-service-catalog:v3.9.0
    Image ID:      docker-pullable://docker.io/openshift/origin-service-catalog@sha256:d86a5f8f57b18041c379ebfc6ebe3eeec04373060b50b88ebdadb188109cfafb
    Port:          6443/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/service-catalog
    Args:
      apiserver
      --storage-type
      etcd
      --secure-port
      6443
      --etcd-servers
      https://openshift-master3.xxx.xxx:4001
      --etcd-cafile
      /etc/origin/master/master.etcd-ca.crt
      --etcd-certfile
      /etc/origin/master/master.etcd-client.crt
      --etcd-keyfile
      /etc/origin/master/master.etcd-client.key
      -v
      10
      --cors-allowed-origins
      localhost
      --admission-control
      KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck
      --feature-gates
      OriginatingIdentity=true
    State:          Running
      Started:      Wed, 26 Sep 2018 15:28:31 -0600
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 26 Sep 2018 15:26:59 -0600
      Finished:     Wed, 26 Sep 2018 15:28:02 -0600
    Ready:          True
    Restart Count:  3
    Environment:    <none>
    Mounts:
      /etc/origin/master from etcd-host-cert (ro)
      /var/run/kubernetes-service-catalog from apiserver-ssl (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-apiserver-token-tt725 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  apiserver-ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  apiserver-ssl
    Optional:    false
  etcd-host-cert:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/origin/master
    HostPathType:
  data-dir:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  service-catalog-apiserver-token-tt725:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  service-catalog-apiserver-token-tt725
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  openshift-infra=apiserver
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                  Age                From                                           Message
  ----     ------                  ----               ----                                           -------
  Warning  FailedCreatePodSandBox  20m                kubelet, openshift-master1.xxx.xxx        Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f3c7c498eb04b401b139032b2d38d7493852a64ca0faa852b8047f19bdf5dc3e" network for pod "apiserver-h68gc": NetworkPlugin cni failed to set up pod "apiserver-h68gc_kube-service-catalog" network: OpenShift SDN network process is not (yet?) available
  Normal   SandboxChanged          19m (x2 over 20m)  kubelet, openshift-master1.xxx.xxx        Pod sandbox changed, it will be killed and re-created.
  Warning  NetworkFailed           19m                openshift-sdn, openshift-master1.xxx.xxx  The pod's network interface has been lost and the pod will be stopped.
  Warning  FailedMount             18m                kubelet, openshift-master1.xxx.xxx        MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: read tcp xxx.xxx.xxx.xxx:60936->xxx.xxx.xxx.xxx:8443: use of closed network connection
  Warning  FailedMount             18m                kubelet, openshift-master1.xxx.xxx        MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : grpc: the client connection is closing
  Warning  FailedMount             17m                kubelet, openshift-master1.xxx.xxx        MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: read tcp xxx.xxx.xxx.xxx:33648->xxx.xxx.xxx.xxx:8443: use of closed network connection
  Warning  FailedMount             17m                kubelet, openshift-master1.xxx.xxx        MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: dial tcp xxx.xxx.xxx.xxx:8443: getsockopt: connection refused
  Warning  BackOff                 17m                kubelet, openshift-master1.xxx.xxx        Back-off restarting failed container
  Warning  FailedMount             17m                kubelet, openshift-master1.xxx.xxx        MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: dial tcp xxx.xxx.xxx.xxx:8443: getsockopt: connection refused
  Warning  FailedMount             17m                kubelet, openshift-master1.xxx.xxx        MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: net/http: TLS handshake timeout
  Normal   Pulled                  17m (x2 over 19m)  kubelet, openshift-master1.xxx.xxx        Container image "docker.io/openshift/origin-service-catalog:v3.9.0" already present on machine
  Normal   Created                 17m (x2 over 18m)  kubelet, openshift-master1.xxx.xxx        Created container
  Normal   Started                 17m (x2 over 18m)  kubelet, openshift-master1.xxx.xxx        Started container


Name:           controller-manager-w4sl6
Namespace:      kube-service-catalog
Node:           openshift-master1.xxx.xxx/xxx.xxx.xxx.xxx
Start Time:     Tue, 25 Sep 2018 09:54:15 -0600
Labels:         app=controller-manager
                controller-revision-hash=1456597385
                pod-template-generation=2
Annotations:    openshift.io/scc=restricted
Status:         Running
IP:             10.128.0.60
Controlled By:  DaemonSet/controller-manager
Containers:
  controller-manager:
    Container ID:  docker://b5b0f91c452abb4fbab93988ffeef65e44de3919f24b2a46a6106b7d2e489776
    Image:         docker.io/openshift/origin-service-catalog:v3.9.0
    Image ID:      docker-pullable://docker.io/openshift/origin-service-catalog@sha256:d86a5f8f57b18041c379ebfc6ebe3eeec04373060b50b88ebdadb188109cfafb
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/service-catalog
    Args:
      controller-manager
      -v
      5
      --leader-election-namespace
      kube-service-catalog
      --broker-relist-interval
      5m
      --feature-gates
      OriginatingIdentity=true
    State:          Running
      Started:      Wed, 26 Sep 2018 15:27:08 -0600
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 25 Sep 2018 09:55:08 -0600
      Finished:     Wed, 26 Sep 2018 15:24:52 -0600
    Ready:          True
    Restart Count:  1
    Environment:
      K8S_NAMESPACE:  kube-service-catalog (v1:metadata.namespace)
    Mounts:
      /var/run/kubernetes-service-catalog from service-catalog-ssl (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-controller-token-spmwm (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  service-catalog-ssl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  apiserver-ssl
    Optional:    false
  service-catalog-controller-token-spmwm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  service-catalog-controller-token-spmwm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  openshift-infra=apiserver
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                  Age                From                                           Message
  ----     ------                  ----               ----                                           -------
  Warning  FailedCreatePodSandBox  19m                kubelet, openshift-master1.xxx.xxx        Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d3ce84298d273f0d3a8ab7c9ed7afc1f7c00157dcab7c802c3963fb2673a280a" network for pod "controller-manager-w4sl6": NetworkPlugin cni failed to set up pod "controller-manager-w4sl6_kube-service-catalog" network: OpenShift SDN network process is not (yet?) available
  Warning  NetworkFailed           19m                openshift-sdn, openshift-master1.xxx.xxx  The pod's network interface has been lost and the pod will be stopped.
  Normal   SandboxChanged          19m (x2 over 20m)  kubelet, openshift-master1.xxx.xxx        Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  18m                kubelet, openshift-master1.xxx.xxx        Container image "docker.io/openshift/origin-service-catalog:v3.9.0" already present on machine
  Normal   Created                 18m                kubelet, openshift-master1.xxx.xxx        Created container

I followed the recommendations in comment 15.  It appears that the kube-catalog is not functioning properly, however, I am not sure how to correct it.  I was considering redeploying the pod, but it shows that it is still v3.9 and so I don't know what would happen if I did.  Any suggestions?

Comment 44 Randolph Morgan 2018-10-17 22:39:27 UTC

So I found the problem, it was the service catalog.  The solution was to uninstall it and then run the upgrade.  I was speaking with a friend who is an expert in Openshift and he told me that he has had lots of problems with the service catalog and usually does not install it.  It would seem to me that something that can cause this many problems should never have been pushed to production code, but should have remained in dev until such time as the bugs have been worked out.

Comment 47 Scott Dodson 2018-10-23 19:11:51 UTC

https://github.com/openshift/openshift-ansible/pull/10497 backports the aggregated API wait changes from release-3.11 to release-3.10

Comment 56 liujia 2018-11-09 02:41:44 UTC

Can not reproduce this bug in QE's env. Go through the comments and discussed with service catalog guys. we can only check pr10497 merged and upgrade with service catalog works well and passed the tasks in the pr. QE suggest customer who encountered the issue to get a pre-release build which including the pr to ensure it can fix the issue better now.

Version:
ansible-2.4.6.0-1.el7ae.noarch
openshift-ansible-3.10.66-1.git.0.3c3a83a.el7.noarch

Checked that pr was merged. Upgrade against ocp with service catalog succeed.

Comment 59 Scott Dodson 2018-12-05 15:14:40 UTC

This was resolved in openshift-ansible-3.10.66-1.git.0.3c3a83a.el7

Note You need to log in before you can comment on or make changes to this bug.