Upgrading from Origin v3.9 to v3.10 failed. After correcting the problem, the upgrade now fails at Remove Image Stream Tag. It fails with the following error: TASK [openshift_node_group : Remove the image stream tag] ************************************************************************* fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-node istag node:v3.10 --ignore-not-found", "delta": "0:00:54.354315", "end": "2018-08-23 16:37:35.608204", "msg": "non-zero return code", "rc": 1, "start": "2018-08-23 16:36:41.253889", "stderr": "error: the server doesn't have a resource type \"istag\"", "stderr_lines": ["error: the server doesn't have a resource type \"istag\""], "stdout": "", "stdout_lines": []} Researching this problem I found that the same issue exists upgrading from 3.10 to 3.1l, see bugzilla report 1622255. After looking at the solution reported by Michael Gugino, I tracked down the code in github (https://github.com/openshift/openshift-ansible/commit/dc77308bdffb5f3a69bbdb17466898dd6dc339d2) and modified the following files: roles/openshift_bootstrap_autoapprover/tasks/main.yml roles/openshift_node_group/tasks/sync.yml roles/openshift_sdn/tasks/main.yml This corrected the fault listed above, however, the upgrade now fails at the next step: TASK [openshift_node_group : Apply the config] ************************************************************************************ fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-pBI2Hx", "delta": "0:00:21.119941", "end": "2018-08-31 09:59:36.502122", "msg": "non-zero return code", "rc": 1, "start": "2018-08-31 09:59:15.382181", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-pBI2Hx/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" configured", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" configured"]} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry This was frustrating because now it shows that the istag exists in the files and after looking at the files, sure enough it does. After looking through the process, I found that the files the Remove Image Stream Tag references are actually copied over the /tmp three steps prior the removing the tag. I modified the three files mentioned above to be as follows: --- - name: Ensure project exists oc_project: name: openshift-node state: present node_selector: - "" - name: Make temp directory for templates command: mktemp -d /tmp/ansible-XXXXXX register: mktemp changed_when: False # TODO: temporary until we fix apply for image stream tags - name: Remove the image stream tag command: > {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig delete -n openshift-node istag node:v3.10 --ignore-not-found register: l_os_istag_del # The istag might not be there, so we want to not fail in that case. failed_when: - l_os_istag_del.rc != 0 - "'have a resource type' not in l_os_istag_del.stderr" - name: Copy templates to temp directory copy: src: "{{ item }}" dest: "{{ mktemp.stdout }}/{{ item | basename }}" with_fileglob: - "files/*.yaml" - name: Update the image tag yedit: src: "{{ mktemp.stdout }}/sync-images.yaml" key: 'tag.from.name' value: "{{ osn_image }}" - name: Ensure the service account can run privileged oc_adm_policy_user: namespace: "openshift-node" resource_kind: scc resource_name: privileged state: present user: "system:serviceaccount:openshift-node:sync" - name: Apply the config shell: > {{ openshift_client_binary }} --config={{ openshift.common.config_base }}/master/admin.kubeconfig apply -f {{ mktemp.stdout }} - name: Remove temp directory The upgrade proceeded without incident. So changing the order of the tasks in the files allows for the upgrade to be re-run without incident.
Backport created for 3.10: https://github.com/openshift/openshift-ansible/pull/9911
For any issues, please attach the inventory and variables + full ansible-playbook -vvv output (3 v's.).
@Randolph Morgan @Matthew Robson You got the error is because that you got pr9780 in, this pr will not fail the the task [openshift_node_group : Remove the image stream tag] when the task did not remove istag node:v3.10 actually for some error. So when it came to the task[Apply the config], it will prompt u that "imagestreamtag.image.openshift.io \"node:v3.10\" already exists". T he root cause should be that [openshift_node_group : Remove the image stream tag] did not actually remove istag node:v3.10. The log in attachment shows above: TASK [openshift_node_group : Remove the image stream tag] ********************************************** .. "rc": 1, "start": "2018-09-10 10:19:22.813434", "stderr": "error: the server doesn't have a resource type \"istag\"", TASK [openshift_node_group : Apply the config] ********************************************************* .. "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-l1G8UE/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", Then came to why task [openshift_node_group : Remove the image stream tag] failed.(Refer to bz1622255, which Michael hit it in step3) I tried command in task [openshift_node_group : Remove the image stream tag] several times manually. Even if istag node:v3.10 was deleted, then run the command again, it should not prompt the error like “error: the server doesn't have a resource type \"istag\"”. It just give no output because tag v3.10 was removed and not existed. Then I discussed with other folks who tested the command, and got some info in bz1577520, which shows that this error happened because the server was not connected when run the command. I dig the log again, found that, the previous task also has the same error. TASK [openshift_node_group : Ensure the service account can run privileged] **************************** .. "msg": { "cmd": "/bin/oc get scc privileged -o json -n openshift-node", "results": [ {} ], "returncode": 1, "stderr": "error: the server doesn't have a resource type \"scc\"\n", "stdout": "" ... TASK [openshift_node_group : Remove the image stream tag] ********************************************** "msg": "non-zero return code", "rc": 1, "start": "2018-09-10 10:19:22.813434", "stderr": "error: the server doesn't have a resource type \"istag\"", .. TASK [openshift_node_group : Apply the config] ********************************************************* So, @Michael, I think now we should found why server was not connected at that time, and seems pr9780 does not fix the issue in bz1622255, and also this bug. And this bug should be dup with that bug. I will re-open that one too. Anyway, all above is just my analysis according to current info/log/bug, please correct me if my guess is wrong.
Totally agree comment 11, in initial log or comment 8, in "Remove the image stream tag" task, oc return is saying "the server doesn't have a resource type istag", then in the following task oc return is saying "imagestreamtag.image.openshift.io 'node:v3.10' already exists". They are conflicting with each other. In a healthy cluster, whatever it is 3.9 and 3.10, "oc get istag" always return succeed, never saying "the server doesn't have a resource type istag". The only possibility is master server is not fully up when running "Ensure the service account can run privileged" and "Remove the image stream tag" task, that is why it is saying "error: the server doesn't have a resource type 'scc'", "error: the server doesn't have a resource type 'istag'". So the key point is that before running these task, upgrade should do some strict pre-check, make sure server is really get ready, at least should not hit "the server doesn't have a resource type" error. While the PR #9911 is trying to ignore such error, that would make the following tasks getting more unstable.
The issue with getting a resource type has to do with the inability of the api server to correctly discover and return the information for the alias like scc or istag. "stderr": "error: the server doesn't have a resource type \"scc\"\n", "stderr": "error: the server doesn't have a resource type \"istag\"", We hit the issue because the service catalog was in a degraded state causing calls to the API to be throttled and timeout. On the first run of the 3.10 upgrade, there was a firewall issue when bringing up etcd on master 1 meaning the old service was down and the new static pod could not start. In 3.9, by default, the SC apiserver only points to 1 etcd server, in this case, the one running on master1. This lead to all alias calls failing and regular oc commnanks taking upwards of 1 minute. root 43250 126 7.3 1527892 1203308 ? SNsl Sep04 14332:23 /usr/bin/service-catalog apiserver --storage-type etcd --secure-port 6443 --etcd-servers https://master1:2379 --etcd-cafile /etc/origin/master/master.etcd-ca.crt --etcd-certfile /etc/origin/master/master.etcd-client.crt --etcd-keyfile /etc/origin/master/master.etcd-client.key -v 3 --cors-allowed-origins localhost --admission-control KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck --feature-gates OriginatingIdentity=true We came to this conclusion by running with loglevel 9 on the oc command and seeing over 60 429s on 1 oc get invocation: oc get scc privileged -o json -n openshift-node --loglevel=9: I0912 11:11:29.359077 69349 round_trippers.go:405] GET https://masterlbredhat.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s 429 Too Many Requests in 65 milliseconds At the end of the request, you would see: I0912 11:11:30.619148 69349 request.go:897] Response Body: Too many requests, please try again later. I0912 11:11:30.619338 69349 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" } I0912 11:11:30.619357 69349 cached_discovery.go:77] skipped caching discovery info due to the server has received too many requests and has asked us to try again later F0912 11:11:30.620980 69349 helpers.go:119] error: the server doesn't have a resource type "scc" We did an: oc edit -n kube-service-catalog ds/apiserver and changed it to point to the second etcd server. Once we bounced the pod, all commands started executing again very quickly, including aliases. This allowed us to complete the upgrade without any fixes or rearranging of the playbook execution. If you are seeing 'the server doesn't have a resource type' there is a good chance you may have an issue with the API and running with log level 9 as per above will help you root cause the issue. The upgrade to OCP 3.10 modifies the service catalog to point to all 3 etcd servers which would prevent something like this in the future.
@Matthew Robson - your suggestion worked well until the time came to apply the config and now I am seeing the same problem I was seeing before: TASK [openshift_node_group : Apply the config] ************************************************************************************ fatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-aZWqrY", "delta": "0:00:21.153334", "end": "2018-09-13 09:28:53.068766", "msg": "non-zero return code", "rc": 1, "start": "2018-09-13 09:28:31.915432", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-aZWqrY/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-aZWqrY/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" configured", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" configured"]} PLAY RECAP ************************************************************************************************************************ localhost : ok=14 changed=0 unreachable=0 failed=0 openshift-infra1.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-infra2.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-master1.chem.byu.edu : ok=260 changed=43 unreachable=0 failed=1 openshift-master2.chem.byu.edu : ok=102 changed=9 unreachable=0 failed=0 openshift-master3.chem.byu.edu : ok=102 changed=9 unreachable=0 failed=0 openshift-node1.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-node2.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-node3.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 vpn1.chem.byu.edu : ok=1 changed=0 unreachable=0 failed=0 Failure summary: 1. Hosts: openshift-master1.chem.byu.edu Play: Configure components that must be available prior to upgrade Task: Apply the config Message: non-zero return code What frustrates me about this, is that if the tag already exists, it should accept the existing tag and move on, not fail. Failing for something that it put there seems like a bad idea, or someone missed a step.
# Create an OSEv3 group that contains the master, nodes, etcd, and lb groups. # The lb group lets Ansible configure HAProxy as the load balancing solution. # Comment lb out if your load balancer is pre-configured. [OSEv3:children] masters nodes etcd nfs # Set variables common for all OSEv3 hosts [OSEv3:vars] ansible_ssh_user=root openshift_deployment_type=origin openshift_service_type=origin enable_docker_excluder=false openshift_disable_check=disk_availability openshift_enable_service_catalog=false template_service_broker_install=false ansible_service_broker_install=false # Uncomment the following to enable htpasswd authentication; defaults to DenyAllPasswordIdentityProvider. openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}, {'name': 'IPA', 'challenge': 'true', 'login': 'true', 'mappingMethod': 'claim', 'kind': 'LDAPPasswordIdentityProvider', 'attributes': {'id': ['dn'], 'email': ['mail'], 'name': ['cn'], 'preferredUsername': ['uid']}, 'bindDN': 'uid=user,cn=sysaccounts,cn=etc,dc=xxx,dc=xxx,dc=xxx', 'bindPassword': '0000', 'ca': '/etc/origin/master/ldap_ca.crt', 'insecure': 'false', 'url': 'ldaps://ipa.xxx.xxx:636/cn=users,cn=accounts,dc=xxx,dc=xxx,dc=xxx?uid?sub?(&(uid=*)(memberOf=cn=xxx,cn=groups,cn=accounts,dc=xxx,dc=xxx,dc=xxx))'}] openshift_master_htpasswd_users={'user': '0000'} openshift_master_ldap_ca_file=/home/root/ldap_ca.crt # Native high availbility cluster method with optional load balancer. # If no lb group is defined installer assumes that a load balancer has # been preconfigured. For installation the value of # openshift_master_cluster_hostname must resolve to the load balancer # or to one or all of the masters defined in the inventory if no load # balancer is present. openshift_master_cluster_method=native # set the following line to be the vm that will start as the load balancer openshift_master_cluster_hostname=openshift-masters.xxx.xxx openshift_master_cluster_public_hostname=openshift.xxx.xxx openshift_master_default_subdomain=apps.xxx.xxx # Router selector (optional) # Router will only be created if nodes matching this label are present. # Default value: 'region=infra' openshift_hosted_router_selector='region=infra' # default project node selector osm_default_node_selector='region=primary' # override the default controller lease ttl #osm_controller_lease_ttl=30 # Configure the multi-tenant SDN plugin (default is 'redhat/openshift-ovs-subnet') os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant' # External NFS Host # NFS volume must already exist with path "nfs_directory/_volume_name" on # the storage_host. For example, the remote volume path using these # options would be "nfs.example.com:/exports/registry" #It looks like this automatically maps this directory to the nodes openshift_enable_unsupported_configurations=True openshift_hosted_registry_storage_kind=nfs openshift_hosted_registry_storage_access_modes=['ReadWriteMany'] openshift_hosted_registry_storage_host=vpn1.xxx.xxx openshift_hosted_registry_storage_nfs_directory=/exports openshift_hosted_registry_storage_volume_name=registry openshift_hosted_registry_storage_volume_size=322Gi #Configure custom ca certificate #openshift_master_ca_certificate={'certfile': '/etc/origin/master/ca.crt', 'keyfile': '/etc/origin/master.key'} [nfs] vpn1.xxx.xxx # host group for masters # changed in DNS to point openshift-masters.xxx.xxx to multiple places, as Garrett suggested. [masters] openshift-master[1:3].xxx.xxx #openshift-master1.xxx.xxx #openshift-master3.xxx.xxx # host group for etcd, hosted on masters # same as masters group. [etcd] openshift-master[1:3].xxx.xxx #openshift-master1.xxx.xxx #openshift-master3.xxx.xxx # host group for nodes, includes region info [nodes] openshift-master[1:3].xxx.xxx openshift_node_group_name='node-config-master' openshift-infra[1:2].xxx.xxx openshift_node_group_name='node-config-infra' openshift-node[1:3].xxx.xxx openshift_node_group_name='node-config-compute'
Created attachment 1483121 [details] output of playbook -vvv
After watching this run repeatedly, I am more convinced that the remove image tag is looking in the wrong directory. The files that are copied and used to create the config are found in /tmp. These files actually do have a istag in them. The remove image tag does not remove this tag because it is reading different files that don't have an istag.
@Randolph, according to comment 15, QE's guess (comment 11 and 12) is right. This is not related to "remove image tag" step, but because of the inability of the api server. In your log, checked the task prior to "Remove the image stream tag", it is returning: "stderr": "error: the server doesn't have a resource type \"scc\"\n", "stderr": "error: the server doesn't have a resource type \"istag\"", I think you need do some investigation about why your api server can not correctly discover and return the information for the alias like scc or istag.
So I followed the suggestions in comment 15. I ran the "oc get scc privileged -o json -n openshift-node --loglevel=9:" command and had similar results to what Matthew was discussing. I then ran the "oc edit -n kube-service-catalog ds/apiserver" and changed the etcd server to one of the other masters. There are 3 masters in my cluster. The upgrade is triggered from Master1, and the current etcd server is pointing to master3. I chose master3 because master2 had a communication issue that was resolved. I am not seeing any instability in the cluster, but I also have a harder time finding the logs. I was working with a friend who is an expert in Openshift, and he looked things over for me, but we found no issues with the stability of the cluster. To date, I have followed instructions from Matthew. The results found in comment 16 show what happened. I would love to do some deeper investigation, but I am not even sure where to start at this point. I have looked at the files mentioned in comment 19, both the originals and those in the tmp directory and what I describe is accurate. The original files do not have and SCC or ISTAG entries. However, the files in the tmp directory have both. If the remove istag step of the upgrade is supposed to remove the istag, it should be removing it from the files in the tmp directory and not the originals. This way when the files get copied to their permanent location and the config is applied, the istag won't be there to cause problems. Admittedly I am not a developer, I am a sysadmin, so perhaps I have this wrong, but it just seems logical from my perspective, that this would be the case. I am happy to run this again, and capture whatever logs you would like to see. I do appreciate the assistance I receive and want to contribute to making this product even better than it currently is. Since my job is to take the vision of the developers and make it work in environments perhaps not envisioned at the time I am happy to help.
If you use the full command versus the scc or istag alias, do they work?: oc get securitycontextconstraints oc get imagestreamtag Can you post the output from log level 9 if you sanitize it?
Created attachment 1483373 [details] log of the oc get command oc get securitycontextconstraints privileged -o json -n openshift-node --loglevel=9 > scc.log
Created attachment 1483374 [details] results of oc get command for scc
Created attachment 1483375 [details] results of oc get command for istag I sent this to a file for the log, and the log file was empty.
I ran both commands, the results and associated log files are attached. The istag.log file was empty so it would not let upload it.
Created attachment 1483389 [details] Output of Journalctl -r -u origin-node These are the logs for the oc get securitycontextconstraints privileged -o json -n openshift-node --loglevel=9 > scc.log
Created attachment 1483390 [details] Output of Journalctl -r -u origin-node These are the logs for oc get imagestreamtag privileged -o json -n openshift-node --loglevel=9 > istag.log
I am still experiencing the problem with the istag removal. I keep seeing others who have experienced this but their solution is to just rebuild. This is production environment so rebuilding is not an option. I need a solution to this, so any ideas would be appreciated.
Here is the result of my last attempt. The pauses were inserted by me to manually remove the istag. Prior to the latest update yesterday, this worked to allow me to get past this problem, but now it doesn't. It appears that the file is being held in memory and the changes are no longer visible. If you can't tell I am getting desperate and extremely frustrated. If there is some information you need, please tell me and I will get it for you. TASK [openshift_node_group : Make temp directory for templates] ******************************************** ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Copy templates to temp directory] ********************************************* changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-images.yaml) changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync-policy.yaml) changed: [openshift-master1.chem.byu.edu] => (item=/usr/share/ansible/openshift-ansible/roles/openshift_node_group/files/sync.yaml) TASK [openshift_node_group : pause] ************************************************************************ Pausing for 300 seconds (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort) [[Press 'C' to continue the play or 'A' to abort [[ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Update the image tag] ********************************************************* changed: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Ensure the service account can run privileged] ******************************** ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Remove the image stream tag] ************************************************** changed: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : pause] ************************************************************************ Pausing for 300 seconds (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort) [[Press 'C' to continue the play or 'A' to abort [[ok: [openshift-master1.chem.byu.edu] TASK [openshift_node_group : Apply the config] ************************************************************* cfatal: [openshift-master1.chem.byu.edu]: FAILED! => {"changed": true, "cmd": "oc --config=/etc/origin/master/admin.kubeconfig apply -f /tmp/ansible-V4QMRF", "delta": "0:00:19.224360", "end": "2018-09-25 16:21:27.219506", "msg": "non-zero return code", "rc": 1, "start": "2018-09-25 16:21:07.995146", "stderr": "Error from server (AlreadyExists): error when creating \"/tmp/ansible-V4QMRF/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists", "stderr_lines": ["Error from server (AlreadyExists): error when creating \"/tmp/ansible-V4QMRF/sync-images.yaml\": imagestreamtag.image.openshift.io \"node:v3.10\" already exists"], "stdout": "serviceaccount \"sync\" unchanged\nrolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged\ndaemonset.apps \"sync\" unchanged", "stdout_lines": ["serviceaccount \"sync\" unchanged", "rolebinding.authorization.openshift.io \"sync-node-config-reader-binding\" unchanged", "daemonset.apps \"sync\" unchanged"]} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry PLAY RECAP ************************************************************************************************* localhost : ok=14 changed=0 unreachable=0 failed=0 openshift-infra1.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-infra2.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-master1.chem.byu.edu : ok=263 changed=43 unreachable=0 failed=1 openshift-master2.chem.byu.edu : ok=103 changed=9 unreachable=0 failed=0 openshift-master3.chem.byu.edu : ok=103 changed=9 unreachable=0 failed=0 openshift-node1.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-node2.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 openshift-node3.chem.byu.edu : ok=18 changed=0 unreachable=0 failed=0 vpn1.chem.byu.edu : ok=1 changed=0 unreachable=0 failed=0 Failure summary: 1. Hosts: openshift-master1.chem.byu.edu Play: Configure components that must be available prior to upgrade Task: Apply the config Message: non-zero return code
I do have logs from the last upgrade attempt, I ran the following commands to acquire them: journalctl --no-pager > node.log master-logs etcd etcd &> etcd.log master-logs api api &> api.log master-logs controllers controllers &> controllers.log The node.log file is too big to upload, is there another way for me to attach it this ticket?
Created attachment 1487375 [details] output of: master-logs etcd etcd &> etcd.log
Created attachment 1487376 [details] output of: master-logs api api &> api.log
Created attachment 1487377 [details] output of: master-logs controllers controllers &> controllers.log
I ran: "oc describe pods -n kube-service-catalog" and the results follow: Name: apiserver-h68gc Namespace: kube-service-catalog Node: openshift-master1.xxx.xxx/xxx.xxx.xxx.xxx Start Time: Thu, 20 Sep 2018 13:40:10 -0600 Labels: app=apiserver controller-revision-hash=2395821796 pod-template-generation=9 Annotations: ca_hash=74b331eee9c40a5edb321fea9ccb2f84b9dffafd openshift.io/scc=hostmount-anyuid Status: Running IP: 10.128.0.59 Controlled By: DaemonSet/apiserver Containers: apiserver: Container ID: docker://4c89125a4b8fde921bf7811eb42621bf498fdd38e92f197fabcbf2cef8760aff Image: docker.io/openshift/origin-service-catalog:v3.9.0 Image ID: docker-pullable://docker.io/openshift/origin-service-catalog@sha256:d86a5f8f57b18041c379ebfc6ebe3eeec04373060b50b88ebdadb188109cfafb Port: 6443/TCP Host Port: 0/TCP Command: /usr/bin/service-catalog Args: apiserver --storage-type etcd --secure-port 6443 --etcd-servers https://openshift-master3.xxx.xxx:4001 --etcd-cafile /etc/origin/master/master.etcd-ca.crt --etcd-certfile /etc/origin/master/master.etcd-client.crt --etcd-keyfile /etc/origin/master/master.etcd-client.key -v 10 --cors-allowed-origins localhost --admission-control KubernetesNamespaceLifecycle,DefaultServicePlan,ServiceBindingsLifecycle,ServicePlanChangeValidator,BrokerAuthSarCheck --feature-gates OriginatingIdentity=true State: Running Started: Wed, 26 Sep 2018 15:28:31 -0600 Last State: Terminated Reason: Error Exit Code: 1 Started: Wed, 26 Sep 2018 15:26:59 -0600 Finished: Wed, 26 Sep 2018 15:28:02 -0600 Ready: True Restart Count: 3 Environment: <none> Mounts: /etc/origin/master from etcd-host-cert (ro) /var/run/kubernetes-service-catalog from apiserver-ssl (ro) /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-apiserver-token-tt725 (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: apiserver-ssl: Type: Secret (a volume populated by a Secret) SecretName: apiserver-ssl Optional: false etcd-host-cert: Type: HostPath (bare host directory volume) Path: /etc/origin/master HostPathType: data-dir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: service-catalog-apiserver-token-tt725: Type: Secret (a volume populated by a Secret) SecretName: service-catalog-apiserver-token-tt725 Optional: false QoS Class: BestEffort Node-Selectors: openshift-infra=apiserver Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 20m kubelet, openshift-master1.xxx.xxx Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "f3c7c498eb04b401b139032b2d38d7493852a64ca0faa852b8047f19bdf5dc3e" network for pod "apiserver-h68gc": NetworkPlugin cni failed to set up pod "apiserver-h68gc_kube-service-catalog" network: OpenShift SDN network process is not (yet?) available Normal SandboxChanged 19m (x2 over 20m) kubelet, openshift-master1.xxx.xxx Pod sandbox changed, it will be killed and re-created. Warning NetworkFailed 19m openshift-sdn, openshift-master1.xxx.xxx The pod's network interface has been lost and the pod will be stopped. Warning FailedMount 18m kubelet, openshift-master1.xxx.xxx MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: read tcp xxx.xxx.xxx.xxx:60936->xxx.xxx.xxx.xxx:8443: use of closed network connection Warning FailedMount 18m kubelet, openshift-master1.xxx.xxx MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : grpc: the client connection is closing Warning FailedMount 17m kubelet, openshift-master1.xxx.xxx MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: read tcp xxx.xxx.xxx.xxx:33648->xxx.xxx.xxx.xxx:8443: use of closed network connection Warning FailedMount 17m kubelet, openshift-master1.xxx.xxx MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: dial tcp xxx.xxx.xxx.xxx:8443: getsockopt: connection refused Warning BackOff 17m kubelet, openshift-master1.xxx.xxx Back-off restarting failed container Warning FailedMount 17m kubelet, openshift-master1.xxx.xxx MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: dial tcp xxx.xxx.xxx.xxx:8443: getsockopt: connection refused Warning FailedMount 17m kubelet, openshift-master1.xxx.xxx MountVolume.SetUp failed for volume "service-catalog-apiserver-token-tt725" : Get https://openshift-masters.xxx.xxx:8443/api/v1/namespaces/kube-service-catalog/secrets/service-catalog-apiserver-token-tt725: net/http: TLS handshake timeout Normal Pulled 17m (x2 over 19m) kubelet, openshift-master1.xxx.xxx Container image "docker.io/openshift/origin-service-catalog:v3.9.0" already present on machine Normal Created 17m (x2 over 18m) kubelet, openshift-master1.xxx.xxx Created container Normal Started 17m (x2 over 18m) kubelet, openshift-master1.xxx.xxx Started container Name: controller-manager-w4sl6 Namespace: kube-service-catalog Node: openshift-master1.xxx.xxx/xxx.xxx.xxx.xxx Start Time: Tue, 25 Sep 2018 09:54:15 -0600 Labels: app=controller-manager controller-revision-hash=1456597385 pod-template-generation=2 Annotations: openshift.io/scc=restricted Status: Running IP: 10.128.0.60 Controlled By: DaemonSet/controller-manager Containers: controller-manager: Container ID: docker://b5b0f91c452abb4fbab93988ffeef65e44de3919f24b2a46a6106b7d2e489776 Image: docker.io/openshift/origin-service-catalog:v3.9.0 Image ID: docker-pullable://docker.io/openshift/origin-service-catalog@sha256:d86a5f8f57b18041c379ebfc6ebe3eeec04373060b50b88ebdadb188109cfafb Port: 8080/TCP Host Port: 0/TCP Command: /usr/bin/service-catalog Args: controller-manager -v 5 --leader-election-namespace kube-service-catalog --broker-relist-interval 5m --feature-gates OriginatingIdentity=true State: Running Started: Wed, 26 Sep 2018 15:27:08 -0600 Last State: Terminated Reason: Error Exit Code: 255 Started: Tue, 25 Sep 2018 09:55:08 -0600 Finished: Wed, 26 Sep 2018 15:24:52 -0600 Ready: True Restart Count: 1 Environment: K8S_NAMESPACE: kube-service-catalog (v1:metadata.namespace) Mounts: /var/run/kubernetes-service-catalog from service-catalog-ssl (ro) /var/run/secrets/kubernetes.io/serviceaccount from service-catalog-controller-token-spmwm (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: service-catalog-ssl: Type: Secret (a volume populated by a Secret) SecretName: apiserver-ssl Optional: false service-catalog-controller-token-spmwm: Type: Secret (a volume populated by a Secret) SecretName: service-catalog-controller-token-spmwm Optional: false QoS Class: BestEffort Node-Selectors: openshift-infra=apiserver Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 19m kubelet, openshift-master1.xxx.xxx Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d3ce84298d273f0d3a8ab7c9ed7afc1f7c00157dcab7c802c3963fb2673a280a" network for pod "controller-manager-w4sl6": NetworkPlugin cni failed to set up pod "controller-manager-w4sl6_kube-service-catalog" network: OpenShift SDN network process is not (yet?) available Warning NetworkFailed 19m openshift-sdn, openshift-master1.xxx.xxx The pod's network interface has been lost and the pod will be stopped. Normal SandboxChanged 19m (x2 over 20m) kubelet, openshift-master1.xxx.xxx Pod sandbox changed, it will be killed and re-created. Normal Pulled 18m kubelet, openshift-master1.xxx.xxx Container image "docker.io/openshift/origin-service-catalog:v3.9.0" already present on machine Normal Created 18m kubelet, openshift-master1.xxx.xxx Created container I followed the recommendations in comment 15. It appears that the kube-catalog is not functioning properly, however, I am not sure how to correct it. I was considering redeploying the pod, but it shows that it is still v3.9 and so I don't know what would happen if I did. Any suggestions?
So I found the problem, it was the service catalog. The solution was to uninstall it and then run the upgrade. I was speaking with a friend who is an expert in Openshift and he told me that he has had lots of problems with the service catalog and usually does not install it. It would seem to me that something that can cause this many problems should never have been pushed to production code, but should have remained in dev until such time as the bugs have been worked out.
https://github.com/openshift/openshift-ansible/pull/10497 backports the aggregated API wait changes from release-3.11 to release-3.10
Can not reproduce this bug in QE's env. Go through the comments and discussed with service catalog guys. we can only check pr10497 merged and upgrade with service catalog works well and passed the tasks in the pr. QE suggest customer who encountered the issue to get a pre-release build which including the pr to ensure it can fix the issue better now. Version: ansible-2.4.6.0-1.el7ae.noarch openshift-ansible-3.10.66-1.git.0.3c3a83a.el7.noarch Checked that pr was merged. Upgrade against ocp with service catalog succeed.
This was resolved in openshift-ansible-3.10.66-1.git.0.3c3a83a.el7