I don't know if this information helps but these are the versions after the upgrade control plane has failed: oc get nodes (versions all the same as before the upgrade started) MBP00478-2:~ bryn.ellis$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-xxx-xxx-10.stage.ice.aws Ready compute,master 1y v1.9.1+a0ce1bc657 ip-10-xxx-xxx-200.stage.ice.aws Ready infra 1y v1.9.1+a0ce1bc657 ip-10-xxx-xxx-201.stage.ice.aws Ready infra 1y v1.9.1+a0ce1bc657 ip-10-xxx-xxx-10.stage.ice.aws Ready compute,master 154d v1.9.1+a0ce1bc657 ip-10-xxx-xxx-200.stage.ice.aws Ready infra 1y v1.9.1+a0ce1bc657 ip-10-xxx-xxx-10.stage.ice.aws Ready compute,master 154d v1.9.1+a0ce1bc657 Web console still works but shows k8s master upgraded version: OpenShift Master: v3.9.0+ba7faec-1 Kubernetes Master: v1.10.0+b81c8f8 OpenShift Web Console: v3.9.0+b600d46-dirty
I can confirm that by making changes to roles/openshift_facts/defaults/main.yml (change 8443 to 443) `openshift_master_api_port: "443"` ....AND roles/openshift_facts/library/openshift_facts.py (change 8443 to 443) if 'master' in roles: defaults['master'] = dict(api_use_ssl=True, api_port='443', controllers_port='8444', console_use_ssl=True, console_path='/console', console_port='443', portal_net='172.30.0.0/16', bind_addr='0.0.0.0', session_max_seconds=3600, session_name='ssn') ...then running the upgrade again it gets further.
Contacted the reporter on Slack: 1. Facts were cleaned 2. ansible version used is 2.5.2, which might be the cause here. Bryn, would you mind trying to reproduce that with ansible 2.4 or 2.6? I'll check if using ansible 2.5.2 makes it reproducible in my case
I've downgraded ansible: ansible --version ansible 2.4.6.0 config file = /root/openshift-ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /bin/ansible python version = 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] Restored back to original 3.9 and ran openshift_node_group (ran successfully) and upgrade_control_plane again. Can't decide if upgrade_control_plane got further or not ( I don't think it has) but I have hit a different error now. This error is already a bug reported on bugzilla - https://bugzilla.redhat.com/show_bug.cgi?id=1656645. I don't really know how to proceed now because on this new error I've commented out all my metrics inventory variables because I want to leave metrics and upgrade that later and running 'ansible -i inventory_file master -a 'oc whoami' came back with all 3 masters show 'system:admin' which the bug report seems to suggest is correct. I've done another 'git pull' and am going to restore all instances back to 3.9 and then run again to see if I hit the problem at the same spot.
OK, I don't know what happened to get the error related to bug 1656645 mentioned above but it went through this time. It went through all 30 retries and failed but didn't bomb out of the upgrade, it just continued. Then it failed at the usual point verifying the API but using the correct port as per Comment 2.. Looking back through the log in a bit more detail it all seems to be related to the failure of storage upgrade. Maybe the failure of the storage upgrade is what is causing the API verification to fail because something isn't starting properly. I'm currently losing the will to live with this upgrade! :-( Next is to restore back to 3.9 again and try one of the suggestions in Comment 6 of https://bugzilla.redhat.com/show_bug.cgi?id=1591053 to see if I can get the storage upgrade to work.
(In reply to Bryn Ellis from comment #5) > Can't decide if upgrade_control_plane got further or not ( I don't think it > has) but I have hit a different error now. It certainly has progressed further - storage migration doesn't start before we ensure API is up. Since it has proceeded to that step it means API check has passed. (In reply to Bryn Ellis from comment #6) > OK, I don't know what happened to get the error related to bug 1656645 > mentioned above but it went through this time. It went through all 30 > retries and failed but didn't bomb out of the upgrade, it just continued. > Then it failed at the usual point verifying the API but using the correct > port as per Comment 2.. This might mean API server (or scheduler) got stuck. Please attach output of `master-logs api api` and `master-logs controllers controllers` from master nodes
Sorry, I've been trying all sorts of other things to try to get this to work since then so I don't have those logs anymore. The current position I'm in is: a) I still have the 'hacks' in to make sure it uses 443 and not 8443 for the API check. b) I've added an entry to /etc/hosts to point the fqdn it uses for the API check to the first master. I've made this /etc/hosts edit on all 3 master plus my ansible server. The idea behind this was to bypass my AWS ELB to make sure it wasn't causing the API check to fail because of dropping the masters out of the ELB. Unfortunately, exactly the same issue has occurred. It is using 443 for the API check but it still fails after 120 retries. 2019-04-03 14:10:07,639 p=2342 u=root | fatal: [ip-10-160-20-10.stage.ice.aws]: FAILED! => { "attempts": 120, "changed": false, "cmd": [ "curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://staging-ocp-cluster.ice-technology.com:443/healthz/ready" ], "delta": "0:00:00.010035", "end": "2019-04-03 14:10:07.619571", "failed": true, "invocation": { "module_args": { "_raw_params": "curl --silent --tlsv1.2 --max-time 2 --cacert /etc/origin/master/ca-bundle.crt https://staging-ocp-cluster.ice-technology.com:443/healthz/ready", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": false } }, "msg": "non-zero return code", "rc": 7, "start": "2019-04-03 14:10:07.609536", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [] and... docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7046c175db56 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 6 minutes ago Up 5 minutes k8s_POD_master-controllers-ip-10-160-20-10.stage.ice.aws_kube-system_7258811294607c245c6d909256a719d5_0 bf5f79f17468 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 6 minutes ago Up 5 minutes k8s_POD_master-api-ip-10-160-20-10.stage.ice.aws_kube-system_df602223db6f9a5a5492230ccb5ebce9_0 7c6b4ef25a78 96cf7dd047cb "/usr/bin/service-..." 20 minutes ago Up 20 minutes k8s_controller-manager_controller-manager-vmxvm_kube-service-catalog_46fa2bfa-8f91-11e8-a5f4-0ab1f54c2b38_109 69b641629aac ff5dd2137a4f "/bin/sh -c '#!/bi..." 20 minutes ago Up 20 minutes k8s_etcd_master-etcd-ip-10-160-20-10.stage.ice.aws_kube-system_7c4462a08b9f01a3c928e36663e0f1b9_0 88a34b73ed0e docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_master-etcd-ip-10-160-20-10.stage.ice.aws_kube-system_7c4462a08b9f01a3c928e36663e0f1b9_0 bb610be0bbf1 96cf7dd047cb "/usr/bin/service-..." 22 minutes ago Up 22 minutes k8s_apiserver_apiserver-w42pp_kube-service-catalog_44845c61-8f91-11e8-a5f4-0ab1f54c2b38_61 f2c36c3f4041 4f63970098ae "sh run.sh" 22 minutes ago Up 22 minutes k8s_fluentd-elasticsearch_logging-fluentd-l7bbm_logging_8d0453ec-560e-11e9-b04f-0ab1f54c2b38_2 41ef39cb25f1 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_logging-fluentd-l7bbm_logging_8d0453ec-560e-11e9-b04f-0ab1f54c2b38_1 00bb2d7dda92 aa12a2fc57f7 "/usr/bin/origin-w..." 22 minutes ago Up 22 minutes k8s_webconsole_webconsole-5f649b49b5-bfltf_openshift-web-console_e488cc08-ee60-11e8-950d-0ab1f54c2b38_24 a4d3fd8f9c09 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_webconsole-5f649b49b5-bfltf_openshift-web-console_e488cc08-ee60-11e8-950d-0ab1f54c2b38_23 39bd01427954 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_controller-manager-vmxvm_kube-service-catalog_46fa2bfa-8f91-11e8-a5f4-0ab1f54c2b38_56 eca530e715ca docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_apiserver-w42pp_kube-service-catalog_44845c61-8f91-11e8-a5f4-0ab1f54c2b38_56 806667874dbc 262dbb751d6b "/bin/bash -c '#!/..." 23 minutes ago Up 22 minutes k8s_openvswitch_ovs-57rr2_openshift-sdn_0e1ad7d7-560f-11e9-89b8-02684177ea70_0 7001883698ae 262dbb751d6b "/bin/bash -c '#!/..." 23 minutes ago Up 22 minutes k8s_sdn_sdn-djk8r_openshift-sdn_0e1ad8a8-560f-11e9-89b8-02684177ea70_0 31a405d5b494 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 23 minutes ago Up 23 minutes k8s_POD_sdn-djk8r_openshift-sdn_0e1ad8a8-560f-11e9-89b8-02684177ea70_0 9c6663fc7a71 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 23 minutes ago Up 23 minutes k8s_POD_ovs-57rr2_openshift-sdn_0e1ad7d7-560f-11e9-89b8-02684177ea70_0 ad16320ea2bb 262dbb751d6b "/bin/bash -c '#!/..." 24 minutes ago Up 24 minutes k8s_sync_sync-mln2b_openshift-node_aa0c3684-560e-11e9-b04f-0ab1f54c2b38_1 7d94632b0f62 openshift/origin-pod:v3.9.0 "/usr/bin/pod" 26 minutes ago Up 26 minutes k8s_POD_sync-mln2b_openshift-node_aa0c3684-560e-11e9-b04f-0ab1f54c2b38_0 and... netstat -nalp | grep 443 unix 2 [ ACC ] STREAM LISTENING 28443 1339/master private/proxywrite So pods are running but nothing listening on 443 hence which the API check fails I assume.
We're gonna need a more verbose logs (`ansible-playbook -vvv` output, not the log file ansible creates) to find out why is API port not being set properly
Created PR to 3.10 - https://github.com/openshift/openshift-ansible/pull/11487 3.11 PR - https://github.com/openshift/openshift-ansible/pull/11491
*** Bug 1699695 has been marked as a duplicate of this bug. ***
I could not reproduce the bug. Could you help with reproduce steps? Thanks.
(In reply to Weihua Meng from comment #21) > I could not reproduce the bug. > Could you help with reproduce steps? > Thanks. 1. Install 3.9 cluster w/ custom api_port to 443 2. Clear local and remote facts 3. Upgrade to 3.10 This worked as expected during upgrade when existing facts were reused
Fixed. openshift-ansible-3.10.139-1.git.0.02bc5db.el7.noarch ansible-2.4.6.0-1.el7ae
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0786