Bug 1689243
| Summary: | Upgrade from 3.9 to 3.10 fails on openshift_control_plane: verify API server. Using wrong API port. | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Bryn Ellis <bryn.ellis> | |
| Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> | |
| Installer sub component: | openshift-ansible | QA Contact: | Weihua Meng <wmeng> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | unspecified | CC: | aos-bugs, ckoep, gpei, mmccomas, mvardhan, shiywang, vrutkovs, wmeng | |
| Version: | 3.9.0 | |||
| Target Milestone: | --- | |||
| Target Release: | 3.10.z | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: several openshift vars were not used during upgrade
Consequence: 3.9 -> 3.10 upgrade would fail if custom api port is set and facts were cleared
Fix: api_port and other apiserver-related vars are being read during upgrade
Result: upgrade succeeds
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1699695 1699696 (view as bug list) | Environment: | ||
| Last Closed: | 2019-06-11 09:30:48 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1699695, 1699696 | |||
|
Comment 1
Bryn Ellis
2019-03-15 14:11:59 UTC
I can confirm that by making changes to roles/openshift_facts/defaults/main.yml (change 8443 to 443)
`openshift_master_api_port: "443"`
....AND roles/openshift_facts/library/openshift_facts.py (change 8443 to 443)
if 'master' in roles:
defaults['master'] = dict(api_use_ssl=True, api_port='443',
controllers_port='8444',
console_use_ssl=True,
console_path='/console',
console_port='443',
portal_net='172.30.0.0/16',
bind_addr='0.0.0.0',
session_max_seconds=3600,
session_name='ssn')
...then running the upgrade again it gets further.
Contacted the reporter on Slack: 1. Facts were cleaned 2. ansible version used is 2.5.2, which might be the cause here. Bryn, would you mind trying to reproduce that with ansible 2.4 or 2.6? I'll check if using ansible 2.5.2 makes it reproducible in my case I've downgraded ansible: ansible --version ansible 2.4.6.0 config file = /root/openshift-ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /bin/ansible python version = 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] Restored back to original 3.9 and ran openshift_node_group (ran successfully) and upgrade_control_plane again. Can't decide if upgrade_control_plane got further or not ( I don't think it has) but I have hit a different error now. This error is already a bug reported on bugzilla - https://bugzilla.redhat.com/show_bug.cgi?id=1656645. I don't really know how to proceed now because on this new error I've commented out all my metrics inventory variables because I want to leave metrics and upgrade that later and running 'ansible -i inventory_file master -a 'oc whoami' came back with all 3 masters show 'system:admin' which the bug report seems to suggest is correct. I've done another 'git pull' and am going to restore all instances back to 3.9 and then run again to see if I hit the problem at the same spot. OK, I don't know what happened to get the error related to bug 1656645 mentioned above but it went through this time. It went through all 30 retries and failed but didn't bomb out of the upgrade, it just continued. Then it failed at the usual point verifying the API but using the correct port as per Comment 2.. Looking back through the log in a bit more detail it all seems to be related to the failure of storage upgrade. Maybe the failure of the storage upgrade is what is causing the API verification to fail because something isn't starting properly. I'm currently losing the will to live with this upgrade! :-( Next is to restore back to 3.9 again and try one of the suggestions in Comment 6 of https://bugzilla.redhat.com/show_bug.cgi?id=1591053 to see if I can get the storage upgrade to work. (In reply to Bryn Ellis from comment #5) > Can't decide if upgrade_control_plane got further or not ( I don't think it > has) but I have hit a different error now. It certainly has progressed further - storage migration doesn't start before we ensure API is up. Since it has proceeded to that step it means API check has passed. (In reply to Bryn Ellis from comment #6) > OK, I don't know what happened to get the error related to bug 1656645 > mentioned above but it went through this time. It went through all 30 > retries and failed but didn't bomb out of the upgrade, it just continued. > Then it failed at the usual point verifying the API but using the correct > port as per Comment 2.. This might mean API server (or scheduler) got stuck. Please attach output of `master-logs api api` and `master-logs controllers controllers` from master nodes Sorry, I've been trying all sorts of other things to try to get this to work since then so I don't have those logs anymore.
The current position I'm in is:
a) I still have the 'hacks' in to make sure it uses 443 and not 8443 for the API check.
b) I've added an entry to /etc/hosts to point the fqdn it uses for the API check to the first master. I've made this /etc/hosts edit on all 3 master plus my ansible server. The idea behind this was to bypass my AWS ELB to make sure it wasn't causing the API check to fail because of dropping the masters out of the ELB.
Unfortunately, exactly the same issue has occurred. It is using 443 for the API check but it still fails after 120 retries.
2019-04-03 14:10:07,639 p=2342 u=root | fatal: [ip-10-160-20-10.stage.ice.aws]: FAILED! => {
"attempts": 120,
"changed": false,
"cmd": [
"curl",
"--silent",
"--tlsv1.2",
"--max-time",
"2",
"--cacert",
"/etc/origin/master/ca-bundle.crt",
"https://staging-ocp-cluster.ice-technology.com:443/healthz/ready"
],
"delta": "0:00:00.010035",
"end": "2019-04-03 14:10:07.619571",
"failed": true,
"invocation": {
"module_args": {
"_raw_params": "curl --silent --tlsv1.2 --max-time 2 --cacert /etc/origin/master/ca-bundle.crt https://staging-ocp-cluster.ice-technology.com:443/healthz/ready",
"_uses_shell": false,
"chdir": null,
"creates": null,
"executable": null,
"removes": null,
"stdin": null,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 7,
"start": "2019-04-03 14:10:07.609536",
"stderr": "",
"stderr_lines": [],
"stdout": "",
"stdout_lines": []
and...
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7046c175db56 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 6 minutes ago Up 5 minutes k8s_POD_master-controllers-ip-10-160-20-10.stage.ice.aws_kube-system_7258811294607c245c6d909256a719d5_0
bf5f79f17468 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 6 minutes ago Up 5 minutes k8s_POD_master-api-ip-10-160-20-10.stage.ice.aws_kube-system_df602223db6f9a5a5492230ccb5ebce9_0
7c6b4ef25a78 96cf7dd047cb "/usr/bin/service-..." 20 minutes ago Up 20 minutes k8s_controller-manager_controller-manager-vmxvm_kube-service-catalog_46fa2bfa-8f91-11e8-a5f4-0ab1f54c2b38_109
69b641629aac ff5dd2137a4f "/bin/sh -c '#!/bi..." 20 minutes ago Up 20 minutes k8s_etcd_master-etcd-ip-10-160-20-10.stage.ice.aws_kube-system_7c4462a08b9f01a3c928e36663e0f1b9_0
88a34b73ed0e docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_master-etcd-ip-10-160-20-10.stage.ice.aws_kube-system_7c4462a08b9f01a3c928e36663e0f1b9_0
bb610be0bbf1 96cf7dd047cb "/usr/bin/service-..." 22 minutes ago Up 22 minutes k8s_apiserver_apiserver-w42pp_kube-service-catalog_44845c61-8f91-11e8-a5f4-0ab1f54c2b38_61
f2c36c3f4041 4f63970098ae "sh run.sh" 22 minutes ago Up 22 minutes k8s_fluentd-elasticsearch_logging-fluentd-l7bbm_logging_8d0453ec-560e-11e9-b04f-0ab1f54c2b38_2
41ef39cb25f1 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_logging-fluentd-l7bbm_logging_8d0453ec-560e-11e9-b04f-0ab1f54c2b38_1
00bb2d7dda92 aa12a2fc57f7 "/usr/bin/origin-w..." 22 minutes ago Up 22 minutes k8s_webconsole_webconsole-5f649b49b5-bfltf_openshift-web-console_e488cc08-ee60-11e8-950d-0ab1f54c2b38_24
a4d3fd8f9c09 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_webconsole-5f649b49b5-bfltf_openshift-web-console_e488cc08-ee60-11e8-950d-0ab1f54c2b38_23
39bd01427954 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_controller-manager-vmxvm_kube-service-catalog_46fa2bfa-8f91-11e8-a5f4-0ab1f54c2b38_56
eca530e715ca docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 22 minutes ago Up 22 minutes k8s_POD_apiserver-w42pp_kube-service-catalog_44845c61-8f91-11e8-a5f4-0ab1f54c2b38_56
806667874dbc 262dbb751d6b "/bin/bash -c '#!/..." 23 minutes ago Up 22 minutes k8s_openvswitch_ovs-57rr2_openshift-sdn_0e1ad7d7-560f-11e9-89b8-02684177ea70_0
7001883698ae 262dbb751d6b "/bin/bash -c '#!/..." 23 minutes ago Up 22 minutes k8s_sdn_sdn-djk8r_openshift-sdn_0e1ad8a8-560f-11e9-89b8-02684177ea70_0
31a405d5b494 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 23 minutes ago Up 23 minutes k8s_POD_sdn-djk8r_openshift-sdn_0e1ad8a8-560f-11e9-89b8-02684177ea70_0
9c6663fc7a71 docker.io/openshift/origin-pod:v3.10.0 "/usr/bin/pod" 23 minutes ago Up 23 minutes k8s_POD_ovs-57rr2_openshift-sdn_0e1ad7d7-560f-11e9-89b8-02684177ea70_0
ad16320ea2bb 262dbb751d6b "/bin/bash -c '#!/..." 24 minutes ago Up 24 minutes k8s_sync_sync-mln2b_openshift-node_aa0c3684-560e-11e9-b04f-0ab1f54c2b38_1
7d94632b0f62 openshift/origin-pod:v3.9.0 "/usr/bin/pod" 26 minutes ago Up 26 minutes k8s_POD_sync-mln2b_openshift-node_aa0c3684-560e-11e9-b04f-0ab1f54c2b38_0
and...
netstat -nalp | grep 443
unix 2 [ ACC ] STREAM LISTENING 28443 1339/master private/proxywrite
So pods are running but nothing listening on 443 hence which the API check fails I assume.
We're gonna need a more verbose logs (`ansible-playbook -vvv` output, not the log file ansible creates) to find out why is API port not being set properly Created PR to 3.10 - https://github.com/openshift/openshift-ansible/pull/11487 3.11 PR - https://github.com/openshift/openshift-ansible/pull/11491 *** Bug 1699695 has been marked as a duplicate of this bug. *** I could not reproduce the bug. Could you help with reproduce steps? Thanks. (In reply to Weihua Meng from comment #21) > I could not reproduce the bug. > Could you help with reproduce steps? > Thanks. 1. Install 3.9 cluster w/ custom api_port to 443 2. Clear local and remote facts 3. Upgrade to 3.10 This worked as expected during upgrade when existing facts were reused Fixed. openshift-ansible-3.10.139-1.git.0.02bc5db.el7.noarch ansible-2.4.6.0-1.el7ae Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0786 |