Bug 1812431
Summary: | Wrong check for verify API server task | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Aditya Deshpande <adeshpan> |
Component: | Installer | Assignee: | Russell Teague <rteague> |
Installer sub component: | openshift-ansible | QA Contact: | Johnny Liu <jialiu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | bleanhar, rteague, sdodson |
Version: | 3.10.0 | Keywords: | UpcomingSprint |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: API health check checked only load balancer.
Consequence: Health check could return success although the local host is failing after restart.
Fix: Added API health check for local host.
Result: Local API health is confirmed before proceeding.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-12 10:08:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Aditya Deshpande
2020-03-11 10:44:19 UTC
Most components in the cluster communicate with the load balancer rather than the local pod. In my opinion this check is working as intended. Please provide more information regarding the actual problem here. Suppose, there are three master-api pods running before running upgrade playbook. During upgrade playbook, first master-api pod running on master-1 gets updated and for some reason it fails. So, here the situation is like 1 pod is not running and other two pods running on other two masters are running. The check of Verify API server did not fail since the curl request chcek only for LB and as there are two pods running the playbook went ahead. Now, while upgrading second api pod running on master-2 also api pod gets failed but the task did not fail. While upgrading third api pod running on master-3, if that api pod also gets failed then whole LB is not working and then the task failed resulting outage. So, if we have stopped the playbook at first place where api pod on master-1 got failed then we could have avoided the outage and troubleshooting for failed api pod running on master-1 can be done seprately. Here, according to me the task should not check for LB URL. It should verfiy the api pod status which is updated latest on the master node. Ok, it seems reasonable to check both. Opened a PR to check API health through both load balancer and local host. Verified this bug with openshift-ansible-3.11.310-1.git.0.4896d62.el7.noarch, and passed. Upgrade a cluster using openshift-ansible-3.11.310-1.git.0.4896d62.el7.noarch, in the process of upgrading, will get the following checking: <---master 1 api check---> 10-26 15:00:38 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:00:41 FAILED - RETRYING: verify Local API server (120 retries left). 10-26 15:00:45 FAILED - RETRYING: verify Local API server (119 retries left). 10-26 15:00:48 FAILED - RETRYING: verify Local API server (118 retries left). 10-26 15:00:52 FAILED - RETRYING: verify Local API server (117 retries left). 10-26 15:00:53 ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 5, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-1.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.132041", "end": "2020-10-26 03:00:53.037319", "rc": 0, "start": "2020-10-26 03:00:52.905278", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:00:53 10-26 15:00:53 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:00:54 ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-1.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.121582", "end": "2020-10-26 03:00:53.727826", "rc": 0, "start": "2020-10-26 03:00:53.606244", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:00:54 10-26 15:00:54 RUNNING HANDLER [openshift_control_plane : verify API server] ****************** 10-26 15:00:55 ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311lb-nfs-1:443/healthz/ready"], "delta": "0:00:00.116926", "end": "2020-10-26 03:00:54.421811", "rc": 0, "start": "2020-10-26 03:00:54.304885", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:00:55 10-26 15:00:55 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:00:55 ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-1.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.107246", "end": "2020-10-26 03:00:55.131238", "rc": 0, "start": "2020-10-26 03:00:55.023992", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} <---master 2 api check---> 10-26 15:01:27 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:01:30 FAILED - RETRYING: verify Local API server (120 retries left). 10-26 15:01:34 FAILED - RETRYING: verify Local API server (119 retries left). 10-26 15:01:37 FAILED - RETRYING: verify Local API server (118 retries left). 10-26 15:01:41 FAILED - RETRYING: verify Local API server (117 retries left). 10-26 15:01:45 FAILED - RETRYING: verify Local API server (116 retries left). 10-26 15:01:46 ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 6, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-2.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.175943", "end": "2020-10-26 03:01:46.320235", "rc": 0, "start": "2020-10-26 03:01:46.144292", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:01:46 10-26 15:01:46 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:01:47 ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-2.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.184025", "end": "2020-10-26 03:01:47.230324", "rc": 0, "start": "2020-10-26 03:01:47.046299", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:01:47 10-26 15:01:47 RUNNING HANDLER [openshift_control_plane : verify API server] ****************** 10-26 15:01:48 ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311lb-nfs-1:443/healthz/ready"], "delta": "0:00:00.132616", "end": "2020-10-26 03:01:47.943989", "rc": 0, "start": "2020-10-26 03:01:47.811373", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:01:48 10-26 15:01:48 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:01:49 ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-2.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.127315", "end": "2020-10-26 03:01:48.720747", "rc": 0, "start": "2020-10-26 03:01:48.593432", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} <---master 3 api check---> 10-26 15:02:18 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:02:21 FAILED - RETRYING: verify Local API server (120 retries left). 10-26 15:02:25 FAILED - RETRYING: verify Local API server (119 retries left). 10-26 15:02:28 FAILED - RETRYING: verify Local API server (118 retries left). 10-26 15:02:32 FAILED - RETRYING: verify Local API server (117 retries left). 10-26 15:02:36 FAILED - RETRYING: verify Local API server (116 retries left). 10-26 15:02:37 ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 6, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-3.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.141949", "end": "2020-10-26 03:02:36.711618", "rc": 0, "start": "2020-10-26 03:02:36.569669", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:02:37 10-26 15:02:37 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:02:38 ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-3.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.137628", "end": "2020-10-26 03:02:37.508232", "rc": 0, "start": "2020-10-26 03:02:37.370604", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:02:38 10-26 15:02:38 RUNNING HANDLER [openshift_control_plane : verify API server] ****************** 10-26 15:02:39 ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311lb-nfs-1:443/healthz/ready"], "delta": "0:00:00.190494", "end": "2020-10-26 03:02:38.353075", "rc": 0, "start": "2020-10-26 03:02:38.162581", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} 10-26 15:02:39 10-26 15:02:39 RUNNING HANDLER [openshift_control_plane : verify Local API server] ************ 10-26 15:02:40 ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-3.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.141891", "end": "2020-10-26 03:02:39.388866", "rc": 0, "start": "2020-10-26 03:02:39.246975", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.317 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4430 |