Bug 1812431 - Wrong check for verify API server task
Summary: Wrong check for verify API server task
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.11.z
Assignee: Russell Teague
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-11 10:44 UTC by Aditya Deshpande
Modified: 2023-10-06 19:24 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: API health check checked only load balancer. Consequence: Health check could return success although the local host is failing after restart. Fix: Added API health check for local host. Result: Local API health is confirmed before proceeding.
Clone Of:
Environment:
Last Closed: 2020-11-12 10:08:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12255 0 None closed Bug 1812431: roles/openshift_control_plane: Verify local API health 2020-11-10 14:00:02 UTC
Red Hat Product Errata RHBA-2020:4430 0 None None None 2020-11-12 10:08:55 UTC

Description Aditya Deshpande 2020-03-11 10:44:19 UTC
Description of problem:

    While upgrading multi master environment using openshift-ansible-3.10.181, it will verify API server health as per the task mentioned below.
   
   ## openshift-ansible/roles/openshift_control_plane/handlers/main.yml
    ~~~
    - name: verify API server
      # Using curl here since the uri module requires python-httplib2 and
      # wait_for port doesn't provide health information.
      command: >
        curl --silent --tlsv1.2 --max-time 2
        --cacert {{ openshift.common.config_base }}/master/ca-bundle.crt
        {{ openshift.master.api_url }}/healthz/ready
      args:
        # Disables the following warning:
        # Consider using get_url or uri module rather than running curl
        warn: no
      register: l_api_available_output
      until: l_api_available_output.stdout == 'ok'
      retries: 120
      delay: 1
      changed_when: false
    ~~~
     
     
    The above task is checking health of api using curl command against {{ openshift.master.api_url }}/healthz/ready where openshift.master.api_url would be master api URL set in the master-config.yaml file as masterPublicURL. This URL acts as a load balancer to multiple master-api pods running on different master server.
     
    Master nodes are upgrading one by one. Suppose, master-api pod on master-1 got upgraded but it fails to run. But due to other two pods availability the check for masterPublicURL got successful and playbook went ahead to upgrade other two api pods. At last, all three master-api pods went down after upgrade then the task got failed occuring outage for whole cluster as api is not available.
     
    If we could have check the failed status of master-api pod on master-1 and stopped the playbook there itself then we could avoid the outage scenario.
    So, the task mentioned for verify API server should check the status of master-api pod which got upgraded to newer version and not for the masterPublicURL (openshift.master.api_url) which is load balancing multiple pods.

Version-Release number of the following components:
rpm -q openshift-ansible
3.10.181-1

rpm -q ansible
ansible --version
ansible-2.6.19-1.el7ae.noarch



Actual results:
TASK verify API server failed at the time of third master upgrade when all api pods are down.

Expected results:
We should modify the task in such a way it will check the master-api pod health one by one as upgrade proceeds. We can avoid situations like all api pods are down causing outage.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2020-03-11 13:22:54 UTC
Most components in the cluster communicate with the load balancer rather than the local pod. In my opinion this check is working as intended. Please provide more information regarding the actual problem here.

Comment 4 Aditya Deshpande 2020-03-16 10:13:56 UTC
Suppose, there are three master-api pods running before running upgrade playbook.
During upgrade playbook, first master-api pod running on master-1 gets updated and for some reason it fails.
So, here the situation is like 1 pod is not running and other two pods running on other two masters are running.
The check of Verify API server did not fail since the curl request chcek only for LB and as there are two pods running the playbook went ahead.

Now, while upgrading second api pod running on master-2 also api pod gets failed but the task did not fail.
While upgrading third api pod running on master-3, if that api pod also gets failed then whole LB is not working and then the task failed resulting outage.

So, if we have stopped the playbook at first place where api pod on master-1 got failed then we could have avoided the outage and troubleshooting for failed api pod running on master-1 can be done seprately.
Here, according to me the task should not check for LB URL. It should verfiy the api pod status which is updated latest on the master node.

Comment 5 Scott Dodson 2020-03-17 12:27:44 UTC
Ok, it seems reasonable to check both.

Comment 15 Russell Teague 2020-10-23 15:03:10 UTC
Opened a PR to check API health through both load balancer and local host.

Comment 17 Johnny Liu 2020-10-26 07:45:59 UTC
Verified this bug with openshift-ansible-3.11.310-1.git.0.4896d62.el7.noarch, and passed.


Upgrade a cluster using openshift-ansible-3.11.310-1.git.0.4896d62.el7.noarch, in the process of upgrading, will get the following checking:
<---master 1 api check--->
10-26 15:00:38  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************

10-26 15:00:41  FAILED - RETRYING: verify Local API server (120 retries left).

10-26 15:00:45  FAILED - RETRYING: verify Local API server (119 retries left).

10-26 15:00:48  FAILED - RETRYING: verify Local API server (118 retries left).

10-26 15:00:52  FAILED - RETRYING: verify Local API server (117 retries left).

10-26 15:00:53  ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 5, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-1.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.132041", "end": "2020-10-26 03:00:53.037319", "rc": 0, "start": "2020-10-26 03:00:52.905278", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:00:53  
10-26 15:00:53  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************
10-26 15:00:54  ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-1.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.121582", "end": "2020-10-26 03:00:53.727826", "rc": 0, "start": "2020-10-26 03:00:53.606244", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:00:54  
10-26 15:00:54  RUNNING HANDLER [openshift_control_plane : verify API server] ******************
10-26 15:00:55  ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311lb-nfs-1:443/healthz/ready"], "delta": "0:00:00.116926", "end": "2020-10-26 03:00:54.421811", "rc": 0, "start": "2020-10-26 03:00:54.304885", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:00:55  
10-26 15:00:55  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************

10-26 15:00:55  ok: [ci-vm-10-0-151-248.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-1.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.107246", "end": "2020-10-26 03:00:55.131238", "rc": 0, "start": "2020-10-26 03:00:55.023992", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}



<---master 2 api check--->
10-26 15:01:27  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************

10-26 15:01:30  FAILED - RETRYING: verify Local API server (120 retries left).

10-26 15:01:34  FAILED - RETRYING: verify Local API server (119 retries left).

10-26 15:01:37  FAILED - RETRYING: verify Local API server (118 retries left).

10-26 15:01:41  FAILED - RETRYING: verify Local API server (117 retries left).

10-26 15:01:45  FAILED - RETRYING: verify Local API server (116 retries left).

10-26 15:01:46  ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 6, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-2.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.175943", "end": "2020-10-26 03:01:46.320235", "rc": 0, "start": "2020-10-26 03:01:46.144292", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:01:46  
10-26 15:01:46  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************

10-26 15:01:47  ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-2.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.184025", "end": "2020-10-26 03:01:47.230324", "rc": 0, "start": "2020-10-26 03:01:47.046299", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:01:47  
10-26 15:01:47  RUNNING HANDLER [openshift_control_plane : verify API server] ******************
10-26 15:01:48  ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311lb-nfs-1:443/healthz/ready"], "delta": "0:00:00.132616", "end": "2020-10-26 03:01:47.943989", "rc": 0, "start": "2020-10-26 03:01:47.811373", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:01:48  
10-26 15:01:48  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************

10-26 15:01:49  ok: [ci-vm-10-0-151-94.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-2.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.127315", "end": "2020-10-26 03:01:48.720747", "rc": 0, "start": "2020-10-26 03:01:48.593432", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}


<---master 3 api check--->
10-26 15:02:18  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************

10-26 15:02:21  FAILED - RETRYING: verify Local API server (120 retries left).

10-26 15:02:25  FAILED - RETRYING: verify Local API server (119 retries left).

10-26 15:02:28  FAILED - RETRYING: verify Local API server (118 retries left).

10-26 15:02:32  FAILED - RETRYING: verify Local API server (117 retries left).

10-26 15:02:36  FAILED - RETRYING: verify Local API server (116 retries left).

10-26 15:02:37  ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 6, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-3.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.141949", "end": "2020-10-26 03:02:36.711618", "rc": 0, "start": "2020-10-26 03:02:36.569669", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:02:37  
10-26 15:02:37  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************
10-26 15:02:38  ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-3.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.137628", "end": "2020-10-26 03:02:37.508232", "rc": 0, "start": "2020-10-26 03:02:37.370604", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:02:38  
10-26 15:02:38  RUNNING HANDLER [openshift_control_plane : verify API server] ******************

10-26 15:02:39  ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311lb-nfs-1:443/healthz/ready"], "delta": "0:00:00.190494", "end": "2020-10-26 03:02:38.353075", "rc": 0, "start": "2020-10-26 03:02:38.162581", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}
10-26 15:02:39  
10-26 15:02:39  RUNNING HANDLER [openshift_control_plane : verify Local API server] ************
10-26 15:02:40  ok: [ci-vm-10-0-149-70.hosted.upshift.rdu2.redhat.com] => {"attempts": 1, "changed": false, "cmd": ["curl", "--silent", "--tlsv1.2", "--max-time", "2", "--cacert", "/etc/origin/master/ca-bundle.crt", "https://jialiu311master-etcd-3.int.1026-l5y.qe.rhcloud.com:443/healthz/ready"], "delta": "0:00:00.141891", "end": "2020-10-26 03:02:39.388866", "rc": 0, "start": "2020-10-26 03:02:39.246975", "stderr": "", "stderr_lines": [], "stdout": "ok", "stdout_lines": ["ok"]}

Comment 20 errata-xmlrpc 2020-11-12 10:08:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.317 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4430


Note You need to log in before you can comment on or make changes to this bug.