Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1588142

Summary: node scaleup fails - Node approval failed
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: MasterAssignee: Jordan Liggitt <jliggitt>
Status: CLOSED WORKSFORME QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, hongli, jmencak, jokerman, mfojtik, mifiedle, mmccomas, sdodson, vlaad
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-12 03:40:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ansible log with -vvv none

Description Vikas Laad 2018-06-06 18:52:48 UTC
Description of problem:
Scaleup is failing with following error

changed: [ec2-54-149-214-42.us-west-2.compute.amazonaws.com] => {
    "changed": true, 
    "cmd": [
        "oc", 
        "describe", 
        "csr", 
        "--config=/etc/origin/master/admin.kubeconfig"
    ], 
    "delta": "0:00:00.127132", 
    "end": "2018-06-06 18:28:51.602703", 
    "failed": false, 
    "invocation": {
        "module_args": {
            "_raw_params": "oc describe csr --config=/etc/origin/master/admin.kubeconfig", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "rc": 0, 
    "start": "2018-06-06 18:28:51.475571", 
    "stderr": "", 
    "stderr_lines": [], 
    "stdout": "", 
    "stdout_lines": []
}

TASK [Report approval errors] ***************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/openshift-node/private/join.yml:41
fatal: [ec2-54-149-214-42.us-west-2.compute.amazonaws.com]: FAILED! => {
    "changed": false, 
    "failed": true, 
    "msg": "Node approval failed"
}

Version-Release number of the following components:
rpm -q openshift-ansible 
openshift-ansible-3.10.0-0.60.0.git.0.bf95bf8.el7.noarch


rpm -q ansible
ansible-2.4.4.0-1.el7ae.noarch

ansible --version
ansible 2.4.4.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May  4 2018, 09:38:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-34)]


Steps to Reproduce:
1. create a 3.10.0-0.60 cluster 
2. run node scaleup playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-node/scaleup.yml

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:
Scaleup should succeed

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Vikas Laad 2018-06-06 18:55:13 UTC
Created attachment 1448412 [details]
ansible log with -vvv

Comment 3 Russell Teague 2018-06-08 13:31:37 UTC
I have been able to successfully scaleup nodes on an existing cluster and have not reproduced this bug.  Based on the logs, the task 'Get CSRs' should always report something and stdout is blank.  This is indicative of the openshift API not being responsive for some reason.  The other attempts to run scaleup as shown in the log indicate the cluster might not be healthy and therefore scaleup would not be successful.

Comment 4 Scott Dodson 2018-06-08 13:39:40 UTC
Moving to the master team based on the assessment that the api server goes unresponsive while we're attempting to sign required CSRs. AFAIK it's also known that the time to sign a static number of CSRs is affected by the overall number of nodes in the cluster.

https://docs.google.com/spreadsheets/d/1eg4_nLJuBr8Es04gdI-GSHM26eoQ-cVWAPvNOgqSDrA/edit#gid=0

Jiri Mencak has also observed the same behavior and may be able to provide more detail.

Comment 6 Vikas Laad 2018-06-11 19:02:47 UTC
I am not able to re-produce this issue, tried scaleup few times. Not sure if something changes recently. Please close if needed.

Comment 7 Jordan Liggitt 2018-06-12 03:40:08 UTC
nothing changed recently in that area, but I'm not able to reproduce either