Bug 1828112
Summary: | [OSP] Got "invalid argument \"\\\"\\\"\" for \"--address\" flag: \"\\\"\\\"\" is not a valid IP address" for kubelet service during rhel scaleup scenario | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | weiwei jiang <wjiang> | ||||||
Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> | ||||||
Status: | CLOSED ERRATA | QA Contact: | weiwei jiang <wjiang> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.5 | CC: | aos-bugs, dcbw, jokerman, kgarriso | ||||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||||
Target Release: | 4.5.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-07-13 17:31:39 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
weiwei jiang
2020-04-27 03:05:45 UTC
A few questionable items from the log in the description. Warning: crio.service changed on disk. Run 'systemctl daemon-reload' to reload units. crio journal: Apr 26 06:04:36 wj45uos426b-v48vz-rhel-0 crio[3403]: time="2020-04-26 06:04:36.638359673-04:00" level=info msg="Update default CNI network name to " hyperkube journal: Apr 26 06:07:41 wj45uos426b-v48vz-rhel-0 hyperkube[9153]: F0426 06:07:41.766079 9153 server.go:157] invalid argument \ "\\"\\"" for "--address" flag: "\\"\\"" is not a valid IP address This appears to be a network configuration issue. If this is not an issue with crio, we need more details about what is going on with CNI. Requesting the network team take a look. @russel this isn't a CNI issue. It's the kubelet systemd service not getting the right IP address arguments on startup. Not sure where it's trying to read the IP from, but whatever its doing, it's doing it wrong. Over to node team. Looks like the configuration has an extra set of quotations around the node IP addresses. Can you try and remove them and report back? Please attach logs from ansible-playbook with the -vvv flag Created attachment 1688707 [details]
OSP rhel scaleup failed log
Created attachment 1688723 [details]
OSP rhel scaleup failed log with vvv log level
It seems like the variable KUBELET_NODE_IP is not handled properly in the ansible playbook. The openshift-ansible playbooks obtain the config from MCO and run the MCD to lay down the configuration. The extra quotes are in the config [1] provided by MCO. This is the same for baremetal, openstack, and vsphere. https://github.com/openshift/machine-config-operator/blob/master/templates/worker/01-worker-kubelet/openstack/units/kubelet.yaml#L26-L27 This is potentially due to a difference between how systemd on RHEL8 and RHEL7 interpret unit files. https://github.com/openshift/machine-config-operator/pull/1439#issuecomment-582093297 Hi @Russell & @weiwei, PR merged please let me know if you need anything further. Checked with 4.5.0-0.nightly-2020-05-19-041951, and it's fixed. so moved to verified. TASK [openshift_node : Reboot the host and wait for it to come back] *********** Wednesday 20 May 2020 10:26:00 +0800 (0:00:00.534) 0:05:47.456 ********* changed: [10.0.98.71] => {"changed": true, "elapsed": 16, "rebooted": true} TASK [openshift_node : Approve node CSRs] ************************************** Wednesday 20 May 2020 10:26:18 +0800 (0:00:18.137) 0:06:05.594 ********* changed: [10.0.98.71 -> localhost] => {"changed": true, "client_approve_results": ["Attempt: 1, Node wj45uos520a-vzjzr-rhel-0 not present or CSR not yet available", "Attempt: 2, Node wj45uos520a-vzjzr-rhel-0 not present or CSR not yet available", "wj45uos520a-vzjzr-rhel-0: certificatesigningrequest.certificates.k8s.io/csr-wmvj8 approved\n", "Attempt: 3, Node wj45uos520a-vzjzr-rhel-0 not present or CSR not yet available", "Attempt: 4, Node wj45uos520a-vzjzr-rhel-0 not present or CSR not yet available", "Attempt: 5, Node wj45uos520a-vzjzr-rhel-0 not present or CSR not yet available", "Node wj45uos520a-vzjzr-rhel-0 is present in node list"], "rc": 0, "server_approve_results": ["wj45uos520a-vzjzr-rhel-0: certificatesigningrequest.certificates.k8s.io/csr-9sklb approved\n", "Node wj45uos520a-vzjzr-rhel-0 API is ready"]} TASK [openshift_node : Wait for node to report ready] ************************** Wednesday 20 May 2020 10:26:47 +0800 (0:00:28.179) 0:06:33.774 ********* FAILED - RETRYING: Wait for node to report ready (36 retries left). FAILED - RETRYING: Wait for node to report ready (35 retries left). FAILED - RETRYING: Wait for node to report ready (34 retries left). FAILED - RETRYING: Wait for node to report ready (33 retries left). FAILED - RETRYING: Wait for node to report ready (32 retries left). FAILED - RETRYING: Wait for node to report ready (31 retries left). FAILED - RETRYING: Wait for node to report ready (30 retries left). ok: [10.0.98.71 -> localhost] => {"attempts": 8, "changed": false, "cmd": ["oc", "get", "node", "wj45uos520a-vzjzr-rhel-0", "--kubeconfig=/tmp/installer-lvuvga/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.133743", "end": "2020-05-20 10:27:24.284389", "rc": 0, "start": "2020-05-20 10:27:24.150646", "stderr": "", "stderr_lines": [], "stdout": "True", "stdout_lines": ["True"]} PLAY RECAP ********************************************************************* 10.0.98.71 : ok=40 changed=28 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0 localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0 Wednesday 20 May 2020 10:27:24 +0800 (0:00:37.249) 0:07:11.023 ********* =============================================================================== openshift_node : Install openshift support packages ------------------- 242.90s openshift_node : Install openshift packages ---------------------------- 60.99s openshift_node : Wait for node to report ready ------------------------- 37.25s openshift_node : Approve node CSRs ------------------------------------- 28.18s openshift_node : Reboot the host and wait for it to come back ---------- 18.14s openshift_node : Pull release image ------------------------------------- 9.28s openshift_node : Pull MCD image ----------------------------------------- 8.10s openshift_node : Wait for bootstrap endpoint to show up ----------------- 3.73s openshift_node : Fetch bootstrap ignition file locally ------------------ 3.17s openshift_node : Apply ignition manifest -------------------------------- 2.22s openshift_node : Get machine controller daemon image from release image --- 1.81s openshift_node : Get cluster nodes -------------------------------------- 1.20s openshift_node : Write /etc/containers/registries.conf ------------------ 1.00s openshift_node : Update CA trust ---------------------------------------- 0.98s openshift_node : Setting sebool container_manage_cgroup ----------------- 0.95s openshift_node : Enable the CRI-O service ------------------------------- 0.86s openshift_node : Restart the CRI-O service ------------------------------ 0.85s openshift_node : Check for cluster http proxy --------------------------- 0.84s openshift_node : Check for cluster no proxy ----------------------------- 0.82s openshift_node : Check for cluster https proxy -------------------------- 0.75s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |