Bug 1684353
| Summary: | 3.11.82 upgrade playbook selecting evicted pods | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mitchell Rollinson <mirollin> |
| Component: | Management Console | Assignee: | Samuel Padgett <spadgett> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Yadan Pei <yapei> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.11.0 | CC: | aos-bugs, jokerman, mirollin, mmccomas, sdodson |
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-03-07 22:00:28 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Version-Release number of the following components: rpm -q openshift-ansible Name : openshift-ansible Arch : noarch Version : 3.11.82 Release : 3.git.0.9718d0a.el7 Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible 2.7.8 Tidied up all evicted pods; eval "$(oc get pods -o json --all-namespaces | jq -r '.items[] | select(.status.phase == "Failed" and .status.reason == "Evicted") | "oc delete pod --namespace " + .metadata.namespace + " " + .metadata.name')" Then re-ran the upgrade, and no issues were observed. Naturally ensuring a clean tidy cluster before upgrading is ideal. However, in the event that a pod is evicted during an upgrade, this could still be an issue. Do you by chance have the complete logs at the verbosity in the description. These tasks should've only happened if the deployment was marked unsuccessful and it's not clear to me what specifically triggered that condition. @Scott
The cu inadvertently overwrote the original 'tee-ed output' log file when re-running the playbook.
They have advised that they will attempt to replicate in their dev ENV. I am not sure what the selection process is (which pod the playbook checks) after the pod listing process completes, but it is probably it is the first pod in the list.
As such if they can force an eviction of that pod, they should be able to replicate.
Re - These tasks should've only happened if the deployment was marked unsuccessful - My understanding is that the initial deployment (there were pods in an EVICTED state) was unsuccessful.
Only after removing the evicted pods, did the upgrade succeed.
So just to clarify.
1) Evicted pods existed when commencing the upgrade. Absoultely one should verify cluster health prior to upgrading, but the thinking is that a pod 'may' change states during an upgrade.
2) Playbook task generates a list of pods in the namespace
3) Playbook then performs 'checks' on one of the pods in the namespace (the playbook did not identify that the pod selected was evicted {does the playbook need additional logic to verify pod status - appears to be missing})
4) Upgrade fails
5) CU deletes evicted pods, and re-runs the upgrade successfully.
Mitch
This task is simply printing out the pods in the namespace after an upgrade fails to help troubleshoot. It's not the cause of the upgrade failure. We *want* to see evicted pods in case it's related to the upgrade failure. To verify that the upgrade succeeded, we check ready replicas on the console deployment: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_web_console/tasks/start.yml#L2-L16 This logic should handle evicted pods. It looks the rollout did not complete inside of the 10 minute timeout, and the installer correctly failed because it could not update the console. Note that in the pods list you have the running pods are much newer than the evicted ones, which does seem to indicate that they were created sometime after the 10 minute timeout. If you are able to include the full log, it would help confirm. Thanks for the clarification Sam. I have requested the full logs. The cu will attempt to provide these. Hi Sam. The cu has not been able to replicate the issue, and consequently the full set of logs are not available in order to further this investigation. The cu is happy with the explanations provided, and now believes that the RC was a different issue. Thanks for your help |
Description of problem: When running the 3.11.82 upgrade playbook, In ./openshift-ansible/roles/openshift_web_console/tasks/start.yml the following task executes... TASK [openshift_web_console : Get pods in the openshift-web-console namespace] Issue: Playbook TASK checks are possibly made against PODS that are in a state other than running E.G EVICTED state. Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: This causes the installation to fail. Expected results: The Playbook TASK should not continue to 'pursue' a POD that is not in a RUNNING state. Additional logic to (one or all of the following).. * verify pod status .. IF EVICTED * wait xxx seconds for EVICTED state to change * IF multiple pods, move on to the next POD until RUNNING pod identified. * fail if no running pods located. Successful installation Please include the entire output from the last TASK line through the end of output if an error is generated <peek-boo> (1, '\r\n\r\n{"changed": true, "end": "2019-02-27 14:31:32.279069", "stdout": "", "cmd": ["oc", "logs", "deployment/webconsole", "--tail=50", "--config=/etc/origin/master/admin.kubeconfig", "-n", "openshift-web-console"][15/9282] ": true, "delta": "0:00:00.282914", "stderr": "Found 9 pods, using pod/webconsole-7659dcd487-4454z\\nError from server (BadRequest): container \\"webconsole\\" in pod \\"webconsole-7659dcd487-4454z\\" is not available", "rc": 1, "invocation": {"mo dule_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "oc logs deployment/webconsole --tail=50 --config=/etc/origin/master/admin.kubeconfig -n openshift-web-console", "removes": null, "argv": null, "creates": null, "c hdir": null, "stdin": null}}, "start": "2019-02-27 14:31:31.996155", "msg": "non-zero return code"}\r\n', 'Shared connection to peeka-boo closed.\r\n') <peek-boo> Failed to connect to the host via ssh: Shared connection to peek-boo closed. <peek-boo> ESTABLISH SSH CONNECTION FOR USER: thatsme <peek-boo> SSH: EXEC sshpass -d7 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o User=thatsme -o ConnectTimeout=10 -o ControlPath=/home/thatsme/.ansible/cp/48bf060d0a peek-boo '/bin/sh -c '"'"'rm -f -r /home/thatsme/.ansible/tmp/ansi ble-tmp-1551231091.53-224885511603996/ > /dev/null 2>&1 && sleep 0'"'"'' <peek-boo> (0, '', '') fatal: [peek-boo]: FAILED! => { "changed": true, "cmd": [ "oc", "logs", "deployment/webconsole", "--tail=50", "--config=/etc/origin/master/admin.kubeconfig", "-n", "openshift-web-console" ], "delta": "0:00:00.282914", "end": "2019-02-27 14:31:32.279069", "invocation": { "module_args": { "_raw_params": "oc logs deployment/webconsole --tail=50 --config=/etc/origin/master/admin.kubeconfig -n openshift-web-console", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true } }, "msg": "non-zero return code", "rc": 1, "start": "2019-02-27 14:31:31.996155", "stderr": "Found 9 pods, using pod/webconsole-7659dcd487-4454z\nError from server (BadRequest): container \"webconsole\" in pod \"webconsole-7659dcd487-4454z\" is not available", "stderr_lines": [ "Found 9 pods, using pod/webconsole-7659dcd487-4454z", "Error from server (BadRequest): container \"webconsole\" in pod \"webconsole-7659dcd487-4454z\" is not available" ], "stdout": "", "stdout_lines": [] } ...ignoring TASK [openshift_web_console : debug] ****************************************************************************************************************************************************************************************************************** task path: /home/thatsme/ansible/tp_openshift_setup/openshift-ansible-3.11.82/roles/openshift_web_console/tasks/start.yml:47 ok: [peek-boo] => { "msg": [] } TASK [openshift_web_console : Report console errors] ************************************************************************************************************************************************************************************************** task path: /home/thatsme/ansible/tp_openshift_setup/openshift-ansible-3.11.82/roles/openshift_web_console/tasks/start.yml:52 fatal: [peek-boo]: FAILED! => { "changed": false, "msg": "Console install failed." } to retry, use: --limit @/home/thatsme/ansible/tp_openshift_setup/openshift-ansible-3.11.82/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry ### The part of interest is; " Found 9 pods, using pod/webconsole-7659dcd487-4454z " If I check the pods in that namespace; [root@peek-boo ~]# oc get pods NAME READY STATUS RESTARTS AGE webconsole-7659dcd487-4454z 0/1 Evicted 0 2d webconsole-7659dcd487-6j46p 0/1 Evicted 0 2d webconsole-7659dcd487-87qgz 1/1 Running 0 31m webconsole-7659dcd487-9hhh2 0/1 Evicted 0 1h webconsole-7659dcd487-bpdmf 0/1 Evicted 0 1h webconsole-7659dcd487-bw66h 0/1 Evicted 0 2d webconsole-7659dcd487-lbh7z 0/1 Evicted 0 1h webconsole-7659dcd487-rvf44 1/1 Running 0 31m webconsole-7659dcd487-vvfrz 1/1 Running 0 31m Additional info: Please attach logs from ansible-playbook with the -vvv flag