I have been trying to have a pod that cannot be found to reproduce and verify the bug, but failed.
Can you help with steps to reproduce this bug?
Thanks.
There is no way to reproduce this bug e2e, especially now that it's supposedly fixed. We can only verify that the command tolerates 404s in unit tests.
Since this fix was introduced, our cluster upgrade process is no longer encountering the issue. With that data point, suggesting QA move this to VERIFIED.
Description of problem: During an openshift-ansible upgrade, it appears a short lived pod caused drain to fail. PLAY [Drain and upgrade nodes] ************************************************* TASK [setup] ******************************************************************* Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/system/setup.py <54.197.202.125> ESTABLISH SSH CONNECTION FOR USER: root <54.197.202.125> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.197.202.125 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' ok: [free-int-node-compute-30bae] TASK [Mark node unschedulable] ************************************************* task path: /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml:17 Using module file /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/roles/lib_openshift/library/oc_adm_manage_node.py <54.147.205.250> ESTABLISH SSH CONNECTION FOR USER: root <54.147.205.250> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.147.205.250 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' changed: [free-int-node-compute-30bae -> None] => { "attempts": 1, "changed": true, "invocation": { "module_args": { "debug": false, "dry_run": false, "evacuate": false, "force": false, "grace_period": null, "kubeconfig": "/etc/origin/master/admin.kubeconfig", "list_pods": false, "node": [ "ip-172-31-56-218.ec2.internal" ], "pod_selector": null, "schedulable": false, "selector": null }, "module_name": "oc_adm_manage_node" }, "results": { "cmd": "/usr/bin/oc adm manage-node ip-172-31-56-218.ec2.internal --schedulable=False", "nodes": [ { "name": "ip-172-31-56-218.ec2.internal", "schedulable": false } ], "results": "NAME STATUS AGE VERSION\nip-172-31-56-218.ec2.internal Ready,SchedulingDisabled 66d v1.6.1+5115d708d7\n", "returncode": 0 }, "state": "present" } TASK [Drain Node for Kubelet upgrade] ****************************************** task path: /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml:27 Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/commands/command.py <54.147.205.250> ESTABLISH SSH CONNECTION FOR USER: root <54.147.205.250> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.147.205.250 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' fatal: [free-int-node-compute-30bae -> None]: FAILED! => { "changed": true, "cmd": [ "oadm", "drain", "ip-172-31-56-218.ec2.internal", "--force", "--delete-local-data", "--ignore-daemonsets" ], "delta": "0:00:33.211124", "end": "2017-06-08 22:00:48.947786", "failed": true, "invocation": { "module_args": { "_raw_params": "oadm drain ip-172-31-56-218.ec2.internal --force --delete-local-data --ignore-daemonsets", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true }, "module_name": "command" }, "rc": 1, "start": "2017-06-08 22:00:15.736662", "stderr": "WARNING: Deleting pods with local storage: jenkins-docker-2-97crt; Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: jenkins-1-deploy\nWARNING: Deleting pods with local storage: jenkins-docker-2-97crt\nThere are pending pods when an error occurred: error when evicting pod \"jenkins-1-deploy\": pods \"jenkins-1-deploy\" not found\npod/jenkins-1-cpw87\npod/che-1-c6j77\npod/jenkins-docker-2-97crt\npod/j1704251626-1-18mq9\npod/content-repository-1-pbp41\npod/php1-3-18mq7\npod/build-06081700z-ol-1-773g1\npod/pull-04192220z-er-1-05sdg\npod/pull-05022050z-u3-1-z4v6k\npod/pull-05151530z-nw-1-vkd26\npod/my-4-4p9fd\npod/jenkins-1-cs9rz\npod/jenkins-1-m5tlg\npod/che-1-xb6pd\npod/jenkins-1-jf76t\npod/content-repository-1-s0q28\npod/undertow-servlet-3-19zzj\npod/hooks-2-07kph\npod/hooks-2-hjb58\nerror: error when evicting pod \"jenkins-1-deploy\": pods \"jenkins-1-deploy\" not found", "stdout": "node \"ip-172-31-56-218.ec2.internal\" already cordoned\npod \"frontend-1-hook-pre\" evicted\npod \"jenkins-1-jf76t\" evicted\npod \"che-1-xb6pd\" evicted\npod \"hooks-2-07kph\" evicted\npod \"j1704251626-1-18mq9\" evicted\npod \"pull-04192220z-er-1-05sdg\" evicted\npod \"content-repository-1-s0q28\" evicted\npod \"jenkins-1-m5tlg\" evicted\npod \"jenkins-1-cpw87\" evicted\npod \"pull-05151530z-nw-1-vkd26\" evicted\npod \"php1-3-18mq7\" evicted\npod \"build-06081700z-ol-1-773g1\" evicted\npod \"pull-05022050z-u3-1-z4v6k\" evicted\npod \"jenkins-1-cs9rz\" evicted\npod \"my-4-4p9fd\" evicted\npod \"undertow-servlet-3-19zzj\" evicted\npod \"hooks-2-hjb58\" evicted\npod \"jenkins-docker-2-97crt\" evicted\npod \"content-repository-1-pbp41\" evicted", "stdout_lines": [ "node \"ip-172-31-56-218.ec2.internal\" already cordoned", "pod \"frontend-1-hook-pre\" evicted", "pod \"jenkins-1-jf76t\" evicted", "pod \"che-1-xb6pd\" evicted", "pod \"hooks-2-07kph\" evicted", "pod \"j1704251626-1-18mq9\" evicted", "pod \"pull-04192220z-er-1-05sdg\" evicted", "pod \"content-repository-1-s0q28\" evicted", "pod \"jenkins-1-m5tlg\" evicted", "pod \"jenkins-1-cpw87\" evicted", "pod \"pull-05151530z-nw-1-vkd26\" evicted", "pod \"php1-3-18mq7\" evicted", "pod \"build-06081700z-ol-1-773g1\" evicted", "pod \"pull-05022050z-u3-1-z4v6k\" evicted", "pod \"jenkins-1-cs9rz\" evicted", "pod \"my-4-4p9fd\" evicted", "pod \"undertow-servlet-3-19zzj\" evicted", "pod \"hooks-2-hjb58\" evicted", "pod \"jenkins-docker-2-97crt\" evicted", "pod \"content-repository-1-pbp41\" evicted" ], "warnings": [] } Version-Release number of selected component (if applicable): oc v3.6.100 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://internal.api.free-int.openshift.com:443 openshift v3.6.100 kubernetes v1.6.1+5115d708d7 How reproducible: Has only occurred once. Expected results: If a pod is legitimately removed drain, it should not cause an error. Additional info: Full openshift-ansible logs: http://file.rdu.redhat.com/~jupierce/share/drain-error-consoleText.txt