Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1460067

Summary:	Error from adm drain when pod disappears
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Node	Assignee:	Derek Carr <decarr>
Status:	CLOSED ERRATA	QA Contact:	Weihua Meng <wmeng>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.6.0	CC:	anli, aos-bugs, jiajliu, jokerman, jupierce, mkargaki, mmccomas, sdodson, trankin, wmeng
Target Milestone:	---
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-14 20:58:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-06-09 00:53:04 UTC

Description of problem:

During an openshift-ansible upgrade, it appears a short lived pod caused drain to fail.

PLAY [Drain and upgrade nodes] *************************************************

TASK [setup] *******************************************************************
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/system/setup.py
<54.197.202.125> ESTABLISH SSH CONNECTION FOR USER: root
<54.197.202.125> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.197.202.125 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
ok: [free-int-node-compute-30bae]

TASK [Mark node unschedulable] *************************************************
task path: /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml:17
Using module file /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/roles/lib_openshift/library/oc_adm_manage_node.py
<54.147.205.250> ESTABLISH SSH CONNECTION FOR USER: root
<54.147.205.250> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.147.205.250 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
changed: [free-int-node-compute-30bae -> None] => {
    "attempts": 1, 
    "changed": true, 
    "invocation": {
        "module_args": {
            "debug": false, 
            "dry_run": false, 
            "evacuate": false, 
            "force": false, 
            "grace_period": null, 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "list_pods": false, 
            "node": [
                "ip-172-31-56-218.ec2.internal"
            ], 
            "pod_selector": null, 
            "schedulable": false, 
            "selector": null
        }, 
        "module_name": "oc_adm_manage_node"
    }, 
    "results": {
        "cmd": "/usr/bin/oc adm manage-node ip-172-31-56-218.ec2.internal --schedulable=False", 
        "nodes": [
            {
                "name": "ip-172-31-56-218.ec2.internal", 
                "schedulable": false
            }
        ], 
        "results": "NAME                            STATUS                     AGE       VERSION\nip-172-31-56-218.ec2.internal   Ready,SchedulingDisabled   66d       v1.6.1+5115d708d7\n", 
        "returncode": 0
    }, 
    "state": "present"
}

TASK [Drain Node for Kubelet upgrade] ******************************************
task path: /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml:27
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/commands/command.py
<54.147.205.250> ESTABLISH SSH CONNECTION FOR USER: root
<54.147.205.250> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.147.205.250 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
fatal: [free-int-node-compute-30bae -> None]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oadm", 
        "drain", 
        "ip-172-31-56-218.ec2.internal", 
        "--force", 
        "--delete-local-data", 
        "--ignore-daemonsets"
    ], 
    "delta": "0:00:33.211124", 
    "end": "2017-06-08 22:00:48.947786", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "oadm drain ip-172-31-56-218.ec2.internal --force --delete-local-data --ignore-daemonsets", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "warn": true
        }, 
        "module_name": "command"
    }, 
    "rc": 1, 
    "start": "2017-06-08 22:00:15.736662", 
    "stderr": "WARNING: Deleting pods with local storage: jenkins-docker-2-97crt; Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: jenkins-1-deploy\nWARNING: Deleting pods with local storage: jenkins-docker-2-97crt\nThere are pending pods when an error occurred: error when evicting pod \"jenkins-1-deploy\": pods \"jenkins-1-deploy\" not found\npod/jenkins-1-cpw87\npod/che-1-c6j77\npod/jenkins-docker-2-97crt\npod/j1704251626-1-18mq9\npod/content-repository-1-pbp41\npod/php1-3-18mq7\npod/build-06081700z-ol-1-773g1\npod/pull-04192220z-er-1-05sdg\npod/pull-05022050z-u3-1-z4v6k\npod/pull-05151530z-nw-1-vkd26\npod/my-4-4p9fd\npod/jenkins-1-cs9rz\npod/jenkins-1-m5tlg\npod/che-1-xb6pd\npod/jenkins-1-jf76t\npod/content-repository-1-s0q28\npod/undertow-servlet-3-19zzj\npod/hooks-2-07kph\npod/hooks-2-hjb58\nerror: error when evicting pod \"jenkins-1-deploy\": pods \"jenkins-1-deploy\" not found", 
    "stdout": "node \"ip-172-31-56-218.ec2.internal\" already cordoned\npod \"frontend-1-hook-pre\" evicted\npod \"jenkins-1-jf76t\" evicted\npod \"che-1-xb6pd\" evicted\npod \"hooks-2-07kph\" evicted\npod \"j1704251626-1-18mq9\" evicted\npod \"pull-04192220z-er-1-05sdg\" evicted\npod \"content-repository-1-s0q28\" evicted\npod \"jenkins-1-m5tlg\" evicted\npod \"jenkins-1-cpw87\" evicted\npod \"pull-05151530z-nw-1-vkd26\" evicted\npod \"php1-3-18mq7\" evicted\npod \"build-06081700z-ol-1-773g1\" evicted\npod \"pull-05022050z-u3-1-z4v6k\" evicted\npod \"jenkins-1-cs9rz\" evicted\npod \"my-4-4p9fd\" evicted\npod \"undertow-servlet-3-19zzj\" evicted\npod \"hooks-2-hjb58\" evicted\npod \"jenkins-docker-2-97crt\" evicted\npod \"content-repository-1-pbp41\" evicted", 
    "stdout_lines": [
        "node \"ip-172-31-56-218.ec2.internal\" already cordoned", 
        "pod \"frontend-1-hook-pre\" evicted", 
        "pod \"jenkins-1-jf76t\" evicted", 
        "pod \"che-1-xb6pd\" evicted", 
        "pod \"hooks-2-07kph\" evicted", 
        "pod \"j1704251626-1-18mq9\" evicted", 
        "pod \"pull-04192220z-er-1-05sdg\" evicted", 
        "pod \"content-repository-1-s0q28\" evicted", 
        "pod \"jenkins-1-m5tlg\" evicted", 
        "pod \"jenkins-1-cpw87\" evicted", 
        "pod \"pull-05151530z-nw-1-vkd26\" evicted", 
        "pod \"php1-3-18mq7\" evicted", 
        "pod \"build-06081700z-ol-1-773g1\" evicted", 
        "pod \"pull-05022050z-u3-1-z4v6k\" evicted", 
        "pod \"jenkins-1-cs9rz\" evicted", 
        "pod \"my-4-4p9fd\" evicted", 
        "pod \"undertow-servlet-3-19zzj\" evicted", 
        "pod \"hooks-2-hjb58\" evicted", 
        "pod \"jenkins-docker-2-97crt\" evicted", 
        "pod \"content-repository-1-pbp41\" evicted"
    ], 
    "warnings": []
}


Version-Release number of selected component (if applicable):
oc v3.6.100
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.free-int.openshift.com:443
openshift v3.6.100
kubernetes v1.6.1+5115d708d7


How reproducible:
Has only occurred once. 


Expected results:
If a pod is legitimately removed drain, it should not cause an error.

Additional info:
Full openshift-ansible logs: http://file.rdu.redhat.com/~jupierce/share/drain-error-consoleText.txt

Comment 1 Scott Dodson 2017-06-09 02:08:08 UTC

I'm pretty sure there's an origin issue if not another bugzilla about this. It happens in particular with builds.

Comment 2 Derek Carr 2017-06-09 21:03:58 UTC

Opened upstream fix:
https://github.com/kubernetes/kubernetes/pull/47270

Will backport when merged upstream.

Comment 3 Michail Kargakis 2017-06-13 18:32:15 UTC

Also https://github.com/kubernetes/kubernetes/pull/47450

Comment 4 Derek Carr 2017-06-15 00:25:17 UTC

posted:

https://github.com/openshift/origin/pull/14663

once 47450 merges will pick as well.

Comment 5 Derek Carr 2017-06-16 01:56:56 UTC

as well as:
https://github.com/openshift/origin/pull/14690

Comment 6 Weihua Meng 2017-07-05 11:02:23 UTC

I have been trying to have a pod that cannot be found to reproduce and verify the bug, but failed.
Can you help with steps to reproduce this bug?
Thanks.

Comment 7 Michail Kargakis 2017-07-05 11:07:35 UTC

There is no way to reproduce this bug e2e, especially now that it's supposedly fixed. We can only verify that the command tolerates 404s in unit tests.

Comment 8 Justin Pierce 2017-07-05 11:41:50 UTC

Since this fix was introduced, our cluster upgrade process is no longer encountering the issue. With that data point, suggesting QA move this to VERIFIED.

Comment 9 Weihua Meng 2017-07-06 08:36:58 UTC

Verified on openshift v3.6.135
no same issue.
move to verified according to comment 7&8
also cc two QEs who are in charge of upgrade test.

Comment 11 Scott Dodson 2017-08-14 20:58:09 UTC

This was fixed in 3.6.0.173.5 which is the GA 3.6 release.