Bug 1460067 - Error from adm drain when pod disappears
Error from adm drain when pod disappears
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Pod (Show other bugs)
3.6.0
Unspecified Linux
unspecified Severity high
: ---
: 3.7.0
Assigned To: Derek Carr
Weihua Meng
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-08 20:53 EDT by Justin Pierce
Modified: 2017-08-14 16:58 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-14 16:58:09 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Justin Pierce 2017-06-08 20:53:04 EDT
Description of problem:

During an openshift-ansible upgrade, it appears a short lived pod caused drain to fail.

PLAY [Drain and upgrade nodes] *************************************************

TASK [setup] *******************************************************************
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/system/setup.py
<54.197.202.125> ESTABLISH SSH CONNECTION FOR USER: root
<54.197.202.125> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.197.202.125 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
ok: [free-int-node-compute-30bae]

TASK [Mark node unschedulable] *************************************************
task path: /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml:17
Using module file /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/roles/lib_openshift/library/oc_adm_manage_node.py
<54.147.205.250> ESTABLISH SSH CONNECTION FOR USER: root
<54.147.205.250> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.147.205.250 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
changed: [free-int-node-compute-30bae -> None] => {
    "attempts": 1, 
    "changed": true, 
    "invocation": {
        "module_args": {
            "debug": false, 
            "dry_run": false, 
            "evacuate": false, 
            "force": false, 
            "grace_period": null, 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "list_pods": false, 
            "node": [
                "ip-172-31-56-218.ec2.internal"
            ], 
            "pod_selector": null, 
            "schedulable": false, 
            "selector": null
        }, 
        "module_name": "oc_adm_manage_node"
    }, 
    "results": {
        "cmd": "/usr/bin/oc adm manage-node ip-172-31-56-218.ec2.internal --schedulable=False", 
        "nodes": [
            {
                "name": "ip-172-31-56-218.ec2.internal", 
                "schedulable": false
            }
        ], 
        "results": "NAME                            STATUS                     AGE       VERSION\nip-172-31-56-218.ec2.internal   Ready,SchedulingDisabled   66d       v1.6.1+5115d708d7\n", 
        "returncode": 0
    }, 
    "state": "present"
}

TASK [Drain Node for Kubelet upgrade] ******************************************
task path: /home/opsmedic/aos-cd/tmp/tmp.6AiYQkUSVf/openshift-ansible_extract/playbooks/common/openshift-cluster/upgrades/upgrade_nodes.yml:27
Using module file /usr/lib/python2.7/site-packages/ansible/modules/core/commands/command.py
<54.147.205.250> ESTABLISH SSH CONNECTION FOR USER: root
<54.147.205.250> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.147.205.250 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
fatal: [free-int-node-compute-30bae -> None]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oadm", 
        "drain", 
        "ip-172-31-56-218.ec2.internal", 
        "--force", 
        "--delete-local-data", 
        "--ignore-daemonsets"
    ], 
    "delta": "0:00:33.211124", 
    "end": "2017-06-08 22:00:48.947786", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "oadm drain ip-172-31-56-218.ec2.internal --force --delete-local-data --ignore-daemonsets", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "warn": true
        }, 
        "module_name": "command"
    }, 
    "rc": 1, 
    "start": "2017-06-08 22:00:15.736662", 
    "stderr": "WARNING: Deleting pods with local storage: jenkins-docker-2-97crt; Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: jenkins-1-deploy\nWARNING: Deleting pods with local storage: jenkins-docker-2-97crt\nThere are pending pods when an error occurred: error when evicting pod \"jenkins-1-deploy\": pods \"jenkins-1-deploy\" not found\npod/jenkins-1-cpw87\npod/che-1-c6j77\npod/jenkins-docker-2-97crt\npod/j1704251626-1-18mq9\npod/content-repository-1-pbp41\npod/php1-3-18mq7\npod/build-06081700z-ol-1-773g1\npod/pull-04192220z-er-1-05sdg\npod/pull-05022050z-u3-1-z4v6k\npod/pull-05151530z-nw-1-vkd26\npod/my-4-4p9fd\npod/jenkins-1-cs9rz\npod/jenkins-1-m5tlg\npod/che-1-xb6pd\npod/jenkins-1-jf76t\npod/content-repository-1-s0q28\npod/undertow-servlet-3-19zzj\npod/hooks-2-07kph\npod/hooks-2-hjb58\nerror: error when evicting pod \"jenkins-1-deploy\": pods \"jenkins-1-deploy\" not found", 
    "stdout": "node \"ip-172-31-56-218.ec2.internal\" already cordoned\npod \"frontend-1-hook-pre\" evicted\npod \"jenkins-1-jf76t\" evicted\npod \"che-1-xb6pd\" evicted\npod \"hooks-2-07kph\" evicted\npod \"j1704251626-1-18mq9\" evicted\npod \"pull-04192220z-er-1-05sdg\" evicted\npod \"content-repository-1-s0q28\" evicted\npod \"jenkins-1-m5tlg\" evicted\npod \"jenkins-1-cpw87\" evicted\npod \"pull-05151530z-nw-1-vkd26\" evicted\npod \"php1-3-18mq7\" evicted\npod \"build-06081700z-ol-1-773g1\" evicted\npod \"pull-05022050z-u3-1-z4v6k\" evicted\npod \"jenkins-1-cs9rz\" evicted\npod \"my-4-4p9fd\" evicted\npod \"undertow-servlet-3-19zzj\" evicted\npod \"hooks-2-hjb58\" evicted\npod \"jenkins-docker-2-97crt\" evicted\npod \"content-repository-1-pbp41\" evicted", 
    "stdout_lines": [
        "node \"ip-172-31-56-218.ec2.internal\" already cordoned", 
        "pod \"frontend-1-hook-pre\" evicted", 
        "pod \"jenkins-1-jf76t\" evicted", 
        "pod \"che-1-xb6pd\" evicted", 
        "pod \"hooks-2-07kph\" evicted", 
        "pod \"j1704251626-1-18mq9\" evicted", 
        "pod \"pull-04192220z-er-1-05sdg\" evicted", 
        "pod \"content-repository-1-s0q28\" evicted", 
        "pod \"jenkins-1-m5tlg\" evicted", 
        "pod \"jenkins-1-cpw87\" evicted", 
        "pod \"pull-05151530z-nw-1-vkd26\" evicted", 
        "pod \"php1-3-18mq7\" evicted", 
        "pod \"build-06081700z-ol-1-773g1\" evicted", 
        "pod \"pull-05022050z-u3-1-z4v6k\" evicted", 
        "pod \"jenkins-1-cs9rz\" evicted", 
        "pod \"my-4-4p9fd\" evicted", 
        "pod \"undertow-servlet-3-19zzj\" evicted", 
        "pod \"hooks-2-hjb58\" evicted", 
        "pod \"jenkins-docker-2-97crt\" evicted", 
        "pod \"content-repository-1-pbp41\" evicted"
    ], 
    "warnings": []
}


Version-Release number of selected component (if applicable):
oc v3.6.100
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.free-int.openshift.com:443
openshift v3.6.100
kubernetes v1.6.1+5115d708d7


How reproducible:
Has only occurred once. 


Expected results:
If a pod is legitimately removed drain, it should not cause an error.

Additional info:
Full openshift-ansible logs: http://file.rdu.redhat.com/~jupierce/share/drain-error-consoleText.txt
Comment 1 Scott Dodson 2017-06-08 22:08:08 EDT
I'm pretty sure there's an origin issue if not another bugzilla about this. It happens in particular with builds.
Comment 2 Derek Carr 2017-06-09 17:03:58 EDT
Opened upstream fix:
https://github.com/kubernetes/kubernetes/pull/47270

Will backport when merged upstream.
Comment 3 Michail Kargakis 2017-06-13 14:32:15 EDT
Also https://github.com/kubernetes/kubernetes/pull/47450
Comment 4 Derek Carr 2017-06-14 20:25:17 EDT
posted:

https://github.com/openshift/origin/pull/14663

once 47450 merges will pick as well.
Comment 5 Derek Carr 2017-06-15 21:56:56 EDT
as well as:
https://github.com/openshift/origin/pull/14690
Comment 6 Weihua Meng 2017-07-05 07:02:23 EDT
I have been trying to have a pod that cannot be found to reproduce and verify the bug, but failed.
Can you help with steps to reproduce this bug?
Thanks.
Comment 7 Michail Kargakis 2017-07-05 07:07:35 EDT
There is no way to reproduce this bug e2e, especially now that it's supposedly fixed. We can only verify that the command tolerates 404s in unit tests.
Comment 8 Justin Pierce 2017-07-05 07:41:50 EDT
Since this fix was introduced, our cluster upgrade process is no longer encountering the issue. With that data point, suggesting QA move this to VERIFIED.
Comment 9 Weihua Meng 2017-07-06 04:36:58 EDT
Verified on openshift v3.6.135
no same issue.
move to verified according to comment 7&8
also cc two QEs who are in charge of upgrade test.
Comment 11 Scott Dodson 2017-08-14 16:58:09 EDT
This was fixed in 3.6.0.173.5 which is the GA 3.6 release.

Note You need to log in before you can comment on or make changes to this bug.