Description of problem: During drain by openshift-ansible on production cluster starter-us-east-1 Version-Release number of selected component (if applicable): oc v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://internal.api.starter-us-east-1.openshift.com:443 openshift v3.6.173.0.5 kubernetes v1.6.1+5115d708d7 How reproducible: Low Steps to Reproduce: 1. Run openshift-ansible upgrade on a large cluster Actual results: Some nodes fail with: <54.210.132.97> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.210.132.97 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"'' fatal: [starter-us-east-1-node-compute-c98c9 -> None]: FAILED! => { "attempts": 60, "changed": true, "cmd": [ "oadm", "drain", "ip-172-31-50-229.ec2.internal", "--config=/etc/origin/master/admin.kubeconfig", "--force", "--delete-local-data", "--ignore-daemonsets" ], "delta": "0:00:11.811169", "end": "2017-08-08 03:16:28.756428", "failed": true, "invocation": { "module_args": { "_raw_params": "oadm drain ip-172-31-50-229.ec2.internal --config=/etc/origin/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true }, "module_name": "command" }, "rc": 1, "start": "2017-08-08 03:16:16.945259", "stderr": "WARNING: replicationcontrollers \"nodejsmongodb-3\" not found: nodejsmongodb-3-508xz, nodejsmongodb-3-508xz\nWARNING: replicationcontrollers \"nodejsmongodb-3\" not found: nodejsmongodb-3-508xz, nodejsmongodb-3-508xz\nThere are pending pods when an error occurred: error when evicting pod \"nodejsmongodb-3-508xz\": pods \"nodejsmongodb-3-508xz\" is forbidden: unable to create new content in namespace demo00 because it is being terminated.\npod/nodejsmongodb-3-508xz\npod/postgresql-1-sl9rh\npod/mariadb-3-1g2m7\npod/mysql-3-5b1sh\npod/mysql-3-44fz4\nerror: error when evicting pod \"nodejsmongodb-3-508xz\": pods \"nodejsmongodb-3-508xz\" is forbidden: unable to create new content in namespace demo00 because it is being terminated.", "stdout": "node \"ip-172-31-50-229.ec2.internal\" already cordoned", "stdout_lines": [ "node \"ip-172-31-50-229.ec2.internal\" already cordoned" ], "warnings": [] }
I thought this was a spurious condition that could be recovered by retrying the operation, but it seems worse than that. 60 attempts were exhausted by openshift-ansible trying to recover from this error.
Problem reoccurred this time on dakinitest20170618 namespace }, "rc": 1, "retries": 61, "start": "2017-08-08 16:49:58.313226", "stderr": "WARNING: replicationcontrollers \"mysql-1\" not found: mysql-1-p3ttk, mysql-1-p3ttk\nWARNING: replicationcontrollers \"mysql-1\" not found: mysql-1-p3ttk, mysql-1-p3ttk\nThere are pending pods when an error occurred: error when evicting pod \"mysql-1-p3ttk\": pods \"mysql-1-p3ttk\" is forbidden: unable to create new content in namespace dakinitest20170618 because it is being terminated.\npod/mysql-1-p3ttk\npod/jws-app-1-hkvqc\nerror: error when evicting pod \"mysql-1-p3ttk\": pods \"mysql-1-p3ttk\" is forbidden: unable to create new content in namespace dakinitest20170618 because it is being terminated.", "stdout": "node \"ip-172-31-53-187.ec2.internal\" already cordoned", "stdout_lines": [ "node \"ip-172-31-53-187.ec2.internal\" already cordoned" ], [root@starter-us-east-1-master-25064 ~]# oc get all -n dakinitest20170618 NAME READY STATUS RESTARTS AGE po/mysql-1-p3ttk 0/1 Unknown 1 19d [root@starter-us-east-1-master-25064 ~]# oc get all -n dakinitest20170618 -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Pod metadata: annotations: kubernetes.io/created-by: | {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"dakinitest20170618","name":"mysql-1","uid":"e0b97aaf-543b-11e7-835c-12d641ec7610","apiVersion":"v1","resourceVersion":"1067971746"}} openshift.io/deployment-config.latest-version: "1" openshift.io/deployment-config.name: mysql openshift.io/deployment.name: mysql-1 openshift.io/generated-by: OpenShiftNewApp openshift.io/scc: restricted creationTimestamp: 2017-07-20T06:37:53Z deletionGracePeriodSeconds: 30 deletionTimestamp: 2017-08-02T16:48:38Z generateName: mysql-1- labels: app: cakephp-mysql-persistent deployment: mysql-1 deploymentconfig: mysql name: mysql name: mysql-1-p3ttk namespace: dakinitest20170618 ownerReferences: - apiVersion: v1 blockOwnerDeletion: true controller: true kind: ReplicationController name: mysql-1 uid: e0b97aaf-543b-11e7-835c-12d641ec7610 resourceVersion: "1258228086" selfLink: /api/v1/namespaces/dakinitest20170618/pods/mysql-1-p3ttk uid: f533bbeb-6d15-11e7-803a-122631632f42 spec: containers: - env: - name: MYSQL_USER valueFrom: secretKeyRef: key: database-user name: cakephp-mysql-persistent - name: MYSQL_PASSWORD valueFrom: secretKeyRef: key: database-password name: cakephp-mysql-persistent - name: MYSQL_DATABASE value: default image: registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:991ef507a4e981531a5601f12ceb65da32605792f1117f15a6001305dd3cfd73 imagePullPolicy: Always livenessProbe: failureThreshold: 3 initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 tcpSocket: port: 3306 timeoutSeconds: 1 name: mysql ports: - containerPort: 3306 protocol: TCP readinessProbe: exec: command: - /bin/sh - -i - -c - MYSQL_PWD='s48UeOLoL1JQtT3T' mysql -h 127.0.0.1 -u cakephp -D default -e 'SELECT 1' failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: limits: cpu: "1" memory: 512Mi requests: cpu: 60m memory: 307Mi securityContext: capabilities: drop: - KILL - MKNOD - NET_RAW - SETGID - SETUID - SYS_CHROOT privileged: false runAsUser: 1124180000 seLinuxOptions: level: s0:c352,c314 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/mysql/data name: mysql-data - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-tx31j readOnly: true dnsPolicy: ClusterFirst imagePullSecrets: - name: default-dockercfg-ckvl7 nodeName: ip-172-31-53-187.ec2.internal nodeSelector: type: compute restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1124180000 seLinuxOptions: level: s0:c352,c314 serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 volumes: - name: mysql-data persistentVolumeClaim: claimName: mysql - name: default-token-tx31j secret: defaultMode: 420 secretName: default-token-tx31j status: conditions: - lastProbeTime: null lastTransitionTime: 2017-07-20T06:37:53Z status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: 2017-07-27T16:50:18Z message: 'containers with unready status: [mysql]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: 2017-07-20T06:37:53Z status: "True" type: PodScheduled containerStatuses: - containerID: docker://e98609ceb53f32685938b1d5ce13586e9dcfea5220321916eb5b523fa577e967 image: registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:991ef507a4e981531a5601f12ceb65da32605792f1117f15a6001305dd3cfd73 imageID: docker-pullable://registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:991ef507a4e981531a5601f12ceb65da32605792f1117f15a6001305dd3cfd73 lastState: terminated: containerID: docker://3439f952bde86637ef420a02dae289707bd0a827a4ba7376b6b60855aeb37063 exitCode: 0 finishedAt: 2017-07-21T23:29:02Z reason: Completed startedAt: 2017-07-20T06:40:30Z name: mysql ready: false restartCount: 1 state: running: startedAt: 2017-07-21T23:35:32Z hostIP: 172.31.53.187 phase: Running qosClass: Burstable startTime: 2017-07-20T06:37:53Z kind: List metadata: {} resourceVersion: "" selfLink: "" [root@starter-us-east-1-master-25064 ~]#
Created attachment 1310760 [details] metrics output
This looks like https://bugzilla.redhat.com/show_bug.cgi?id=1460729 Based on namespace termination hanging due to stuck pods due to docker state mismatch with containerd that resolves after a few hours. Indicators include "containerd: container not found" in the node log. *** This bug has been marked as a duplicate of bug 1460729 ***