1479362 – "unable to create new content in namespace demo00 because it is being terminated" during drain

Bug 1479362 - "unable to create new content in namespace demo00 because it is being terminated" during drain

Summary: "unable to create new content in namespace demo00 because it is being termina...

Keywords:
Status:	CLOSED DUPLICATE of bug 1460729
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Derek Carr
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-08 12:37 UTC by Justin Pierce
Modified:	2017-08-08 17:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-08 17:50:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
metrics output (630.91 KB, text/plain) 2017-08-08 16:59 UTC, Scott Dodson	no flags	Details
View All

Description Justin Pierce 2017-08-08 12:37:45 UTC

Description of problem:
During drain by openshift-ansible on production cluster starter-us-east-1 

Version-Release number of selected component (if applicable):
oc v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.starter-us-east-1.openshift.com:443
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7


How reproducible:
Low

Steps to Reproduce:
1. Run openshift-ansible upgrade on a large cluster

Actual results:
Some nodes fail with:

<54.210.132.97> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/opsmedic/.ansible/cp/ansible-ssh-%h-%p-%r 54.210.132.97 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
fatal: [starter-us-east-1-node-compute-c98c9 -> None]: FAILED! => {
    "attempts": 60, 
    "changed": true, 
    "cmd": [
        "oadm", 
        "drain", 
        "ip-172-31-50-229.ec2.internal", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "--force", 
        "--delete-local-data", 
        "--ignore-daemonsets"
    ], 
    "delta": "0:00:11.811169", 
    "end": "2017-08-08 03:16:28.756428", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "oadm drain ip-172-31-50-229.ec2.internal --config=/etc/origin/master/admin.kubeconfig --force --delete-local-data --ignore-daemonsets", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "warn": true
        }, 
        "module_name": "command"
    }, 
    "rc": 1, 
    "start": "2017-08-08 03:16:16.945259", 
    "stderr": "WARNING: replicationcontrollers \"nodejsmongodb-3\" not found: nodejsmongodb-3-508xz, nodejsmongodb-3-508xz\nWARNING: replicationcontrollers \"nodejsmongodb-3\" not found: nodejsmongodb-3-508xz, nodejsmongodb-3-508xz\nThere are pending pods when an error occurred: error when evicting pod \"nodejsmongodb-3-508xz\": pods \"nodejsmongodb-3-508xz\" is forbidden: unable to create new content in namespace demo00 because it is being terminated.\npod/nodejsmongodb-3-508xz\npod/postgresql-1-sl9rh\npod/mariadb-3-1g2m7\npod/mysql-3-5b1sh\npod/mysql-3-44fz4\nerror: error when evicting pod \"nodejsmongodb-3-508xz\": pods \"nodejsmongodb-3-508xz\" is forbidden: unable to create new content in namespace demo00 because it is being terminated.", 
    "stdout": "node \"ip-172-31-50-229.ec2.internal\" already cordoned", 
    "stdout_lines": [
        "node \"ip-172-31-50-229.ec2.internal\" already cordoned"
    ], 
    "warnings": []
}

Comment 1 Justin Pierce 2017-08-08 14:22:22 UTC

I thought this was a spurious condition that could be recovered by retrying the operation, but it seems worse than that. 60 attempts were exhausted by openshift-ansible trying to recover from this error.

Comment 5 Scott Dodson 2017-08-08 16:59:16 UTC

Problem reoccurred this time on dakinitest20170618 namespace

    }, 
    "rc": 1, 
    "retries": 61, 
    "start": "2017-08-08 16:49:58.313226", 
    "stderr": "WARNING: replicationcontrollers \"mysql-1\" not found: mysql-1-p3ttk, mysql-1-p3ttk\nWARNING: replicationcontrollers \"mysql-1\" not found: mysql-1-p3ttk, mysql-1-p3ttk\nThere are pending pods when an error occurred: error when evicting pod \"mysql-1-p3ttk\": pods \"mysql-1-p3ttk\" is forbidden: unable to create new content in namespace dakinitest20170618 because it is being terminated.\npod/mysql-1-p3ttk\npod/jws-app-1-hkvqc\nerror: error when evicting pod \"mysql-1-p3ttk\": pods \"mysql-1-p3ttk\" is forbidden: unable to create new content in namespace dakinitest20170618 because it is being terminated.", 
    "stdout": "node \"ip-172-31-53-187.ec2.internal\" already cordoned", 
    "stdout_lines": [
        "node \"ip-172-31-53-187.ec2.internal\" already cordoned"
    ], 



[root@starter-us-east-1-master-25064 ~]# oc get all -n dakinitest20170618
NAME               READY     STATUS    RESTARTS   AGE
po/mysql-1-p3ttk   0/1       Unknown   1          19d



[root@starter-us-east-1-master-25064 ~]# oc get all -n dakinitest20170618  -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      kubernetes.io/created-by: |
        {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"dakinitest20170618","name":"mysql-1","uid":"e0b97aaf-543b-11e7-835c-12d641ec7610","apiVersion":"v1","resourceVersion":"1067971746"}}
      openshift.io/deployment-config.latest-version: "1"
      openshift.io/deployment-config.name: mysql
      openshift.io/deployment.name: mysql-1
      openshift.io/generated-by: OpenShiftNewApp
      openshift.io/scc: restricted
    creationTimestamp: 2017-07-20T06:37:53Z
    deletionGracePeriodSeconds: 30
    deletionTimestamp: 2017-08-02T16:48:38Z
    generateName: mysql-1-
    labels:
      app: cakephp-mysql-persistent
      deployment: mysql-1
      deploymentconfig: mysql
      name: mysql
    name: mysql-1-p3ttk
    namespace: dakinitest20170618
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicationController
      name: mysql-1
      uid: e0b97aaf-543b-11e7-835c-12d641ec7610
    resourceVersion: "1258228086"
    selfLink: /api/v1/namespaces/dakinitest20170618/pods/mysql-1-p3ttk
    uid: f533bbeb-6d15-11e7-803a-122631632f42
  spec:
    containers:
    - env:
      - name: MYSQL_USER
        valueFrom:
          secretKeyRef:
            key: database-user
            name: cakephp-mysql-persistent
      - name: MYSQL_PASSWORD
        valueFrom:
          secretKeyRef:
            key: database-password
            name: cakephp-mysql-persistent
      - name: MYSQL_DATABASE
        value: default
      image: registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:991ef507a4e981531a5601f12ceb65da32605792f1117f15a6001305dd3cfd73
      imagePullPolicy: Always
      livenessProbe:
        failureThreshold: 3
        initialDelaySeconds: 30
        periodSeconds: 10
        successThreshold: 1
        tcpSocket:
          port: 3306
        timeoutSeconds: 1
      name: mysql
      ports:
      - containerPort: 3306
        protocol: TCP
      readinessProbe:
        exec:
          command:
          - /bin/sh
          - -i
          - -c
          - MYSQL_PWD='s48UeOLoL1JQtT3T' mysql -h 127.0.0.1 -u cakephp -D default
            -e 'SELECT 1'
        failureThreshold: 3
        initialDelaySeconds: 5
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      resources:
        limits:
          cpu: "1"
          memory: 512Mi
        requests:
          cpu: 60m
          memory: 307Mi
      securityContext:
        capabilities:
          drop:
          - KILL
          - MKNOD
          - NET_RAW
          - SETGID
          - SETUID
          - SYS_CHROOT
        privileged: false
        runAsUser: 1124180000
        seLinuxOptions:
          level: s0:c352,c314
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /var/lib/mysql/data
        name: mysql-data
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-tx31j
        readOnly: true
    dnsPolicy: ClusterFirst
    imagePullSecrets:
    - name: default-dockercfg-ckvl7
    nodeName: ip-172-31-53-187.ec2.internal
    nodeSelector:
      type: compute
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext:
      fsGroup: 1124180000
      seLinuxOptions:
        level: s0:c352,c314
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    volumes:
    - name: mysql-data
      persistentVolumeClaim:
        claimName: mysql
    - name: default-token-tx31j
      secret:
        defaultMode: 420
        secretName: default-token-tx31j
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: 2017-07-20T06:37:53Z
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: 2017-07-27T16:50:18Z
      message: 'containers with unready status: [mysql]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: 2017-07-20T06:37:53Z
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: docker://e98609ceb53f32685938b1d5ce13586e9dcfea5220321916eb5b523fa577e967
      image: registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:991ef507a4e981531a5601f12ceb65da32605792f1117f15a6001305dd3cfd73
      imageID: docker-pullable://registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:991ef507a4e981531a5601f12ceb65da32605792f1117f15a6001305dd3cfd73
      lastState:
        terminated:
          containerID: docker://3439f952bde86637ef420a02dae289707bd0a827a4ba7376b6b60855aeb37063
          exitCode: 0
          finishedAt: 2017-07-21T23:29:02Z
          reason: Completed
          startedAt: 2017-07-20T06:40:30Z
      name: mysql
      ready: false
      restartCount: 1
      state:
        running:
          startedAt: 2017-07-21T23:35:32Z
    hostIP: 172.31.53.187
    phase: Running
    qosClass: Burstable
    startTime: 2017-07-20T06:37:53Z
kind: List
metadata: {}
resourceVersion: ""
selfLink: ""
[root@starter-us-east-1-master-25064 ~]#

Comment 6 Scott Dodson 2017-08-08 16:59:37 UTC

Created attachment 1310760 [details]
metrics output

Comment 10 Seth Jennings 2017-08-08 17:50:31 UTC

This looks like
https://bugzilla.redhat.com/show_bug.cgi?id=1460729

Based on namespace termination hanging due to stuck pods due to docker state mismatch with containerd that resolves after a few hours.

Indicators include "containerd: container not found" in the node log.

*** This bug has been marked as a duplicate of bug 1460729 ***

Note You need to log in before you can comment on or make changes to this bug.