Bug 1547169

Summary:	[free-stg] evicted webconsole pods not being rescheduled after nodefs eviction
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Master	Assignee:	Tomáš Nožička <tnozicka>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Wang Haoran <haowang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.9.0	CC:	aos-bugs, dakini, decarr, jchevret, jokerman, mfojtik, mmccomas, sjenning, smunilla, spadgett, tnozicka, wmeng
Target Milestone:	---	Keywords:	DeliveryBlocker
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openshift v3.9.1	Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-18 17:44:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-02-20 16:29:56 UTC

Description of problem:
Several days ago, webconsole pods were evicted due to some sort of disk pressure. Even though that disk pressure is no longer present, the pods are not being scheduled. This condition persists even after restarting master-controllers. 

[root@free-stg-master-9fec9 ~]# oc get pods -n openshift-web-console
NAME                          READY     STATUS    RESTARTS   AGE
webconsole-6768b679b8-2j4wk   0/1       Evicted   0          5d
webconsole-6768b679b8-6pjl9   0/1       Evicted   0          7d
webconsole-6768b679b8-brv29   0/1       Evicted   0          7d
webconsole-6768b679b8-bt5wv   0/1       Evicted   0          5d
webconsole-6768b679b8-csdwp   0/1       Evicted   0          5d


Version-Release number of selected component (if applicable):
openshift v3.9.0-0.42.0


How reproducible:
Unkonwn

Steps to Reproduce:
1. cluster was running acceptably
2. pods were evicted due to disk pressure
3. pods were never scheduled after disk pressure abated

Comment 1 Seth Jennings 2018-02-20 17:02:07 UTC

My triage results:

The evicted pods were evicted days ago.  The deployment controller doesn't seem to be attempting to start new ones for some reason.

I started a deployment in a test project on that cluster and the pods were created as expected so the controller is working at least to some degree.  Couldn't find any panics or anything in the controller logs on the current leader ip-172-31-78-254.us-east-2.compute.internal.

Comment 2 Tomáš Nožička 2018-02-21 15:57:48 UTC

I am not sure why that happens, the RS is scaled down to 0, no idea why and there are no logs.

free-stg runs with loglevel=2 which is pointless for debugging controller issues. 

Justin can you please raise it to V4? I am afraid that the bug will go away with controllers restart which should be a good reason to run with V4 by default, at least in testing environments.

Comment 4 Tomáš Nožička 2018-02-21 21:32:53 UTC

my current findings:

- Deployment (with Recreate) strategy waits for all pods to be deleted here:
https://github.com/openshift/origin/blob/c4b40b3f401472203fcef1927f3c5ce40da07056/vendor/k8s.io/kubernetes/pkg/controller/deployment/recreate.go#L47-L50

- Deployment does that by scaling RS to 0

- Deployment waits till the pods are deleted but RS leaves them around for some reason

I need to find out tomorrow why RS leaves that pod around when scaled to 0.

Comment 5 Tomáš Nožička 2018-02-21 21:33:38 UTC

*** Bug 1547604 has been marked as a duplicate of this bug. ***

Comment 6 Tomáš Nožička 2018-02-23 12:46:50 UTC

RS ignores failed pods
https://github.com/tnozicka/origin/blob/710998e7358752a4c79518804522f1a6cf239838/vendor/k8s.io/kubernetes/pkg/controller/replicaset/replica_set.go#L610
and leave them laying around

upstream now has issue regarding failed pods:
 - https://github.com/kubernetes/kubernetes/issues/60162
 - https://github.com/kubernetes/kubernetes/issues/54525#issuecomment-367845264

I'll try if upstream accepts a hotfix before RS and other controllers are fixed in generic way.

Comment 7 Tomáš Nožička 2018-02-23 17:53:27 UTC

upstream PR for unblocking the Deployment:

  https://github.com/kubernetes/kubernetes/pull/60301

Additional fixes will follow for the leftover pods depending on which way upstream decides to go but this should fix the stuck deployment.

Comment 11 Tomáš Nožička 2018-02-27 10:37:44 UTC

upstream pick:
  https://github.com/openshift/origin/pull/18760


haowang yes, we have the same issue with DCs but we have a timeout in addition so it will reconcile eventually, so it's not that severe. Please file a separate issue for tracking it.

Comment 12 Michal Fojtik 2018-02-27 11:49:18 UTC

Picked to origin 3.9: https://github.com/openshift/origin/pull/18760

Comment 13 Wang Haoran 2018-02-28 08:02:08 UTC

I am sure this worked now with the new puddle built out, but there I found the new deploy cost a lot time also, see the events:

Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  1h    deployment-controller  Scaled up replica set busybox-5f9469d8dc to 1
  Normal  ScalingReplicaSet  14m   deployment-controller  Scaled down replica set busybox-5f9469d8dc to 0
  Normal  ScalingReplicaSet  5m    deployment-controller  Scaled up replica set busybox-67c646dff5 to 1
[root@host-172-16-120-6 ~]# oc get pod
NAME                       READY     STATUS    RESTARTS   AGE
busybox-5f9469d8dc-m54jl   0/1       Evicted   0          1h
busybox-67c646dff5-n477f   1/1       Running   0          12m
[root@host-172-16-120-6 ~]# oc get rs
NAME                 DESIRED   CURRENT   READY     AGE
busybox-5f9469d8dc   0         0         0         1h
busybox-67c646dff5   1         1         1         12m

Is this ecpected ?

Comment 14 Tomáš Nožička 2018-03-01 14:00:24 UTC

I am not sure what went wrong in your case, but I recall that when looking into the env I think you tested it by filling up the disk and at least the journald was borked; not sure if something else wasn't broken as well.

Would you mind trying to recreate such state again and leave the env around if that happens?

I have tried it locally and I went with the path using failed pods from https://bugzilla.redhat.com/show_bug.cgi?id=1547604 by having failed pods with MatchNodeSelector and the deployment worked without any delay.

(To create such pod you'd need to specify valid nodeName, to avoid scheduler, and then have non-matching nodeSelector for kubelet to reject it and fail it.)

Comment 15 Wang Haoran 2018-03-01 14:43:59 UTC

It works well now and the journald is not break this time, verified with:
openshift v3.9.1
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16