Bug 2005347

Summary:	Various scheduler tests failing, unscheduled pods, at least sometimes due to a kubelet race
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	kube-scheduler	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED DUPLICATE	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	aos-bugs, mfojtik, wking
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-14 10:14:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2021-09-17 13:24:21 UTC

Description of problem:

The following tests all failed in the stable canary run:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304

[sig-scheduling] SchedulerPredicates [Serial] validates that NodeSelector is respected if not matching [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

[sig-scheduling] SchedulerPredicates [Serial] validates that NodeAffinity is respected if not matching [Suite:openshift/conformance/serial] [Suite:k8s] 

[sig-scheduling] SchedulerPredicates [Serial] validates pod overhead is considered along with resource limits of pods that are allowed to run verify pod overhead is accounted for [Suite:openshift/conformance/serial] [Suite:k8s]

[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]


All 4 tests failed w/ this output:
fail [k8s.io/kubernetes.1/test/e2e/scheduling/predicates.go:436]: Sep 17 09:40:52.455: Timed out after 10m0s waiting for stable cluster.

But they failed over an hour apart and no other tests had issues.

It's not clear what a "stable cluster" means in this context, so improved debugging output would be worth considering.  

Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-09-17-032545 

How reproducible:
flake

Comment 1 W. Trevor King 2021-09-17 20:20:52 UTC

4.8's WaitForStableCluster [1].  Looks like it includes "Timed out waiting for the following pods to schedule..." in test-case stdout, but sadly does not include that information in the framework.Failf message.  Seems like a useful upstream fix to try to at least sketch the unscheduled pause in the failure message.

But popping open stdout for one of these failing test cases in comment 0's job:

Sep 17 09:40:52.454: INFO: Timed out waiting for the following pods to schedule
Sep 17 09:40:52.454: INFO: openshift-authentication/oauth-openshift-684f7bffb-tcdfv
Sep 17 09:40:52.455: FAIL: Timed out after 10m0s waiting for stable cluster.

And dropping into gathered assets:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.name == "oauth-openshift-684f7bffb-tcdfv").status'
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2021-09-17T08:49:30Z",
      "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

So did we break anti-affinity?  Or the auth pods are being misconfigured?

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.name == "oauth-openshift-684f7bffb-tcdfv").spec.affinity'
{
  "podAntiAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchLabels": {
            "app": "oauth-openshift",
            "oauth-openshift-anti-affinity": "true"
          }
        },
        "topologyKey": "kubernetes.io/hostname"
      }
    ]
  }
}

[1]: https://github.com/openshift/origin/blob/a7992c389e048b27a3dfb1a05e9d2691f9e8f371/vendor/k8s.io/kubernetes/test/e2e/scheduling/framework.go#L43-L61

Comment 2 W. Trevor King 2021-09-17 20:22:40 UTC

Or maybe we're trying (and failing) to surge with three replicas?  And the surging fourth pod has no fourth node to live on.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq -r '.items[].metadata.name | select(startswith("oauth-openshift-"))'
oauth-openshift-5ccc6f4c65-qg6jb
oauth-openshift-5ccc6f4c65-qvd8h
oauth-openshift-684f7bffb-pgcnx
oauth-openshift-684f7bffb-tcdfv

Comment 3 W. Trevor King 2021-09-17 20:30:09 UTC

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/deployments.json | jq '.items[] | select(.metadata.name == "oauth-openshift").spec | {replicas, strategy}'
{
  "replicas": 3,
  "strategy": {
    "rollingUpdate": {
      "maxSurge": 0,
      "maxUnavailable": 1
    },
    "type": "RollingUpdate"
  }
}

So not surging...  Aha:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq -r '.items[] | select((.metadata.name | startswith("oauth-openshift-")) and .status.phase == "Pending").metadata | keys'
[
  "annotations",
  "creationTimestamp",
  "deletionGracePeriodSeconds",
  "deletionTimestamp",
  "generateName",
  "labels",
  "name",
  "namespace",
  "ownerReferences",
  "resourceVersion",
  "uid"
]
[
  "annotations",
  "creationTimestamp",
  "generateName",
  "labels",
  "name",
  "namespace",
  "ownerReferences",
  "resourceVersion",
  "uid"
]

So I bet this is a symptom of bug 1952224, and can be closed as a dup (possibly with Sippy metadata sync'ed over?).  The issue is that a sibling pod hit bug 1952224 and hung in termination, and then anti-affinity locked scheduling a new sibling, and then WaitForStableCluster locked on the unscheduled sibling.

Comment 4 W. Trevor King 2021-09-17 20:31:30 UTC

Or maybe we keep this bug open, and use it to track including a summary of unscheduled pods in WaitForStableCluster's framework.Failf?

Comment 5 Jan Chaloupka 2021-10-14 10:14:06 UTC

I prefer to have an issue against a specific issue which we can debug and fix. It takes longer to collect what is wrong from the kubelet and the scheduler logs. Though, it's still doable.

*** This bug has been marked as a duplicate of bug 1952224 ***