2005347 – Various scheduler tests failing, unscheduled pods, at least sometimes due to a kubelet race

Bug 2005347 - Various scheduler tests failing, unscheduled pods, at least sometimes due to a kubelet race

Summary: Various scheduler tests failing, unscheduled pods, at least sometimes due to ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1952224
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-17 13:24 UTC by Ben Parees
Modified:	2021-10-14 10:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-14 10:14:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2021-09-17 13:24:21 UTC

Description of problem:

The following tests all failed in the stable canary run:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304

[sig-scheduling] SchedulerPredicates [Serial] validates that NodeSelector is respected if not matching [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

[sig-scheduling] SchedulerPredicates [Serial] validates that NodeAffinity is respected if not matching [Suite:openshift/conformance/serial] [Suite:k8s] 

[sig-scheduling] SchedulerPredicates [Serial] validates pod overhead is considered along with resource limits of pods that are allowed to run verify pod overhead is accounted for [Suite:openshift/conformance/serial] [Suite:k8s]

[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]


All 4 tests failed w/ this output:
fail [k8s.io/kubernetes.1/test/e2e/scheduling/predicates.go:436]: Sep 17 09:40:52.455: Timed out after 10m0s waiting for stable cluster.

But they failed over an hour apart and no other tests had issues.

It's not clear what a "stable cluster" means in this context, so improved debugging output would be worth considering.  

Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-09-17-032545 

How reproducible:
flake

Comment 1 W. Trevor King 2021-09-17 20:20:52 UTC

4.8's WaitForStableCluster [1].  Looks like it includes "Timed out waiting for the following pods to schedule..." in test-case stdout, but sadly does not include that information in the framework.Failf message.  Seems like a useful upstream fix to try to at least sketch the unscheduled pause in the failure message.

But popping open stdout for one of these failing test cases in comment 0's job:

Sep 17 09:40:52.454: INFO: Timed out waiting for the following pods to schedule
Sep 17 09:40:52.454: INFO: openshift-authentication/oauth-openshift-684f7bffb-tcdfv
Sep 17 09:40:52.455: FAIL: Timed out after 10m0s waiting for stable cluster.

And dropping into gathered assets:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.name == "oauth-openshift-684f7bffb-tcdfv").status'
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2021-09-17T08:49:30Z",
      "message": "0/6 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

So did we break anti-affinity?  Or the auth pods are being misconfigured?

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.name == "oauth-openshift-684f7bffb-tcdfv").spec.affinity'
{
  "podAntiAffinity": {
    "requiredDuringSchedulingIgnoredDuringExecution": [
      {
        "labelSelector": {
          "matchLabels": {
            "app": "oauth-openshift",
            "oauth-openshift-anti-affinity": "true"
          }
        },
        "topologyKey": "kubernetes.io/hostname"
      }
    ]
  }
}

[1]: https://github.com/openshift/origin/blob/a7992c389e048b27a3dfb1a05e9d2691f9e8f371/vendor/k8s.io/kubernetes/test/e2e/scheduling/framework.go#L43-L61

Comment 2 W. Trevor King 2021-09-17 20:22:40 UTC

Or maybe we're trying (and failing) to surge with three replicas?  And the surging fourth pod has no fourth node to live on.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq -r '.items[].metadata.name | select(startswith("oauth-openshift-"))'
oauth-openshift-5ccc6f4c65-qg6jb
oauth-openshift-5ccc6f4c65-qvd8h
oauth-openshift-684f7bffb-pgcnx
oauth-openshift-684f7bffb-tcdfv

Comment 3 W. Trevor King 2021-09-17 20:30:09 UTC

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/deployments.json | jq '.items[] | select(.metadata.name == "oauth-openshift").spec | {replicas, strategy}'
{
  "replicas": 3,
  "strategy": {
    "rollingUpdate": {
      "maxSurge": 0,
      "maxUnavailable": 1
    },
    "type": "RollingUpdate"
  }
}

So not surging...  Aha:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1438781824678498304/artifacts/e2e-aws-canary/gather-extra/artifacts/pods.json | jq -r '.items[] | select((.metadata.name | startswith("oauth-openshift-")) and .status.phase == "Pending").metadata | keys'
[
  "annotations",
  "creationTimestamp",
  "deletionGracePeriodSeconds",
  "deletionTimestamp",
  "generateName",
  "labels",
  "name",
  "namespace",
  "ownerReferences",
  "resourceVersion",
  "uid"
]
[
  "annotations",
  "creationTimestamp",
  "generateName",
  "labels",
  "name",
  "namespace",
  "ownerReferences",
  "resourceVersion",
  "uid"
]

So I bet this is a symptom of bug 1952224, and can be closed as a dup (possibly with Sippy metadata sync'ed over?).  The issue is that a sibling pod hit bug 1952224 and hung in termination, and then anti-affinity locked scheduling a new sibling, and then WaitForStableCluster locked on the unscheduled sibling.

Comment 4 W. Trevor King 2021-09-17 20:31:30 UTC

Or maybe we keep this bug open, and use it to track including a summary of unscheduled pods in WaitForStableCluster's framework.Failf?

Comment 5 Jan Chaloupka 2021-10-14 10:14:06 UTC

I prefer to have an issue against a specific issue which we can debug and fix. It takes longer to collect what is wrong from the kubelet and the scheduler logs. Though, it's still doable.

*** This bug has been marked as a duplicate of bug 1952224 ***

Note You need to log in before you can comment on or make changes to this bug.