1354031 – Openshift master scheduler cache had bad info, scheduler was unable to assign pod with error PodFitsPorts

Bug 1354031 - Openshift master scheduler cache had bad info, scheduler was unable to assign pod with error PodFitsPorts

Summary: Openshift master scheduler cache had bad info, scheduler was unable to assign...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Andy Goldstein
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3
TreeView+	depends on / blocked

Reported:	2016-07-08 18:29 UTC by Matt Woodson
Modified:	2023-09-14 03:27 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-09 14:22:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matt Woodson 2016-07-08 18:29:15 UTC

Description of problem:

We were trying to deploy a router after an upgraded system. We were doing a blue/green upgrade. We had old version of nodes (3.1) and were upgrading to 3.2.1. We installed the new set of nodes (green, 3.2), then we cut over to them. This process happened over 2-3 days, also done by multiple people. The details on this is fuzzy (sorry).

We then tried to deploy the router to our infra nodes (node selector and node labels). When the pod tried to deploy, the pod went right to error state. We did an "oc describe pod router-1-xxxx -n default" We saw this error blocking the pod from rolling out to the node.

fit failure on node (ip-172-31-51-239.ec2.internal): PodFitsPorts

The node, ip-172-31-51-239.ec2.internal, had 0 things running on it. oc get pods --all-namespace -o wide did NOT show any other pods scheduled to this node. There was nothing running on this node.

We tried multiple things, but what worked was restarting atomic-openshift-master-controllers service. We believe this is what fixed it, but the api service was also restarted. Once this was restarted, the router deployed immediately.

Version-Release number of selected component (if applicable):

atomic-openshift-3.2.1.4-1.git.0.9fe156c.el7.x86_64

How reproducible:

N/A

Steps to Reproduce:

See description

Actual results:

router did not deploy

Expected results:

router should have deployed.

Additional info:

Sorry, the details are fuzzy here. We have collected logs and sent them to engineering. We wanted to put the bug together to track this for future use.

Comment 1 Matt Woodson 2016-07-08 18:37:11 UTC

Some additional info.  One of the things that was done was we had 4 infra nodes (2 blue, 3.1, 2 gree, 3.2).  This was scaled to 6, by accident.  Something odd may have happened as we went back and forth between the blue and green nodes (we make them scheduable, and then not schedualble)


We also suspect scheduler cache is what was probably incorrect.

Comment 2 Andy Goldstein 2016-07-08 18:56:47 UTC

It looks like we don't have logs going back far enough to diagnose what started this problem. Hopefully if it does happen again, we'll have increased log retention to be able to triage.

Comment 3 Andy Goldstein 2016-08-04 10:33:01 UTC

Matt, have you ever run into this again?

Comment 4 Andy Goldstein 2016-08-09 14:22:17 UTC

Matt says he hasn't seen this any more. I'm going to close this for now, but if it happens again, please reopen.

Comment 5 Red Hat Bugzilla 2023-09-14 03:27:48 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.