Bug 1354031

Summary: Openshift master scheduler cache had bad info, scheduler was unable to assign pod with error PodFitsPorts
Product: OpenShift Container Platform Reporter: Matt Woodson <mwoodson>
Component: NodeAssignee: Andy Goldstein <agoldste>
Status: CLOSED WORKSFORME QA Contact: DeShuai Ma <dma>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.1CC: agoldste, agrimm, aos-bugs, jokerman, mmccomas, mwoodson
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-09 14:22:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1303130    

Description Matt Woodson 2016-07-08 18:29:15 UTC
Description of problem:

We were trying to deploy a router after an upgraded system.  We were doing a blue/green upgrade.  We had old version of nodes (3.1) and were upgrading to 3.2.1.   We installed the new set of nodes (green, 3.2), then we cut over to them.  This process happened over 2-3 days, also done by multiple people.  The details on this is fuzzy (sorry).

We then tried to deploy the router to our infra nodes (node selector and node labels).  When the pod tried to deploy, the pod went right to error state.  We did an "oc describe pod router-1-xxxx -n default"  We saw this error blocking the pod from rolling out to the node.


fit failure on node (ip-172-31-51-239.ec2.internal): PodFitsPorts


The node, ip-172-31-51-239.ec2.internal, had 0 things running on it.  oc get pods --all-namespace -o wide did NOT show any other pods scheduled to this node.  There was nothing running on this node.

We tried multiple things, but what worked was restarting atomic-openshift-master-controllers service.  We believe this is what fixed it, but the api service was also restarted.  Once this was restarted, the router deployed immediately.

Version-Release number of selected component (if applicable):

atomic-openshift-3.2.1.4-1.git.0.9fe156c.el7.x86_64

How reproducible:

N/A

Steps to Reproduce:

See description

Actual results:

router did not deploy

Expected results:

router should have deployed.


Additional info:

Sorry, the details are fuzzy here.  We have collected logs and sent them to engineering.  We wanted to put the bug together to track this for future use.

Comment 1 Matt Woodson 2016-07-08 18:37:11 UTC
Some additional info.  One of the things that was done was we had 4 infra nodes (2 blue, 3.1, 2 gree, 3.2).  This was scaled to 6, by accident.  Something odd may have happened as we went back and forth between the blue and green nodes (we make them scheduable, and then not schedualble)


We also suspect scheduler cache is what was probably incorrect.

Comment 2 Andy Goldstein 2016-07-08 18:56:47 UTC
It looks like we don't have logs going back far enough to diagnose what started this problem. Hopefully if it does happen again, we'll have increased log retention to be able to triage.

Comment 3 Andy Goldstein 2016-08-04 10:33:01 UTC
Matt, have you ever run into this again?

Comment 4 Andy Goldstein 2016-08-09 14:22:17 UTC
Matt says he hasn't seen this any more. I'm going to close this for now, but if it happens again, please reopen.

Comment 5 Red Hat Bugzilla 2023-09-14 03:27:48 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days