Description of problem: We were trying to deploy a router after an upgraded system. We were doing a blue/green upgrade. We had old version of nodes (3.1) and were upgrading to 3.2.1. We installed the new set of nodes (green, 3.2), then we cut over to them. This process happened over 2-3 days, also done by multiple people. The details on this is fuzzy (sorry). We then tried to deploy the router to our infra nodes (node selector and node labels). When the pod tried to deploy, the pod went right to error state. We did an "oc describe pod router-1-xxxx -n default" We saw this error blocking the pod from rolling out to the node. fit failure on node (ip-172-31-51-239.ec2.internal): PodFitsPorts The node, ip-172-31-51-239.ec2.internal, had 0 things running on it. oc get pods --all-namespace -o wide did NOT show any other pods scheduled to this node. There was nothing running on this node. We tried multiple things, but what worked was restarting atomic-openshift-master-controllers service. We believe this is what fixed it, but the api service was also restarted. Once this was restarted, the router deployed immediately. Version-Release number of selected component (if applicable): atomic-openshift-3.2.1.4-1.git.0.9fe156c.el7.x86_64 How reproducible: N/A Steps to Reproduce: See description Actual results: router did not deploy Expected results: router should have deployed. Additional info: Sorry, the details are fuzzy here. We have collected logs and sent them to engineering. We wanted to put the bug together to track this for future use.
Some additional info. One of the things that was done was we had 4 infra nodes (2 blue, 3.1, 2 gree, 3.2). This was scaled to 6, by accident. Something odd may have happened as we went back and forth between the blue and green nodes (we make them scheduable, and then not schedualble) We also suspect scheduler cache is what was probably incorrect.
It looks like we don't have logs going back far enough to diagnose what started this problem. Hopefully if it does happen again, we'll have increased log retention to be able to triage.
Matt, have you ever run into this again?
Matt says he hasn't seen this any more. I'm going to close this for now, but if it happens again, please reopen.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days