Bug 1401631

Summary: Router container going into crashLoopBackoff with little log information
Product: OpenShift Container Platform Reporter: Steven Walter <stwalter>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, bbennett, eparis, pcameron, ramr, stwalter
Version: 3.2.0   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-17 16:32:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Steven Walter 2016-12-05 18:37:20 UTC
Description of problem:
One of two router pods intermittently goes into CrashLoopBackOff.

Version-Release number of selected component (if applicable):
3.2.0 -- they customized the image to modify the header size but no other changes

How reproducible:
Unverified for us, consistent for them

Actual results:
CrashLoopBackoff

Expected results:
Running pod

Additional info:
Uploading logs and events momentarily

Comment 4 Steven Walter 2016-12-05 18:40:57 UTC
Events show many messages like:

1m          1m         1         router-1-7mkii   Pod       spec.containers{router}   Normal    Killing      {kubelet apsrp6468.example.com}   Killing container with docker id f20ad1d47f84: pod "router-1-7mkii_default(d0306181-b799-11e6-8fc2-0050568704dd)" container "router" is unhealthy, it will be killed and re-created.
19s         19s        1         router-1-7mkii   Pod       spec.containers{router}   Normal    Created      {kubelet apsrp6468.example.com}   Created container with docker id 51e261c75a16
19s         19s        1         router-1-7mkii   Pod       spec.containers{router}   Normal    Started      {kubelet apsrp6468.example.com}   Started container with docker id 51e261c75a16
37m         24s        37        router-1-syzb1   Pod       spec.containers{router}   Warning   Unhealthy    {kubelet apsrp6469.example.com}   Readiness probe failed: Get http://localhost:1936/healthz: net/http: request canceled while waiting for connection
37m         24s        36        router-1-syzb1   Pod       spec.containers{router}   Warning   Unhealthy    {kubelet apsrp6469.example.com}   Liveness probe failed: Get http://localhost:1936/healthz: net/http: request canceled while waiting for connection

But nowhere in here or in logs does it appear to point to the cause (i.e. no indication of port conflicts, etc)