Bug 1626500

Summary: [starter-us-east-1] pod stuck in terminating during node drain
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: NetworkingAssignee: Ricardo Carrillo Cruz <ricarril>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, jokerman, jupierce, mmccomas, ricarril, scuppett
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-20 14:44:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-09-07 13:46:22 UTC
Description of problem:


Version-Release number of selected component (if applicable):
master: v3.11.0-0.21.0
node at time of hang: v3.10.8

Actual results:
Pod stuck in terminating was mysql-1-dcgl8 in mysql-database.
Node was ip-172-31-62-236.ec2.internal

Additional info:
- Pod failed to terminate over the course of the night
- Captured preliminary set of logs
- Increased master.env loglevel to 4 & restarted controllers/api
- Captured second set of logs
- Noted that pod finally drained after master process restart
- Logs will be attached in comments

Comment 4 Casey Callendrello 2018-09-11 15:00:05 UTC
This is the last thing the SDN logs:

I0911 14:53:14.825290    9249 node.go:289] Starting openshift-sdn network plugin
F0911 14:53:24.829959    9249 network.go:46] SDN node startup failed: failed to validate network configuration: cannot fetch "default" cluster network: Get https://preserve-jialiu311-auto-jtmk-men-1:8443/apis/network.openshift.io/v1/clusternetworks/default: dial tcp [fe80::f816:3eff:fe1b:6dbc%eth0]:8443: connect: connection refused

Comment 5 Casey Callendrello 2018-09-11 15:01:08 UTC
Ignore that comment - wrong issue!

Comment 6 Casey Callendrello 2018-09-12 13:27:47 UTC
Is it possible to get the logs from the SDN pod?

Comment 7 Justin Pierce 2018-09-12 13:55:08 UTC
The upgrade to 3.11 was resumed after the problem determination procedure inadvertently cleared the issue. Unfortunately, this means that any relevant sdn pod information is gone.

Comment 8 Casey Callendrello 2018-09-12 14:23:54 UTC
Did the node reboot? If not, you can get the SDN pod logs manually (with docker ps -a and docker logs).

Otherwise, no worries. Just update the issue if it happens again.

Comment 9 Ricardo Carrillo Cruz 2019-04-05 10:38:08 UTC
Can you please update the bug and let us know if this is still an issue?

Comment 11 Red Hat Bugzilla 2023-09-14 04:34:27 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days