Bug 1913411

Summary: When openshift-sdn segfaults, creation fails with 'error reserving pod name ...: name is reserved"
Product: OpenShift Container Platform Reporter: Peter Hunt <pehunt>
Component: NetworkingAssignee: Surya Seetharaman <surya>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: urgent CC: aconstan, akamra, akashem, anbhat, aos-bugs, apjagtap, bbennett, ckavili, dosmith, dsanzmor, eparis, hongkliu, itbrown, izhang, jcallen, jcrumple, jmalde, jokerman, jsafrane, jtaleric, jtanenba, kramdoss, lbac, llopezmo, lmartinh, mapandey, mwasher, naoto30, nelluri, openshift-bugs-escalate, pehunt, rcernin, rkrawitz, rperiyas, rphillips, rsevilla, sbhavsar, schoudha, scuppett, ssonigra, swasthan, tkatarki, vpagar, wking, xmorano
Version: 4.6.zFlags: lmartinh: needinfo-
lmartinh: needinfo-
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-43
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1785399 Environment:
Last Closed: 2021-04-27 19:54:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1887744    

Comment 12 Abu Kashem 2021-01-21 16:55:26 UTC
> - However, along with the 'netplugin failed with no error message: signal: killed' and 'name is reserved' messages, they observed the apiserver reporting panic.
> - As a workaround, they restart the kube-apiserver pods and everything works as expected for a while.

I would like the following to be tracked:

- do a grep on the kube apiserver logs and give us a count of the total number of panics seen with time range.
- can you have the customer run the following prometheus query on the web console (time range = starting with the time master was rebooted and going back to 48 hours) and share the screenshot with us?
> sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)

Comment 15 Abu Kashem 2021-01-25 16:59:28 UTC
apjagtap, requests for new data


grep the current kube apiserver logs (all instances):
> grep -rni -E "timeout.go:(132|134)" namespaces/openshift-kube-apiserver/*

please run the following prometheus queries and share the entire screenshot with me.
> topk(25, sum(apiserver_flowcontrol_current_executing_requests) by (priorityLevel,instance))
> topk(25, sum(apiserver_flowcontrol_request_concurrency_limit) by (priorityLevel,instance))

Thanks!

Comment 21 Abu Kashem 2021-02-02 16:48:26 UTC
apjagtap,

> Should I open another bug and share it over 
sounds good to me, and please follow the instructions for data capture from this - https://bugzilla.redhat.com/show_bug.cgi?id=1908383#c19

Comment 35 Surya Seetharaman 2021-04-27 19:54:28 UTC

*** This bug has been marked as a duplicate of bug 1924741 ***