Bug 1729461 - OCP API server panic while communicating with Cloudforms.
Summary: OCP API server panic while communicating with Cloudforms.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1722808
TreeView+ depends on / blocked
 
Reported: 2019-07-12 11:00 UTC by Shivkumar Ople
Modified: 2020-01-13 13:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-13 13:05:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
api logs from Jun 25. List /api/v1/pods consistently timing out ("panic") after 2min. (8.71 MB, application/gzip)
2019-07-12 12:21 UTC, Beni Paskin-Cherniavsky
no flags Details
etcd logs from Jun 26 (2.01 MB, application/gzip)
2019-07-12 12:22 UTC, Beni Paskin-Cherniavsky
no flags Details
controllers logs from Jun 26 (4.41 MB, application/gzip)
2019-07-12 12:24 UTC, Beni Paskin-Cherniavsky
no flags Details
new api logs from Jul 8. increased timeout. /api/v1/replicationcontrollers time out after 10min! (175.21 KB, application/gzip)
2019-07-12 12:25 UTC, Beni Paskin-Cherniavsky
no flags Details
master.config from all masters, Jul 9 (2.87 KB, application/gzip)
2019-07-12 12:30 UTC, Beni Paskin-Cherniavsky
no flags Details

Description Shivkumar Ople 2019-07-12 11:00:25 UTC
Description of problem:

OCP API server is getting panic, while communicating with cloudforms.

Following errors observed in the evm.logs from cloudforms.


````````````````````````

[----] E, [2019-06-17T13:54:02.340421 #29275:4a8f58] ERROR -- : MIQ(ManageIQ::Providers::Openshift::ContainerManager::Refresher#refresh) EMS: [prod2-ocp], id: [1000000000005] Refresh failed
[----] E, [2019-06-17T13:54:02.340620 #29275:4a8f58] ERROR -- : [JSON::ParserError]: 765: unexpected token at 'This request caused apiserver to panic. Look in the logs for details.
'  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-17T13:54:02.340737 #29275:4a8f58] ERROR -- : /usr/share/gems/gems/json-2.1.0/lib/json/common.rb:156:in `parse'


````````````````````````

- In the Openshift-infra project all pods are running, where cloudforms hits for the data.

- Tokens are in place for communication between cloudforms and OpenShift



Version-Release number of selected component (if applicable):




How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
API server is Panic

Expected results:

API server should respond correctly, in time without panic. 


Additional info:

There are two OCP clusters from where the cloudforms fetches data, but one OCP cluster is working well it responds correctly. In another OCP environment , there is panic error issue as per the above description.

Comment 1 Beni Paskin-Cherniavsky 2019-07-12 12:21:02 UTC
Created attachment 1589903 [details]
api logs from Jun 25.  List /api/v1/pods consistently timing out ("panic") after 2min.

Comment 2 Beni Paskin-Cherniavsky 2019-07-12 12:22:39 UTC
Created attachment 1589904 [details]
etcd logs from Jun 26

Comment 3 Beni Paskin-Cherniavsky 2019-07-12 12:24:26 UTC
Created attachment 1589905 [details]
controllers logs from Jun 26

Comment 4 Beni Paskin-Cherniavsky 2019-07-12 12:25:51 UTC
Created attachment 1589906 [details]
new api logs from Jul 8. increased timeout. /api/v1/replicationcontrollers time out after 10min!

Comment 5 Beni Paskin-Cherniavsky 2019-07-12 12:30:42 UTC
Created attachment 1589908 [details]
master.config from all masters, Jul 9

Comment 6 Beni Paskin-Cherniavsky 2019-07-12 12:58:07 UTC
Attached openshift logs from case.

On Jun 25, /api/v1/pods requests were consistently timing out after 2min:

I0621 18:03:03.801477       1 trace.go:76] Trace[159392431]: "List /api/v1/pods" (started: 2019-06-21 18:00:53.021276906 +0800 +08 m=+3628123.883312551) (total time: 2m10.780146322s):
Trace[159392431]: [335.105685ms] [335.101451ms] Listing from storage done
Trace[159392431]: [2m10.780142527s] [2m10.442985457s] Writing http response done (1946 items)
E0621 18:03:03.801883       1 runtime.go:66] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

Since then, number of pods in cluster was reduced, and config changed to increase request-timeout and burst, qps limits.  (new master.config attached)

According to new log, /api/v1/pods requests complete successfully in 10~30sec (which is still unreasonably slow for ~400 pods, 2MB response).

But now /api/v1/replicationcontrollers time out after taking *10min*!

I0708 17:15:34.335206       1 trace.go:76] Trace[1673307968]: "List /api/v1/replicationcontrollers" (started: 2019-07-08 17:05:15.652786439 +0800 +08 m=+1477.517926570) (total time: 10m18.682383848s):
Trace[1673307968]: [1.096050414s] [1.096047559s] Listing from storage done
Trace[1673307968]: [10m18.682380842s] [10m17.571893772s] Writing http response done (15888 items)
E0708 17:15:34.335399       1 runtime.go:66] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)

If I'm reading these correctly (I'm not openshift expert), etcd (Listing from storage) is pretty fast (<1sec) and then minutes are spent in apiserver?!

---

Is CloudForms significant?
One important point is that unlike oc, CloudForms makes unchunked requests — no `?limit=500` and continue tokens, just one big request for all pods in all namespaces (then a big request for all services, all replicationcontrollers, etc...)

For unknown reason, we haven't managed to reproduce timeout with `curl` from same machine as CloudForms.
So maybe there's still _something_ special about the way CF makes requests, but I have no leads — all we can observe from CF side is apiserver takes a long time and returns error.

The way forward is investigating the timeouts inside Openshift (hence this BZ).  To be clear, I'm familiar with CF side but not openshift.

Comment 8 Stefan Schimanski 2019-07-23 15:54:08 UTC
Are we sure that CF is reading the data from the respone in time? A client can also slow down connections. If an equal curl call works fine, the chance is high that is has to do with the transport level, not the API server logic.

Have you checked the CPU load during the request? If it is the API server logic, the chance is high that the CPU load of at least one core is at 100%. If the client does not read data quick enough, you won't see 100%.

Comment 9 Stefan Schimanski 2019-07-23 15:57:52 UTC
Reading "Trace[1673307968]: [10m18.682380842s] [10m17.571893772s] Writing http response done (15888 items)" doesn't sound healthy. There are 15k+ ReplicationControllers in the system? I would not expect such a request to finish within 60 seconds. This requires chunking.

Comment 13 Beni Paskin-Cherniavsky 2019-07-24 13:23:39 UTC
> Are we sure that CF is reading the data from the respone in time? A client can also slow down connections. 

I also had this thought but was thinking of network bandwidth and say hundreds of megabytes, and decided it's unlikely (1) for a couple megs taking up to half a minute (2) due to curl showing the network *can* transfer it fast.

> Have you checked the CPU load during the request?  If it is the API server logic, the chance is high that the CPU load of at least one core is at 100%. 

Ooh, I haven't considered CF itself being slow!  It's ruby, but it's supposed to just slurp the data off the wire, all JSON parsing etc. is later.  (I'll find the exact code soon)

But we have `top` output dump every minute from CF machine (I'll attach), and while overall usage stays <30%, one of the openshift refresh workers is frequently >90%, even 99% of its core!
Cool, I'll look into profiling CF code!

--

Question about interpreting apiserver logs:
"Writing http response done (15888 items)"
does this mean it successfully wrote a full response?  and then later timed out??
or that's the number it managed to write until the timeout stopped it?


--

> This requires chunking.

Ack, we're aware unchunked requests are bad for openshift too, but the work on chunking in CF was shelved and not sure we'll revive it :(.

Comment 14 Beni Paskin-Cherniavsky 2019-07-24 13:52:40 UTC
> one of the openshift refresh workers is frequently >90%, even 99% of its core!

wait, that seems to be a worker for another openshift env that does refresh.  failing CF worker one is ~3% CPU.
I'll correlate the logs better to confirm...

Comment 15 Beni Paskin-Cherniavsky 2020-01-13 13:05:37 UTC
This specific customer's case has long been closed, and we've implemented chunking on cloudforms side (details on bz 1722808), which should improve things a lot.
Nothing to do here.


Note You need to log in before you can comment on or make changes to this bug.