Bug 1468195 - http_proxy errors in the ssl_error log
Summary: http_proxy errors in the ssl_error log
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Appliance
Version: 5.8.0
Hardware: All
OS: All
medium
medium
Target Milestone: GA
: cfme-future
Assignee: Joe Rafaniello
QA Contact: Dave Johnson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-06 09:29 UTC by Felix Dewaleyne
Modified: 2021-06-10 12:33 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-09 18:35:15 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
top_output grep (289.03 KB, text/plain)
2018-01-03 16:07 UTC, myoder
no flags Details
top_output-11-18 (293.97 KB, text/plain)
2018-01-03 16:11 UTC, myoder
no flags Details

Description Felix Dewaleyne 2017-07-06 09:29:05 UTC
Description of problem:
in cfme-5.8.0.17 the ssl_error logs get filled with proxy and proxy_http errors

Version-Release number of selected component (if applicable):
5.8.0.17

How reproducible:
all the time

Steps to Reproduce:
1. update a 4.2 appliance to 4.5
2.
3.

Actual results:
[Wed Jul 05 07:06:40.803674 2017] [proxy:error] [pid 12987] (111)Connection refused: AH00957: HTTP: attempt to connect to 0.0.0.0:3001 (0.0.0.0) failed
[Wed Jul 05 07:06:40.803719 2017] [proxy:error] [pid 12987] AH00959: ap_proxy_connect_backend disabling worker for (0.0.0.0) for 60s
[Wed Jul 05 07:06:40.803733 2017] [proxy_http:error] [pid 12987] [client 10.241.63.158:58811] AH01114: HTTP: failed to make connection to backend: 0.0.0.0, referer: https://example.cloudforms.com/
[Wed Jul 05 07:06:40.803869 2017] [proxy:error] [pid 12987] (111)Connection refused: AH00957: HTTP: attempt to connect to 0.0.0.0:3002 (0.0.0.0) failed
[Wed Jul 05 07:06:40.803877 2017] [proxy:error] [pid 12987] AH00959: ap_proxy_connect_backend disabling worker for (0.0.0.0) for 60s

Expected results:
no error in the logs

Additional info:
this doesn't seem to have too much functional impact at first hand

Comment 2 Nick Carboni 2017-09-22 13:35:16 UTC
These messages are just indicating that we don't have a puma worker answering on that particular port and are not indicative of an error.

I would also rather not have these messages logged so frequently, but I'm not sure there is anything we can do.

This was introduced as a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1422988 in https://github.com/ManageIQ/manageiq/pull/14007

Joe, is there anything we can do to reduce (or eliminate) these messages from polluting the logs?

If not, I would say we can close this as NOTABUG.

Comment 3 Joe Rafaniello 2017-09-22 21:17:54 UTC
I believe we can provide a different, larger, retry value than the default of every 60 seconds.

https://httpd.apache.org/docs/2.4/mod/mod_proxy.html

But that would mean we would have workers that apache wouldn't know about for more than 60 seconds.

The alternative is going back to managing a dynamic config file with all the various puma worker ports where we continually add and remove ports.

If there is measurable overhead to continually checking balancer members in apache, we can reduce the number of members and disable configuring more than that number of workers.

None of these solutions is great.

Comment 4 Nick Carboni 2017-09-25 18:53:02 UTC
Based on the bug that this change was meant to fix I feel like trying to maintain the config file with exactly the workers we want on demand is more error prone than having httpd continually unable to access configured workers.

Personally, I'm willing to overlook the errors in favor of a more reliable and maintainable web server setup.

I'm going to close this as WONTFIX as there are options to reduce this log output, but we decided that enduring the excessive logging is the least harmful option when the other options are taken into account.

If this is causing some other application level issue or "measurable overhead" as Joe mentioned in comment 3, feel free to reopen the bug with the details.

Comment 7 Joe Rafaniello 2018-01-02 22:22:03 UTC
Hi Yoder,

These errors should be noisy but benign:

[Wed Jul 05 07:06:40.803674 2017] [proxy:error] [pid 12987] (111)Connection refused: AH00957: HTTP: attempt to connect to 0.0.0.0:3001 (0.0.0.0) failed
[Wed Jul 05 07:06:40.803719 2017] [proxy:error] [pid 12987] AH00959: ap_proxy_connect_backend disabling worker for (0.0.0.0) for 60s

They should not be visible problems you'd see as a user since apache is routinely trying all members in the balancer pool in the background, even when there's no rails process listening on that port.

Can you go through the apache logs and see if there are other errors such as timeouts:
grep timeout log/apache/*

Also, can you look for any errors that are not these AH00957 and AH00959 in the apache logs and see if they point at a specific UI worker port (such as the one above points to port 3001).

In another issue[1], we found that users were getting 503 errors in the UI that manifested as timeouts in the apache logs on a specific UI worker port.  When we went through top_output.log for that port at that time, we located the UI worker process that was using that port at that time and found the UI worker pid had accepted a UI request in the production.log but for whatever reason never completed it.  For them, they would get somewhat random failures which only occurred when the apache balancer sent traffic to this bad UI worker.  Restarting that worker fixed the issue but we're still trying to determine if there is a larger problem causing the worker to go bad.  Note, it never actually died or was killed, it kept heartbeating and "passing" apache load balancer 'ping' so it saw requests that would ultimately timeout.  It was quite strange.

Sorry for writing a book.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1526146

Comment 9 Joe Rafaniello 2018-01-03 15:26:58 UTC
These errors:

apache/ssl_error.log-20171118:[Fri Nov 17 08:57:48.256364 2017] [proxy_http:error] [pid 7515] (70007)The timeout specified has expired: [client REDACTED:64356] AH01102: error reading status line from remote server 0.0.0.0:4000

indicate there isn't a web service worker running on that appliance or it's not responding to requests.  The next step is to review the top_output.log for that timestamp looking for a worker with port 4000:

Maybe something like this:

grep -E "timesync|\bup\b|4000" top_output.log

Each UI appliance needs to have the web services role enabled and active.

Comment 10 myoder 2018-01-03 16:07:55 UTC
Created attachment 1376509 [details]
top_output grep

Comment 11 myoder 2018-01-03 16:11:01 UTC
Created attachment 1376510 [details]
top_output-11-18

Comment 13 Joe Rafaniello 2018-01-03 16:14:04 UTC
Hi Yoder,

Gregg and I were looking at the timeout logs again and actually saw there were a few timeouts on port 3000, a UI worker, also:

apache/ssl_error.log-20171122:[Tue Nov 21 18:22:33.518000 2017] [proxy_http:error] [pid 8684] (70007)The timeout specified has expired: [client REDACTED:56779] AH01102: error reading status line from remote server 0.0.0.0:3000
apache/ssl_error.log-20171122:[Tue Nov 21 18:42:43.745723 2017] [proxy_http:error] [pid 31690] (70007)The timeout specified has expired: [client REDACTED:56978] AH01102: error reading status line from remote server 0.0.0.0:3000
apache/ssl_error.log-20171122:[Tue Nov 21 18:44:44.046524 2017] [proxy_http:error] [pid 31690] (70007)The timeout specified has expired: [client REDACTED:56978] AH01102: error reading status line from remote server 0.0.0.0:3000
apache/ssl_error.log-20171122:[Tue Nov 21 18:45:53.509980 2017] [proxy_http:error] [pid 7588] (70007)The timeout specified has expired: [client REDACTED:56994] AH01102: error reading status line from remote server 0.0.0.0:3000
apache/ssl_error.log-20171122:[Tue Nov 21 18:47:03.546118 2017] [proxy_http:error] [pid 8684] (70007)The timeout specified has expired: [client REDACTED:57001] AH01102: error reading status line from remote server 0.0.0.0:3000
apache/ssl_error.log-20171122:[Tue Nov 21 18:49:03.750962 2017] [proxy_http:error] [pid 8684] (70007)The timeout specified has expired: [client REDACTED:57001] AH01102: error reading status line from remote server 0.0.0.0:3000
apache/ssl_error.log-20171122:[Tue Nov 21 18:50:08.126264 2017] [proxy_http:error] [pid 7281] (70007)The timeout specified has expired: [client REDACTED:57026] AH01102: error reading status line from remote server 0.0.0.0:3000


So, you'd have to look at the top_output.log for that timeframe also, and see which pid were running on port 3000.  Then, grep the production.log and evm.log for that pid also to see what was the last request it handled and if it's hung.

Comment 15 myoder 2018-01-03 16:30:22 UTC
Joe,

I've uploaded the evm and production logs from 2017-11-18 as well as the grepped search your provided in comment 9.

I see your comment 13 and will be working on finding the last task that hung the ui worker process.

Just to answer comment 9, web_services and uiser_interface was enabled on all of the appliances in the region, and from their testing, only the database appliance seems to handle this role.  All other appliances are constantly crashing.

Cheers,
Yoder

Comment 17 Joe Rafaniello 2018-01-03 20:39:59 UTC
Hi Yoder,

We saw that error too and we're not sure that's related to the problem.

I've reviewed this with Gregg and Jillian and we have some findings below.  Do we have the full set of logs, including the api.log?  Please provide that if possible.

Thanks!

The customer is using 5.8.1.5:

["cfme", "5.8.1.5-1.el7cf"]

We're seeing apache timeouts reaching the web service worker on port 4000.

We grepp'ed the pid 23423 and found that we have requests that come into the web service worker and never complete:


[----] I, [2017-11-17T08:45:02.877534 #23423:113c828]  INFO -- : Started POST "/api/service_orders/cart" for 127.0.0.1 at 2017-11-17 08:45:02 -0700
[----] I, [2017-11-17T08:45:02.881722 #23423:113c828]  INFO -- : Processing by Api::ServiceOrdersController#update as JSON
[----] I, [2017-11-17T08:45:02.881848 #23423:113c828]  INFO -- :   Parameters: {"c_id"=>"cart", "service_order"=>{}}

[----] I, [2017-11-17T08:45:15.087712 #23423:113c29c]  INFO -- : Started POST "/api/service_orders/cart" for 127.0.0.1 at 2017-11-17 08:45:15 -0700
[----] I, [2017-11-17T08:45:15.091924 #23423:113c29c]  INFO -- : Processing by Api::ServiceOrdersController#update as JSON
[----] I, [2017-11-17T08:45:15.092049 #23423:113c29c]  INFO -- :   Parameters: {"c_id"=>"cart", "service_order"=>{}}

These two requests then hit the 2 minute timeout in the apache logs:

apache/ssl_error.log-20171118:[Fri Nov 17 08:47:02.945382 2017] [proxy_http:error] [pid 7424] (70007)The timeout specified has expired: [client x.x.x.x:63930] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016
apache/ssl_error.log-20171118:[Fri Nov 17 08:47:15.092346 2017] [proxy_http:error] [pid 16121] (70007)The timeout specified has expired: [client x.x.x.x:63934] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016


We then have 3 more requests come in and never complete:

[----] I, [2017-11-17T08:55:48.455610 #23423:113d2c8]  INFO -- : Started POST "/api/service_orders/cart" for 127.0.0.1 at 2017-11-17 08:55:48 -0700
[----] I, [2017-11-17T08:55:48.459837 #23423:113d2c8]  INFO -- : Processing by Api::ServiceOrdersController#update as JSON
[----] I, [2017-11-17T08:55:48.460026 #23423:113d2c8]  INFO -- :   Parameters: {"c_id"=>"cart", "service_order"=>{}}

[----] I, [2017-11-17T08:55:47.499663 #23423:113cc4c]  INFO -- : Started POST "/api/service_orders/cart" for 127.0.0.1 at 2017-11-17 08:55:47 -0700
[----] I, [2017-11-17T08:55:47.504335 #23423:113cc4c]  INFO -- : Processing by Api::ServiceOrdersController#update as JSON
[----] I, [2017-11-17T08:55:47.504460 #23423:113cc4c]  INFO -- :   Parameters: {"c_id"=>"cart", "service_order"=>{}}

[----] I, [2017-11-17T08:55:48.569286 #23423:113c01c]  INFO -- : Started POST "/api/service_orders/cart" for 127.0.0.1 at 2017-11-17 08:55:48 -0700
[----] I, [2017-11-17T08:55:48.573251 #23423:113c01c]  INFO -- : Processing by Api::ServiceOrdersController#update as JSON
[----] I, [2017-11-17T08:55:48.573394 #23423:113c01c]  INFO -- :   Parameters: {"c_id"=>"cart", "service_order"=>{}}

We see several timeouts in the apache logs and at this point, the UI appears to be dead:

apache/ssl_error.log-20171118:[Fri Nov 17 08:57:47.557394 2017] [proxy_http:error] [pid 7580] (70007)The timeout specified has expired: [client x.x.x.x:64353] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016
apache/ssl_error.log-20171118:[Fri Nov 17 08:57:48.256364 2017] [proxy_http:error] [pid 7515] (70007)The timeout specified has expired: [client x.x.x.x:64356] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016
apache/ssl_error.log-20171118:[Fri Nov 17 08:57:48.455047 2017] [proxy_http:error] [pid 12356] (70007)The timeout specified has expired: [client x.x.x.x:64354] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016
apache/ssl_error.log-20171118:[Fri Nov 17 08:57:48.668524 2017] [proxy_http:error] [pid 13327] (70007)The timeout specified has expired: [client x.x.x.x:64355] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016
apache/ssl_error.log-20171118:[Fri Nov 17 08:57:54.625379 2017] [proxy_http:error] [pid 7017] (70007)The timeout specified has expired: [client x.x.x.x:64357] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016
apache/ssl_error.log-20171118:[Fri Nov 17 08:57:58.243324 2017] [proxy_http:error] [pid 13654] (70007)The timeout specified has expired: [client x.x.x.x:64359] AH01102: error reading status line from remote server 0.0.0.0:4000, referer: https://redacted.com/ui/service/catalogs/155000000000016

Finally, the webservice worker exits due to a DRb connection error:


[----] E, [2017-11-17T09:31:01.113555 #23423:170857c] ERROR -- : MIQ(MiqWebServiceWorker::Runner) ID [155000004724201] PID [23423] GUID [6abfb160-c9f2-11e7-b7fc-0050568a6857] Error heartbeating to MiqServer because DRb::DRbConnError: Connection reset by peer Worker exiting.
[----] I, [2017-11-17T09:31:02.152551 #19938:deb14c]  INFO -- : MIQ(MiqWebServiceWorker#kill) Killing worker: ID [155000004724201], PID [23423], GUID [6abfb160-c9f2-11e7-b7fc-0050568a6857], status [started]

Comment 21 Jillian Tullo 2018-01-10 16:10:55 UTC
Hi Yoder--

I have taken a look at the appliance and have tried numerous methods of recreating the issue but have been unable to do so. Would it be possible to get another appliance to talk to this one, in order to more closely mimic the environment that the issue occurred in? It is possible that the environment complexity is a factor we will need to add in to recreate the issue and find the root cause.

Also, have they triggered the issue again by re-enabling any of their UI appliances?

Thank you!

Comment 26 Jack Ottofaro 2018-02-28 17:26:31 UTC
I have a call with cu this afternoon. Are there any updates regarding this issue?
Thanks,
Jack

Comment 28 Joe Rafaniello 2018-03-01 16:58:17 UTC
I'll put the most useful questions/comment first with more detail below.  So, some things that might help us understand what is happening:
1) how many notifications are in their noticifications table?  If they don't use notifications in the service UI, truncating notifications and notification_recipients for now may help.  We have ongoiong work to properly purge these tables on a schedule.
2) how many services, service_orders, service_templates, service_template_catalogs rows are there?
3) postgresql logs from the ~11/17 or if it's happened again, more recent logs including evm.log, top_output.log, production.log, postgresql.log, log/apache directory, etc.


Note, the prior comment mentions the 183,000 queue messages for the event role in a specific zone. I wanted to include the other items we found when logging at the full set of logs again.

* 183k event queue messages as of 11/17/2017 (it was at 300k at one point on the 11th.  We don't have appliance logs for the event role to see what these events are but perhaps an event storm occurred and caused a general slowdown.

* When service order checkout from the cart occurs and works, it looks like this in the logs.  "Service Order checkout for service..." followed shortly but 1 or more MiqQueue.put messages:

[----] I, [2017-11-16T09:27:16.121187 #23423:113c01c]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000735
[----] I, [2017-11-16T09:27:16.163360 #23423:113c01c]  INFO -- : MIQ(MiqQueue.put) Message id: [155000094184397],  id: [], Zone: [], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [155000000000753], Task id: [], Command: [ServiceTemplateProvisionRequest.call_automate_event], Timeout: [3600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["request_created"]
[----] I, [2017-11-16T10:15:14.542356 #24526:d3d13c]  INFO -- : MIQ(MiqQueue.put) Message id: [155000094194266],  id: [], Zone: [UI Zone], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [155000000000754], Task id: [], Command: [AutomationRequest.call_automate_event], Timeout: [3600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["request_created"]
[----] I, [2017-11-16T10:15:14.581606 #24526:d3d13c]  INFO -- : MIQ(MiqQueue.put) Message id: [155000094194267],  id: [], Zone: [UI Zone], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [155000000000754], Task id: [], Command: [AutomationRequest.call_automate_event], Timeout: [3600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["request_approved"]


In the logs provided when it fails, we see the "Service Order Checkout for service" for not the queue above.
This is why I'm curious if the 180k+ MiqQueue event messages sitting in the queue are causing an issue queueing these messages.
Anything from a locked table, a general delay of inserting or removing rows from this table, or something similar could possibly be in play here.

Anyway, this is what we see when the queue has 183k+ event messages sitting in it...

Many service checkouts but only a single queue put:

[----] I, [2017-11-17T08:45:02.926445 #23423:113c828]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000508
[----] I, [2017-11-17T08:45:15.133501 #23423:113c29c]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000508
[----] I, [2017-11-17T08:55:47.550245 #23423:113cc4c]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000508
[----] I, [2017-11-17T08:55:48.507024 #23423:113d2c8]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000508
[----] I, [2017-11-17T08:55:48.615038 #23423:113c01c]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000508
[----] I, [2017-11-17T10:43:50.972209 #7665:f56518]  INFO -- : MIQ(MiqQueue.put) Message id: [155000094401776],  id: [], Zone: [], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [155000000000759], Task id: [], Command: [ServiceTemplateProvisionRequest.call_automate_event], Timeout: [3600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["request_approved"]
[----] I, [2017-11-17T12:06:26.896477 #7675:3b5b4d8]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000718
[----] I, [2017-11-17T12:06:40.789237 #7675:3b5aed4]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000718
[----] I, [2017-11-17T12:07:06.611869 #7675:3b5a880]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000718
[----] I, [2017-11-17T12:07:29.905368 #7675:3b5abf0]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000718
[----] I, [2017-11-17T13:03:28.811575 #7675:3b5b294]  INFO -- : MIQ(ServiceOrder#checkout) Service Order checkout for service: Order # 155000000000718

Finally, we're seeing somewhat slow api requests for notifications at around this time:

[----] I, [2017-11-17T10:39:24.147438 #7675:3b5b294]  INFO -- : Started GET "/api/notifications?expand=resources&attributes=details&sort_by=id&sort_order=desc&limit=100" for 127.0.0.1 at 2017-11-17 10:39:24 -0700
[----] I, [2017-11-17T10:39:26.054640 #7675:3b5b294]  INFO -- : Completed 200 OK in 1892ms (Views: 0.3ms | ActiveRecord: 580.0ms)

[----] I, [2017-11-17T10:40:04.420282 #7675:3b5abf0]  INFO -- : Started GET "/api/notifications?expand=resources&attributes=details&sort_by=id&sort_order=desc&limit=100" for 127.0.0.1 at 2017-11-17 10:40:04 -0700
[----] I, [2017-11-17T10:40:05.696475 #7675:3b5abf0]  INFO -- : Completed 200 OK in 1272ms (Views: 0.3ms | ActiveRecord: 272.8ms)

[----] I, [2017-11-17T10:40:20.130879 #7675:3b5b4d8]  INFO -- : Started GET "/api/notifications?expand=resources&attributes=details&sort_by=id&sort_order=desc&limit=100" for 127.0.0.1 at 2017-11-17 10:40:20 -0700
[----] I, [2017-11-17T10:40:21.401474 #7675:3b5b4d8]  INFO -- : Completed 200 OK in 1266ms (Views: 0.3ms | ActiveRecord: 255.7ms)

Comment 29 Joe Rafaniello 2018-03-08 20:10:38 UTC
Hi Ryan,

Is there an update on this issue?

Comment #28 at the top has the most important questions...

The slow notifications and 187k-300k miq_queue rows are either "interesting" or may helping to cause the problems they're having with the service order cart in the API.  Do they new logs and answers to the questions in comment #28?

Comment 30 Jack Ottofaro 2018-03-12 12:10:44 UTC
Hi Joe,

For case 01982606 cu decided to close the case as they have not encountered the issue in "months".

Comment 31 Joe Rafaniello 2018-04-09 18:35:15 UTC
Closing for now, if this reoccurs, we can reopen the bug.


Note You need to log in before you can comment on or make changes to this bug.