Bug 1318250 - Horizon Overview page is not accessible - 504 Gateway Time-out when stack has 5000 instances/vms
Summary: Horizon Overview page is not accessible - 504 Gateway Time-out when stack has...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-django-horizon
Version: 7.0 (Kilo)
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Jason E. Rist
QA Contact: Ido Ovadia
URL:
Whiteboard:
Depends On: 1388171
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-16 11:08 UTC by Yuri Obshansky
Modified: 2017-05-04 14:30 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-04 14:30:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Horizon good screenshot (70.04 KB, image/png)
2016-03-16 11:08 UTC, Yuri Obshansky
no flags Details
Horizon timeout screenshot (15.69 KB, image/png)
2016-03-16 11:09 UTC, Yuri Obshansky
no flags Details
openstack-status output (29.26 KB, text/plain)
2016-03-16 11:12 UTC, Yuri Obshansky
no flags Details
horizon controller log file (1.38 MB, application/x-gzip)
2016-03-16 12:12 UTC, Yuri Obshansky
no flags Details
journalctl -u haproxy >haproxy.log (1.38 MB, application/x-gzip)
2016-03-16 12:13 UTC, Yuri Obshansky
no flags Details
correct horizon controller log file (310 bytes, application/x-gzip)
2016-03-16 12:20 UTC, Yuri Obshansky
no flags Details
httpd logs (7.28 KB, application/x-gzip)
2016-03-16 13:51 UTC, Yuri Obshansky
no flags Details
correct httpd logs (7.24 KB, application/x-gzip)
2016-03-16 13:59 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2016-03-16 11:08:15 UTC
Created attachment 1136980 [details]
Horizon good screenshot

Description of problem:
I successfully created 5000 instances on small scale environment on 10.03
Horizon was OK and show me all instances in UI
(see attached screenshot).
I left it to run on weekend in idle mode (w/o load).
And now Horizon is not accessible.
Raise error "504 Gateway Time-out"
(see screenshot).
I check environment using command $ openstack-status
Looks like everything is OK
(see attached output)
1. I updated httpd timeout and restarted it - didn't help
/etc/httpd/conf/httpd.conf
#Timeout 120
Timeout 600
2. I updated HAProxy timeout and restarted it - didn't help
/etc/haproxy/haproxy.cfg
defaults
  log  global
  maxconn  4096
  mode  tcp
  retries  3
#  timeout  http-request 10s
  timeout  http-request 1m
  timeout  queue 1m
#  timeout  connect 10s
  timeout  connect 1m
  timeout  client 1m
  timeout  server 1m
#  timeout  check 10s
  timeout  check 1m

Version-Release number of selected component (if applicable):
rhos-release 7 -p 2016-02-24.1

How reproducible:
Create 5000 instances/vms for 50 tenants
using 
- image: cirros-0.3.2-sc (12.6 MB)
- flavor: m1.nano (1VCPU, 64MB RAM, 1GB Disk)
Try login to Horizon

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2016-03-16 11:09:18 UTC
Created attachment 1136981 [details]
Horizon timeout screenshot

Comment 2 Yuri Obshansky 2016-03-16 11:12:45 UTC
Created attachment 1136982 [details]
openstack-status output

Comment 3 Yuri Obshansky 2016-03-16 11:28:13 UTC
Test result here:
https://mojo.redhat.com/docs/DOC-1071620

Comment 4 Matthias Runge 2016-03-16 11:53:58 UTC
Could you please provide logs? Like error log from horizon? I'd be also curious to see haproxy logs.

Comment 5 Yuri Obshansky 2016-03-16 12:12:42 UTC
Created attachment 1136990 [details]
horizon controller log file

Comment 6 Yuri Obshansky 2016-03-16 12:13:31 UTC
Created attachment 1136991 [details]
journalctl -u haproxy >haproxy.log

Comment 7 Yuri Obshansky 2016-03-16 12:20:33 UTC
Created attachment 1136997 [details]
correct horizon controller log file

Comment 8 Yuri Obshansky 2016-03-16 13:51:28 UTC
Created attachment 1137035 [details]
httpd logs

Comment 9 Yuri Obshansky 2016-03-16 13:59:27 UTC
Created attachment 1137051 [details]
correct httpd logs

Comment 11 Matthias Runge 2016-03-17 07:23:55 UTC
After digging around in that installation, finding many stopped neutron services, I'd tend to close this. 


neutron logs are showing lots of errors like this:
2016-03-17 00:06:11.767 15007 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:11.770 15007 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 447bdac7a94548cc9e7ec6e29e4bd942", "code": 404, "title": "Not Found"}}
2016-03-17 00:06:11.771 15007 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:14.526 15011 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:14.528 15011 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 99dbb2f98e654ee08acc16502a100ccc", "code": 404, "title": "Not Found"}}
2016-03-17 00:06:14.529 15011 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:18.295 15002 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:18.296 15002 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 1f5fa24792ec4f7ab741a9d600b41695", "code": 404, "title": "Not Found"}}
2016-03-17 00:06:18.296 15002 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:18.303 15002 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:30.417 15003 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:06:30.419 15003 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 70dd139424894c37bae880f8c8ae0add", "code": 404, "title": "Not Found"}}
2016-03-17 00:06:30.419 15003 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:16:20.058 15009 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:16:20.059 15009 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 8974d93d849445dc967504b295e16e1e", "code": 404, "title": "Not Found"}}
2016-03-17 00:16:20.059 15009 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:16:22.496 15021 WARNING keystonemiddleware.auth_token [-] Authorization failed for token
2016-03-17 00:16:22.497 15021 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 55ea136f54574b14b948fd4928083958", "code": 404, "title": "Not Found"}}
2016-03-17 00:16:22.498 15021 WARNING keystonemiddleware.auth_token [-] Authorization failed for token

and also on another node:
2016-03-16 15:56:39.237 14714 ERROR neutron.plugins.ml2.managers [req-7b4ea77c-fe69-4f77-b78d-10517dbeaca
5 ] Failed to bind port 48d926c7-c5e9-4c79-a7b9-ab06f28841eb on host overcloud-controller-1.localdomain
2016-03-16 15:56:39.238 14714 ERROR neutron.plugins.ml2.managers [req-7b4ea77c-fe69-4f77-b78d-10517dbeaca
5 ] Failed to bind port 48d926c7-c5e9-4c79-a7b9-ab06f28841eb on host overcloud-controller-1.localdomain
2016-03-16 15:56:39.265 14714 WARNING neutron.plugins.ml2.plugin [req-7b4ea77c-fe69-4f77-b78d-10517dbeaca
5 ] In _notify_port_updated(), no bound segment for port 48d926c7-c5e9-4c79-a7b9-ab06f28841eb on network 
adfe1297-56ec-4b68-ae4c-da48483c40de
2016-03-16 15:56:41.224 14729 WARNING neutron.plugins.ml2.rpc [req-d03b4f72-c9bd-4caa-9115-cc38de8ea7aa ]
 Device 48d926c7-c5e9-4c79-a7b9-ab06f28841eb requested by agent ovs-agent-overcloud-controller-0.localdomain on network adfe1297-56ec-4b68-ae4c-da48483c40de not bound, vif_type: binding_failed
2016-03-16 15:56:41.290 14729 WARNING neutron.plugins.ml2.rpc [req-d03b4f72-c9bd-4caa-9115-cc38de8ea7aa ] Device b0e449b3-05ab-4b6a-a54e-758c142d8886 requested by agent ovs-agent-overcloud-controller-0.localdomain on network 653efbd9-4213-4f20-a93b-94ce5f2b3548 not bound, vif_type: binding_failed
2016-03-16 15:56:42.749 14734 WARNING neutron.plugins.ml2.drivers.type_tunnel [req-74c3026c-c431-4e8e-abfd-c001cbe297b2 ] Endpoint with ip 172.16.0.10 already exists
2016-03-16 15:56:44.028 14737 WARNING neutron.plugins.ml2.rpc [req-74c3026c-c431-4e8e-abfd-c001cbe297b2 ] Device 48d926c7-c5e9-4c79-a7b9-ab06f28841eb requested by agent ovs-agent-overcloud-controller-1.localdomain on network adfe1297-56ec-4b68-ae4c-da48483c40de not bound, vif_type: binding_failed

Comment 12 Yuri Obshansky 2016-03-17 07:35:28 UTC
I think, it is not correct verification.
The bug happened when stack has 5000 instances
and you digging around in stack w/o instances at all.

Comment 13 Matthias Runge 2016-03-17 07:54:06 UTC
Yuri, the bug still exists, even on an empty machine without running instances.

Comment 14 Itxaka 2016-04-22 11:18:29 UTC
Yuri,

any update on this or can I close it? 

thanks

Comment 15 Yuri Obshansky 2016-05-03 13:14:36 UTC
Hi, 
I try to reproduce bug on rhos-release 8 -p 2016-03-24.2
Created 4,849 VMs
Horizon is accessible right now
But, I'd like to take a 24 hours for idle running 
I'll update with result.

Thanks

Comment 16 Matthias Runge 2016-05-10 07:23:16 UTC
After deleting the instances, it seems like the admin/ocerview page is not available any more.


While looking at used APIS, I found horizon is calling os-simple-tenant-usage, which has been involved in an older bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1243301

Unfortunately, the proposed solution doesn't solve *this* situation here.

Comment 17 Matthias Runge 2016-05-10 07:28:04 UTC
addition to #c16
I would like to add
OPENSTACK_NOVA_EXTENSIONS_BLACKLIST = ["SimpleTenantUsage"]

to /etc/openstack-dashboard/local_settings 
and restart httpd

I can not verify that solution from here.

Comment 18 Itxaka 2016-05-10 10:08:13 UTC
For the terminated instances killing the page it may help to do this:

--- a/openstack_dashboard/dashboards/admin/overview/views.py
+++ b/openstack_dashboard/dashboards/admin/overview/views.py
@@ -44,6 +44,7 @@ class GlobalUsageCsvRenderer(csvbase.BaseCsvResponse):
 
 
 class GlobalOverview(usage.UsageView):
+    show_terminated = False
     table_class = usage.GlobalUsageTable
     usage_class = usage.GlobalUsage
     template_name = 'admin/overview/usage.html'



The problem with this is that by default the overview page for admin will query for all instances usage including the deleted ones, which can have a high load if the number of instances is too high.

Would it be possible to get the env details so I can ssh into it and see it myself?

Thanks!

Comment 19 Yuri Obshansky 2016-05-11 09:54:16 UTC
Provided env details to Itxaka.

Comment 21 Radomir Dopieralski 2016-11-28 14:45:28 UTC
This problem should be greatly alleviated (and in most cases completely fixed) by implementing https://bugzilla.redhat.com/show_bug.cgi?id=1388171


Note You need to log in before you can comment on or make changes to this bug.