Description of problem ====================== Number of skyring to mongodb connections grows in a linear way in about 17 hours cycle, which may be related to additional problems (it's not possible to login into web interface and the web interface is not available when the number of connections is near it's peak). Version-Release =============== On RHSC 2.0 machine: rhscon-core-0.0.19-1.el7scon.x86_64 rhscon-ceph-0.0.18-1.el7scon.x86_64 rhscon-ui-0.0.34-1.el7scon.noarch ceph-installer-1.0.11-1.el7scon.noarch ceph-ansible-1.0.5-15.el7scon.noarch mongodb-server-2.6.5-4.1.el7.x86_64 mongodb-2.6.5-4.1.el7.x86_64 On Ceph node machines: rhscon-agent-0.0.8-1.el7scon.noarch ceph-base-10.2.1-6.el7cp.x86_64 How reproducible ================ 100 % Steps to Reproduce ================== 1. Install RHSC 2.0 following the documentation, make sure you have few nodes ready to be accepted later. 2. Accept all nodes. 3. Start Create Cluster task. 4. Monitor number of skyring to mongodb connections and let the cluster idle for few days. To "quickly" reproduce the issue, one would need about 2 or 3 days to observe few full cycles (as described below) One can run script like this to monitor the number of the connections: ~~~ #!/bin/bash skyring_pid=$1 mongodb_port=27017 while true; do con_num=$(lsof -p ${skyring_pid} | grep ${mongodb_port} | wc -l) echo $(date +"%Y-%m-%dT%H:%M") $con_num sleep 60 done ~~~ Actual results ============== Skyring maintains "low" number of connections (about 20 in my case) to mongodb at first, but after few hours (about 6 hours in my case), the number of connections grows in a linear way until it reaches it's peak of 1002 connections, when the number of connections plummets to just 4 connections. From there the cycle of linear growth of connections starts again. One such cycle takes about 17 hours. There are additional/related problems: 1) it's not possible to login as admin user into web interface, the login page states: > The username or password is incorrect. While in the skyring.log file, there are just these 2 lines for the event: ~~~ 2016-05-29T15:08:35.229+02:00 ERROR auth.go:163 Login] Error saving the session for user: admin. error: Closed explicitly 2016-05-29T15:08:35.229+02:00 ERROR auth.go:70 login] Unable to login User:Closed explicitly ~~~ 2) When the number of connections is near it's peak, it's not possible to reach skyring daemon via web interface. One would need to wait until the number of connections plummets again to be able to use the web interface. Expected results ================ Skyring maintains reasonable number of mongodb connections the whole time.
Created attachment 1163238 [details] number of connections log Attaching log file generated by the script from "Steps to Reproduce" section, each line contains a times tamp and the number of connections.
Created attachment 1163239 [details] graph based on number of connections log Graph representation of "number of connections log". See "Actual results" section of this BZ for the explanation.
Additional information: restarting skyring fixes the issue with web login.
When we discussed earlier, these connection were getting garbage collected when the go garbage collector kicks in. Are you seeing steep increase in the no of connections or is iut getting cleared up after some time?
(In reply to Nishanth Thomas from comment #4) > When we discussed earlier, these connection were getting garbage collected > when the go garbage collector kicks in. Yes, that's true. On the other hand the way how it grows, the sheer number of connections being open and related issues (web interface not reachable when the number of mongodb connections is near it's peak) are hardly ok and reasonable. > Are you seeing steep increase in the > no of connections or is iut getting cleared up after some time? Look at the graph attached to this BZ. I see unreasonable linear growth of mongodb connections. The period of this process is about 17 hours. While linear growth is not steep (for some definition of steep), it's definitely not fine. Why does skyring open new connection like this over and over again?
I was trying to recheck this issue with new skyring build (I was asked to to this during one meeting with Nishants this week) and at first sight, I don't see linear growth of skyring/mongodb connections. But I'm unable to collect enough data to gain a good evidence, because skyring crashes/stops after few hours (BZ 1343104) and I need to let it run for at least 4 or 5 days.
With build rhscon-core-0.0.30-1.el7scon, the mongodb connections are well under control and there is not linear growth in the no of connections now.
Checking with ============= On RHSC 2.0 server machine: rhscon-ui-0.0.51-1.el7scon.noarch rhscon-core-0.0.38-1.el7scon.x86_64 rhscon-ceph-0.0.38-1.el7scon.x86_64 rhscon-core-selinux-0.0.38-1.el7scon.noarch On Ceph 2.0 machines: rhscon-core-selinux-0.0.38-1.el7scon.noarch rhscon-agent-0.0.16-1.el7scon.noarch Verification ============ When observing number of skyring to mongodb connections for about 3,5 days, I haven't noticed the issue at all: the number of connections remained constant (at 17 connections) during the whole period. ~~~ $ head -2 skyring_mongod_conn.log 2016-08-04T18:12 17 2016-08-04T18:13 17 $ tail -2 skyring_mongod_conn.log 2016-08-08T10:39 17 2016-08-08T10:40 17 $ cut -d' ' -f2 skyring_mongod_conn.log | sort | uniq 17 ~~~ >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754