Bug 1343104

Summary: skyring crashes without relevant info in logs
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Martin Kudlej <mkudlej>
Component: coreAssignee: Nishanth Thomas <nthomas>
core sub component: provisioning QA Contact: Daniel Horák <dahorak>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: dahorak, mbukatov, shtripat, vsarmila
Version: 2   
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhscon-core-0.0.30-1.el7scon Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-23 19:55:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1344195    
Attachments:
Description Flags
failed Create Cluster task with "Last update time" before "Start time". none

Description Martin Kudlej 2016-06-06 13:49:53 UTC
Description of problem:
I've installed skyring cluster 2 days ago. Today I've accepted all available nodes and when I've tried to create cluster I've realized during testing create cluster wizard that skyring crashed. I haven't submited create cluster task.

Version-Release number of selected component (if applicable):
ceph-ansible-1.0.5-19.el7scon.noarch
ceph-installer-1.0.11-1.el7scon.noarch
rhscon-ceph-0.0.20-1.el7scon.x86_64
rhscon-core-0.0.21-1.el7scon.x86_64
rhscon-ui-0.0.34-1.el7scon.noarch

How reproducible:
probably not easily reproducible but skyring should not crash ever

Expected results:
Skyring should not crash ever.

Comment 1 Martin Kudlej 2016-06-06 14:01:01 UTC
Created attachment 1165217 [details]
logs from server

Comment 2 Daniel Horák 2016-06-07 11:52:31 UTC
Created attachment 1165583 [details]
failed Create Cluster task with "Last update time" before "Start time".

Comment 3 Daniel Horák 2016-06-07 11:57:50 UTC
I probably hit the same issue on larger cluster (3 monitors and 40 OSD nodes) during the Cluster creation. 
This leads to quite strange situation - Last update time on the task is before Start time (as you can see on attachment 1165583 [details]).

In the logs seems to be nothing really related to the crash, similarly as described in Comment 0.

Comment 5 Martin Bukatovic 2016-06-07 12:29:57 UTC
I noticed this issue today on my cluster which I let running idle (after
creating a cluster there) to track mongodb connections.

1) when I was tracking mongodb connections for BZ 1341225 with older version
   of skyring (rhscon-core-0.0.19-1.el7scon.x86_64) I didn't hit this
   issue (skyring didn't crash) even though skyring was running for about
   4 days
2) with new verion of skyring, there are multiple occurences of this BZ
   (mkudlej's, dahorak's and mine) while skyring uptime was smaller than 4 days 

Based on these facts: I think that the issue is real and it may got worse or be
introduced in the current build rhscon-core-0.0.21-1.el7scon.x86_64

That said, to reproduce it would require to let the skyring run for few days.

Steps To Reproduce
==================

Here are the exact steps I did with the cluster before the crash. Unfortunately 
because of the sheer time needed, I don't know yet for sure how reproducible it
actually is.

1) Install RHSC 2.0 following the documentation, make sure you have few nodes
   ready to be accepted later.
2) Let the cluster just run idle for 2 days.
3) Accept all nodes.
4) Start Create Cluster task to create cluster.
5) Let the cluster run idle for another day or 2.

Note that this may not be the most simple reproducer, as mkudlej hit this issue
without even creating a cluster.

Actual Results
==============

Skyring service is not running:

~~~
# systemctl status skyring
● skyring.service - Skyring Daemon
   Loaded: loaded (/usr/lib/systemd/system/skyring.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Tue 2016-06-07 13:18:47 CEST; 8s ago
     Docs: man:skyring(8)
  Process: 21003 ExecStart=/usr/bin/skyring (code=exited, status=1/FAILURE)
 Main PID: 21003 (code=exited, status=1/FAILURE)

Jun 07 13:18:47 mbukatov-usm1-server.os1.phx2.redhat.com systemd[1]: Unit skyring.service entered failed state.
Jun 07 13:18:47 mbukatov-usm1-server.os1.phx2.redhat.com systemd[1]: skyring.service failed.
~~~

Based on my script which was monitoring the number of skyring/mongodb
connections, the crash happened between 22:07 and 22:08:

~~~
2016-06-06T22:05 28
2016-06-06T22:06 28
2016-06-06T22:07 28
2016-06-06T22:08 0
2016-06-06T22:09 0
2016-06-06T22:10 0
~~~

But there is *nothing* in skyring.log file from this period:

~~~
2016-06-06T22:03:01+0000 INFO     saltwrapper.py:30 saltwrapper.wrapper] args=(<salt.client.LocalClient object at 0x3d21590>, ('mbukatov-usm1-node3.os1.phx2.redhat.com', 'mbukatov-usm1-mon1.os1.phx2.redhat.com', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'mbukatov-usm1-node2.os1.phx2.redhat.com', 'mbukatov-usm1-node4.os1.phx2.redhat.com', 'mbukatov-usm2-mon3.os1.phx2.redhat.com', 'mbukatov-usm2-mon1.os1.phx2.redhat.com', 'mbukatov-usm2-node5.os1.phx2.redhat.com', 'mbukatov-usm2-node1.os1.phx2.redhat.com', 'mbukatov-usm2-node6.os1.phx2.redhat.com', 'mbukatov-usm2-mon2.os1.phx2.redhat.com', 'mbukatov-usm1-mon2.os1.phx2.redhat.com', 'mbukatov-usm2-node2.os1.phx2.redhat.com', 'mbukatov-usm2-node3.os1.phx2.redhat.com', 'mbukatov-usm2-node4.os1.phx2.redhat.com', 'mbukatov-usm1-mon3.os1.phx2.redhat.com'), 'collectd.getSingleValuedMetricsFromCollectd', ['network']), kwargs={'expr_form': 'list'}
2016-06-06T22:06:19.616+02:00 WARNING  monitoring.go:262 FetchStatFromGraphite] skyring:b0ad4a1a-b868-42af-b808-9d0d94cc9c0f - Error The value 0 is not a number.Err strconv.ParseFloat: parsing "": invalid syntax
2016-06-06T22:06:19.654+02:00 WARNING  monitoring.go:262 FetchStatFromGraphite] skyring:b0ad4a1a-b868-42af-b808-9d0d94cc9c0f - Error Failed to get the instant stat of disk-read resource of alpha
2016-06-06T22:06:19.664+02:00 WARNING  monitoring.go:262 FetchStatFromGraphite] skyring:b0ad4a1a-b868-42af-b808-9d0d94cc9c0f - Error Failed to get the instant stat of disk-write resource of alpha
2016-06-06T22:08:00.84+02:00 WARNING  sync.go:258 SyncNodeUtilizations] Failed to sync network utilization.Error Request timed out
2016-06-06T22:08:00.878+02:00 WARNING  sync.go:356 SyncNodeUtilizations] Failed to get swap usage percentage from node mbukatov-usm1-node3.os1.phx2.redhat.com.err value=nan
~~~

In my case, the timeline is:

 * 2016-06-03T19:15 skyring were installed and started
   (by our jenkins managed deployment automation)
 * 2016-06-06T15:26 started task to create cluster named 'aplha', which were
   created with success some time later
 * 2016-06-06T22:08 skyring service crashed (and is not running since then)

Comment 6 Martin Bukatovic 2016-06-08 09:14:31 UTC
After checking my cluster today, I noticed skyring is not running again:

This time:

 * 2016-06-07T19:00 started skyring service again
 * 2016-06-07T22:44 skyring service crashed (and is not running since then)

in other words: this issue is clearly reproducible in our OS1 QE environment.

Comment 7 Martin Bukatovic 2016-06-09 14:01:42 UTC
(In reply to Martin Bukatovic from comment #6)
> After checking my cluster today, I noticed skyring is not running again:
> 
> This time:
> 
>  * 2016-06-07T19:00 started skyring service again
>  * 2016-06-07T22:44 skyring service crashed (and is not running since then)
> 
> in other words: this issue is clearly reproducible in our OS1 QE environment.

That said, I'm not able to reproduce this on our libvirt machines.

Comment 8 Shubhendu Tripathi 2016-07-04 12:10:29 UTC
Moving this to ON_QA as with new build and various generic fixes gone in for nil checks etc, we dont see any crash..

I happens again, feel free to re-open :)

Comment 10 Daniel Horák 2016-07-20 08:27:13 UTC
We didn't experienced this issue for the last weeks neither during  manual or automated testing, so closing this Bug as VERIFIED.

Last time tested on:
  rhscon-ceph-0.0.33-1.el7scon.x86_64
  rhscon-core-selinux-0.0.34-1.el7scon.noarch
  rhscon-core-0.0.34-1.el7scon.x86_64
  rhscon-ui-0.0.48-1.el7scon.noarch

>> VERIFIED

Comment 12 errata-xmlrpc 2016-08-23 19:55:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754