Bug 1343104
Summary: | skyring crashes without relevant info in logs | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Storage Console | Reporter: | Martin Kudlej <mkudlej> | ||||
Component: | core | Assignee: | Nishanth Thomas <nthomas> | ||||
core sub component: | provisioning | QA Contact: | Daniel Horák <dahorak> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | medium | CC: | dahorak, mbukatov, shtripat, vsarmila | ||||
Version: | 2 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | rhscon-core-0.0.30-1.el7scon | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-08-23 19:55:03 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1344195 | ||||||
Attachments: |
|
Description
Martin Kudlej
2016-06-06 13:49:53 UTC
Created attachment 1165217 [details]
logs from server
Created attachment 1165583 [details]
failed Create Cluster task with "Last update time" before "Start time".
I probably hit the same issue on larger cluster (3 monitors and 40 OSD nodes) during the Cluster creation. This leads to quite strange situation - Last update time on the task is before Start time (as you can see on attachment 1165583 [details]). In the logs seems to be nothing really related to the crash, similarly as described in Comment 0. I noticed this issue today on my cluster which I let running idle (after creating a cluster there) to track mongodb connections. 1) when I was tracking mongodb connections for BZ 1341225 with older version of skyring (rhscon-core-0.0.19-1.el7scon.x86_64) I didn't hit this issue (skyring didn't crash) even though skyring was running for about 4 days 2) with new verion of skyring, there are multiple occurences of this BZ (mkudlej's, dahorak's and mine) while skyring uptime was smaller than 4 days Based on these facts: I think that the issue is real and it may got worse or be introduced in the current build rhscon-core-0.0.21-1.el7scon.x86_64 That said, to reproduce it would require to let the skyring run for few days. Steps To Reproduce ================== Here are the exact steps I did with the cluster before the crash. Unfortunately because of the sheer time needed, I don't know yet for sure how reproducible it actually is. 1) Install RHSC 2.0 following the documentation, make sure you have few nodes ready to be accepted later. 2) Let the cluster just run idle for 2 days. 3) Accept all nodes. 4) Start Create Cluster task to create cluster. 5) Let the cluster run idle for another day or 2. Note that this may not be the most simple reproducer, as mkudlej hit this issue without even creating a cluster. Actual Results ============== Skyring service is not running: ~~~ # systemctl status skyring ● skyring.service - Skyring Daemon Loaded: loaded (/usr/lib/systemd/system/skyring.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2016-06-07 13:18:47 CEST; 8s ago Docs: man:skyring(8) Process: 21003 ExecStart=/usr/bin/skyring (code=exited, status=1/FAILURE) Main PID: 21003 (code=exited, status=1/FAILURE) Jun 07 13:18:47 mbukatov-usm1-server.os1.phx2.redhat.com systemd[1]: Unit skyring.service entered failed state. Jun 07 13:18:47 mbukatov-usm1-server.os1.phx2.redhat.com systemd[1]: skyring.service failed. ~~~ Based on my script which was monitoring the number of skyring/mongodb connections, the crash happened between 22:07 and 22:08: ~~~ 2016-06-06T22:05 28 2016-06-06T22:06 28 2016-06-06T22:07 28 2016-06-06T22:08 0 2016-06-06T22:09 0 2016-06-06T22:10 0 ~~~ But there is *nothing* in skyring.log file from this period: ~~~ 2016-06-06T22:03:01+0000 INFO saltwrapper.py:30 saltwrapper.wrapper] args=(<salt.client.LocalClient object at 0x3d21590>, ('mbukatov-usm1-node3.os1.phx2.redhat.com', 'mbukatov-usm1-mon1.os1.phx2.redhat.com', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'mbukatov-usm1-node2.os1.phx2.redhat.com', 'mbukatov-usm1-node4.os1.phx2.redhat.com', 'mbukatov-usm2-mon3.os1.phx2.redhat.com', 'mbukatov-usm2-mon1.os1.phx2.redhat.com', 'mbukatov-usm2-node5.os1.phx2.redhat.com', 'mbukatov-usm2-node1.os1.phx2.redhat.com', 'mbukatov-usm2-node6.os1.phx2.redhat.com', 'mbukatov-usm2-mon2.os1.phx2.redhat.com', 'mbukatov-usm1-mon2.os1.phx2.redhat.com', 'mbukatov-usm2-node2.os1.phx2.redhat.com', 'mbukatov-usm2-node3.os1.phx2.redhat.com', 'mbukatov-usm2-node4.os1.phx2.redhat.com', 'mbukatov-usm1-mon3.os1.phx2.redhat.com'), 'collectd.getSingleValuedMetricsFromCollectd', ['network']), kwargs={'expr_form': 'list'} 2016-06-06T22:06:19.616+02:00 WARNING monitoring.go:262 FetchStatFromGraphite] skyring:b0ad4a1a-b868-42af-b808-9d0d94cc9c0f - Error The value 0 is not a number.Err strconv.ParseFloat: parsing "": invalid syntax 2016-06-06T22:06:19.654+02:00 WARNING monitoring.go:262 FetchStatFromGraphite] skyring:b0ad4a1a-b868-42af-b808-9d0d94cc9c0f - Error Failed to get the instant stat of disk-read resource of alpha 2016-06-06T22:06:19.664+02:00 WARNING monitoring.go:262 FetchStatFromGraphite] skyring:b0ad4a1a-b868-42af-b808-9d0d94cc9c0f - Error Failed to get the instant stat of disk-write resource of alpha 2016-06-06T22:08:00.84+02:00 WARNING sync.go:258 SyncNodeUtilizations] Failed to sync network utilization.Error Request timed out 2016-06-06T22:08:00.878+02:00 WARNING sync.go:356 SyncNodeUtilizations] Failed to get swap usage percentage from node mbukatov-usm1-node3.os1.phx2.redhat.com.err value=nan ~~~ In my case, the timeline is: * 2016-06-03T19:15 skyring were installed and started (by our jenkins managed deployment automation) * 2016-06-06T15:26 started task to create cluster named 'aplha', which were created with success some time later * 2016-06-06T22:08 skyring service crashed (and is not running since then) After checking my cluster today, I noticed skyring is not running again: This time: * 2016-06-07T19:00 started skyring service again * 2016-06-07T22:44 skyring service crashed (and is not running since then) in other words: this issue is clearly reproducible in our OS1 QE environment. (In reply to Martin Bukatovic from comment #6) > After checking my cluster today, I noticed skyring is not running again: > > This time: > > * 2016-06-07T19:00 started skyring service again > * 2016-06-07T22:44 skyring service crashed (and is not running since then) > > in other words: this issue is clearly reproducible in our OS1 QE environment. That said, I'm not able to reproduce this on our libvirt machines. Moving this to ON_QA as with new build and various generic fixes gone in for nil checks etc, we dont see any crash.. I happens again, feel free to re-open :) We didn't experienced this issue for the last weeks neither during manual or automated testing, so closing this Bug as VERIFIED.
Last time tested on:
rhscon-ceph-0.0.33-1.el7scon.x86_64
rhscon-core-selinux-0.0.34-1.el7scon.noarch
rhscon-core-0.0.34-1.el7scon.x86_64
rhscon-ui-0.0.48-1.el7scon.noarch
>> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2016:1754 |