1299005 – skyring crashes when /tmp/.skyring-event socket exists

Bug 1299005 - skyring crashes when /tmp/.skyring-event socket exists

Summary: skyring crashes when /tmp/.skyring-event socket exists

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	beta
Target Release:	2
Assignee:	Shubhendu Tripathi
QA Contact:	Martin Bukatovic
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-15 16:38 UTC by Martin Bukatovic
Modified:	2016-08-23 19:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	rhscon-ceph-0.0.23-1.el7scon.x86_64, rhscon-core-0.0.24-1.el7scon.x86_64, rhscon-ui-0.0.39-1.el7scon.noarch
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:45:54 UTC
Embargoed:

Attachments	(Terms of Use)
full skyring.log file (3.26 KB, text/plain) 2016-01-15 16:38 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Gerrithub.io	257240	None	None	None	2016-01-21 18:39:56 UTC
Gerrithub.io	260541	None	None	None	2016-01-21 18:41:27 UTC
Red Hat Bugzilla	1296123	unspecified	CLOSED	skyring binary is not service	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1336357	unspecified	CLOSED	Skyring stops unexpectedly	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHEA-2016:1754	normal	SHIPPED_LIVE	New packages: Red Hat Storage Console 2.0	2017-04-18 19:09:06 UTC

Internal Links: 1296123 1336357

Description Martin Bukatovic 2016-01-15 16:38:04 UTC

Created attachment 1115191 [details]
full skyring.log file

Description of problem
======================

Skyring crashes when /tmp/.skyring-event unix socket file exists.

Null pointer crash of skyring process is not a good way to handle this
kind of error state.

Moreover socket handling would need to be enhanced/redesigned to meet
systemd requirements anyway, see BZ 1296123.

Version-Release number of selected component
============================================

# rpm -qa | grep rhscon
rhscon-ui-0.0.6-0.1.alpha1.el7.noarch
rhscon-core-0.0.6-0.1.alpha1.el7.x86_64
rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64

How reproducible
================

100%

Steps to Reproduce
==================

1. Install skyring on the server
2. Make sure unix socket file /tmp/.skyring-event exists,
   this could happen when previous process crashes, or one can create the
   socket by hand like this:

~~~
# socat UNIX-LISTEN:/tmp/.skyring-event -
^C
# ls -l /tmp/.skyring-event
srwxr-xr-x. 1 root root 0 Jan 15 17:23 /tmp/.skyring-event
# lsof /tmp/.skyring-event
#
~~~

3. Make sure skyring doesn't have other reason to crash
   (eg. check that mongodb is running).
4. Start skyring: `setsid skyring > ~/skyring.log 2>&1`
5. After a while, check /root/skyring.log file

note: in step 4, I start skyring daemon process manually because of BZ 1296123

Actual results
==============

Skyring process crashes. Also see this excerpt from skyring.log:

~~~
2016-01-15T16:41:56.055+01:00 INFO     main.go:177 start] Starting event listener
2016-01-15T16:41:56.057+01:00 CRITICAL listener.go:134 StartListener] listen error:listen unix /tmp/.skyring-event: bind: address already in use
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4849b4]
~~~

(full skyring.log file is attached to this BZ)

Expected results
================

Skyring process reports the issue in the log properly without crashing 
(and then stops if there is no other way).

Comment 2 Martin Bukatovic 2016-05-17 15:04:01 UTC

This is still a valid issue.

The socket has been moved into /var/run/.skyring-event and skyring service
has a proper systemd unit file now, but the issue remains:

~~~
# rpm -q rhscon-core
rhscon-core-0.0.17-1.el7scon.x86_64
# ls -l /var/run/.skyring-event
srwxr-xr-x. 1 root root 0 May 16 07:21 /var/run/.skyring-event
# journalctl -u skyring --since "11:22:50"
-- Logs begin at Tue 2016-05-10 10:36:54 CEST, end at Tue 2016-05-17 17:01:01 CEST. --
May 17 11:22:50 dhcp46-190.example.com systemd[1]: skyring.service holdoff time over, scheduling restart.
May 17 11:22:50 dhcp46-190.example.com systemd[1]: Started Skyring Daemon.
May 17 11:22:50 dhcp46-190.example.com systemd[1]: Starting Skyring Daemon...
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.341+02:00 ERROR    auth.go:197 AddUser] Username: admin already exists
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.341+02:00 ERROR    auth.go:41 AddDefaultUser] Unable to create default User:user already exists
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.344+02:00 INFO     main.go:214 start] Starting the providers
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.345+02:00 DEBUG    provider-conf.go:57 LoadProviderConfig] File Name: ceph.conf
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.345+02:00 DEBUG    provider-conf.go:57 LoadProviderConfig] File Name: ceph.dat
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.345+02:00 DEBUG    provider-conf.go:57 LoadProviderConfig] File Name: ceph.evt
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.345+02:00 DEBUG    provider-conf.go:86 LoadProviderConfig] Collection: [{{ceph bigfin 10.1} [] {ceph-installer  true false}}]
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.345+02:00 INFO     skyring.go:98 StartProviders] Config:%!(EXTRA []conf.ProviderInfo=[{{ceph bigfin 10.1} [] {ceph-installer  true false}}])
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.345+02:00 DEBUG    skyring.go:101 StartProviders] Config:%!(EXTRA conf.ProviderInfo={{ceph bigfin 10.1} [] {ceph-installer  true false}})
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.348+02:00 INFO     main.go:221 start] Starting event listener
May 17 11:22:51 dhcp46-190.example.com skyring[27755]: 2016-05-17T11:22:51.357+02:00 CRITICAL listener.go:146 StartListener] listen error:listen unix /var/run/.skyring-event: bind: address already in use
May 17 11:22:51 dhcp46-190.example.com systemd[1]: skyring.service: main process exited, code=exited, status=1/FAILURE
May 17 11:22:51 dhcp46-190.example.com systemd[1]: Unit skyring.service entered failed state.
May 17 11:22:51 dhcp46-190.example.com systemd[1]: skyring.service failed.
May 17 11:22:56 dhcp46-190.example.com systemd[1]: Stopped Skyring Daemon.
~~~

Comment 3 Martin Bukatovic 2016-05-17 15:17:06 UTC

I'm sorry, actually, it seems that skyring didn't crash, but just stopped
as it should in this case.

That said, we may need to check if it makes sense to use systemd to track
the socket so that the socket would not be forgotten in a case of crash
(which is usually the reason why there is some socket file left) and create
another BZ for it.

Moving the BZ state back.

Comment 4 Martin Bukatovic 2016-07-01 17:09:07 UTC

Using rhscon-core-0.0.28-1.el7scon.x86_64

~~~
# systemctl stop skyring
# socat UNIX-LISTEN:/var/run/.skyring-event -
^C
# ls -l /var/run/.skyring-event
srwxr-xr-x. 1 root root 0 Jul  1 18:59 /var/run/.skyring-event
# systemctl start skyring
# systemctl status skyring
● skyring.service - Skyring Daemon
   Loaded: loaded (/usr/lib/systemd/system/skyring.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2016-07-01 19:01:48 CEST; 2s ago
     Docs: man:skyring(8)
  Process: 327 ExecStart=/usr/bin/skyring $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 327 (code=exited, status=1/FAILURE)

Jul 01 19:01:46 dhcp-126-78.lab.eng.brq.redhat.com systemd[1]: Started Skyring Daemon.
Jul 01 19:01:46 dhcp-126-78.lab.eng.brq.redhat.com systemd[1]: Starting Skyring Daemon...
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com skyring[327]: 2016-07-01T19:01:48.211+02:00 INFO     main.go:214 start] Starting the providers
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com skyring[327]: 2016-07-01T19:01:48.212+02:00 INFO     skyring.go:98 StartProviders] Config:%!(EXTR...r:3]}])
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com skyring[327]: 2016-07-01T19:01:48.216+02:00 INFO     main.go:221 start] Starting event listener
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com skyring[327]: 2016-07-01T19:01:48.217+02:00 CRITICAL listener.go:146 StartListener] listen error:... in use
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com systemd[1]: skyring.service: main process exited, code=exited, status=1/FAILURE
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com systemd[1]: Unit skyring.service entered failed state.
Jul 01 19:01:48 dhcp-126-78.lab.eng.brq.redhat.com systemd[1]: skyring.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
~~~

To sum this up: skyring fails to start, but no longer crashes.

Checking the logs in more details:

~~~
2016-07-01T19:01:48.211+02:00 INFO     main.go:214 start] Starting the providers
2016-07-01T19:01:48.212+02:00 INFO     skyring.go:98 StartProviders] Config:%!(EXTRA []conf.ProviderInfo=[{{ceph bigfin 10.1} [] {ceph-installer  true false} map[max_metadata_on_ssd:4 min_monitors_in_cluster:3]}])
2016-07-01T19:01:48.216+02:00 INFO     main.go:221 start] Starting event listener
2016-07-01T19:01:48.217+02:00 CRITICAL listener.go:146 StartListener] listen error:listen unix /var/run/.skyring-event: bind: address already in use
~~~

We see that error about existing socket file is reported properly.

That said, the start/stop of the daemon is not logged (but there is another
bugzilla for that: BZ 1343651)

>> VERIFIED

Comment 7 errata-xmlrpc 2016-08-23 19:45:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2016:1754

Note You need to log in before you can comment on or make changes to this bug.