Bug 1321324

Summary: Launching Calamari web UI after a restart of admin/calamari node produces Server Error (500)
Product: Red Hat Ceph Storage Reporter: Mike Hackett <mhackett>
Component: CalamariAssignee: Boris Ranto <branto>
Calamari sub component: Web UI QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Status: CLOSED WONTFIX Docs Contact:
Severity: low    
Priority: unspecified CC: anharris, ceph-eng-bugs, kdreyer, vumrao
Version: 1.3.2   
Target Milestone: rc   
Target Release: 1.3.4   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-20 20:56:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 2 Christina Meno 2016-04-08 18:10:12 UTC
Mike,

This recovery is not required. But then calamari won't be able to get updates from the cluster till the network is up since it'll need salt to get them.

I suppose the correct solution is for calamari to wait for network.

Comment 4 Mike Hackett 2016-07-13 19:22:05 UTC
Red Hat KCS created: https://access.redhat.com/solutions/2442901

Comment 5 Mike Hackett 2016-07-13 20:08:39 UTC
Updated Description to make bug public

Description of problem:

Launching Calamari after a restart of admin/calamari node produces Server Error (500). 

SElinux = disabled
All required firewall ports are open.

It was found that cthulhu was actually in a failed state and a restart of cthlhu resolves the connection issue to Calamari successfully. Preliminary analysis "cthulhu" starts before the database is ready and
enters failed state.

/var/log/calamari/cthulhu.log:

OperationalError: (OperationalError) could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?
could not create socket: Address family not supported by protocol -> this is cause by ipv6_disabled
 None None
2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed

[root@#### ~]# supervisorctl 
carbon-cache                     RUNNING    pid 1010, uptime 0:06:14
cthulhu                          FATAL      Exited too quickly (process log may have details)

Cthulhu  seems to retry 4 time within 8 seconds:

[root@####]# cat /var/log/calamari/cthulhu.log|grep ERROR
2016-03-08 08:49:40,252 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:41,681 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:44,116 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:48,421 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:17,430 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:18,860 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:21,288 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:22,238 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:23,665 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:26,090 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:29,635 - ERROR - cthulhu Recovery failed


I believe postgresql started only at 09:17:49, which is 20 seconds later than ctulhu.

[root@evecm01 etc]# ps -ef|grep post
postgres  1471     1  0 09:17 ?        00:00:00 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
postgres  1808  1471  0 09:17 ?        00:00:00 postgres: logger process   
postgres  1894  1471  0 09:17 ?        00:00:00 postgres: checkpointer process   
postgres  1896  1471  0 09:17 ?        00:00:00 postgres: writer process   
postgres  1897  1471  0 09:17 ?        00:00:00 postgres: wal writer process   
postgres  1898  1471  0 09:17 ?        00:00:00 postgres: autovacuum launcher process   
postgres  1899  1471  0 09:17 ?        00:00:00 postgres: stats collector process   
root      2172     1  0 09:17 ?        00:00:00 /usr/libexec/postfix/master -w
postfix   2265  2172  0 09:17 ?        00:00:00 pickup -l -t unix -u
postfix   2266  2172  0 09:17 ?        00:00:00 qmgr -l -t unix -u
root      7447  3336  0 09:27 pts/0    00:00:00 grep --color=auto post
[root@####c]# stat /proc/1471/sta
stat: cannot stat ‘/proc/1471/sta’: No such file or directory
[root@####]# stat /proc/1471/stat
  File: ‘/proc/1471/stat’
  Size: 0               Blocks: 0          IO Block: 1024   regular empty file
Device: 3h/3d   Inode: 24741       Links: 1
Access: (0444/-r--r--r--)  Uid: (   26/postgres)   Gid: (   26/postgres)
Access: 2016-03-08 09:17:49.484641330 +0100
Modify: 2016-03-08 09:17:49.484641330 +0100
Change: 2016-03-08 09:17:49.484641330 +0100

As postgresql service waits for network, the network bringup time might be related here...
[Unit]
Description=PostgreSQL database server
After=network.target 

Perhaps Cthulhu should also wait for network bringup,
but it is intertwined in supervisord, so I am not sure if that side-effects for other supervisor appliactions (besides calamari)
[root@evecm01 etc]# cat ./systemd/system/multi-user.target.wants/supervisord.service 
[Unit]
Description=Process Monitoring and Control Daemon
After=rc-local.service

Version-Release number of selected component (if applicable):
RHCS 1.3.1
RHEL 7.1

How reproducible:
Customer can reproduce with every reboot.
I can reproduce intermittently, not with every reboot.