Bug 1321324 - Launching Calamari web UI after a restart of admin/calamari node produces Server Error (500)
Summary: Launching Calamari web UI after a restart of admin/calamari node produces Ser...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Calamari
Version: 1.3.2
Hardware: Unspecified
OS: Linux
unspecified
low
Target Milestone: rc
: 1.3.4
Assignee: Boris Ranto
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-25 13:39 UTC by Mike Hackett
Modified: 2019-12-16 05:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-20 20:56:57 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Article) 2442901 0 None None None 2016-07-13 19:50:52 UTC

Comment 2 Christina Meno 2016-04-08 18:10:12 UTC
Mike,

This recovery is not required. But then calamari won't be able to get updates from the cluster till the network is up since it'll need salt to get them.

I suppose the correct solution is for calamari to wait for network.

Comment 4 Mike Hackett 2016-07-13 19:22:05 UTC
Red Hat KCS created: https://access.redhat.com/solutions/2442901

Comment 5 Mike Hackett 2016-07-13 20:08:39 UTC
Updated Description to make bug public

Description of problem:

Launching Calamari after a restart of admin/calamari node produces Server Error (500). 

SElinux = disabled
All required firewall ports are open.

It was found that cthulhu was actually in a failed state and a restart of cthlhu resolves the connection issue to Calamari successfully. Preliminary analysis "cthulhu" starts before the database is ready and
enters failed state.

/var/log/calamari/cthulhu.log:

OperationalError: (OperationalError) could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?
could not create socket: Address family not supported by protocol -> this is cause by ipv6_disabled
 None None
2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed

[root@#### ~]# supervisorctl 
carbon-cache                     RUNNING    pid 1010, uptime 0:06:14
cthulhu                          FATAL      Exited too quickly (process log may have details)

Cthulhu  seems to retry 4 time within 8 seconds:

[root@####]# cat /var/log/calamari/cthulhu.log|grep ERROR
2016-03-08 08:49:40,252 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:41,681 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:44,116 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:48,421 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:17,430 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:18,860 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:21,288 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:22,238 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:23,665 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:26,090 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:29,635 - ERROR - cthulhu Recovery failed


I believe postgresql started only at 09:17:49, which is 20 seconds later than ctulhu.

[root@evecm01 etc]# ps -ef|grep post
postgres  1471     1  0 09:17 ?        00:00:00 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
postgres  1808  1471  0 09:17 ?        00:00:00 postgres: logger process   
postgres  1894  1471  0 09:17 ?        00:00:00 postgres: checkpointer process   
postgres  1896  1471  0 09:17 ?        00:00:00 postgres: writer process   
postgres  1897  1471  0 09:17 ?        00:00:00 postgres: wal writer process   
postgres  1898  1471  0 09:17 ?        00:00:00 postgres: autovacuum launcher process   
postgres  1899  1471  0 09:17 ?        00:00:00 postgres: stats collector process   
root      2172     1  0 09:17 ?        00:00:00 /usr/libexec/postfix/master -w
postfix   2265  2172  0 09:17 ?        00:00:00 pickup -l -t unix -u
postfix   2266  2172  0 09:17 ?        00:00:00 qmgr -l -t unix -u
root      7447  3336  0 09:27 pts/0    00:00:00 grep --color=auto post
[root@####c]# stat /proc/1471/sta
stat: cannot stat ‘/proc/1471/sta’: No such file or directory
[root@####]# stat /proc/1471/stat
  File: ‘/proc/1471/stat’
  Size: 0               Blocks: 0          IO Block: 1024   regular empty file
Device: 3h/3d   Inode: 24741       Links: 1
Access: (0444/-r--r--r--)  Uid: (   26/postgres)   Gid: (   26/postgres)
Access: 2016-03-08 09:17:49.484641330 +0100
Modify: 2016-03-08 09:17:49.484641330 +0100
Change: 2016-03-08 09:17:49.484641330 +0100

As postgresql service waits for network, the network bringup time might be related here...
[Unit]
Description=PostgreSQL database server
After=network.target 

Perhaps Cthulhu should also wait for network bringup,
but it is intertwined in supervisord, so I am not sure if that side-effects for other supervisor appliactions (besides calamari)
[root@evecm01 etc]# cat ./systemd/system/multi-user.target.wants/supervisord.service 
[Unit]
Description=Process Monitoring and Control Daemon
After=rc-local.service

Version-Release number of selected component (if applicable):
RHCS 1.3.1
RHEL 7.1

How reproducible:
Customer can reproduce with every reboot.
I can reproduce intermittently, not with every reboot.


Note You need to log in before you can comment on or make changes to this bug.