Hide Forgot
Mike, This recovery is not required. But then calamari won't be able to get updates from the cluster till the network is up since it'll need salt to get them. I suppose the correct solution is for calamari to wait for network.
Red Hat KCS created: https://access.redhat.com/solutions/2442901
Updated Description to make bug public Description of problem: Launching Calamari after a restart of admin/calamari node produces Server Error (500). SElinux = disabled All required firewall ports are open. It was found that cthulhu was actually in a failed state and a restart of cthlhu resolves the connection issue to Calamari successfully. Preliminary analysis "cthulhu" starts before the database is ready and enters failed state. /var/log/calamari/cthulhu.log: OperationalError: (OperationalError) could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 5432? could not create socket: Address family not supported by protocol -> this is cause by ipv6_disabled None None 2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed [root@#### ~]# supervisorctl carbon-cache RUNNING pid 1010, uptime 0:06:14 cthulhu FATAL Exited too quickly (process log may have details) Cthulhu seems to retry 4 time within 8 seconds: [root@####]# cat /var/log/calamari/cthulhu.log|grep ERROR 2016-03-08 08:49:40,252 - ERROR - cthulhu Recovery failed 2016-03-08 08:49:41,681 - ERROR - cthulhu Recovery failed 2016-03-08 08:49:44,116 - ERROR - cthulhu Recovery failed 2016-03-08 08:49:48,421 - ERROR - cthulhu Recovery failed 2016-03-08 09:05:17,430 - ERROR - cthulhu Recovery failed 2016-03-08 09:05:18,860 - ERROR - cthulhu Recovery failed 2016-03-08 09:05:21,288 - ERROR - cthulhu Recovery failed 2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed 2016-03-08 09:17:22,238 - ERROR - cthulhu Recovery failed 2016-03-08 09:17:23,665 - ERROR - cthulhu Recovery failed 2016-03-08 09:17:26,090 - ERROR - cthulhu Recovery failed 2016-03-08 09:17:29,635 - ERROR - cthulhu Recovery failed I believe postgresql started only at 09:17:49, which is 20 seconds later than ctulhu. [root@evecm01 etc]# ps -ef|grep post postgres 1471 1 0 09:17 ? 00:00:00 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432 postgres 1808 1471 0 09:17 ? 00:00:00 postgres: logger process postgres 1894 1471 0 09:17 ? 00:00:00 postgres: checkpointer process postgres 1896 1471 0 09:17 ? 00:00:00 postgres: writer process postgres 1897 1471 0 09:17 ? 00:00:00 postgres: wal writer process postgres 1898 1471 0 09:17 ? 00:00:00 postgres: autovacuum launcher process postgres 1899 1471 0 09:17 ? 00:00:00 postgres: stats collector process root 2172 1 0 09:17 ? 00:00:00 /usr/libexec/postfix/master -w postfix 2265 2172 0 09:17 ? 00:00:00 pickup -l -t unix -u postfix 2266 2172 0 09:17 ? 00:00:00 qmgr -l -t unix -u root 7447 3336 0 09:27 pts/0 00:00:00 grep --color=auto post [root@####c]# stat /proc/1471/sta stat: cannot stat ‘/proc/1471/sta’: No such file or directory [root@####]# stat /proc/1471/stat File: ‘/proc/1471/stat’ Size: 0 Blocks: 0 IO Block: 1024 regular empty file Device: 3h/3d Inode: 24741 Links: 1 Access: (0444/-r--r--r--) Uid: ( 26/postgres) Gid: ( 26/postgres) Access: 2016-03-08 09:17:49.484641330 +0100 Modify: 2016-03-08 09:17:49.484641330 +0100 Change: 2016-03-08 09:17:49.484641330 +0100 As postgresql service waits for network, the network bringup time might be related here... [Unit] Description=PostgreSQL database server After=network.target Perhaps Cthulhu should also wait for network bringup, but it is intertwined in supervisord, so I am not sure if that side-effects for other supervisor appliactions (besides calamari) [root@evecm01 etc]# cat ./systemd/system/multi-user.target.wants/supervisord.service [Unit] Description=Process Monitoring and Control Daemon After=rc-local.service Version-Release number of selected component (if applicable): RHCS 1.3.1 RHEL 7.1 How reproducible: Customer can reproduce with every reboot. I can reproduce intermittently, not with every reboot.