| Summary: | Launching Calamari web UI after a restart of admin/calamari node produces Server Error (500) | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Mike Hackett <mhackett> |
| Component: | Calamari | Assignee: | Boris Ranto <branto> |
| Calamari sub component: | Web UI | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | low | ||
| Priority: | unspecified | CC: | anharris, ceph-eng-bugs, kdreyer, vumrao |
| Version: | 1.3.2 | ||
| Target Milestone: | rc | ||
| Target Release: | 1.3.4 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-02-20 20:56:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Comment 2
Christina Meno
2016-04-08 18:10:12 UTC
Red Hat KCS created: https://access.redhat.com/solutions/2442901 Updated Description to make bug public
Description of problem:
Launching Calamari after a restart of admin/calamari node produces Server Error (500).
SElinux = disabled
All required firewall ports are open.
It was found that cthulhu was actually in a failed state and a restart of cthlhu resolves the connection issue to Calamari successfully. Preliminary analysis "cthulhu" starts before the database is ready and
enters failed state.
/var/log/calamari/cthulhu.log:
OperationalError: (OperationalError) could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
could not create socket: Address family not supported by protocol -> this is cause by ipv6_disabled
None None
2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed
[root@#### ~]# supervisorctl
carbon-cache RUNNING pid 1010, uptime 0:06:14
cthulhu FATAL Exited too quickly (process log may have details)
Cthulhu seems to retry 4 time within 8 seconds:
[root@####]# cat /var/log/calamari/cthulhu.log|grep ERROR
2016-03-08 08:49:40,252 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:41,681 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:44,116 - ERROR - cthulhu Recovery failed
2016-03-08 08:49:48,421 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:17,430 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:18,860 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:21,288 - ERROR - cthulhu Recovery failed
2016-03-08 09:05:24,720 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:22,238 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:23,665 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:26,090 - ERROR - cthulhu Recovery failed
2016-03-08 09:17:29,635 - ERROR - cthulhu Recovery failed
I believe postgresql started only at 09:17:49, which is 20 seconds later than ctulhu.
[root@evecm01 etc]# ps -ef|grep post
postgres 1471 1 0 09:17 ? 00:00:00 /usr/bin/postgres -D /var/lib/pgsql/data -p 5432
postgres 1808 1471 0 09:17 ? 00:00:00 postgres: logger process
postgres 1894 1471 0 09:17 ? 00:00:00 postgres: checkpointer process
postgres 1896 1471 0 09:17 ? 00:00:00 postgres: writer process
postgres 1897 1471 0 09:17 ? 00:00:00 postgres: wal writer process
postgres 1898 1471 0 09:17 ? 00:00:00 postgres: autovacuum launcher process
postgres 1899 1471 0 09:17 ? 00:00:00 postgres: stats collector process
root 2172 1 0 09:17 ? 00:00:00 /usr/libexec/postfix/master -w
postfix 2265 2172 0 09:17 ? 00:00:00 pickup -l -t unix -u
postfix 2266 2172 0 09:17 ? 00:00:00 qmgr -l -t unix -u
root 7447 3336 0 09:27 pts/0 00:00:00 grep --color=auto post
[root@####c]# stat /proc/1471/sta
stat: cannot stat ‘/proc/1471/sta’: No such file or directory
[root@####]# stat /proc/1471/stat
File: ‘/proc/1471/stat’
Size: 0 Blocks: 0 IO Block: 1024 regular empty file
Device: 3h/3d Inode: 24741 Links: 1
Access: (0444/-r--r--r--) Uid: ( 26/postgres) Gid: ( 26/postgres)
Access: 2016-03-08 09:17:49.484641330 +0100
Modify: 2016-03-08 09:17:49.484641330 +0100
Change: 2016-03-08 09:17:49.484641330 +0100
As postgresql service waits for network, the network bringup time might be related here...
[Unit]
Description=PostgreSQL database server
After=network.target
Perhaps Cthulhu should also wait for network bringup,
but it is intertwined in supervisord, so I am not sure if that side-effects for other supervisor appliactions (besides calamari)
[root@evecm01 etc]# cat ./systemd/system/multi-user.target.wants/supervisord.service
[Unit]
Description=Process Monitoring and Control Daemon
After=rc-local.service
Version-Release number of selected component (if applicable):
RHCS 1.3.1
RHEL 7.1
How reproducible:
Customer can reproduce with every reboot.
I can reproduce intermittently, not with every reboot.
|