Hide Forgot
In order to make any useful comments, I'm going to need logs from the controllers and the bad compute node. If sos wont work on some of those locations, please grab /var/log/pacemaker.log and /var/log/corosync.log manually.
The problematic compute node has been rebooted which corrected all issues with it. Below problems were magically solved by the reboot. - Failure to take snapshot for instances running on it. - Failure to schedule new instances to that compute node. - Live migration to and from this compute node started working. Still it's not know why this was failing. Logs are on collabshell if you would like to take a look at it. Compute-1 /var/log: /cases/01618921/logs_community-cmpt01.localdomain.tar.gz/var/log - This is only upto 20th April while still having problems. All logs in that location before 24th is before reboot.
Ok, I can see that the errors started at Apr 19 05:51:57 Apr 18 12:50:17 [78627] community-cmpt01.localdomain pacemaker_remoted: info: crm_compress_string: Compressed 342635 bytes into 14345 (ratio 23:1) in 106ms Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: error: crm_send_tls: Connection terminated rc = -53 Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: error: crm_send_tls: Connection terminated rc = -10 Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: error: crm_remote_send: Failed to send remote msg, rc = -10 Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: error: lrmd_tls_send_msg: Failed to send remote lrmd tls msg, rc = -10 Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: warning: send_client_notify: Notification of client remote-lrmd-community-cmpt01:3121/29014a03-c5e0-47af-8ec4-23da75b63cec failed Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. Apr 19 05:51:57 [78627] community-cmpt01.localdomain pacemaker_remoted: info: cancel_recurring_action: Cancelling ocf operation nova-compute_monitor_10000 But without the logs from the controllers I can't tell if this correlates with anything the rest of the system was doing.
*** This bug has been marked as a duplicate of bug 1354601 ***
Dup -- QE will decide about automating the original