Hide Forgot
Description of problem: If corosync runs over a lossy network it can enter recovery mode repeatedly. It can also fail to recover properly, re-entering gather/commit/recovery from the recovery mode. After entering recovery a couple of hundred times, corosync has used up a lot memory. If recovery is entered again (without entering OPERATIONAL), without transmitting all of the retrans_message_queue, then we get some memory loss when we reinitialize the queues in memb_state_recovery_enter(). Also deliver_messages_from_recovery_to_regular() has a small memory leak if the message wasn't added to the regular_sort_queue. Version-Release number of selected component (if applicable): 1.3.4 How reproducible: Pretty reproducible in a lossy network Steps to Reproduce: 1. Setup a 4 node cluster 2. Make the network it operates lossy (either from the corosync parameters or some other mechanism) 3. Wait overnight - 8 hours. Actual results: Corosync memory usage will have increased a lot. Expected results: Corosync memory usage should be relatively static. Additional info: We have a patch I have sent to the mailing list that resolves these two problems for us.