Bug 706881

Summary: Unstable loadbalancer (piranha)
Product: Red Hat Enterprise Linux 6 Reporter: Henrik Johansson <henrik.l.johansson>
Component: piranhaAssignee: Ryan O'Hara <rohara>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 6.2CC: benjamin.girard, cluster-maint, djansa, jkortus, lhh, mfuruta
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: piranha-0.8.5-9.el6 Doc Type: Bug Fix
Doc Text:
Prior to this update, terminating a nanny or an lvs daemon did not trigger a failover to the backup server. As a consequence, the load balancer stopped working. With this update, the pulse daemon shuts down if either the nanny daemon or the lvs daemon terminates. Now, the load balancer works as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 17:57:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
An extract from /var/log/messages, from killing a nanny until restart of pulse. none

Description Henrik Johansson 2011-05-23 10:51:47 UTC
Created attachment 500392 [details]
An extract from /var/log/messages, from killing a nanny until restart of pulse.

Description of problem:
The loadbalancer stops working, when one of the nanny-processes dies or the lvsd-process get a TERM-signal.
The lvsd stops the remaining nannys, goes into defunct status, the pulse processes (on MASTER and BACKUP) doesn't observe the problem. 

Manual fix: 'service pulse restart'.

Version-Release number of selected component (if applicable):
[root@lvs2 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.0 (Santiago)
[root@lvs2 ~]# rpm -qa | egrep piranha
piranha-0.8.5-7.el6.i686


How reproducible:
kill <nanny-process> or kill <lvsd-process>


Steps to Reproduce:
1. service pulse start
2. kill <nanny-process>
3.
  
Actual results:
The loadbalancer stops and the backup doesn't notice the problem. 
Extract from ps:
[root@lvs2 ~]# ps -ef | egrep "piranha|pulse|lvsd|nanny"
root     15526     1  0 09:10 ?        00:00:00 pulse -v
root     15533 15526  0 09:10 ?        00:00:00 [lvsd] <defunct>
root     15579  1614  0 09:11 pts/0    00:00:00 egrep piranha|pulse|lvsd|nanny


Expected results:
Alternatives:
- Restarting the missing nanny. 
- Restarting the service pulse.
- Restarting the service pulse, with a timeout.
- Stopping the service pulse.


Additional info:
An extract from /var/log/messages, from killing a nanny until restart of pulse.

Comment 6 Ryan O'Hara 2011-08-11 14:59:29 UTC
With patch:

# service pulse start
# kill <nanny-pid>
# tail /var/log/messages

Aug 11 09:46:55 mobil-virt-09 nanny[14634]: Terminating due to signal 15
Aug 11 09:46:55 mobil-virt-09 lvs[14625]: nanny died! shutting down lvs
Aug 11 09:46:55 mobil-virt-09 lvs[14625]: shutting down virtual service HTTP
Aug 11 09:46:55 mobil-virt-09 nanny[14635]: Terminating due to signal 15
Aug 11 09:46:55 mobil-virt-09 nanny[14636]: Terminating due to signal 15
Aug 11 09:46:55 mobil-virt-09 pulse[14622]: Terminating due to signal 15

# service pulse start
# kill <lvsd-pid>
# tail /var/log/messages

Aug 11 09:58:09 mobil-virt-09 lvs[14675]: shutting down due to signal 15
Aug 11 09:58:09 mobil-virt-09 lvs[14675]: shutting down virtual service HTTP
Aug 11 09:58:09 mobil-virt-09 nanny[14684]: Terminating due to signal 15
Aug 11 09:58:09 mobil-virt-09 nanny[14685]: Terminating due to signal 15
Aug 11 09:58:09 mobil-virt-09 nanny[14686]: Terminating due to signal 15
Aug 11 09:58:09 mobil-virt-09 pulse[14671]: Terminating due to signal 15

Killing (with SIGTERM) either nanny or lvsd will cause all pulse/nanny/lvsd processes to exit.

Comment 9 Eliska Slobodova 2011-10-25 09:50:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this update, terminating a nanny or an lvs daemon did not trigger a failover to the backup server. As a consequence, the load balancer stopped working. With this update, the pulse daemon shuts down if either the nanny daemon or the lvs daemon terminates. Now, the load balancer works as expected.

Comment 10 Henrik Johansson 2011-10-25 12:18:19 UTC
Which version of piranha has the update?

Any suggestions for restart of 'service pulse'?

Comment 11 Henrik Johansson 2011-10-25 12:20:15 UTC
Oops, piranha-0.8.5-9.el6

Comment 12 benjamin.girard 2011-12-02 17:27:11 UTC
Where can I find this update ?

Comment 13 errata-xmlrpc 2011-12-06 17:57:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1716.html