Bug 176913 - nanny and lvsd experiencing a segfault at start of LVS
Summary: nanny and lvsd experiencing a segfault at start of LVS
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: piranha
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Stanko Kupcevic
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 180185
TreeView+ depends on / blocked
 
Reported: 2006-01-04 10:23 UTC by Kim Forsberg
Modified: 2009-04-16 20:13 UTC (History)
3 users (show)

Fixed In Version: RHBA-2006-0538
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 21:19:12 UTC
Embargoed:


Attachments (Terms of Use)
lvs.cf (816 bytes, text/plain)
2006-01-11 06:46 UTC, Kim Forsberg
no flags Details
Output of lsmod (1.15 KB, text/plain)
2006-01-11 06:48 UTC, Kim Forsberg
no flags Details
Output of lspci-vv (15.30 KB, text/plain)
2006-01-11 06:50 UTC, Kim Forsberg
no flags Details
Syslog (33.62 KB, text/plain)
2006-01-11 06:51 UTC, Kim Forsberg
no flags Details
Output of pulse -nv (1.18 KB, text/plain)
2006-01-12 06:31 UTC, Kim Forsberg
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0538 0 normal SHIPPED_LIVE piranha bug fix update 2006-08-10 04:00:00 UTC

Description Kim Forsberg 2006-01-04 10:23:34 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fi; rv:1.8) Gecko/20051111 Firefox/1.5

Description of problem:
Initially pulse starts normally. However, when it activates the lvs and creates a monitor for the virtual service both nanny and lvsd produce a segfault. They then continue to load and eventually print a message that gratuitous lvs arps has been finished.

kernel: nanny[10515]: segfault at 0000000000000000 rip 0000003188e6fd20 rsp 0000007fbffef0c8 error 4
kernel: lvsd[10510]: segfault at 0000000000000480 rip 000000000040314f rsp 0000007fbffff970 error 4

These two segfaults result in the LVS routing table to be only half-done. The virtual service is there but no real servers have been specified. They can be entered manually using ipvsadm and then everything works (until pulse is restarted and it resets the table).

This happens straight out of the box on two identical Dell PowerEdge 1425 servers (one is primary and the other is backup) essentially rendering the whole service useless.

If you disable all real servers, for instance using Piranha, no segfaults are produced and the services start up nicely. Once you enable at least one real server the problem reoccurs. 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Add a real server using, for instance, Piranha
2. /sbin/service pulse start


Additional info:

CURRENT LVS ROUTING TABLE

IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
 -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.58.12.246:80 wlc

CURRENT LVS PROCESSES

root 10507 0.0 0.0 5176 548 ? Ss 12:01 0:00 pulse
root 10510 0.0 0.0 0 0 ? Zs 12:01 0:00 [lvsd] <defunct>

pulse[10507]: STARTING PULSE AS MASTER
pulse: pulse startup succeeded
pulse[10507]: partner dead: activating lvs
lvs[10510]: starting virtual service rek active: 80
nanny[10515]: starting LVS client monitor for 10.58.12.246:80
kernel: nanny[10515]: segfault at 0000000000000000 rip 0000003188e6fd20 rsp 0000007fbffef0c8 error 4
lvs[10510]: create_monitor for rek/rek2 running as pid 10515
kernel: lvsd[10510]: segfault at 0000000000000480 rip 000000000040314f rsp 0000007fbffff970 error 4
lada pulse[10512]: gratuitous lvs arps finished

The machine has 4 network interfaces bonded into one external and one internal bonding interface. I have tried running pulse without any bonded interfaces but this did nothing to solve the problem.

Comment 1 Kim Forsberg 2006-01-04 10:26:29 UTC
All available updates have been downloaded and installed from Redhat Network.

Comment 2 Kim Forsberg 2006-01-11 06:46:40 UTC
Created attachment 123037 [details]
lvs.cf

LVS.CF is identical on both servers. Segfault persists even if the backup
service is disabled in this file.

Comment 3 Kim Forsberg 2006-01-11 06:48:33 UTC
Created attachment 123038 [details]
Output of  lsmod

Comment 4 Kim Forsberg 2006-01-11 06:50:45 UTC
Created attachment 123039 [details]
Output of lspci-vv

Comment 5 Kim Forsberg 2006-01-11 06:51:31 UTC
Created attachment 123040 [details]
Syslog

Comment 6 Stanko Kupcevic 2006-01-11 23:56:33 UTC
Could you start pulse with 'pulse -nv' instead of 'service pulse start' and
capture output until lvsd dies? 


Comment 7 Kim Forsberg 2006-01-12 06:31:58 UTC
Created attachment 123103 [details]
Output of pulse -nv

The segfaults are visible in /var/log/messages.

Comment 8 Stanko Kupcevic 2006-01-12 21:15:12 UTC
There is a bug in nanny that triggers a series of unfortunate events: 

1. nanny segfaults if regular expression matching is enabled, and there is no
âexpect stringâ to match against
2. lvsd segfaults with nanny, leaving other nannies alive
3. pulse doesn't monitor lvsd, so is unaware of problem
4. system is half configured, without failing over


Comment 9 Stanko Kupcevic 2006-01-12 21:19:21 UTC
Quick fix: 

1. Disable regular expression matching in âvirtual servers->monitoring scriptsâ
if âexpectâ is blank. 
2. Since âedit monitoring scriptsâ page removes escape characters on âacceptâ (a
bug), no newline characters can be specified. So, replace ârnrnâ with â\r\n\r\nâ
in /etc/sysconfig/ha/lvs.cf. 

or

Recreate a virtual server, without making changes to âedit monitoring scriptsâ
page. Defaults work just fine with http. 


Comment 10 Kim Forsberg 2006-01-13 08:29:59 UTC
Problem solved! HTTP is just for testing purposes so the defaults wouldn't have
been applicable for the custom applications we are using in production. Thank
you for the quick fix!

/Kim

Comment 12 Stanko Kupcevic 2006-05-15 22:27:08 UTC
Fixed in 0.8.3

Fixed nanny and lvsd segfaults

To test:

 1. enable regex matching and leave expect string empty; nanny shouldn't segfault
 
 2. start lvsd and kill a nanny; lvsd should gracefully exit terminating other
nannies


Comment 15 Red Hat Bugzilla 2006-08-10 21:19:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0538.html



Note You need to log in before you can comment on or make changes to this bug.