Bug 176913 - nanny and lvsd experiencing a segfault at start of LVS
nanny and lvsd experiencing a segfault at start of LVS
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: piranha (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Stanko Kupcevic
Cluster QE
:
Depends On:
Blocks: 180185
  Show dependency treegraph
 
Reported: 2006-01-04 05:23 EST by Kim Forsberg
Modified: 2009-04-16 16:13 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2006-0538
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 17:19:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
lvs.cf (816 bytes, text/plain)
2006-01-11 01:46 EST, Kim Forsberg
no flags Details
Output of lsmod (1.15 KB, text/plain)
2006-01-11 01:48 EST, Kim Forsberg
no flags Details
Output of lspci-vv (15.30 KB, text/plain)
2006-01-11 01:50 EST, Kim Forsberg
no flags Details
Syslog (33.62 KB, text/plain)
2006-01-11 01:51 EST, Kim Forsberg
no flags Details
Output of pulse -nv (1.18 KB, text/plain)
2006-01-12 01:31 EST, Kim Forsberg
no flags Details

  None (edit)
Description Kim Forsberg 2006-01-04 05:23:34 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; fi; rv:1.8) Gecko/20051111 Firefox/1.5

Description of problem:
Initially pulse starts normally. However, when it activates the lvs and creates a monitor for the virtual service both nanny and lvsd produce a segfault. They then continue to load and eventually print a message that gratuitous lvs arps has been finished.

kernel: nanny[10515]: segfault at 0000000000000000 rip 0000003188e6fd20 rsp 0000007fbffef0c8 error 4
kernel: lvsd[10510]: segfault at 0000000000000480 rip 000000000040314f rsp 0000007fbffff970 error 4

These two segfaults result in the LVS routing table to be only half-done. The virtual service is there but no real servers have been specified. They can be entered manually using ipvsadm and then everything works (until pulse is restarted and it resets the table).

This happens straight out of the box on two identical Dell PowerEdge 1425 servers (one is primary and the other is backup) essentially rendering the whole service useless.

If you disable all real servers, for instance using Piranha, no segfaults are produced and the services start up nicely. Once you enable at least one real server the problem reoccurs. 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Add a real server using, for instance, Piranha
2. /sbin/service pulse start


Additional info:

CURRENT LVS ROUTING TABLE

IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
 -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.58.12.246:80 wlc

CURRENT LVS PROCESSES

root 10507 0.0 0.0 5176 548 ? Ss 12:01 0:00 pulse
root 10510 0.0 0.0 0 0 ? Zs 12:01 0:00 [lvsd] <defunct>

pulse[10507]: STARTING PULSE AS MASTER
pulse: pulse startup succeeded
pulse[10507]: partner dead: activating lvs
lvs[10510]: starting virtual service rek active: 80
nanny[10515]: starting LVS client monitor for 10.58.12.246:80
kernel: nanny[10515]: segfault at 0000000000000000 rip 0000003188e6fd20 rsp 0000007fbffef0c8 error 4
lvs[10510]: create_monitor for rek/rek2 running as pid 10515
kernel: lvsd[10510]: segfault at 0000000000000480 rip 000000000040314f rsp 0000007fbffff970 error 4
lada pulse[10512]: gratuitous lvs arps finished

The machine has 4 network interfaces bonded into one external and one internal bonding interface. I have tried running pulse without any bonded interfaces but this did nothing to solve the problem.
Comment 1 Kim Forsberg 2006-01-04 05:26:29 EST
All available updates have been downloaded and installed from Redhat Network.
Comment 2 Kim Forsberg 2006-01-11 01:46:40 EST
Created attachment 123037 [details]
lvs.cf

LVS.CF is identical on both servers. Segfault persists even if the backup
service is disabled in this file.
Comment 3 Kim Forsberg 2006-01-11 01:48:33 EST
Created attachment 123038 [details]
Output of  lsmod
Comment 4 Kim Forsberg 2006-01-11 01:50:45 EST
Created attachment 123039 [details]
Output of lspci-vv
Comment 5 Kim Forsberg 2006-01-11 01:51:31 EST
Created attachment 123040 [details]
Syslog
Comment 6 Stanko Kupcevic 2006-01-11 18:56:33 EST
Could you start pulse with 'pulse -nv' instead of 'service pulse start' and
capture output until lvsd dies? 
Comment 7 Kim Forsberg 2006-01-12 01:31:58 EST
Created attachment 123103 [details]
Output of pulse -nv

The segfaults are visible in /var/log/messages.
Comment 8 Stanko Kupcevic 2006-01-12 16:15:12 EST
There is a bug in nanny that triggers a series of unfortunate events: 

1. nanny segfaults if regular expression matching is enabled, and there is no
“expect string” to match against
2. lvsd segfaults with nanny, leaving other nannies alive
3. pulse doesn't monitor lvsd, so is unaware of problem
4. system is half configured, without failing over
Comment 9 Stanko Kupcevic 2006-01-12 16:19:21 EST
Quick fix: 

1. Disable regular expression matching in “virtual servers->monitoring scripts”
if “expect” is blank. 
2. Since “edit monitoring scripts” page removes escape characters on “accept” (a
bug), no newline characters can be specified. So, replace “rnrn” with “\r\n\r\n”
in /etc/sysconfig/ha/lvs.cf. 

or

Recreate a virtual server, without making changes to “edit monitoring scripts”
page. Defaults work just fine with http. 
Comment 10 Kim Forsberg 2006-01-13 03:29:59 EST
Problem solved! HTTP is just for testing purposes so the defaults wouldn't have
been applicable for the custom applications we are using in production. Thank
you for the quick fix!

/Kim
Comment 12 Stanko Kupcevic 2006-05-15 18:27:08 EDT
Fixed in 0.8.3

Fixed nanny and lvsd segfaults

To test:

 1. enable regex matching and leave expect string empty; nanny shouldn't segfault
 
 2. start lvsd and kill a nanny; lvsd should gracefully exit terminating other
nannies
Comment 15 Red Hat Bugzilla 2006-08-10 17:19:12 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0538.html

Note You need to log in before you can comment on or make changes to this bug.