Bug 179734 - CMAN failed send message and leave cluster
Summary: CMAN failed send message and leave cluster
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: cman
Version: rawhide
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Chris Feist
QA Contact:
URL:
Whiteboard: bzcl34nup
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-02-02 15:46 UTC by Andrew Okhmat
Modified: 2008-05-07 00:20 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-07 00:20:54 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Andrew Okhmat 2006-02-02 15:46:49 UTC
We installed new version of cluster software (cman-1.0.4) and got errors in
dmesg when cman service try starting. 

=== message from CMAN ===
CMAN 2.6.14.1-20051219.162641.FC5.10 (built Feb  2 2006 08:45:18) installed
NET: Registered protocol family 30
CMAN: Waiting to join or form a Linux-cluster
CMAN: sending membership request
CMAN: sending membership request
CMAN: got node z3
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: quorum regained, resuming activity
CMAN: sendmsg failed: -13
CMAN: send_queued_message failed, error -13
CMAN: sendmsg failed: -13
CMAN: send_queued_message failed, error -13
CMAN: sendmsg failed: -13
DLM 2.6.14.1-20051219.162641.FC5.10 (built Feb  2 2006 08:46:32) installed
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: we are leaving the cluster. Shutdown
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown
===========

Probably error in new code for sending cluster messages. I found message "Say
something if sendmsg fails." at this url
http://sources.redhat.com/cgi-bin/cvsweb.cgi/cluster/cman-kernel/src/cnxman.c?rev=1.55&content-type=text/x-cvsweb-markup&cvsroot=cluster

One of characteristics of our servers we use bonding interface:

=== ifconfig ===
bond0     Link encap:Ethernet  HWaddr 00:0B:CD:EF:F1:D7
          inet addr:x.x.x.x  Bcast:x.x.x.x  Mask:255.255.255.224
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:186012691 errors:0 dropped:0 overruns:0 frame:0
          TX packets:48577497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1141297473 (1.0 GiB)  TX bytes:3641909279 (3.3 GiB)

bond0:0   Link encap:Ethernet  HWaddr 00:0B:CD:EF:F1:D7
          inet addr:192.168.0.165  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

eth0      Link encap:Ethernet  HWaddr 00:0B:CD:EF:F1:D7
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:185935375 errors:0 dropped:0 overruns:0 frame:0
          TX packets:48577497 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1135633654 (1.0 GiB)  TX bytes:3641909279 (3.3 GiB)
          Interrupt:185

eth1      Link encap:Ethernet  HWaddr 00:0B:CD:EF:F1:D7
          UP BROADCAST RUNNING NOARP SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:77316 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5663819 (5.4 MiB)  TX bytes:0 (0.0 b)
          Interrupt:193

eth2      Link encap:Ethernet  HWaddr 00:04:23:AA:E0:60
          inet addr:10.65.73.31  Bcast:10.65.73.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:336794267 errors:0 dropped:0 overruns:0 frame:0
          TX packets:110800 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3615713202 (3.3 GiB)  TX bytes:7091200 (6.7 MiB)
          Base address:0x3000 Memory:f7de0000-f7e00000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:30059 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30059 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2694690 (2.5 MiB)  TX bytes:2694690 (2.5 MiB)
=======



Dmesg:
========
Linux version 2.6.15-prep (root@z4) (gcc version 4.0.2 20051125 (Red Hat
4.0.2-8)) #3 SMP Sun Jan 29 21:14:49 EST 2006
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000f3ffa000 (usable)
 BIOS-e820: 00000000f3ffa000 - 00000000f4000000 (ACPI data)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
 BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
3007MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f4fd0
Using x86 segment limits to approximate NX protection
On node 0 totalpages: 999418
  DMA zone: 4096 pages, LIFO batch:0
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 225280 pages, LIFO batch:31
  HighMem zone: 770042 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 COMPAQ                                ) @ 0x000f4f70
ACPI: RSDT (v001 COMPAQ P29      0x00000002 <D2>^D 0x0000162e) @ 0xf3ffa000
ACPI: FADT (v001 COMPAQ P29      0x00000002 <D2>^D 0x0000162e) @ 0xf3ffa040
ACPI: MADT (v001 COMPAQ 00000083 0x00000002  0x00000000) @ 0xf3ffa100
ACPI: SPCR (v001 COMPAQ SPCRRBSU 0x00000001 <D2>^D 0x0000162e) @ 0xf3ffa1c0
ACPI: DSDT (v001 COMPAQ     DSDT 0x00000001 MSFT 0x0100000b) @ 0x00000000
ACPI: Local APIC enabled (0).
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] disabled)
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] disabled)
ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] enabled)
Processor #6 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] disabled)
ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] disabled)
ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
Processor #7 15:2 APIC version 20
ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-15
ACPI: IOAPIC (id[0x03] address[0xfec01000] gsi_base[16])
IOAPIC[1]: apic_id 3, version 17, address 0xfec01000, GSI 16-31
ACPI: IOAPIC (id[0x04] address[0xfec02000] gsi_base[32])
IOAPIC[2]: apic_id 4, version 17, address 0xfec02000, GSI 32-47
ACPI: IOAPIC (id[0x05] address[0xfec03000] gsi_base[48])
IOAPIC[3]: apic_id 5, version 17, address 0xfec03000, GSI 48-63
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Enabling APIC mode:  Flat.  Using 4 I/O APICs
LAPIC enabled (0), calling get_smp_config
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at f5000000 (gap: f4000000:0ac00000)
Built 1 zonelists
Kernel command line: ro root=/dev/md1
mapped APIC to ffffd000 (fee00000)
mapped IOAPIC to ffffc000 (fec00000)
mapped IOAPIC to ffffb000 (fec01000)
mapped IOAPIC to ffffa000 (fec02000)
mapped IOAPIC to ffff9000 (fec03000)
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 3186.683 MHz processor.
Using tsc for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 3959780k/3997672k available (2257k kernel code, 36640k reserved, 649k
data, 196k init, 3080168k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 6382.33 BogoMIPS (lpj=12764671)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Physical Processor ID: 0
CPU: After all inits, caps: bfebf3ff 00000000 00000000 00000080 00004400
00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU0: Intel P4/Xeon Extended MCE MSRs (12) available
CPU0: Thermal monitoring enabled
mtrr: v2.0 (20020519)
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
CPU0: Intel(R) Xeon(TM) CPU 3.20GHz stepping 05
Booting processor 1/1 eip 2000
Initializing CPU#1
Calibrating delay using timer specific routine.. 6373.16 BogoMIPS (lpj=12746338)
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Physical Processor ID: 0
CPU: After all inits, caps: bfebf3ff 00000000 00000000 00000080 00004400
00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU1: Thermal monitoring enabled
CPU1: Intel(R) Xeon(TM) CPU 3.20GHz stepping 05
Booting processor 2/6 eip 2000
Initializing CPU#2
Calibrating delay using timer specific routine.. 6373.22 BogoMIPS (lpj=12746440)
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Physical Processor ID: 3
CPU: After all inits, caps: bfebf3ff 00000000 00000000 00000080 00004400
00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#2.
CPU2: Intel P4/Xeon Extended MCE MSRs (12) available
CPU2: Thermal monitoring enabled
CPU2: Intel(R) Xeon(TM) CPU 3.20GHz stepping 05
Booting processor 3/7 eip 2000
Initializing CPU#3
Calibrating delay using timer specific routine.. 6372.92 BogoMIPS (lpj=12745858)
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: L3 cache: 1024K
CPU: Physical Processor ID: 3
CPU: After all inits, caps: bfebf3ff 00000000 00000000 00000080 00004400
00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#3.
CPU3: Intel P4/Xeon Extended MCE MSRs (12) available
CPU3: Thermal monitoring enabled
CPU3: Intel(R) Xeon(TM) CPU 3.20GHz stepping 05
Total of 4 processors activated (25501.65 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
checking TSC synchronization across 4 CPUs: passed.
Brought up 4 CPUs
checking if image is initramfs... it is
Freeing initrd memory: 1032k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xf0094, last bus=9
PCI: Using configuration type 1
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs.
mtrr: corrected configuration.
ACPI: Subsystem revision 20050902
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
Boot video device is 0000:00:03.0
PCI: Ignoring BAR0-3 of IDE controller 0000:00:0f.1
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Root Bridge [PCI1] (0000:01)
PCI: Probing PCI hardware (bus 01)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI1._PRT]
ACPI: PCI Root Bridge [PCI2] (0000:02)
PCI: Probing PCI hardware (bus 02)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI2._PRT]
ACPI: PCI Root Bridge [PCI3] (0000:03)
PCI: Probing PCI hardware (bus 03)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI3._PRT]
ACPI: PCI Root Bridge [PCI4] (0000:06)
PCI: Probing PCI hardware (bus 06)
ACPI: PCI Interrupt Routing Table [\_SB_.PCI4._PRT]
ACPI: PCI Interrupt Link [IUSB] (IRQs 4 5 7 10 *11 15)
ACPI: PCI Interrupt Link [IN16] (IRQs 4 5 7 10 11 15) *3
ACPI: PCI Interrupt Link [IN17] (IRQs 4 *5 7 10 11 15)
ACPI: PCI Interrupt Link [IN18] (IRQs 4 5 *7 10 11 15)
ACPI: PCI Interrupt Link [IN19] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN20] (IRQs 4 5 *7 10 11 15)
ACPI: PCI Interrupt Link [IN21] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN22] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN23] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN24] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN25] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN26] (IRQs 4 5 7 *10 11 15)
ACPI: PCI Interrupt Link [IN27] (IRQs 4 5 7 10 11 *15)
ACPI: PCI Interrupt Link [IN28] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN29] (IRQs 4 5 7 10 *11 15)
ACPI: PCI Interrupt Link [IN30] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN31] (IRQs 4 5 7 10 *11 15)
ACPI: PCI Interrupt Link [IN32] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN33] (IRQs 4 5 7 10 11 15) *0, disabled.
ACPI: PCI Interrupt Link [IN34] (IRQs 4 5 7 10 11 15) *0, disabled.
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 8 devices
SCSI subsystem initialized
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
PCI: Device 0000:00:00.0 not found by BIOS
PCI: Device 0000:00:00.1 not found by BIOS
PCI: Device 0000:00:00.2 not found by BIOS
PCI: Device 0000:00:0f.0 not found by BIOS
PCI: Device 0000:00:0f.3 not found by BIOS
PCI: Device 0000:00:10.0 not found by BIOS
PCI: Device 0000:00:10.2 not found by BIOS
PCI: Device 0000:00:11.0 not found by BIOS
PCI: Device 0000:00:11.2 not found by BIOS
pnp: 00:00: ioport range 0xf50-0xf58 has been reserved
pnp: 00:00: ioport range 0x408-0x40f has been reserved
pnp: 00:00: ioport range 0x900-0x903 could not be reserved
pnp: 00:00: ioport range 0x910-0x911 could not be reserved
pnp: 00:00: ioport range 0x920-0x923 could not be reserved
pnp: 00:00: ioport range 0x930-0x937 has been reserved
pnp: 00:00: ioport range 0x940-0x947 has been reserved
Machine check exception polling timer started.
audit: initializing netlink socket (disabled)
audit(1138616034.080:1): initialized
highmem bounce pool size: 64 pages
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SGI XFS with ACLs, large block numbers, no debug enabled
SGI XFS Quota Management subsystem
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
ACPI: Thermal Zone [THM0] (8 C)
Real Time Clock Driver v1.12
PNP: PS/2 Controller [PNP0303:KBD,PNP0f0e:PS2M] at 0x60,0x64 irq 1,12
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Fusion MPT base driver 3.03.04
Copyright (c) 1999-2005 LSI Logic Corporation
Fusion MPT SPI Host driver 3.03.04
ACPI: PCI Interrupt 0000:06:02.0[A] -> GSI 26 (level, low) -> IRQ 169
mptbase: Initiating ioc0 bringup
ioc0: 53C1030: Capabilities={Initiator,Target}
scsi0 : ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=169
  Vendor: COMPAQ    Model: BD14685A26        Rev: HPB6
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sda: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sda: drive cache: write through
SCSI device sda: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sda: drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 >
sd 0:0:0:0: Attached scsi disk sda
  Vendor: COMPAQ    Model: BD14686225        Rev: HPB6
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdb: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdb: drive cache: write through
SCSI device sdb: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdb: drive cache: write through
 sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 sdb7 >
sd 0:0:1:0: Attached scsi disk sdb
  Vendor: COMPAQ    Model: BD14686225        Rev: HPB6
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdc: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdc: drive cache: write through
SCSI device sdc: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdc: drive cache: write through
 sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 sdc7 >
sd 0:0:2:0: Attached scsi disk sdc
  Vendor: COMPAQ    Model: BD14687B52        Rev: HPB5
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdd: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdd: drive cache: write through
SCSI device sdd: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdd: drive cache: write through
 sdd: sdd1 sdd2 sdd3 sdd4 < sdd5 sdd6 sdd7 >
sd 0:0:3:0: Attached scsi disk sdd
  Vendor: COMPAQ    Model: BD14686225        Rev: HPB6
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sde: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sde: drive cache: write through
SCSI device sde: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sde: drive cache: write through
 sde: sde1 sde2 sde3 sde4 < sde5 sde6 sde7 >
sd 0:0:4:0: Attached scsi disk sde
  Vendor: COMPAQ    Model: BD14686225        Rev: HPB6
  Type:   Direct-Access                      ANSI SCSI revision: 03
SCSI device sdf: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdf: drive cache: write through
SCSI device sdf: 286749488 512-byte hdwr sectors (146816 MB)
SCSI device sdf: drive cache: write through
 sdf: sdf1 sdf2 sdf3 sdf4 < sdf5 sdf6 sdf7 >
sd 0:0:5:0: Attached scsi disk sdf
  Vendor: COMPAQ    Model: PROLIANT 4LCI     Rev: 1.84
  Type:   Processor                          ANSI SCSI revision: 02
ACPI: PCI Interrupt 0000:06:02.1[B] -> GSI 27 (level, low) -> IRQ 177
mptbase: Initiating ioc1 bringup
ioc1: 53C1030: Capabilities={Initiator,Target}
scsi1 : ioc1: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=255, IRQ=177
  Vendor: Promise   Model: 12 Disk RAID5     Rev: V0.0
  Type:   Direct-Access                      ANSI SCSI revision: 04
SCSI device sdg: 2674804352 1024-byte hdwr sectors (2739000 MB)
SCSI device sdg: drive cache: write through
SCSI device sdg: 2674804352 1024-byte hdwr sectors (2739000 MB)
SCSI device sdg: drive cache: write through
 sdg: unknown partition table
sd 1:0:1:0: Attached scsi disk sdg
  Vendor: Promise   Model: 12 Disk RAID5     Rev: V0.0
  Type:   Direct-Access                      ANSI SCSI revision: 04
SCSI device sdh: 2674804352 1024-byte hdwr sectors (2739000 MB)
SCSI device sdh: drive cache: write through
SCSI device sdh: 2674804352 1024-byte hdwr sectors (2739000 MB)
SCSI device sdh: drive cache: write through
 sdh: unknown partition table
sd 1:0:3:0: Attached scsi disk sdh
Fusion MPT misc device (ioctl) driver 3.03.04
mptctl: Registered with Fusion MPT base driver
mptctl: /dev/mptctl @ (major,minor=10,220)
mice: PS/2 mouse device common for all mice
md: raid1 personality registered as nr 3
md: raid5 personality registered as nr 4
raid5: automatically using best checksumming function: pIII_sse
   pIII_sse  :  3459.000 MB/sec
raid5: using function: pIII_sse (3459.000 MB/sec)
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel
NET: Registered protocol family 2
input: AT Translated Set 2 keyboard as /class/input/input0
IP route cache hash table entries: 131072 (order: 7, 524288 bytes)
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
TCP reno registered
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Starting balanced_irq
Using IPI Shortcut mode
Freeing unused kernel memory: 196k freed
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
md: Autodetecting RAID arrays.
md: autorun ...
md: considering sdf7 ...
md:  adding sdf7 ...
md: sdf5 has different UUID to sdf7
md: sdf3 has different UUID to sdf7
md: sdf2 has different UUID to sdf7
md: sdf1 has different UUID to sdf7
md:  adding sde7 ...
md: sde5 has different UUID to sdf7
md: sde3 has different UUID to sdf7
md: sde2 has different UUID to sdf7
md: sde1 has different UUID to sdf7
md:  adding sdd7 ...
md: sdd5 has different UUID to sdf7
md: sdd3 has different UUID to sdf7
md: sdd2 has different UUID to sdf7
md: sdd1 has different UUID to sdf7
md:  adding sdc7 ...
md: sdc5 has different UUID to sdf7
md: sdc3 has different UUID to sdf7
md: sdc2 has different UUID to sdf7
md: sdc1 has different UUID to sdf7
md:  adding sdb7 ...
md: sdb5 has different UUID to sdf7
md: sdb3 has different UUID to sdf7
md: sdb2 has different UUID to sdf7
md: sdb1 has different UUID to sdf7
md:  adding sda7 ...
md: sda5 has different UUID to sdf7
md: sda3 has different UUID to sdf7
md: sda2 has different UUID to sdf7
md: sda1 has different UUID to sdf7
md: created md4
md: bind<sda7>
md: bind<sdb7>
md: bind<sdc7>
md: bind<sdd7>
md: bind<sde7>
md: bind<sdf7>
md: running: <sdf7><sde7><sdd7><sdc7><sdb7><sda7>
raid5: device sdf7 operational as raid disk 5
raid5: device sde7 operational as raid disk 4
raid5: device sdd7 operational as raid disk 3
raid5: device sdc7 operational as raid disk 2
raid5: device sdb7 operational as raid disk 1
raid5: device sda7 operational as raid disk 0
raid5: allocated 6285kB for md4
raid5: raid level 5 set md4 active with 6 out of 6 devices, algorithm 2
RAID5 conf printout:
 --- rd:6 wd:6 fd:0
 disk 0, o:1, dev:sda7
 disk 1, o:1, dev:sdb7
 disk 2, o:1, dev:sdc7
 disk 3, o:1, dev:sdd7
 disk 4, o:1, dev:sde7
 disk 5, o:1, dev:sdf7
md: considering sdf5 ...
md:  adding sdf5 ...
md: sdf3 has different UUID to sdf5
md: sdf2 has different UUID to sdf5
md: sdf1 has different UUID to sdf5
md:  adding sde5 ...
md: sde3 has different UUID to sdf5
md: sde2 has different UUID to sdf5
md: sde1 has different UUID to sdf5
md:  adding sdd5 ...
md: sdd3 has different UUID to sdf5
md: sdd2 has different UUID to sdf5
md: sdd1 has different UUID to sdf5
md:  adding sdc5 ...
md: sdc3 has different UUID to sdf5
md: sdc2 has different UUID to sdf5
md: sdc1 has different UUID to sdf5
md:  adding sdb5 ...
md: sdb3 has different UUID to sdf5
md: sdb2 has different UUID to sdf5
md: sdb1 has different UUID to sdf5
md:  adding sda5 ...
md: sda3 has different UUID to sdf5
md: sda2 has different UUID to sdf5
md: sda1 has different UUID to sdf5
md: created md3
md: bind<sda5>
md: bind<sdb5>
md: bind<sdc5>
md: bind<sdd5>
md: bind<sde5>
md: bind<sdf5>
md: running: <sdf5><sde5><sdd5><sdc5><sdb5><sda5>
md: md3: raid array is not clean -- starting background reconstruction
raid5: device sdf5 operational as raid disk 5
raid5: device sde5 operational as raid disk 4
raid5: device sdd5 operational as raid disk 3
raid5: device sdc5 operational as raid disk 2
raid5: device sdb5 operational as raid disk 1
raid5: device sda5 operational as raid disk 0
raid5: allocated 6285kB for md3
raid5: raid level 5 set md3 active with 6 out of 6 devices, algorithm 2
RAID5 conf printout:
 --- rd:6 wd:6 fd:0
 disk 0, o:1, dev:sda5
 disk 1, o:1, dev:sdb5
 disk 2, o:1, dev:sdc5
 disk 3, o:1, dev:sdd5
 disk 4, o:1, dev:sde5
 disk 5, o:1, dev:sdf5
md: considering sdf3 ...
md:  adding sdf3 ...
md: sdf2 has different UUID to sdf3
md: sdf1 has different UUID to sdf3
md:  adding sde3 ...
md: sde2 has different UUID to sdf3
md: sde1 has different UUID to sdf3
md:  adding sdd3 ...
md: sdd2 has different UUID to sdf3
md: sdd1 has different UUID to sdf3
md:  adding sdc3 ...
md: sdc2 has different UUID to sdf3
md: sdc1 has different UUID to sdf3
md:  adding sdb3 ...
md: sdb2 has different UUID to sdf3
md: sdb1 has different UUID to sdf3
md:  adding sda3 ...
md: sda2 has different UUID to sdf3
md: sda1 has different UUID to sdf3
md: syncing RAID array md3
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
for reconstruction.
md: using 128k window, over a total of 1742912 blocks.
md: created md2
md: bind<sda3>
md: bind<sdb3>
md: bind<sdc3>
md: bind<sdd3>
md: bind<sde3>
md: bind<sdf3>
md: running: <sdf3><sde3><sdd3><sdc3><sdb3><sda3>
raid5: device sdf3 operational as raid disk 5
raid5: device sde3 operational as raid disk 4
raid5: device sdd3 operational as raid disk 3
raid5: device sdc3 operational as raid disk 2
raid5: device sdb3 operational as raid disk 1
raid5: device sda3 operational as raid disk 0
raid5: allocated 6285kB for md2
raid5: raid level 5 set md2 active with 6 out of 6 devices, algorithm 2
RAID5 conf printout:
 --- rd:6 wd:6 fd:0
 disk 0, o:1, dev:sda3
 disk 1, o:1, dev:sdb3
 disk 2, o:1, dev:sdc3
 disk 3, o:1, dev:sdd3
 disk 4, o:1, dev:sde3
 disk 5, o:1, dev:sdf3
md: considering sdf2 ...
md:  adding sdf2 ...
md: sdf1 has different UUID to sdf2
md:  adding sde2 ...
md: sde1 has different UUID to sdf2
md:  adding sdd2 ...
md: sdd1 has different UUID to sdf2
md:  adding sdc2 ...
md: sdc1 has different UUID to sdf2
md:  adding sdb2 ...
md: sdb1 has different UUID to sdf2
md:  adding sda2 ...
md: sda1 has different UUID to sdf2
md: created md1
md: bind<sda2>
md: bind<sdb2>
md: bind<sdc2>
md: bind<sdd2>
md: bind<sde2>
md: bind<sdf2>
md: running: <sdf2><sde2><sdd2><sdc2><sdb2><sda2>
raid5: device sdf2 operational as raid disk 5
raid5: device sde2 operational as raid disk 4
raid5: device sdd2 operational as raid disk 3
raid5: device sdc2 operational as raid disk 2
raid5: device sdb2 operational as raid disk 1
raid5: device sda2 operational as raid disk 0
raid5: allocated 6285kB for md1
raid5: raid level 5 set md1 active with 6 out of 6 devices, algorithm 2
RAID5 conf printout:
 --- rd:6 wd:6 fd:0
 disk 0, o:1, dev:sda2
 disk 1, o:1, dev:sdb2
 disk 2, o:1, dev:sdc2
 disk 3, o:1, dev:sdd2
 disk 4, o:1, dev:sde2
 disk 5, o:1, dev:sdf2
md: considering sdf1 ...
md:  adding sdf1 ...
md:  adding sde1 ...
md:  adding sdd1 ...
md:  adding sdc1 ...
md:  adding sdb1 ...
md:  adding sda1 ...
md: created md0
md: bind<sda1>
md: bind<sdb1>
md: bind<sdc1>
md: bind<sdd1>
md: bind<sde1>
md: bind<sdf1>
md: running: <sdf1><sde1><sdd1><sdc1><sdb1><sda1>
raid1: raid set md0 active with 6 out of 6 mirrors
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting.  Commit interval 5 seconds
EXT3-fs: md1: orphan cleanup on readonly fs
ext3_orphan_cleanup: deleting unreferenced inode 84028
EXT3-fs: md1: 1 orphan inode deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
usbcore: registered new driver usbfs
usbcore: registered new driver hub
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:1:0: Attached scsi generic sg1 type 0
sd 0:0:2:0: Attached scsi generic sg2 type 0
sd 0:0:3:0: Attached scsi generic sg3 type 0
sd 0:0:4:0: Attached scsi generic sg4 type 0
sd 0:0:5:0: Attached scsi generic sg5 type 0
 0:0:15:0: Attached scsi generic sg6 type 3
sd 1:0:1:0: Attached scsi generic sg7 type 0
sd 1:0:3:0: Attached scsi generic sg8 type 0
SvrWks CSB5: IDE controller at PCI slot 0000:00:0f.1
SvrWks CSB5: chipset revision 147
SvrWks CSB5: not 100% native mode: will probe irqs later
SvrWks CSB5: simplex device: DMA forced
    ide0: BM-DMA at 0x2000-0x2007, BIOS settings: hda:pio, hdb:pio
SvrWks CSB5: simplex device: DMA forced
    ide1: BM-DMA at 0x2008-0x200f, BIOS settings: hdc:pio, hdd:pio
Probing IDE interface ide0...
hda: COMPAQ CD-ROM SN-124, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
Ethernet Channel Bonding Driver: v2.6.5 (November 4, 2005)
bonding: MII link monitoring set to 100 ms
tg3.c:v3.47 (Dec 28, 2005)
ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 29 (level, low) -> IRQ 185
eth0: Tigon3 [partno(NA) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:0b:cd:ef:f1:d7
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
eth0: dma_rwctrl[769f4000]
ACPI: PCI Interrupt 0000:02:02.0[A] -> GSI 31 (level, low) -> IRQ 193
eth1: Tigon3 [partno(NA) rev 1002 PHY(5703)] (PCIX:100MHz:64-bit)
10/100/1000BaseT Ethernet 00:0b:cd:ef:f1:d6
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000]
Intel(R) PRO/1000 Network Driver - version 6.1.16-k2-NAPI
Copyright (c) 1999-2005 Intel Corporation.
ACPI: PCI Interrupt 0000:03:01.0[A] -> GSI 20 (level, low) -> IRQ 201
e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
cpqphp: Compaq Hot Plug PCI Controller Driver version: 0.9.8
ACPI: PCI Interrupt 0000:06:1e.0[A] -> GSI 18 (level, low) -> IRQ 209
cpqphp: Hot Plug Subsystem Device ID: a2fe
cpqphp: Initializing the PCI hot plug controller residing on PCI bus 6
PCI: Using BIOS Interrupt Routing Table
PCI: Using BIOS Interrupt Routing Table
piix4_smbus 0000:00:0f.0: Found 0000:00:0f.0 device
piix4_smbus 0000:00:0f.0: Working around buggy BIOS (I2C)
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt Link [IUSB] enabled at IRQ 11
ACPI: PCI Interrupt 0000:00:0f.2[A] -> Link [IUSB] -> GSI 11 (level, low) -> IRQ 11
ohci_hcd 0000:00:0f.2: OHCI Host Controller
ohci_hcd 0000:00:0f.2: new USB bus registered, assigned bus number 1
ohci_hcd 0000:00:0f.2: irq 11, io mem 0xf5ef0000
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 4 ports detected
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
EXT3 FS on md1, internal journal
kjournald starting.  Commit interval 5 seconds
EXT3 FS on md0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
XFS mounting filesystem md2
Starting XFS recovery on filesystem: md2 (logdev: internal)
Ending XFS recovery on filesystem: md2 (logdev: internal)
XFS mounting filesystem md3
Starting XFS recovery on filesystem: md3 (logdev: internal)
Ending XFS recovery on filesystem: md3 (logdev: internal)
XFS mounting filesystem md4
Starting XFS recovery on filesystem: md4 (logdev: internal)
Ending XFS recovery on filesystem: md4 (logdev: internal)
Adding 1469908k swap on /dev/sda6.  Priority:1 extents:1 across:1469908k
Adding 1469908k swap on /dev/sdb6.  Priority:1 extents:1 across:1469908k
Adding 1469908k swap on /dev/sdc6.  Priority:1 extents:1 across:1469908k
Adding 1469908k swap on /dev/sdd6.  Priority:1 extents:1 across:1469908k
Adding 1469908k swap on /dev/sde6.  Priority:1 extents:1 across:1469908k
Adding 1469908k swap on /dev/sdf6.  Priority:1 extents:1 across:1469908k
=======

Installed rpm packages:
=====
[root@z4 ~]# rpm -qa | egrep -i "gfs|cman|dlm|magma|ccsd|gnbd|gulm" | sort
cman-1.0.4-0.FC5.1
cman-devel-1.0.4-0.FC5.1
cman-kernel-2.6.14.1-20051219.162641.FC5.10
cman-kernel-smp-2.6.14.1-20051219.162641.FC5.10
cman-kernheaders-2.6.14.1-20051219.162641.FC5.10
dlm-1.0.0-9.FC5
dlm-devel-1.0.0-9.FC5
dlm-kernel-2.6.14.1-20051219.162641.FC5.8
dlm-kernel-smp-2.6.14.1-20051219.162641.FC5.8
dlm-kernheaders-2.6.14.1-20051219.162641.FC5.8
GFS-6.1.4-0.FC5.1
GFS-kernel-2.6.14.1-20051219.162641.FC5.9
GFS-kernel-smp-2.6.14.1-20051219.162641.FC5.9
GFS-kernheaders-2.6.14.1-20051219.162641.FC5.9
gnbd-1.0.2-0.2
gnbd-kernel-2.6.14.0-20051108.134753.FC5.14
gnbd-kernel-smp-2.6.14.0-20051108.134753.FC5.14
gnbd-kernheaders-2.6.14.0-20051108.134753.FC5.14
gulm-1.0.5-0.FC5.1
gulm-devel-1.0.5-0.FC5.1
magma-1.0.3-3.1
magma-devel-1.0.3-3.1
magma-plugins-1.0.5-0.FC5.1
=========

=== cluster.conf ===
<?xml version="1.0"?>
<cluster name="test" config_version="1">
    <cman two_node="1" expected_votes="1">
    </cman>
    <clusternodes>
      <clusternode name="z3" votes="1">
       <fence>
        <method name="1">
         <device name="HPiLO_z3"/>
        </method>
       </fence>
      </clusternode>
     <clusternode name="z4" votes="1">
      <fence>
       <method name="1">
        <device name="HPiLO_z4"/>
       </method>
      </fence>
    </clusternode>
   </clusternodes>
  <fence_devices>
   <fencedevice agent="fence_ilo" hostname="10.2.3.4" name="HPiLO_z4"
login="claman" passwd="6cdgyvkBWblhhor9"/>
   <fencedevice agent="fence_ilo" hostname="10.2.3.3" name="HPiLO_z3"
login="claman" passwd="6SpdmkmwttwgKv"/>
  </fence_devices>
 </cluster>
=======

With cman-1.0.0 all work fine.

Comment 1 Christine Caulfield 2006-02-16 08:55:33 UTC
Sounds like you've got your kernel & userspace out of step. Make sure you are
using the latest cman_tool with that kernel.

Comment 2 Christine Caulfield 2006-02-20 14:22:43 UTC
Yep. the userland packages are definitely out-of date. But the kernel packages
are more recent.


Comment 3 Andrew Okhmat 2006-02-20 15:34:01 UTC
With new packages I got same error.
I try different versions, last working is 1.01.00
Probably problem in following changes: diff -i cnxman.c cnxman.c.new

--- cnxman.c	2005-10-03 16:01:13.000000000 -0400
+++ cnxman.c.new	2006-02-20 10:32:27.000000000 -0500
@@ -32,7 +32,7 @@
 #include "sm_user.h"
 #include "config.h"
 
-#define CMAN_RELEASE_NAME "1.01.00"
+#define CMAN_RELEASE_NAME "2.6.15.0-20051219.162641.FC5.11.7"
 
 static void process_incoming_packet(struct cl_comms_socket *csock,
 				    struct msghdr *msg, struct kvec *vec, int veclen, int len);
@@ -55,6 +55,7 @@
 static int send_or_queue_message(struct socket *sock, void *buf, int len,
struct sockaddr_cl *caddr,
 				 unsigned int flags);
 static struct cl_comms_socket *get_next_interface(struct cl_comms_socket *cur);
+static struct cl_comms_socket *get_peer_interface(int if_num, int mcast);
 static void check_for_unacked_nodes(void);
 static void free_cluster_sockets(void);
 static uint16_t generate_cluster_id(char *name);
@@ -859,7 +860,7 @@
         /* Have we received this message before ? If so just ignore it, it's a
 	 * resend for someone else's benefit */
 	if (!(flags & MSG_NOACK) &&
-	    rem_node && le16_to_cpu(header->seq) == rem_node->last_seq_recv) {
+	    rem_node && ((short)le16_to_cpu(header->seq) <=
(short)rem_node->last_seq_recv)) {
 		P_COMMS
 		    ("Discarding message - Already seen this sequence number %d\n",
 		     rem_node->last_seq_recv);
@@ -1168,6 +1169,7 @@
 static int add_clsock(int broadcast, int number, struct socket *sock,
 		      struct file *file)
 {
+	struct cl_comms_socket *peer;
 	struct cl_comms_socket *newsock =
 	    kmalloc(sizeof (struct cl_comms_socket), GFP_KERNEL);
 	if (!newsock)
@@ -1198,9 +1200,17 @@
 				    &newsock->addr_len, 0);
 
 	num_interfaces = max(num_interfaces, newsock->number);
-	if (!current_interface && newsock->broadcast)
+	if (!current_interface && newsock->recv_only)
 		current_interface = newsock;
 
+	/* Get peer, if this fails because we're the first socket with this
+	   number then that's fine. The subsequent call will fill in both */
+	peer = get_peer_interface(number, !broadcast);
+	if (peer) {
+		peer->peer = newsock;
+		newsock->peer = peer;
+	}
+
 	/* Hook data_ready */
 	newsock->sock->sk->sk_data_ready = cnxman_data_ready;
 
@@ -1754,14 +1764,14 @@
 	if (!atomic_read(&cnxman_running))
 		return -ENOTCONN;
 
-	if (!we_are_a_cluster_member)
-		return -ENOENT;
+	/* FORCE overrides several checks */
+	if (!(leave_flags & CLUSTER_LEAVEFLAG_FORCE)) {
+		if (!we_are_a_cluster_member)
+			return -ENOENT;
 
-	if (in_transition())
-		return -EBUSY;
+		if (in_transition())
+			return -EBUSY;
 
-	/* Ignore the use count if FORCE is set */
-	if (!(leave_flags & CLUSTER_LEAVEFLAG_FORCE)) {
 		if (atomic_read(&use_count))
 			return -ENOTCONN;
 	}
@@ -2018,8 +2028,8 @@
 	vec[0].iov_len = saved_msg_len;
 
 	memset(&msg, 0, sizeof (msg));
-	msg.msg_name = &current_interface->saddr;
-	msg.msg_namelen = current_interface->addr_len;
+	msg.msg_name = &current_interface->peer->saddr;
+	msg.msg_namelen = current_interface->peer->addr_len;
 
 	result = kernel_sendmsg(current_interface->sock, &msg, vec, 1, saved_msg_len);
 
@@ -2126,22 +2136,22 @@
 	struct sockaddr_in6 daddr;
 	struct cl_comms_socket *clsock;
 	int result = 0;
-	int errors = 0;
+	static int errors = 0;
 
 	our_msg->msg_name = &daddr;
 
 	list_for_each_entry(clsock, &socket_list, list) {
 
-		/* Don't send out a recv-only socket */
-		if (!clsock->recv_only) {
+		/* Don't send out of a broadcast socket */
+		if (clsock->recv_only) {
 
 			/* For temporary node IDs send to the node's real IP address */
 			if (nodeid < 0) {
 				get_addr_from_temp_nodeid(nodeid, (char *)&daddr, &our_msg->msg_namelen);
 			}
 			else {
-				memcpy(&daddr, &clsock->saddr, clsock->addr_len);
-				our_msg->msg_namelen = clsock->addr_len;
+				memcpy(&daddr, &clsock->peer->saddr, clsock->peer->addr_len);
+				our_msg->msg_namelen = clsock->peer->addr_len;
 			}
 
 			result = __send_and_save(clsock, our_msg, vec, veclen,
@@ -2149,11 +2159,13 @@
 						 !(flags & MSG_NOACK));
 			if (result < 0)
 				errors++;
+			else
+				errors = 0;
 		}
 	}
 
 	/* If all the interfaces error then die */
-	if (errors == num_interfaces) {
+	if (errors >= num_interfaces * cman_config.max_retries) {
 		printk(KERN_ERR CMAN_NAME ": No functional network interfaces, leaving
cluster\n");
 		quit_threads = 1;
 		wake_up_interruptible(&cnxman_waitq);
@@ -2347,8 +2359,8 @@
 	else {
 		/* Send to only the current socket - resends will use the
 		 * others if necessary */
-		our_msg.msg_name = &current_interface->saddr;
-		our_msg.msg_namelen = current_interface->addr_len;
+		our_msg.msg_name = &current_interface->peer->saddr;
+		our_msg.msg_namelen = current_interface->peer->addr_len;
 
 		result =
 		    __send_and_save(current_interface, &our_msg,
@@ -3092,7 +3104,7 @@
 		struct cl_comms_socket *sock;
 		sock = list_entry(socklist, struct cl_comms_socket, list);
 
-		if (!sock->recv_only && sock->number == next)
+		if (sock->recv_only && sock->number == next)
 			return sock;
 	}
 
@@ -3100,6 +3112,22 @@
 	return NULL;
 }
 
+static struct cl_comms_socket *get_peer_interface(int if_num, int mcast)
+{
+	struct list_head *socklist;
+
+	list_for_each(socklist, &socket_list) {
+		struct cl_comms_socket *sock;
+		sock = list_entry(socklist, struct cl_comms_socket, list);
+
+		if (sock->broadcast == mcast && sock->number == if_num)
+			return sock;
+	}
+
+	return NULL;
+}
+
+
 /* MUST be called with the barrier list lock held */
 static struct cl_barrier *find_barrier(char *name)
 {


Comment 4 Christine Caulfield 2006-02-21 16:25:25 UTC
Yes, that's a fix for bz#166752

The real problem here is that the kernel & userland are built from different CVS
branches.

Comment 5 Andrew Okhmat 2006-02-21 21:35:47 UTC
I tryed last binary rpm from
http://download.fedora.redhat.com/pub/fedora/linux/core/test/4.92/i386/os/Fedora/RPMS/
and still got error :(

CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: No functional network interfaces, leaving cluster
CMAN: sendmsg failed: -13
CMAN: sendmsg failed: -13
CMAN: we are leaving the cluster. Shutdown
WARNING: dlm_emergency_shutdown
WARNING: dlm_emergency_shutdown

Comment 6 Chris Feist 2006-03-02 16:26:49 UTC
Can you try this rpm? And let me know if works for you?

http://download.fedora.redhat.com/pub/fedora/linux/core/development/i386/Fedora/RPMS/cman-1.0.5-0.FC5.0.i386.rpm

Comment 7 max vakulenko 2006-03-03 14:58:42 UTC
No luck.
We tested this userspace tool **with latest development version of cman_kernel**: 
http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/cman-kernel-2.6.15.0-20051219.162641.FC5.11.7.src.rpm

But this cman_tool, like previous, works well with old cman kernel module
(without described changes in cnxman.c) so we think problem actually not in
userspace tool but in kernel module.

Should this bug be moved to "cman_kernel" component?

Comment 8 Christine Caulfield 2006-03-03 15:31:34 UTC
Err, chris. the source RPM has an old cman_tool binary in it! 

So "make" isn't building the new one from source !

jeltz:~/dev/rpms/cman/devel$ rm -rf cman-1.0.5
jeltz:~/dev/rpms/cman/devel$ tar -xzf cman-1.0.5.tar.gz 
jeltz:~/dev/rpms/cman/devel$ cd cman-1.0.5
jeltz:~/dev/rpms/cman/devel/cman-1.0.5$ cd cman_tool/
jeltz:~/dev/rpms/cman/devel/cman-1.0.5/cman_tool$ make
make: Nothing to be done for `all'.
jeltz:~/dev/rpms/cman/devel/cman-1.0.5/cman_tool$ ls -l 
total 180
-rwxrwxr-x  1 patrick patrick 49304 Mar  1 20:14 cman_tool
-rw-r--r--  1 patrick patrick  2328 Mar  1 20:14 cman_tool.h
drwxrwxr-x  2 patrick patrick  4096 Mar  1 20:14 CVS
-rw-r--r--  1 patrick patrick 12703 Mar  1 20:14 join.c
-rw-r--r--  1 patrick patrick 13214 Mar  1 20:14 join_ccs.c
-rw-rw-r--  1 patrick patrick 10388 Mar  1 20:14 join_ccs.o
-rw-rw-r--  1 patrick patrick 18020 Mar  1 20:14 join.o
-rw-r--r--  1 patrick patrick 17126 Mar  1 20:14 main.c
-rw-rw-r--  1 patrick patrick 25652 Mar  1 20:14 main.o
-rw-r--r--  1 patrick patrick  1567 Mar  1 20:14 Makefile


Comment 9 Chris Feist 2006-03-03 16:47:36 UTC
I've found the mistake, I've built a new rpm and will update the bug when it has
been put into fc5.

Comment 10 Chris Feist 2006-03-03 17:14:08 UTC
The new package has been built and tested, it is cman-1.0.5.FC5.1 and should be
appearing online soon.  Please let me know if the updated packages work for you.

Comment 11 max vakulenko 2006-03-05 14:09:23 UTC
Same rezults - cman ok, cman_kernel bad:

We've tested this version of cman_tool and init script and it works with our
custom 2.6.15 kernel with modified cman_kernel sources (rollback to 1.01.00 in
cnxman.c as described above
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=179734#c3)

However with available cman_kernel 2.6.15.0-20051219.162641.FC5.11.7
(http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/cman-kernel-2.6.15.0-20051219.162641.FC5.11.7.src.rpm),
which is same as 2.6.14.1-20051219.162641.FC5.10 except kernel requirements in
.spec file, cluster doesn't working, cman is complaining about missing
interfaces as described at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=179734#c5

Maybe this bug should be moved/crossposted to cman_kernel?

Comment 12 Christine Caulfield 2006-03-06 09:08:31 UTC
Can you double-check that you have the right cman_tool binary? 

If you have the updated one then running 'strace cman_tool join' should show two
calls to setsockopt(<n>, SOL_SOCKET, SO_BROADCAST, [1], 4). If you're only
seeing one then its an old binary. If you are seeing two then it should work.

The binary in the latest package certainly does 2 on my machine.

If you're seeing 2 such calls and it isn't working, can you attach the strace to
this bugzilla please ?

Comment 13 Chris Feist 2006-03-06 19:39:30 UTC
Can you provide the output of the following two commands:

rpm -q cman
rpm -q cman-kernel

Thanks!

Comment 14 max vakulenko 2006-03-07 10:59:50 UTC
(In reply to comment #12)
> Can you double-check that you have the right cman_tool binary? 

I'll try with strace later. Binary is correct, compiled from src.rpm 

Comment 15 max vakulenko 2006-03-07 11:07:12 UTC
(In reply to comment #13)
> Can you provide the output of the following two commands:

> rpm -q cman
cman-1.0.5-0.FC5.1 since fix 
https://bugzilla.redhat.com/bugzilla/process_bug.cgi#c10

> rpm -q cman-kernel
cman-kernel-2.6.14.1-20051219.162641.FC5.10
it is original version with patched spec file to compile with custom
2.6.15-1.1826.2.10_FC5 kernel

Comment 16 Christine Caulfield 2006-03-07 16:27:26 UTC
It works fine for me on FC5t3 with those latest RPMS:

$ rpm -qa|grep cman
cman-kernel-2.6.15.0-20051219.162641.FC5.11.7
cman-1.0.5-0.FC5.1


Comment 17 Bug Zapper 2008-04-03 16:51:42 UTC
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 18 Bug Zapper 2008-05-07 00:20:53 UTC
This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp


Note You need to log in before you can comment on or make changes to this bug.