Description of problem: Two nodes are running RHCS and servicing the gfs. When a rebooted node is booting up, clvmd looks like it is starting too quickly to recognize external disks to be added in cluster. The GFS tried to mount the filesystems, but failed. Mounting GFS filesystems: /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol0" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol1" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol2" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol3" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol4" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol5" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol6" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol7" [FAILED] The cluster works properly if we restart clvmd after boot, then restart gfs. Version-Release number of selected component (if applicable): Kernel Release: 2.6.18-194.el5 RHEL Release: Red Hat Enterprise Linux Server release 5.5 (Tikanga) Version: Linux version 2.6.18-194.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Tue Mar 16 22:03:12 EDT 2010 Platform: ppc64 How reproducible: Always Steps to Reproduce: 1. 2 RHEL5.5 PPC servers. 2. Connected to 2 storages. 3. Created 4 volumes from each storage and mapped to cluster group of the 2 servers. 4. RHCS is installed on both servers. 5. Setup the RHCS and also using GFS. 6. Poweroff one of the node and bring it back up. Actual results: When a rebooted node is booting up, clvmd looks like it is starting too quickly to recognize external disks to be added in cluster. Expected results: The node to boot back up and remount all the gfs. Additional info: [root@tsunami ~]# chkconfig --list | egrep "clvmd|gfs |rgmanage|cman" clvmd 0:off 1:off 2:on 3:on 4:on 5:on 6:off cman 0:off 1:off 2:on 3:on 4:on 5:on 6:off gfs 0:off 1:off 2:on 3:on 4:on 5:on 6:off rgmanager 0:off 1:off 2:on 3:on 4:on 5:on 6:off [root@tsunami ~]# service gfs status Configured GFS mountpoints: /home/smashmnt0 /home/smashmnt1 /home/smashmnt2 /home/smashmnt3 /home/smashmnt4 /home/smashmnt5 /home/smashmnt6 /home/smashmnt7 Active GFS mountpoints: /home/smashmnt0 /home/smashmnt1 /home/smashmnt2 /home/smashmnt3 /home/smashmnt4 /home/smashmnt5 /home/smashmnt6 /home/smashmnt7 [root@tsunami ~]# service cman status cman is running. [root@washuu testutils]# chkconfig --list | egrep "clvmd|gfs |rgmanage|cman" clvmd 0:off 1:off 2:on 3:on 4:on 5:on 6:off cman 0:off 1:off 2:on 3:on 4:on 5:on 6:off gfs 0:off 1:off 2:on 3:on 4:on 5:on 6:off rgmanager 0:off 1:off 2:on 3:on 4:on 5:on 6:off [root@washuu testutils]# service gfs status Configured GFS mountpoints: /home/smashmnt0 /home/smashmnt1 /home/smashmnt2 /home/smashmnt3 /home/smashmnt4 /home/smashmnt5 /home/smashmnt6 /home/smashmnt7 Active GFS mountpoints: /home/smashmnt0 /home/smashmnt1 /home/smashmnt2 /home/smashmnt3 /home/smashmnt4 /home/smashmnt5 /home/smashmnt6 /home/smashmnt7 [root@washuu testutils]# service cman status cman is running. [root@tsunami ~]# cat /etc/cluster/cluster.conf ?xml version="1.0"?> cluster alias="washuu-tsunami" config_version="4" name="washuu-tsunami"> fence_daemon post_fail_delay="0" post_join_delay="3"/> clusternodes> clusternode name="washuu" nodeid="1" votes="1"> fence> method name="1"> device name="Persistent_Reserve" node="washuu"/> /method> /fence> /clusternode> clusternode name="tsunami" nodeid="2" votes="1"> fence> method name="1"> device name="Persistent_Reserve" node="tsunami"/> /method> /fence> /clusternode> /clusternodes> cman expected_votes="1" two_node="1"/> fencedevices> fencedevice agent="fence_scsi" name="Persistent_Reserve"/> /fencedevices> rm> failoverdomains> failoverdomain name="tsunami1" ordered="1" restricted="0"> failoverdomainnode name="washuu" priority="2"/> failoverdomainnode name="tsunami" priority="1"/> /failoverdomain> failoverdomain name="washuu1" ordered="1"> failoverdomainnode name="washuu" priority="1"/> failoverdomainnode name="tsunami" priority="2"/> /failoverdomain> /failoverdomains> resources> ip address="172.22.229.160" monitor_link="1"/> ip address="172.22.229.165" monitor_link="1"/> /resources> service autostart="1" domain="tsunami1" exclusive="0" name="service-172.22.229.160" recovery="relocate"> ip ref="172.22.229.160"/> /service> service autostart="1" domain="washuu1" exclusive="0" name="service-172.22.229.165" recovery="relocate"> ip ref="172.22.229.165"/> /service> /rm> /cluster> Console output during bootup: Reading all physical volumes. This may take a while... Found volume group "VolGroup00" using metadata type lvm2 2 logical volume(s) in volume group "VolGroup00" now active Welcome to Red Hat Enterprise Linux Server Press 'I' to enter interactive startup. Setting clock (utc): Thu May 13 16:52:39 CDT 2010 [ OK ] Starting udev: [ OK ] Loading default keymap (us): [ OK ] Setting hostname washuu: [ OK ] Setting up Logical Volume Management: connect() failed on local socket: Connection refused WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] Checking filesystems Checking all file systems. [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00 /dev/VolGroup00/LogVol00: clean, 205945/14647296 files, 1951418/14639104 blocks [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/sda2 /boot: clean, 23/25688 files, 29588/102400 blocks [ OK ] Remounting root filesystem in read-write mode: [ OK ] Mounting local filesystems: [ OK ] Enabling local filesystem quotas: [ OK ] Enabling /etc/fstab swaps: [ OK ] INIT: Entering runlevel: 3 Entering non-interactive startup Starting monitoring for VG VolGroup00: connect() failed on local socket: Connection refused WARNING: Falling back to local file-based locking. Volume Groups with the clustered attribute will be inaccessible. 2 logical volume(s) in volume group "VolGroup00" monitored [ OK ] Starting background readahead: [ OK ] Checking for hardware changes [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface eth0: [ OK ] Starting auditd: [ OK ] Starting system logger: [ OK ] Starting kernel logger: [ OK ] Starting irqbalance: [ OK ] Starting portmap: [ OK ] Starting NFS statd: [ OK ] Starting RPC idmapd: [ OK ] Starting fcauthd: FC Authentication Daemon: 1.20 [ OK ] Starting cluster: Loading modules... DLM (built Mar 16 2010 22:04:41) installed GFS2 (built Mar 16 2010 22:06:10) installed done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] Loading clustered mirror log module:[ OK ] Starting clustered mirror log server:[ OK ] Starting system message bus: [ OK ] [ OK ] Bluetooth services:[ OK ] Mounting NFS filesystems: FS-Cache: Loaded [ OK ] Mounting other filesystems: [ OK ] Starting PC/SC smart card daemon (pcscd): [ OK ] Starting scsi_reserve:[FAILED] Starting clvmd: dlm: Using TCP for communications dlm: connecting to 2 [ OK ] Activating VGs: 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] Mounting GFS filesystems: /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol0" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol1" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol2" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol3" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol4" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol5" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol6" /sbin/mount.gfs: invalid device path "/dev/lvm_vg/lvol7" [FAILED] Starting HAL daemon: [ OK ] Starting hidd: [ OK ] Starting autofs: Loading autofs4: [ OK ] Starting automount: [ OK ] [ OK ] Starting hpiod: [ OK ] Starting hpssd: [ OK ] Starting ibmvscsisd: /etc/ibmvscsis.conf file does not exist. [FAILED] Starting iprinit: Starting ipr initialization daemon[ OK ] [ OK ] Starting iprupdate: Checking ipr microcode levels Completed ipr microcode updates[ OK ] [ OK ] Starting iprdump: Starting ipr dump daemon[ OK ] [ OK ] Starting sshd: [ OK ] Starting cups: [ OK ] Starting xinetd: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory NFSD: starting 90-second grace period [ OK ] Starting NFS mountd: [ OK ] Starting vsftpd for vsftpd: [ OK ] Starting vpdupdate:
One possible cause of this could be that that the clustered disks take some time to appear on the system, so that clvmd starts up before they are all available. I suppose one way round this would be to put an init script up before clvmd that waits for the drives to be available. clvmd can't wait for disks because it does not know in advance which disks it needs.
can you provide steps on how to implement the init script?
Just something as simple as a sleep command in a script should be enough to prove the point. Create a script such as #!/bin/sh sleep 30 And then put it i na file in the right directory so that it runs at startup eg: /etc/rc.d/rc3.d/S25waitfordisks and make it executable chmod +x /etc/rc.d/rc3.d/S25waitfordisks This is just a quick proof-of-concept and you might have to tweak the delay a few times to get a good compromise between the disks appearing, and a longer startup time. Chrissie
We tried the suggested script and so far, the nodes are consistently remounting the gfs.
will there be a fix in RHEL 5.6?
Abdel: We are still discussing this internally. Stay tuned. The 'needinfo' is not on LSI.
Deferring to RHEL 5.7 since we don't have a solution in place and a change of this size/complexity would need to make the Beta development window, which we've already passed for RHEL 5.6.
Sorry for the delay in responding. Is the storage iSCSI? If so, have you specified _netdev in /etc/fstab ? If you would, please re-create the problem (remove the start-up delay), and then post the resulting /var/log/messages output. Let us know which "sd" devices comprise the the GFS clvmd volume, and are apparently the ones being configured too late. Tom
Closing as per comments #15 and #17. If there is still an unresolved issue, please reopen this bug.