Login
[x]
Log in using an account from:
Fedora Account System
Red Hat Associate
Red Hat Customer
Or login using a Red Hat Bugzilla account
Forgot Password
Login:
Hide Forgot
Create an Account
Red Hat Bugzilla – Attachment 149344 Details for
Bug 230972
Qdisk daemon unable to be started on all members in Xen cluster setup
[?]
New
Simple Search
Advanced Search
My Links
Browse
Requests
Reports
Current State
Search
Tabular reports
Graphical reports
Duplicates
Other Reports
User Changes
Plotly Reports
Bug Status
Bug Severity
Non-Defaults
|
Product Dashboard
Help
Page Help!
Bug Writing Guidelines
What's new
Browser Support Policy
5.0.4.rh83 Release notes
FAQ
Guides index
User guide
Web Services
Contact
Legal
This site requires JavaScript to be enabled to function correctly, please enable it.
[patch]
update patch for RHEL5.0 qdisk -> RHEL5 stable development
cman-qdisk-20070306.patch (text/plain), 51.12 KB, created by
Lon Hohberger
on 2007-03-06 15:52:12 UTC
(
hide
)
Description:
update patch for RHEL5.0 qdisk -> RHEL5 stable development
Filename:
MIME Type:
Creator:
Lon Hohberger
Created:
2007-03-06 15:52:12 UTC
Size:
51.12 KB
patch
obsolete
>Index: init.d/qdiskd >=================================================================== >RCS file: /cvs/cluster/cluster/cman/init.d/qdiskd,v >retrieving revision 1.2 >diff -u -r1.2 qdiskd >--- init.d/qdiskd 19 May 2006 14:41:35 -0000 1.2 >+++ init.d/qdiskd 6 Mar 2007 15:49:37 -0000 >@@ -19,7 +19,7 @@ > # See how we were called. > case "$1" in > start) >- action "Starting the Quorum Disk Daemon:" qdiskd >+ action "Starting the Quorum Disk Daemon:" qdiskd -Q > rtrn=$? > [ $rtrn = 0 ] && touch $LOCK_FILE > ;; >Index: man/mkqdisk.8 >=================================================================== >RCS file: /cvs/cluster/cluster/cman/man/mkqdisk.8,v >retrieving revision 1.2 >diff -u -r1.2 mkqdisk.8 >--- man/mkqdisk.8 21 Jul 2006 17:55:04 -0000 1.2 >+++ man/mkqdisk.8 6 Mar 2007 15:49:37 -0000 >@@ -13,11 +13,16 @@ > .IP "\-c device \-l label" > Initialize a new cluster quorum disk. This will destroy all data on the given > device. If a cluster is currently using that device as a quorum disk, the >-entire cluster will malfunction. Do not ru >+entire cluster will malfunction. Do not run this on an active cluster when >+qdiskd is running. Only one device on the SAN should ever have the given >+label; using multiple different devices is currently not supported (it is >+expected a RAID array is used for quorum disk redundancy). The label can be >+any textual string up to 127 characters - and is therefore enough space to hold >+a UUID created with uuidgen(1). > .IP "\-f label" >-Find the cluster quorum disk with the given label and display information about it.. >+Find the cluster quorum disk with the given label and display information about it. > .IP "\-L" > Display information on all accessible cluster quorum disks. > > .SH "SEE ALSO" >-qdisk(5) qdiskd(8) >+qdisk(5), qdiskd(8), uuidgen(1) >Index: man/qdisk.5 >=================================================================== >RCS file: /cvs/cluster/cluster/cman/man/qdisk.5,v >retrieving revision 1.3 >diff -u -r1.3 qdisk.5 >--- man/qdisk.5 3 Oct 2006 18:07:58 -0000 1.3 >+++ man/qdisk.5 6 Mar 2007 15:49:37 -0000 >@@ -1,6 +1,6 @@ >-.TH "QDisk" "8" "July 2006" "" "Cluster Quorum Disk" >+.TH "QDisk" "5" "20 Feb 2007" "" "Cluster Quorum Disk" > .SH "NAME" >-QDisk 1.0 \- a disk-based quorum daemon for CMAN / Linux-Cluster >+QDisk 1.2 \- a disk-based quorum daemon for CMAN / Linux-Cluster > .SH "1. Overview" > .SH "1.1 Problem" > In some situations, it may be necessary or desirable to sustain >@@ -75,16 +75,24 @@ > > * Cluster node votes should be more or less equal. > >-* CMAN must be running before the qdisk program can start. >+* CMAN must be running before the qdisk program can operate in full >+capacity. If CMAN is not running, qdisk will wait for it. > > * CMAN's eviction timeout should be at least 2x the quorum daemon's > to give the quorum daemon adequate time to converge on a master during a > failure + load spike situation. > >-* The total number of votes assigned to the quorum device should be >-equal to or greater than the total number of node-votes in the cluster. >-While it is possible to assign only one (or a few) votes to the quorum >-device, the effects of doing so have not been explored. >+* For 'all-but-one' failure operation, the total number of votes assigned >+to the quorum device should be equal to or greater than the total number >+of node-votes in the cluster. While it is possible to assign only one >+(or a few) votes to the quorum device, the effects of doing so have not >+been explored. >+ >+* For 'tiebreaker' operation in a two-node cluster, unset CMAN's two_node >+flag (or set it to 0), set CMAN's expected votes to '3', set each node's >+vote to '1', and set qdisk's vote count to '1' as well. This will allow >+the cluster to operate if either both nodes are online, or a single node & >+the heuristics. > > * Currently, the quorum disk daemon is difficult to use with CLVM if > the quorum disk resides on a CLVM logical volume. CLVM requires a >@@ -197,7 +205,7 @@ > .in 9 > \fIinterval\fP\fB="\fP1\fB"\fP > .in 12 >-This is the frequency of read/write cycles >+This is the frequency of read/write cycles, in seconds. > > .in 9 > \fItko\fP\fB="\fP10\fB"\fP >@@ -205,6 +213,26 @@ > This is the number of cycles a node must miss in order to be declared dead. > > .in 9 >+\fItko_up\fP\fB="\fPX\fB"\fP >+.in 12 >+This is the number of cycles a node must be seen in order to be declared >+online. Default is \fBfloor(tko/2)\fP. >+ >+.in 9 >+\fIupgrade_wait\fP\fB="\fP2\fB"\fP >+.in 12 >+This is the number of cycles a node must wait before initiating a bid >+for master status after heuristic scoring becomes sufficient. The >+default is 2. This can not be set to 0, and should not exceed \fBtko\fP. >+ >+.in 9 >+\fImaster_wait\fP\fB="\fPX\fB"\fP >+.in 12 >+This is the number of cycles a node must wait for votes before declaring >+itself master after making a bid. Default is \fBfloor(tko/3)\fP. >+This can not be less than 2 and should not exceed \fBtko\fP. >+ >+.in 9 > \fIvotes\fP\fB="\fP3\fB"\fP > .in 12 > This is the number of votes the quorum daemon advertises to CMAN when it >@@ -217,23 +245,27 @@ > 0 = emergencies; 7 = debug. > > .in 9 >-\fIlog_facility\fP\fB="\fPlocal4\fB"\fP >+\fIlog_facility\fP\fB="\fPdaemon\fB"\fP > .in 12 > This controls the syslog facility used by the quorum daemon when logging. > For a complete list of available facilities, see \fBsyslog.conf(5)\fP. >+The default value for this is 'daemon'. > > .in 9 > \fIstatus_file\fP\fB="\fP/foo\fB"\fP > .in 12 > Write internal states out to this file periodically ("-" = use stdout). >-This is primarily used for debugging. >+This is primarily used for debugging. The default value for this >+attribute is undefined. > > .in 9 > \fImin_score\fP\fB="\fP3\fB"\fP > .in 12 > Absolute minimum score to be consider one's self "alive". If omitted, > or set to 0, the default function "floor((n+1)/2)" is used, where \fIn\fP >-is the sum-total of all of defined heuristics' \fIscore\fP attribute. >+is the total of all of defined heuristics' \fIscore\fP attribute. This >+must never exceed the sum of the heuristic scores, or else the quorum >+disk will never be available. > > .in 9 > \fIreboot\fP\fB="\fP1\fB"\fP >@@ -243,6 +275,55 @@ > this value is 1 (on). > > .in 9 >+\fIallow_kill\fP\fB="\fP1\fB"\fP >+.in 12 >+If set to 0 (off), qdiskd will *not* instruct to kill nodes it thinks >+are dead (as a result of not writing to the quorum disk). The default >+for this value is 1 (on). >+ >+.in 9 >+\fIparanoid\fP\fB="\fP0\fB"\fP >+.in 12 >+If set to 1 (on), qdiskd will watch internal timers and reboot the node >+if it takes more than (interval * tko) seconds to complete a quorum disk >+pass. The default for this value is 0 (off). >+ >+.in 9 >+\fIscheduler\fP\fB="\fPrr\fB"\fP >+.in 12 >+Valid values are 'rr', 'fifo', and 'other'. Selects the scheduling queue >+in the Linux kernel for operation of the main & score threads (does not >+affect the heuristics; they are always run in the 'other' queue). Default >+is 'rr'. See sched_setscheduler(2) for more details. >+ >+.in 9 >+\fIpriority\fP\fB="\fP1\fB"\fP >+.in 12 >+Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid values >+for 'other' are -20..20 inclusive. Sets the priority of the main & score >+threads. The default value is 1 (in the RR and FIFO queues, higher numbers >+denote higher priority; in OTHER, lower values denote higher priority). >+ >+.in 9 >+\fIstop_cman\fP\fB="\fP0\fB"\fP >+.in 12 >+Ordinarily, cluster membership is left up to CMAN, not qdisk. >+If this parameter is set to 1 (on), qdiskd will tell CMAN to leave the >+cluster if it is unable to initialize the quorum disk during startup. This >+can be used to prevent cluster participation by a node which has been >+disconnected from the SAN. The default for this value is 0 (off). >+ >+.in 9 >+\fIuse_uptime\fP\fB="\fP1\fB"\fP >+.in 12 >+If this parameter is set to 1 (on), qdiskd will use values from >+/proc/uptime for internal timings. This is a bit less precise >+than \fBgettimeofday(2)\fP, but the benefit is that changing the >+system clock will not affect qdiskd's behavior - even if \fBparanoid\fP >+is enabled. If set to 0, qdiskd will use \fBgettimeofday(2)\fP, which >+is more precise. The default for this value is 1 (on / use uptime). >+ >+.in 9 > \fIdevice\fP\fB="\fP/dev/sda1\fB"\fP > .in 12 > This is the device the quorum daemon will use. This device must be the >@@ -256,6 +337,8 @@ > on every block device found, comparing the label against the specified > label. This is useful in configurations where the block device name > differs on a per-node basis. >+.in 8 >+\fB...>\fP > .in 0 > > .SH "3.2. The <heuristic> tag" >@@ -268,34 +351,80 @@ > .in 12 > This is the program used to determine if this heuristic is alive. This > can be anything which may be executed by \fI/bin/sh -c\fP. A return >-value of zero indicates success; anything else indicates failure. >+value of zero indicates success; anything else indicates failure. This >+is required. > > .in 9 > \fIscore\fP\fB="\fP1\fB"\fP > .in 12 > This is the weight of this heuristic. Be careful when determining scores >-for heuristics. >+for heuristics. The default score for each heuristic is 1. > > .in 9 > \fIinterval\fP\fB="\fP2\fB"/>\fP > .in 12 >-This is the frequency at which we poll the heuristic. >+This is the frequency (in seconds) at which we poll the heuristic. The >+default interval for every heuristic is 2 seconds. >+.in 0 >+ >+.in 9 >+\fItko\fP\fB="\fP1\fB"/>\fP >+.in 12 >+After this many failed attempts to run the heuristic, it is considered DOWN, >+and its score is removed. The default tko for each heuristic is 1, which >+may be inadequate for things such as 'ping'. >+.in 8 >+\fB/>\fP > .in 0 > >-.SH "3.3. Example" >+ >+.SH "3.3. Examples" >+.SH "3.3.1. 3 cluster nodes & 3 routers" >+.in 8 >+<cman expected_votes="6" .../> >+.br >+<clusternodes> >+.in 12 >+<clusternode name="node1" votes="1" ... /> >+.br >+<clusternode name="node2" votes="1" ... /> >+.br >+<clusternode name="node3" votes="1" ... /> > .in 8 >+</clusternodes> >+.br > <quorumd interval="1" tko="10" votes="3" label="testing"> > .in 12 >-<heuristic program="ping A -c1 -t1" score="1" interval="2"/> >+<heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/> > .br >-<heuristic program="ping B -c1 -t1" score="1" interval="2"/> >+<heuristic program="ping B -c1 -t1" score="1" interval="2" tko="3"/> > .br >-<heuristic program="ping C -c1 -t1" score="1" interval="2"/> >+<heuristic program="ping C -c1 -t1" score="1" interval="2" tko="3"/> >+.br >+.in 8 >+</quorumd> >+ >+.SH "3.3.2. 2 cluster nodes & 1 IP tiebreaker" >+.in 8 >+<cman two_node="0" expected_votes="3" .../> >+.br >+<clusternodes> >+.in 12 >+<clusternode name="node1" votes="1" ... /> >+.br >+<clusternode name="node2" votes="1" ... /> >+.in 8 >+</clusternodes> >+.br >+<quorumd interval="1" tko="10" votes="1" label="testing"> >+.in 12 >+<heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/> > .br > .in 8 > </quorumd> > .in 0 > >+ > .SH "3.4. Heuristic score considerations" > * Heuristic timeouts should be set high enough to allow the previous run > of a given heuristic to complete. >@@ -313,4 +442,4 @@ > for more details. > > .SH "SEE ALSO" >-mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5) >+mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2) >Index: man/qdiskd.8 >=================================================================== >RCS file: /cvs/cluster/cluster/cman/man/qdiskd.8,v >retrieving revision 1.2 >diff -u -r1.2 qdiskd.8 >--- man/qdiskd.8 21 Jul 2006 17:55:04 -0000 1.2 >+++ man/qdiskd.8 6 Mar 2007 15:49:37 -0000 >@@ -15,6 +15,11 @@ > Run in the foreground (do not fork / daemonize). > .IP "\-d" > Enable debug output. >+.IP "\-Q" >+Close stdin/out/err immediately before doing validations. This >+is primarily for use when being called from an init script. Using >+this option will stop all output, and can not be used with the -d >+option. > > .SH "SEE ALSO" > mkqdisk(8), qdisk(5), cman(5) >Index: qdisk/Makefile >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/Makefile,v >retrieving revision 1.6 >diff -u -r1.6 Makefile >--- qdisk/Makefile 11 Aug 2006 15:18:05 -0000 1.6 >+++ qdisk/Makefile 6 Mar 2007 15:49:37 -0000 >@@ -28,7 +28,7 @@ > install ${TARGET} ${sbindir} > > qdiskd: disk.o crc32.o disk_util.o main.o score.o bitmap.o clulog.o \ >- gettid.o proc.o ../lib/libcman.a >+ gettid.o proc.o daemon_init.o ../lib/libcman.a > gcc -o $@ $^ -lpthread -L../lib -L${ccslibdir} -lccs > > mkqdisk: disk.o crc32.o disk_util.o \ >Index: qdisk/clulog.c >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/clulog.c,v >retrieving revision 1.2 >diff -u -r1.2 clulog.c >--- qdisk/clulog.c 19 May 2006 14:41:35 -0000 1.2 >+++ qdisk/clulog.c 6 Mar 2007 15:49:37 -0000 >@@ -20,8 +20,6 @@ > /** @file > * Library routines for communicating with the logging daemon. > * >- * $Id: clulog.c,v 1.2 2006/05/19 14:41:35 lhh Exp $ >- * > * Author: Jeff Moyer <moyer@missioncriticallinux.com> > */ > #include <stdio.h> >@@ -50,8 +48,6 @@ > #include <string.h> > > >-static const char *version __attribute__ ((unused)) = "$Revision: 1.2 $"; >- > #ifdef DEBUG > #include <assert.h> > #define Dprintf(fmt,args...) printf(fmt,##args) >@@ -135,7 +131,7 @@ > } > > pthread_mutex_unlock(&log_mutex); >- return "local4"; >+ return "daemon"; > } > > >@@ -156,7 +152,6 @@ > for (; facilitynames[x].c_name; x++) { > if (strcmp(facilityname, facilitynames[x].c_name)) > continue; >- > syslog_facility = facilitynames[x].c_val; > break; > } >Index: qdisk/daemon_init.c >=================================================================== >RCS file: qdisk/daemon_init.c >diff -N qdisk/daemon_init.c >--- /dev/null 1 Jan 1970 00:00:00 -0000 >+++ qdisk/daemon_init.c 6 Mar 2007 15:49:37 -0000 >@@ -0,0 +1,238 @@ >+/* >+ Copyright Red Hat, Inc. 2002, 2007 >+ Copyright Mission Critical Linux, 2000 >+ >+ This program is free software; you can redistribute it and/or modify it >+ under the terms of the GNU General Public License as published by the >+ Free Software Foundation; either version 2, or (at your option) any >+ later version. >+ >+ This program is distributed in the hope that it will be useful, but >+ WITHOUT ANY WARRANTY; without even the implied warranty of >+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >+ General Public License for more details. >+ >+ You should have received a copy of the GNU General Public License >+ along with this program; see the file COPYING. If not, write to the >+ Free Software Foundation, Inc., 675 Mass Ave, Cambridge, >+ MA 02139, USA. >+*/ >+/** @file >+ * daemon_init function, does sanity checks and calls daemon(). >+ * >+ * Author: Jeff Moyer <jmoyer@redhat.com> >+ */ >+/* >+ * TODO: Clean this up so that only one function constructs the >+ * pidfile /var/run/loggerd.PID, and perhaps only one function >+ * forms the /proc/PID/ path. >+ * >+ * Also need to add file locking for the pid file. >+ */ >+#include <stdio.h> >+#include <stdlib.h> >+#include <unistd.h> >+#include <string.h> >+#include <sys/types.h> >+#include <sys/stat.h> >+#include <sys/param.h> >+#include <fcntl.h> >+#include <dirent.h> >+#include <sys/mman.h> >+#include <sys/errno.h> >+#include <libgen.h> >+#include <signal.h> >+ >+/* >+ * This should ultimately go in a header file. >+ */ >+void daemon_init(char *prog); >+int check_pid_valid(pid_t pid, char *prog); >+int check_process_running(char *prog, pid_t * pid); >+ >+/* >+ * Local prototypes. >+ */ >+static void update_pidfile(char *prog); >+static int setup_sigmask(void); >+ >+ >+int >+check_pid_valid(pid_t pid, char *prog) >+{ >+ FILE *fp; >+ DIR *dir; >+ char filename[PATH_MAX]; >+ char dirpath[PATH_MAX]; >+ char proc_cmdline[64]; /* yank this from kernel somewhere */ >+ char *s = NULL; >+ >+ memset(filename, 0, PATH_MAX); >+ memset(dirpath, 0, PATH_MAX); >+ >+ snprintf(dirpath, sizeof (dirpath), "/proc/%d", pid); >+ if ((dir = opendir(dirpath)) == NULL) { >+ closedir(dir); >+ return 0; /* Pid has gone away. */ >+ } >+ closedir(dir); >+ >+ /* >+ * proc-pid directory exists. Now check to see if this >+ * PID corresponds to the daemon we want to start. >+ */ >+ snprintf(filename, sizeof (filename), "/proc/%d/cmdline", pid); >+ fp = fopen(filename, "r"); >+ if (fp == NULL) { >+ perror("check_pid_valid"); >+ return 0; /* Who cares.... Let's boogy on. */ >+ } >+ >+ if (!fgets(proc_cmdline, sizeof (proc_cmdline) - 1, fp)) { >+ /* >+ * Okay, we've seen processes keep a reference to a >+ * /proc/PID/stat file and not let go. Then when >+ * you try to read /proc/PID/cmline, you get either >+ * \000 or -1. In either case, we can safely assume >+ * the process has gone away. >+ */ >+ fclose(fp); >+ return 0; >+ } >+ fclose(fp); >+ >+ s = &(proc_cmdline[strlen(proc_cmdline)]); >+ if (*s == '\n') >+ *s = 0; >+ >+ /* >+ * Check to see if this is the same executable. >+ */ >+ if ((s = strstr(proc_cmdline, prog)) == NULL) { >+ return 0; >+ } else { >+ return 1; >+ } >+} >+ >+ >+int >+check_process_running(char *prog, pid_t * pid) >+{ >+ pid_t oldpid; >+ FILE *fp; >+ char filename[PATH_MAX]; >+ char *cmd; >+ int ret; >+ struct stat st; >+ >+ *pid = -1; >+ >+ /* >+ * Now see if there is a pidfile associated with this cmd in /var/run >+ */ >+ fp = NULL; >+ memset(filename, 0, PATH_MAX); >+ >+ cmd = basename(prog); >+ snprintf(filename, sizeof (filename), "/var/run/%s.pid", cmd); >+ >+ ret = stat(filename, &st); >+ if ((ret < 0) || (!st.st_size)) >+ return 0; >+ >+ /* >+ * Read the pid from the file. >+ */ >+ fp = fopen(filename, "r"); >+ if (fp == NULL) { /* error */ >+ return 0; >+ } >+ fscanf(fp, "%d\n", &oldpid); >+ fclose(fp); >+ if (check_pid_valid(oldpid, cmd)) { >+ *pid = oldpid; >+ return 1; >+ } >+ return 0; >+} >+ >+ >+static void >+update_pidfile(char *prog) >+{ >+ FILE *fp = NULL; >+ char *cmd; >+ char filename[PATH_MAX]; >+ >+ memset(filename, 0, PATH_MAX); >+ >+ cmd = basename(prog); >+ snprintf(filename, sizeof (filename), "/var/run/%s.pid", cmd); >+ >+ fp = fopen(filename, "w"); >+ if (fp == NULL) { >+ exit(1); >+ } >+ >+ fprintf(fp, "%d", getpid()); >+ fclose(fp); >+} >+ >+ >+static int >+setup_sigmask(void) >+{ >+ sigset_t set; >+ >+ sigfillset(&set); >+ >+ /* >+ * Dont't block signals which would cause us to dump core. >+ */ >+ sigdelset(&set, SIGQUIT); >+ sigdelset(&set, SIGILL); >+ sigdelset(&set, SIGTRAP); >+ sigdelset(&set, SIGABRT); >+ sigdelset(&set, SIGFPE); >+ sigdelset(&set, SIGSEGV); >+ sigdelset(&set, SIGBUS); >+ >+ /* >+ * Don't block SIGTERM or SIGCHLD >+ */ >+ sigdelset(&set, SIGTERM); >+ sigdelset(&set, SIGCHLD); >+ >+ return (sigprocmask(SIG_BLOCK, &set, NULL)); >+} >+ >+ >+void >+daemon_init(char *prog) >+{ >+ uid_t uid; >+ pid_t pid; >+ >+ uid = getuid(); >+ if (uid) { >+ fprintf(stderr, >+ "daemon_init: Sorry, only root wants to run this.\n"); >+ exit(1); >+ } >+ >+ if (check_process_running(prog, &pid) && (pid != getpid())) { >+ fprintf(stderr, >+ "daemon_init: Process \"%s\" already running.\n", >+ prog); >+ exit(1); >+ } >+ if (setup_sigmask() < 0) { >+ fprintf(stderr, "daemon_init: Unable to set signal mask.\n"); >+ exit(1); >+ } >+ >+ daemon(0, 0); >+ >+ update_pidfile(prog); >+} >Index: qdisk/disk.h >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/disk.h,v >retrieving revision 1.4 >diff -u -r1.4 disk.h >--- qdisk/disk.h 3 Oct 2006 18:06:40 -0000 1.4 >+++ qdisk/disk.h 6 Mar 2007 15:49:37 -0000 >@@ -67,7 +67,12 @@ > > > typedef enum { >- RF_REBOOT = 0x1 /* Reboot if we go from master->none */ >+ RF_REBOOT = 0x1, /* Reboot if we go from master->none */ >+ RF_STOP_CMAN = 0x2, >+ RF_DEBUG = 0x4, >+ RF_PARANOID = 0x8, >+ RF_ALLOW_KILL = 0x10, >+ RF_UPTIME = 0x20 > } run_flag_t; > > >@@ -235,11 +240,17 @@ > int qc_writes; > int qc_interval; > int qc_tko; >+ int qc_tko_up; >+ int qc_upgrade_wait; >+ int qc_master_wait; > int qc_votes; > int qc_scoremin; >+ int qc_sched; >+ int qc_sched_prio; > disk_node_state_t qc_disk_status; > disk_node_state_t qc_status; > int qc_master; /* Master?! */ >+ int _pad_; > run_flag_t qc_flags; > cman_handle_t qc_ch; > char *qc_device; >Index: qdisk/disk_util.c >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/disk_util.c,v >retrieving revision 1.2 >diff -u -r1.2 disk_util.c >--- qdisk/disk_util.c 19 May 2006 14:41:35 -0000 1.2 >+++ qdisk/disk_util.c 6 Mar 2007 15:49:37 -0000 >@@ -37,20 +37,71 @@ > #include <time.h> > > >-static inline void >+inline void > _diff_tv(struct timeval *dest, struct timeval *start, struct timeval *end) > { >- dest->tv_sec = end->tv_sec - start->tv_sec; >- dest->tv_usec = end->tv_usec - start->tv_usec; >+ dest->tv_sec = end->tv_sec - start->tv_sec; >+ dest->tv_usec = end->tv_usec - start->tv_usec; > >- if (dest->tv_usec < 0) { >- dest->tv_usec += 1000000; >- dest->tv_sec--; >- } >+ if (dest->tv_usec < 0) { >+ dest->tv_usec += 1000000; >+ dest->tv_sec--; >+ } > } > > > /** >+ * >+ * Grab the uptime from /proc/uptime. >+ * >+ * @param tv Timeval struct to store time in. The sec >+ * field contains seconds, the usec field >+ * contains the hundredths-of-seconds (converted >+ * to micro-seconds) >+ * @return -1 on failure, 0 on success. >+ */ >+static inline int >+getuptime(struct timeval *tv) >+{ >+ FILE *fp; >+ struct timeval junk; >+ int rv; >+ >+ fp = fopen("/proc/uptime","r"); >+ if (!fp) >+ return -1; >+ >+#if defined(__sparc__) || defined(__hppa__) || defined(__sparc64__) || defined (__hppa64__) >+ rv = fscanf(fp,"%ld.%d %ld.%d\n", &tv->tv_sec, &tv->tv_usec, >+ &junk.tv_sec, &junk.tv_usec); >+#else >+ rv = fscanf(fp,"%ld.%ld %ld.%ld\n", &tv->tv_sec, &tv->tv_usec, >+ &junk.tv_sec, &junk.tv_usec); >+#endif >+ fclose(fp); >+ >+ if (rv != 4) { >+ return -1; >+ } >+ >+ tv->tv_usec *= 10000; >+ >+ return 0; >+} >+ >+ >+inline int >+get_time(struct timeval *tv, int use_uptime) >+{ >+ if (use_uptime) { >+ return getuptime(tv); >+ } else { >+ return gettimeofday(tv, NULL); >+ } >+} >+ >+ >+/** > Update write times and calculate a new average time > */ > void >@@ -147,7 +198,7 @@ > ps.ps_arg = 0; > } > >- if (gettimeofday(&start, NULL) < 0) >+ if (get_time(&start, ctx->qc_flags&RF_UPTIME) < 0) > utime_ok = 0; > swab_status_block_t(&ps); > if (qdisk_write(ctx->qc_fd, qdisk_nodeid_offset(nid), &ps, >@@ -155,7 +206,7 @@ > printf("Error writing node ID block %d\n", nid); > return -1; > } >- if (utime_ok && (gettimeofday(&end, NULL) < 0)) >+ if (utime_ok && (get_time(&end, ctx->qc_flags&RF_UPTIME) < 0)) > utime_ok = 0; > > if (utime_ok) { >Index: qdisk/main.c >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/main.c,v >retrieving revision 1.4.4.1 >diff -u -r1.4.4.1 main.c >--- qdisk/main.c 16 Jan 2007 16:20:48 -0000 1.4.4.1 >+++ qdisk/main.c 6 Mar 2007 15:49:38 -0000 >@@ -35,11 +35,21 @@ > #include <unistd.h> > #include <time.h> > #include <sys/reboot.h> >+#include <sys/time.h> > #include <linux/reboot.h> >+#include <sched.h> > #include <signal.h> > #include <ccs.h> > #include "score.h" > #include "clulog.h" >+#if (!defined(LIBCMAN_VERSION) || \ >+ (defined(LIBCMAN_VERSION) && LIBCMAN_VERSION < 2)) >+#include <cluster/cnxman-socket.h> >+#endif >+ >+int daemon_init(char *); >+int check_process_running(char *, pid_t *); >+ > /* > TODO: > 1) Take into account timings to gracefully extend node timeouts during >@@ -52,7 +62,13 @@ > int clear_bit(uint8_t *mask, uint32_t bitidx, uint32_t masklen); > int set_bit(uint8_t *mask, uint32_t bitidx, uint32_t masklen); > int is_bit_set(uint8_t *mask, uint32_t bitidx, uint32_t masklen); >-static int _running = 0; >+inline int get_time(struct timeval *tv, int use_uptime); >+inline void _diff_tv(struct timeval *dest, struct timeval *start, >+ struct timeval *end); >+ >+static int _running = 1; >+void update_local_status(qd_ctx *ctx, node_info_t *ni, int max, int score, >+ int score_req, int score_max); > > > static void >@@ -144,6 +160,8 @@ > continue; > } > /* message. */ >+ memcpy(&(ni[x].ni_last_msg), &(ni[x].ni_msg), >+ sizeof(ni[x].ni_last_msg)); > ni[x].ni_msg.m_arg = sb->ps_arg; > ni[x].ni_msg.m_msg = sb->ps_msg; > ni[x].ni_msg.m_seq = sb->ps_seq; >@@ -155,6 +173,11 @@ > if (sb->ps_timestamp == ni[x].ni_last_seen) { > /* XXX check for average + allow grace */ > ni[x].ni_misses++; >+ if (ni[x].ni_misses > 1) { >+ clulog(LOG_DEBUG, >+ "Node %d missed an update (%d/%d)\n", >+ x+1, ni[x].ni_misses, ctx->qc_tko); >+ } > continue; > } > >@@ -208,6 +231,11 @@ > ni[x].ni_misses = 0; > ni[x].ni_state = S_NONE; > >+ /* Clear our master mask for the node after eviction >+ * or shutdown */ >+ if (mask) >+ clear_bit(mask, (ni[x].ni_status.ps_nodeid-1), >+ sizeof(memb_mask_t)); > continue; > } > >@@ -227,15 +255,17 @@ > Write eviction notice if we're the master. > */ > if (ctx->qc_status == S_MASTER) { >- clulog(LOG_DEBUG, >+ clulog(LOG_NOTICE, > "Writing eviction notice for node %d\n", > ni[x].ni_status.ps_nodeid); > qd_write_status(ctx, ni[x].ni_status.ps_nodeid, > S_EVICT, NULL, NULL, NULL); >- clulog(LOG_DEBUG, >- "Telling CMAN to kill the node\n"); >- cman_kill_node(ctx->qc_ch, >- ni[x].ni_status.ps_nodeid); >+ if (ctx->qc_flags & RF_ALLOW_KILL) { >+ clulog(LOG_DEBUG, "Telling CMAN to " >+ "kill the node\n"); >+ cman_kill_node(ctx->qc_ch, >+ ni[x].ni_status.ps_nodeid); >+ } > } > > /* >@@ -255,6 +285,10 @@ > ni[x].ni_evil_incarnation = > ni[x].ni_status.ps_incarnation; > >+ /* Clear our master mask for the node after eviction */ >+ if (mask) >+ clear_bit(mask, (ni[x].ni_status.ps_nodeid-1), >+ sizeof(memb_mask_t)); > continue; > } > >@@ -279,9 +313,12 @@ > ni[x].ni_status.ps_state = S_EVICT; > > /* XXX Need to fence it again */ >- clulog(LOG_DEBUG, "Telling CMAN to kill the node\n"); >- cman_kill_node(ctx->qc_ch, >- ni[x].ni_status.ps_nodeid); >+ if (ctx->qc_flags & RF_ALLOW_KILL) { >+ clulog(LOG_DEBUG, "Telling CMAN to " >+ "kill the node\n"); >+ cman_kill_node(ctx->qc_ch, >+ ni[x].ni_status.ps_nodeid); >+ } > continue; > } > >@@ -292,7 +329,7 @@ > > Transition from Offline -> Online > */ >- if (ni[x].ni_seen > (ctx->qc_tko / 2) && >+ if (ni[x].ni_seen > ctx->qc_tko_up && > !state_run(ni[x].ni_state)) { > /* > Node-join - everyone just kind of "agrees" >@@ -413,9 +450,13 @@ > int > quorum_init(qd_ctx *ctx, node_info_t *ni, int max, struct h_data *h, int maxh) > { >- int x = 0, score, maxscore; >+ int x = 0, score, maxscore, score_req; > > clulog(LOG_INFO, "Quorum Daemon Initializing\n"); >+ >+ if (mlockall(MCL_CURRENT|MCL_FUTURE) != 0) { >+ clulog(LOG_ERR, "Unable to mlockall()\n"); >+ } > > if (qdisk_validate(ctx->qc_device) < 0) > return -1; >@@ -427,16 +468,22 @@ > return -1; > } > >- start_score_thread(h, maxh); >+ if (h && maxh) { >+ start_score_thread(ctx, h, maxh); >+ } else { >+ clulog(LOG_DEBUG, "Permanently setting score to 1/1\n"); >+ fudge_scoring(); >+ } > > node_info_init(ni, max); >+ ctx->qc_status = S_INIT; > if (qd_write_status(ctx, ctx->qc_my_id, > S_INIT, NULL, NULL, NULL) != 0) { > clulog(LOG_CRIT, "Could not initialize status block!\n"); > return -1; > } > >- while (++x <= ctx->qc_tko) { >+ while (++x <= ctx->qc_tko && _running) { > read_node_blocks(ctx, ni, max); > check_transitions(ctx, ni, max, NULL); > >@@ -446,11 +493,16 @@ > return -1; > } > >- sleep(ctx->qc_interval); >+ get_my_score(&score, &maxscore); >+ score_req = ctx->qc_scoremin; >+ if (score_req <= 0) >+ score_req = (maxscore/2 + 1); >+ update_local_status(ctx, ni, max, score, score_req, maxscore); > >+ sleep(ctx->qc_interval); > } > >- get_my_score(&score,&maxscore); >+ get_my_score(&score, &maxscore); > clulog(LOG_INFO, "Initial score %d/%d\n", score, maxscore); > clulog(LOG_INFO, "Initialization complete\n"); > >@@ -500,12 +552,16 @@ > return; > > memset(master_mask, 0, sizeof(master_mask)); >- > for (x = 0; x < retnodes; x++) { > if (is_bit_set(mask, nodes[x].cn_nodeid-1, sizeof(mask)) && >- nodes[x].cn_member) >+ nodes[x].cn_member) { > set_bit(master_mask, nodes[x].cn_nodeid-1, > sizeof(master_mask)); >+ } else { >+ /* Not in CMAN output = not allowed */ >+ clear_bit(master_mask, (nodes[x].cn_nodeid-1), >+ sizeof(memb_mask_t)); >+ } > } > } > >@@ -585,11 +641,41 @@ > > > void >+print_node_info(FILE *fp, node_info_t *ni) >+{ >+ fprintf(fp, "node_info_t [node %d] {\n", ni->ni_status.ps_nodeid); >+ fprintf(fp, " ni_incarnation = 0x%08x%08x\n", >+ ((int)(ni->ni_incarnation>>32))&0xffffffff, >+ ((int)(ni->ni_incarnation)&0xffffffff)); >+ fprintf(fp, " ni_evil_incarnation = 0x%08x%08x\n", >+ ((int)(ni->ni_evil_incarnation>>32))&0xffffffff, >+ ((int)(ni->ni_evil_incarnation)&0xffffffff)); >+ fprintf(fp, " ni_last_seen = %s", ctime(&ni->ni_last_seen)); >+ fprintf(fp, " ni_misses = %d\n", ni->ni_misses); >+ fprintf(fp, " ni_seen = %d\n", ni->ni_seen); >+ fprintf(fp, " ni_msg = {\n"); >+ fprintf(fp, " m_msg = 0x%08x\n", ni->ni_msg.m_msg); >+ fprintf(fp, " m_arg = %d\n", ni->ni_msg.m_arg); >+ fprintf(fp, " m_seq = %d\n", ni->ni_msg.m_seq); >+ fprintf(fp, " }\n"); >+ fprintf(fp, " ni_last_msg = {\n"); >+ fprintf(fp, " m_msg = 0x%08x\n", ni->ni_last_msg.m_msg); >+ fprintf(fp, " m_arg = %d\n", ni->ni_last_msg.m_arg); >+ fprintf(fp, " m_seq = %d\n", ni->ni_last_msg.m_seq); >+ fprintf(fp, " }\n"); >+ fprintf(fp, " ni_state = 0x%08x (%s)\n", ni->ni_state, >+ state_str(ni->ni_state)); >+ fprintf(fp, "}\n\n"); >+} >+ >+ >+void > update_local_status(qd_ctx *ctx, node_info_t *ni, int max, int score, > int score_req, int score_max) > { > FILE *fp; > int x, need_close = 0; >+ time_t now; > > if (!ctx->qc_status_file) > return; >@@ -603,12 +689,24 @@ > need_close = 1; > } > >+ now = time(NULL); >+ fprintf(fp, "Time Stamp: %s", ctime(&now)); > fprintf(fp, "Node ID: %d\n", ctx->qc_my_id); >- fprintf(fp, "Score (current / min req. / max allowed): %d / %d / %d\n", >- score, score_req, score_max); >+ >+ fprintf(fp, "Score: %d/%d (Minimum required = %d)\n", >+ score, score_max, score_req); > fprintf(fp, "Current state: %s\n", state_str(ctx->qc_status)); >+ >+ /* > fprintf(fp, "Current disk state: %s\n", > state_str(ctx->qc_disk_status)); >+ */ >+ fprintf(fp, "Initializing Set: {"); >+ for (x=0; x<max; x++) { >+ if (ni[x].ni_status.ps_state == S_INIT && ni[x].ni_seen) >+ fprintf(fp," %d", ni[x].ni_status.ps_nodeid); >+ } >+ fprintf(fp, " }\n"); > > fprintf(fp, "Visible Set: {"); > for (x=0; x<max; x++) { >@@ -617,13 +715,18 @@ > fprintf(fp," %d", ni[x].ni_status.ps_nodeid); > } > fprintf(fp, " }\n"); >+ >+ if (ctx->qc_status == S_INIT) >+ goto out; >+ >+ if (ctx->qc_master) >+ fprintf(fp, "Master Node ID: %d\n", ctx->qc_master); >+ else >+ fprintf(fp, "Master Node ID: (none)\n"); > >- if (!ctx->qc_master) { >- fprintf(fp, "No master node\n"); >+ if (!ctx->qc_master) > goto out; >- } > >- fprintf(fp, "Master Node ID: %d\n", ctx->qc_master); > fprintf(fp, "Quorate Set: {"); > for (x=0; x<max; x++) { > if (is_bit_set(ni[ctx->qc_master-1].ni_status.ps_master_mask, >@@ -636,24 +739,140 @@ > fprintf(fp, " }\n"); > > out: >+ if (ctx->qc_flags & RF_DEBUG) { >+ for (x = 0; x < max; x++) >+ print_node_info(fp, &ni[x]); >+ } >+ > fprintf(fp, "\n"); > if (need_close) > fclose(fp); > } > > >+/* Timeval functions from clumanager */ >+/** >+ * Scale a (struct timeval). >+ * >+ * @param tv The timeval to scale. >+ * @param scale Positive multiplier. >+ * @return tv >+ */ >+struct timeval * >+_scale_tv(struct timeval *tv, int scale) >+{ >+ tv->tv_sec *= scale; >+ tv->tv_usec *= scale; >+ >+ if (tv->tv_usec > 1000000) { >+ tv->tv_sec += (tv->tv_usec / 1000000); >+ tv->tv_usec = (tv->tv_usec % 1000000); >+ } >+ >+ return tv; >+} >+ >+ >+ >+#define _print_tv(val) \ >+ printf("%s: %d.%06d\n", #val, (int)((val)->tv_sec), \ >+ (int)((val)->tv_usec)) >+ >+ >+static inline int >+_cmp_tv(struct timeval *left, struct timeval *right) >+{ >+ if (left->tv_sec > right->tv_sec) >+ return -1; >+ >+ if (left->tv_sec < right->tv_sec) >+ return 1; >+ >+ if (left->tv_usec > right->tv_usec) >+ return -1; >+ >+ if (left->tv_usec < right->tv_usec) >+ return 1; >+ >+ return 0; >+} >+ >+ >+void >+set_priority(int queue, int prio) >+{ >+ struct sched_param s; >+ int ret; >+ char *func = "nice"; >+ >+ if (queue == SCHED_OTHER) { >+ s.sched_priority = 0; >+ ret = sched_setscheduler(0, queue, &s); >+ errno = 0; >+ ret = nice(prio); >+ } else { >+ memset(&s,0,sizeof(s)); >+ s.sched_priority = prio; >+ ret = sched_setscheduler(0, queue, &s); >+ func = "sched_setscheduler"; >+ } >+ >+ if (ret < 0 && errno) { >+ clulog(LOG_WARNING, "set_priority [%s] failed: %s\n", func, >+ strerror(errno)); >+ } >+} >+ >+ >+int >+cman_alive(cman_handle_t ch) >+{ >+ fd_set rfds; >+ int fd = cman_get_fd(ch); >+ struct timeval tv = {0, 0}; >+ >+ FD_ZERO(&rfds); >+ FD_SET(fd, &rfds); >+ if (select(fd + 1, &rfds, NULL, NULL, &tv) == 1) { >+ if (cman_dispatch(ch, CMAN_DISPATCH_ALL) < 0) { >+ if (errno == EAGAIN) >+ return 0; >+ return -1; >+ } >+ } >+ return 0; >+} >+ > > int > quorum_loop(qd_ctx *ctx, node_info_t *ni, int max) > { > disk_msg_t msg = {0, 0, 0}; >- int low_id, bid_pending = 0, score, score_max, score_req; >+ int low_id, bid_pending = 0, score, score_max, score_req, >+ upgrade = 0; > memb_mask_t mask, master_mask; >+ struct timeval maxtime, oldtime, newtime, diff, sleeptime, interval; > >- ctx->qc_status = S_RUN; >+ ctx->qc_status = S_NONE; >+ >+ maxtime.tv_usec = 0; >+ maxtime.tv_sec = ctx->qc_interval * ctx->qc_tko; >+ >+ interval.tv_usec = 0; >+ interval.tv_sec = ctx->qc_interval; >+ >+ get_my_score(&score, &score_max); >+ if (score_max < ctx->qc_scoremin) { >+ clulog(LOG_WARNING, "Minimum score (%d) is impossible to " >+ "achieve (heuristic total = %d)\n", >+ ctx->qc_scoremin, score_max); >+ } > > _running = 1; > while (_running) { >+ /* XXX this was getuptime() in clumanager */ >+ get_time(&oldtime, (ctx->qc_flags&RF_UPTIME)); >+ > /* Read everyone else's status */ > read_node_blocks(ctx, ni, max); > >@@ -663,6 +882,10 @@ > /* Check heuristics and remove ourself if necessary */ > get_my_score(&score, &score_max); > >+ /* If we recently upgraded, decrement our wait time */ >+ if (upgrade > 0) >+ --upgrade; >+ > score_req = ctx->qc_scoremin; > if (score_req <= 0) > score_req = (score_max/2 + 1); >@@ -672,14 +895,19 @@ > if (ctx->qc_status > S_NONE) { > clulog(LOG_NOTICE, > "Score insufficient for master " >- "operation (%d/%d; max=%d); " >+ "operation (%d/%d; required=%d); " > "downgrading\n", >- score, score_req, score_max); >+ score, score_max, score_req); > ctx->qc_status = S_NONE; > msg.m_msg = M_NONE; > ++msg.m_seq; > bid_pending = 0; >- cman_poll_quorum_device(ctx->qc_ch, 0); >+ if (cman_alive(ctx->qc_ch) < 0) { >+ clulog(LOG_ERR, "cman: %s\n", >+ strerror(errno)); >+ } else { >+ cman_poll_quorum_device(ctx->qc_ch, 0); >+ } > if (ctx->qc_flags & RF_REBOOT) > reboot(RB_AUTOBOOT); > } >@@ -688,10 +916,11 @@ > if (ctx->qc_status == S_NONE) { > clulog(LOG_NOTICE, > "Score sufficient for master " >- "operation (%d/%d; max=%d); " >+ "operation (%d/%d; required=%d); " > "upgrading\n", >- score, score_req, score_max); >+ score, score_max, score_req); > ctx->qc_status = S_RUN; >+ upgrade = ctx->qc_upgrade_wait; > } > } > >@@ -702,11 +931,13 @@ > if (!ctx->qc_master && > low_id == ctx->qc_my_id && > ctx->qc_status == S_RUN && >- !bid_pending ) { >+ !bid_pending && >+ !upgrade) { > /* > If there's no master, and we are the lowest node > ID, make a bid to become master if we're not >- already bidding. >+ already bidding. We can't do this if we've just >+ upgraded. > */ > > clulog(LOG_DEBUG,"Making bid for master\n"); >@@ -724,10 +955,18 @@ > /* We're currently bidding for master. > See if anyone's voted, or if we should > rescind our bid */ >+ ++bid_pending; > > /* Yes, those are all deliberate fallthroughs */ > switch (check_votes(ctx, ni, max, &msg)) { > case 3: >+ /* >+ * Give ample time to become aware of other >+ * nodes >+ */ >+ if (bid_pending < (ctx->qc_master_wait)) >+ break; >+ > clulog(LOG_INFO, > "Assuming master role\n"); > ctx->qc_status = S_MASTER; >@@ -755,6 +994,13 @@ > /* We are the master. Poll the quorum device. > We can't be the master unless we score high > enough on our heuristics. */ >+ if (cman_alive(ctx->qc_ch) < 0) { >+ clulog(LOG_ERR, "cman_dispatch: %s\n", >+ strerror(errno)); >+ clulog(LOG_ERR, >+ "Halting qdisk operations\n"); >+ return -1; >+ } > check_cman(ctx, mask, master_mask); > cman_poll_quorum_device(ctx->qc_ch, 1); > >@@ -768,6 +1014,13 @@ > ni[ctx->qc_master-1].ni_status.ps_master_mask, > ctx->qc_my_id-1, > sizeof(memb_mask_t))) { >+ if (cman_alive(ctx->qc_ch) < 0) { >+ clulog(LOG_ERR, "cman_dispatch: %s\n", >+ strerror(errno)); >+ clulog(LOG_ERR, >+ "Halting qdisk operations\n"); >+ return -1; >+ } > cman_poll_quorum_device(ctx->qc_ch, 1); > } > } >@@ -783,8 +1036,43 @@ > > /* Cycle. We could time the loop and sleep > usleep(interval-looptime), but this is fine for now.*/ >+ get_time(&newtime, ctx->qc_flags&RF_UPTIME); >+ _diff_tv(&diff, &oldtime, &newtime); >+ >+ /* >+ * Reboot if we didn't send a heartbeat in interval*TKO_COUNT >+ */ >+ if (_cmp_tv(&maxtime, &diff) == 1 && >+ ctx->qc_flags & RF_PARANOID) { >+ clulog(LOG_EMERG, "Failed to complete a cycle within " >+ "%d second%s (%d.%06d) - REBOOTING\n", >+ (int)maxtime.tv_sec, >+ maxtime.tv_sec==1?"":"s", >+ (int)diff.tv_sec, >+ (int)diff.tv_usec); >+ if (!(ctx->qc_flags & RF_DEBUG)) >+ reboot(RB_AUTOBOOT); >+ } >+ >+ /* >+ * If the amount we took to complete a loop is greater or less >+ * than our interval, we adjust by the difference each round. >+ * >+ * It's not really "realtime", but it helps! >+ */ >+ if (_cmp_tv(&diff, &interval) == 1) { >+ _diff_tv(&sleeptime, &diff, &interval); >+ } else { >+ clulog(LOG_WARNING, "qdisk cycle took more " >+ "than %d second%s to complete (%d.%06d)\n", >+ ctx->qc_interval, ctx->qc_interval==1?"":"s", >+ (int)diff.tv_sec, (int)diff.tv_usec); >+ memcpy(&sleeptime, &interval, sizeof(sleeptime)); >+ } >+ >+ /* Could hit a watchdog timer here if we wanted to */ > if (_running) >- sleep(ctx->qc_interval); >+ select(0, NULL, NULL, NULL, &sleeptime); > } > > return 0; >@@ -829,12 +1117,18 @@ > ctx->qc_interval = 1; > ctx->qc_tko = 10; > ctx->qc_scoremin = 0; >- ctx->qc_flags = RF_REBOOT; >+ ctx->qc_flags = RF_REBOOT | RF_ALLOW_KILL | RF_UPTIME; >+ /* | RF_STOP_CMAN;*/ >+ if (debug) >+ ctx->qc_flags |= RF_DEBUG; >+ ctx->qc_sched = SCHED_RR; >+ ctx->qc_sched_prio = 1; > > /* Get log log_facility */ > snprintf(query, sizeof(query), "/cluster/quorumd/@log_facility"); > if (ccs_get(ccsfd, query, &val) == 0) { > clu_set_facility(val); >+ clulog(LOG_DEBUG, "Log facility: %s\n", val); > free(val); > } > >@@ -867,6 +1161,38 @@ > if (ctx->qc_tko < 3) > ctx->qc_tko = 3; > } >+ >+ /* Get up-tko (transition off->online) */ >+ ctx->qc_tko_up = (ctx->qc_tko / 2); >+ snprintf(query, sizeof(query), "/cluster/quorumd/@tko_up"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ ctx->qc_tko_up = atoi(val); >+ free(val); >+ } >+ if (ctx->qc_tko_up < 2) >+ ctx->qc_tko_up = 2; >+ >+ /* After coming online, wait this many intervals before >+ being allowed to bid for master. */ >+ ctx->qc_upgrade_wait = 2; /* (ctx->qc_tko / 3); */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@upgrade_wait"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ ctx->qc_upgrade_wait = atoi(val); >+ free(val); >+ } >+ if (ctx->qc_upgrade_wait < 1) >+ ctx->qc_upgrade_wait = 1; >+ >+ /* wait this many intervals after bidding for master before >+ becoming Caesar */ >+ ctx->qc_master_wait = (ctx->qc_tko / 3); >+ snprintf(query, sizeof(query), "/cluster/quorumd/@master_wait"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ ctx->qc_master_wait = atoi(val); >+ free(val); >+ } >+ if (ctx->qc_master_wait < 2) >+ ctx->qc_master_wait = 2; > > /* Get votes */ > snprintf(query, sizeof(query), "/cluster/quorumd/@votes"); >@@ -903,6 +1229,37 @@ > if (ctx->qc_scoremin < 0) > ctx->qc_scoremin = 0; > } >+ >+ /* Get scheduling queue */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@scheduler"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ switch(val[0]) { >+ case 'r': >+ case 'R': >+ ctx->qc_sched = SCHED_RR; >+ break; >+ case 'f': >+ case 'F': >+ ctx->qc_sched = SCHED_FIFO; >+ break; >+ case 'o': >+ case 'O': >+ ctx->qc_sched = SCHED_OTHER; >+ break; >+ default: >+ clulog(LOG_WARNING, "Invalid scheduling queue '%s'\n", >+ val); >+ break; >+ } >+ free(val); >+ } >+ >+ /* Get priority */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@priority"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ ctx->qc_sched_prio = atoi(val); >+ free(val); >+ } > > /* Get reboot flag for when we transition -> offline */ > /* default = on, so, 0 to turn off */ >@@ -912,12 +1269,71 @@ > ctx->qc_flags &= ~RF_REBOOT; > free(val); > } >+ >+ /* >+ * Get flag to see if we're supposed to kill cman if qdisk is not >+ * available. >+ */ >+ /* default = off, so, 1 to turn on */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@stop_cman"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ if (!atoi(val)) >+ ctx->qc_flags &= ~RF_STOP_CMAN; >+ else >+ ctx->qc_flags |= RF_STOP_CMAN; >+ free(val); >+ } >+ >+ >+ /* >+ * Get flag to see if we're supposed to reboot if we can't complete >+ * a pass in failure time >+ */ >+ /* default = off, so, 1 to turn on */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@paranoid"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ if (!atoi(val)) >+ ctx->qc_flags &= ~RF_PARANOID; >+ else >+ ctx->qc_flags |= RF_PARANOID; >+ free(val); >+ } >+ >+ >+ /* >+ * Get flag to see if we're supposed to reboot if we can't complete >+ * a pass in failure time >+ */ >+ /* default = off, so, 1 to turn on */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@allow_kill"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ if (!atoi(val)) >+ ctx->qc_flags &= ~RF_ALLOW_KILL; >+ else >+ ctx->qc_flags |= RF_ALLOW_KILL; >+ free(val); >+ } >+ >+ /* >+ * Get flag to see if we're supposed to use /proc/uptime instead of >+ * gettimeofday(2) >+ */ >+ /* default = off, so, 1 to turn on */ >+ snprintf(query, sizeof(query), "/cluster/quorumd/@use_uptime"); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ if (!atoi(val)) >+ ctx->qc_flags &= ~RF_UPTIME; >+ else >+ ctx->qc_flags |= RF_UPTIME; >+ free(val); >+ } > > *cfh = configure_heuristics(ccsfd, h, maxh); > > clulog(LOG_DEBUG, > "Quorum Daemon: %d heuristics, %d interval, %d tko, %d votes\n", > *cfh, ctx->qc_interval, ctx->qc_tko, ctx->qc_votes); >+ clulog(LOG_DEBUG, "Run Flags: %08x\n", ctx->qc_flags); > > ccs_disconnect(ccsfd); > >@@ -925,58 +1341,124 @@ > } > > >+void >+check_stop_cman(qd_ctx *ctx) >+{ >+ if (!(ctx->qc_flags & RF_STOP_CMAN)) >+ return; >+ >+ clulog(LOG_WARNING, "Telling CMAN to leave the cluster; qdisk is not" >+ " available\n"); >+#if (defined(LIBCMAN_VERSION) && LIBCMAN_VERSION >= 2) >+ if (cman_shutdown(ctx->qc_ch, 0) < 0) { >+#else >+ int x = 0; >+ if (ioctl(cman_get_fd(ctx->qc_ch), SIOCCLUSTER_LEAVE_CLUSTER, &x) < 0) { >+#endif >+ clulog(LOG_CRIT, "Could not leave the cluster - rebooting\n"); >+ sleep(5); >+ if (ctx->qc_flags & RF_DEBUG) >+ return; >+ reboot(RB_AUTOBOOT); >+ } >+} >+ >+ > int > main(int argc, char **argv) > { > cman_node_t me; >- int cfh, rv; >+ int cfh, rv, forked = 0, nfd = -1; > qd_ctx ctx; > cman_handle_t ch; > node_info_t ni[MAX_NODES_DISK]; > struct h_data h[10]; > char debug = 0, foreground = 0; > char device[128]; >+ pid_t pid; > >- while ((rv = getopt(argc, argv, "fd")) != EOF) { >+ if (check_process_running(argv[0], &pid) && pid !=getpid()) { >+ printf("QDisk services already running\n"); >+ return 0; >+ } >+ >+ while ((rv = getopt(argc, argv, "fdQ")) != EOF) { > switch (rv) { > case 'd': > debug = 1; > break; > case 'f': > foreground = 1; >+ clu_log_console(1); >+ break; >+ case 'Q': >+ /* Make qdisk very quiet */ >+ nfd = open("/dev/null", O_RDWR); >+ close(0); >+ close(1); >+ close(2); >+ dup2(nfd, 0); >+ dup2(nfd, 1); >+ dup2(nfd, 2); >+ close(nfd); >+ break; > default: > break; > } > } >+ > #if (defined(LIBCMAN_VERSION) && LIBCMAN_VERSION >= 2) > ch = cman_admin_init(NULL); > #else > ch = cman_init(NULL); > #endif > if (!ch) { >- printf("Could not connect to cluster (CMAN not running?)\n"); >- return -1; >+ if (!foreground && !forked) { >+ if (daemon_init(argv[0]) < 0) >+ return -1; >+ else >+ forked = 1; >+ } >+ >+ clulog(LOG_INFO, "Waiting for CMAN to start\n"); >+ >+ do { >+ sleep(5); >+#if (defined(LIBCMAN_VERSION) && LIBCMAN_VERSION >= 2) >+ ch = cman_admin_init(NULL); >+#else >+ ch = cman_init(NULL); >+#endif >+ } while (!ch); > } > >- if (cman_get_node(ch, CMAN_NODEID_US, &me) < 0) { >- printf("Could not determine local node ID; cannot start\n"); >- return -1; >+ memset(&me, 0, sizeof(me)); >+ while (cman_get_node(ch, CMAN_NODEID_US, &me) < 0) { >+ if (!foreground && !forked) { >+ if (daemon_init(argv[0]) < 0) >+ return -1; >+ else >+ forked = 1; >+ } >+ sleep(5); > } > > qd_init(&ctx, ch, me.cn_nodeid); > > signal(SIGINT, int_handler); >+ signal(SIGTERM, int_handler); > >- if (debug) >+ if (debug) { > clu_set_loglevel(LOG_DEBUG); >- if (foreground) >- clu_log_console(1); >+ ctx.qc_flags |= RF_DEBUG; >+ } > > if (get_config_data(NULL, &ctx, h, 10, &cfh, debug) < 0) { > clulog_and_print(LOG_CRIT, "Configuration failed\n"); >+ check_stop_cman(&ctx); > return -1; > } >- >+ > if (ctx.qc_label) { > if (find_partitions("/proc/partitions", > ctx.qc_label, device, >@@ -984,6 +1466,7 @@ > clulog_and_print(LOG_CRIT, "Unable to match label" > " '%s' to any device\n", > ctx.qc_label); >+ check_stop_cman(&ctx); > return -1; > } > >@@ -999,17 +1482,26 @@ > clulog(LOG_CRIT, > "Specified partition %s does not have a " > "qdisk label\n", ctx.qc_device); >+ check_stop_cman(&ctx); > return -1; > } > } > >- if (!foreground) >- daemon(0,0); >+ if (!foreground && !forked) { >+ if (daemon_init(argv[0]) < 0) >+ return -1; >+ } >+ >+ set_priority(ctx.qc_sched, ctx.qc_sched_prio); > > if (quorum_init(&ctx, ni, MAX_NODES_DISK, h, cfh) < 0) { > clulog_and_print(LOG_CRIT, "Initialization failed\n"); >+ check_stop_cman(&ctx); > return -1; > } >+ >+ if (!_running) >+ return 0; > > cman_register_quorum_device(ctx.qc_ch, ctx.qc_device, ctx.qc_votes); > /* >@@ -1025,14 +1517,12 @@ > } > */ > >- quorum_loop(&ctx, ni, MAX_NODES_DISK); >- cman_unregister_quorum_device(ctx.qc_ch); >+ if (quorum_loop(&ctx, ni, MAX_NODES_DISK) == 0) >+ cman_unregister_quorum_device(ctx.qc_ch); > > quorum_logout(&ctx); >- > qd_destroy(&ctx); > > return 0; >- > } > >Index: qdisk/score.c >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/score.c,v >retrieving revision 1.2 >diff -u -r1.2 score.c >--- qdisk/score.c 19 May 2006 14:41:35 -0000 1.2 >+++ qdisk/score.c 6 Mar 2007 15:49:38 -0000 >@@ -32,14 +32,20 @@ > #include <string.h> > #include <ccs.h> > #include <clulog.h> >+#include <sched.h> >+#include <sys/mman.h> >+#include "disk.h" > #include "score.h" > > static pthread_mutex_t sc_lock = PTHREAD_MUTEX_INITIALIZER; > static int _score = 0, _maxscore = 0, _score_thread_running = 0; > static pthread_t score_thread = (pthread_t)0; >+void set_priority(int, int); > > struct h_arg { > struct h_data *h; >+ int sched_queue; >+ int sched_prio; > int count; > }; > >@@ -97,6 +103,20 @@ > h->childpid = pid; > return 0; > } >+ >+ /* >+ * always use SCHED_OTHER for the child processes >+ * nice -1 is fine; but we don't know what the child process >+ * might do, so leaving it (potentially) in SCHED_RR or SCHED_FIFO >+ * is out of the question >+ * >+ * XXX if you set SCHED_OTHER in the conf file and nice 20, the below >+ * will make the heuristics a higher prio than qdiskd. This should be >+ * fine in practice, because running qdiskd at nice 20 will cause all >+ * sorts of problems on a busy system. >+ */ >+ set_priority(SCHED_OTHER, -1); >+ munlockall(); > > argv[0] = "/bin/sh"; > argv[1] = "-c"; >@@ -122,6 +142,13 @@ > > *score = 0; > *maxscore = 0; >+ >+ printf("max = %d\n", max); >+ /* Allow operation w/o any heuristics */ >+ if (!max) { >+ *score = *maxscore = 1; >+ return; >+ } > > for (x = 0; x < max; x++) { > *maxscore += h[x].score; >@@ -141,22 +168,51 @@ > int status; > > if (h->childpid == 0) >+ /* No child to check */ > return 0; > > ret = waitpid(h->childpid, &status, block?0:WNOHANG); > if (!block && ret == 0) >+ /* No children exited */ > return 0; > > h->childpid = 0; >- h->available = 0; > if (ret < 0 && errno == ECHILD) >- return -1; >- if (!WIFEXITED(status)) >- return 0; >- if (WEXITSTATUS(status) != 0) >- return 0; >- h->available = 1; >+ /* wrong child? */ >+ goto miss; >+ if (!WIFEXITED(status)) { >+ ret = 0; >+ goto miss; >+ } >+ if (WEXITSTATUS(status) != 0) { >+ ret = 0; >+ goto miss; >+ } >+ >+ /* Returned 0 and was not killed */ >+ if (!h->available) { >+ h->available = 1; >+ clulog(LOG_INFO, "Heuristic: '%s' UP\n", h->program); >+ } >+ h->misses = 0; > return 0; >+ >+miss: >+ if (h->available) { >+ h->misses++; >+ if (h->misses >= h->tko) { >+ clulog(LOG_INFO, >+ "Heuristic: '%s' DOWN (%d/%d)\n", >+ h->program, h->misses, h->tko); >+ h->available = 0; >+ } else { >+ clulog(LOG_DEBUG, >+ "Heuristic: '%s' missed (%d/%d)\n", >+ h->program, h->misses, h->tko); >+ } >+ } >+ >+ return ret; > } > > >@@ -204,7 +260,9 @@ > do { > h[x].program = NULL; > h[x].available = 0; >+ h[x].misses = 0; > h[x].interval = 2; >+ h[x].tko = 1; > h[x].score = 1; > h[x].childpid = 0; > h[x].nextrun = 0; >@@ -236,9 +294,20 @@ > if (h[x].interval <= 0) > h[x].interval = 2; > } >+ >+ /* Get tko for this heuristic */ >+ snprintf(query, sizeof(query), >+ "/cluster/quorumd/heuristic[%d]/@tko", x+1); >+ if (ccs_get(ccsfd, query, &val) == 0) { >+ h[x].tko= atoi(val); >+ free(val); >+ if (h[x].tko <= 0) >+ h[x].tko = 1; >+ } > >- clulog(LOG_DEBUG, "Heuristic: '%s' score=%d interval=%d\n", >- h[x].program, h[x].score, h[x].interval); >+ clulog(LOG_DEBUG, >+ "Heuristic: '%s' score=%d interval=%d tko=%d\n", >+ h[x].program, h[x].score, h[x].interval, h[x].tko); > > } while (++x < max); > >@@ -264,6 +333,20 @@ > > > /** >+ Call this if no heuristics are set to run in master-wins mode >+ */ >+int >+fudge_scoring(void) >+{ >+ pthread_mutex_lock(&sc_lock); >+ _score = _maxscore = 1; >+ pthread_mutex_unlock(&sc_lock); >+ >+ return 0; >+} >+ >+ >+/** > Loop for the scoring thread. > */ > void * >@@ -271,6 +354,8 @@ > { > struct h_arg *args = (struct h_arg *)arg; > int score, maxscore; >+ >+ set_priority(args->sched_queue, args->sched_prio); > > while (_score_thread_running) { > fork_heuristics(args->h, args->count); >@@ -317,7 +402,7 @@ > to pass in h if it was allocated on the stack. > */ > int >-start_score_thread(struct h_data *h, int count) >+start_score_thread(qd_ctx *ctx, struct h_data *h, int count) > { > pthread_attr_t attrs; > struct h_arg *args; >@@ -337,8 +422,11 @@ > > memcpy(args->h, h, (sizeof(struct h_data) * count)); > args->count = count; >+ args->sched_queue = ctx->qc_sched; >+ args->sched_prio = ctx->qc_sched_prio; > > _score_thread_running = 1; >+ > pthread_attr_init(&attrs); > pthread_attr_setinheritsched(&attrs, PTHREAD_INHERIT_SCHED); > pthread_create(&score_thread, &attrs, score_thread_main, args); >Index: qdisk/score.h >=================================================================== >RCS file: /cvs/cluster/cluster/cman/qdisk/score.h,v >retrieving revision 1.2 >diff -u -r1.2 score.h >--- qdisk/score.h 19 May 2006 14:41:35 -0000 1.2 >+++ qdisk/score.h 6 Mar 2007 15:49:38 -0000 >@@ -32,7 +32,9 @@ > char * program; > int score; > int available; >+ int tko; > int interval; >+ int misses; > pid_t childpid; > time_t nextrun; > }; >@@ -50,11 +52,18 @@ > /* > Start the thread which runs the scoring applets > */ >-int start_score_thread(struct h_data *h, int count); >+int start_score_thread(qd_ctx *ctx, struct h_data *h, int count); > > /* > Get our score + maxscore > */ > int get_my_score(int *score, int *maxscore); > >+/* >+ Set score + maxscore to 1. Call if no heuristics are present >+ to enable master-wins mode >+ */ >+int fudge_scoring(void); >+ >+ > #endif
You cannot view the attachment while viewing its details because your browser does not support IFRAMEs.
View the attachment on a separate page
.
View Attachment As Diff
View Attachment As Raw
Actions:
View
|
Diff
Attachments on
bug 230972
: 149344