Bug 152548 - cyrus-murder problems with database corruption in the frontend/master
cyrus-murder problems with database corruption in the frontend/master
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: cyrus-imapd (Show other bugs)
rawhide
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Tomas Janousek
Brian Brock
bzcl34nup
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-03-30 03:45 EST by João Assad
Modified: 2008-05-06 20:07 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-06 20:07:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to replace munmap followed by mmap for a mremap (2.06 KB, patch)
2005-05-03 22:55 EDT, João Assad
no flags Details | Diff

  None (edit)
Description João Assad 2005-03-30 03:45:00 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.6) Gecko/20050225 Firefox/1.0.1

Description of problem:

We currently have 1.200.000+ mailboxes split in 3 backend servers using 1
frontend / master server (both services running on the same server) for a
grand total of 3.400.000+ mailboxes and subfolders.

the servers are used for imap access only. no pop3, no sieve scripts,
no nntp, nothing else besides imap.

Now, the problem in question.

after some days running smoothly I start getting errors on the  master/frontend
server. It has happened a few times now..

So far I've seen it happen only on the master/frontend server.. not on the backend servers.

the errors:

  cyrus/mupdate[1433]: DBERROR: skiplist recovery
  /var/lib/imap/mailboxes.db: 141BAD94 should be ADD or DELETE
  cyrus/mupdate[1433]: DBERROR: error updating database user.1215407: cyrusdb error

then after some time I start getting these:

  cyrus/lmtp[1276]: authentication to remote mupdate server failed: EOF from server
  cyrus/lmtp[1276]: couldn't connect to 10.1.5.101: no authentication to server

after a little more time

  cyrus/imap[30969]: kick_mupdate: can't connect to target: Connection refused
  cyrus/lmtp[2240]: mupdate-client: connection to server closed: end of file reached
  cyrus/lmtp[2240]: couldn't connect to 10.1.5.101: no connection to server

then this is what I get when trying to restart cyrus frontend/master

  cyrus/ctl_cyrusdb[30607]: recovering cyrus databases
  cyrus/ctl_cyrusdb[30607]: DBERROR: skiplist recovery /var/lib/imap/mailboxes.db: 141BAD94 should be ADD or DELETE
  cyrus/ctl_cyrusdb[30607]: DBERROR: opening /var/lib/imap/mailboxes.db: cyrusdb error


what I need to do next is completely delete the dbs from the master/frontend
and reimport all the mailboxes, but that takes hours. 

Thank you very much,

Jo�o Assad

------------------
Below is my frontend/master configuration

##### cyrus.conf

  START {
   # do not delete this entry!
   recover       cmd="ctl_cyrusdb -r"
  }

  # UNIX sockets start with a slash and are put into /var/lib/imap/sockets
  SERVICES {
   # add or remove based on preferences
   mupdate       cmd="mupdate -m" listen=3905 prefork=1
   lmtp          cmd="lmtpproxyd" listen="lmtp" prefork=0
   imap          cmd="proxyd" listen="imap" prefork=0
  }

  EVENTS {
   # this is required
   checkpoint    cmd="ctl_cyrusdb -c" period=240
  }

###### imapd.conf

  configdirectory: /var/lib/imap
  partition-default: /tmp
  admins: cyrus
  sievedir: /var/lib/imap/sieve
  sendmail: /usr/sbin/sendmail
  hashimapspool: true
  sasl_pwcheck_method: saslauthd
  sasl_mech_list: PLAIN
  tls_cert_file: /usr/share/ssl/certs/crt.crt
  tls_key_file: /usr/share/ssl/certs/key.key
  tls_ca_file: /usr/share/ssl/certs/ca.ca

  allowusermoves: 1

  #Backend servers
  cyrus-be1_password: ***
  cyrus-be2_password: ***
  cyrus-be3_password: ***
  proxy_authname: cyrus


  #Mupdate server
  mupdate_server:10.1.5.101
  mupdate_authname:cyrus
  mupdate_password: ***

  maxmessagesize: 2097152
  syslog_prefix:cyrus
  lmtp_over_quota_perm_failure:1
  quotawarn: 110
  imapidlepoll: 0
  fulldirhash: 1
  munge8bit: 0
  timeout: 10

  tls_session_timeout: 0
  mupdate_connections_max: 1024
  berkeley_cachesize:102400
  berkeley_txns_max:500

Version-Release number of selected component (if applicable):
cyrus-imapd-2.2.10-3.fc2 / cyrus-imapd-murder-2.2.10-3.fc2

How reproducible:
Couldn't Reproduce

Steps to Reproduce:


Additional info:

You cant reproduce the problem by following steps. it happens after a random period of time, veries from a couple of days to a week sometimes
Comment 1 John Dennis 2005-03-30 12:29:42 EST
Thank you for your bug report. In this instance I'm going to recommend you open
a bug report with the authors of cyrus imap. There is a much greater probability
you will get timely information from the folks who distribute and maintain cyrus
imap for an installation of this complexity which is exhibiting random failures.
We will certainly look at the issue too, but the authors and maintainers are
probably the best resource for this particular issue.

You can file the bug at the cyrus bugzilla which is here:
https://bugzilla.andrew.cmu.edu. If you do please cite the bug numbers in both
reports so they can be cross referenced.
Comment 2 João Assad 2005-03-30 12:57:57 EST
Thank you for your quick reply
I've just filed the bug on cyrus bugzilla as requested

http://bugzilla.andrew.cmu.edu/show_bug.cgi?id=2640
Comment 3 João Assad 2005-03-31 17:45:42 EST
New info

It seems before the corruption , I always get the following error

cyrus/mupdate[12614]: IOERROR: mapping /var/lib/imap/mailboxes.db file: Cannot
allocate memory
cyrus/mupdate[12614]: failed to mmap /var/lib/imap/mailboxes.db file
cyrus/master[12580]: service mupdate pid 12614 in READY state: terminated
abnormally 

This error is given by the map_refresh method ... there is 2 declarations of the
method. one in lib/map_stupidshared.c and another in lib/map_shared.c .

I dont know which one cyrus use when compiled under fedora.

Regards
Comment 4 John Dennis 2005-04-01 08:15:36 EST
hmm... I'm not sure if this is relevant or not, but there are a few potential
problems that could crop up with berkley db, which cyrus-imapd links against.

One is the processor, there was a problem with AMD K6 cpu's, by any chance does
your system have this processor?

The second is that there were a few minor changes to the db API recently (the
rpm is "db4") I wonder if cyrus-imapd and db4 have gotten out of sync somehow.
Have you upgraded db4 recently? What is the vesion of db4 that is installed?
Comment 5 João Assad 2005-04-01 10:51:35 EST
The db thats getting corrupted is not berkley db, its skiplist.

It looks like Im running out of resources to mmap the mailbox file , cyrus isn't
handling this behaviour properly and the db is ending up corrupted.

But why am I running out of resources ? Why is it complaining "Cannot allocate
memory".

Im nowhere near running out of RAM on this box.

Thats how my ulimits is configured:
core file size        (blocks, -c) 0
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
max locked memory     (kbytes, -l) unlimited
max memory size       (kbytes, -m) unlimited
open files                    (-n) 200000
pipe size          (512 bytes, -p) 8
stack size            (kbytes, -s) unlimited
cpu time             (seconds, -t) unlimited
max user processes            (-u) unlimited
virtual memory        (kbytes, -v) unlimited
Comment 6 João Assad 2005-04-06 14:29:21 EDT
Our system is P4, not K6
db4-4.2.52-3.1
Comment 7 João Assad 2005-04-08 13:35:22 EDT
I got 2 gdb backtraces from the production server.

----------------------------
#18988 0x0804dcd3 in fatal (
   s=0x8d52f070 "Internal error: assertion failed: mupdate.c: 586: 0",
   code=75) at mupdate.c:586
#18989 0x08082622 in assertionfailed (file=0x8082ab9 "mupdate.c", line=586,
   expr=0x8082ce1 "0") at assert.c:61
#18990 0x0804dcd3 in fatal (
   s=0x8d52f4c0 "failed to mmap /var/lib/imap/mailboxes.db file", code=75)
   at mupdate.c:586
#18991 0x080755f5 in map_refresh (fd=8, onceonly=0, base=0x8ad56e0,
   len=0x8ad56e4, newlen=366993408,
   name=0x8ad5710 "/var/lib/imap/mailboxes.db", mboxname=0x0)
   at map_shared.c:105
#18992 0x08078178 in update_lock (db=0x8ad56d8, txn=0x8d52f6e0)
   at cyrusdb_skiplist.c:572
#18993 0x0807a2f9 in mycommit (db=0x8ad56d8, tid=0x8d52f6e0)
   at cyrusdb_skiplist.c:1345
#18994 0x08079d64 in mystore (db=0x8ad56d8, key=0x8b41ba8 "user.1440034",
   keylen=12,
   data=0x8c5a248 "1 cyrus-be2.gazzag.com!default 1440034\tlrswipcda\t",
   datalen=49, tid=0x0, overwrite=1) at cyrusdb_skiplist.c:1225
#18995 0x08079de5 in store (db=0x8ad56d8, key=0x8b41ba8 "user.1440034",
   keylen=12,
   data=0x8c5a248 "1 cyrus-be2.gazzag.com!default 1440034\tlrswipcda\t",
   datalen=49, tid=0x0) at cyrusdb_skiplist.c:1244
#18996 0x080571a9 in mboxlist_insertremote (name=0x8b41ba8 "user.1440034",
   mbtype=0, host=0x8b144f0 "cyrus-be2.gazzag.com!default",
   acl=0x8c4dab8 "1440034\tlrswipcda\t", tid=0x0) at mboxlist.c:801
#18997 0x0804f869 in database_log (mb=0x8c4daa8, mytid=0x0) at mupdate.c:1300
#18998 0x0804ffe3 in cmd_set (C=0x8bdd778, tag=0x8be8918 "X1",
   mailbox=0x8bb6ec8 "user.1440034",
   server=0x8bb54b8 "cyrus-be2.gazzag.com!default",
   acl=0x8c36e28 "1440034\tlrswipcda\t", t=SET_ACTIVE) at mupdate.c:1527
#18999 0x0804e1f0 in docmd (c=0x8bdd778) at mupdate.c:685
#19000 0x0804f669 in thread_main (rock=0x0) at mupdate.c:1228
#19001 0x0069b98c in start_thread () from /lib/tls/libpthread.so.0
#19002 0x005897da in clone () from /lib/tls/libc.so.6

----------------------------

#18988 0x0804dcd3 in fatal (
   s=0x9fcd6070 "Internal error: assertion failed: mupdate.c: 586: 0",
   code=75) at mupdate.c:586
#18989 0x08082622 in assertionfailed (file=0x8082ab9 "mupdate.c", line=586,
   expr=0x8082ce1 "0") at assert.c:61
#18990 0x0804dcd3 in fatal (
   s=0x9fcd64c0 "failed to mmap /var/lib/imap/mailboxes.db file", code=75)
   at mupdate.c:586
#18991 0x080755f5 in map_refresh (fd=8, onceonly=0, base=0x85856e0,
   len=0x85856e4, newlen=366583808,
   name=0x8585710 "/var/lib/imap/mailboxes.db", mboxname=0x0)
   at map_shared.c:105
#18992 0x08078178 in update_lock (db=0x85856d8, txn=0x9fcd66e0)
   at cyrusdb_skiplist.c:572
#18993 0x0807a2f9 in mycommit (db=0x85856d8, tid=0x9fcd66e0)
   at cyrusdb_skiplist.c:1345
#18994 0x08079d64 in mystore (db=0x85856d8,
   key=0x88046a78 "user.950836._TRASH", keylen=18,
   data=0x88056cc0 "1 cyrus-be3.gazzag.com!default 950836\tlrswipcda\t",
   datalen=48, tid=0x0, overwrite=1) at cyrusdb_skiplist.c:1225
#18995 0x08079de5 in store (db=0x85856d8, key=0x88046a78 "user.950836._TRASH",
   keylen=18,
   data=0x88056cc0 "1 cyrus-be3.gazzag.com!default 950836\tlrswipcda\t",
   datalen=48, tid=0x0) at cyrusdb_skiplist.c:1244
#18996 0x080571a9 in mboxlist_insertremote (
   name=0x88046a78 "user.950836._TRASH", mbtype=0,
   host=0x88019a70 "cyrus-be3.gazzag.com!default",
   acl=0x8804b7c0 "950836\tlrswipcda\t", tid=0x0) at mboxlist.c:801
#18997 0x0804f869 in database_log (mb=0x8804b7b0, mytid=0x0) at mupdate.c:1300
#18998 0x0804ffe3 in cmd_set (C=0x9dd0b9b8, tag=0x9dd9b010 "X1",
   mailbox=0x9ddb1140 "user.950836._TRASH",
   server=0x9ddb5360 "cyrus-be3.gazzag.com!default",
   acl=0x88016f70 "950836\tlrswipcda\t", t=SET_ACTIVE) at mupdate.c:1527
#18999 0x0804e1f0 in docmd (c=0x9dd0b9b8) at mupdate.c:685
#19000 0x0804f669 in thread_main (rock=0x0) at mupdate.c:1228
#19001 0x0069b98c in start_thread () from /lib/tls/libpthread.so.0
#19002 0x005897da in clone () from /lib/tls/libc.so.6
Comment 8 João Assad 2005-04-08 15:15:20 EDT
To generate this dump I had to change the fatal inside mupdate.c as instructed
by Derrick (cyrus developer)

void fatal(const char *s, int code)
{
  chdir("/tmp");
  assert(0);
  exit(code); /* shut up GCC */
}
Comment 9 João Assad 2005-04-08 16:01:38 EDT
and now a strace from the production server.. Im sending just the last few lines
of it. its really big.

16:18:02.399469 accept(4, 0, NULL)      = 104
16:18:02.470386 getpeername(104, {sa_family=AF_INET, sin_port=htons(33978),
sin_addr=inet_addr("10.1.5.202")}, [16]) = 0
16:18:02.470523 getpeername(104, {sa_family=AF_INET, sin_port=htons(33978),
sin_addr=inet_addr("10.1.5.202")}, [16]) = 0
16:18:02.470658 getsockname(104, {sa_family=AF_INET, sin_port=htons(3905),
sin_addr=inet_addr("10.1.5.101")}, [16]) = 0
16:18:02.470782 open("/etc/hosts.allow", O_RDONLY) = 112
16:18:02.470850 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
16:18:02.470958 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000
16:18:02.471015 read(112, "#\n", 4096)  = 2
16:18:02.471079 read(112, "", 4096)     = 0
16:18:02.471135 close(112)              = 0
16:18:02.471185 munmap(0xb691d000, 4096) = 0
16:18:02.471244 open("/etc/hosts.deny", O_RDONLY) = 112
16:18:02.471307 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
16:18:02.471411 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000
16:18:02.471466 read(112, "#\n", 4096)  = 2
16:18:02.471526 read(112, "", 4096)     = 0
16:18:02.471577 close(112)              = 0
16:18:02.471625 munmap(0xb691d000, 4096) = 0
16:18:02.471683 time([1112987882])      = 1112987882
16:18:02.471750 rt_sigaction(SIGPIPE, {0x585aa0, [], SA_RESTORER, 0x4f4f48},
{SIG_IGN}, 8) = 0
16:18:02.471838 send(5, "<23>Apr  8 16:18:02 cyrus/mupdat"..., 61, 0) = 61
16:18:02.471927 rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
16:18:02.472012 write(3, "\4\0\0\0\24b\0\0", 8) = 8
16:18:02.472086 write(7, "h\0\0\0", 4)  = 4
16:18:02.472588 accept(4, 0, NULL)      = 106
16:18:02.498519 getpeername(106, {sa_family=AF_INET, sin_port=htons(33979),
sin_addr=inet_addr("10.1.5.202")}, [16]) = 0
16:18:02.498661 getpeername(106, {sa_family=AF_INET, sin_port=htons(33979),
sin_addr=inet_addr("10.1.5.202")}, [16]) = 0
16:18:02.498793 getsockname(106, {sa_family=AF_INET, sin_port=htons(3905),
sin_addr=inet_addr("10.1.5.101")}, [16]) = 0
16:18:02.498918 open("/etc/hosts.allow", O_RDONLY) = 112
16:18:02.498987 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
16:18:02.499125 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000
16:18:02.499183 read(112, "#\n", 4096)  = 2
16:18:02.499250 read(112, "", 4096)     = 0
16:18:02.499301 close(112)              = 0
16:18:02.499351 munmap(0xb691d000, 4096) = 0
16:18:02.499408 open("/etc/hosts.deny", O_RDONLY) = 112
16:18:02.499473 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
16:18:02.499585 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000
16:18:02.499639 read(112, "#\n", 4096)  = 2
16:18:02.499700 read(112, "", 4096)     = 0
16:18:02.499749 close(112)              = 0
16:18:02.499799 munmap(0xb691d000, 4096) = 0
16:18:02.499856 time([1112987882])      = 1112987882
16:18:02.499920 rt_sigaction(SIGPIPE, {0x585aa0, [], SA_RESTORER, 0x4f4f48},
{SIG_IGN}, 8) = 0
16:18:02.500004 send(5, "<23>Apr  8 16:18:02 cyrus/mupdat"..., 61, 0) = 61
16:18:02.500090 rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
16:18:02.500180 write(3, "\4\0\0\0\24b\0\0", 8) = 8
16:18:02.500253 write(7, "j\0\0\0", 4)  = 4
16:18:02.500326 accept(4, 0, NULL)      = 104
16:18:02.595693 getpeername(104, {sa_family=AF_INET, sin_port=htons(33980),
sin_addr=inet_addr("10.1.5.202")}, [16]) = 0
16:18:02.595830 getpeername(104, {sa_family=AF_INET, sin_port=htons(33980),
sin_addr=inet_addr("10.1.5.202")}, [16]) = 0
16:18:02.595964 getsockname(104, {sa_family=AF_INET, sin_port=htons(3905),
sin_addr=inet_addr("10.1.5.101")}, [16]) = 0
16:18:02.596087 open("/etc/hosts.allow", O_RDONLY) = 106
16:18:02.596161 fstat64(106, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
16:18:02.596266 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000
16:18:02.596324 read(106, "#\n", 4096)  = 2
16:18:02.596389 read(106, "", 4096)     = 0
16:18:02.596439 close(106)              = 0
16:18:02.596489 munmap(0xb691d000, 4096) = 0
16:18:02.596548 open("/etc/hosts.deny", O_RDONLY) = 106
16:18:02.596615 fstat64(106, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0
16:18:02.596718 mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000
16:18:02.596772 read(106, "#\n", 4096)  = 2
16:18:02.596834 read(106, "", 4096)     = 0
16:18:02.596884 close(106)              = 0
16:18:02.596932 munmap(0xb691d000, 4096) = 0
16:18:02.596993 time([1112987882])      = 1112987882
16:18:02.597057 rt_sigaction(SIGPIPE, {0x585aa0, [], SA_RESTORER, 0x4f4f48},
{SIG_IGN}, 8) = 0
16:18:02.597151 send(5, "<23>Apr  8 16:18:02 cyrus/mupdat"..., 61, 0) = 61
16:18:02.597236 rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0
16:18:02.597320 write(3, "\4\0\0\0\24b\0\0", 8) = 8
16:18:02.597394 write(7, "h\0\0\0", 4)  = 4
16:18:02.597840 accept(4, 0, NULL)      = ? ERESTARTSYS (To be restarted)
16:18:02.797127 +++ killed by SIGSEGV (core dumped) +++

why did accept return ERESTARTSYS ?
shouldnt the kernel transfor it to EINTR ?

Comment 10 John Dennis 2005-04-12 12:50:53 EDT
FYI, I'm in the process of trying to get answers for this. One data point that
may be of interest is that ERESTARTSYS is indeed kernel only, however the one
time its is legal to show up in user space is in strace (just learned that myself).

Comment 11 João Assad 2005-04-12 17:48:32 EDT
great thanks.

I have a full gdb trace now 

you can get it at

http://www.gazzag.com/gdb.output.gz
Comment 12 Matthew Miller 2005-04-26 11:49:37 EDT
Fedora Core 2 is now maintained by the Fedora Legacy project for
security updates only. If this problem is a security issue, please
reopen and reassign to the Fedora Legacy product. If it is not a
security issue and hasn't been resolved in the current FC3 updates or
in the FC4 test release, reopen and change the version to match.
Comment 13 John Dennis 2005-04-26 15:00:31 EDT
I've looked at this a bit more. Unfortunately there isn't enough information in
the traces or log files to diagnose the problem. Here is what I see as the issues:

In the file map_shared.c is a function

map_refresh(fd, onceonly, base, len, newlen, name, mboxname)

map_refresh is calling mmap and its failing but we don't know why. The
parameters base and len are important, but we don't know their values because
they are passed by reference and the gdb trace only shows the addess of these
parameters, not their values. Also we don't know the mmap error code which is
returned in the global variable errno. It would help if you could rebuild the
code (I know you already did this to fix fatal) so these values are printed out. 

At the top of the function change char buf[80] to char buf[2048] so there is
enough room to format the error message then at the bottom of the function
change the code to:

    if (*base == (char *)-1) {
	int err = errno;
	syslog(LOG_ERR, "IOERROR: mapping %s file%s%s: %m", name,
	       mboxname ? " for " : "", mboxname ? mboxname : "");
	snprintf(buf, sizeof(buf), "failed to mmap %s file base=%x len=%d newlen=%d
errno=%d", name, *base, *len, newlen, err);
	fatal(buf, EC_IOERR);
    }

if for some reason errno is undefined add #include <errno.h> at the top of the
file. 

This should help tell us why mmap failed. Here are some other things you can do
or check:

When the failure occurs it would help to know what mappings were in effect, this
info can be obtained by

cat /proc/XXXX/map

where XXXX is replaced by the process id.

Also, while it would be nice to make sure skiplist works correctly it would be
informative if this problem only occurs with skiplist. If you reconfigured to
run with berkley instead of skiplist does the problem go away? (this might be a
workaround for you till the problem is fixed).

Also, how much freespace do you have in the filesystem where the database file
lives?
Comment 14 John Dennis 2005-04-26 15:03:27 EDT
Oh, one more request, could you please when you provide long information use
bugzilla attachments (at the bottom of this page) rather than pasting text
inline or referring to your own URL's. Thanks!
Comment 15 João Assad 2005-05-03 22:54:30 EDT
I fixed the problem changing the way map_refresh works.
Instead of calling munmap when necessary  and then mmap, I changed it so it
calls mremap.

I can't recompile cyrus to add your code right now, I would need to do it on my
production servers and everytime the problem happens my whole imap system goes
down for over 1 hour.

here is an actual log from when I was testing last week.
cyrus/mupdate[4324]: MUNMAP at map_refresh: munmap(6a3e8000, 411041792) = 0
cyrus/mupdate[4324]: MMAP at map_refresh: mmap(0, 411058176, PROT_READ,
MAP_SHARED, 8, 0) = -1

Some interesting observations from when I was debugging the whole thing.

switching from flex-mmap to the legacy mmap mode only makes the problem happen
faster.

disabling exec-shield completely doesnt help, and mmap still returns a
quasirandom address everytime it's called. It still use exec-shield VM
randomization despite the fact that exec-shield is disabled.

disabling exec-shield VM randomization makes it take longer for the problem to
happen but doesnt fix it. Every now and then mmap will return a different
address for no apparent reason, when it could just allocate a bigger range in
the same start address as before. after changing the start address a few times
mmap fails returning ENOMEM.. looking at the /proc/XXX/maps file just before the
problem happens, it's clear that mmap had more then enough continuous address
space to map the whole thing.

using my mremap patch and disabling exec-shield VM randomization, mremap has
been using the same start address for days now, since I started cyrus the last
time. Why when doing a munmap followed by a mmap the address returned sometimes
change ? mremap doesnt seem to replicate this behavior.

almost forgot. I have around 50G of free space on the disk.

Im attaching my mremap patch. Im not a very good C programmer, so if you spot
any problems with my code, please let me know.

Comment 16 João Assad 2005-05-03 22:55:52 EDT
Created attachment 113994 [details]
patch to replace munmap followed by mmap for a mremap
Comment 17 Matthew Miller 2005-05-03 22:58:16 EDT
This isn't targeted for FC2, right? Could you please change the Version? Thanks!
Comment 18 João Assad 2005-05-03 23:07:23 EDT
If you mean the patch. yes it is, I changed my src.rpm so it uses cyrus-2.2.12
if thats what got you wondering. map_shared.c is the same on cyrus-2.2.10(FC2)
so this patch can be used. All the problems described above happens both on
cyrus-2.2.10 and 2.2.12
Comment 19 Matthew Miller 2005-05-03 23:14:51 EDT
Hopefully this will be fixed upstream as per comment #1 and comment #2. But in
the meantime, this is unlikely to get fixed for FC2, since it isn't a direct
security exposure. Unless I missed something...?
Comment 20 João Assad 2005-05-03 23:19:56 EDT
I think you are right. I just posted my findings as requested by John Dennis on
comment #13 . Im not expecting it to get fixed on FC2
Comment 21 Matthew Miller 2005-05-03 23:24:12 EDT
Okay, thanks. I'm going to move this to -devel.
Comment 22 Bug Zapper 2008-04-03 11:54:30 EDT
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.
Comment 23 Bug Zapper 2008-05-06 20:07:35 EDT
This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

Note You need to log in before you can comment on or make changes to this bug.