Bug 152548
Summary: | cyrus-murder problems with database corruption in the frontend/master | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | João Assad <jfassad> | ||||
Component: | cyrus-imapd | Assignee: | Tomas Janousek <tjanouse> | ||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | rawhide | CC: | mattdm, triage, zing | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | bzcl34nup | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-05-07 00:07:36 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
João Assad
2005-03-30 08:45:00 UTC
Thank you for your bug report. In this instance I'm going to recommend you open a bug report with the authors of cyrus imap. There is a much greater probability you will get timely information from the folks who distribute and maintain cyrus imap for an installation of this complexity which is exhibiting random failures. We will certainly look at the issue too, but the authors and maintainers are probably the best resource for this particular issue. You can file the bug at the cyrus bugzilla which is here: https://bugzilla.andrew.cmu.edu. If you do please cite the bug numbers in both reports so they can be cross referenced. Thank you for your quick reply I've just filed the bug on cyrus bugzilla as requested http://bugzilla.andrew.cmu.edu/show_bug.cgi?id=2640 New info It seems before the corruption , I always get the following error cyrus/mupdate[12614]: IOERROR: mapping /var/lib/imap/mailboxes.db file: Cannot allocate memory cyrus/mupdate[12614]: failed to mmap /var/lib/imap/mailboxes.db file cyrus/master[12580]: service mupdate pid 12614 in READY state: terminated abnormally This error is given by the map_refresh method ... there is 2 declarations of the method. one in lib/map_stupidshared.c and another in lib/map_shared.c . I dont know which one cyrus use when compiled under fedora. Regards hmm... I'm not sure if this is relevant or not, but there are a few potential problems that could crop up with berkley db, which cyrus-imapd links against. One is the processor, there was a problem with AMD K6 cpu's, by any chance does your system have this processor? The second is that there were a few minor changes to the db API recently (the rpm is "db4") I wonder if cyrus-imapd and db4 have gotten out of sync somehow. Have you upgraded db4 recently? What is the vesion of db4 that is installed? The db thats getting corrupted is not berkley db, its skiplist. It looks like Im running out of resources to mmap the mailbox file , cyrus isn't handling this behaviour properly and the db is ending up corrupted. But why am I running out of resources ? Why is it complaining "Cannot allocate memory". Im nowhere near running out of RAM on this box. Thats how my ulimits is configured: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 200000 pipe size (512 bytes, -p) 8 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited Our system is P4, not K6 db4-4.2.52-3.1 I got 2 gdb backtraces from the production server. ---------------------------- #18988 0x0804dcd3 in fatal ( s=0x8d52f070 "Internal error: assertion failed: mupdate.c: 586: 0", code=75) at mupdate.c:586 #18989 0x08082622 in assertionfailed (file=0x8082ab9 "mupdate.c", line=586, expr=0x8082ce1 "0") at assert.c:61 #18990 0x0804dcd3 in fatal ( s=0x8d52f4c0 "failed to mmap /var/lib/imap/mailboxes.db file", code=75) at mupdate.c:586 #18991 0x080755f5 in map_refresh (fd=8, onceonly=0, base=0x8ad56e0, len=0x8ad56e4, newlen=366993408, name=0x8ad5710 "/var/lib/imap/mailboxes.db", mboxname=0x0) at map_shared.c:105 #18992 0x08078178 in update_lock (db=0x8ad56d8, txn=0x8d52f6e0) at cyrusdb_skiplist.c:572 #18993 0x0807a2f9 in mycommit (db=0x8ad56d8, tid=0x8d52f6e0) at cyrusdb_skiplist.c:1345 #18994 0x08079d64 in mystore (db=0x8ad56d8, key=0x8b41ba8 "user.1440034", keylen=12, data=0x8c5a248 "1 cyrus-be2.gazzag.com!default 1440034\tlrswipcda\t", datalen=49, tid=0x0, overwrite=1) at cyrusdb_skiplist.c:1225 #18995 0x08079de5 in store (db=0x8ad56d8, key=0x8b41ba8 "user.1440034", keylen=12, data=0x8c5a248 "1 cyrus-be2.gazzag.com!default 1440034\tlrswipcda\t", datalen=49, tid=0x0) at cyrusdb_skiplist.c:1244 #18996 0x080571a9 in mboxlist_insertremote (name=0x8b41ba8 "user.1440034", mbtype=0, host=0x8b144f0 "cyrus-be2.gazzag.com!default", acl=0x8c4dab8 "1440034\tlrswipcda\t", tid=0x0) at mboxlist.c:801 #18997 0x0804f869 in database_log (mb=0x8c4daa8, mytid=0x0) at mupdate.c:1300 #18998 0x0804ffe3 in cmd_set (C=0x8bdd778, tag=0x8be8918 "X1", mailbox=0x8bb6ec8 "user.1440034", server=0x8bb54b8 "cyrus-be2.gazzag.com!default", acl=0x8c36e28 "1440034\tlrswipcda\t", t=SET_ACTIVE) at mupdate.c:1527 #18999 0x0804e1f0 in docmd (c=0x8bdd778) at mupdate.c:685 #19000 0x0804f669 in thread_main (rock=0x0) at mupdate.c:1228 #19001 0x0069b98c in start_thread () from /lib/tls/libpthread.so.0 #19002 0x005897da in clone () from /lib/tls/libc.so.6 ---------------------------- #18988 0x0804dcd3 in fatal ( s=0x9fcd6070 "Internal error: assertion failed: mupdate.c: 586: 0", code=75) at mupdate.c:586 #18989 0x08082622 in assertionfailed (file=0x8082ab9 "mupdate.c", line=586, expr=0x8082ce1 "0") at assert.c:61 #18990 0x0804dcd3 in fatal ( s=0x9fcd64c0 "failed to mmap /var/lib/imap/mailboxes.db file", code=75) at mupdate.c:586 #18991 0x080755f5 in map_refresh (fd=8, onceonly=0, base=0x85856e0, len=0x85856e4, newlen=366583808, name=0x8585710 "/var/lib/imap/mailboxes.db", mboxname=0x0) at map_shared.c:105 #18992 0x08078178 in update_lock (db=0x85856d8, txn=0x9fcd66e0) at cyrusdb_skiplist.c:572 #18993 0x0807a2f9 in mycommit (db=0x85856d8, tid=0x9fcd66e0) at cyrusdb_skiplist.c:1345 #18994 0x08079d64 in mystore (db=0x85856d8, key=0x88046a78 "user.950836._TRASH", keylen=18, data=0x88056cc0 "1 cyrus-be3.gazzag.com!default 950836\tlrswipcda\t", datalen=48, tid=0x0, overwrite=1) at cyrusdb_skiplist.c:1225 #18995 0x08079de5 in store (db=0x85856d8, key=0x88046a78 "user.950836._TRASH", keylen=18, data=0x88056cc0 "1 cyrus-be3.gazzag.com!default 950836\tlrswipcda\t", datalen=48, tid=0x0) at cyrusdb_skiplist.c:1244 #18996 0x080571a9 in mboxlist_insertremote ( name=0x88046a78 "user.950836._TRASH", mbtype=0, host=0x88019a70 "cyrus-be3.gazzag.com!default", acl=0x8804b7c0 "950836\tlrswipcda\t", tid=0x0) at mboxlist.c:801 #18997 0x0804f869 in database_log (mb=0x8804b7b0, mytid=0x0) at mupdate.c:1300 #18998 0x0804ffe3 in cmd_set (C=0x9dd0b9b8, tag=0x9dd9b010 "X1", mailbox=0x9ddb1140 "user.950836._TRASH", server=0x9ddb5360 "cyrus-be3.gazzag.com!default", acl=0x88016f70 "950836\tlrswipcda\t", t=SET_ACTIVE) at mupdate.c:1527 #18999 0x0804e1f0 in docmd (c=0x9dd0b9b8) at mupdate.c:685 #19000 0x0804f669 in thread_main (rock=0x0) at mupdate.c:1228 #19001 0x0069b98c in start_thread () from /lib/tls/libpthread.so.0 #19002 0x005897da in clone () from /lib/tls/libc.so.6 To generate this dump I had to change the fatal inside mupdate.c as instructed by Derrick (cyrus developer) void fatal(const char *s, int code) { chdir("/tmp"); assert(0); exit(code); /* shut up GCC */ } and now a strace from the production server.. Im sending just the last few lines of it. its really big. 16:18:02.399469 accept(4, 0, NULL) = 104 16:18:02.470386 getpeername(104, {sa_family=AF_INET, sin_port=htons(33978), sin_addr=inet_addr("10.1.5.202")}, [16]) = 0 16:18:02.470523 getpeername(104, {sa_family=AF_INET, sin_port=htons(33978), sin_addr=inet_addr("10.1.5.202")}, [16]) = 0 16:18:02.470658 getsockname(104, {sa_family=AF_INET, sin_port=htons(3905), sin_addr=inet_addr("10.1.5.101")}, [16]) = 0 16:18:02.470782 open("/etc/hosts.allow", O_RDONLY) = 112 16:18:02.470850 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 16:18:02.470958 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000 16:18:02.471015 read(112, "#\n", 4096) = 2 16:18:02.471079 read(112, "", 4096) = 0 16:18:02.471135 close(112) = 0 16:18:02.471185 munmap(0xb691d000, 4096) = 0 16:18:02.471244 open("/etc/hosts.deny", O_RDONLY) = 112 16:18:02.471307 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 16:18:02.471411 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000 16:18:02.471466 read(112, "#\n", 4096) = 2 16:18:02.471526 read(112, "", 4096) = 0 16:18:02.471577 close(112) = 0 16:18:02.471625 munmap(0xb691d000, 4096) = 0 16:18:02.471683 time([1112987882]) = 1112987882 16:18:02.471750 rt_sigaction(SIGPIPE, {0x585aa0, [], SA_RESTORER, 0x4f4f48}, {SIG_IGN}, 8) = 0 16:18:02.471838 send(5, "<23>Apr 8 16:18:02 cyrus/mupdat"..., 61, 0) = 61 16:18:02.471927 rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0 16:18:02.472012 write(3, "\4\0\0\0\24b\0\0", 8) = 8 16:18:02.472086 write(7, "h\0\0\0", 4) = 4 16:18:02.472588 accept(4, 0, NULL) = 106 16:18:02.498519 getpeername(106, {sa_family=AF_INET, sin_port=htons(33979), sin_addr=inet_addr("10.1.5.202")}, [16]) = 0 16:18:02.498661 getpeername(106, {sa_family=AF_INET, sin_port=htons(33979), sin_addr=inet_addr("10.1.5.202")}, [16]) = 0 16:18:02.498793 getsockname(106, {sa_family=AF_INET, sin_port=htons(3905), sin_addr=inet_addr("10.1.5.101")}, [16]) = 0 16:18:02.498918 open("/etc/hosts.allow", O_RDONLY) = 112 16:18:02.498987 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 16:18:02.499125 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000 16:18:02.499183 read(112, "#\n", 4096) = 2 16:18:02.499250 read(112, "", 4096) = 0 16:18:02.499301 close(112) = 0 16:18:02.499351 munmap(0xb691d000, 4096) = 0 16:18:02.499408 open("/etc/hosts.deny", O_RDONLY) = 112 16:18:02.499473 fstat64(112, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 16:18:02.499585 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000 16:18:02.499639 read(112, "#\n", 4096) = 2 16:18:02.499700 read(112, "", 4096) = 0 16:18:02.499749 close(112) = 0 16:18:02.499799 munmap(0xb691d000, 4096) = 0 16:18:02.499856 time([1112987882]) = 1112987882 16:18:02.499920 rt_sigaction(SIGPIPE, {0x585aa0, [], SA_RESTORER, 0x4f4f48}, {SIG_IGN}, 8) = 0 16:18:02.500004 send(5, "<23>Apr 8 16:18:02 cyrus/mupdat"..., 61, 0) = 61 16:18:02.500090 rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0 16:18:02.500180 write(3, "\4\0\0\0\24b\0\0", 8) = 8 16:18:02.500253 write(7, "j\0\0\0", 4) = 4 16:18:02.500326 accept(4, 0, NULL) = 104 16:18:02.595693 getpeername(104, {sa_family=AF_INET, sin_port=htons(33980), sin_addr=inet_addr("10.1.5.202")}, [16]) = 0 16:18:02.595830 getpeername(104, {sa_family=AF_INET, sin_port=htons(33980), sin_addr=inet_addr("10.1.5.202")}, [16]) = 0 16:18:02.595964 getsockname(104, {sa_family=AF_INET, sin_port=htons(3905), sin_addr=inet_addr("10.1.5.101")}, [16]) = 0 16:18:02.596087 open("/etc/hosts.allow", O_RDONLY) = 106 16:18:02.596161 fstat64(106, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 16:18:02.596266 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000 16:18:02.596324 read(106, "#\n", 4096) = 2 16:18:02.596389 read(106, "", 4096) = 0 16:18:02.596439 close(106) = 0 16:18:02.596489 munmap(0xb691d000, 4096) = 0 16:18:02.596548 open("/etc/hosts.deny", O_RDONLY) = 106 16:18:02.596615 fstat64(106, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 16:18:02.596718 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb691d000 16:18:02.596772 read(106, "#\n", 4096) = 2 16:18:02.596834 read(106, "", 4096) = 0 16:18:02.596884 close(106) = 0 16:18:02.596932 munmap(0xb691d000, 4096) = 0 16:18:02.596993 time([1112987882]) = 1112987882 16:18:02.597057 rt_sigaction(SIGPIPE, {0x585aa0, [], SA_RESTORER, 0x4f4f48}, {SIG_IGN}, 8) = 0 16:18:02.597151 send(5, "<23>Apr 8 16:18:02 cyrus/mupdat"..., 61, 0) = 61 16:18:02.597236 rt_sigaction(SIGPIPE, {SIG_IGN}, NULL, 8) = 0 16:18:02.597320 write(3, "\4\0\0\0\24b\0\0", 8) = 8 16:18:02.597394 write(7, "h\0\0\0", 4) = 4 16:18:02.597840 accept(4, 0, NULL) = ? ERESTARTSYS (To be restarted) 16:18:02.797127 +++ killed by SIGSEGV (core dumped) +++ why did accept return ERESTARTSYS ? shouldnt the kernel transfor it to EINTR ? FYI, I'm in the process of trying to get answers for this. One data point that may be of interest is that ERESTARTSYS is indeed kernel only, however the one time its is legal to show up in user space is in strace (just learned that myself). great thanks. I have a full gdb trace now you can get it at http://www.gazzag.com/gdb.output.gz Fedora Core 2 is now maintained by the Fedora Legacy project for security updates only. If this problem is a security issue, please reopen and reassign to the Fedora Legacy product. If it is not a security issue and hasn't been resolved in the current FC3 updates or in the FC4 test release, reopen and change the version to match. I've looked at this a bit more. Unfortunately there isn't enough information in the traces or log files to diagnose the problem. Here is what I see as the issues: In the file map_shared.c is a function map_refresh(fd, onceonly, base, len, newlen, name, mboxname) map_refresh is calling mmap and its failing but we don't know why. The parameters base and len are important, but we don't know their values because they are passed by reference and the gdb trace only shows the addess of these parameters, not their values. Also we don't know the mmap error code which is returned in the global variable errno. It would help if you could rebuild the code (I know you already did this to fix fatal) so these values are printed out. At the top of the function change char buf[80] to char buf[2048] so there is enough room to format the error message then at the bottom of the function change the code to: if (*base == (char *)-1) { int err = errno; syslog(LOG_ERR, "IOERROR: mapping %s file%s%s: %m", name, mboxname ? " for " : "", mboxname ? mboxname : ""); snprintf(buf, sizeof(buf), "failed to mmap %s file base=%x len=%d newlen=%d errno=%d", name, *base, *len, newlen, err); fatal(buf, EC_IOERR); } if for some reason errno is undefined add #include <errno.h> at the top of the file. This should help tell us why mmap failed. Here are some other things you can do or check: When the failure occurs it would help to know what mappings were in effect, this info can be obtained by cat /proc/XXXX/map where XXXX is replaced by the process id. Also, while it would be nice to make sure skiplist works correctly it would be informative if this problem only occurs with skiplist. If you reconfigured to run with berkley instead of skiplist does the problem go away? (this might be a workaround for you till the problem is fixed). Also, how much freespace do you have in the filesystem where the database file lives? Oh, one more request, could you please when you provide long information use bugzilla attachments (at the bottom of this page) rather than pasting text inline or referring to your own URL's. Thanks! I fixed the problem changing the way map_refresh works. Instead of calling munmap when necessary and then mmap, I changed it so it calls mremap. I can't recompile cyrus to add your code right now, I would need to do it on my production servers and everytime the problem happens my whole imap system goes down for over 1 hour. here is an actual log from when I was testing last week. cyrus/mupdate[4324]: MUNMAP at map_refresh: munmap(6a3e8000, 411041792) = 0 cyrus/mupdate[4324]: MMAP at map_refresh: mmap(0, 411058176, PROT_READ, MAP_SHARED, 8, 0) = -1 Some interesting observations from when I was debugging the whole thing. switching from flex-mmap to the legacy mmap mode only makes the problem happen faster. disabling exec-shield completely doesnt help, and mmap still returns a quasirandom address everytime it's called. It still use exec-shield VM randomization despite the fact that exec-shield is disabled. disabling exec-shield VM randomization makes it take longer for the problem to happen but doesnt fix it. Every now and then mmap will return a different address for no apparent reason, when it could just allocate a bigger range in the same start address as before. after changing the start address a few times mmap fails returning ENOMEM.. looking at the /proc/XXX/maps file just before the problem happens, it's clear that mmap had more then enough continuous address space to map the whole thing. using my mremap patch and disabling exec-shield VM randomization, mremap has been using the same start address for days now, since I started cyrus the last time. Why when doing a munmap followed by a mmap the address returned sometimes change ? mremap doesnt seem to replicate this behavior. almost forgot. I have around 50G of free space on the disk. Im attaching my mremap patch. Im not a very good C programmer, so if you spot any problems with my code, please let me know. Created attachment 113994 [details]
patch to replace munmap followed by mmap for a mremap
This isn't targeted for FC2, right? Could you please change the Version? Thanks! If you mean the patch. yes it is, I changed my src.rpm so it uses cyrus-2.2.12 if thats what got you wondering. map_shared.c is the same on cyrus-2.2.10(FC2) so this patch can be used. All the problems described above happens both on cyrus-2.2.10 and 2.2.12 Hopefully this will be fixed upstream as per comment #1 and comment #2. But in the meantime, this is unlikely to get fixed for FC2, since it isn't a direct security exposure. Unless I missed something...? I think you are right. I just posted my findings as requested by John Dennis on comment #13 . Im not expecting it to get fixed on FC2 Okay, thanks. I'm going to move this to -devel. Based on the date this bug was created, it appears to have been reported against rawhide during the development of a Fedora release that is no longer maintained. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained. If this bug remains in NEEDINFO thirty (30) days from now, we will automatically close it. If you can reproduce this bug in a maintained Fedora version (7, 8, or rawhide), please change this bug to the respective version and change the status to ASSIGNED. (If you're unable to change the bug's version or status, add a comment to the bug and someone will change it for you.) Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again. This bug has been in NEEDINFO for more than 30 days since feedback was first requested. As a result we are closing it. If you can reproduce this bug in the future against a maintained Fedora version please feel free to reopen it against that version. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp |