Created attachment 415503 [details] kernel log Description of problem: Upgraded to 5.5 from 5.4. All the config was made with 5.3 upgraded to 5.4 now upgraded to 5.5 Installing some large software (i.e. comsol40) on gfs2 filesystem, the installation hangs, we get the attached messages from the kernel, access to the gfs2 does not hang, system increases load, becomes non-responsive due to D process increasing after the first crash. Version-Release number of selected component (if applicable): 5.5 2.6.18-194.3.1.el5 gfs2-utils-0.1.62-20.el5 How reproducible: 100% Steps to Reproduce: 1. start installation of i.e. comsol40 2. watch the kernel messages 3. Actual results: Should finish the install Expected results: the gfs2 file system crashes with the attached messages. Additional info: Maybe I am wrong that the problem is with gfs2!
Created attachment 415513 [details] Full crash detaills
The message: BUG: soft lockup - CPU#1 stuck for 10s! [aisexec:2430] may indicate there is a communications problem or an openais problem. If openais is stuck, dlm can't pass cluster traffic across the network, and that could be why GFS got stuck waiting for all its glocks. Just a theory. I'm going to cc some cluster folks to get their input. I'd also like to recommend in cases like this where lots of processes seem stuck in glock_wait that you download, compile and run this tool: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/gfs2_hangalyzer.c The program has instructions in comments at the top, and should be run only with rsa keys set up so it doesn't ask for passwords a hundred times.
Created attachment 415657 [details] new crash log New crash log.. This time it starts with "CPU locked!"
Bob hi. I run the program (with the necessary modifications like ln -s /mounts /gfs2) and the results (until the monitored node crashed) are as following: [root@tweety-2 ~]# ./gfs2_hangalyzer -n tweety1 exec: /usr/bin/ssh ssh -l root tweety1 "/bin/grep 'clusternode name' /etc/cluster/cluste ssh: : Name or service not known /" /gfs2/bin/glocks: No such file or directory in/glocks" /bin/cat: /gfs2/BOINC/glocks: No such file or directoryBOINC/glocks" /bin/cat: /gfs2/conf/glocks: No such file or directory/conf/glocks" /bin/cat: /gfs2/DBs/glocks: No such file or directory2/DBs/glocks" /bin/cat: /gfs2/Doms/glocks: No such file or directory/Doms/glocks" /bin/cat: /gfs2/FileServer/glocks: No such file or directoryerver/glocks" /bin/cat: /gfs2/fileservice/glocks: No such file or directoryrvice/glocks" /bin/cat: /gfs2/hs_err_pid497.log/glocks: Not a directory_err_pid497.log/glocks" /bin/cat: /gfs2/html/glocks: No such file or directory/html/glocks" /gfs2/jboss-4.2.3.GA/glocks: No such file or directory boss-4.2.3.GA/glocks" /bin/cat: /gfs2/log/glocks: No such file or directory2/log/glocks" /bin/cat: /gfs2/lost+found/glocks: No such file or directoryfound/glocks" /bin/cat: /gfs2/opt/glocks: No such file or directory2/opt/glocks" /gfs2/tars/glocks: No such file or directory ars/glocks" /bin/cat: /gfs2/test/glocks: Not a directory at /gfs2/test/glocks" /bin/cat: /gfs2/tmp/glocks: No such file or directory2/tmp/glocks" /bin/cat: /gfs2/var/glocks: No such file or directory2/var/glocks" bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' bash: -c: line 0: syntax error near unexpected token `('ull)" bash: -c: line 0: `/bin/cat /dlm/(null)' Write failed: Broken pipesferred at /dlm/(null)"
Looks like you need to mount the debugfs to /sys/kernel/debug. You can do that manually or add this to the /etc/fstab and then mount -a debugfs /sys/kernel/debug debugfs defaults 0 0 Then re-run the tool
Ok here we are. 1. I started again a new installation of the software that triggered the event. It ended all the way without any kernel logs. 2. I deed a forth isntallation. This time the kernel loged, however I saw that a) the node, gor frozen but did not hanged/blowed, b) the installation continued after some time, c) a number of other applications based on this file system, crashed when they tried to write to the file system. I attache "crash_without_hang" with kernel logs in this case and "gfs2_hangalyzer_output_crash_no_hang" the screen output of the software. All this was done prior to your comment above. Will try also with the debug option.
Created attachment 415679 [details] Kernel logs - The server did not crashed... just got frozen In this case the node did not crashed. The installation continued after a couple of minutes, however any application trying to write on this filesystem during the freeze, crashed completely.
Created attachment 415681 [details] the results in the -NO SERVER CRASH- case.
I think I managed to get it. Installation on tweety1. Monitoring from tweety2. In "locks" I have included: i) output of random run of the gfs2_hangalyzer from tweety2 ii) bellow it, a run of gfs2_hangalyzer, when tweety1 got in the "frozen" condition iii) below it, a big block of kernel messages (this time it looks that drbd played a role). Until now I could replicate the "CPU stuck" message. Only kernel messages related to gfs (as in previous attatchemts). This is the only time that drbd also ouputs messages, but was the only time that I could somhow replicate the problem and catch some logs with gfs2_hangalyzer Thank you.
Created attachment 415722 [details] Messages during somehow replicated freeze Included are kernel messages and results of gfs2_hangalyzer
Also tweety1 has no access anymore to the gfs2 filesystem. Any action (like 'ls') causes more glocks and frozen processes.
Thanks for the information. Some of the call trace information lead me to believe that GFS2 might simply be waiting for DRBD. Other information may indicate GFS2 is at fault. It's hard to tell from the output unless I can see what all the processes are doing. So bottom line: It's still not clear to me what is going on here. I'd like to see a complete picture of the state of both nodes at the time of failure. Therefore I recommend that you do the following: 1. Modify your /etc/cluster/cluster.conf file to temporarily add a large post_fail_delay value so the nodes don't fence. 2. Make sure your /etc/sysctl.conf on both nodes includes the following: kernel.sysrq = 1 kernel.printk = 7 kernel.panic_on_oops = 0 3. Reboot both nodes and restart the cluster 4. Start with a fresh file system, so mkfs.gfs2 a new one (saving off anything you need of course). 5. Recreate the failure. 6. Just after the failure, but before either machine is rebooted, collect the following information: a. Run the gfs2_hangalyzer tool again and save its output again. b. Capture a sysrq-t output by doing this command on both nodes: echo t > /proc/sysrq-trigger c. Run the "dmesg" command on both nodes and save both. For example: dmesg > /tmp/dmesg.tweety1 I'd like to see what all the processes are doing, not just a few, so I'd like the sysrq-t information from both nodes right after the failure. I'd like to see what the tool has to say about the hang from the same time. If possible, I'd like to get instructions on how to reproduce the failure in our labs.
Hi Bob, I think it would be better if I did all but point 4. In case the problem has anything to do with one upgrade after the other (this gfs2 was created in 5.2), then we might not see the problem. I will run one sequence without point 4 (until traping the error), and one with point 4 (again until -hopefully- trapping the error). If it has anything to do with "old" created gfs2, it looks that this way we will get the error. Thank you and will come back ASAP. Theophanis
Adding Linbit contacts as it looks like it might be drbd related.
Theophanis. do you have a support agreement with Red Hat? If so then we need to get support in the loop on this. Can you confirm that you are using RHEL and not CentOS?
Hello all, I could not replicate the issue ever again, not with the existing GFS2 neither with a new one. I suggest we close this report, and if I ever encounter that again, I will file a new bug report if more data will be available. Thank you
Hi Theophanis, Thanks for your input. I'll close this for now.