Description of problem: Several hosts went non-responsive and vdsm restarted repeatedly with the following Traceback: Thread-68333::ERROR::2013-09-17 23:09:45,214::misc::173::Storage.Misc::(panic) Panic: Couldn't connect to supervdsm Traceback (most recent call last): File "/usr/share/vdsm/supervdsm.py", line 195, in launch utils.retry(self._connect, Exception, timeout=60, tries=3) File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 934, in retry return func() File "/usr/share/vdsm/supervdsm.py", line 181, in _connect self._manager.connect() File "/usr/lib64/python2.6/multiprocessing/managers.py", line 474, in connect conn = Client(self._address, authkey=self._authkey) File "/usr/lib64/python2.6/multiprocessing/connection.py", line 149, in Client answer_challenge(c, authkey) File "/usr/lib64/python2.6/multiprocessing/connection.py", line 390, in answer_challenge raise AuthenticationError('digest sent was rejected') AuthenticationError: digest sent was rejected Version-Release number of selected component (if applicable): vdsm-4.10.2-1.13.el6ev.x86_64 How reproducible: unknown Steps to Reproduce: Unknown Actual results: Unusuable error message "digest sent was rejected" Expected results: Error message that indicates what the problem is Additional info: This issue appeared to occur on two different RHEV environments in the same company simultaneously
There _isn't_ a corresponding error logged in supervdsm.log. My working theory is that supervdsm is blowing up somewhere in the python multiprocessing module. vdsm regards any response other than WELCOME as an authentication failure, even if it's actually a #TRACEBACK response because the server thread died with an exception. The multiprocessing module does have logging functions, but it looks like supervdsm is not enabling them. I've filed bug 1011767 to cover that problem.
Hey, Lets separate We had BZ 908068, where multiply instances were created due to wrong check if parallel instances are running (commit 19c1303f should fix all those cases). This caused KeyError\RemoteError while trying to communicate with new instance while old instance were exists. It's easy to catch this case in vdsm.log and it appears only in vdsm 3.2 before starting to use external service for supervdsm. Some of those versions still don't have supervdsm.log provided, so vdsm.log can provide all relevant info - when vdsm initiates supervdsm instance and why it was crashed. The AuthenticationError might related, or raised by another issue as you stated, this case will be investigated as part of this BZ. 1. Can you attach /var/log/vdsm/vdsm.log and /var/log/messages ? If supervdsm.log exists attach it also even if it lacks of information. 2. Can you describe steps that lead to the AuthenticationError if possible? is it reproducible? 3. Can you attach also supervdsm.py file just to ensure the version? Thanks.
Created attachment 804624 [details] Vdsm log at the time of the issue - vdsm panic
I attached one of the vdsm.logs from the time of the original issue happening. Please let me know, if you need some more. The whole logcollector is unfortunately too large to be pushed to BZ.
It looks quite responsive after 2 tries, vdsm starts properly and operates. We assume that sometimes old instance can be exist while initiating new one, and in those cases you might see this exception. That's some of the ricochets we had while working with fork instead of 2 services. What about attaching supervdsm.py and supervdsm.log if exists? or any steps that leads to the exception? Back to the log, the host returns to be responsive in this case, so it doesn't reflect the bug description. Can you determine the time when the host became non-responsive?
Created attachment 804686 [details] Supervdsm.log
Supervdsm.log added to attachments at time of this particular failure customer used RHN installed version of vdsm (so I guess we can take the supervdsm.py from there): Aug 14 11:57:26 Updated: vdsm-4.10.2-23.0.el6ev.x86_64
Yes, there are fast restarts in supervdsm.log at 2013-09-19 11:08:57, but as before it looks like supervdsm recovers afterwards and continue to operate and to response. At least until 2013-09-20 11:24:33 when another restart occurred, then it worked for 4 hours until 2013-09-20 15:18:28,867 another 3 restarts occurred, but still seems to work. Can't say what leads to the restarts without vdsm.log from that time (2013-09-20 15:18:28 and 2013-09-20 11:24:33), maybe it can gives us more information about the case. Please attach those vdsm.log s too if you can (I guess its vsdm.log.18 and 17). What were the results of that exception? Is the host responsive afterwards? Is it only about the exception print and the restarts and why? If yes, those logs will help to understand what went wrong.
From what I saw while investigating it with Greg Procunier: 1. Hosts become non-responsive, that's lead to fencing 2. While disabling fencing, the hosts keep stuck on non-responsive 3. Each host that becomes non-responsive try to migrate all of its vms From some of the logs that sent to me I can see: While trying to connect to storagePool over and over again we get: Thread-910::ERROR::2013-08-30 17:00:14,382::task::833::TaskManager.Task::(_setError) Task=`b0fbe27e-b036-43b2-a273-848340668b12`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 924, in connectStoragePool masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 971, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 644, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1177, in __rebuild fileUtils.createdir(self.poolPath) File "/usr/share/vdsm/storage/fileUtils.py", line 179, in createdir os.makedirs(*params) File "/usr/lib64/python2.6/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf' Thread-910::ERROR::2013-08-30 17:00:14,450::dispatcher::69::Storage.Dispatcher.Protect::(run) [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf' Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 1142, in prepare raise self.error OSError: [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf libvirt errors in the middle: libvirtEventLoop::ERROR::2013-09-10 14:44:04,778::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self) libvirtError: Domain not found: no domain with matching uuid 'b4f4337e-290b-4813-99d9-84b15bd708a5' supervdsm dies and reconnection failed at 2013-09-16 13:12:46 (as you noted in the description) for quite long time (Panic is called each 3 retries which mean vdsm is restarted but with no success to reconnect) - that's the main issue that I'm trying to investigate. Then, the following is being repeated until the end of the log: Thread-13634::ERROR::2013-09-29 10:23:33,607::BindingXMLRPC::922::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 908, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 194, in vmDestroy return vm.destroy() File "/usr/share/vdsm/API.py", line 313, in destroy res = v.destroy() File "/usr/share/vdsm/libvirtvm.py", line 2430, in destroy response = self.releaseVm() File "/usr/share/vdsm/libvirtvm.py", line 2382, in releaseVm self.guestAgent.stop() File "/usr/share/vdsm/guestIF.py", line 192, in stop self._channelListener.unregister(self._sock.fileno()) AttributeError: GuestAgent instance has no attribute '_sock' Still checking what leaded to that and why the restart didn't solve it.
Thanks to your debug prints I found the supervdsm error. This issue was raised in the past (BZ 966133) only in QA environment when the hosts were ran with fake kvm (the hosts ran over VMs to test load issues). There we encountered an issue where the ctime of the process was not equal to the creation time, as you can see from your log (carhtst06, 1.129.93.36): Thread-22::DEBUG::2013-09-29 09:02:44,685::supervdsm::155::SuperVdsmProxy::(isRunning) isRunning: createdTime=1380459755.63 Thread-22::DEBUG::2013-09-29 09:02:44,685::supervdsm::167::SuperVdsmProxy::(isRunning) isRunning: pTime=1380459764.68 When this occurred, we initiated new instance of supervdsm when the old one still runs and started to get the AuthenticationError forever. This can lead to some errors afterwards that can be related to the vms migration and inc. I'm attaching a patch over tag: v4.10.2-23.0 with that change, please try to patch your current code. With that we might still see errors with the storage that leads to migrations, but it will be much clearer to understand why it happened and where the flow started to change
Created attachment 805064 [details] Patch above tag: v4.10.2-23.0, tag: sf18 with http://gerrit.ovirt.org/#/c/14998 changes
We need to understand why this issue occurred on this specific environment, Bug 966133 was opened due to it happening on a VM (which makes sense), while here we are talking about physical hosts.
I don't have much news about it, but as stated in Bug 966133 ctime is not meant to be used only as creation time (http://en.wikipedia.org/wiki/Stat_(system_call)#ctime) it can be changed during the run. Using pidStat([pid])[21] means: (from http://linux.die.net/man/5/proc) (22) The time the process started after system boot. In kernels before Linux 2.6, this value was expressed in jiffies. Since Linux 2.6, the value is expressed in clock ticks (divide by sysconf(_SC_CLK_TCK)). More accurate and cannot change in any case. I still need to get more details about the environment that is used here, so I leave the needinfo on for more updates.
I've built packages based on the changes against both 23.0 and 1.13. See https://brewweb.devel.redhat.com/taskinfo?taskID=6351883 and https://brewweb.devel.redhat.com/taskinfo?taskID=6351922
The packages referenced in Comment 22 were broken, due to messing up the backport to the customer's package versions. While on the phone with the customer I fixed those up, and the new test packages against 23.0 are at: https://brewweb.devel.redhat.com/taskinfo?taskID=6353014 In the meantime, the customer has updated their systems so that the vdsm versions are in sync across all environments, so packages against the 1.13 version are no longer necessary.
Not relevant for 3.3. In 3.3 supervdsm is an external service and this code was changed. Barak, Please update the flags
tested on vdsm-4.10.2-27.0.el6ev.x86_64. verified
couldn't reproduce bug
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1462.html