Bug 1011661
Summary: | Supervdsm "AuthenticationError: digest sent was rejected" Tracebacks on several hosts at the same time | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Allie DeVolder <adevolder> | ||||||||
Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Tareq Alayan <talayan> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 3.1.0 | CC: | abaron, acathrow, bazulay, dgibson, gwatson, hateya, iheim, lpeer, lyarwood, mkalinin, pstehlik, rhev-integ, tdosek, ybronhei, yeylon | ||||||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||||||
Target Release: | 3.2.4 | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | infra | ||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: |
Previously, supervdsm used ctime (the time of the last status change) to define the start time of each process. The ctime value could change as the process was running, resulting in hosts becoming non-responsive with the error "AuthenticationError: digest sent was rejected". Now, field 21 (the process creation time) is used to define the start time of each process, so this error no longer presents.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2013-10-25 18:12:16 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Allie DeVolder
2013-09-24 18:08:24 UTC
There _isn't_ a corresponding error logged in supervdsm.log. My working theory is that supervdsm is blowing up somewhere in the python multiprocessing module. vdsm regards any response other than WELCOME as an authentication failure, even if it's actually a #TRACEBACK response because the server thread died with an exception. The multiprocessing module does have logging functions, but it looks like supervdsm is not enabling them. I've filed bug 1011767 to cover that problem. Hey, Lets separate We had BZ 908068, where multiply instances were created due to wrong check if parallel instances are running (commit 19c1303f should fix all those cases). This caused KeyError\RemoteError while trying to communicate with new instance while old instance were exists. It's easy to catch this case in vdsm.log and it appears only in vdsm 3.2 before starting to use external service for supervdsm. Some of those versions still don't have supervdsm.log provided, so vdsm.log can provide all relevant info - when vdsm initiates supervdsm instance and why it was crashed. The AuthenticationError might related, or raised by another issue as you stated, this case will be investigated as part of this BZ. 1. Can you attach /var/log/vdsm/vdsm.log and /var/log/messages ? If supervdsm.log exists attach it also even if it lacks of information. 2. Can you describe steps that lead to the AuthenticationError if possible? is it reproducible? 3. Can you attach also supervdsm.py file just to ensure the version? Thanks. Created attachment 804624 [details]
Vdsm log at the time of the issue - vdsm panic
I attached one of the vdsm.logs from the time of the original issue happening. Please let me know, if you need some more. The whole logcollector is unfortunately too large to be pushed to BZ. It looks quite responsive after 2 tries, vdsm starts properly and operates. We assume that sometimes old instance can be exist while initiating new one, and in those cases you might see this exception. That's some of the ricochets we had while working with fork instead of 2 services. What about attaching supervdsm.py and supervdsm.log if exists? or any steps that leads to the exception? Back to the log, the host returns to be responsive in this case, so it doesn't reflect the bug description. Can you determine the time when the host became non-responsive? Created attachment 804686 [details]
Supervdsm.log
Supervdsm.log added to attachments at time of this particular failure customer used RHN installed version of vdsm (so I guess we can take the supervdsm.py from there): Aug 14 11:57:26 Updated: vdsm-4.10.2-23.0.el6ev.x86_64 Yes, there are fast restarts in supervdsm.log at 2013-09-19 11:08:57, but as before it looks like supervdsm recovers afterwards and continue to operate and to response. At least until 2013-09-20 11:24:33 when another restart occurred, then it worked for 4 hours until 2013-09-20 15:18:28,867 another 3 restarts occurred, but still seems to work. Can't say what leads to the restarts without vdsm.log from that time (2013-09-20 15:18:28 and 2013-09-20 11:24:33), maybe it can gives us more information about the case. Please attach those vdsm.log s too if you can (I guess its vsdm.log.18 and 17). What were the results of that exception? Is the host responsive afterwards? Is it only about the exception print and the restarts and why? If yes, those logs will help to understand what went wrong. From what I saw while investigating it with Greg Procunier: 1. Hosts become non-responsive, that's lead to fencing 2. While disabling fencing, the hosts keep stuck on non-responsive 3. Each host that becomes non-responsive try to migrate all of its vms From some of the logs that sent to me I can see: While trying to connect to storagePool over and over again we get: Thread-910::ERROR::2013-08-30 17:00:14,382::task::833::TaskManager.Task::(_setError) Task=`b0fbe27e-b036-43b2-a273-848340668b12`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 840, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 38, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 924, in connectStoragePool masterVersion, options) File "/usr/share/vdsm/storage/hsm.py", line 971, in _connectStoragePool res = pool.connect(hostID, scsiKey, msdUUID, masterVersion) File "/usr/share/vdsm/storage/sp.py", line 644, in connect self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion) File "/usr/share/vdsm/storage/sp.py", line 1177, in __rebuild fileUtils.createdir(self.poolPath) File "/usr/share/vdsm/storage/fileUtils.py", line 179, in createdir os.makedirs(*params) File "/usr/lib64/python2.6/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf' Thread-910::ERROR::2013-08-30 17:00:14,450::dispatcher::69::Storage.Dispatcher.Protect::(run) [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf' Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 1142, in prepare raise self.error OSError: [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf libvirt errors in the middle: libvirtEventLoop::ERROR::2013-09-10 14:44:04,778::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self) libvirtError: Domain not found: no domain with matching uuid 'b4f4337e-290b-4813-99d9-84b15bd708a5' supervdsm dies and reconnection failed at 2013-09-16 13:12:46 (as you noted in the description) for quite long time (Panic is called each 3 retries which mean vdsm is restarted but with no success to reconnect) - that's the main issue that I'm trying to investigate. Then, the following is being repeated until the end of the log: Thread-13634::ERROR::2013-09-29 10:23:33,607::BindingXMLRPC::922::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 908, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 194, in vmDestroy return vm.destroy() File "/usr/share/vdsm/API.py", line 313, in destroy res = v.destroy() File "/usr/share/vdsm/libvirtvm.py", line 2430, in destroy response = self.releaseVm() File "/usr/share/vdsm/libvirtvm.py", line 2382, in releaseVm self.guestAgent.stop() File "/usr/share/vdsm/guestIF.py", line 192, in stop self._channelListener.unregister(self._sock.fileno()) AttributeError: GuestAgent instance has no attribute '_sock' Still checking what leaded to that and why the restart didn't solve it. Thanks to your debug prints I found the supervdsm error. This issue was raised in the past (BZ 966133) only in QA environment when the hosts were ran with fake kvm (the hosts ran over VMs to test load issues). There we encountered an issue where the ctime of the process was not equal to the creation time, as you can see from your log (carhtst06, 1.129.93.36): Thread-22::DEBUG::2013-09-29 09:02:44,685::supervdsm::155::SuperVdsmProxy::(isRunning) isRunning: createdTime=1380459755.63 Thread-22::DEBUG::2013-09-29 09:02:44,685::supervdsm::167::SuperVdsmProxy::(isRunning) isRunning: pTime=1380459764.68 When this occurred, we initiated new instance of supervdsm when the old one still runs and started to get the AuthenticationError forever. This can lead to some errors afterwards that can be related to the vms migration and inc. I'm attaching a patch over tag: v4.10.2-23.0 with that change, please try to patch your current code. With that we might still see errors with the storage that leads to migrations, but it will be much clearer to understand why it happened and where the flow started to change Created attachment 805064 [details] Patch above tag: v4.10.2-23.0, tag: sf18 with http://gerrit.ovirt.org/#/c/14998 changes We need to understand why this issue occurred on this specific environment, Bug 966133 was opened due to it happening on a VM (which makes sense), while here we are talking about physical hosts. I don't have much news about it, but as stated in Bug 966133 ctime is not meant to be used only as creation time (http://en.wikipedia.org/wiki/Stat_(system_call)#ctime) it can be changed during the run. Using pidStat([pid])[21] means: (from http://linux.die.net/man/5/proc) (22) The time the process started after system boot. In kernels before Linux 2.6, this value was expressed in jiffies. Since Linux 2.6, the value is expressed in clock ticks (divide by sysconf(_SC_CLK_TCK)). More accurate and cannot change in any case. I still need to get more details about the environment that is used here, so I leave the needinfo on for more updates. I've built packages based on the changes against both 23.0 and 1.13. See https://brewweb.devel.redhat.com/taskinfo?taskID=6351883 and https://brewweb.devel.redhat.com/taskinfo?taskID=6351922 The packages referenced in Comment 22 were broken, due to messing up the backport to the customer's package versions. While on the phone with the customer I fixed those up, and the new test packages against 23.0 are at: https://brewweb.devel.redhat.com/taskinfo?taskID=6353014 In the meantime, the customer has updated their systems so that the vdsm versions are in sync across all environments, so packages against the 1.13 version are no longer necessary. Not relevant for 3.3. In 3.3 supervdsm is an external service and this code was changed. Barak, Please update the flags tested on vdsm-4.10.2-27.0.el6ev.x86_64. verified couldn't reproduce bug Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1462.html |