Bug 1011661

Summary: Supervdsm "AuthenticationError: digest sent was rejected" Tracebacks on several hosts at the same time
Product: Red Hat Enterprise Virtualization Manager Reporter: Allie DeVolder <adevolder>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED ERRATA QA Contact: Tareq Alayan <talayan>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.1.0CC: abaron, acathrow, bazulay, dgibson, gwatson, hateya, iheim, lpeer, lyarwood, mkalinin, pstehlik, rhev-integ, tdosek, ybronhei, yeylon
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 3.2.4   
Hardware: All   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, supervdsm used ctime (the time of the last status change) to define the start time of each process. The ctime value could change as the process was running, resulting in hosts becoming non-responsive with the error "AuthenticationError: digest sent was rejected". Now, field 21 (the process creation time) is used to define the start time of each process, so this error no longer presents.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-25 18:12:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Vdsm log at the time of the issue - vdsm panic
none
Supervdsm.log
none
Patch above tag: v4.10.2-23.0, tag: sf18 with http://gerrit.ovirt.org/#/c/14998 changes none

Description Allie DeVolder 2013-09-24 18:08:24 UTC
Description of problem:
Several hosts went non-responsive and vdsm restarted repeatedly with the following Traceback:
Thread-68333::ERROR::2013-09-17 23:09:45,214::misc::173::Storage.Misc::(panic) Panic: Couldn't connect to supervdsm
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsm.py", line 195, in launch
    utils.retry(self._connect, Exception, timeout=60, tries=3)
  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 934, in retry
    return func()
  File "/usr/share/vdsm/supervdsm.py", line 181, in _connect
    self._manager.connect()
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 474, in connect
    conn = Client(self._address, authkey=self._authkey)
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 149, in Client
    answer_challenge(c, authkey)
  File "/usr/lib64/python2.6/multiprocessing/connection.py", line 390, in answer_challenge
    raise AuthenticationError('digest sent was rejected')
AuthenticationError: digest sent was rejected

Version-Release number of selected component (if applicable):
vdsm-4.10.2-1.13.el6ev.x86_64

How reproducible:
unknown

Steps to Reproduce:
Unknown

Actual results:
Unusuable error message "digest sent was rejected"

Expected results:
Error message that indicates what the problem is

Additional info:
This issue appeared to occur on two different RHEV environments in the same company simultaneously

Comment 1 David Gibson 2013-09-25 06:07:23 UTC
There _isn't_ a corresponding error logged in supervdsm.log.

My working theory is that supervdsm is blowing up somewhere in the python multiprocessing module.  vdsm regards any response other than WELCOME as an authentication failure, even if it's actually a #TRACEBACK response because the server thread died with an exception.

The multiprocessing module does have logging functions, but it looks like supervdsm is not enabling them.  I've filed bug 1011767 to cover that problem.

Comment 5 Yaniv Bronhaim 2013-09-29 06:23:31 UTC
Hey,
Lets separate
We had BZ 908068, where multiply instances were created due to wrong check if parallel instances are running (commit 19c1303f should fix all those cases). This caused KeyError\RemoteError while trying to communicate with new instance while old instance were exists.
It's easy to catch this case in vdsm.log and it appears only in vdsm 3.2 before starting to use external service for supervdsm.

Some of those versions still don't have supervdsm.log provided, so vdsm.log can provide all relevant info - when vdsm initiates supervdsm instance and why it was crashed. The AuthenticationError might related, or raised by another issue as you stated, this case will be investigated as part of this BZ.

1. Can you attach /var/log/vdsm/vdsm.log and /var/log/messages ? If supervdsm.log exists attach it also even if it lacks of information.
2. Can you describe steps that lead to the AuthenticationError if possible? is it reproducible?
3. Can you attach also supervdsm.py file just to ensure the version?

Thanks.

Comment 6 Tomas Dosek 2013-09-29 06:33:17 UTC
Created attachment 804624 [details]
Vdsm log at the time of the issue - vdsm panic

Comment 7 Tomas Dosek 2013-09-29 06:34:44 UTC
I attached one of the vdsm.logs from the time of the original issue happening. Please let me know, if you need some more. The whole logcollector is unfortunately too large to be pushed to BZ.

Comment 8 Yaniv Bronhaim 2013-09-29 12:38:29 UTC
It looks quite responsive after 2 tries, vdsm starts properly and operates. We assume that sometimes old instance can be exist while initiating new one, and in those cases you might see this exception. That's some of the ricochets we had while working with fork instead of 2 services.

What about attaching supervdsm.py and supervdsm.log if exists? or any steps that leads to the exception?

Back to the log, the host returns to be responsive in this case, so it doesn't reflect the bug description. Can you determine the time when the host became non-responsive?

Comment 9 Tomas Dosek 2013-09-29 13:14:20 UTC
Created attachment 804686 [details]
Supervdsm.log

Comment 10 Tomas Dosek 2013-09-29 13:16:30 UTC
Supervdsm.log added to attachments at time of this particular failure customer used RHN installed version of vdsm (so I guess we can take the supervdsm.py from there):
Aug 14 11:57:26 Updated: vdsm-4.10.2-23.0.el6ev.x86_64

Comment 11 Yaniv Bronhaim 2013-09-29 13:41:34 UTC
Yes, there are fast restarts in supervdsm.log at 2013-09-19 11:08:57, but as before it looks like supervdsm recovers afterwards and continue to operate and to response.
At least until 2013-09-20 11:24:33 when another restart occurred, then it worked for 4 hours until 2013-09-20 15:18:28,867 another 3 restarts occurred, but still seems to work.

Can't say what leads to the restarts without vdsm.log from that time (2013-09-20 15:18:28 and 2013-09-20 11:24:33), maybe it can gives us more information about  the case. 

Please attach those vdsm.log s too if you can (I guess its vsdm.log.18 and 17).

What were the results of that exception? Is the host responsive afterwards? Is it 
only about the exception print and the restarts and why? If yes, those logs will help to understand what went wrong.

Comment 13 Yaniv Bronhaim 2013-09-29 16:04:47 UTC
From what I saw while investigating it with Greg Procunier:
1. Hosts become non-responsive, that's lead to fencing
2. While disabling fencing, the hosts keep stuck on non-responsive
3. Each host that becomes non-responsive try to migrate all of its vms

From some of the logs that sent to me I can see:
While trying to connect to storagePool over and over again we get:
Thread-910::ERROR::2013-08-30 17:00:14,382::task::833::TaskManager.Task::(_setError) Task=`b0fbe27e-b036-43b2-a273-848340668b12`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 840, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 924, in connectStoragePool
    masterVersion, options)
  File "/usr/share/vdsm/storage/hsm.py", line 971, in _connectStoragePool
    res = pool.connect(hostID, scsiKey, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 644, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1177, in __rebuild
    fileUtils.createdir(self.poolPath)
  File "/usr/share/vdsm/storage/fileUtils.py", line 179, in createdir
    os.makedirs(*params)
  File "/usr/lib64/python2.6/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf'
Thread-910::ERROR::2013-08-30 17:00:14,450::dispatcher::69::Storage.Dispatcher.Protect::(run) [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf'
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 61, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1142, in prepare
    raise self.error
OSError: [Errno 28] No space left on device: '/rhev/data-center/ec759bb4-834b-11e1-943e-52540027efbf

libvirt errors in the middle:
libvirtEventLoop::ERROR::2013-09-10 14:44:04,778::libvirtconnection::106::vds::(wrapper) Unknown libvirterror: ecode: 42 edom: 10
Traceback (most recent call last):
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 83, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 381, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid 'b4f4337e-290b-4813-99d9-84b15bd708a5'

supervdsm dies and reconnection failed at 2013-09-16 13:12:46 (as you noted in the description) for quite long time (Panic is called each 3 retries which mean vdsm is restarted but with no success to reconnect) - that's the main issue that I'm trying to investigate. 

Then, the following is being repeated until the end of the log:

Thread-13634::ERROR::2013-09-29 10:23:33,607::BindingXMLRPC::922::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 908, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 194, in vmDestroy
    return vm.destroy()
  File "/usr/share/vdsm/API.py", line 313, in destroy
    res = v.destroy()
  File "/usr/share/vdsm/libvirtvm.py", line 2430, in destroy
    response = self.releaseVm()
  File "/usr/share/vdsm/libvirtvm.py", line 2382, in releaseVm
    self.guestAgent.stop()
  File "/usr/share/vdsm/guestIF.py", line 192, in stop
    self._channelListener.unregister(self._sock.fileno())
AttributeError: GuestAgent instance has no attribute '_sock'


Still checking what leaded to that and why the restart didn't solve it.

Comment 15 Yaniv Bronhaim 2013-09-30 09:51:20 UTC
Thanks to your debug prints I found the supervdsm error. This issue was raised in the past (BZ 966133) only in QA environment when the hosts were ran with fake kvm (the hosts ran over VMs to test load issues).

There we encountered an issue where the ctime of the process was not equal to the creation time, as you can see from your log (carhtst06, 1.129.93.36):
Thread-22::DEBUG::2013-09-29 09:02:44,685::supervdsm::155::SuperVdsmProxy::(isRunning) isRunning: createdTime=1380459755.63
Thread-22::DEBUG::2013-09-29 09:02:44,685::supervdsm::167::SuperVdsmProxy::(isRunning) isRunning: pTime=1380459764.68

When this occurred, we initiated new instance of supervdsm when the old one still runs and started to get the AuthenticationError forever.

This can lead to some errors afterwards that can be related to the vms migration and inc.

I'm attaching a patch over tag: v4.10.2-23.0 with that change, please try to patch your current code. With that we might still see errors with the storage that leads to migrations, but it will be much clearer to understand why it happened and where the flow started to change

Comment 16 Yaniv Bronhaim 2013-09-30 10:05:39 UTC
Created attachment 805064 [details]
Patch above tag: v4.10.2-23.0, tag: sf18 with http://gerrit.ovirt.org/#/c/14998 changes

Comment 18 Barak 2013-09-30 11:35:07 UTC
We need to understand why this issue occurred on this specific environment,
Bug 966133 was  opened due to it happening on a VM (which makes sense), while here we are talking about physical hosts.

Comment 19 Yaniv Bronhaim 2013-09-30 14:42:33 UTC
I don't have much news about it, but as stated in Bug 966133 ctime is not meant to be used only as creation time (http://en.wikipedia.org/wiki/Stat_(system_call)#ctime) it can be changed during the run.

Using pidStat([pid])[21] means:
(from http://linux.die.net/man/5/proc)
(22) The time the process started after system boot. In kernels before Linux 2.6, this value was expressed in jiffies. Since Linux 2.6, the value is expressed in clock ticks (divide by sysconf(_SC_CLK_TCK)).

More accurate and cannot change in any case.

I still need to get more details about the environment that is used here, so I leave the needinfo on for more updates.

Comment 22 David Gibson 2013-10-01 01:07:27 UTC
I've built packages based on the changes against both 23.0 and 1.13.  See

https://brewweb.devel.redhat.com/taskinfo?taskID=6351883

and

https://brewweb.devel.redhat.com/taskinfo?taskID=6351922

Comment 25 David Gibson 2013-10-02 04:13:28 UTC
The packages referenced in Comment 22 were broken, due to messing up the backport to the customer's package versions.  While on the phone with the customer I fixed those up, and the new test packages against 23.0 are at:
    https://brewweb.devel.redhat.com/taskinfo?taskID=6353014

In the meantime, the customer has updated their systems so that the vdsm versions are in sync across all environments, so packages against the 1.13 version are no longer necessary.

Comment 26 Yaniv Bronhaim 2013-10-06 15:18:31 UTC
Not relevant for 3.3. In 3.3 supervdsm is an external service and this code was changed. 
Barak, Please update the flags

Comment 34 Tareq Alayan 2013-10-16 12:44:44 UTC
tested on vdsm-4.10.2-27.0.el6ev.x86_64. verified

Comment 35 Tareq Alayan 2013-10-16 13:15:54 UTC
couldn't reproduce bug

Comment 37 errata-xmlrpc 2013-10-25 18:12:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1462.html