1844578 – anaconda very rarely crashes at startup with a pygobject traceback

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1844578 - anaconda very rarely crashes at startup with a pygobject traceback

Summary: anaconda very rarely crashes at startup with a pygobject traceback

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pygobject3
Sub Component:
Version:	8.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	8.0
Assignee:	Ray Strode [halfline]
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1812825
TreeView+	depends on / blocked

Reported:	2020-06-05 17:14 UTC by Jan Stodola
Modified:	2020-11-04 01:37 UTC (History)
CC List:	13 users (show)
Fixed In Version:	pygobject3-3.28.3-2.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-04 01:35:45 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Log from anaconda kickstart test with the Traceback (39.71 KB, text/plain) 2020-06-08 10:05 UTC, Radek Vykydal	no flags	Details
/usr/share/gir-1.0/GLib-2.0.gir from gobject-introspection-devel-1.56.1-1.el8.s390x.rpm (2.17 MB, text/plain) 2020-06-08 13:46 UTC, Tomas Popela	no flags	Details
journal (278.57 KB, text/plain) 2020-06-09 12:30 UTC, Radek Vykydal	no flags	Details
journal (289.63 KB, text/plain) 2020-06-09 12:50 UTC, Radek Vykydal	no flags	Details
journal for scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 (600.27 KB, text/plain) 2020-06-10 05:45 UTC, Radek Vykydal	no flags	Details
another log for scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 (.debug) (121.33 KB, text/plain) 2020-06-10 08:52 UTC, Radek Vykydal	no flags	Details
log with Traceback (.debug.1 scratch build) (456.49 KB, text/plain) 2020-06-10 13:41 UTC, Radek Vykydal	no flags	Details
Patch that I think will fix the problem. (5.68 KB, patch) 2020-06-10 22:15 UTC, Ray Strode [halfline]	no flags	Details \| Diff
Patch that I think will fix the problem, now with 100% more patch (13.22 KB, patch) 2020-06-10 23:37 UTC, Ray Strode [halfline]	no flags	Details \| Diff
test case (659 bytes, text/plain) 2020-06-11 14:24 UTC, Ray Strode [halfline]	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
GNOME Gitlab	GNOME pygobject merge_requests 149	0	None	None	None	2020-06-12 14:03:55 UTC
Red Hat Product Errata	RHSA-2020:4451	0	None	None	None	2020-11-04 01:36:01 UTC

Description Jan Stodola 2020-06-05 17:14:06 UTC

Description of problem:
Building of qcow2 image for RHEL-8.3 Alpha compose failed on s390x with the following tracebacks in the log:

14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:ERROR:anaconda.modules.common.task.task:Thread AnaTaskThread-StartRHSMTask-1 has failed: Traceback (most recent call last):
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/threading.py", line 280, in run
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    threading.Thread.run(self)
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/threading.py", line 864, in run
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    self._target(*self._args, **self._kwargs)
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/common/task/task.py", line 97, in _task_run_callback
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    self._set_result(self.run())
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/subscription/initialization.py", line 74, in run
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    rhsm_config_proxy.Set("logging.default_log_level", get_variant(Str, "DEBUG"), "")
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib/python3.6/site-packages/dasbus/client/handler.py", line 419, in _call_method
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    parameters = get_variant(in_type, parameters)
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib/python3.6/site-packages/dasbus/typing.py", line 124, in get_variant
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    return Variant(type_string, value)
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/gi/overrides/GLib.py", line 172, in __new__
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    v = creator._create(format_string, value)
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/gi/overrides/GLib.py", line 150, in _create
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    builder.add_value(self._create(dup, i))
14:52:22,959 WARNING org.fedoraproject.Anaconda.Modules.Subscription:TypeError: argument self: Expected GLib.VariantBuilder, but got gi.repository.GLib.VariantBuilder

and

15:03:26,986 WARNING org.fedoraproject.Anaconda.Modules.Subscription:ERROR:anaconda.modules.subscription.initialization:subscription: got no result from StartRHSMTask
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:WARNING:dasbus.server.handler:The call org.fedoraproject.Anaconda.Modules.InstallWithTasks has failed with an exception:
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:Traceback (most recent call last):
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib/python3.6/site-packages/dasbus/server/handler.py", line 421, in _method_callback
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    *unwrap_variant(parameters)
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib/python3.6/site-packages/dasbus/server/handler.py", line 234, in _handle_call
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    return handler(*parameters)
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/common/base/base_interface.py", line 154, in InstallWithTasks
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    self.implementation.install_with_tasks()
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/subscription/subscription.py", line 524, in install_with_tasks
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    rhsm_config_proxy=self.rhsm_observer.get_proxy(RHSM_CONFIG)
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/subscription/rhsm_observer.py", line 79, in get_proxy
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    raise DBusObserverError("The RHSM DBus API is not available.")
15:03:27,039 WARNING org.fedoraproject.Anaconda.Modules.Subscription:dasbus.client.observer.DBusObserverError: The RHSM DBus API is not available.

Version-Release number of selected component (if applicable):
anaconda-33.16.3.5-1.el8

How reproducible:
Unclear, happened during building of RHEL-8.3 Alpha compose

Steps to Reproduce:
1.


Actual results:
Anaconda traceback

Expected results:
Installation finishes successfully

Comment 7 Vendula Poncova 2020-06-08 09:54:06 UTC

It seems to be an issue in the Python bindings for GLib. Reassigning.

Comment 8 Radek Vykydal 2020-06-08 10:05:15 UTC

Created attachment 1696028 [details]
Log from anaconda kickstart test with the Traceback

We've seen similar problem lately (June 05, python3-gobject-3.28.3-1.el8.x86_64) in the same call from Anaconda in our nightly kickstart tests:

01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:ERROR:anaconda.modules.common.task.task:Thread AnaTaskThread-StartRHSMTask-1 has failed: Traceback (most recent call last):
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/threading.py", line 280, in run
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    threading.Thread.run(self)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/threading.py", line 864, in run
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    self._target(*self._args, **self._kwargs)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/common/task/task.py", line 97, in _task_run_callback
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    self._set_result(self.run())
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/subscription/initialization.py", line 74, in run
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    rhsm_config_proxy.Set("logging.default_log_level", get_variant(Str, "DEBUG"), "")
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib/python3.6/site-packages/dasbus/client/handler.py", line 419, in _call_method
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    parameters = get_variant(in_type, parameters)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib/python3.6/site-packages/dasbus/typing.py", line 124, in get_variant
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    return Variant(type_string, value)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/gi/overrides/GLib.py", line 172, in __new__
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    v = creator._create(format_string, value)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:  File "/usr/lib64/python3.6/site-packages/gi/overrides/GLib.py", line 121, in _create
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:    builder = GLib.VariantBuilder.new(gvtype)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:AttributeError: type object 'VariantBuilder' has no attribute 'new'
01:29:54,779 WARNING org.fedoraproject.Anaconda.Modules.Subscription:INFO:anaconda.threading:Thread Done: AnaTaskThread-StartRHSMTask-1 (140054874896128)

The problem happened only for one of ~190 tests in one of 6 nightly runs.

Comment 9 Dan Horák 2020-06-08 12:16:30 UTC

FWIW pygobject master brach is passing all upstream tests on s390x - http://multiarch-ci.khw1.lab.eng.bos.redhat.com:8080/job/pygobject/label=s390x/

Comment 10 Dan Horák 2020-06-08 12:19:48 UTC

and https://gitlab.gnome.org/GNOME/pygobject/-/issues/247 is the last/only s390x (endianness) issue I recall for pygobject

Comment 11 Radek Vykydal 2020-06-08 12:40:41 UTC

For the record, the traceback in comment #8 is x86_64 nested virtualization in PSI OpenStack (but we don't know yet if the root cause is the same as in the image build fail).

Comment 12 Dan Horák 2020-06-08 12:51:10 UTC

I believe the original s390x is also nested virt situation, the builder is a VM and runs image installation in nested VM.

Comment 13 Matthias Clasen 2020-06-08 13:31:59 UTC

> 01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:AttributeError: type object 'VariantBuilder' has no attribute 'new'

This looks like a problem with gobject-introspection to me, not like a problem with pygobject.

What does the xml for VariantBuilder look like ?

Comment 14 Matthias Clasen 2020-06-08 13:40:40 UTC

The xml I'm referring to is /usr/share/gir-1.0/GLib-2.0.gir

Comment 15 Tomas Popela 2020-06-08 13:46:10 UTC

Created attachment 1696118 [details]
/usr/share/gir-1.0/GLib-2.0.gir from gobject-introspection-devel-1.56.1-1.el8.s390x.rpm

I'm attaching /usr/share/gir-1.0/GLib-2.0.gir from gobject-introspection-devel-1.56.1-1.el8.s390x.rpm

Comment 16 Ray Strode [halfline] 2020-06-08 14:25:19 UTC

So I don't know if it's related, but looking at a different of the build logs for gobject-introspection between x86_64 and s390x I see this:

╎❯ diff -u ../../../build.log ../../../build.log.s390x | egrep -3 '^\+.*warning:'
   CC       girepository/cmph/libcmph_la-brz.lo
 make[2]: Leaving directory '/builddir/build/BUILD/gobject-introspection-1.56.1'
+girepository/cmph/brz.c: In function 'brz_pack':
+girepository/cmph/brz.c:813:18: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
+    *g_is_ptr++ = (cmph_uint32)g_i;
+                  ^
+girepository/cmph/brz.c: In function 'brz_bmz8_search_packed':
+girepository/cmph/brz.c:934:11: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
+  h1_ptr = (cmph_uint8 *) g_is_ptr[h0];
+           ^
+girepository/cmph/brz.c: In function 'brz_fch_search_packed':
+girepository/cmph/brz.c:998:11: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
+  h1_ptr = (cmph_uint8 *) g_is_ptr[h0];
+           ^
 make[2]: Entering directory '/builddir/build/BUILD/gobject-introspection-1.56.1'

I hadn't heard of cmph before, but a quick google search shows it's library for producing perfect hashes:

https://sourceforge.net/p/cmph/git/ci/master/tree/

Looking the code around the warnings is a little bleak:

→       →       #if defined (__ia64) || defined (__x86_64__)•
→       →       →       *g_is_ptr++ = (cmph_uint64)g_i;•
→       →       #else•
→       →       →       *g_is_ptr++ = (cmph_uint32)g_i;•
→       →       #endif•

...

→       #if defined (__ia64) || defined (__x86_64__)•
→       →       g_is_ptr = (cmph_uint64 *)packed_mphf;•
→       #else•
→       →       g_is_ptr = packed_mphf;•
→       #endif•
→       •
→       h1_ptr = (cmph_uint8 *) g_is_ptr[h0];•

...
→       #if defined (__ia64) || defined (__x86_64__)•
→       →       g_is_ptr = (cmph_uint64 *)packed_mphf;•
→       #else•
→       →       g_is_ptr = packed_mphf;•
→       #endif•
→       •
→       h1_ptr = (cmph_uint8 *) g_is_ptr[h0];•
...


So, apparently it thinks the only 64-bit arches are intel ones!  Might be a red herring, of course.

Looking at the latest upstream version of cmph shows that hasn't really changed since it was bundled.

Comment 17 Ray Strode [halfline] 2020-06-08 14:43:41 UTC

I think the warnings a red herring. looking in the source I see:

gthash.c:  cmph_config_set_algo (config, CMPH_BDZ);

and the warnings are in brz.c, so I don't think the code is actually getting used.

Comment 18 Ray Strode [halfline] 2020-06-08 14:49:52 UTC

( I mean it's choosing the algorithm in bdz.c not the algorithm in brz.c )

Comment 19 Ray Strode [halfline] 2020-06-08 15:09:29 UTC

so comment 0 and comment 8 do seem related.  In comment 0 there's this code:

        # Since we discarded all leaf types, this must be a container•
        builder = GLib.VariantBuilder.new(gvtype)•
...
            for i in value:•
                dup = variant_type_from_string(remainer_format).dup_string()•
                builder.add_value(self._create(dup, i))•

leading to the quizical error: 

TypeError: argument self: Expected GLib.VariantBuilder, but got gi.repository.GLib.VariantBuilder

comment 8 is failing in the first line above where builder is assigned, but the error message is different:

AttributeError: type object 'VariantBuilder' has no attribute 'new'

I do see the word threading in the traceback. Does python3 do way with the global interpreter lock ? If so, I wonder if we're hitting some sort of concurrency issue.  That might explain why it's intermittently failing.

Comment 20 Matthias Clasen 2020-06-08 15:38:15 UTC

Might want to drop brz.c then

Comment 21 Ray Strode [halfline] 2020-06-08 16:43:42 UTC

Apparently, the GIL is still a thing in python3. So I don't think it's a threading issue.

The weird error message from comment 0 comes from here:

static gboolean
_pygi_marshal_from_py_interface_object(...)
...
{•
    if (PyObject_IsInstance (py_arg, iface_cache->py_type) ||•
            (PyObject_TypeCheck(py_arg, &PyGObject_Type) &&•
             g_type_is_a (G_OBJECT_TYPE (pygobject_get (py_arg)), iface_cache->g_type))) {
...
    } else {
        PyObject *module = PyObject_GetAttrString(py_arg, "__module__");•
•
        PyErr_Format (PyExc_TypeError, "argument %s: Expected %s, but got %s%s%s",•
                      arg_cache->arg_name ? arg_cache->arg_name : "self",•
                      ( (PyGIInterfaceCache *)arg_cache)->type_name,•
                      module ? PYGLIB_PyUnicode_AsString(module) : "",•
                      module ? "." : "",•
...
                      py_arg->ob_type->tp_name);•
    }•
}•

I don't know under what circumstances the interface cache is populated, but you could imagine g_type_is_a failing if the gobject associated with the python module got disposed inadvertently.

Comment 22 Ray Strode [halfline] 2020-06-08 17:30:54 UTC Comment hidden (obsolete, wrong_analysis)

So this error message:

AttributeError: type object 'VariantBuilder' has no attribute 'new'

is coming from here:

from ._gi import \•
    Repository, \•
...
repository = Repository.get_default()
...
class IntrospectionModule(object):•
...
    def __getattr__(self, name):•
        info = repository.find_by_name(self._namespace, name)•
        if not info:•
            raise AttributeError("%r object has no attribute %r" % (•
                                 self.__name__, name))•
•
The C half of this returns NULL (not info) here:

GIBaseInfo *•
g_irepository_find_by_name (GIRepository *repository,•
→       →       →           const gchar  *namespace,•
→       →       →           const gchar  *name)•
{•
...
  entry = g_typelib_get_dir_entry_by_name (typelib, name);•
  if (entry == NULL)•
    return NULL;•
...
}•

g_typelib_get_dir_entry_by_name calls g_typelib_get_dir_entry over and over again with different indices, and is just:

  return (DirEntry *)&typelib->data[header->directory + (index - 1) * header->entry_blob_size];•

typelib->data is the mmaped typelib file.  Maybe the typelib file is getting truncated?

Comment 23 Ray Strode [halfline] 2020-06-08 17:43:04 UTC

Radek, do you happen to have the full journal output?

Comment 24 Ray Strode [halfline] 2020-06-08 18:19:39 UTC

(In reply to Ray Strode [halfline] from comment #22)
> So this error message:
> 
> AttributeError: type object 'VariantBuilder' has no attribute 'new'
> 
> is coming from here:
...
>             raise AttributeError("%r object has no attribute %r" % (•
>                                  self.__name__, name))•

This was wrong.  The error messages are close, but subtly different. There would need to be a '%r' or so after the word object.

Doing a grep, it's probably coming from python directly:

╎❯ egrep " object '%.*has no attribute" /usr/* -R
Binary file /usr/lib64/libpython3.8.so matches
Binary file /usr/lib64/libpython3.8.so.1.0 matches

╎❯ strings /usr/lib64/libpython3.8.so |egrep "object '.*has no attribute"
type object '%.50s' has no attribute '%U'

╎❯

Comment 25 Ray Strode [halfline] 2020-06-08 20:28:46 UTC

If we look at GLib.VariantBuilder with python, we see it's implemented with a metaclass:

>>> type(GLib.VariantBuilder)
<class 'gi.types.StructMeta'>

implementation is here:

class StructMeta(type, MetaClassHelper):•
    def __init__(cls, name, bases, dict_):•
...
        cls._setup_fields()•
        cls._setup_methods()•
...

_setup_methods mentioned above is just:
    def _setup_methods(cls):•
        for method_info in cls.__info__.get_methods():•
            setattr(cls, method_info.__name__, method_info)•

get_methods() calls g_struct_info_get_method in a loop g_struct_info_get_n_methods() times, so 'new' should be getting setattr'd to the class here.

Indeed, running

>>> delattr(GLib.VariantBuilder, 'new')
>>> GLib.VariantBuilder.new(GLib.VariantType("(b)"))
...
AttributeError: type object 'VariantBuilder' has no attribute 'new'

Leads to the same error message, so perhaps the method is something getting deleted, or perhaps _setup_methods is never getting called.

One side note, next in the code after _setup_methods is this bit:

        for method_info in cls.__info__.get_methods():•
            if method_info.is_constructor() and \•
                    method_info.__name__ == 'new' and \•
                    (not method_info.get_arguments() or•
                     cls.__info__.get_size() == 0):•
                cls.__new__ = staticmethod(method_info)•
                # Boxed will raise an exception•
                # if arguments are given to __init__•
                cls.__init__ = nothing•
                break•

So interestingly enough, GLib.VariantBuilder.new() isn't even necessary! __new__ gets set up so, the line that was failing:

builder = GLib.VariantBuilder.new(gvtype)

could have been written

builder = GLib.VariantBuilder(gvtype)

If this version works reliably that would suggest possibly delattr is somehow getting called (versus setup_methods not getting called in the first place), I suppose.

Comment 26 Radek Vykydal 2020-06-09 06:27:29 UTC

(In reply to Ray Strode [halfline] from comment #23)
> Radek, do you happen to have the full journal output?

Unfortunately in this case of failure the journal is not collected from the system.  I could try to run some batches of kickstart tests with Anaconda updated so we are able to collect the journal but it will take some time and the chances we hit the condition are uncertain.

Comment 27 Radek Vykydal 2020-06-09 06:48:44 UTC

(In reply to Radek Vykydal from comment #26)
> (In reply to Ray Strode [halfline] from comment #23)
> > Radek, do you happen to have the full journal output?
> 
> Unfortunately in this case of failure the journal is not collected from the
> system.  I could try to run some batches of kickstart tests with Anaconda
> updated so we are able to collect the journal but it will take some time and
> the chances we hit the condition are uncertain.

We could update Anaconda so that it shuts down when we pass the get_variant place (so that the test that is not interesting is finished quickly and the update testing tooling so that the test is not stopped when hitting a traceback (and allows ssh access to the machine).

Comment 28 Radek Vykydal 2020-06-09 07:14:58 UTC

(In reply to Radek Vykydal from comment #27)
> (In reply to Radek Vykydal from comment #26)
> > (In reply to Ray Strode [halfline] from comment #23)
> > > Radek, do you happen to have the full journal output?
> > 
> > Unfortunately in this case of failure the journal is not collected from the
> > system.  I could try to run some batches of kickstart tests with Anaconda
> > updated so we are able to collect the journal but it will take some time and
> > the chances we hit the condition are uncertain.
> 
> We could update Anaconda so that it shuts down when we pass the get_variant
> place (so that the test that is not interesting is finished quickly and the
> update testing tooling so that the test is not stopped when hitting a
> traceback (and allows ssh access to the machine).

Or better, update just the kickstart tests tooling to filter the tests quickly so that we don't modify the condition by using updates image.

Comment 29 Jiri Konecny 2020-06-09 07:16:39 UTC

Clearing needinfo flag.


(In reply to Radek Vykydal from comment #27)
> (In reply to Radek Vykydal from comment #26)
> > (In reply to Ray Strode [halfline] from comment #23)
> > > Radek, do you happen to have the full journal output?
> > 
> > Unfortunately in this case of failure the journal is not collected from the
> > system.  I could try to run some batches of kickstart tests with Anaconda
> > updated so we are able to collect the journal but it will take some time and
> > the chances we hit the condition are uncertain.
> 
> We could update Anaconda so that it shuts down when we pass the get_variant
> place (so that the test that is not interesting is finished quickly and the
> update testing tooling so that the test is not stopped when hitting a
> traceback (and allows ssh access to the machine).

That is worth trying. I'm even thinking if we can't change kickstart tests to extend/disable timeout when this error is detected.

We will look on the possibilities.

Comment 30 Jan Stodola 2020-06-09 07:52:15 UTC

Clearing needinfo, GLib-2.0.gir was provided in comment 15.

Comment 31 Radek Vykydal 2020-06-09 08:38:43 UTC

I am now running batches of nightly rhel8 kickstart tests (~180 tests/installations) killing tests passing get_variant line. One batch is taking about 30 minutes. No luck with reproducing yet. I'll try to get more of our PSI OpenStack resources involved to run more batches.

Comment 32 Ray Strode [halfline] 2020-06-09 11:45:14 UTC

is there any way you could somehow run the tests with this scratch build?

Task info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29194774


it adds some debug spew that may help characterize the problem.

Comment 33 Radek Vykydal 2020-06-09 11:56:12 UTC

(In reply to Ray Strode [halfline] from comment #32)
> is there any way you could somehow run the tests with this scratch build?
> 
> Task info:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29194774
> 
> 
> it adds some debug spew that may help characterize the problem.

We'd probably need to use updates image which can modify the condition, or maybe we could build iso with the scratch build.
Bad news is that so far I've run some 9 batches without any hit.

Comment 34 Radek Vykydal 2020-06-09 12:30:42 UTC

Created attachment 1696310 [details]
journal

So I've succeeded to reproduce the issue from comment #8 (AttributeError: type object 'VariantBuilder' has no attribute 'new').
Attaching the journal.
The VM does not have bridged networking so to access it you'd need
ssh to hypervisor (I'd provide access) and from hypervisor do
ssh root.122.55 to installer VM.

Comment 35 Radek Vykydal 2020-06-09 12:44:10 UTC

(In reply to Radek Vykydal from comment #34)
> Created attachment 1696310 [details]
> journal
> 
> So I've succeeded to reproduce the issue from comment #8 (AttributeError:
> type object 'VariantBuilder' has no attribute 'new').

Err, actually it is the case from the description: TypeError: argument self: Expected GLib.VariantBuilder, but got gi.repository.GLib.VariantBuilder

Comment 36 Radek Vykydal 2020-06-09 12:50:16 UTC

Created attachment 1696314 [details]
journal

Actually reproduced also the case from comment #8:
(AttributeError: type object 'VariantBuilder' has no attribute 'new').
Attaching the journal.
Access from the hypervisor would be ssh root.122.182

Comment 37 Radek Vykydal 2020-06-09 12:57:19 UTC

(In reply to Radek Vykydal from comment #34)
> Created attachment 1696310 [details]
> journal
> 
> So I've succeeded to reproduce the issue from comment #8 (AttributeError:
> type object 'VariantBuilder' has no attribute 'new').
> Attaching the journal.
> The VM does not have bridged networking so to access it you'd need
> ssh to hypervisor (I'd provide access) and from hypervisor do
> ssh root.122.55 to installer VM.

The access (password is fedora):
[rvykydal@prnot3 ~]$ ssh fedora.136.57
fedora.136.57's password: 
Last login: Tue Jun  9 12:55:51 2020 from 10.43.26.115
[fedora@host-f32-2ff2ef-0 ~]$ ssh root.122.55
Last login: Tue Jun  9 12:55:12 2020 from 192.168.122.1
[anaconda root@localhost ~]#

Comment 38 Radek Vykydal 2020-06-09 13:02:44 UTC

(In reply to Radek Vykydal from comment #36)
> Created attachment 1696314 [details]
> journal
> 
> Actually reproduced also the case from comment #8:
> (AttributeError: type object 'VariantBuilder' has no attribute 'new').
> Attaching the journal.
> Access from the hypervisor would be ssh root.122.182

The access (password is fedora):
[rvykydal@prnot3 ~]$ ssh fedora.138.28
fedora.138.28's password: 
Last login: Tue Jun  9 12:59:15 2020 from 10.43.26.115
[fedora@host-f32-2ff2ef-1 ~]$ ssh root.122.182
Last login: Tue Jun  9 13:01:22 2020 from 192.168.122.1
[anaconda root@localhost ~]#

Comment 39 Radek Vykydal 2020-06-09 13:06:35 UTC

I'll try to hit the issue with updates image containing the scratch build.

Comment 40 Ray Strode [halfline] 2020-06-09 14:59:00 UTC

So I had one theory that perhaps the typelib was inadvertently getting unmapped out from under pygobject.

I tried to simulate that, by running:

[anaconda root@localhost ~]# /usr/libexec/platform-python
Python 3.6.8 (default, Apr  3 2020, 16:09:51)
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gi
>>> from gi.repository import GLib
>>> builder = GLib.VariantBuilder.new(GLib.VariantType("(su)"))                                                                                                                                                                              

[anaconda root@localhost ~]# cat /proc/$(pidof platform-python3.6)/maps |grep typelib
7fd829d58000-7fd829d86000 r--p 00000000 fd:00 12761                      /usr/lib64/girepository-1.0/GLib-2.0.typelib                             

Then in gdb running:

(gdb) call (void) munmap(0x7fd829d58000,0x7fd829d86000-0x7fd829d58000)

and went back to python and ran:

>>> builder2 = GLib.VariantBuilder.new(GLib.VariantType("(su)"))


This lead to different symptoms, a crash:

Program received signal SIGSEGV, Segmentation fault.                                                                                              
0x00007fd82d55b10f in g_function_info_get_flags () from /lib64/libgirepository-1.0.so.1                                                           
(gdb) up                                                                                                        
#1  0x00007fd82d7aaa4a in _function_info_descr_get () from /usr/lib64/python3.6/site-packages/gi/_gi.cpython-36m-x86_64-linux-gnu.so                                                                                                         
(gdb)                                                                                                                                           
#2  0x00007fd83c363091 in type_getattro () from /lib64/libpython3.6m.so.1.0                                                                     
(gdb)                                                                                                                                           
#3  0x00007fd83c361d52 in PyObject_GetAttr () from /lib64/libpython3.6m.so.1.0                                  
(gdb)                                                                                                           
#4  0x00007fd83c3b6005 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0             

but note frame 2 is type_getattro, which is the same function where the error was

AttributeError: type object 'VariantBuilder' has no attribute 'new'

is raised. So maybe we're in the right ballpark. The crash is likely from here (though I lack debug symbols to be sure):

    /* Look for the attribute in the metatype */•
    meta_attribute = _PyType_Lookup(metatype, name);•
•
    if (meta_attribute != NULL) {•
        meta_get = Py_TYPE(meta_attribute)->tp_descr_get;•
which calls:

_function_info_descr_get (PyGICallableInfo *self, PyObject *obj, PyObject *type) {•
...
    flags = g_function_info_get_flags ( (GIFunctionInfo*) self->base.info);•

which dereferences:

g_function_info_get_flags (GIFunctionInfo *info)•
{•
...
  blob = (FunctionBlob *)&rinfo->typelib->data[rinfo->offset];•


after my gdb shenangins, data is unmapped, thus crash.

So does this help us with the "real failure"? Maybe, a little.
Looking at the code again:

    /* Look for the attribute in the metatype */•
    meta_attribute = _PyType_Lookup(metatype, name);•
•
    if (meta_attribute != NULL) {•
        meta_get = Py_TYPE(meta_attribute)->tp_descr_get;•
•
        if (meta_get != NULL && PyDescr_IsData(meta_attribute)) {•
            /* Data descriptors implement tp_descr_set to intercept•
             * writes. Assume the attribute is not overridden in•
             * type's tp_dict (and bases): call the descriptor now.•
             */•
            return meta_get(meta_attribute, (PyObject *)type,•
                            (PyObject *)metatype);•
        }•
        Py_INCREF(meta_attribute);•
    }•
•
We see if we get to tp_descr_get (as we apparently should in working conditions), we return from the function.  tp_descr_get can't return NULL in our case, so this suggests the real failure is causing

     meta_attribute = _PyType_Lookup(metatype, name); 

to return NULL.

Comment 41 Ray Strode [halfline] 2020-06-09 15:23:34 UTC

(because the error message is raised at the very bottom of that function, so it can't be returning early in our failure case.)

Comment 42 Ray Strode [halfline] 2020-06-09 19:18:48 UTC

so while i'm waiting on the latest debug output from QE, I decided to try to toy around in a local VM.

I didn't expect to be able to reproduce the problem (since it takes many iterations during automated testing to hit the bug), but I was hopeful I might be able to spot some sort of otherwise silent heap corrupting bug with valgrind.

So I booted into the installer, killed anaconda, installed valgrind, edited /usr/libexec/anaconda/start-module to run the module through valgrind, then started anaconda.

No luck:

[anaconda root@localhost tmp]# journalctl |grep 2239
Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Memcheck, a memory error detector
Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Command: /usr/libexec/platform-python -m pyanaconda.modules.subscription
Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Process terminating with default action of signal 15 (SIGTERM)
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==    at 0x6300EA1: poll (in /usr/lib64/libc-2.28.so)
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==    by 0x1939B9B5: ??? (in /usr/lib64/libglib-2.0.so.0.5600.4)
ct_FastCallKeywords (in /usr/lib64/libpython3.6m.so.1.0)
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==    by 0x54C03A5: ??? (in /usr/lib64/libpython3.6m.so.1.0)
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==    by 0x54C0EA7: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.6m.so.1.0)
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==    by 0x541D5F5: ??? (in /usr/lib64/libpython3.6m.so.1.0)
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== HEAP SUMMARY:
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==     in use at exit: 23,309,258 bytes in 208,281 blocks
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==   total heap usage: 657,309 allocs, 449,028 frees, 85,971,811 bytes allocated
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==
raproject.Anaconda.Modules.Subscription[2199]: ==2239==                       of which reachable via heuristic:
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==                         length64           : 792 bytes in 15 blocks
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==                         newarray           : 1,826 bytes in 31 blocks
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==         suppressed: 0 bytes in 0 blocks
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Rerun with --leak-check=full to see details of leaked memory
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239==
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== For lists of detected and suppressed errors, rerun with: -s
Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Comment 43 Ray Strode [halfline] 2020-06-09 20:01:40 UTC

So it's worth pointing out that every time it's crashed, it's crashed while trying to resolve the paramters in a dbus RPC call.

In both comment 0 and comment 8 and comment 36 it's crashing on

rhsm_config_proxy.Set()

Set is a remote procedure call, not a native method.

In comment 34, it's crashing when trying to request a name on the bus:

self.proxy.RequestName

(also a remote procedure call).

Comment 44 Ray Strode [halfline] 2020-06-09 20:26:59 UTC

So, I've started to think about what sorts of things could make this non-deterministic, and one thought I had was "Maybe it's Garbage Collection kicking in?".

So I tried manually adding:

gc.collect()

throughout the problematic areas, but it didn't help in creating a reliable reproducer, unfortunately.

We could also try the opposite, do a debug build with gc.disable() and see if that makes the problem go away completely.
If it does have an effect, it might give us a hint on where the problem is.

Comment 45 Ray Strode [halfline] 2020-06-09 21:08:58 UTC

I want to revisit comment 21 since i've looked at it a bit closer now:

The failing if statement is:

    if (PyObject_IsInstance (py_arg, iface_cache->py_type) ||•
            (PyObject_TypeCheck(py_arg, &PyGObject_Type) &&•
             g_type_is_a (G_OBJECT_TYPE (pygobject_get (py_arg)), iface_cache->g_type))) {

The object at hand is not a gobject, so the second half is completely irrelevant.

That means PyObject_IsInstance (py_arg, iface_cache->py_type) is failing.  the builder instance is no longer a PyObject.

Comment 46 Ray Strode [halfline] 2020-06-09 21:20:32 UTC

i meant to say "of the type expected in the cache" in the last comment.  Maybe, iface_cache->py_type is getting overwritten with a new type?

Comment 47 Ray Strode [halfline] 2020-06-10 04:18:33 UTC

I've been doing a bit of code reading, so I'm just going to dump my findings here.

iface_cache->py_type is apparently set up in pygi_arg_interface_setup:

    iface_cache->py_type = pygi_type_import_by_gi_info ( (GIBaseInfo *) iface_info);•

which is roughly implemented as:

    module_name = g_strconcat ("gi.repository.", namespace_, NULL);• 
    py_module = PyImport_ImportModule (module_name);•                
    py_object = PyObject_GetAttrString (py_module, name);•           

so it's basically doing the equivalent of:

from gi.repository.GLib as GLib
py_type = getattr(GLib, "VariantBuilder")

the /chain for setting up py_type for the self argument of this failing builder.add_value method call is something like:

 pygi_arg_interface_setup ->
 pygi_arg_interface_new_from_info -> 
 pygi_arg_struct_new_from_info -> 
 _arg_cache_new_for_interface ->
 _function_with_instance_cache_generate_args_cache_real ->
 _callable_cache_init ->
 _function_with_instance_cache_init ->
 pygi_method_cache_new ->
 _wrap_g_callable_info_invoke

_wrap_g_callable_info_invoke is notable:

PyObject *•                                                          
_wrap_g_callable_info_invoke (PyGIBaseInfo *self, PyObject *py_args,•
                              PyObject *kwargs)•                     
{•                                                                   
...
    if (self->cache == NULL) {•                    
...
           function_cache = pygi_method_cache_new (self->info);• 
...
        self->cache = (PyGICallableCache *)function_cache;•
...
    }•
...
    return pygi_callable_info_invoke (self->info, py_args, kwargs, self->cache, NULL);
}•

So if the cache isn't already set up, it sets it up then it invokes the method which marshals the args and uses ffi.
The full chain for the marshalling of the self argument I think should be something like:

 pygi_callable_info_invoke ->
 pygi_function_cache_invoke ->
 _function_cache_invoke_real ->
 pygi_invoke_c_callable ->
 _invoke_marshal_in_args ->
 arg_type_class_from_py_marshal

Note our error message comes from _pygi_marshal_from_py_interface_object which is weird for this type of call.
It's an analog to arg_type_class_from_py_marshal but for a different type of argument!

arg_type_class_from_py_marshal is for boxed structs (like GVariantBuilder), _pygi_marshal_from_py_interface_object is for GObject arguments.

It's like it thinks the builder variable is a GObject instead of GBoxed. They way it decides is from the cache generated in
_wrap_g_callable_info_invoke. Part of that cache generation involves calling _arg_cache_new_for_interface which does this:

{•                                                                   
...
    info_type = g_base_info_get_type ( (GIBaseInfo *)iface_info);•   
...
    switch (info_type) {•                                            
...
        case GI_INFO_TYPE_OBJECT:•                                   
        case GI_INFO_TYPE_INTERFACE:•                                
            return pygi_arg_gobject_new_from_info (type_info,•       
                                                   arg_info,•        
                                                   transfer,•        
                                                   direction,•       
                                                   iface_info,•      
                                                   callable_cache);• 
        case GI_INFO_TYPE_BOXED:•                                    
...
            return pygi_arg_struct_new_from_info (type_info,•        
                                                  arg_info,•         
                                                  transfer,•         
                                                  direction,•        
                                                  iface_info);•      
...
}• 

So apparently info_type is the wrong value.  iface_info comes from one frame higher _function_with_instance_cache_generate_args_cache_real, which does:

    interface_info = g_base_info_get_container ((GIBaseInfo *) callable_info);

callable_info should be a GIFunctionInfo and it's container should be a GIStructInfo, but the container is apparently a GIObjectInfo or GIInterfaceInfo.

The GIFunctionInfo is supposed to come from the stuff i talked about in comment 25:


class StructMeta(type, MetaClassHelper):•
    def __init__(cls, name, bases, dict_):•
...
        cls._setup_fields()•
        cls._setup_methods()•
...

_setup_methods mentioned above is just:
    def _setup_methods(cls):•
        for method_info in cls.__info__.get_methods():•
            setattr(cls, method_info.__name__, method_info)•

So at some point the above code should have done setattr(GLib.VariantBuilder, "add_value", method_info), where method_info was a GIFunctionInfo who's container was GIStructInfo.

No answers yet, but at least I understand a little better how it's supposed to work.  I'll try to a build that disables Garbage Collection to see if that makes the problem go away.

Comment 48 Ray Strode [halfline] 2020-06-10 04:26:59 UTC

this build (assuming it succeeds) should disable garbage collection.  If garbage collection is implicated, this build will never hit the bug:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637

this build (assuming it succeeds) should make it garbage collect much more frequently.  If garbage collection is implicated, this build may help it hit the bug more often:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646

Comment 49 Radek Vykydal 2020-06-10 05:45:49 UTC

Created attachment 1696433 [details]
journal for scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469

Log from kstest run hitting a traceback using the iso containing this scratch build:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469
(updated version of build from comment #32)

Comment 50 Radek Vykydal 2020-06-10 06:44:09 UTC

Just to be clear about instrumenting of the kickstart tests to pick the reproducer fast, I am killing the test if it reaches:
https://github.com/rhinstaller/anaconda/blob/8bea2c61f155a2fd5e17f53b7a6dd5656951c8e6/pyanaconda/modules/subscription/initialization.py#L76
by looking for the message in the log.
The reproducer happens a line before:
https://github.com/rhinstaller/anaconda/blob/8bea2c61f155a2fd5e17f53b7a6dd5656951c8e6/pyanaconda/modules/subscription/initialization.py#L74
I can modify this to your needs.

Comment 51 Radek Vykydal 2020-06-10 06:48:09 UTC

(In reply to Ray Strode [halfline] from comment #48)
> this build (assuming it succeeds) should disable garbage collection.  If
> garbage collection is implicated, this build will never hit the bug:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637
> 
> this build (assuming it succeeds) should make it garbage collect much more
> frequently.  If garbage collection is implicated, this build may help it hit
> the bug more often:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646

Jirka, could you please create boot.isos for both? I'll run the test batch with them.

Comment 52 Radek Vykydal 2020-06-10 08:11:41 UTC

(In reply to Radek Vykydal from comment #49)
> Created attachment 1696433 [details]
> journal for scratch build
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469
> 
> Log from kstest run hitting a traceback using the iso containing this
> scratch build:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469
> (updated version of build from comment #32)

Access to the vm (password is fedora):
[rvykydal@prnot3 ~]$ ssh fedora.136.57
fedora.136.57's password: 
Last login: Wed Jun 10 08:08:36 2020 from 10.43.26.115
[fedora@host-f32-2ff2ef-0 ~]$ ssh root.122.114
Last login: Wed Jun 10 08:10:11 2020 from 192.168.122.1
[anaconda root@localhost ~]#

Comment 53 Radek Vykydal 2020-06-10 08:52:29 UTC

Created attachment 1696466 [details]
another log for scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 (.debug)

(In reply to Radek Vykydal from comment #49)
> Created attachment 1696433 [details]
> journal for scratch build
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469
> 
> Log from kstest run hitting a traceback using the iso containing this
> scratch build:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469
> (updated version of build from comment #32)

Another instance of the reproducer for the same pygobject scratch build (.debug)

Comment 54 Radek Vykydal 2020-06-10 10:19:26 UTC

(In reply to Ray Strode [halfline] from comment #48)
> this build (assuming it succeeds) should disable garbage collection.  If
> garbage collection is implicated, this build will never hit the bug:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637
> 
> this build (assuming it succeeds) should make it garbage collect much more
> frequently.  If garbage collection is implicated, this build may help it hit
> the bug more often:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646

Thank you Jirka for building the isos.
Hunts for reproducer for both builds are running.

Comment 55 Radek Vykydal 2020-06-10 13:41:42 UTC

Created attachment 1696491 [details]
log with Traceback (.debug.1 scratch build)

(In reply to Ray Strode [halfline] from comment #48)
> this build (assuming it succeeds) should disable garbage collection.  If
> garbage collection is implicated, this build will never hit the bug:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637

Attaching log with Traceback hit after a few batches (normally we hit Traceback after ~7-10 batches)

Access to the VM:
[rvykydal@prnot3 ~]$ ssh fedora.137.248
fedora.137.248's password: 
Last login: Wed Jun 10 13:34:22 2020 from 10.43.26.115
[fedora@race-hunter1-a052d7-1 ~]$ ssh root.122.28
Last login: Wed Jun 10 13:36:00 2020 from 192.168.122.1
[anaconda root@localhost ~]#

Comment 56 Ray Strode [halfline] 2020-06-10 14:16:04 UTC

We do see the typelib is still mapped:

Jun 09 20:38:05 localhost org.fedoraproject.Anaconda.Modules.Subscription[1897]: 7f8cba7e4000-7f8cba812000 r--p 00000000 fd:00 141820                     /usr/lib64/girepository-1.0/GLib-2.0.typelib•

So that definitively rules out spurious munmap calls being the problem.

This is interesting:

Jun 09 20:38:05 localhost org.fedoraproject.Anaconda.Modules.Subscription[1897]: GLib.VariantBuilder: ['__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__gtype__', '__hash__', '__info__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_free_on_dealloc', 'copy']

It's supposed to be:

['__class__',  '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__gtype__', '__hash__', '__info__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_free_on_dealloc', 'add_value', 'close', 'copy', 'end', 'new', 'open', 'ref', 'unref']

So, 'add_value', 'close', ''end', 'new', 'open', 'ref', 'unref' are missing, these come from the typelib.
'copy' is present.  This comes from:

pygboxed.c:    { "copy", (PyCFunction) gboxed_copy, METH_NOARGS },

Indeed, if I repeat the same experiment from comment 25, but try to get rid of "copy" instead of "new" it fails:

>>> delattr(GLib.VariantBuilder, 'copy')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: copy
>>>

So it does seem like perhaps all the attributes are getting deleted on the object, or aren't getting added in the first place.
Maybe we should be printing out GLib.VariantBuilder.__dict__ as well. I suspect it's pretty bare.

As an experiment, I decided changing the code to use GObjectMeta instead of StructMeta, to see if it would fail in the same way.
Unfortunately, it seems to still call _setup_methods so all the methods get property added.

Looking in StructMeta, I do see an early return:

class StructMeta(type, MetaClassHelper):•                        
     def __init__(cls, name, bases, dict_):•
...       
        # Avoid touching anything else than the base class.•     
        g_type = cls.__info__.get_g_type()•
        if g_type != TYPE_INVALID and g_type.pytype is not None:•
            return•                                              
...
        cls._setup_methods()• 

So perhaps we're hitting that.  Indeed, as an experiment, if I add:

if g_type.name == 'GVariantBuilder':
  return

before cls._setup_methods() the problem reproduces.  So maybe g_type.pytype is somehow inadvertently getting set?

Comment 57 Ray Strode [halfline] 2020-06-10 14:46:57 UTC

There's also this chunk of code that's interesting:

            # Check if there is already a Python wrapper that is not a parent class•
            # of the wrapper being created. If it is a parent, it is ok to clobber•
            # g_type.pytype with a new child class wrapper of the existing parent.•
            # Note that the return here never occurs under normal circumstances due•
            # to caching on the __dict__ itself.•
            if g_type != TYPE_NONE:•
                type_ = g_type.pytype•
                if type_ is not None and type_ not in bases:•
                    self.__dict__[name] = type_•
                    return type_•
...
            wrapper = metaclass(name, bases, dict_)


So the code is expecting, that if the pytype already set on the object is a parent, that the pytype will
get clobbered when running through the metaclass.  As mentioned in the previous comment that doesn't happen
though:

        # Avoid touching anything else than the base class.•     
        g_type = cls.__info__.get_g_type()•
        if g_type != TYPE_INVALID and g_type.pytype is not None:•
            return• 

So if pytype is set "Boxed" it's never going to get "VariantBuilder"'s methods and pytype.  I don't know why
pytype would be Boxed, though (or if it is)

Comment 58 Ray Strode [halfline] 2020-06-10 15:17:09 UTC

(In reply to Radek Vykydal from comment #55)
> (In reply to Ray Strode [halfline] from comment #48)
> > this build (assuming it succeeds) should disable garbage collection.  If
> > garbage collection is implicated, this build will never hit the bug:
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637
> 
> Attaching log with Traceback hit after a few batches (normally we hit
> Traceback after ~7-10 batches)
Okay, so the GC is in the clear.

I think the next move should be to instrument the code to print when pytype is set.
That why we can see if it's the reason the methods aren't getting added to the object.

Comment 59 Radek Vykydal 2020-06-10 15:34:29 UTC

(In reply to Ray Strode [halfline] from comment #48)
 
> this build (assuming it succeeds) should make it garbage collect much more
> frequently.  If garbage collection is implicated, this build may help it hit
> the bug more often:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646

Interestingly, after 8 batches when we are usually close to hitting a reproducer, no reproducer here. I'll wait a bit more before shutting down this series.

Comment 60 Ray Strode [halfline] 2020-06-10 18:51:48 UTC

(In reply to Radek Vykydal from comment #59)
> (In reply to Ray Strode [halfline] from comment #48)
>  
> > this build (assuming it succeeds) should make it garbage collect much more
> > frequently.  If garbage collection is implicated, this build may help it hit
> > the bug more often:
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646
> 
> Interestingly, after 8 batches when we are usually close to hitting a
> reproducer, no reproducer here. I'll wait a bit more before shutting down
> this series.

Maybe it's a timing thing, the GCs slow it down enough that it never fails?

I've put a new build here:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29279557

This one prints stacktrace any time someone tries to acces pytype on GVariantBuilder

Comment 61 Ray Strode [halfline] 2020-06-10 20:37:50 UTC

So at this point since we've ruled out the GC, and very likely ruled out heap corruption, the main contender, in my mind has again turned back to some sort concurrency race involving multiple threads.

Looking at the subscription module it does have two threads running:

class SubscriptionService(KickstartService):•
...
    def __init__(self):•
...
        self._rhsm_startup_task = StartRHSMTask()•
...
    def run(self):•
        """Initiate RHSM service startup before starting the main loop.•
•
        This way RHSM service can startup in parallel without blocking•
        startup of the Subscription module.
        """•
        self._rhsm_startup_task.start()•
        super().run()•

_rhsm_startup_task.start() starts the thread where the crash happens, and
super().run() is this:

    def run(self):•
        """Run the loop."""•
        log.debug("Publish the service.")•
        self.publish()•
        log.debug("Start the loop.")•
        self._loop.run()•

We can see the first of those debug messages in the log shortly before the crash.

Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:anaconda.modules.common.base.base:Publish the service.

self.publish() does two things:

DBus.publish_object(SUBSCRIPTION.object_path, SubscriptionInterface(self))•   
DBus.register_service(SUBSCRIPTION.service_name)•                             

The first shows up right away in the log:
Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:dasbus.connection:Publishing an object at /org/fedoraproject/Anaconda/Modules/Subscription.

But then we jump over to the other thread
Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:anaconda.core.dbus:Connecting to the Anaconda bus at unix:abstract=/tmp/dbus-rmvcQJefAZ,guid=ff26b875e7c66ca5840fc0565ee0d50d.
Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:anaconda.modules.subscription.initialization:subscription: creating /etc/yum.repos.d
Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:program:Running... systemctl start rhsm.service

and jump back, where we see DBus.register_service run:
Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:dasbus.connection:Registering a service name org.fedoraproject.Anaconda.Modules.Subscription.

and back to the second thread (the one that ultimately tanks):
Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:program:Return code: 0
Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:anaconda.modules.subscription.initialization:subscription: setting RHSM log level to DEBUG

Looks like we try to get two bus connects on the system bus at the same time. 

Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:dasbus.connection:Connecting to the system bus.
Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:dasbus.connection:Connecting to the system bus.

Both threads are trying to do bus calls at the same time.  One thread is doing:

        result = self.proxy.RequestName(...)

and the other thread is doing:

        rhsm_config_proxy.Set(...)

The DBus proxy object code implicitly gets a connection on the bus during this call.  The crash is shortly after the connection is established, it's here:

        if in_type is not None:•                           
            parameters = get_variant(in_type, parameters)• 

get_variant is just a wrapper around GLib.Variant provided by pygobject.

Since the other thread is exercising pretty much the same code path, it's likely at around the same place in the code.
So there is a strong possibility we're hitting a thread safety issue here.

Comment 62 Ray Strode [halfline] 2020-06-10 21:49:44 UTC

So I have a theory that's similar to comment 57, but tweaked to take into account interactions that may be caused by multiple threads.

Both threads are essentially accessing

GLib.VariantBuilder

at the same time. This means this code is running in two threads concurrently:

class IntrospectionModule(object):•
    def __getattr__(self, name):•
       info = repository.find_by_name(self._namespace, name)•
...
        elif isinstance(info, RegisteredTypeInfo):•
            g_type = info.get_g_type()•
...
            # Create a wrapper.
            elif isinstance(info, (StructInfo, UnionInfo)):•
                if g_type.is_a(TYPE_BOXED):•
                    bases = (Boxed,)•
...
                metaclass = StructMeta•
...
                type_ = g_type.pytype•
                if type_ is not None and type_ not in bases:•
                    self.__dict__[name] = type_•
                    return type_•
...
            wrapper = metaclass(name, bases, dict_)
...
                g_type.pytype = wrapper

As mentioned in comment 57, it's possible for the wrapper to be created incompletely without fully setting up the object, if
pytype is set.  If that happens, the incomplete wrapper will get used as the new type object from then on out. But there's a
check if pytype is already set, and an early return. Normally, if pytype is set to anything but "Boxed", it will just get 
returned and used, avoiding the problematic incomplete wrapper.

Of course, in the scenario where two threads are running currently, pytype won't be "Boxed", it will be either None or one
of the wrappers.  So, in theory the early return should fire, and there should be no issues. But what if pytype gets set 
from the other thread right after the check in this thread?  It would have the very symptoms we're seeing!

One niggle is, that I know, thanks to the global interpreter lock, that two threads don't exactly run simultaneously. 
They take turns, one thread waiting on the lock and the other executing code until it drops the lock.  The lock gets
dropped at certain preemption points (like when calling into glib code).  I don't know for sure that there is such a
preemption point after the 

                if type_ is not None and type_ not in bases:•

check, and before the         

        if g_type != TYPE_INVALID and g_type.pytype is not None:

check.  But there might be, and if there is then it seems to fit the bug pretty well.

Comment 63 Ray Strode [halfline] 2020-06-10 22:15:34 UTC

Created attachment 1696587 [details]
Patch that I think will fix the problem.

Assuming my theory is correct, I believe the above patch should fix the problem.

Comment 64 Ray Strode [halfline] 2020-06-10 22:17:12 UTC

I putting a scratch build here:

Task info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29285014

Moving to POST. If it checks out after a day of testing tomorrow I'll do an official build.  Radek, can you qa ack?

Comment 65 Ray Strode [halfline] 2020-06-10 23:37:59 UTC

Created attachment 1696598 [details]
Patch that I think will fix the problem, now with 100% more patch

attachment 1696587 [details] only had the spec file changes. It lacked the actual patch.

Comment 66 Radek Vykydal 2020-06-11 07:16:12 UTC

Thank you Ray, to me the fix seems very promising.

I am not sure I can formally give qa_ack being a devel, but I'll definitely test the build running it on as much resources I can get (in PSI OpenStack). Normally the condition is hit sooner than ~10 batches are run [1] (each 170 installations, taking ~35 mins to run). So about 1/1500 installations. I'll run several series of batches with patched pygobject and one or two series without the patch in parallel.

[1] we had only 2 exceptions:
- using updates image we needed 14 batches
- using gcextra build we didn't hit the condition even after 31 batches
but I think we had enough cases hitting the condition as expected that it is very likely that if we don't hit it with the patch it is because the cause is actually fixed.

Comment 67 Jiri Konecny 2020-06-11 08:32:20 UTC

(In reply to Ray Strode [halfline] from comment #64)
> I putting a scratch build here:
> 
> Task info:
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29285014
> 
> Moving to POST. If it checks out after a day of testing tomorrow I'll do an
> official build.  Radek, can you qa ack?

New ISO is created for this build:

http://10.43.136.2/trees/rhel-8-scratch/bug1844578_task29285014_boot.iso

Comment 68 Radek Vykydal 2020-06-11 08:44:31 UTC

Starting the tests.

Comment 69 Ray Strode [halfline] 2020-06-11 12:19:24 UTC

(In reply to Radek Vykydal from comment #66)
> I am not sure I can formally give qa_ack being a devel, but I'll definitely
> test the build running it on as much resources I can get (in PSI OpenStack).

ah, thanks for the work you put in getting a rare bug to reproduce many times over the last days.  I assumed only someone with QE chops would be able to pull that off...

tpelka, is this something you can qa ack?

Comment 70 Radek Vykydal 2020-06-11 12:27:29 UTC

Current results of series of batches run in parallel:

The issue was hit in both non-patched instances.
- original1: reproduced in 2nd batch
- original2: reproduced in 5th batch
With patched pygobject the issue has not been hit after
- patched1: 7 batches
- patched2: 5 batches
- patched3: 5 batches

Seems good so far.

Comment 71 Ray Strode [halfline] 2020-06-11 14:24:06 UTC

Created attachment 1696798 [details]
test case

So this test, I think simulates, the bug. If I run it locally within a minute or so I see:

╎❯ python3 threadtest.py
type object 'VariantBuilder' has no attribute 'new'
'VariantBuilder' object has no attribute 'unref'

If I use a version of pygobject3 with my patch, I don't see the problem after several minutes.

I think the delattr stuff i'm doing in the test isn't really kosher though and it's leading to a fairly fast memory leak, unfortunately.

Comment 72 Radek Vykydal 2020-06-11 15:02:32 UTC

(In reply to Radek Vykydal from comment #70)

> With patched pygobject the issue has not been hit after
> - patched1: 7 batches
> - patched2: 5 batches
> - patched3: 5 batches

Still not hit after 10, 8, 8 batches. I'll do one more update in the evening.
I'll also run regular complete run of kickstart tests (without instrumentation looking for the reproducer) with the iso.

Comment 73 Radek Vykydal 2020-06-11 18:34:09 UTC

(In reply to Radek Vykydal from comment #72)
> (In reply to Radek Vykydal from comment #70)
> 
> > With patched pygobject the issue has not been hit after
> > - patched1: 7 batches
> > - patched2: 5 batches
> > - patched3: 5 batches
> 
> Still not hit after 10, 8, 8 batches. I'll do one more update in the evening.
> I'll also run regular complete run of kickstart tests (without
> instrumentation looking for the reproducer) with the iso.

17,13,12 batches without hitting the problem. I am releasing the resources considering the test PASSED.

Comment 74 Ray Strode [halfline] 2020-06-11 19:51:58 UTC

great, i'll do a build now then.

Comment 75 Jiri Konecny 2020-06-11 20:31:01 UTC

Great work Ray with the debugging and Radek with testing! You guys did an amazing job here!

Comment 76 Radek Vykydal 2020-06-11 21:03:44 UTC

(In reply to Radek Vykydal from comment #72)

> I'll also run regular complete run of kickstart tests (without
> instrumentation looking for the reproducer) with the iso.

There are no regressions in complete kickstart tests run using the iso with the patched pygobject3.

Comment 79 Martin Kolman 2020-06-12 01:56:55 UTC

(In reply to Jiri Konecny from comment #75)
> Great work Ray with the debugging and Radek with testing! You guys did an
> amazing job here!

Indeed, totally amazing! :)

Comment 83 errata-xmlrpc 2020-11-04 01:35:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: GNOME security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4451

Note You need to log in before you can comment on or make changes to this bug.