-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Some segfaults arose in scikit-tree during the implementation of this PR: neurodata/treeple#249.
Some actionable items came to mind to improve what we have:
Documentation:
scikit-learn/sklearn/tree/_tree.pyx
Lines 211 to 273 in f360749
| cpdef initialize_node_queue( | |
| self, | |
| Tree tree, | |
| object X, | |
| const float64_t[:, ::1] y, | |
| const float64_t[:] sample_weight=None, | |
| const unsigned char[::1] missing_values_in_feature_mask=None, | |
| ): | |
| """Initialize a list of roots""" | |
| X, y, sample_weight = self._check_input(X, y, sample_weight) | |
| # organize samples by decision paths | |
| paths = tree.decision_path(X) | |
| cdef intp_t PARENT | |
| cdef intp_t CHILD | |
| cdef intp_t i | |
| false_roots = {} | |
| X_copy = {} | |
| y_copy = {} | |
| for i in range(X.shape[0]): | |
| # collect depths from the node paths | |
| depth_i = paths[i].indices.shape[0] - 1 | |
| PARENT = depth_i - 1 | |
| CHILD = depth_i | |
| # find leaf node's & their parent node's IDs | |
| if PARENT < 0: | |
| parent_i = 0 | |
| else: | |
| parent_i = paths[i].indices[PARENT] | |
| child_i = paths[i].indices[CHILD] | |
| left = 0 | |
| if tree.children_left[parent_i] == child_i: | |
| left = 1 # leaf node is left child | |
| # organize samples by the leaf they fall into (false root) | |
| # leaf nodes are marked by parent node and | |
| # their relative position (left or right child) | |
| if (parent_i, left) in false_roots: | |
| false_roots[(parent_i, left)][0] += 1 | |
| X_copy[(parent_i, left)].append(X[i]) | |
| y_copy[(parent_i, left)].append(y[i]) | |
| else: | |
| false_roots[(parent_i, left)] = [1, depth_i] | |
| X_copy[(parent_i, left)] = [X[i]] | |
| y_copy[(parent_i, left)] = [y[i]] | |
| X_list = [] | |
| y_list = [] | |
| # reorder the samples according to parent node IDs | |
| for key, value in reversed(sorted(X_copy.items())): | |
| X_list = X_list + value | |
| y_list = y_list + y_copy[key] | |
| cdef object X_new = np.array(X_list) | |
| cdef cnp.ndarray y_new = np.array(y_list) | |
| # initialize the splitter using sorted samples | |
| cdef Splitter splitter = self.splitter | |
| splitter.init(X_new, y_new, sample_weight, missing_values_in_feature_mask) | |
| # convert dict to numpy array and store value | |
| self.initial_roots = np.array(list(false_roots.items())) |
initial_roots. Since this is part of the Cython codebase, it is thus a critical piece as segfaults are time-consuming and difficult to chase down.
Next, we prolly want to include a clear description for developers on what the differences are here:
scikit-learn/sklearn/tree/_tree.pyx
Lines 302 to 320 in f360749
| if initial_roots is None: | |
| # Recursive partition (without actual recursion) | |
| splitter.init(X, y, sample_weight, missing_values_in_feature_mask) | |
| if tree.max_depth <= 10: | |
| init_capacity = <intp_t> (2 ** (tree.max_depth + 1)) - 1 | |
| else: | |
| init_capacity = 2047 | |
| tree._resize(init_capacity) | |
| first = 1 | |
| else: | |
| # convert numpy array back to dict | |
| false_roots = {} | |
| for key_value_pair in initial_roots: | |
| false_roots[tuple(key_value_pair[0])] = key_value_pair[1] | |
| # reset the root array | |
| self.initial_roots = None |
Features, or documentation
Part of sklearn handles monotonic constraints and n_constant_features tracking. From first glance it is also not clear that these are actually tracked. I.e. is the monotonic constraint and n_constant_features if we do fit and then partial_fit for two subsets of the data different from if we just did fit for the entire dataset? If they are different, what does this imply?
In an ideal world the state is the same.