MAINT Refactor `partial_fit` Cython code to be more maintainable

Some segfaults arose in scikit-tree during the implementation of this PR: https://github.com/neurodata/scikit-tree/pull/249.
Some actionable items came to mind to improve what we have:

**Documentation:**
https://github.com/neurodata/scikit-learn/blob/f3607494ce509f1ff63255b0cb3bc5562de981a1/sklearn/tree/_tree.pyx#L211-L273 should ideally get rewritten, or at least more comments. Rn it is hard to parse what comprises `initial_roots`. Since this is part of the Cython codebase, it is thus a critical piece as segfaults are time-consuming and difficult to chase down.

Next, we prolly want to include a clear description for developers on what the differences are here: https://github.com/neurodata/scikit-learn/blob/f3607494ce509f1ff63255b0cb3bc5562de981a1/sklearn/tree/_tree.pyx#L302-L320

**Features, or documentation**
Part of sklearn handles monotonic constraints and n_constant_features tracking. From first glance it is also not clear that these are actually tracked. I.e. is the monotonic constraint and n_constant_features if we do `fit` and then `partial_fit` for two subsets of the data different from if we just did `fit` for the entire dataset? If they are different, what does this imply?

In an ideal world the state is the same. 



	cpdef initialize_node_queue(
	self,
	Tree tree,
	object X,
	const float64_t[:, ::1] y,
	const float64_t[:] sample_weight=None,
	const unsigned char[::1] missing_values_in_feature_mask=None,
	):
	"""Initialize a list of roots"""
	X, y, sample_weight = self._check_input(X, y, sample_weight)

	# organize samples by decision paths
	paths = tree.decision_path(X)
	cdef intp_t PARENT
	cdef intp_t CHILD
	cdef intp_t i
	false_roots = {}
	X_copy = {}
	y_copy = {}
	for i in range(X.shape[0]):
	# collect depths from the node paths
	depth_i = paths[i].indices.shape[0] - 1
	PARENT = depth_i - 1
	CHILD = depth_i

	# find leaf node's & their parent node's IDs
	if PARENT < 0:
	parent_i = 0
	else:
	parent_i = paths[i].indices[PARENT]
	child_i = paths[i].indices[CHILD]
	left = 0
	if tree.children_left[parent_i] == child_i:
	left = 1 # leaf node is left child

	# organize samples by the leaf they fall into (false root)
	# leaf nodes are marked by parent node and
	# their relative position (left or right child)
	if (parent_i, left) in false_roots:
	false_roots[(parent_i, left)][0] += 1
	X_copy[(parent_i, left)].append(X[i])
	y_copy[(parent_i, left)].append(y[i])
	else:
	false_roots[(parent_i, left)] = [1, depth_i]
	X_copy[(parent_i, left)] = [X[i]]
	y_copy[(parent_i, left)] = [y[i]]

	X_list = []
	y_list = []

	# reorder the samples according to parent node IDs
	for key, value in reversed(sorted(X_copy.items())):
	X_list = X_list + value
	y_list = y_list + y_copy[key]
	cdef object X_new = np.array(X_list)
	cdef cnp.ndarray y_new = np.array(y_list)

	# initialize the splitter using sorted samples
	cdef Splitter splitter = self.splitter
	splitter.init(X_new, y_new, sample_weight, missing_values_in_feature_mask)

	# convert dict to numpy array and store value
	self.initial_roots = np.array(list(false_roots.items()))

	if initial_roots is None:
	# Recursive partition (without actual recursion)
	splitter.init(X, y, sample_weight, missing_values_in_feature_mask)

	if tree.max_depth <= 10:
	init_capacity = <intp_t> (2 ** (tree.max_depth + 1)) - 1
	else:
	init_capacity = 2047

	tree._resize(init_capacity)
	first = 1
	else:
	# convert numpy array back to dict
	false_roots = {}
	for key_value_pair in initial_roots:
	false_roots[tuple(key_value_pair[0])] = key_value_pair[1]

	# reset the root array
	self.initial_roots = None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MAINT Refactor `partial_fit` Cython code to be more maintainable #63

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MAINT Refactor partial_fit Cython code to be more maintainable #63

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

MAINT Refactor `partial_fit` Cython code to be more maintainable #63