Skip to content

Tree Models

This section documents the decision tree model components of the nextmv-scikit-learn package.

Model

model

Defines sklearn.tree models interoperability.

FUNCTION DESCRIPTION
DecisionTreeRegressor

Creates a scikit-learn DecisionTreeRegressor from provided options

DecisionTreeRegressor

DecisionTreeRegressor(
    options: Options,
) -> DecisionTreeRegressor

Creates a sklearn.tree.DecisionTreeRegressor from the provided options.

You can import the DecisionTreeRegressor function directly from tree:

from nextmv_sklearn.tree import DecisionTreeRegressor

This function uses the options to create a scikit-learn DecisionTreeRegressor model with the specified parameters. It extracts parameter values from the Nextmv options object and passes them to the scikit-learn constructor.

PARAMETER DESCRIPTION

options

Options for the DecisionTreeRegressor. Can contain the following parameters: - criterion : str, default='squared_error' The function to measure the quality of a split. - splitter : str, default='best' The strategy used to choose the split at each node. - max_depth : int, optional The maximum depth of the tree. - min_samples_split : int, optional The minimum number of samples required to split an internal node. - min_samples_leaf : int, optional The minimum number of samples required to be at a leaf node. - min_weight_fraction_leaf : float, optional The minimum weighted fraction of the sum total of weights required to be at a leaf node. - max_features : int, optional The number of features to consider when looking for the best split. - random_state : int, optional Controls the randomness of the estimator. - max_leaf_nodes : int, optional Grow a tree with max_leaf_nodes in best-first fashion. - min_impurity_decrease : float, optional A node will be split if this split induces a decrease of the impurity. - ccp_alpha : float, optional Complexity parameter used for Minimal Cost-Complexity Pruning.

TYPE: Options

RETURNS DESCRIPTION
DecisionTreeRegressor

A sklearn.tree.DecisionTreeRegressor instance.

Examples:

>>> from nextmv_sklearn.tree import DecisionTreeRegressorOptions
>>> from nextmv_sklearn.tree import DecisionTreeRegressor
>>>
>>> # Create options for the regressor
>>> options = DecisionTreeRegressorOptions().to_nextmv()
>>>
>>> # Set specific parameters if needed
>>> options.set("max_depth", 5)
>>> options.set("min_samples_split", 2)
>>>
>>> # Create the regressor model
>>> regressor = DecisionTreeRegressor(options)
>>>
>>> # Use the regressor with scikit-learn API
>>> X = [[0, 0], [1, 1], [2, 2], [3, 3]]
>>> y = [0, 1, 2, 3]
>>> regressor.fit(X, y)
>>> regressor.predict([[4, 4]])
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/model.py
def DecisionTreeRegressor(options: nextmv.Options) -> tree.DecisionTreeRegressor:
    """
    Creates a `sklearn.tree.DecisionTreeRegressor` from the provided options.

    You can import the `DecisionTreeRegressor` function directly from `tree`:

    ```python
    from nextmv_sklearn.tree import DecisionTreeRegressor
    ```

    This function uses the options to create a scikit-learn DecisionTreeRegressor
    model with the specified parameters. It extracts parameter values from the
    Nextmv options object and passes them to the scikit-learn constructor.

    Parameters
    ----------
    options : nextmv.Options
        Options for the DecisionTreeRegressor. Can contain the following parameters:
        - criterion : str, default='squared_error'
            The function to measure the quality of a split.
        - splitter : str, default='best'
            The strategy used to choose the split at each node.
        - max_depth : int, optional
            The maximum depth of the tree.
        - min_samples_split : int, optional
            The minimum number of samples required to split an internal node.
        - min_samples_leaf : int, optional
            The minimum number of samples required to be at a leaf node.
        - min_weight_fraction_leaf : float, optional
            The minimum weighted fraction of the sum total of weights required
            to be at a leaf node.
        - max_features : int, optional
            The number of features to consider when looking for the best split.
        - random_state : int, optional
            Controls the randomness of the estimator.
        - max_leaf_nodes : int, optional
            Grow a tree with max_leaf_nodes in best-first fashion.
        - min_impurity_decrease : float, optional
            A node will be split if this split induces a decrease of the impurity.
        - ccp_alpha : float, optional
            Complexity parameter used for Minimal Cost-Complexity Pruning.

    Returns
    -------
    DecisionTreeRegressor
        A sklearn.tree.DecisionTreeRegressor instance.

    Examples
    --------
    >>> from nextmv_sklearn.tree import DecisionTreeRegressorOptions
    >>> from nextmv_sklearn.tree import DecisionTreeRegressor
    >>>
    >>> # Create options for the regressor
    >>> options = DecisionTreeRegressorOptions().to_nextmv()
    >>>
    >>> # Set specific parameters if needed
    >>> options.set("max_depth", 5)
    >>> options.set("min_samples_split", 2)
    >>>
    >>> # Create the regressor model
    >>> regressor = DecisionTreeRegressor(options)
    >>>
    >>> # Use the regressor with scikit-learn API
    >>> X = [[0, 0], [1, 1], [2, 2], [3, 3]]
    >>> y = [0, 1, 2, 3]
    >>> regressor.fit(X, y)
    >>> regressor.predict([[4, 4]])
    """

    names = {p.name for p in DECISION_TREE_REGRESSOR_PARAMETERS}
    opt_dict = {k: v for k, v in options.to_dict().items() if k in names if v is not None}

    return tree.DecisionTreeRegressor(**opt_dict)

Options

options

Defines sklearn.tree options interoperability.

This module provides functionality for interfacing with scikit-learn's tree-based algorithms within the Nextmv framework. It includes classes for configuring decision tree regressors.

CLASS DESCRIPTION
DecisionTreeRegressorOptions

Options wrapper for scikit-learn's DecisionTreeRegressor.

DECISION_TREE_REGRESSOR_PARAMETERS module-attribute

DECISION_TREE_REGRESSOR_PARAMETERS = [
    Option(
        name="criterion",
        option_type=str,
        choices=[
            "squared_error",
            "friedman_mse",
            "absolute_error",
            "poisson",
        ],
        description="The function to measure the quality of a split.",
        default="squared_error",
    ),
    Option(
        name="splitter",
        option_type=str,
        choices=["best", "random"],
        description="The strategy used to choose the split at each node.",
        default="best",
    ),
    Option(
        name="max_depth",
        option_type=int,
        description="The maximum depth of the tree.",
    ),
    Option(
        name="min_samples_split",
        option_type=int,
        description="The minimum number of samples required to split an internal node.",
    ),
    Option(
        name="min_samples_leaf",
        option_type=int,
        description="The minimum number of samples required to be at a leaf node.",
    ),
    Option(
        name="min_weight_fraction_leaf",
        option_type=float,
        description="The minimum weighted fraction of the sum total of weights required to be at a leaf node.",
    ),
    Option(
        name="max_features",
        option_type=int,
        description="The number of features to consider when looking for the best split.",
    ),
    Option(
        name="random_state",
        option_type=int,
        description="Controls the randomness of the estimator.",
    ),
    Option(
        name="max_leaf_nodes",
        option_type=int,
        description="Grow a tree with max_leaf_nodes in best-first fashion.",
    ),
    Option(
        name="min_impurity_decrease",
        option_type=float,
        description="A node will be split if this split induces a decrease of the impurity #.",
    ),
    Option(
        name="ccp_alpha",
        option_type=float,
        description="Complexity parameter used for Minimal Cost-Complexity Pruning.",
    ),
]

List of Nextmv Option objects for configuring a DecisionTreeRegressor.

Each option corresponds to a hyperparameter of the scikit-learn DecisionTreeRegressor, providing a consistent interface for setting up decision tree regression models within the Nextmv ecosystem.

You can import the DECISION_TREE_REGRESSOR_PARAMETERS directly from tree:

from nextmv_sklearn.tree import DECISION_TREE_REGRESSOR_PARAMETERS

DecisionTreeRegressorOptions

DecisionTreeRegressorOptions()

Options for the sklearn.tree.DecisionTreeRegressor.

You can import the DecisionTreeRegressorOptions class directly from tree:

from nextmv_sklearn.tree import DecisionTreeRegressorOptions

A wrapper class for scikit-learn's DecisionTreeRegressor hyperparameters, providing a consistent interface for configuring decision tree regression models within the Nextmv ecosystem.

ATTRIBUTE DESCRIPTION
params

List of Nextmv Option objects corresponding to DecisionTreeRegressor parameters.

TYPE: list

Examples:

>>> from nextmv_sklearn.tree import DecisionTreeRegressorOptions
>>> options = DecisionTreeRegressorOptions()
>>> nextmv_options = options.to_nextmv()

Initialize a DecisionTreeRegressorOptions instance.

Configures the default parameters for a decision tree regressor.

Source code in nextmv-scikit-learn/nextmv_sklearn/tree/options.py
def __init__(self):
    """Initialize a DecisionTreeRegressorOptions instance.

    Configures the default parameters for a decision tree regressor.
    """
    self.params = DECISION_TREE_REGRESSOR_PARAMETERS

params instance-attribute

to_nextmv

to_nextmv() -> Options

Converts the options to a Nextmv options object.

Creates a Nextmv Options instance from the configured decision tree regressor parameters.

RETURNS DESCRIPTION
Options

A Nextmv options object containing all decision tree regressor parameters.

Examples:

>>> options = DecisionTreeRegressorOptions()
>>> nextmv_options = options.to_nextmv()
>>> # Access options as CLI arguments
>>> # python script.py --criterion squared_error --max_depth 5
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/options.py
def to_nextmv(self) -> nextmv.Options:
    """Converts the options to a Nextmv options object.

    Creates a Nextmv Options instance from the configured decision tree
    regressor parameters.

    Returns
    -------
    nextmv.Options
        A Nextmv options object containing all decision tree regressor parameters.

    Examples
    --------
    >>> options = DecisionTreeRegressorOptions()
    >>> nextmv_options = options.to_nextmv()
    >>> # Access options as CLI arguments
    >>> # python script.py --criterion squared_error --max_depth 5
    """

    return nextmv.Options(*self.params)

Solution

solution

Defines sklearn.tree solution interoperability.

This module provides classes for working with scikit-learn tree models.

CLASS DESCRIPTION
DecisionTreeRegressorSolution

Represents a scikit-learn DecisionTreeRegressor model, allowing conversion to and from a serializable format.

DecisionTreeRegressorSolution

Bases: BaseModel

Decision Tree Regressor scikit-learn model representation.

You can import the DecisionTreeRegressorSolution class directly from tree:

from nextmv_sklearn.tree import DecisionTreeRegressorSolution

This class provides functionality to convert between scikit-learn's DecisionTreeRegressor model and a serializable format. It enables saving and loading trained models through dictionaries or JSON.

PARAMETER DESCRIPTION

max_features_

The inferred value of max_features.

TYPE: int DEFAULT: 0

n_features_in_

Number of features seen during fit.

TYPE: int DEFAULT: 0

feature_names_in_

Names of features seen during fit.

TYPE: ndarray DEFAULT: None

n_outputs_

The number of outputs when fit is performed.

TYPE: int DEFAULT: 0

tree_

The underlying Tree object.

TYPE: Tree DEFAULT: None

Examples:

>>> from sklearn.datasets import load_diabetes
>>> from sklearn.tree import DecisionTreeRegressor
>>> from nextmv_sklearn.tree import DecisionTreeRegressorSolution
>>>
>>> # Train a scikit-learn model
>>> X, y = load_diabetes(return_X_y=True)
>>> model = DecisionTreeRegressor().fit(X, y)
>>>
>>> # Convert to solution object
>>> solution = DecisionTreeRegressorSolution.from_model(model)
>>>
>>> # Convert to dictionary for serialization
>>> model_dict = solution.to_dict()
>>>
>>> # Recreate solution from dictionary
>>> restored = DecisionTreeRegressorSolution.from_dict(model_dict["attributes"])
>>>
>>> # Convert back to scikit-learn model
>>> restored_model = restored.to_model()

feature_names_in_ class-attribute instance-attribute

feature_names_in_: ndarray = None

Names of features seen during fit. Defined only when X has feature names that are all strings.

from_dict classmethod

from_dict(
    data: dict[str, Any],
) -> DecisionTreeRegressorSolution

Creates a DecisionTreeRegressorSolution instance from a dictionary.

PARAMETER DESCRIPTION
data

Dictionary containing the model attributes.

TYPE: dict[str, Any]

RETURNS DESCRIPTION
DecisionTreeRegressorSolution

Instance of DecisionTreeRegressorSolution.

Examples:

>>> solution_dict = {
...     "max_features_": 10,
...     "n_features_in_": 10,
...     "n_outputs_": 1,
...     "tree_": "base64encodedtreedata"
... }
>>> solution = DecisionTreeRegressorSolution.from_dict(solution_dict)
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/solution.py
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DecisionTreeRegressorSolution":
    """
    Creates a DecisionTreeRegressorSolution instance from a dictionary.

    Parameters
    ----------
    data : dict[str, Any]
        Dictionary containing the model attributes.

    Returns
    -------
    DecisionTreeRegressorSolution
        Instance of DecisionTreeRegressorSolution.

    Examples
    --------
    >>> solution_dict = {
    ...     "max_features_": 10,
    ...     "n_features_in_": 10,
    ...     "n_outputs_": 1,
    ...     "tree_": "base64encodedtreedata"
    ... }
    >>> solution = DecisionTreeRegressorSolution.from_dict(solution_dict)
    """

    if "tree_" in data:
        data["tree_"] = pickle.loads(base64.b64decode(data["tree_"]))

    for key, value in cls.__annotations__.items():
        if key in data and value is ndarray:
            data[key] = np.array(data[key])

    return cls(**data)

from_model classmethod

from_model(
    model: DecisionTreeRegressor,
) -> DecisionTreeRegressorSolution

Creates a DecisionTreeRegressorSolution instance from a scikit-learn DecisionTreeRegressor model.

PARAMETER DESCRIPTION
model

scikit-learn DecisionTreeRegressor model.

TYPE: DecisionTreeRegressor

RETURNS DESCRIPTION
DecisionTreeRegressorSolution

Instance of DecisionTreeRegressorSolution.

Examples:

>>> from sklearn.datasets import load_diabetes
>>> from sklearn.tree import DecisionTreeRegressor
>>> X, y = load_diabetes(return_X_y=True)
>>> model = DecisionTreeRegressor().fit(X, y)
>>> solution = DecisionTreeRegressorSolution.from_model(model)
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/solution.py
@classmethod
def from_model(cls, model: tree.DecisionTreeRegressor) -> "DecisionTreeRegressorSolution":
    """
    Creates a DecisionTreeRegressorSolution instance from a scikit-learn
    DecisionTreeRegressor model.

    Parameters
    ----------
    model : tree.DecisionTreeRegressor
        scikit-learn DecisionTreeRegressor model.

    Returns
    -------
    DecisionTreeRegressorSolution
        Instance of DecisionTreeRegressorSolution.

    Examples
    --------
    >>> from sklearn.datasets import load_diabetes
    >>> from sklearn.tree import DecisionTreeRegressor
    >>> X, y = load_diabetes(return_X_y=True)
    >>> model = DecisionTreeRegressor().fit(X, y)
    >>> solution = DecisionTreeRegressorSolution.from_model(model)
    """

    data = {}
    for key in cls.__annotations__:
        try:
            data[key] = getattr(model, key)
        except AttributeError:
            pass

    return cls(**data)

max_features_ class-attribute instance-attribute

max_features_: int = 0

The inferred value of max_features.

model_config class-attribute instance-attribute

model_config = ConfigDict(arbitrary_types_allowed=True)

n_features_in_ class-attribute instance-attribute

n_features_in_: int = 0

Number of features seen during fit.

n_outputs_ class-attribute instance-attribute

n_outputs_: int = 0

The number of outputs when fit is performed.

to_dict

to_dict()

Convert a data model instance to a dict with associated class info.

RETURNS DESCRIPTION
dict

Dictionary with class information and model attributes. The dictionary has two main keys: - 'class': Contains module and class name information - 'attributes': Contains the serialized model attributes

Examples:

>>> solution = DecisionTreeRegressorSolution(max_features_=10)
>>> solution_dict = solution.to_dict()
>>> print(solution_dict['class']['name'])
'DecisionTreeRegressorSolution'
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/solution.py
def to_dict(self):
    """
    Convert a data model instance to a dict with associated class info.

    Returns
    -------
    dict
        Dictionary with class information and model attributes.
        The dictionary has two main keys:
        - 'class': Contains module and class name information
        - 'attributes': Contains the serialized model attributes

    Examples
    --------
    >>> solution = DecisionTreeRegressorSolution(max_features_=10)
    >>> solution_dict = solution.to_dict()
    >>> print(solution_dict['class']['name'])
    'DecisionTreeRegressorSolution'
    """

    t = type(self)
    return {
        "class": {
            "module": t.__module__,
            "name": t.__name__,
        },
        "attributes": self.model_dump(mode="json", exclude_none=True, by_alias=True),
    }

to_model

to_model() -> DecisionTreeRegressor

Transforms the DecisionTreeRegressorSolution instance into a scikit-learn DecisionTreeRegressor model.

RETURNS DESCRIPTION
DecisionTreeRegressor

scikit-learn DecisionTreeRegressor model.

Examples:

>>> solution = DecisionTreeRegressorSolution(max_features_=10, n_features_in_=10)
>>> model = solution.to_model()
>>> isinstance(model, tree.DecisionTreeRegressor)
True
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/solution.py
def to_model(self) -> tree.DecisionTreeRegressor:
    """
    Transforms the DecisionTreeRegressorSolution instance into a scikit-learn
    DecisionTreeRegressor model.

    Returns
    -------
    tree.DecisionTreeRegressor
        scikit-learn DecisionTreeRegressor model.

    Examples
    --------
    >>> solution = DecisionTreeRegressorSolution(max_features_=10, n_features_in_=10)
    >>> model = solution.to_model()
    >>> isinstance(model, tree.DecisionTreeRegressor)
    True
    """
    m = tree.DecisionTreeRegressor()
    for key in self.model_fields:
        setattr(m, key, self.__dict__[key])

    return m

tree_ class-attribute instance-attribute

tree_: Tree = None

The underlying Tree object.

Tree module-attribute

Tree = Annotated[
    Tree,
    BeforeValidator(lambda x: x),
    PlainSerializer(lambda x: b64encode(dumps(x))),
]

Type annotation for handling scikit-learn Tree objects.

This type is annotated with Pydantic validators and serializers to handle the conversion between scikit-learn Tree objects and base64-encoded strings for JSON serialization.

Statistics

statistics

Scikit-learn tree module statistics interoperability for Nextmv.

This module provides functionality to integrate scikit-learn tree-based models with Nextmv statistics tracking.

FUNCTION DESCRIPTION
DecisionTreeRegressorStatistics

Convert a DecisionTreeRegressor model to Nextmv statistics format.

DecisionTreeRegressorStatistics

DecisionTreeRegressorStatistics(
    model: DecisionTreeRegressor,
    X: Iterable,
    y: Iterable,
    sample_weight: float = None,
    run_duration_start: Optional[float] = None,
) -> Statistics

Create a Nextmv statistics object from a scikit-learn DecisionTreeRegressor model.

You can import the DecisionTreeRegressorStatistics function directly from tree:

from nextmv_sklearn.tree import DecisionTreeRegressorStatistics

Converts a trained scikit-learn DecisionTreeRegressor model into Nextmv statistics format. The statistics include model depth, feature importances, number of leaves, and model score. Additional custom metrics can be added by the user after this function returns. The optional run_duration_start parameter can be used to track the total runtime of the modeling process.

PARAMETER DESCRIPTION

model

The trained scikit-learn DecisionTreeRegressor model.

TYPE: DecisionTreeRegressor

X

The input features used for scoring the model.

TYPE: Iterable

y

The target values used for scoring the model.

TYPE: Iterable

sample_weight

The sample weights used for scoring, by default None.

TYPE: float DEFAULT: None

run_duration_start

The timestamp when the model run started, typically from time.time(), by default None.

TYPE: float DEFAULT: None

RETURNS DESCRIPTION
Statistics

A Nextmv statistics object containing model performance metrics.

Examples:

>>> from sklearn.tree import DecisionTreeRegressor
>>> from nextmv_sklearn.tree import DecisionTreeRegressorStatistics
>>> import time
>>>
>>> # Record start time
>>> start_time = time.time()
>>>
>>> # Train model
>>> model = DecisionTreeRegressor(max_depth=5)
>>> model.fit(X_train, y_train)
>>>
>>> # Create statistics
>>> stats = DecisionTreeRegressorStatistics(
...     model, X_test, y_test, run_duration_start=start_time
... )
>>>
>>> # Add additional metrics
>>> stats.result.custom["my_custom_metric"] = custom_value
Source code in nextmv-scikit-learn/nextmv_sklearn/tree/statistics.py
def DecisionTreeRegressorStatistics(
    model: tree.DecisionTreeRegressor,
    X: Iterable,
    y: Iterable,
    sample_weight: float = None,
    run_duration_start: Optional[float] = None,
) -> nextmv.Statistics:
    """Create a Nextmv statistics object from a scikit-learn DecisionTreeRegressor model.

    You can import the `DecisionTreeRegressorStatistics` function directly from `tree`:

    ```python
    from nextmv_sklearn.tree import DecisionTreeRegressorStatistics
    ```

    Converts a trained scikit-learn DecisionTreeRegressor model into Nextmv statistics
    format. The statistics include model depth, feature importances, number of leaves,
    and model score. Additional custom metrics can be added by the user after this
    function returns. The optional `run_duration_start` parameter can be used to track
    the total runtime of the modeling process.

    Parameters
    ----------
    model : tree.DecisionTreeRegressor
        The trained scikit-learn DecisionTreeRegressor model.
    X : Iterable
        The input features used for scoring the model.
    y : Iterable
        The target values used for scoring the model.
    sample_weight : float, optional
        The sample weights used for scoring, by default None.
    run_duration_start : float, optional
        The timestamp when the model run started, typically from time.time(),
        by default None.

    Returns
    -------
    nextmv.Statistics
        A Nextmv statistics object containing model performance metrics.

    Examples
    --------
    >>> from sklearn.tree import DecisionTreeRegressor
    >>> from nextmv_sklearn.tree import DecisionTreeRegressorStatistics
    >>> import time
    >>>
    >>> # Record start time
    >>> start_time = time.time()
    >>>
    >>> # Train model
    >>> model = DecisionTreeRegressor(max_depth=5)
    >>> model.fit(X_train, y_train)
    >>>
    >>> # Create statistics
    >>> stats = DecisionTreeRegressorStatistics(
    ...     model, X_test, y_test, run_duration_start=start_time
    ... )
    >>>
    >>> # Add additional metrics
    >>> stats.result.custom["my_custom_metric"] = custom_value
    """

    run = nextmv.RunStatistics()
    if run_duration_start is not None:
        run.duration = time.time() - run_duration_start

    statistics = nextmv.Statistics(
        run=run,
        result=nextmv.ResultStatistics(
            custom={
                "depth": model.get_depth(),
                "feature_importances_": model.feature_importances_.tolist(),
                "n_leaves": int(model.get_n_leaves()),
                "score": model.score(X, y, sample_weight),
            },
        ),
        series_data=nextmv.SeriesData(),
    )

    if sample_weight is not None:
        statistics.result.custom["sample_weight"] = sample_weight

    return statistics