Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
7ae25b5
Do experiments on large dataset (not clean)
vxbrandon Jul 28, 2022
50d03a4
Handling a few extreme cases!
vxbrandon Jul 29, 2022
f3f31ee
Run experiments on various real dataset.
vxbrandon Jul 29, 2022
6aa2e82
Large refactor. Moving sklearn and mnist datasets to data_loader in c…
Jul 29, 2022
333f0f5
Refactoring datasets in compare_budgets.py
Jul 29, 2022
a446934
Large cleanup: compare_runtimes.py and compare_budget.py now avoid du…
Jul 29, 2022
9031020
Passing dataset name to store profiles with dataset names and not ove…
Jul 29, 2022
47e84f6
Cleaning (deleting) logfiles
Jul 29, 2022
4234d4a
Changing num_seeds back to 5 for runtime and budget
Jul 29, 2022
8445992
Adding HOUSING dataset, where our approach doesnt work, and renaming …
Jul 29, 2022
be22771
Adding forest covertype dataset -- we outperform, 10s to 600s
Jul 30, 2022
ca2b07a
Trying to fix messed up merge
Jul 30, 2022
0a7d534
Adding npy file to gitignore
Jul 30, 2022
db64749
Preparing for experiments: deleting old logs, writing new runtime pro…
Jul 30, 2022
c1b1530
Storing data up to APS
Jul 30, 2022
9102cfc
Refactor codes and fix errors on scaling
vxbrandon Jul 30, 2022
f2551c7
Merge remote-tracking branch 'origin/dataset_experiments' into datase…
vxbrandon Jul 30, 2022
e1dfe41
Classification results look good -- can use these for Table 1. Unfort…
Jul 30, 2022
4dcebf8
Uncomment runtime comparision for regression and fix setting
vxbrandon Jul 30, 2022
56d54e0
results on AIR work with max_depth=3
Jul 30, 2022
b82fbdb
Running regression exps with max_depth=3 helps sklearn. Blog still do…
Jul 30, 2022
c35e903
Producing Tables 1 and 2 properly with produce_tables.py -- everythin…
Jul 30, 2022
373ddaa
Fixing errors with model names
Jul 30, 2022
6281d94
Doubling budgets and only running 1 seed to ensure baselines can trai…
Jul 30, 2022
51c36b3
Adding ability to form predictions from forests when no full trees we…
Jul 31, 2022
d42e14d
Figured out reasonable budgets for all exps. Running 5 seeds now. Als…
Jul 31, 2022
a4815ac
Addressing my own comments from review
Jul 31, 2022
4d152bc
Merge branch 'dataset_experiments' into new_scaling_exps
motiwari Jul 31, 2022
3afdeb6
Merge pull request #226 from ThrunGroup/new_scaling_exps
motiwari Jul 31, 2022
c2312f1
Add error bars on scaling exps. Add gpu dataset. Add vectorize_impuri…
vxbrandon Jul 31, 2022
f027bec
Run black
vxbrandon Jul 31, 2022
de9fea8
Delete data file to upload
vxbrandon Jul 31, 2022
df57afb
Delete data file to upload
vxbrandon Jul 31, 2022
c79f5aa
Turn of flag of preprocessing gpu file
vxbrandon Jul 31, 2022
b26514f
Adding budget experiment results
Jul 31, 2022
5bf6ff8
TAG: EXP_CLOSE. Merging Jeys data_experiments with mine. All runtime …
Jul 31, 2022
22c978a
TAG: EXP_OPEN, about to run runtime and budget exps for GPU dataset. …
Jul 31, 2022
7a1ed4d
Adding ALL experimental results including for new regression dataset.…
Jul 31, 2022
b218b06
Nits
Jul 31, 2022
57d8672
TAG: EXP_OPEN, attempting to improve Table 4 results by tweaking budg…
Jul 31, 2022
10c58ef
TAG: EXP_CLOSE for 1-seed improvements to Table 4. EXP_OPEN for 5-see…
Jul 31, 2022
a41119e
TAG: EXP_CLOSE, did 5 seeds for Table 4, looks much better now
Jul 31, 2022
5302bf6
Made necessary cosmetic changes to produce_tables.py. Use all of its …
Aug 1, 2022
bd23865
Nit
Aug 1, 2022
9f75e27
TAG: EXP_OPEN. Table 3 experiments, budget for classification, look e…
Aug 1, 2022
7111696
Adding small dataset exps for R3
Aug 1, 2022
edc32e1
Something is still wrong with Table 3. About to debug
Aug 1, 2022
3c06edc
TAG: EXP_OPEN. Fixing critical bug in classification budget to NOT ju…
Aug 1, 2022
4c7e5be
TAG: EXP_OPEN. Reducing budget by factor of 5 in FLIGHT
Aug 2, 2022
10afe2b
TAG: EXP_OPEN. Reducing budget by factor of 5 in COVTYPE
Aug 2, 2022
a569986
Rerunning exps with different max_depth with one seed to see perf diffs
Aug 2, 2022
cefa861
TAG: EXP_OPEN. Running final table 3
Aug 2, 2022
6cc0e27
Adding proper Table 3 results
Aug 2, 2022
940cf47
After an hour, get .8088 with sklearn RFC and .5964 with Hoeffding Tree
Aug 2, 2022
baed34f
Don't plot negative value for num queries
vxbrandon Aug 2, 2022
3591185
Merge remote-tracking branch 'origin/dataset_experiments' into datase…
vxbrandon Aug 2, 2022
da6745b
Recreating final Table 3
Aug 2, 2022
e8ab8ec
- Randomly rank the items that have same equal importance score
vxbrandon Sep 10, 2022
c4afa72
Clean up codes for reproducing the experiments
vxbrandon Sep 10, 2022
693319f
Merge branch 'reproducec_experiments' of github.com:ThrunGroup/FastFo…
Sep 13, 2022
db08c77
ran all experiments except gpu
Sep 14, 2022
2ea691b
ran all experiments except gpu
Sep 14, 2022
6852e88
Change hyperparams to reproduce experiments on Sklearn regression and…
vxbrandon Sep 15, 2022
b11563a
1. Change hyperparameters of budget experiments: increased max depth …
vxbrandon Sep 18, 2022
2021a91
Update log files by running bash repro_script.sh
vxbrandon Sep 18, 2022
6752c81
Uncomment comments for debugging/experiments
vxbrandon Sep 18, 2022
eacec31
Merge branch 'dataset_experiments' into reproduce_experiments
motiwari Sep 18, 2022
be8e2c7
Merge pull request #235 from ThrunGroup/reproduce_experiments
motiwari Sep 18, 2022
e7ba44f
1. Change back our algorithm to find the best arm not the 5th best arm
vxbrandon Sep 19, 2022
94cfad8
Changing mab solver to len(candidates) > 1
Oct 9, 2022
444bd83
Fixing repro script
Oct 9, 2022
aff2db8
Adding results, choosing final datasets
Oct 10, 2022
fcd420a
Fixing table-making code
Oct 10, 2022
c723f69
Adding tables
Oct 10, 2022
b4f87a5
Nit todo
Nov 8, 2022
8b232c8
Created results that show for depth=5, pull_each_arm=333, batch_size=…
Nov 8, 2022
da736ea
Need to sample randomly 1000 times for roughly equal performances, th…
Nov 9, 2022
d9723e7
Undoing some random solver changes
Dec 12, 2022
9bcba82
Merge pull request #242 from ThrunGroup/random_baseline
motiwari Dec 12, 2022
975e3f3
Small style changes to sklearn comparison
Dec 12, 2022
23e34e1
Fixing ERFR problem for sklearn by allowing it to look at all features
Dec 12, 2022
541a283
Making sure candidate conditions are fixed. About to rerun repro script
Dec 12, 2022
abac8c4
All results recreated. Table 1 multipliers are a little more modest w…
Dec 13, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
experiments/datasets/kdd98.npz.npy
fastforest.egg-info/
.DS_Store
.idea/
__pycache__/
experiments/.ipynb_checkpoints/*
experiments/mnist
experiments/.DS_Store
experiments/datasets/cup98*
experiments/datasets/gpu*
experiments/datasets/new_gpu*
62 changes: 42 additions & 20 deletions data_structures/forest_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
DEFAULT_MIN_IMPURITY_DECREASE,
BATCH_SIZE,
)
from utils.utils import data_to_discrete, set_seed, get_subset_2d
from utils.utils import data_to_discrete, set_seed, get_subset_2d, class_to_idx, counts_of_labels
from utils.boosting import get_next_targets
from data_structures.tree_classifier import TreeClassifier
from data_structures.tree_regressor import TreeRegressor
Expand Down Expand Up @@ -65,7 +65,7 @@ def __init__(
self.new_targets = labels

# self.curr_data and self.curr_targets are the data, targets that are used to fit the current tree.
# These attributes may be smaller than the original dataset size if self.bootstrap is true.
# These attributes may be smaller than the original datasets size if self.bootstrap is true.
self.curr_data = None
self.curr_targets = None
self.trees = []
Expand Down Expand Up @@ -169,12 +169,18 @@ def fit(self, data: np.ndarray = None, labels: np.ndarray = None) -> None:
self.data = data
self.org_targets = labels
self.new_targets = labels

N = len(self.data)
F = len(self.data[0])

if self.is_classification:
self.org_targets = self.org_targets.astype(np.int32)
self.new_targets = self.new_targets.astype(np.int32)
if self.classes is None:
self.classes: dict = class_to_idx(
np.unique(self.new_targets)
) # a dictionary that maps class name to class index
self.n_classes = len(self.classes)

if self.make_discrete:
self.discrete_features: DefaultDict = data_to_discrete(self.data, n=10)
Expand Down Expand Up @@ -289,6 +295,9 @@ def fit(self, data: np.ndarray = None, labels: np.ndarray = None) -> None:
):
self.trees.append(tree)
else:
# We have been unable to fit the tree within the budget.
# In this case, we do not add the tree to the list
# In the case of the very first tree not being able to be fit, this implies that self.trees = []
break

if self.boosting:
Expand Down Expand Up @@ -336,27 +345,40 @@ def predict(self, datapoint: np.ndarray) -> Union[Tuple[int, np.ndarray], float]
(Regressor) the averaged mean value of labels in each tree
"""
T = len(self.trees)
if self.is_classification:
agg_preds = np.empty((T, self.n_classes))
for tree_idx, tree in enumerate(self.trees):
# Average over predicted probabilities, not just hard labels
agg_preds[tree_idx] = tree.predict(datapoint)[1]
avg_preds = agg_preds.mean(axis=0)
label_pred = list(self.classes.keys())[avg_preds.argmax()]
return label_pred, avg_preds
if T == 0:
# We handle the case where no full trees were able to be split specially. In this case, we instantiate
# a stump (single root node) and predict the datapoint as the average (regression) or priors (classification).
N = len(self.data)
if self.is_classification:
# In the case of classification, we use the priors as the prediction
# WARNING: This may not work when the labels are not contiguous integers starting at 0. Fix after ddl.
# TODO(@motiwari): See warning above.
avg_preds = np.bincount(self.new_targets) / N
return np.argmax(avg_preds), avg_preds
else:
return np.mean(self.org_targets)
else:
if self.boosting:
if self.is_classification:
agg_preds = np.empty((T, self.n_classes))
for tree_idx, tree in enumerate(self.trees):
if tree_idx == 0:
agg_pred = tree.predict(datapoint)
else:
agg_pred += self.boosting_lr * tree.predict(datapoint)
return agg_pred
# Average over predicted probabilities, not just hard labels
agg_preds[tree_idx] = tree.predict(datapoint)[1]
avg_preds = agg_preds.mean(axis=0)
label_pred = list(self.classes.keys())[avg_preds.argmax()]
return label_pred, avg_preds
else:
agg_pred = np.empty(T)
for tree_idx, tree in enumerate(self.trees):
agg_pred[tree_idx] = tree.predict(datapoint)
return float(agg_pred.mean())
if self.boosting:
for tree_idx, tree in enumerate(self.trees):
if tree_idx == 0:
agg_pred = tree.predict(datapoint)
else:
agg_pred += self.boosting_lr * tree.predict(datapoint)
return agg_pred
else:
agg_pred = np.empty(T)
for tree_idx, tree in enumerate(self.trees):
agg_pred[tree_idx] = tree.predict(datapoint)
return float(agg_pred.mean())

def get_oob_score(self, data=None) -> float:
"""
Expand Down
9 changes: 2 additions & 7 deletions data_structures/forest_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,8 @@ def __init__(
alpha_F: float = 1.0,
alpha_N: float = 1.0,
) -> None:
if classes is None:
self.classes: dict = class_to_idx(
np.unique(labels)
) # a dictionary that maps class name to class index
else:
self.classes = classes
self.n_classes = len(self.classes)

self.classes = classes
super().__init__(
data=data,
labels=labels,
Expand Down
8 changes: 4 additions & 4 deletions data_structures/histogram.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import math
from typing import Any, Tuple

from utils.constants import LINEAR, DISCRETE, IDENTITY, RANDOM, DEFAULT_NUM_BINS, VECTORIZE
from utils.constants import LINEAR, DISCRETE, IDENTITY, RANDOM, DEFAULT_NUM_BINS, VECTORIZE_HISTOGRAM
from utils.utils_histogram import welford_variance_calc


Expand Down Expand Up @@ -109,8 +109,8 @@ def empty_samples(self, bin_idcs: np.ndarray, is_curr_empty: bool = True) -> Non

def add(self, X: np.ndarray, Y: np.ndarray):
"""
Given dataset X , add all the points in the dataset to the histogram.
:param X: dataset to be histogrammed (subset of original X, although could be the same size)
Given datasets X , add all the points in the datasets to the histogram.
:param X: datasets to be histogrammed (subset of original X, although could be the same size)
:return: None, but modify the histogram to include the relevant feature values
"""
feature_values = X[:, self.feature_idx]
Expand All @@ -126,7 +126,7 @@ def add(self, X: np.ndarray, Y: np.ndarray):
Y
), "Error: sample sizes and label sizes must be the same"
insert_idcs = self.get_bin(feature_values, self.bin_edges).astype("int64")
if VECTORIZE:
if VECTORIZE_HISTOGRAM:
new_Y = self.replace_array(Y, self.class_to_idx)
hist = np.zeros(
(self.left.shape[0] + 1, self.left.shape[1]), dtype=np.int64
Expand Down
19 changes: 17 additions & 2 deletions data_structures/node.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from collections import defaultdict
from utils.utils import get_subset_2d

from utils.solvers import solve_mab, solve_exactly
from utils.solvers import solve_mab, solve_exactly, solve_randomly
from utils.utils import (
type_check,
counts_of_labels,
Expand All @@ -17,6 +17,7 @@
from utils.constants import (
MAB,
EXACT,
RANDOM_SOLVER,
GINI,
LINEAR,
DEFAULT_NUM_BINS,
Expand Down Expand Up @@ -49,9 +50,12 @@ def __init__(
# To decrease memory usage and cost for making a copy of large array, we don't pass data array to child node
# but indices
# The features aren't global to the tree, so we should be resampling the features at every node

# Note: To compare with random solver, change self.feature_subsampling to None
self.feature_idcs = choose_features(
self.tree.feature_idcs, self.feature_subsampling, self.tree.rng
)

if self.tree.discrete_features is not None:
self.discrete_features = remap_discrete_features(
self.feature_idcs, self.tree.discrete_features
Expand Down Expand Up @@ -149,6 +153,17 @@ def calculate_best_split(self, budget: int = None) -> Union[float, int]:
impurity_measure=self.criterion,
# NOTE: not implemented with budget yet
)
elif self.solver == RANDOM_SOLVER:
results = solve_randomly(
data=self.data,
labels=self.labels,
minmax=self.minmax,
discrete_bins_dict=self.discrete_features,
binning_type=self.bin_type,
num_bins=self.num_bins,
is_classification=self.is_classification,
impurity_measure=self.criterion,
)
else:
raise Exception("Invalid solver specified, must be MAB or EXACT")

Expand All @@ -165,7 +180,7 @@ def calculate_best_split(self, budget: int = None) -> Union[float, int]:
self.prev_split_feature = self.split_feature
self.split_feature = self.feature_idcs[
self.split_feature
] # Feature index of original dataset
] # Feature index of original datasets
self.split_reduction *= self.proportion # Normalize by number of datapoints
if self.verbose:
print("Calculated split with", self.num_queries, "queries")
Expand Down
14 changes: 8 additions & 6 deletions data_structures/permutation.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def get_importance_array(self) -> np.ndarray:
Trains all of its forests and computes the importance vector for each of the trained forests.
This is the main function that an object of this class will call.

:return: an array of importance vectors.
:return: an array of importance vectors. (i, j) = ith random forest, jth feature's importance score.
"""
self.is_train = True
self.train_forests()
Expand Down Expand Up @@ -229,11 +229,13 @@ def get_stability_freq(imp_data: np.ndarray, best_k_features: int) -> float:
), "Feature subset size should be less than feature dimension"

# preprocess data
best_idcs = np.argsort(-imp_data)[:, :best_k_features]
for i in range(N):
top_k_imps = imp_data[i][best_idcs[i]]
# zero-out the top k importances that aren't below threshold (default 0.0)
best_idcs[i] = np.where(top_k_imps >= MIN_IMPORTANCE, best_idcs[i], -1)
random_noise = np.random.random(imp_data.shape)
sort_idcs = np.lexsort((random_noise, -imp_data), axis=1) # Why use lexsort: to randomly sort the indices that
# have equal values, see https://bit.ly/3DeAnFY
best_idcs = sort_idcs[:, :best_k_features]

# zero-out the top k importances that aren't below threshold (default 0.0)
best_idcs = np.where(np.take_along_axis(imp_data, best_idcs, axis=1) >= MIN_IMPORTANCE, best_idcs, -1)

c_var = 0
for i in range(F):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ def __init__(
random_state=random_state,
with_replacement=with_replacement,
verbose=verbose,
is_precomputed_minmax=True,
is_precomputed_minmax=False,
use_logarithmic_split=False,
epsilon=0.01,
)
6 changes: 4 additions & 2 deletions data_structures/wrappers/histogram_random_forest_regressor.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import numpy as np

from data_structures.forest_regressor import ForestRegressor
from utils.constants import SQRT, LINEAR, DEFAULT_NUM_BINS, BEST, EXACT, MSE
from utils.constants import BATCH_SIZE, SQRT, LINEAR, DEFAULT_NUM_BINS, BEST, EXACT, MSE


class HistogramRandomForestRegressor(ForestRegressor):
Expand Down Expand Up @@ -32,6 +32,7 @@ def __init__(
random_state: int = 0,
with_replacement: bool = False,
verbose: bool = False,
batch_size: int = BATCH_SIZE,
) -> None:
super().__init__(
data=data,
Expand All @@ -52,7 +53,8 @@ def __init__(
random_state=random_state,
with_replacement=with_replacement,
verbose=verbose,
is_precomputed_minmax=True,
is_precomputed_minmax=False,
use_logarithmic_split=False,
epsilon=0.01,
batch_size=batch_size,
)
85 changes: 85 additions & 0 deletions experiments/HT_exps/hoeffding_tree_exp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
from skmultiflow.trees import HoeffdingTreeClassifier, HoeffdingTreeRegressor

from sklearn.ensemble import RandomForestClassifier as RFC_sklearn
from sklearn.ensemble import RandomForestRegressor as RFR_sklearn

from experiments.datasets import data_loader

from utils.constants import (
FLIGHT,
AIR,
APS,
BLOG,
SKLEARN_REGRESSION,
MNIST_STR,
HOUSING,
COVTYPE,
KDD,
GPU,
BATCH_SIZE,
)


def main():
SUBSAMPLE_SIZE = 60000
X_train, y_train, X_test, y_test = data_loader.fetch_data(MNIST_STR)
X_train = X_train[:SUBSAMPLE_SIZE]
y_train = y_train[:SUBSAMPLE_SIZE]

sklearn_RFC = RFC_sklearn(
n_estimators=1,
# defaults
criterion="gini", # default
max_depth=None, # default
min_samples_split=2, # default
min_samples_leaf=1, # default
random_state=0,
min_weight_fraction_leaf=0.0,
max_features="sqrt",
max_leaf_nodes=None,
min_impurity_decrease=0.0,
bootstrap=True,
oob_score=False,
n_jobs=None,
verbose=0,
warm_start=False,
class_weight=None,
ccp_alpha=0.0,
max_samples=None,
)
sklearn_RFC.fit(X_train, y_train)
sklearn_RFC.predict(X_test)
print(sklearn_RFC.score(X_test, y_test))


# Hoeffding Tree for classification
ht_classifier = HoeffdingTreeClassifier(
max_byte_size=float("inf"),
memory_estimate_period=float("inf"),
grace_period=10,
split_criterion="gini",
split_confidence=0.1,
# defaults
tie_threshold=0.05, # default
binary_split=True, # default
stop_mem_management=False, # default
remove_poor_atts=False, # default
no_preprune=False, # default
leaf_prediction="nba", # default
nb_threshold=0, # default
)
ht_classifier.fit(X_train, y_train)
ht_classifier.predict(X_test)
print(ht_classifier.score(X_test, y_test))
print(ht_classifier.get_rules_description())

# # Hoeffding Tree for regression
# ht_regressor = HoeffdingTreeRegressor()
# ht_regressor.fit(X, y)
# ht_regressor.predict(X)
# ht_regressor.predict_proba(X)
# ht_regressor.score(X, y)


if __name__ == "__main__":
main()
1 change: 0 additions & 1 deletion experiments/budget_exps/ERFC_dict

This file was deleted.

1 change: 0 additions & 1 deletion experiments/budget_exps/ERFR_dict

This file was deleted.

1 change: 0 additions & 1 deletion experiments/budget_exps/HRFC_dict

This file was deleted.

1 change: 0 additions & 1 deletion experiments/budget_exps/HRFR_dict

This file was deleted.

1 change: 0 additions & 1 deletion experiments/budget_exps/HRPC_dict

This file was deleted.

Loading