milliontrees.datasets package¶

Submodules¶

milliontrees.datasets.TreeBoxes module¶

class milliontrees.datasets.TreeBoxes.TreeBoxesDataset(version=None, root_dir='data', download=False, split_scheme='within-distribution', geometry_name='y', eval_score_threshold=0.0, remove_incomplete=False, image_size=448, include_sources=None, exclude_sources=None, mini=False, small=False, verbose=True, include_unsupervised=False)[source]¶

Bases: MillionTreesDataset

A dataset of tree annotations with bounding box coordinates from multiple global sources.

The dataset contains aerial imagery of trees with their corresponding bounding box annotations. Each tree is annotated with a 4-point bounding box (x_min, y_min, x_max, y_max).

Dataset Splits:

within-distribution: For each source, a portion of images is in train and a portion in test.
crossgeometry: Boxes and Points are used to predict polygons.
out-of-distribution: Selected sources are entirely held out for testing.

Data Format:

Input (x): RGB aerial imagery Labels (y): Nx4 array of bounding box coordinates Metadata: Location identifiers for each image

Parameters:

version (str) – The version of the dataset to load.
root_dir (str) – The root directory to store the dataset.
download (bool) – Whether to download the dataset if it is not already present.
split_scheme (str) – The split scheme to use.
geometry_name (str) – The name of the geometry to use.
eval_score_threshold (float) – The threshold for the evaluation score.
remove_incomplete (bool) – Drop incomplete (not exhaustively annotated) sources from the TRAIN split only. Validation/test are never filtered, so the evaluation set matches a full-train run.
image_size (int) – The size of the image to use.
include_sources (list) – The sources to include.
exclude_sources (list) – The sources to exclude.
unsupervised (bool) – If True, include unsupervised data in addition to any other selected sources (unless explicitly excluded).
mini (bool) – If True, download mini versions of datasets for development. Mini datasets are smaller subsets that maintain the same structure.
small (bool) – If True, download small releases (up to 50 images per source).
unsupervised_args (dict) – The arguments to pass to the unsupervised download pipeline.

References

Website: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009180

Citation:: @article{Weinstein2020, title={A benchmark dataset for canopy crown detection and delineation in co-registered airborne RGB, LiDAR and hyperspectral imagery from the National Ecological Observation Network.}, author={Weinstein BG, Graves SJ, Marconi S, Singh A, Zare A, Stewart D, et al.}, journal={PLoS Comput Biol}, year={2021}, doi={10.1371/journal.pcbi.1009180} }

License: Creative Commons Attribution License

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=10)[source]¶

Performs evaluation on the given predictions.

The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.

If viz_dir is set, writes overlay PNGs (purple = ground truth, orange = predictions above the eval score threshold), up to viz_n_per_source images per source, in subfolders named by source.

get_input(idx)[source]¶

Retrieves the input features (image) for a given data point.

Parameters:: idx (int) – Index of a data point
Returns:: Input features of the idx-th data point (image) as a normalized numpy array.
Return type:: np.ndarray

milliontrees.datasets.TreePoints module¶

class milliontrees.datasets.TreePoints.TreePointsDataset(version=None, root_dir='data', download=False, split_scheme='within-distribution', geometry_name='y', remove_incomplete=False, distance_threshold=0.02, include_sources=None, exclude_sources=None, mini=False, small=False, image_size=448, verbose=True, include_unsupervised=False, eval_score_threshold=0.0, real_world_threshold_m=4.0)[source]¶

Bases: MillionTreesDataset

The TreePoints dataset is a collection of tree annotations annotated as x,y locations.

Dataset Splits:

within-distribution: For each source, a portion of images is in train and a portion in test.
crossgeometry: Boxes and Points are used to predict polygons.
out-of-distribution: Selected sources are entirely held out for testing.

Input (x):

RGB aerial images

Label (y):

y is an n x 2 matrix where each row represents a keypoint (x, y)

Metadata:

Each image is annotated with the following metadata

location (int): location id

License:

This dataset is distributed under Creative Commons Attribution License

SOURCE_GSD = {'Amirkolaee et al. 2023': 0.2, 'Beery et al. 2022': 0.05, 'Bohlman et al. 2008': 0.3, 'Chen & Shang (2022)': 0.12, 'Dubrovin et al. 2024': 0.07, 'NEON MultiTemporal': 0.1, 'NEON_points': 0.1, 'OFO field 2025': 0.05, 'OSBS megaplot 2025': 0.2, 'Ventura et al. 2022': 0.6, 'Young et al. 2025 unsupervised': 0.1}¶

TRAIN_ONLY_SOURCES = {'Beery et al. 2022'}¶

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=10)[source]¶

Evaluate predictions.

KeypointAccuracy (recall) uses a per-source distance threshold derived from each source’s GSD so that the matching radius is always real_world_threshold_m metres regardless of image resolution. All other metrics use the dataset-level distance_threshold.

Optional viz_dir / viz_n_per_source write qualitative overlays.

get_annotation_from_filename(filename)[source]¶

get_input(idx)[source]¶

Parameters:: idx (-) – Index of a data point

Output:

x (np.ndarray): Input features of the idx-th data point

milliontrees.datasets.TreePolygons module¶

class milliontrees.datasets.TreePolygons.TreePolygonsDataset(version=None, root_dir='data', download=False, split_scheme='within-distribution', geometry_name='y', eval_score_threshold=0.0, image_size=448, remove_incomplete=False, include_sources=None, exclude_sources=None, mini=False, small=False, verbose=True, include_unsupervised=False)[source]¶

Bases: MillionTreesDataset

The TreePolygons dataset is a collection of tree annotations annotated as multi-point polygon locations.

The dataset is comprised of many sources from across the world.

Dataset Splits:

Within-distribution: For each source, 80% of the data is used for training and 20% for testing.
crossgeometry: Boxes and Points are used to predict polygons.
out-of-distribution: Selected sources are entirely held out for testing.

Input (x):

RGB aerial images.

Label (y):

y is an n x 2-dimensional vector where each line represents a point coordinate (x, y).

Metadata:

Each image is annotated with the following metadata:

location (int): location id
source (int): source id

License:

This dataset is distributed under the Creative Commons Attribution License.

build_metrics(score_threshold)[source]¶

Construct the evaluation metric objects at a given score threshold.

Each metric filters predictions by scores >= score_threshold, so the threshold is baked in at construction. Factored out so callers (e.g. a threshold sweep) can build independent metric sets per threshold without reconstructing the whole dataset.

create_polygon_mask(width, height, vertices, scale_x=1.0, scale_y=1.0)[source]¶

Rasterize a shapely polygon to a binary mask at the given (width, height).

Vertex coordinates are multiplied by (scale_x, scale_y) so a polygon defined on the original image can be drawn directly at a downscaled target size, avoiding allocation of a full- resolution mask.

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=10)[source]¶

The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.

Optional viz_dir / viz_n_per_source write qualitative overlays (purple = GT masks, orange = predicted masks above the eval score threshold).

get_input(idx)[source]¶

Parameters:: idx (-) – Index of a data point

Output:

x (np.ndarray): Input features of the idx-th data point

milliontrees.datasets.download_utils module¶

This file contains utility functions for downloading datasets. The code in this file is taken from the torchvision package, specifically, https://github.com/pytorch/vision/blob/master/torchvision/datasets/utils.py. We package it here to avoid users having to install the rest of torchvision. It is licensed under the following license:

BSD 3-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

milliontrees.datasets.download_utils.calculate_md5(fpath: str, chunk_size: int = 1048576) → str[source]¶

milliontrees.datasets.download_utils.check_integrity(fpath: str, md5: str | None = None) → bool[source]¶

milliontrees.datasets.download_utils.check_md5(fpath: str, md5: str, **kwargs: Any) → bool[source]¶

milliontrees.datasets.download_utils.download_and_extract_archive(url: str, download_root: str, extract_root: str | None = None, filename: str | None = None, md5: str | None = None, remove_finished: bool = False, size: int | None = None) → None[source]¶

milliontrees.datasets.download_utils.download_file_from_google_drive(file_id: str, root: str, filename: str | None = None, md5: str | None = None)[source]¶

Download a Google Drive file from and place it in root.

Parameters:

file_id (str) – id of file to be downloaded
root (str) – Directory to place downloaded file in
filename (str, optional) – Name to save the file under. If None, use the id of the file.
md5 (str, optional) – MD5 checksum of the download. If None, do not check

milliontrees.datasets.download_utils.download_url(url: str, root: str, filename: str | None = None, md5: str | None = None, size: int | None = None) → None[source]¶

Download a file from a url and place it in root.

Parameters:

url (str) – URL to download file from
root (str) – Directory to place downloaded file in
filename (str, optional) – Name to save the file under. If None, use the basename of the URL
md5 (str, optional) – MD5 checksum of the download. If None, do not check

milliontrees.datasets.download_utils.extract_archive(from_path: str, to_path: str | None = None, remove_finished: bool = False) → None[source]¶

milliontrees.datasets.download_utils.gen_bar_updater(total) → Callable[[int, int, int], None][source]¶

milliontrees.datasets.download_utils.iterable_to_str(iterable: Iterable) → str[source]¶

milliontrees.datasets.download_utils.list_dir(root: str, prefix: bool = False) → List[str][source]¶

List all directories at a given root.

Parameters:

root (str) – Path to directory whose folders need to be listed
prefix (bool, optional) – If true, prepends the path to each result, otherwise only returns the name of the directories found

milliontrees.datasets.download_utils.list_files(root: str, suffix: str, prefix: bool = False) → List[str][source]¶

List all files ending with a suffix at a given root.

Parameters:

root (str) – Path to directory whose folders need to be listed
suffix (str or tuple) – Suffix of the files to match, e.g. ‘.png’ or (‘.jpg’, ‘.png’). It uses the Python “str.endswith” method and is passed directly
prefix (bool, optional) – If true, prepends the path to each result, otherwise only returns the name of the files found

milliontrees.datasets.download_utils.verify_str_arg(value: T, arg: str | None = None, valid_values: Iterable[T] = None, custom_msg: str | None = None) → T[source]¶

milliontrees.datasets.milliontrees_dataset module¶

class milliontrees.datasets.milliontrees_dataset.MillionTreesDataset(root_dir, download, split_scheme)[source]¶

Bases: object

Shared dataset class for all MillionTrees datasets.

Each data point in the dataset is a tuple (x, y, metadata), where:

x: The input features
y: The target
metadata: A vector of relevant information (e.g., domain). For convenience, metadata also contains y.

DEFAULT_SOURCE_DOMAIN_SPLITS = [0]¶

DEFAULT_SPLITS = {'train': 0, 'val': 1}¶

DEFAULT_SPLIT_NAMES = {'train': 'Train', 'val': 'Validation'}¶

check_init()[source]¶: Convenience function to check that the MillionTreesDataset is properly configured.

check_version()[source]¶

property collate¶

Torch function to collate items in a batch.

By default returns None -> uses default torch collate.

property data_dir¶: The full path to the folder in which the dataset is stored.

dataset_exists_locally(data_dir, version_file)[source]¶

property dataset_name¶: A string that identifies the dataset, e.g., ‘amazon’, ‘camelyon17’.

download_dataset(data_dir, download_flag)[source]¶

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=10)[source]¶

Parameters:

y_pred (-) – Predicted targets per image
y_true (-) – True targets per image
metadata (-) – Metadata rows aligned with predictions
viz_dir (-) – If set, write up to viz_n_per_source overlay PNGs per source_id under this directory (see eval_visualization).
viz_n_per_source (-) – Max images to save per source when viz_dir is set. Pass None to write all images.

Output:

results (dict): Dictionary of results (may include eval_visualization_paths)
results_str (str): Pretty print version of the results

get_input(idx)[source]¶

Parameters:: idx (-) – Index of a data point

Output:

x (Tensor): Input features of the idx-th data point

get_subset(split, frac=1.0, transform=None)[source]¶

Parameters:

split (-) – Split identifier, e.g., ‘train’, ‘val’, ‘test’. Must be in self.split_dict.
frac (-) – What fraction of the split to randomly sample. Used for fast development on a small dataset.
transform (-) – Any data transformations to be applied to the input x.

Output:

subset (MillionTreesSubset): A (potentially subsampled) subset of the MillionTreesDataset.

get_tree_coverage_mask(idx, image_shape)[source]¶: Load a precomputed tree/no-tree mask for an image if available.

initialize_data_dir(root_dir, download)[source]¶

Helper function for downloading/updating the dataset if required.

Note that we only do a version check for datasets where the download_url is set. Currently, this includes all datasets except Yelp. Datasets for which we don’t control the download, like Yelp, might not handle versions similarly.

property is_detection¶

Boolean.

True if the task is detection, and false otherwise.

property latest_version¶

property metadata_array¶

A Tensor of metadata, with the i-th row representing the metadata associated with the i-th data point.

The columns correspond to the metadata_fields defined above.

property metadata_fields¶

A list of strings naming each column of the metadata table, e.g., [‘hospital’, ‘y’].

Must include ‘y’.

property metadata_map¶

An optional dictionary that, for each metadata field, contains a list that maps from integers (in metadata_array) to a string representing what that integer means.

This is only used for logging, so that we print out more intelligible metadata values. Each key must be in metadata_fields. For example, if we have metadata_fields = [‘hospital’, ‘y’] metadata_map = {‘hospital’: [‘East’, ‘West’]} then if metadata_array[i, 0] == 0, the i-th data point belongs to the ‘East’ hospital while if metadata_array[i, 0] == 1, it belongs to the ‘West’ hospital.

property n_classes¶

Number of classes for single-task classification datasets.

Used for logging and to configure models to produce appropriately-sized output. None by default. Leave as None if not applicable (e.g., regression or multi-task classification).

property original_resolution¶: Original image resolution for image datasets.

property source_domain_splits¶: List of split IDs that are from the source domain.

property split_array¶: An array of integers, with split_array[i] representing what split the i-th data point belongs to.

property split_dict¶

A dictionary mapping splits to integer identifiers (used in split_array), e.g., {‘train’: 0, ‘val’: 1, ‘test’: 2}.

Keys should match up with split_names.

property split_names¶

‘Train’, ‘val’: ‘Validation’, ‘test’: ‘Test’}.

Keys should match up with split_dict.

Type:: A dictionary mapping splits to their pretty names, e.g., {‘train’

property split_scheme¶: A string identifier of how the split is constructed, e.g., ‘standard’, ‘mixed-to-test’, ‘user’, etc.

static standard_eval(metric, y_pred, y_true)[source]¶

Parameters:

metric (-) – Metric to use for eval
y_pred (-) – Predicted targets
y_true (-) – True targets

Output:

results (dict): Dictionary of results
results_str (str): Pretty print version of the results

static standard_group_eval(metric, grouper, y_pred, y_true, metadata, aggregate=True)[source]¶

Parameters:

metric (-) – Metric to use for eval
grouper (-) – Grouper object that converts metadata into groups
y_pred (-) – Predicted targets
y_true (-) – True targets
metadata (-) – Metadata

Output:

results (dict): Dictionary of results
results_str (str): Pretty print version of the results

property version¶: A string that identifies the dataset version, e.g., ‘1.0’.

property versions_dict¶

A dictionary where each key is a version string (e.g., ‘1.0’) and each value is a dictionary containing the ‘download_url’ and ‘compressed_size’ keys.

‘download_url’ is the URL for downloading the dataset archive. If None, the dataset cannot be downloaded automatically (e.g., because it first requires accepting a usage agreement).

‘compressed_size’ is the approximate size of the compressed dataset in bytes.

property y_array¶

A Tensor of targets (e.g., labels for classification tasks), with y_array[i] representing the target of the i-th data point.

y_array[i] can contain multiple elements.

property y_size¶

The number of dimensions/elements in the target, i.e., len(y_array[i]).

For standard classification/regression tasks, y_size = 1. For multi-task or structured prediction settings, y_size > 1. Used for logging and to configure models to produce appropriately- sized output.

class milliontrees.datasets.milliontrees_dataset.MillionTreesSubset(dataset, indices, transform=None, geometry_name='y')[source]¶

Bases: MillionTreesDataset

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=10)[source]¶

Parameters:

y_pred (-) – Predicted targets per image
y_true (-) – True targets per image
metadata (-) – Metadata rows aligned with predictions
viz_dir (-) – If set, write up to viz_n_per_source overlay PNGs per source_id under this directory (see eval_visualization).
viz_n_per_source (-) – Max images to save per source when viz_dir is set. Pass None to write all images.

Output:

results (dict): Dictionary of results (may include eval_visualization_paths)
results_str (str): Pretty print version of the results

property metadata_array¶

A Tensor of metadata, with the i-th row representing the metadata associated with the i-th data point.

The columns correspond to the metadata_fields defined above.

property split_array¶: An array of integers, with split_array[i] representing what split the i-th data point belongs to.

property y_array¶

A Tensor of targets (e.g., labels for classification tasks), with y_array[i] representing the target of the i-th data point.

y_array[i] can contain multiple elements.

milliontrees.datasets package¶

Submodules¶

milliontrees.datasets.TreeBoxes module¶

milliontrees.datasets.TreePoints module¶

milliontrees.datasets.TreePolygons module¶

milliontrees.datasets.download_utils module¶

milliontrees.datasets.milliontrees_dataset module¶

Module contents¶