milliontrees.datasets package

Submodules

milliontrees.datasets.TreeBoxes module

class milliontrees.datasets.TreeBoxes.TreeBoxesDataset(version=None, root_dir='data', download=False, split_scheme='random', geometry_name='y', eval_score_threshold=0.1, remove_incomplete=False, image_size=448, include_sources=None, exclude_sources=None, mini=False, verbose=True, include_unsupervised=False)[source]

Bases: MillionTreesDataset

A dataset of tree annotations with bounding box coordinates from multiple global sources.

The dataset contains aerial imagery of trees with their corresponding bounding box annotations. Each tree is annotated with a 4-point bounding box (x_min, y_min, x_max, y_max).

Dataset Splits:
  • Random: For each source, 80% of the data is used for training and 20% for testing.

  • crossgeometry: Boxes and Points are used to predict polygons.

  • zeroshot: Selected sources are entirely held out for testing.

Data Format:

Input (x): RGB aerial imagery Labels (y): Nx4 array of bounding box coordinates Metadata: Location identifiers for each image

Parameters:
  • version (str) – The version of the dataset to load.

  • root_dir (str) – The root directory to store the dataset.

  • download (bool) – Whether to download the dataset if it is not already present.

  • split_scheme (str) – The split scheme to use.

  • geometry_name (str) – The name of the geometry to use.

  • eval_score_threshold (float) – The threshold for the evaluation score.

  • remove_incomplete (bool) – Whether to remove incomplete data.

  • image_size (int) – The size of the image to use.

  • include_sources (list) – The sources to include.

  • exclude_sources (list) – The sources to exclude.

  • unsupervised (bool) – If True, include unsupervised data in addition to any other selected sources (unless explicitly excluded).

  • mini (bool) – If True, download mini versions of datasets for development. Mini datasets are smaller subsets that maintain the same structure.

  • unsupervised_args (dict) – The arguments to pass to the unsupervised download pipeline.

References

Website: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009180

Citation:

@article{Weinstein2020, title={A benchmark dataset for canopy crown detection and delineation in co-registered airborne RGB, LiDAR and hyperspectral imagery from the National Ecological Observation Network.}, author={Weinstein BG, Graves SJ, Marconi S, Singh A, Zare A, Stewart D, et al.}, journal={PLoS Comput Biol}, year={2021}, doi={10.1371/journal.pcbi.1009180} }

License: Creative Commons Attribution License

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]

Performs evaluation on the given predictions.

The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.

If viz_dir is set, writes overlay PNGs (purple = ground truth, orange = predictions above the eval score threshold), up to viz_n_per_source images per source, in subfolders named by source.

get_input(idx)[source]

Retrieves the input features (image) for a given data point.

Parameters:

idx (int) – Index of a data point

Returns:

Input features of the idx-th data point (image) as a normalized numpy array.

Return type:

np.ndarray

milliontrees.datasets.TreePoints module

class milliontrees.datasets.TreePoints.TreePointsDataset(version=None, root_dir='data', download=False, split_scheme='random', geometry_name='y', remove_incomplete=False, distance_threshold=0.02, include_sources=None, exclude_sources=None, mini=False, image_size=448, verbose=True, include_unsupervised=False)[source]

Bases: MillionTreesDataset

The TreePoints dataset is a collection of tree annotations annotated as x,y locations.

Dataset Splits:
  • random: For each source, 80% of the data is used for training and 20% for testing.

  • crossgeometry: Boxes and Points are used to predict polygons.

  • zeroshot: Selected sources are entirely held out for testing.

Input (x):

RGB aerial images

Label (y):

y is an n x 2 matrix where each row represents a keypoint (x, y)

Metadata:
Each image is annotated with the following metadata
  • location (int): location id

License:

This dataset is distributed under Creative Commons Attribution License

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]

The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.

Optional viz_dir / viz_n_per_source write qualitative overlays (see TreeBoxes.eval).

get_annotation_from_filename(filename)[source]
get_input(idx)[source]
Parameters:

idx (-) – Index of a data point

Output:
  • x (np.ndarray): Input features of the idx-th data point

milliontrees.datasets.TreePolygons module

class milliontrees.datasets.TreePolygons.TreePolygonsDataset(version=None, root_dir='data', download=False, split_scheme='random', geometry_name='y', eval_score_threshold=0.5, image_size=448, remove_incomplete=False, include_sources=None, exclude_sources=None, mini=False, verbose=True, include_unsupervised=False)[source]

Bases: MillionTreesDataset

The TreePolygons dataset is a collection of tree annotations annotated as multi-point polygon locations.

The dataset is comprised of many sources from across the world.

Dataset Splits:
  • Random: For each source, 80% of the data is used for training and 20% for testing.

  • crossgeometry: Boxes and Points are used to predict polygons.

  • zeroshot: Selected sources are entirely held out for testing.

Input (x):

RGB aerial images.

Label (y):

y is an n x 2-dimensional vector where each line represents a point coordinate (x, y).

Metadata:
Each image is annotated with the following metadata:
  • location (int): location id

  • source (int): source id

License:

This dataset is distributed under the Creative Commons Attribution License.

create_polygon_mask(width, height, vertices)[source]

Create a grayscale image with a white polygonal area on a black background.

Parameters: - width (int): Width of the output image. - height (int): Height of the output image. - vertices (shapely.geometry.Polygon): A shapely Polygon object representing the polygon.

Returns: - mask_img (np.ndarray): A numpy array representing the image with the drawn polygon.

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]

The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.

Optional viz_dir / viz_n_per_source write qualitative overlays (purple = GT masks, orange = predicted masks above the eval score threshold).

get_input(idx)[source]
Parameters:

idx (-) – Index of a data point

Output:
  • x (np.ndarray): Input features of the idx-th data point

milliontrees.datasets.download_utils module

This file contains utility functions for downloading datasets. The code in this file is taken from the torchvision package, specifically, https://github.com/pytorch/vision/blob/master/torchvision/datasets/utils.py. We package it here to avoid users having to install the rest of torchvision. It is licensed under the following license:

BSD 3-Clause License

Copyright (c) Soumith Chintala 2016, All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

milliontrees.datasets.download_utils.calculate_md5(fpath: str, chunk_size: int = 1048576) str[source]
milliontrees.datasets.download_utils.check_integrity(fpath: str, md5: str | None = None) bool[source]
milliontrees.datasets.download_utils.check_md5(fpath: str, md5: str, **kwargs: Any) bool[source]
milliontrees.datasets.download_utils.download_and_extract_archive(url: str, download_root: str, extract_root: str | None = None, filename: str | None = None, md5: str | None = None, remove_finished: bool = False, size: int | None = None) None[source]
milliontrees.datasets.download_utils.download_file_from_google_drive(file_id: str, root: str, filename: str | None = None, md5: str | None = None)[source]

Download a Google Drive file from and place it in root.

Parameters:
  • file_id (str) – id of file to be downloaded

  • root (str) – Directory to place downloaded file in

  • filename (str, optional) – Name to save the file under. If None, use the id of the file.

  • md5 (str, optional) – MD5 checksum of the download. If None, do not check

milliontrees.datasets.download_utils.download_url(url: str, root: str, filename: str | None = None, md5: str | None = None, size: int | None = None) None[source]

Download a file from a url and place it in root.

Parameters:
  • url (str) – URL to download file from

  • root (str) – Directory to place downloaded file in

  • filename (str, optional) – Name to save the file under. If None, use the basename of the URL

  • md5 (str, optional) – MD5 checksum of the download. If None, do not check

milliontrees.datasets.download_utils.extract_archive(from_path: str, to_path: str | None = None, remove_finished: bool = False) None[source]
milliontrees.datasets.download_utils.gen_bar_updater(total) Callable[[int, int, int], None][source]
milliontrees.datasets.download_utils.iterable_to_str(iterable: Iterable) str[source]
milliontrees.datasets.download_utils.list_dir(root: str, prefix: bool = False) List[str][source]

List all directories at a given root.

Parameters:
  • root (str) – Path to directory whose folders need to be listed

  • prefix (bool, optional) – If true, prepends the path to each result, otherwise only returns the name of the directories found

milliontrees.datasets.download_utils.list_files(root: str, suffix: str, prefix: bool = False) List[str][source]

List all files ending with a suffix at a given root.

Parameters:
  • root (str) – Path to directory whose folders need to be listed

  • suffix (str or tuple) – Suffix of the files to match, e.g. ‘.png’ or (‘.jpg’, ‘.png’). It uses the Python “str.endswith” method and is passed directly

  • prefix (bool, optional) – If true, prepends the path to each result, otherwise only returns the name of the files found

milliontrees.datasets.download_utils.verify_str_arg(value: T, arg: str | None = None, valid_values: Iterable[T] = None, custom_msg: str | None = None) T[source]

milliontrees.datasets.milliontrees_dataset module

class milliontrees.datasets.milliontrees_dataset.MillionTreesDataset(root_dir, download, split_scheme)[source]

Bases: object

Shared dataset class for all MillionTrees datasets.

Each data point in the dataset is a tuple (x, y, metadata), where:
  • x: The input features

  • y: The target

  • metadata: A vector of relevant information (e.g., domain). For convenience, metadata also contains y.

DEFAULT_SOURCE_DOMAIN_SPLITS = [0]
DEFAULT_SPLITS = {'train': 0, 'val': 1}
DEFAULT_SPLIT_NAMES = {'train': 'Train', 'val': 'Validation'}
check_init()[source]

Convenience function to check that the WILDSDataset is properly configured.

check_version()[source]
property collate

Torch function to collate items in a batch.

By default returns None -> uses default torch collate.

property data_dir

The full path to the folder in which the dataset is stored.

dataset_exists_locally(data_dir, version_file)[source]
property dataset_name

A string that identifies the dataset, e.g., ‘amazon’, ‘camelyon17’.

download_dataset(data_dir, download_flag)[source]
eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]
Parameters:
  • y_pred (-) – Predicted targets per image

  • y_true (-) – True targets per image

  • metadata (-) – Metadata rows aligned with predictions

  • viz_dir (-) – If set, write up to viz_n_per_source overlay PNGs per source_id under this directory (see eval_visualization).

  • viz_n_per_source (-) – Max images to save per source when viz_dir is set.

Output:
  • results (dict): Dictionary of results (may include eval_visualization_paths)

  • results_str (str): Pretty print version of the results

get_input(idx)[source]
Parameters:

idx (-) – Index of a data point

Output:
  • x (Tensor): Input features of the idx-th data point

get_subset(split, frac=1.0, transform=None)[source]
Parameters:
  • split (-) – Split identifier, e.g., ‘train’, ‘val’, ‘test’. Must be in self.split_dict.

  • frac (-) – What fraction of the split to randomly sample. Used for fast development on a small dataset.

  • transform (-) – Any data transformations to be applied to the input x.

Output:
  • subset (MillionTreesSubset): A (potentially subsampled) subset of the WILDSDataset.

get_tree_coverage_mask(idx, image_shape)[source]

Load a precomputed tree/no-tree mask for an image if available.

initialize_data_dir(root_dir, download)[source]

Helper function for downloading/updating the dataset if required.

Note that we only do a version check for datasets where the download_url is set. Currently, this includes all datasets except Yelp. Datasets for which we don’t control the download, like Yelp, might not handle versions similarly.

property is_detection

Boolean.

True if the task is detection, and false otherwise.

property latest_version
property metadata_array

A Tensor of metadata, with the i-th row representing the metadata associated with the i-th data point.

The columns correspond to the metadata_fields defined above.

property metadata_fields

A list of strings naming each column of the metadata table, e.g., [‘hospital’, ‘y’].

Must include ‘y’.

property metadata_map

An optional dictionary that, for each metadata field, contains a list that maps from integers (in metadata_array) to a string representing what that integer means.

This is only used for logging, so that we print out more intelligible metadata values. Each key must be in metadata_fields. For example, if we have metadata_fields = [‘hospital’, ‘y’] metadata_map = {‘hospital’: [‘East’, ‘West’]} then if metadata_array[i, 0] == 0, the i-th data point belongs to the ‘East’ hospital while if metadata_array[i, 0] == 1, it belongs to the ‘West’ hospital.

property n_classes

Number of classes for single-task classification datasets.

Used for logging and to configure models to produce appropriately-sized output. None by default. Leave as None if not applicable (e.g., regression or multi-task classification).

property original_resolution

Original image resolution for image datasets.

property source_domain_splits

List of split IDs that are from the source domain.

property split_array

An array of integers, with split_array[i] representing what split the i-th data point belongs to.

property split_dict

A dictionary mapping splits to integer identifiers (used in split_array), e.g., {‘train’: 0, ‘val’: 1, ‘test’: 2}.

Keys should match up with split_names.

property split_names

‘Train’, ‘val’: ‘Validation’, ‘test’: ‘Test’}.

Keys should match up with split_dict.

Type:

A dictionary mapping splits to their pretty names, e.g., {‘train’

property split_scheme

A string identifier of how the split is constructed, e.g., ‘standard’, ‘mixed-to-test’, ‘user’, etc.

static standard_eval(metric, y_pred, y_true)[source]
Parameters:
  • metric (-) – Metric to use for eval

  • y_pred (-) – Predicted targets

  • y_true (-) – True targets

Output:
  • results (dict): Dictionary of results

  • results_str (str): Pretty print version of the results

static standard_group_eval(metric, grouper, y_pred, y_true, metadata, aggregate=True)[source]
Parameters:
  • metric (-) – Metric to use for eval

  • grouper (-) – Grouper object that converts metadata into groups

  • y_pred (-) – Predicted targets

  • y_true (-) – True targets

  • metadata (-) – Metadata

Output:
  • results (dict): Dictionary of results

  • results_str (str): Pretty print version of the results

property version

A string that identifies the dataset version, e.g., ‘1.0’.

property versions_dict

A dictionary where each key is a version string (e.g., ‘1.0’) and each value is a dictionary containing the ‘download_url’ and ‘compressed_size’ keys.

‘download_url’ is the URL for downloading the dataset archive. If None, the dataset cannot be downloaded automatically (e.g., because it first requires accepting a usage agreement).

‘compressed_size’ is the approximate size of the compressed dataset in bytes.

property y_array

A Tensor of targets (e.g., labels for classification tasks), with y_array[i] representing the target of the i-th data point.

y_array[i] can contain multiple elements.

property y_size

The number of dimensions/elements in the target, i.e., len(y_array[i]).

For standard classification/regression tasks, y_size = 1. For multi-task or structured prediction settings, y_size > 1. Used for logging and to configure models to produce appropriately- sized output.

class milliontrees.datasets.milliontrees_dataset.MillionTreesSubset(dataset, indices, transform=None, geometry_name='y')[source]

Bases: MillionTreesDataset

eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]
Parameters:
  • y_pred (-) – Predicted targets per image

  • y_true (-) – True targets per image

  • metadata (-) – Metadata rows aligned with predictions

  • viz_dir (-) – If set, write up to viz_n_per_source overlay PNGs per source_id under this directory (see eval_visualization).

  • viz_n_per_source (-) – Max images to save per source when viz_dir is set.

Output:
  • results (dict): Dictionary of results (may include eval_visualization_paths)

  • results_str (str): Pretty print version of the results

property metadata_array

A Tensor of metadata, with the i-th row representing the metadata associated with the i-th data point.

The columns correspond to the metadata_fields defined above.

property split_array

An array of integers, with split_array[i] representing what split the i-th data point belongs to.

property y_array

A Tensor of targets (e.g., labels for classification tasks), with y_array[i] representing the target of the i-th data point.

y_array[i] can contain multiple elements.

Module contents