milliontrees.datasets package¶
Submodules¶
milliontrees.datasets.TreeBoxes module¶
- class milliontrees.datasets.TreeBoxes.TreeBoxesDataset(version=None, root_dir='data', download=False, split_scheme='random', geometry_name='y', eval_score_threshold=0.1, remove_incomplete=False, image_size=448, include_sources=None, exclude_sources=None, mini=False, verbose=True, include_unsupervised=False)[source]¶
Bases:
MillionTreesDatasetA dataset of tree annotations with bounding box coordinates from multiple global sources.
The dataset contains aerial imagery of trees with their corresponding bounding box annotations. Each tree is annotated with a 4-point bounding box (x_min, y_min, x_max, y_max).
- Dataset Splits:
Random: For each source, 80% of the data is used for training and 20% for testing.
crossgeometry: Boxes and Points are used to predict polygons.
zeroshot: Selected sources are entirely held out for testing.
- Data Format:
Input (x): RGB aerial imagery Labels (y): Nx4 array of bounding box coordinates Metadata: Location identifiers for each image
- Parameters:
version (str) – The version of the dataset to load.
root_dir (str) – The root directory to store the dataset.
download (bool) – Whether to download the dataset if it is not already present.
split_scheme (str) – The split scheme to use.
geometry_name (str) – The name of the geometry to use.
eval_score_threshold (float) – The threshold for the evaluation score.
remove_incomplete (bool) – Whether to remove incomplete data.
image_size (int) – The size of the image to use.
include_sources (list) – The sources to include.
exclude_sources (list) – The sources to exclude.
unsupervised (bool) – If True, include unsupervised data in addition to any other selected sources (unless explicitly excluded).
mini (bool) – If True, download mini versions of datasets for development. Mini datasets are smaller subsets that maintain the same structure.
unsupervised_args (dict) – The arguments to pass to the unsupervised download pipeline.
References
Website: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009180
- Citation:
@article{Weinstein2020, title={A benchmark dataset for canopy crown detection and delineation in co-registered airborne RGB, LiDAR and hyperspectral imagery from the National Ecological Observation Network.}, author={Weinstein BG, Graves SJ, Marconi S, Singh A, Zare A, Stewart D, et al.}, journal={PLoS Comput Biol}, year={2021}, doi={10.1371/journal.pcbi.1009180} }
License: Creative Commons Attribution License
- eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]¶
Performs evaluation on the given predictions.
The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.
If
viz_diris set, writes overlay PNGs (purple = ground truth, orange = predictions above the eval score threshold), up toviz_n_per_sourceimages per source, in subfolders named by source.
milliontrees.datasets.TreePoints module¶
- class milliontrees.datasets.TreePoints.TreePointsDataset(version=None, root_dir='data', download=False, split_scheme='random', geometry_name='y', remove_incomplete=False, distance_threshold=0.02, include_sources=None, exclude_sources=None, mini=False, image_size=448, verbose=True, include_unsupervised=False)[source]¶
Bases:
MillionTreesDatasetThe TreePoints dataset is a collection of tree annotations annotated as x,y locations.
- Dataset Splits:
random: For each source, 80% of the data is used for training and 20% for testing.
crossgeometry: Boxes and Points are used to predict polygons.
zeroshot: Selected sources are entirely held out for testing.
- Input (x):
RGB aerial images
- Label (y):
y is an n x 2 matrix where each row represents a keypoint (x, y)
- Metadata:
- Each image is annotated with the following metadata
location (int): location id
- License:
This dataset is distributed under Creative Commons Attribution License
milliontrees.datasets.TreePolygons module¶
- class milliontrees.datasets.TreePolygons.TreePolygonsDataset(version=None, root_dir='data', download=False, split_scheme='random', geometry_name='y', eval_score_threshold=0.5, image_size=448, remove_incomplete=False, include_sources=None, exclude_sources=None, mini=False, verbose=True, include_unsupervised=False)[source]¶
Bases:
MillionTreesDatasetThe TreePolygons dataset is a collection of tree annotations annotated as multi-point polygon locations.
The dataset is comprised of many sources from across the world.
- Dataset Splits:
Random: For each source, 80% of the data is used for training and 20% for testing.
crossgeometry: Boxes and Points are used to predict polygons.
zeroshot: Selected sources are entirely held out for testing.
- Input (x):
RGB aerial images.
- Label (y):
y is an n x 2-dimensional vector where each line represents a point coordinate (x, y).
- Metadata:
- Each image is annotated with the following metadata:
location (int): location id
source (int): source id
- License:
This dataset is distributed under the Creative Commons Attribution License.
- create_polygon_mask(width, height, vertices)[source]¶
Create a grayscale image with a white polygonal area on a black background.
Parameters: - width (int): Width of the output image. - height (int): Height of the output image. - vertices (shapely.geometry.Polygon): A shapely Polygon object representing the polygon.
Returns: - mask_img (np.ndarray): A numpy array representing the image with the drawn polygon.
- eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]¶
The main evaluation metric, detection_acc_avg_dom, measures the simple average of the detection accuracies of each domain.
Optional
viz_dir/viz_n_per_sourcewrite qualitative overlays (purple = GT masks, orange = predicted masks above the eval score threshold).
milliontrees.datasets.download_utils module¶
This file contains utility functions for downloading datasets. The code in this file is taken from the torchvision package, specifically, https://github.com/pytorch/vision/blob/master/torchvision/datasets/utils.py. We package it here to avoid users having to install the rest of torchvision. It is licensed under the following license:
BSD 3-Clause License
Copyright (c) Soumith Chintala 2016, All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- milliontrees.datasets.download_utils.calculate_md5(fpath: str, chunk_size: int = 1048576) str[source]¶
- milliontrees.datasets.download_utils.check_integrity(fpath: str, md5: str | None = None) bool[source]¶
- milliontrees.datasets.download_utils.download_and_extract_archive(url: str, download_root: str, extract_root: str | None = None, filename: str | None = None, md5: str | None = None, remove_finished: bool = False, size: int | None = None) None[source]¶
- milliontrees.datasets.download_utils.download_file_from_google_drive(file_id: str, root: str, filename: str | None = None, md5: str | None = None)[source]¶
Download a Google Drive file from and place it in root.
- Parameters:
file_id (str) – id of file to be downloaded
root (str) – Directory to place downloaded file in
filename (str, optional) – Name to save the file under. If None, use the id of the file.
md5 (str, optional) – MD5 checksum of the download. If None, do not check
- milliontrees.datasets.download_utils.download_url(url: str, root: str, filename: str | None = None, md5: str | None = None, size: int | None = None) None[source]¶
Download a file from a url and place it in root.
- Parameters:
url (str) – URL to download file from
root (str) – Directory to place downloaded file in
filename (str, optional) – Name to save the file under. If None, use the basename of the URL
md5 (str, optional) – MD5 checksum of the download. If None, do not check
- milliontrees.datasets.download_utils.extract_archive(from_path: str, to_path: str | None = None, remove_finished: bool = False) None[source]¶
- milliontrees.datasets.download_utils.gen_bar_updater(total) Callable[[int, int, int], None][source]¶
- milliontrees.datasets.download_utils.list_dir(root: str, prefix: bool = False) List[str][source]¶
List all directories at a given root.
- Parameters:
root (str) – Path to directory whose folders need to be listed
prefix (bool, optional) – If true, prepends the path to each result, otherwise only returns the name of the directories found
- milliontrees.datasets.download_utils.list_files(root: str, suffix: str, prefix: bool = False) List[str][source]¶
List all files ending with a suffix at a given root.
- Parameters:
root (str) – Path to directory whose folders need to be listed
suffix (str or tuple) – Suffix of the files to match, e.g. ‘.png’ or (‘.jpg’, ‘.png’). It uses the Python “str.endswith” method and is passed directly
prefix (bool, optional) – If true, prepends the path to each result, otherwise only returns the name of the files found
milliontrees.datasets.milliontrees_dataset module¶
- class milliontrees.datasets.milliontrees_dataset.MillionTreesDataset(root_dir, download, split_scheme)[source]¶
Bases:
objectShared dataset class for all MillionTrees datasets.
- Each data point in the dataset is a tuple (x, y, metadata), where:
x: The input features
y: The target
metadata: A vector of relevant information (e.g., domain). For convenience, metadata also contains y.
- DEFAULT_SOURCE_DOMAIN_SPLITS = [0]¶
- DEFAULT_SPLITS = {'train': 0, 'val': 1}¶
- DEFAULT_SPLIT_NAMES = {'train': 'Train', 'val': 'Validation'}¶
- property collate¶
Torch function to collate items in a batch.
By default returns None -> uses default torch collate.
- property data_dir¶
The full path to the folder in which the dataset is stored.
- property dataset_name¶
A string that identifies the dataset, e.g., ‘amazon’, ‘camelyon17’.
- eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]¶
- Parameters:
y_pred (-) – Predicted targets per image
y_true (-) – True targets per image
metadata (-) – Metadata rows aligned with predictions
viz_dir (-) – If set, write up to
viz_n_per_sourceoverlay PNGs persource_idunder this directory (seeeval_visualization).viz_n_per_source (-) – Max images to save per source when
viz_diris set.
- Output:
results (dict): Dictionary of results (may include
eval_visualization_paths)results_str (str): Pretty print version of the results
- get_input(idx)[source]¶
- Parameters:
idx (-) – Index of a data point
- Output:
x (Tensor): Input features of the idx-th data point
- get_subset(split, frac=1.0, transform=None)[source]¶
- Parameters:
split (-) – Split identifier, e.g., ‘train’, ‘val’, ‘test’. Must be in self.split_dict.
frac (-) – What fraction of the split to randomly sample. Used for fast development on a small dataset.
transform (-) – Any data transformations to be applied to the input x.
- Output:
subset (MillionTreesSubset): A (potentially subsampled) subset of the WILDSDataset.
- get_tree_coverage_mask(idx, image_shape)[source]¶
Load a precomputed tree/no-tree mask for an image if available.
- initialize_data_dir(root_dir, download)[source]¶
Helper function for downloading/updating the dataset if required.
Note that we only do a version check for datasets where the download_url is set. Currently, this includes all datasets except Yelp. Datasets for which we don’t control the download, like Yelp, might not handle versions similarly.
- property is_detection¶
Boolean.
True if the task is detection, and false otherwise.
- property latest_version¶
- property metadata_array¶
A Tensor of metadata, with the i-th row representing the metadata associated with the i-th data point.
The columns correspond to the metadata_fields defined above.
- property metadata_fields¶
A list of strings naming each column of the metadata table, e.g., [‘hospital’, ‘y’].
Must include ‘y’.
- property metadata_map¶
An optional dictionary that, for each metadata field, contains a list that maps from integers (in metadata_array) to a string representing what that integer means.
This is only used for logging, so that we print out more intelligible metadata values. Each key must be in metadata_fields. For example, if we have metadata_fields = [‘hospital’, ‘y’] metadata_map = {‘hospital’: [‘East’, ‘West’]} then if metadata_array[i, 0] == 0, the i-th data point belongs to the ‘East’ hospital while if metadata_array[i, 0] == 1, it belongs to the ‘West’ hospital.
- property n_classes¶
Number of classes for single-task classification datasets.
Used for logging and to configure models to produce appropriately-sized output. None by default. Leave as None if not applicable (e.g., regression or multi-task classification).
- property original_resolution¶
Original image resolution for image datasets.
- property source_domain_splits¶
List of split IDs that are from the source domain.
- property split_array¶
An array of integers, with split_array[i] representing what split the i-th data point belongs to.
- property split_dict¶
A dictionary mapping splits to integer identifiers (used in split_array), e.g., {‘train’: 0, ‘val’: 1, ‘test’: 2}.
Keys should match up with split_names.
- property split_names¶
‘Train’, ‘val’: ‘Validation’, ‘test’: ‘Test’}.
Keys should match up with split_dict.
- Type:
A dictionary mapping splits to their pretty names, e.g., {‘train’
- property split_scheme¶
A string identifier of how the split is constructed, e.g., ‘standard’, ‘mixed-to-test’, ‘user’, etc.
- static standard_eval(metric, y_pred, y_true)[source]¶
- Parameters:
metric (-) – Metric to use for eval
y_pred (-) – Predicted targets
y_true (-) – True targets
- Output:
results (dict): Dictionary of results
results_str (str): Pretty print version of the results
- static standard_group_eval(metric, grouper, y_pred, y_true, metadata, aggregate=True)[source]¶
- Parameters:
metric (-) – Metric to use for eval
grouper (-) – Grouper object that converts metadata into groups
y_pred (-) – Predicted targets
y_true (-) – True targets
metadata (-) – Metadata
- Output:
results (dict): Dictionary of results
results_str (str): Pretty print version of the results
- property version¶
A string that identifies the dataset version, e.g., ‘1.0’.
- property versions_dict¶
A dictionary where each key is a version string (e.g., ‘1.0’) and each value is a dictionary containing the ‘download_url’ and ‘compressed_size’ keys.
‘download_url’ is the URL for downloading the dataset archive. If None, the dataset cannot be downloaded automatically (e.g., because it first requires accepting a usage agreement).
‘compressed_size’ is the approximate size of the compressed dataset in bytes.
- property y_array¶
A Tensor of targets (e.g., labels for classification tasks), with y_array[i] representing the target of the i-th data point.
y_array[i] can contain multiple elements.
- property y_size¶
The number of dimensions/elements in the target, i.e., len(y_array[i]).
For standard classification/regression tasks, y_size = 1. For multi-task or structured prediction settings, y_size > 1. Used for logging and to configure models to produce appropriately- sized output.
- class milliontrees.datasets.milliontrees_dataset.MillionTreesSubset(dataset, indices, transform=None, geometry_name='y')[source]¶
Bases:
MillionTreesDataset- eval(y_pred, y_true, metadata, *, viz_dir=None, viz_n_per_source=4)[source]¶
- Parameters:
y_pred (-) – Predicted targets per image
y_true (-) – True targets per image
metadata (-) – Metadata rows aligned with predictions
viz_dir (-) – If set, write up to
viz_n_per_sourceoverlay PNGs persource_idunder this directory (seeeval_visualization).viz_n_per_source (-) – Max images to save per source when
viz_diris set.
- Output:
results (dict): Dictionary of results (may include
eval_visualization_paths)results_str (str): Pretty print version of the results
- property metadata_array¶
A Tensor of metadata, with the i-th row representing the metadata associated with the i-th data point.
The columns correspond to the metadata_fields defined above.
- property split_array¶
An array of integers, with split_array[i] representing what split the i-th data point belongs to.
- property y_array¶
A Tensor of targets (e.g., labels for classification tasks), with y_array[i] representing the target of the i-th data point.
y_array[i] can contain multiple elements.