dacapo.experiments.tasks.evaluators

Submodules

Classes

`DummyEvaluationScores`	The evaluation scores for the dummy task. The scores include the frizz level and blipp score. A higher frizz level indicates more frizz, while a higher blipp score indicates better performance.
`DummyEvaluator`	A class representing a dummy evaluator. This evaluator is used for testing purposes.
`EvaluationScores`	Base class for evaluation scores. This class is used to store the evaluation scores for a task.
`Evaluator`	Base class of all evaluators: An abstract class representing an evaluator that compares and evaluates the output array against the evaluation array.
`MultiChannelBinarySegmentationEvaluationScores`	Class representing evaluation scores for multi-channel binary segmentation tasks.
`BinarySegmentationEvaluationScores`	Class representing evaluation scores for binary segmentation tasks.
`BinarySegmentationEvaluator`	Given a binary segmentation, compute various metrics to determine their similarity. The metrics include:
`InstanceEvaluationScores`	The evaluation scores for the instance segmentation task. The scores include the variation of information (VOI) split, VOI merge, and VOI.
`InstanceEvaluator`	A class representing an evaluator for instance segmentation tasks.

Package Contents

class dacapo.experiments.tasks.evaluators.DummyEvaluationScores

The evaluation scores for the dummy task. The scores include the frizz level and blipp score. A higher frizz level indicates more frizz, while a higher blipp score indicates better performance.

frizz_level: float the frizz level

blipp_score: float the blipp score

higher_is_better(criterion): Return whether higher is better for the given criterion.

bounds(criterion): Return the bounds for the given criterion.

store_best(criterion): Return whether to store the best score for the given criterion.

Note

The DummyEvaluationScores class is used to store the evaluation scores for the dummy task. The class also provides methods to determine whether higher is better for a given criterion, the bounds for a given criterion, and whether to store the best score for a given criterion.

criteria = ['frizz_level', 'blipp_score']

The evaluation criteria.

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> evaluation_scores.criteria
["criterion1", "criterion2"]

Note

This function is used to return the evaluation criteria.

frizz_level: float

blipp_score: float

static higher_is_better(criterion: str) → bool

Return whether higher is better for the given criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

bool: whether higher is better for this criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> DummyEvaluationScores.higher_is_better("frizz_level")
True

Note

This function is used to determine whether higher is better for the given criterion.

static bounds(criterion: str) → Tuple[int | float | None, int | float | None]

Return the bounds for the given criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

Tuple[Union[int, float, None], Union[int, float, None]]: the bounds for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> DummyEvaluationScores.bounds("frizz_level")
(0.0, 1.0)

Note

This function is used to return the bounds for the given criterion.

static store_best(criterion: str) → bool

Return whether to store the best score for the given criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

bool: whether to store the best score for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> DummyEvaluationScores.store_best("frizz_level")
True

Note

This function is used to determine whether to store the best score for the given criterion.

class dacapo.experiments.tasks.evaluators.DummyEvaluator

A class representing a dummy evaluator. This evaluator is used for testing purposes.

criteria: List[str] the evaluation criteria

evaluate(output_array_identifier, evaluation_dataset): Evaluate the output array against the evaluation dataset.

score(): Return the evaluation scores.

Note

The DummyEvaluator class is used to evaluate the performance of a dummy task.

criteria = ['frizz_level', 'blipp_score']

A list of all criteria for which a model might be “best”. i.e. your criteria might be “precision”, “recall”, and “jaccard”. It is unlikely that the best iteration/post processing parameters will be the same for all 3 of these criteria

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> evaluator.criteria
[]

Note

This function is used to return the evaluation criteria.

evaluate(output_array_identifier, evaluation_dataset)

Evaluate the given output array and dataset and returns the scores based on predefined criteria.

Parameters:

output_array_identifier – The output array to be evaluated.
evaluation_dataset – The dataset to be used for evaluation.

Returns:

An object of DummyEvaluationScores class, with the evaluation scores.

Return type:

DummyEvaluationScore

Raises:

ValueError – if the output array identifier is not valid

Examples

>>> dummy_evaluator = DummyEvaluator()
>>> output_array_identifier = "output_array"
>>> evaluation_dataset = "evaluation_dataset"
>>> dummy_evaluator.evaluate(output_array_identifier, evaluation_dataset)
DummyEvaluationScores(frizz_level=0.0, blipp_score=0.0)

Note

This function is used to evaluate the output array against the evaluation dataset.

property score: dacapo.experiments.tasks.evaluators.dummy_evaluation_scores.DummyEvaluationScores

Return the evaluation scores.

Returns:: An object of DummyEvaluationScores class, with the evaluation scores.
Return type:: DummyEvaluationScores

Examples

>>> dummy_evaluator = DummyEvaluator()
>>> dummy_evaluator.score
DummyEvaluationScores(frizz_level=0.0, blipp_score=0.0)

Note

This function is used to return the evaluation scores.

class dacapo.experiments.tasks.evaluators.EvaluationScores

Base class for evaluation scores. This class is used to store the evaluation scores for a task. The scores include the evaluation criteria. The class also provides methods to determine whether higher is better for a given criterion, the bounds for a given criterion, and whether to store the best score for a given criterion.

criteria: List[str] the evaluation criteria

higher_is_better(criterion): Return whether higher is better for the given criterion.

bounds(criterion): Return the bounds for the given criterion.

store_best(criterion): Return whether to store the best score for the given criterion.

Note

The EvaluationScores class is used to store the evaluation scores for a task. All evaluation scores should inherit from this class.

property criteria: List[str]

Abstractmethod:

The evaluation criteria.

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> evaluation_scores.criteria
["criterion1", "criterion2"]

Note

This function is used to return the evaluation criteria.

static higher_is_better(criterion: str) → bool

Abstractmethod:

Wether or not higher is better for this criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

bool: whether higher is better for this criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> criterion = "criterion1"
>>> evaluation_scores.higher_is_better(criterion)
True

Note

This function is used to determine whether higher is better for a given criterion.

static bounds(criterion: str) → Tuple[int | float | None, int | float | None]

Abstractmethod:

The bounds for this criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

Tuple[Union[int, float, None], Union[int, float, None]]: the bounds for this criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> criterion = "criterion1"
>>> evaluation_scores.bounds(criterion)
(0, 1)

Note

This function is used to return the bounds for the given criterion.

static store_best(criterion: str) → bool

Abstractmethod:

Whether or not to save the best validation block and model weights for this criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

bool: whether to store the best score for this criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> criterion = "criterion1"
>>> evaluation_scores.store_best(criterion)
True

Note

This function is used to return whether to store the best score for the given criterion.

class dacapo.experiments.tasks.evaluators.Evaluator

Base class of all evaluators: An abstract class representing an evaluator that compares and evaluates the output array against the evaluation array.

An evaluator takes a post-processor’s output and compares it against ground-truth. It then returns a set of scores that can be used to determine the quality of the post-processor’s output.

best_scores: Dict[OutputIdentifier, BestScore] the best scores for each dataset/post-processing parameter/criterion combination

evaluate(output_array_identifier, evaluation_array): Compare and evaluate the output array against the evaluation array.

is_best(dataset, parameter, criterion, score): Check if the provided score is the best for this dataset/parameter/criterion combo.

get_overall_best(dataset, criterion): Return the best score for the given dataset and criterion.

get_overall_best_parameters(dataset, criterion): Return the best parameters for the given dataset and criterion.

compare(score_1, score_2, criterion): Compare two scores for the given criterion.

set_best(validation_scores): Find the best iteration for each dataset/post_processing_parameter/criterion.

higher_is_better(criterion): Return whether higher is better for the given criterion.

bounds(criterion): Return the bounds for the given criterion.

store_best(criterion): Return whether to store the best score for the given criterion.

Note

The Evaluator class is used to compare and evaluate the output array against the evaluation array.

abstract evaluate(output_array_identifier: dacapo.store.local_array_store.LocalArrayIdentifier, evaluation_array: dacapo.experiments.datasplits.datasets.arrays.Array) → dacapo.experiments.tasks.evaluators.evaluation_scores.EvaluationScores

Compares and evaluates the output array against the evaluation array.

Parameters:

output_array_identifier – LocalArrayIdentifier The identifier of the output array.
evaluation_array – Array The evaluation array.

Returns:

EvaluationScores: The evaluation scores.

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> output_array_identifier = LocalArrayIdentifier("output_array")
>>> evaluation_array = Array()
>>> evaluator.evaluate(output_array_identifier, evaluation_array)
EvaluationScores()

Note

This function is used to compare and evaluate the output array against the evaluation array.

property best_scores: Dict[OutputIdentifier, BestScore]

The best scores for each dataset/post-processing parameter/criterion combination.

Returns:

Dict[OutputIdentifier, BestScore]: the best scores for each dataset/post-processing parameter/criterion combination

Raises:

AttributeError – if the best scores are not set

Examples

>>> evaluator = Evaluator()
>>> evaluator.best_scores
{}

Note

This function is used to return the best scores for each dataset/post-processing parameter/criterion combination.

is_best(dataset: dacapo.experiments.datasplits.datasets.Dataset, parameter: dacapo.experiments.tasks.post_processors.PostProcessorParameters, criterion: str, score: dacapo.experiments.tasks.evaluators.evaluation_scores.EvaluationScores) → bool

Check if the provided score is the best for this dataset/parameter/criterion combo.

Parameters:

dataset – Dataset the dataset
parameter – PostProcessorParameters the post-processor parameters
criterion – str the criterion
score – EvaluationScores the evaluation scores

Returns:

bool: whether the provided score is the best for this dataset/parameter/criterion combo

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> dataset = Dataset()
>>> parameter = PostProcessorParameters()
>>> criterion = "criterion"
>>> score = EvaluationScores()
>>> evaluator.is_best(dataset, parameter, criterion, score)
False

Note

This function is used to check if the provided score is the best for this dataset/parameter/criterion combo.

get_overall_best(dataset: dacapo.experiments.datasplits.datasets.Dataset, criterion: str)

Return the best score for the given dataset and criterion.

Parameters:

dataset – Dataset the dataset
criterion – str the criterion

Returns:

Optional[float]: the best score for the given dataset and criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> dataset = Dataset()
>>> criterion = "criterion"
>>> evaluator.get_overall_best(dataset, criterion)
None

Note

This function is used to return the best score for the given dataset and criterion.

get_overall_best_parameters(dataset: dacapo.experiments.datasplits.datasets.Dataset, criterion: str)

Return the best parameters for the given dataset and criterion.

Parameters:

dataset – Dataset the dataset
criterion – str the criterion

Returns:

Optional[PostProcessorParameters]: the best parameters for the given dataset and criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> dataset = Dataset()
>>> criterion = "criterion"
>>> evaluator.get_overall_best_parameters(dataset, criterion)
None

Note

This function is used to return the best parameters for the given dataset and criterion.

compare(score_1, score_2, criterion)

Compare two scores for the given criterion.

Parameters:

score_1 – float the first score
score_2 – float the second score
criterion – str the criterion

Returns:

bool: whether the first score is better than the second score for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> score_1 = 0.0
>>> score_2 = 0.0
>>> criterion = "criterion"
>>> evaluator.compare(score_1, score_2, criterion)
False

Note

This function is used to compare two scores for the given criterion.

set_best(validation_scores: dacapo.experiments.validation_scores.ValidationScores) → None

Find the best iteration for each dataset/post_processing_parameter/criterion.

Parameters:: validation_scores – ValidationScores the validation scores
Raises:: NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> validation_scores = ValidationScores()
>>> evaluator.set_best(validation_scores)
None

Note

This function is used to find the best iteration for each dataset/post_processing_parameter/criterion. Typically, this function is called after the validation scores have been computed.

property criteria: List[str]

Abstractmethod:

A list of all criteria for which a model might be “best”. i.e. your criteria might be “precision”, “recall”, and “jaccard”. It is unlikely that the best iteration/post processing parameters will be the same for all 3 of these criteria

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> evaluator.criteria
[]

Note

This function is used to return the evaluation criteria.

higher_is_better(criterion: str) → bool

Wether or not higher is better for this criterion.

Parameters:

criterion – str the criterion

Returns:

bool: whether higher is better for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> criterion = "criterion"
>>> evaluator.higher_is_better(criterion)
False

Note

This function is used to determine whether higher is better for the given criterion.

bounds(criterion: str) → Tuple[int | float | None, int | float | None]

The bounds for this criterion

Parameters:

criterion – str the criterion

Returns:

Tuple[Union[int, float, None], Union[int, float, None]]: the bounds for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> criterion = "criterion"
>>> evaluator.bounds(criterion)
(0, 1)

Note

This function is used to return the bounds for the given criterion.

store_best(criterion: str) → bool

The bounds for this criterion

Parameters:

criterion – str the criterion

Returns:

bool: whether to store the best score for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> criterion = "criterion"
>>> evaluator.store_best(criterion)
False

Note

This function is used to return whether to store the best score for the given criterion.

property score: dacapo.experiments.tasks.evaluators.evaluation_scores.EvaluationScores

Abstractmethod:

The evaluation scores.

Returns:

EvaluationScores: the evaluation scores

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> evaluator.score
EvaluationScores()

Note

This function is used to return the evaluation scores.

class dacapo.experiments.tasks.evaluators.MultiChannelBinarySegmentationEvaluationScores

Class representing evaluation scores for multi-channel binary segmentation tasks.

channel_scores

The list of channel scores.

Type:: List[Tuple[str, BinarySegmentationEvaluationScores]]

higher_is_better(criterion: str) -> bool: Determines whether a higher value is better for a given criterion.

store_best(criterion: str) -> bool: Whether or not to store the best weights/validation blocks for this criterion.

bounds(criterion: str) -> Tuple[Union[int, float, None], Union[int, float, None]]: Determines the bounds for a given criterion.

Notes

The evaluation scores are stored as attributes of the class. The class also contains methods to determine whether a higher value is better for a given criterion, whether or not to store the best weights/validation blocks for a given criterion, and the bounds for a given criterion.

channel_scores: List[Tuple[str, BinarySegmentationEvaluationScores]]

property criteria

Returns a list of all criteria for all channels.

Returns:: The list of criteria.
Return type:: List[str]
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> channel_scores = [("channel1", BinarySegmentationEvaluationScores()), ("channel2", BinarySegmentationEvaluationScores())]
>>> MultiChannelBinarySegmentationEvaluationScores(channel_scores).criteria

Notes

The method returns a list of all criteria for all channels. The criteria are stored as attributes of the class.

static higher_is_better(criterion: str) → bool

Determines whether a higher value is better for a given criterion.

Parameters:: criterion (str) – The evaluation criterion.
Returns:: True if a higher value is better, False otherwise.
Return type:: bool
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> MultiChannelBinarySegmentationEvaluationScores.higher_is_better("channel1__dice")
True
>>> MultiChannelBinarySegmentationEvaluationScores.higher_is_better("channel1__f1_score")
True

Notes

The method returns True if the criterion is recognized and False otherwise. Whether a higher value is better for a given criterion is determined by the mapping dictionary.

static store_best(criterion: str) → bool

Determines whether or not to store the best weights/validation blocks for a given criterion.

Parameters:: criterion (str) – The evaluation criterion.
Returns:: True if the best weights/validation blocks should be stored, False otherwise.
Return type:: bool
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> MultiChannelBinarySegmentationEvaluationScores.store_best("channel1__dice")
False
>>> MultiChannelBinarySegmentationEvaluationScores.store_best("channel1__f1_score")
True

Notes

The method returns True if the criterion is recognized and False otherwise. Whether or not to store the best weights/validation blocks for a given criterion is determined by the mapping dictionary.

static bounds(criterion: str) → Tuple[int | float | None, int | float | None]

Determines the bounds for a given criterion. The bounds are used to determine the best value for a given criterion.

Parameters:: criterion (str) – The evaluation criterion.
Returns:: The lower and upper bounds for the criterion.
Return type:: Tuple[Union[int, float, None], Union[int, float, None]]
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> MultiChannelBinarySegmentationEvaluationScores.bounds("channel1__dice")
(0, 1)
>>> MultiChannelBinarySegmentationEvaluationScores.bounds("channel1__hausdorff")
(0, nan)

Notes

The method returns the lower and upper bounds for the criterion. The bounds are determined by the mapping dictionary.

class dacapo.experiments.tasks.evaluators.BinarySegmentationEvaluationScores

Class representing evaluation scores for binary segmentation tasks.

The metrics include: - Dice coefficient: 2 * |A ∩ B| / |A| + |B| ; where A and B are the binary segmentations - Jaccard coefficient: |A ∩ B| / |A ∪ B| ; where A and B are the binary segmentations - Hausdorff distance: max(h(A, B), h(B, A)) ; where h(A, B) is the Hausdorff distance between A and B - False negative rate: |A - B| / |A| ; where A and B are the binary segmentations - False positive rate: |B - A| / |B| ; where A and B are the binary segmentations - False discovery rate: |B - A| / |A| ; where A and B are the binary segmentations - VOI: Variation of Information; split and merge errors combined into a single measure of segmentation quality - Mean false distance: 0.5 * (mean false positive distance + mean false negative distance) - Mean false negative distance: mean distance of false negatives - Mean false positive distance: mean distance of false positives - Mean false distance clipped: 0.5 * (mean false positive distance clipped + mean false negative distance clipped) ; clipped to a maximum distance - Mean false negative distance clipped: mean distance of false negatives clipped ; clipped to a maximum distance - Mean false positive distance clipped: mean distance of false positives clipped ; clipped to a maximum distance - Precision with tolerance: TP / (TP + FP) ; where TP and FP are the true and false positives within a tolerance distance - Recall with tolerance: TP / (TP + FN) ; where TP and FN are the true and false positives within a tolerance distance - F1 score with tolerance: 2 * (Recall * Precision) / (Recall + Precision) ; where Recall and Precision are the true and false positives within a tolerance distance - Precision: TP / (TP + FP) ; where TP and FP are the true and false positives - Recall: TP / (TP + FN) ; where TP and FN are the true and false positives - F1 score: 2 * (Recall * Precision) / (Recall + Precision) ; where Recall and Precision are the true and false positives

dice

The Dice coefficient.

Type:: float

jaccard

The Jaccard index.

Type:: float

hausdorff

The Hausdorff distance.

Type:: float

false_negative_rate

The false negative rate.

Type:: float

false_negative_rate_with_tolerance

The false negative rate with tolerance.

Type:: float

false_positive_rate

The false positive rate.

Type:: float

false_discovery_rate

The false discovery rate.

Type:: float

false_positive_rate_with_tolerance

The false positive rate with tolerance.

Type:: float

voi

The variation of information.

Type:: float

mean_false_distance

The mean false distance.

Type:: float

mean_false_negative_distance

The mean false negative distance.

Type:: float

mean_false_positive_distance

The mean false positive distance.

Type:: float

mean_false_distance_clipped

The mean false distance clipped.

Type:: float

mean_false_negative_distance_clipped

The mean false negative distance clipped.

Type:: float

mean_false_positive_distance_clipped

The mean false positive distance clipped.

Type:: float

precision_with_tolerance

The precision with tolerance.

Type:: float

recall_with_tolerance

The recall with tolerance.

Type:: float

f1_score_with_tolerance

The F1 score with tolerance.

Type:: float

precision

The precision.

Type:: float

recall

The recall.

Type:: float

f1_score

The F1 score.

Type:: float

store_best(criterion: str) -> bool: Whether or not to store the best weights/validation blocks for this criterion.

higher_is_better(criterion: str) -> bool: Determines whether a higher value is better for a given criterion.

bounds(criterion: str) -> Tuple[Union[int, float, None], Union[int, float, None]]: Determines the bounds for a given criterion.

Notes

The evaluation scores are stored as attributes of the class. The class also contains methods to determine whether a higher value is better for a given criterion, whether or not to store the best weights/validation blocks for a given criterion, and the bounds for a given criterion.

dice: float

jaccard: float

hausdorff: float

false_negative_rate: float

false_negative_rate_with_tolerance: float

false_positive_rate: float

false_discovery_rate: float

false_positive_rate_with_tolerance: float

voi: float

mean_false_distance: float

mean_false_negative_distance: float

mean_false_positive_distance: float

mean_false_distance_clipped: float

mean_false_negative_distance_clipped: float

mean_false_positive_distance_clipped: float

precision_with_tolerance: float

recall_with_tolerance: float

f1_score_with_tolerance: float

precision: float

recall: float

f1_score: float

criteria = ['dice', 'jaccard', 'hausdorff', 'false_negative_rate', 'false_negative_rate_with_tolerance',...

The evaluation criteria.

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> evaluation_scores.criteria
["criterion1", "criterion2"]

Note

This function is used to return the evaluation criteria.

static store_best(criterion: str) → bool

Determines whether or not to store the best weights/validation blocks for a given criterion.

Parameters:: criterion (str) – The evaluation criterion.
Returns:: True if the best weights/validation blocks should be stored, False otherwise.
Return type:: bool
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> BinarySegmentationEvaluationScores.store_best("dice")
False
>>> BinarySegmentationEvaluationScores.store_best("f1_score")
True

Notes

The method returns True if the criterion is recognized and False otherwise. Whether or not to store the best weights/validation blocks for a given criterion is determined by the mapping dictionary.

static higher_is_better(criterion: str) → bool

Determines whether a higher value is better for a given criterion.

Parameters:: criterion (str) – The evaluation criterion.
Returns:: True if a higher value is better, False otherwise.
Return type:: bool
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> BinarySegmentationEvaluationScores.higher_is_better("dice")
True
>>> BinarySegmentationEvaluationScores.higher_is_better("f1_score")
True

Notes

The method returns True if the criterion is recognized and False otherwise. Whether a higher value is better for a given criterion is determined by the mapping dictionary.

static bounds(criterion: str) → Tuple[int | float | None, int | float | None]

Determines the bounds for a given criterion. The bounds are used to determine the best value for a given criterion.

Parameters:: criterion (str) – The evaluation criterion.
Returns:: The lower and upper bounds for the criterion.
Return type:: Tuple[Union[int, float, None], Union[int, float, None]]
Raises:: ValueError – If the criterion is not recognized.

Examples

>>> BinarySegmentationEvaluationScores.bounds("dice")
(0, 1)
>>> BinarySegmentationEvaluationScores.bounds("hausdorff")
(0, nan)

Notes

The method returns the lower and upper bounds for the criterion. The bounds are determined by the mapping dictionary.

class dacapo.experiments.tasks.evaluators.BinarySegmentationEvaluator(clip_distance: float, tol_distance: float, channels: List[str])

Given a binary segmentation, compute various metrics to determine their similarity. The metrics include: - Dice coefficient: 2 * |A ∩ B| / |A| + |B| ; where A and B are the binary segmentations - Jaccard coefficient: |A ∩ B| / |A ∪ B| ; where A and B are the binary segmentations - Hausdorff distance: max(h(A, B), h(B, A)) ; where h(A, B) is the Hausdorff distance between A and B - False negative rate: |A - B| / |A| ; where A and B are the binary segmentations - False positive rate: |B - A| / |B| ; where A and B are the binary segmentations - False discovery rate: |B - A| / |A| ; where A and B are the binary segmentations - VOI: Variation of Information; split and merge errors combined into a single measure of segmentation quality - Mean false distance: 0.5 * (mean false positive distance + mean false negative distance) - Mean false negative distance: mean distance of false negatives - Mean false positive distance: mean distance of false positives - Mean false distance clipped: 0.5 * (mean false positive distance clipped + mean false negative distance clipped) ; clipped to a maximum distance - Mean false negative distance clipped: mean distance of false negatives clipped ; clipped to a maximum distance - Mean false positive distance clipped: mean distance of false positives clipped ; clipped to a maximum distance - Precision with tolerance: TP / (TP + FP) ; where TP and FP are the true and false positives within a tolerance distance - Recall with tolerance: TP / (TP + FN) ; where TP and FN are the true and false positives within a tolerance distance - F1 score with tolerance: 2 * (Recall * Precision) / (Recall + Precision) ; where Recall and Precision are the true and false positives within a tolerance distance - Precision: TP / (TP + FP) ; where TP and FP are the true and false positives - Recall: TP / (TP + FN) ; where TP and FN are the true and false positives - F1 score: 2 * (Recall * Precision) / (Recall + Precision) ; where Recall and Precision are the true and false positives

clip_distance: float the clip distance

tol_distance: float the tolerance distance

channels: List[str] the channels

criteria: List[str] the evaluation criteria

evaluate(output_array_identifier, evaluation_array): Evaluate the output array against the evaluation array.

score(): Return the evaluation scores.

Note

The BinarySegmentationEvaluator class is used to evaluate the performance of a binary segmentation task. The class provides methods to evaluate the output array against the evaluation array and return the evaluation scores. All evaluation scores should inherit from this class.

Clip distance is the maximum distance between the ground truth and the predicted segmentation for a pixel to be considered a false positive. Tolerance distance is the maximum distance between the ground truth and the predicted segmentation for a pixel to be considered a true positive. Channels are the channels of the binary segmentation. Criteria are the evaluation criteria.

criteria = ['jaccard', 'voi']

A list of all criteria for which a model might be “best”. i.e. your criteria might be “precision”, “recall”, and “jaccard”. It is unlikely that the best iteration/post processing parameters will be the same for all 3 of these criteria

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> evaluator.criteria
[]

Note

This function is used to return the evaluation criteria.

clip_distance

tol_distance

channels

evaluate(output_array_identifier, evaluation_array)

Evaluate the output array against the evaluation array.

Parameters:

output_array_identifier – str the identifier of the output array
evaluation_array – ZarrArray the evaluation array

Returns:

BinarySegmentationEvaluationScores or MultiChannelBinarySegmentationEvaluationScores: the evaluation scores

Raises:

ValueError – if the output array identifier is not valid

Examples

>>> binary_segmentation_evaluator = BinarySegmentationEvaluator(clip_distance=200, tol_distance=40, channels=["channel1", "channel2"])
>>> output_array_identifier = "output_array"
>>> evaluation_array = ZarrArray.open_from_array_identifier("evaluation_array")
>>> binary_segmentation_evaluator.evaluate(output_array_identifier, evaluation_array)
BinarySegmentationEvaluationScores(dice=0.0, jaccard=0.0, hausdorff=0.0, false_negative_rate=0.0, false_positive_rate=0.0, false_discovery_rate=0.0, voi=0.0, mean_false_distance=0.0, mean_false_negative_distance=0.0, mean_false_positive_distance=0.0, mean_false_distance_clipped=0.0, mean_false_negative_distance_clipped=0.0, mean_false_positive_distance_clipped=0.0, precision_with_tolerance=0.0, recall_with_tolerance=0.0, f1_score_with_tolerance=0.0, precision=0.0, recall=0.0, f1_score=0.0)

Note

This function is used to evaluate the output array against the evaluation array.

property score

Return the evaluation scores.

Returns:

BinarySegmentationEvaluationScores or MultiChannelBinarySegmentationEvaluationScores: the evaluation scores

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> binary_segmentation_evaluator = BinarySegmentationEvaluator(clip_distance=200, tol_distance=40, channels=["channel1", "channel2"])
>>> binary_segmentation_evaluator.score
BinarySegmentationEvaluationScores(dice=0.0, jaccard=0.0, hausdorff=0.0, false_negative_rate=0.0, false_positive_rate=0.0, false_discovery_rate=0.0, voi=0.0, mean_false_distance=0.0, mean_false_negative_distance=0.0, mean_false_positive_distance=0.0, mean_false_distance_clipped=0.0, mean_false_negative_distance_clipped=0.0, mean_false_positive_distance_clipped=0.0, precision_with_tolerance=0.0, recall_with_tolerance=0.0, f1_score_with_tolerance=0.0, precision=0.0, recall=0.0, f1_score=0.0)

Note

This function is used to return the evaluation scores.

class dacapo.experiments.tasks.evaluators.InstanceEvaluationScores

The evaluation scores for the instance segmentation task. The scores include the variation of information (VOI) split, VOI merge, and VOI.

voi_split: float the variation of information (VOI) split

voi_merge: float the variation of information (VOI) merge

voi: float the variation of information (VOI)

higher_is_better(criterion): Return whether higher is better for the given criterion.

bounds(criterion): Return the bounds for the given criterion.

store_best(criterion): Return whether to store the best score for the given criterion.

Note

The InstanceEvaluationScores class is used to store the evaluation scores for the instance segmentation task.

criteria = ['voi_split', 'voi_merge', 'voi']

The evaluation criteria.

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluation_scores = EvaluationScores()
>>> evaluation_scores.criteria
["criterion1", "criterion2"]

Note

This function is used to return the evaluation criteria.

voi_split: float

voi_merge: float

property voi

Return the average of the VOI split and VOI merge.

Returns:

float: the average of the VOI split and VOI merge

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> instance_evaluation_scores = InstanceEvaluationScores(voi_split=0.1, voi_merge=0.2)
>>> instance_evaluation_scores.voi
0.15

Note

This function is used to calculate the average of the VOI split and VOI merge.

static higher_is_better(criterion: str) → bool

Return whether higher is better for the given criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

bool: whether higher is better for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> InstanceEvaluationScores.higher_is_better("voi_split")
False

Note

This function is used to determine whether higher is better for the given criterion.

static bounds(criterion: str) → Tuple[int | float | None, int | float | None]

Return the bounds for the given criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

Tuple[Union[int, float, None], Union[int, float, None]]: the bounds for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> InstanceEvaluationScores.bounds("voi_split")
(0, 1)

Note

This function is used to return the bounds for the given criterion.

static store_best(criterion: str) → bool

Return whether to store the best score for the given criterion.

Parameters:

criterion – str the evaluation criterion

Returns:

bool: whether to store the best score for the given criterion

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> InstanceEvaluationScores.store_best("voi_split")
True

Note

This function is used to determine whether to store the best score for the given criterion.

class dacapo.experiments.tasks.evaluators.InstanceEvaluator

A class representing an evaluator for instance segmentation tasks.

criteria: List[str] the evaluation criteria

evaluate(output_array_identifier, evaluation_array): Evaluate the output array against the evaluation array.

score(): Return the evaluation scores.

Note

The InstanceEvaluator class is used to evaluate the performance of an instance segmentation task.

criteria: List[str] = ['voi_merge', 'voi_split', 'voi']

A list of all criteria for which a model might be “best”. i.e. your criteria might be “precision”, “recall”, and “jaccard”. It is unlikely that the best iteration/post processing parameters will be the same for all 3 of these criteria

Returns:

List[str]: the evaluation criteria

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> evaluator = Evaluator()
>>> evaluator.criteria
[]

Note

This function is used to return the evaluation criteria.

evaluate(output_array_identifier, evaluation_array)

Evaluate the output array against the evaluation array.

Parameters:

output_array_identifier – str the identifier of the output array
evaluation_array – ZarrArray the evaluation array

Returns:

InstanceEvaluationScores: the evaluation scores

Raises:

ValueError – if the output array identifier is not valid

Examples

>>> instance_evaluator = InstanceEvaluator()
>>> output_array_identifier = "output_array"
>>> evaluation_array = ZarrArray.open_from_array_identifier("evaluation_array")
>>> instance_evaluator.evaluate(output_array_identifier, evaluation_array)
InstanceEvaluationScores(voi_merge=0.0, voi_split=0.0)

Note

This function is used to evaluate the output array against the evaluation array.

property score: dacapo.experiments.tasks.evaluators.instance_evaluation_scores.InstanceEvaluationScores

Return the evaluation scores.

Returns:

InstanceEvaluationScores: the evaluation scores

Raises:

NotImplementedError – if the function is not implemented

Examples

>>> instance_evaluator = InstanceEvaluator()
>>> instance_evaluator.score
InstanceEvaluationScores(voi_merge=0.0, voi_split=0.0)

Note

This function is used to return the evaluation scores.