Raw results table
We could not include the raw result table in the paper. We show it here, and include some additional metrics (Expected Calibration Error using different binning schemes and bin counts). It is pretty clear that the bins used to compute the ECE does not have a huge impact on the metric.
Do note that the DER given here is the local diarization error rate. It can not be compared to DERs usually reported in the litterature ! Since the powerset speaker diarization model works on local windows of a few seconds (5 seconds in our case), we compute compute and sum the DER component on each of these windows. There is no clustering involved here (or in any DER we provide) and it cannot be interpreted as the final pipeline DER.
Dataset | DER (%) | Accuracy (%) | ECE uniform 10 bins (%) | ECE uniform 20 bins (%) | ECE adaptive 10 bins (%) | ECE adaptive 20 bins (%) |
---|---|---|---|---|---|---|
AISHELL | 11.86 | 89.10 | 0.39 | 0.48 | 0.50 | 0.50 |
AMI-SDM | 19.49 | 82.79 | 3.98 | 3.98 | 3.98 | 3.98 |
AMI | 17.50 | 84.55 | 3.53 | 3.53 | 3.53 | 3.53 |
AVA-AVD | 34.85 | 81.87 | 4.30 | 4.31 | 4.30 | 4.30 |
AliMeeting | 19.59 | 79.46 | 3.04 | 3.04 | 3.04 | 3.04 |
CALLHOME | 22.49 | 77.07 | 2.57 | 2.57 | 2.57 | 2.57 |
MSDWILD | 20.03 | 80.52 | 2.89 | 2.89 | 2.89 | 2.89 |
RAMC | 10.69 | 91.12 | 1.67 | 1.67 | 1.67 | 1.67 |
REPERE | 7.67 | 92.48 | 1.83 | 1.83 | 1.83 | 1.83 |
VoxConverse | 9.94 | 91.05 | 0.70 | 0.70 | 0.69 | 0.69 |
audiobooks | 12.22 | 90.44 | 3.22 | 3.26 | 3.28 | 3.28 |
broadcast interview | 16.77 | 86.90 | 6.50 | 6.50 | 6.44 | 6.44 |
clinical | 32.15 | 79.48 | 3.94 | 3.98 | 3.93 | 3.94 |
court | 16.46 | 86.00 | 8.19 | 8.19 | 8.17 | 8.17 |
cts | 16.47 | 83.68 | 1.37 | 1.38 | 1.37 | 1.37 |
maptask | 28.15 | 81.25 | 8.20 | 8.20 | 7.97 | 8.08 |
meeting | 39.70 | 64.26 | 16.63 | 16.63 | 16.63 | 16.63 |
restaurant | 45.82 | 54.11 | 14.31 | 14.31 | 14.31 | 14.31 |
socio field | 21.74 | 82.45 | 2.65 | 2.65 | 2.65 | 2.65 |
socio lab | 22.06 | 82.60 | 4.36 | 4.36 | 4.24 | 4.33 |
webvideo | 40.01 | 69.75 | 10.52 | 10.52 | 10.52 | 10.52 |
DER / ECE scatter plot
The paper contains two scatter plots for DER / ECE. Here we grouped all datasets and domains into one unique scatter plot. Feel free to zoom in and filter out in/out-domain datasets.
Reliability diagrams
Here are reliability diagrams for all 11 DIHARD 3 domains. The paper only shows uniform binning, but we also propose diagrams for adaptive binning. We put the figures under foldable sections since they take a lot of vertical space.
Uniform binning with 10 bins
Adapative binning with 10 bins
Note that the X axis is not linear at all. Since most predictions are confident, the higher bins contain very similar confidence values.
Analysis of low-confidence regions
We sample low-confidence data (left column) and random regions of data (right column), and compare the composition of the data as well as the model performance. As usual we provide the figures of all DIHARD domains instead of a select few.
Data composition
Model performance (DER)
Reproducibility
Pretrained model checkpoint downloads:
Composition of the training dataset:
Parquet inference files, containing model probabilities and targets for all of the datasets: