Raw results table

We could not include the raw result table in the paper. We show it here, and include some additional metrics (Expected Calibration Error using different binning schemes and bin counts). It is pretty clear that the bins used to compute the ECE does not have a huge impact on the metric.

About the reported DERs

Do note that the DER given here is the local diarization error rate. It can not be compared to DERs usually reported in the litterature ! Since the powerset speaker diarization model works on local windows of a few seconds (5 seconds in our case), we compute compute and sum the DER component on each of these windows. There is no clustering involved here (or in any DER we provide) and it cannot be interpreted as the final pipeline DER.

Dataset	DER (%)	Accuracy (%)	ECE uniform 10 bins (%)	ECE uniform 20 bins (%)	ECE adaptive 10 bins (%)	ECE adaptive 20 bins (%)
AISHELL	11.86	89.10	0.39	0.48	0.50	0.50
AMI-SDM	19.49	82.79	3.98	3.98	3.98	3.98
AMI	17.50	84.55	3.53	3.53	3.53	3.53
AVA-AVD	34.85	81.87	4.30	4.31	4.30	4.30
AliMeeting	19.59	79.46	3.04	3.04	3.04	3.04
CALLHOME	22.49	77.07	2.57	2.57	2.57	2.57
MSDWILD	20.03	80.52	2.89	2.89	2.89	2.89
RAMC	10.69	91.12	1.67	1.67	1.67	1.67
REPERE	7.67	92.48	1.83	1.83	1.83	1.83
VoxConverse	9.94	91.05	0.70	0.70	0.69	0.69

audiobooks	12.22	90.44	3.22	3.26	3.28	3.28
broadcast interview	16.77	86.90	6.50	6.50	6.44	6.44
clinical	32.15	79.48	3.94	3.98	3.93	3.94
court	16.46	86.00	8.19	8.19	8.17	8.17
cts	16.47	83.68	1.37	1.38	1.37	1.37
maptask	28.15	81.25	8.20	8.20	7.97	8.08
meeting	39.70	64.26	16.63	16.63	16.63	16.63
restaurant	45.82	54.11	14.31	14.31	14.31	14.31
socio field	21.74	82.45	2.65	2.65	2.65	2.65
socio lab	22.06	82.60	4.36	4.36	4.24	4.33
webvideo	40.01	69.75	10.52	10.52	10.52	10.52

DER / ECE scatter plot

The paper contains two scatter plots for DER / ECE. Here we grouped all datasets and domains into one unique scatter plot. Feel free to zoom in and filter out in/out-domain datasets.

Reliability diagrams

Here are reliability diagrams for all 11 DIHARD 3 domains. The paper only shows uniform binning, but we also propose diagrams for adaptive binning. We put the figures under foldable sections since they take a lot of vertical space.

Uniform binning with 10 bins

Using uniform binning with 10 bins

Adapative binning with 10 bins

Note that the X axis is not linear at all. Since most predictions are confident, the higher bins contain very similar confidence values.

Using adaptive binning with 10 bins

Analysis of low-confidence regions

We sample low-confidence data (left column) and random regions of data (right column), and compare the composition of the data as well as the model performance. As usual we provide the figures of all DIHARD domains instead of a select few.

Data composition

Data composition of low-confidence regions

Model performance (DER)

DER on low-confidence regions

Reproducibility

Pretrained model checkpoint downloads:

Composition of the training dataset:

pyannote.database protocol specifications

Parquet inference files, containing model probabilities and targets for all of the datasets:

.parquet inference files