style

Add FEVER Metric for Fact Verification Evaluation (#710 )
* Add FEVER metric for fact verification * Apply code formatting with black and isort
--- a/metrics/bleu/README.md
+++ b/metrics/bleu/README.md
@@ -48,9 +48,9 @@ This metric takes as input a list of predicted sentences and a list of lists of
 ```

 ### Inputs
 - **predictions** (`list` of `str`s): Translations to score.
 - **references** (`list` of `list`s of `str`s): references for each translation.
 - ** tokenizer** : approach used for standardizing `predictions` and `references`.
 - **predictions** (`list[str]`): Translations to score.
 - **references** (`Union[list[str], list[list[str]]]`): references for each translation.
 - **tokenizer** : approach used for standardizing `predictions` and `references`.
    The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
    This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).

@@ -93,15 +93,15 @@ Example where each prediction has 1 reference:
 {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 7, 'reference_length': 7}
 ```

 Example where the second prediction has 2 references:
 Example where the first prediction has 2 references:
 ```python
 >>> predictions = [
 ...     ["hello there general kenobi",
 ...     ["foo bar foobar"]
 ...     "hello there general kenobi",
 ...     "foo bar foobar"
 ... ]
 >>> references = [
 ...     [["hello there general kenobi"], ["hello there!"]],
 ...     [["foo bar foobar"]]
 ...     ["hello there general kenobi", "hello there!"],
 ...     ["foo bar foobar"]
 ... ]
 >>> bleu = evaluate.load("bleu")
 >>> results = bleu.compute(predictions=predictions, references=references)
@@ -114,12 +114,12 @@ Example with the word tokenizer from NLTK:
 >>> bleu = evaluate.load("bleu")
 >>> from nltk.tokenize import word_tokenize
 >>> predictions = [
 ...     ["hello there general kenobi",
 ...     ["foo bar foobar"]
 ...     "hello there general kenobi",
 ...     "foo bar foobar"
 ... ]
 >>> references = [
 ...     [["hello there general kenobi"], ["hello there!"]],
 ...     [["foo bar foobar"]]
 ...     ["hello there general kenobi", "hello there!"],
 ...     ["foo bar foobar"]
 ... ]
 >>> results = bleu.compute(predictions=predictions, references=references, tokenizer=word_tokenize)
 >>> print(results)
--- a/metrics/bleurt/README.md
+++ b/metrics/bleurt/README.md
@@ -42,9 +42,15 @@ This metric takes as input lists of predicted sentences and reference sentences:
 ```

 ### Inputs

 For the `load` function:

 - **config_name** (`str`): BLEURT checkpoint. Will default to `"bleurt-base-128"` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`.

 For the `compute` function:

 - **predictions** (`list` of `str`s): List of generated sentences to score.
 - **references** (`list` of `str`s): List of references to compare to.
 - **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`. 

 ### Output Values
 - **scores** : a `list` of scores, one per prediction. 
@@ -65,7 +71,7 @@ BLEURT is used to compare models across different asks (e.g. (Table to text gene

 ### Examples

 Example with the default model:
 Example with the default model (`"bleurt-base-128"`):
 ```python
 >>> predictions = ["hello there", "general kenobi"]
 >>> references = ["hello there", "general kenobi"]
@@ -75,14 +81,14 @@ Example with the default model:
 {'scores': [1.0295498371124268, 1.0445425510406494]}
 ```

 Example with the `"bleurt-base-128"` model checkpoint:
 Example with the full `"BLEURT-20"` model checkpoint:
 ```python
 >>> predictions = ["hello there", "general kenobi"]
 >>> references = ["hello there", "general kenobi"]
 >>> bleurt = load("bleurt", module_type="metric", checkpoint="bleurt-base-128")
 >>> bleurt = load("bleurt", module_type="metric", config_name="BLEURT-20")
 >>> results = bleurt.compute(predictions=predictions, references=references)
 >>> print(results)
 {'scores': [1.0295498371124268, 1.0445425510406494]}
 {'scores': [1.015415906906128, 0.9985226988792419]}
 ```

 ## Limitations and Bias
--- a/metrics/bleurt/bleurt.py
+++ b/metrics/bleurt/bleurt.py
@@ -100,8 +100,8 @@ class BLEURT(evaluate.Metric):
        # check that config name specifies a valid BLEURT model
        if self.config_name == "default":
            logger.warning(
                "Using default BLEURT-Base checkpoint for sequence maximum length 128. "
                "You can use a bigger model for better results with e.g.: evaluate.load('bleurt', 'bleurt-large-512')."
                "Using default checkpoint 'bleurt-base-128' for sequence maximum length 128. "
                "You can use a bigger model for better results with e.g.: evaluate.load('bleurt', config_name='bleurt-large-512')."
            )
            self.config_name = "bleurt-base-128"

--- a/metrics/fever/README.md
+++ b/metrics/fever/README.md
@@ -0,0 +1,183 @@
 ---
 title: FEVER
 emoji: 🔥
 colorFrom: orange
 colorTo: red
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 pinned: false
 tags:
  - evaluate
  - metric
 description: >-
  The FEVER (Fact Extraction and VERification) metric evaluates the performance of systems that verify factual claims against evidence retrieved from Wikipedia.

  It consists of three main components: Label accuracy (measures how often the predicted claim label matches the gold label), FEVER score (considers a prediction correct only if the label is correct and at least one complete gold evidence set is retrieved), and Evidence F1 (computes the micro-averaged precision, recall, and F1 between predicted and gold evidence sentences).

  The FEVER score is the official leaderboard metric used in the FEVER shared tasks. All metrics range from 0 to 1, with higher values indicating better performance.
 ---

 # Metric Card for FEVER

 ## Metric description

 The FEVER (Fact Extraction and VERification) metric evaluates the performance of systems that verify factual claims against evidence retrieved from Wikipedia. It was introduced in the FEVER shared task and has become a standard benchmark for fact verification systems.

 FEVER consists of three main evaluation components:

 1. **Label accuracy**: measures how often the predicted claim label (SUPPORTED, REFUTED, or NOT ENOUGH INFO) matches the gold label
 2. **FEVER score**: considers a prediction correct only if the label is correct _and_ at least one complete gold evidence set is retrieved
 3. **Evidence F1**: computes the micro-averaged precision, recall, and F1 between predicted and gold evidence sentences

 ## How to use

 The metric takes two inputs: predictions (a list of dictionaries containing predicted labels and evidence) and references (a list of dictionaries containing gold labels and evidence sets).

 ```python
 from evaluate import load
 fever = load("fever")
 predictions = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
 references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
 results = fever.compute(predictions=predictions, references=references)
 ```

 ## Output values

 This metric outputs a dictionary containing five float values:

 ```python
 print(results)
 {
    'label_accuracy': 1.0,
    'fever_score': 1.0,
    'evidence_precision': 1.0,
    'evidence_recall': 1.0,
    'evidence_f1': 1.0
 }
 ```

 - **label_accuracy**: Proportion of claims with correctly predicted labels (0-1, higher is better)
 - **fever_score**: Proportion of claims where both the label and at least one full gold evidence set are correct (0-1, higher is better). This is the **official FEVER leaderboard metric**
 - **evidence_precision**: Micro-averaged precision of evidence retrieval (0-1, higher is better)
 - **evidence_recall**: Micro-averaged recall of evidence retrieval (0-1, higher is better)
 - **evidence_f1**: Micro-averaged F1 of evidence retrieval (0-1, higher is better)

 All values range from 0 to 1, with **1.0 representing perfect performance**.

 ### Values from popular papers

 The FEVER shared task has established performance benchmarks on the FEVER dataset:

 - Human performance: FEVER score of ~0.92
 - Top systems (2018-2019): FEVER scores ranging from 0.64 to 0.70
 - State-of-the-art models (2020+): FEVER scores above 0.75

 Performance varies significantly based on:

 - Model architecture (retrieval + verification pipeline vs. end-to-end)
 - Pre-training (BERT, RoBERTa, etc.)
 - Evidence retrieval quality

 ## Examples

 Perfect prediction (label and evidence both correct):

 ```python
 from evaluate import load
 fever = load("fever")
 predictions = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
 references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
 results = fever.compute(predictions=predictions, references=references)
 print(results)
 {
    'label_accuracy': 1.0,
    'fever_score': 1.0,
    'evidence_precision': 1.0,
    'evidence_recall': 1.0,
    'evidence_f1': 1.0
 }
 ```

 Correct label but incomplete evidence:

 ```python
 from evaluate import load
 fever = load("fever")
 predictions = [{"label": "SUPPORTED", "evidence": ["E1"]}]
 references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
 results = fever.compute(predictions=predictions, references=references)
 print(results)
 {
    'label_accuracy': 1.0,
    'fever_score': 0.0,
    'evidence_precision': 1.0,
    'evidence_recall': 0.5,
    'evidence_f1': 0.6666666666666666
 }
 ```

 Incorrect label (FEVER score is 0):

 ```python
 from evaluate import load
 fever = load("fever")
 predictions = [{"label": "REFUTED", "evidence": ["E1", "E2"]}]
 references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
 results = fever.compute(predictions=predictions, references=references)
 print(results)
 {
    'label_accuracy': 0.0,
    'fever_score': 0.0,
    'evidence_precision': 1.0,
    'evidence_recall': 1.0,
    'evidence_f1': 1.0
 }
 ```

 Multiple valid evidence sets:

 ```python
 from evaluate import load
 fever = load("fever")
 predictions = [{"label": "SUPPORTED", "evidence": ["E3", "E4"]}]
 references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"], ["E3", "E4"]]}]
 results = fever.compute(predictions=predictions, references=references)
 print(results)
 {
    'label_accuracy': 1.0,
    'fever_score': 1.0,
    'evidence_precision': 0.5,
    'evidence_recall': 0.5,
    'evidence_f1': 0.5
 }
 ```

 ## Limitations and bias

 The FEVER metric has several important considerations:

 1. **Evidence set completeness**: The FEVER score requires retrieving _all_ sentences in at least one gold evidence set. Partial evidence retrieval (even if sufficient for verification) results in a score of 0.
 2. **Multiple valid evidence sets**: Some claims can be verified using different sets of evidence. The metric gives credit if any one complete set is retrieved.
 3. **Micro-averaging**: Evidence precision, recall, and F1 are micro-averaged across all examples, which means performance on longer evidence sets has more influence on the final metrics.
 4. **Label dependency**: The FEVER score requires both correct labeling _and_ correct evidence retrieval, making it a strict metric that penalizes systems for either type of error.
 5. **Wikipedia-specific**: The metric was designed for Wikipedia-based fact verification and may not generalize directly to other knowledge sources or domains.

 ## Citation

 ```bibtex
@inproceedings{thorne2018fever,
  title={FEVER: a Large-scale Dataset for Fact Extraction and VERification},
  author={Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
  booktitle={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)},
  pages={809--819},
  year={2018}
 }
 ```

 ## Further References

 - [FEVER Dataset Website](https://fever.ai/dataset/)
 - [FEVER Paper on arXiv](https://arxiv.org/abs/1803.05355)
 - [Hugging Face Tasks -- Fact Checking](https://huggingface.co/tasks/text-classification)
 - [FEVER Shared Task Overview](https://fever.ai/task.html)
--- a/metrics/fever/app.py
+++ b/metrics/fever/app.py
@@ -0,0 +1,6 @@
 import evaluate
 from evaluate.utils import launch_gradio_widget


 module = evaluate.load("fever")
 launch_gradio_widget(module)
--- a/metrics/fever/fever.py
+++ b/metrics/fever/fever.py
@@ -0,0 +1,140 @@
 # Copyright 2021 The HuggingFace Evaluate Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 """FEVER (Fact Extraction and VERification) metric."""

 import datasets

 import evaluate


 _CITATION = """\
@inproceedings{thorne2018fever,
  title={FEVER: Fact Extraction and VERification},
  author={Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit},
  booktitle={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  pages={809--819},
  year={2018}
 }
 """
 _DESCRIPTION = """\
 The FEVER (Fact Extraction and VERification) metric evaluates the performance of systems that verify factual claims against evidence retrieved from Wikipedia.

 It consists of three main components:
 - **Label accuracy**: measures how often the predicted claim label (SUPPORTED, REFUTED, or NOT ENOUGH INFO) matches the gold label.
 - **FEVER score**: considers a prediction correct only if the label is correct *and* at least one complete gold evidence set is retrieved.
 - **Evidence F1**: computes the micro-averaged precision, recall, and F1 between predicted and gold evidence sentences.

 The FEVER score is the official leaderboard metric used in the FEVER shared tasks.
 """
 _KWARGS_DESCRIPTION = """
 Computes the FEVER evaluation metrics.

 Args:
    predictions (list of dict): Each prediction should be a dictionary with:
        - "label" (str): the predicted claim label.
        - "evidence" (list of str): the predicted evidence sentences.
    references (list of dict): Each reference should be a dictionary with:
        - "label" (str): the gold claim label.
        - "evidence_sets" (list of list of str): all possible gold evidence sets.

 Returns:
    A dictionary containing:
        - 'label_accuracy': proportion of claims with correctly predicted labels.
        - 'fever_score': proportion of claims where both the label and at least one full gold evidence set are correct.
        - 'evidence_precision': micro-averaged precision of evidence retrieval.
        - 'evidence_recall': micro-averaged recall of evidence retrieval.
        - 'evidence_f1': micro-averaged F1 of evidence retrieval.

 Example:
    >>> predictions = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
    >>> references = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"], ["E3", "E4"]]}]
    >>> fever = evaluate.load("fever")
    >>> results = fever.compute(predictions=predictions, references=references)
    >>> print(results)
    {'label_accuracy': 1.0, 'fever_score': 1.0, 'evidence_precision': 1.0, 'evidence_recall': 1.0, 'evidence_f1': 1.0}
 """


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 class FEVER(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": {
                        "label": datasets.Value("string"),
                        "evidence": datasets.Sequence(datasets.Value("string")),
                    },
                    "references": {
                        "label": datasets.Value("string"),
                        "evidence_sets": datasets.Sequence(datasets.Sequence(datasets.Value("string"))),
                    },
                }
            ),
            reference_urls=[
                "https://fever.ai/dataset/",
                "https://arxiv.org/abs/1803.05355",
            ],
        )

    def _compute(self, predictions, references):
        """
        Computes FEVER metrics:
        - Label accuracy
        - FEVER score (label + complete evidence set)
        - Evidence precision, recall, and F1 (micro-averaged)
        """
        total = len(predictions)
        label_correct, fever_correct = 0, 0
        total_overlap, total_pred, total_gold = 0, 0, 0

        for pred, ref in zip(predictions, references):
            pred_label = pred["label"]
            pred_evidence = set(e.strip().lower() for e in pred["evidence"])
            gold_label = ref["label"]
            gold_sets = []
            for s in ref["evidence_sets"]:
                gold_sets.append([e.strip().lower() for e in s])

            if pred_label == gold_label:
                label_correct += 1
                for g_set in gold_sets:
                    if set(g_set).issubset(pred_evidence):
                        fever_correct += 1
                        break

            gold_evidence = set().union(*gold_sets) if gold_sets else set()
            overlap = len(gold_evidence.intersection(pred_evidence))
            total_overlap += overlap
            total_pred += len(pred_evidence)
            total_gold += len(gold_evidence)

        precision = (total_overlap / total_pred) if total_pred else 0
        recall = (total_overlap / total_gold) if total_gold else 0
        evidence_f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

        fever_score = fever_correct / total if total else 0
        label_accuracy = label_correct / total if total else 0

        return {
            "label_accuracy": label_accuracy,
            "fever_score": fever_score,
            "evidence_precision": precision,
            "evidence_recall": recall,
            "evidence_f1": evidence_f1,
        }
--- a/metrics/fever/test_fever.py
+++ b/metrics/fever/test_fever.py
@@ -0,0 +1,134 @@
 # Copyright 2025 The HuggingFace Evaluate Authors.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.

 """Tests for the FEVER (Fact Extraction and VERification) metric."""

 import unittest

 from fever import FEVER  # assuming your metric file is named fever.py


 fever = FEVER()


 class TestFEVER(unittest.TestCase):
    def test_perfect_prediction(self):
        preds = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertAlmostEqual(result["label_accuracy"], 1.0)
        self.assertAlmostEqual(result["fever_score"], 1.0)
        self.assertAlmostEqual(result["evidence_precision"], 1.0)
        self.assertAlmostEqual(result["evidence_recall"], 1.0)
        self.assertAlmostEqual(result["evidence_f1"], 1.0)

    def test_label_only_correct(self):
        preds = [{"label": "SUPPORTED", "evidence": ["X1", "X2"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertAlmostEqual(result["label_accuracy"], 1.0)
        self.assertAlmostEqual(result["fever_score"], 0.0)
        self.assertTrue(result["evidence_f1"] < 1.0)

    def test_label_incorrect(self):
        preds = [{"label": "REFUTED", "evidence": ["E1", "E2"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertAlmostEqual(result["label_accuracy"], 0.0)
        self.assertAlmostEqual(result["fever_score"], 0.0)

    def test_partial_evidence_overlap(self):
        preds = [{"label": "SUPPORTED", "evidence": ["E1"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertAlmostEqual(result["label_accuracy"], 1.0)
        self.assertAlmostEqual(result["fever_score"], 0.0)
        self.assertAlmostEqual(result["evidence_precision"], 1.0)
        self.assertAlmostEqual(result["evidence_recall"], 0.5)
        self.assertTrue(0 < result["evidence_f1"] < 1.0)

    def test_extra_evidence_still_correct(self):
        preds = [{"label": "SUPPORTED", "evidence": ["E1", "E2", "X1"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertAlmostEqual(result["fever_score"], 1.0)
        self.assertTrue(result["evidence_precision"] < 1.0)
        self.assertAlmostEqual(result["evidence_recall"], 1.0)

    def test_multiple_gold_sets(self):
        preds = [{"label": "SUPPORTED", "evidence": ["E3", "E4"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"], ["E3", "E4"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertAlmostEqual(result["fever_score"], 1.0)
        self.assertAlmostEqual(result["label_accuracy"], 1.0)

    def test_mixed_examples(self):
        preds = [
            {"label": "SUPPORTED", "evidence": ["A1", "A2"]},
            {"label": "SUPPORTED", "evidence": ["B1"]},
            {"label": "REFUTED", "evidence": ["C1", "C2"]},
        ]
        refs = [
            {"label": "SUPPORTED", "evidence_sets": [["A1", "A2"]]},
            {"label": "SUPPORTED", "evidence_sets": [["B1", "B2"]]},
            {"label": "SUPPORTED", "evidence_sets": [["C1", "C2"]]},
        ]
        result = fever.compute(predictions=preds, references=refs)
        self.assertTrue(0 < result["label_accuracy"] < 1.0)
        self.assertTrue(0 <= result["fever_score"] < 1.0)
        self.assertTrue(0 <= result["evidence_f1"] <= 1.0)

    def test_empty_evidence_prediction(self):
        preds = [{"label": "SUPPORTED", "evidence": []}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertEqual(result["evidence_precision"], 0.0)
        self.assertEqual(result["evidence_recall"], 0.0)
        self.assertEqual(result["evidence_f1"], 0.0)

    def test_empty_gold_evidence(self):
        preds = [{"label": "SUPPORTED", "evidence": ["E1", "E2"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [[]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertEqual(result["evidence_recall"], 0.0)

    def test_multiple_examples_micro_averaging(self):
        preds = [
            {"label": "SUPPORTED", "evidence": ["E1"]},
            {"label": "SUPPORTED", "evidence": ["F1", "F2"]},
        ]
        refs = [
            {"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]},
            {"label": "SUPPORTED", "evidence_sets": [["F1", "F2"]]},
        ]
        result = fever.compute(predictions=preds, references=refs)
        self.assertTrue(result["evidence_f1"] < 1.0)
        self.assertAlmostEqual(result["label_accuracy"], 1.0)

    def test_fever_score_requires_label_match(self):
        preds = [{"label": "REFUTED", "evidence": ["E1", "E2"]}]
        refs = [{"label": "SUPPORTED", "evidence_sets": [["E1", "E2"]]}]
        result = fever.compute(predictions=preds, references=refs)
        self.assertEqual(result["fever_score"], 0.0)
        self.assertEqual(result["label_accuracy"], 0.0)

    def test_empty_input_list(self):
        preds, refs = [], []
        result = fever.compute(predictions=preds, references=refs)
        for k in result:
            self.assertEqual(result[k], 0.0)


 if __name__ == "__main__":
    unittest.main()
--- a/metrics/meteor/README.md
+++ b/metrics/meteor/README.md
@@ -116,6 +116,9 @@ While the correlation between METEOR and human judgments was measured for Chines

 Furthermore, while the alignment and matching done in METEOR is based on unigrams, using multiple word entities (e.g. bigrams) could contribute to improving its accuracy -- this has been proposed in [more recent publications](https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-naacl-2010.pdf) on the subject.

 Scores differ by up to **±10 points** across v1.0↔v1.5 and flag combinations (`-l`, `-norm`, `-vOut`). 
 Pin the Java package and document your flags. This uses the NLTK implementation (METEOR v1.0).
 [Lübbers, 2024](https://github.com/cluebbers/Reproducibility-METEOR-NLP)

 ## Citation

--- a/metrics/rouge/rouge.py
+++ b/metrics/rouge/rouge.py
@@ -14,9 +14,9 @@
 """ ROUGE metric from Google Research github repo. """

 # The dependencies in https://github.com/google-research/google-research/blob/master/rouge/requirements.txt
 import absl  # Here to have a nice missing dependency error message early on
 import absl
 import datasets
 import nltk  # Here to have a nice missing dependency error message early on
 import nltk
 import numpy  # Here to have a nice missing dependency error message early on
 import six  # Here to have a nice missing dependency error message early on
 from rouge_score import rouge_scorer, scoring
--- a/src/evaluate/loading.py
+++ b/src/evaluate/loading.py
@@ -243,7 +243,7 @@ def _download_additional_modules(
        elif import_type == "external":
            url_or_filename = import_path
        else:
            raise ValueError("Wrong import_type")
            raise ValueError(f"Wrong import_type: {import_type!r}. Expected 'library', 'internal', or 'external'.")

        local_import_path = cached_path(
            url_or_filename,
@@ -255,17 +255,18 @@ def _download_additional_modules(

    # Check library imports
    needs_to_be_installed = set()
    for library_import_name, library_import_path in library_imports:
    for library_import_name, _ in library_imports:
        try:
            lib = importlib.import_module(library_import_name)  # noqa F841
            importlib.import_module(library_import_name)  # noqa F841
        except ImportError:
            library_import_name = "scikit-learn" if library_import_name == "sklearn" else library_import_name
            needs_to_be_installed.add((library_import_name, library_import_path))
            library_import_name = "absl-py" if library_import_name == "absl" else library_import_name
            needs_to_be_installed.add(library_import_name)
    if needs_to_be_installed:
        missing = sorted(needs_to_be_installed)
        raise ImportError(
            f"To be able to use {name}, you need to install the following dependencies"
            f"{[lib_name for lib_name, lib_path in needs_to_be_installed]} using 'pip install "
            f"{' '.join([lib_path for lib_name, lib_path in needs_to_be_installed])}' for instance'"
            f"To be able to use {name}, you need to install these dependencies: "
            f"{', '.join(missing)} using the command 'pip install {' '.join(missing)}'."
        )
    return local_imports
Author	SHA1	Message	Date
Quentin Lhoest	39a3e98405	style	1 month ago
mkhi238	9e0a446827	Add FEVER Metric for Fact Verification Evaluation (#710 ) * Add FEVER metric for fact verification * Apply code formatting with black and isort	1 month ago
Christopher L.	b4d710804b	Add warning about METEOR version and flag-induced score variance (#677 ) This pull request adds a call-out block at the top of `evaluate/metrics/meteor/README.md` to inform users about significant score discrepancies in METEOR evaluations. Variations up to ±10 points can occur due to differences between versions 1.0 and 1.5, as well as the use of specific flags (`-l`, `-norm`, `-vOut`). By highlighting this issue, users are encouraged to specify the Java package version and document the flags used to ensure reproducibility. The information is based on findings from Lübbers (2024), available at [https://github.com/cluebbers/Reproducibility-METEOR-NLP](https://github.com/cluebbers/Reproducibility-METEOR-NLP).	1 month ago
Mathias Müller	56af7abbb1	fix bleurt docs (#708 )	1 month ago
engineerlisa	c07bbe78a3	Fix dependency hints on ImportError (#717 ) * Fix: dependency hints on ImportError (714) * improve error messages, cleanup --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>	1 month ago
Yacklin Wong	c33f40c2af	corrected problematic pip commands generated by _download_additional_modules (#715 ) Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>	1 month ago
Yacklin Wong	91453fdfbb	Remove the comments that pollute library_import_path in _download_additional_modules and correct import_library_name for "absl" (#716 ) * Remove the comments that pollute library_import_path in _download_additional_modules and correct import_library_name for "absl" * some updates	1 month ago
Yacklin Wong	143a05c984	corrected syntax errors and typos found in README.md of bleu metric (#718 )	1 month ago