Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pair classification inconsistencies #582

Open
SaitejaUtpala opened this issue Apr 26, 2024 · 1 comment
Open

pair classification inconsistencies #582

SaitejaUtpala opened this issue Apr 26, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@SaitejaUtpala
Copy link
Contributor

class AbsTaskPairClassification(AbsTask):
    """Abstract class for PairClassificationTasks
    The similarity is computed between pairs and the results are ranked. Average precision
    is computed to measure how well the methods can be used for pairwise pair classification.

    self.load_data() must generate a huggingface dataset with a split matching self.metadata_dict["eval_splits"], and assign it to self.dataset. It must contain the following columns:
        sent1: list[str]
        sent2: list[str]
        labels: list[int]
    """

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def _evaluate_monolingual(self, model, dataset, split="test", **kwargs):
        data_split = dataset[split][0]  # This causes error because it just gets first row of the split
        logging.getLogger(
            "sentence_transformers.evaluation.PairClassificationEvaluator"
        ).setLevel(logging.WARN)
        evaluator = PairClassificationEvaluator(
            data_split["sent1"], data_split["sent2"], data_split["labels"], **kwargs
        )

I am working #581 dataset, I found couple of potential issues in 'AbsTaskPairClassification' with

  • data_split = dataset[split][0] in _evaluate_monolingual, This causes error because it just gets first row of the split and doesn't whole dataset
  • PairClassificationEvaluator(
    data_split["sent1"], data_split["sent2"], data_split["labels"], **kwargs
    )
    also it expects 'sent1', 'sent2', 'labels' instead of 'sentence1', 'sentence2' and 'label' (standard followed in STS and BiText Mining task)
@loicmagne
Copy link
Member

It's not an error but this is a legacy of how the initial pair classification datasets were formatted, for example TwitterSemEval2015:

>>> d = load_dataset('mteb/twittersemeval2015-pairclassification')
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████| 313k/313k [00:00<00:00, 1.35MB/s]
Generating test split: 100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.39 examples/s]
>>> d
DatasetDict({
    test: Dataset({
        features: ['sent1', 'sent2', 'labels'],
        num_rows: 1
    })
})

There is a single row, where each row contains a list of sentences. I agree this isn't a very good format and the naming is inconsistent with other tasks so it might make sense to change it

@KennethEnevoldsen KennethEnevoldsen changed the title pair classification inconsistencies and bugs pair classification inconsistencies Apr 29, 2024
@KennethEnevoldsen KennethEnevoldsen added the enhancement New feature or request label Apr 29, 2024
@loicmagne loicmagne added good first issue Good for newcomers help wanted Extra attention is needed and removed help wanted Extra attention is needed labels May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants