Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partial ratio #10

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions src/distance.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
pub mod common;
pub mod damerau_levenshtein;
pub mod hamming;
pub mod indel;
Expand Down
15 changes: 15 additions & 0 deletions src/distance/common.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
/**
Tuple like object describing the position of the compared strings in
src and dest.

It indicates that the score has been calculated between
src[src_start:src_end] and dest[dest_start:dest_end]
*/
#[derive(PartialEq, Debug)]
pub struct ScoreAlignment {
pub score: f64,
pub src_start: usize,
pub src_end: usize,
pub dest_start: usize,
pub dest_end: usize,
}
347 changes: 347 additions & 0 deletions src/fuzz.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
use crate::common::{NoScoreCutoff, SimilarityCutoff, WithScoreCutoff};
use crate::details::distance::MetricUsize;
use crate::distance::common::ScoreAlignment;
use crate::distance::indel;
use crate::HashableChar;
use std::collections::HashSet;

#[must_use]
#[derive(Clone, Copy, Debug)]
Expand Down Expand Up @@ -149,6 +151,263 @@ where
}
}

/// Searches for the optimal alignment of the shorter string in the
/// longer string and returns the fuzz.ratio for this alignment.
///
/// # Example
/// ```
/// use rapidfuzz::fuzz;
/// /// score is 1.0
/// let score = fuzz::partial_ratio("this is a test".chars(), "this is a test!".chars());
/// ```
///
pub fn partial_ratio<Iter1, Iter2>(s1: Iter1, s2: Iter2) -> f64
where
Iter1: IntoIterator,
Iter1::IntoIter: DoubleEndedIterator + Clone,
Iter2: IntoIterator,
Iter2::IntoIter: DoubleEndedIterator + Clone,
Iter1::Item: PartialEq<Iter2::Item> + HashableChar + Copy,
Iter2::Item: PartialEq<Iter1::Item> + HashableChar + Copy,
{
partial_ratio_with_args(s1, s2, &Args::default())
}

pub fn partial_ratio_with_args<Iter1, Iter2, CutoffType>(
s1: Iter1,
s2: Iter2,
args: &Args<f64, CutoffType>,
) -> CutoffType::Output
where
Iter1: IntoIterator,
Iter1::IntoIter: DoubleEndedIterator + Clone,
Iter2: IntoIterator,
Iter2::IntoIter: DoubleEndedIterator + Clone,
Iter1::Item: PartialEq<Iter2::Item> + HashableChar + Copy,
Iter2::Item: PartialEq<Iter1::Item> + HashableChar + Copy,
CutoffType: SimilarityCutoff<f64, Output = f64>,
maxbachmann marked this conversation as resolved.
Show resolved Hide resolved
{
let s1_iter = s1.into_iter();
let s2_iter = s2.into_iter();

let alignment = partial_ratio_alignment(
s1_iter.clone(),
s1_iter.count(),
s2_iter.clone(),
s2_iter.count(),
args,
);

match alignment {
Some(alignment) => alignment.score,
None => 0.0,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not return None if a score below score_cutoff was passed.
For now this could be:

let score =  match alignment {
      Some(alignment) => alignment.score,
      None => 0.0,
  }
args.score_cutoff.score(score)

This will likely need to be changed when making the return type of partial_ratio_alignment alignment dependent on the presence of a score_cutoff.

Please add a test for this as well. fn issue206 might be a good place. For the other languages I managed to find ways to iterate over functions to reduce the boilerplate for some of these tests. I didn't look into ways to achieve the same in rust so far.

}

pub fn partial_ratio_alignment<Iter1, Iter2, CutoffType>(
s1: Iter1,
len1: usize,
s2: Iter2,
len2: usize,
args: &Args<f64, CutoffType>,
) -> Option<ScoreAlignment>
maxbachmann marked this conversation as resolved.
Show resolved Hide resolved
where
Iter1: IntoIterator,
Iter1::IntoIter: DoubleEndedIterator + Clone,
Iter2: IntoIterator,
Iter2::IntoIter: DoubleEndedIterator + Clone,
Iter1::Item: PartialEq<Iter2::Item> + HashableChar + Copy,
Iter2::Item: PartialEq<Iter1::Item> + HashableChar + Copy,
CutoffType: SimilarityCutoff<f64, Output = f64>,
{
let s1_iter = s1.into_iter();
let s2_iter = s2.into_iter();
let mut score_cutoff = args.score_cutoff.cutoff().unwrap_or(0.0);

let mut res = if len1 <= len2 {
partial_ratio_impl(
s1_iter.clone(),
len1,
s2_iter.clone(),
len2,
score_cutoff,
args.score_hint,
)
} else {
partial_ratio_impl(
s2_iter.clone(),
len2,
s1_iter.clone(),
len1,
score_cutoff,
args.score_hint,
)
};
Copy link
Member

@maxbachmann maxbachmann Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the two strings are swapped for the comparison we have to swap the start and end positions to fix them up. I guess that would come down to something along the lines of:

if len1 > len2 {
    let mut res = partial_ratio_alignment(s2, len2, s1, len1, args);
    std::mem::swap(&mut res.src_start, &mut res.dest_start);
    std::mem::swap(&mut res.src_end, &mut res.dest_end);
    return res;
}

Alternatively you can construct a new ScoreAlignment and return that.

A test would make sense for this as well. I should add one to the C++ tests too.

if (res.score != 1.0) && (len1 == len2) {
score_cutoff = f64::max(score_cutoff, res.score);
let res2 = if len1 <= len2 {
maxbachmann marked this conversation as resolved.
Show resolved Hide resolved
partial_ratio_impl(
s2_iter.clone(),
len2,
s1_iter.clone(),
len1,
score_cutoff,
args.score_hint,
)
} else {
partial_ratio_impl(
s1_iter.clone(),
len1,
s2_iter.clone(),
len2,
score_cutoff,
args.score_hint,
)
};
if res2.score > res.score {
res = ScoreAlignment {
score: res2.score,
src_start: res2.dest_start,
src_end: res2.dest_end,
dest_start: res2.src_start,
dest_end: res2.src_end,
};
}
}

(res.score >= score_cutoff).then_some(res)
}

/**
implementation of partial_ratio for needles <= 64. assumes s1 is already the
shorter string
*/
Comment on lines +267 to +270
Copy link
Member

@maxbachmann maxbachmann Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It assumes len(s1) <= len(s2) not len(s1) < len(s2). I fixed the comment in the Python version as well.

The implementation should be based on the C++ implementation you can fine here: https://github.com/rapidfuzz/rapidfuzz-cpp/blob/c6a3ac87c42ddf52f502dc3ed7001c8c2cefb900/rapidfuzz/fuzz_impl.hpp#L68

The Python implementation only skips windows if they can't exist since a character is unique. The C++ implementation makes use of the fact that a score can only change by a certain distance per shift of the window to skip windows. An example would be:

"aaa" <-> "bcdef"

Here the alignments with the full length are:

  • "aaa" <-> "bcd"
  • "aaa" <-> "cde"
  • "aaa" <-> "def"

Now if we first calculate the distance for the alignments 1 and 3 both of the have a indel distance of 6. That way we know that alignment 2 can at best have an indel distance of 4.

Looking at the C++ implementation I feel like the actual min_score calculation in there is incorrect.. I believe instead of:

/* half of the cells that are not needed for known_edits can lead to a better score */
ptrdiff_t min_score =
    static_cast<ptrdiff_t>(std::min(scores[window.first], scores[window.second])) -
    static_cast<ptrdiff_t>(cell_diff + known_edits / 2);

this should be

/* half of the cells that are not needed for known_edits can lead to a better score */
size_t  max_score_improvement  = (cell_diff - known_edits / 2) / 2 * 2;
ptrdiff_t min_score =
    static_cast<ptrdiff_t>(std::min(scores[window.first], scores[window.second])) -
    static_cast<ptrdiff_t>(max_score_improvement);

The current implementation doesn't lead to incorrect results but allows skipping less cells than we really could.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the calculation of min_score in rapidfuzz-cpp to skip more alignments

fn partial_ratio_impl<Iter1, Iter2>(
s1: Iter1,
len1: usize,
s2: Iter2,
len2: usize,
mut score_cutoff: f64,
score_hint: Option<f64>,
) -> ScoreAlignment
where
Iter1: IntoIterator,
Iter1::IntoIter: DoubleEndedIterator + Clone,
Iter2: IntoIterator,
Iter2::IntoIter: DoubleEndedIterator + Clone,
Iter1::Item: PartialEq<Iter2::Item> + HashableChar + Copy,
Iter2::Item: PartialEq<Iter1::Item> + HashableChar + Copy,
{
if len1 == 0 {
return ScoreAlignment {
score: 0.0,
src_start: 0,
src_end: 0,
dest_start: 0,
dest_end: 0,
};
}

let s1_iter = s1.into_iter();
let s2_vec = s2.into_iter().collect::<Vec<_>>();
Copy link
Member

@maxbachmann maxbachmann Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unsure about this so far. In C++ we directly use random access, but that can only be done because we assume fast random access. In Rust e.g. string is utf8 and so has no fast random access.
I believe we could write this so random access isn't required. However we would still iterate the string multiple times and so maybe for those types it's still faster to just make the copy.

I would leave it like this for now, since partial_ratio is never super fast anyway. If we were to change it we should probably benchmark it first.


let s1_char_set = s1_iter
.clone()
.map(|c| c.hash_char())
.collect::<HashSet<_>>();
Comment on lines +300 to +303
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

things like the batchcomparator and the charset should be passed as argument. This is required so the same implementation can be used from PartialRatioBatchComparator. PartialRatioBatchComparator should get implemented.


let mut res = ScoreAlignment {
score: 0.0,
src_start: 0,
src_end: len1,
dest_start: 0,
dest_end: len1,
};

let indel_comp = indel::BatchComparator::new(s1_iter.clone());

for i in 1..len1 {
let substr_last = &s2_vec[i - 1];
if !s1_char_set.contains(&substr_last.hash_char()) {
continue;
}

let ls_ratio = indel_comp
.normalized_similarity_with_args(
s2_vec[..i].iter().cloned(),
&indel::Args {
score_cutoff: WithScoreCutoff(score_cutoff),
score_hint,
},
)
.unwrap_or(0.0);
if ls_ratio > res.score {
score_cutoff = ls_ratio;
res.score = ls_ratio;
res.dest_start = 0;
res.dest_end = i;
if res.score == 1.0 {
return res;
}
}
}

let window_end = len2 - len1;
for i in 0..window_end {
let substr_last = &s2_vec[i + len1 - 1];
if !s1_char_set.contains(&substr_last.hash_char()) {
continue;
}

let ls_ratio = indel_comp
.normalized_similarity_with_args(
s2_vec[i..i + len1].iter().cloned(),
&indel::Args {
score_cutoff: WithScoreCutoff(score_cutoff),
score_hint,
},
)
.unwrap_or(0.0);
if ls_ratio > res.score {
score_cutoff = ls_ratio;
res.score = ls_ratio;
res.dest_start = i;
res.dest_end = i + len1;
if res.score == 1.0 {
return res;
}
}
}

for i in window_end..len2 {
let substr_first = &s2_vec[i];
if !s1_char_set.contains(&substr_first.hash_char()) {
continue;
}

let ls_ratio = indel_comp
.normalized_similarity_with_args(
s2_vec[i..].iter().cloned(),
&indel::Args {
score_cutoff: WithScoreCutoff(score_cutoff),
score_hint,
},
)
.unwrap_or(0.0);
if ls_ratio > res.score {
score_cutoff = ls_ratio;
res.score = ls_ratio;
res.dest_start = i;
res.dest_end = len2;
if res.score == 1.0 {
return res;
}
}
}

res
}

#[cfg(test)]
mod tests {
use super::*;
Expand Down Expand Up @@ -299,4 +558,92 @@ mod tests {
);
}
}

#[test]
fn test_partial_ratio2() {
let s1 = "this is a test";
let s2 = "this is a test!";
let result = partial_ratio(s1.chars(), s2.chars());
assert_eq!(result, 1.0, "Expected 1.0");
}

#[test]
fn test_partial_ratio_issue138() {
let s1 = &"a".repeat(65);
let s2 = &format!("a{}{}", char::from_u32(256).unwrap(), "a".repeat(63));
let result = partial_ratio(s1.chars(), s2.chars());
assert!(
(result - 0.9922481).abs() < 1e-5,
"Expected approximately 0.9922481, got {}",
result
);
}

#[test]
fn test_partial_ratio_alignment() {
let str1 = "er merkantilismus förderte handle und verkehr mit teils marktkonformen, teils dirigistischen maßnahmen.";
let str2 = "ils marktkonformen, teils dirigistischen maßnahmen. an der schwelle zum 19. jahrhundert entstand ein neu";

let alignment = partial_ratio_alignment(
str1.chars(),
str1.chars().count(),
str2.chars(),
str2.chars().count(),
&Args::default(),
);

dbg!(&alignment);

assert!(
(alignment.as_ref().unwrap().score - 0.662337662).abs() < 1e-5,
"Expected 0.662337662, got {}",
alignment.unwrap().score
);
assert_eq!(alignment.as_ref().unwrap().src_start, 0);
assert_eq!(alignment.as_ref().unwrap().src_end, 103);
assert_eq!(alignment.as_ref().unwrap().dest_start, 0);
assert_eq!(alignment.as_ref().unwrap().dest_end, 51);
}

#[test]
fn test_partial_ratio_impl_identical() {
let s1 = "abcd";
let s2 = "abcd";

let result = partial_ratio_impl(
s1.chars(),
s1.chars().count(),
s2.chars(),
s2.chars().count(),
0.0,
None,
);

assert_eq!(result.score, 1.0);
assert_eq!(result.src_start, 0);
assert_eq!(result.src_end, 4);
assert_eq!(result.dest_start, 0);
assert_eq!(result.dest_end, 4);
}

#[test]
fn test_partial_ratio_impl_substring() {
let s1 = "bcd";
let s2 = "abcde";

let result = partial_ratio_impl(
s1.chars(),
s1.chars().count(),
s2.chars(),
s2.chars().count(),
0.0,
None,
);

assert_eq!(result.score, 1.0);
assert_eq!(result.src_start, 0);
assert_eq!(result.src_end, 3);
assert_eq!(result.dest_start, 1);
assert_eq!(result.dest_end, 4);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still a couple of other test in https://github.com/rapidfuzz/RapidFuzz/blob/main/tests/test_fuzz.py and https://github.com/rapidfuzz/rapidfuzz-cpp/blob/main/test/tests-fuzz.cpp that we could port over, but I didn't check this in depth so far.

}
2 changes: 1 addition & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ pub mod distance;
pub mod fuzz;

/// Hash value in the range `i64::MIN` - `u64::MAX`
#[derive(Debug, Copy, Clone)]
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub enum Hash {
UNSIGNED(u64),
SIGNED(i64),
Expand Down