-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improve quotation generation and confirmation (#505)
* Add polyfill-based text fragment quotation extraction * Return to dom-anchor-text-quote-based quotation extraction * Replace dom-anchor-text-quote with approx-string-match for quotation confirmation * Disable polyfill-based quotation extraction test because it adds ~3m to our GH action test run --------- Signed-off-by: Carl Gieringer <[email protected]>
- Loading branch information
1 parent
445883b
commit fca5fc6
Showing
45 changed files
with
8,560 additions
and
212 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,2 @@ | ||
**/testData/** -linguist-detectable | ||
howdju-text-fragment-generation/dist/** -linguist-detectable | ||
howdju-text-fragments/dist/** -linguist-detectable |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,3 +17,5 @@ | |
# macOS file system metadata. These were showing up in act's Github Workflow | ||
# runner. | ||
.DS_Store | ||
|
||
*.cpuprofile |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
import { readFileSync } from "fs"; | ||
import { JSDOM } from "jsdom"; | ||
import stripIndent from "strip-indent"; | ||
import * as textPosition from "dom-anchor-text-position"; | ||
|
||
import { approximateMatch } from "./approximateStringMatch"; | ||
import { toPlainTextContent } from "./domCommon"; | ||
|
||
describe("approximateMatch", () => { | ||
test("matches", () => { | ||
const html = readFileSync( | ||
"lib/testData/urlTextFragments/lexfridman.html", | ||
"utf8" | ||
); | ||
const dom = new JSDOM(html); | ||
const doc = dom.window.document; | ||
const quotation = stripIndent(` | ||
Robert F. Kennedy Jr | ||
(00:09:49) I suppose the way that Camus viewed the world and the way that the Stoics did and a lot of the existentialists, it was that it was so absurd and that the problems and the tasks that were given just to live a life are so insurmountable that the only way that we can get back the gods for giving us this impossible task of living life was to embrace it and to enjoy it and to do our best at it. To me, I read Camus, and particularly in The Myth of Sisyphus as a parable that… And it’s the same lesson that I think he writes about in The Plague, where we’re all given these insurmountable tasks in our lives, but that by doing our duty, by being of service to others, we can bring meaning to a meaningless chaos and we can bring order to the universe. | ||
`).trim(); | ||
|
||
const matches = approximateMatch(doc.body.textContent ?? "", quotation); | ||
|
||
expect(matches).toEqual([{ end: 10800, errors: 25, start: 9995 }]); | ||
const [{ start, end }] = matches; | ||
const range = textPosition.toRange(doc.body, { start, end }); | ||
const foundQuotation = toPlainTextContent(range); | ||
// TODO(507) it should be possible to match the quotation exactly. | ||
const expectedFoundQuotation = `Robert F. Kennedy Jr (00:09:49) I suppose the way that Camus viewed the world and the way that the Stoics did and a lot of the existentialists, it was that it was so absurd and that the problems and the tasks that were given just to live a life are so insurmountable that the only way that we can get back the gods for giving us this impossible task of living life was to embrace it and to enjoy it and to do our best at it. To me, I read Camus, and particularly in The Myth of Sisyphus as a parable that… And it’s the same lesson that I think he writes about in The Plague, where we’re all given these insurmountable tasks in our lives, but that by doing our duty, by being of service to others, we can bring meaning to a meaningless chaos and we can bring order to the universe.`; | ||
expect(foundQuotation).toEqual(expectedFoundQuotation); | ||
}); | ||
|
||
test("matches non-optimally", () => { | ||
const html = readFileSync( | ||
"lib/testData/urlTextFragments/lexfridman.html", | ||
"utf8" | ||
); | ||
const dom = new JSDOM(html); | ||
const doc = dom.window.document; | ||
const quotation = stripIndent(` | ||
Lex Fridman | ||
(00:21:33) And you think that kind of empathy that you referred to, that requires moral courage? | ||
`).trim(); | ||
|
||
const [{ start, end, errors }] = approximateMatch( | ||
doc.body.textContent ?? "", | ||
quotation | ||
); | ||
|
||
expect({ start, end, errors }).toEqual({ | ||
start: 19933, | ||
end: 20035, | ||
errors: 21, | ||
}); | ||
const range = textPosition.toRange(doc.body, { start, end }); | ||
const foundQuotation = toPlainTextContent(range); | ||
// TODO(507) it should be possible to match the quotation exactly. | ||
expect(foundQuotation).toEqual(quotation.substring(20)); | ||
}); | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
import search from "approx-string-match"; | ||
|
||
export const MAX_ACCEPTABLE_ERRORS = 50; | ||
|
||
export function approximateMatch(document: string, query: string) { | ||
return search(document, query, MAX_ACCEPTABLE_ERRORS); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.