Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Search PDF content #685

Open
edeleastar opened this issue Jan 29, 2024 · 3 comments
Open

Text Search PDF content #685

edeleastar opened this issue Jan 29, 2024 · 3 comments

Comments

@edeleastar
Copy link
Contributor

Devise a strategy to extend Tutors Search to PDF content.

Possible approach: consider a modification to the generator project:

When generating a project, produce a json representation of the text within the PDF. Uses this companion content as the basis for the search algorithm

@lgriffin
Copy link
Contributor

@edeleastar I played around with this and made a small tool on https://github.com/lgriffin/pdf-to-json that converts PDF to JSON so a few decisions need to be made here before I integrate something into Tutors:

This isn't a big uplift to implement, once you give me clarity on the above I can quickly get this into a PR

@edeleastar
Copy link
Contributor Author

edeleastar commented Jun 23, 2024

hi @lgriffin

fascinating work. I would be inclined to consider an experiment whereby the json for PDFs is included in tutors.json (just like lab text is). My sense is that they would not be as verbose as the labs - and with compression (when we enable) it should have negligible impact on the performance.

Thus, the roadmap could be:

(1) Create a new npm package "tutors-publish-s" package, a superset of "tutors-publish", which generates a tutors.json (as currently) but also containing the contents of each pdf/json embedded in the talk object. This could be typed like this:

  type: "talk";
  pdf: string; // route to pdf for the lo
  pdfFile: string; // pdf file name
  contentJson: any // textual contents of pdf
};

See

(2) Update the search module in the Tutors Reader to also search this object (currently it only searches notes and labs)

See

This would have to identify the slide number the search text occurs on

(3) Update the Talk component to include a slide number on its url - this is currently not implemented yet, by should be relatively straightforward. It is also a useful feature in its own right, being able to bookmark / send a link to an individual slide. It would be updating the url as the user navigated the slides.

Additionally, it might be a good time to roll in the encryption/compression work you did earlier into the "tutors-publish-s" npm package. If all goes well, we could replace "tutors-publish" with this new version once it is stable ( and we hav verified backward compatibility)

The addition of the above would mean we have full textual model of a complete course contents.

Great stuff

@lgriffin
Copy link
Contributor

lgriffin commented Jul 2, 2024

That makes sense and if I can offer a suggestion for this (and other future changes), make a dedicated project Kanban board and populate it as I see several key steps in this.

The issue type doesn't really give the depth that you expressing in your response especially with a fundamental change to the flow. You can then disconnect the actions and quickly get the new module published -- I don't mind if you bring it in as a dependency or shift the whole logic in, it's only a small experiment app.

I would see the search criteria being a 2 step process. A general search where text matching to the JSON (trivial) to tell you this appears in a certain PDF -- check if that is useful for students as the more expressive search to actual reference where in the PDF it is -- the latter would require a bit of thought and we could be over engineering a solution that nobody wants in a sense.

You then have a possible task to check the performance and then the compression integration and compare. There's a number of compression approaches out there I picked a more popular one but the size of the JSON files was already small enough that compression would not be noticeable. With a dozen PDFs that size will grow for sure and give a better testbed for picking the right compression tool.

All in this looks great and if you can break out the steps I can pick up some of the parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants