Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/onboarding #3881

Closed
wants to merge 14 commits into from
Closed

feature/onboarding #3881

wants to merge 14 commits into from

Conversation

sywhb
Copy link
Contributor

@sywhb sywhb commented May 1, 2024

No description provided.

Copy link

vercel bot commented May 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
omnivore-demo ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 7, 2024 7:11am
omnivore-prod ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 7, 2024 7:11am

@sywhb sywhb marked this pull request as ready for review May 2, 2024 08:53
@sywhb sywhb requested a review from satindar as a code owner May 2, 2024 08:53
// embedding?: number[]
// typeorm does not support vector type, so we store it as a string
@Column('text')
embedding?: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should use a separate table for these? I know we talked about it with some other attributes.

@@ -204,4 +204,7 @@ export class LibraryItem {

@Column('text')
highlightAnnotations?: string[]

@Column('timestamptz')
sharedAt?: Date
Copy link
Contributor

@jacksonh jacksonh May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud, but also wonder if this could be is_discover_item not as flexible for future use, but maybe more explicit. I'm not sure what's better, just a thought.

Copy link
Contributor Author

@sywhb sywhb May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and above are similar question: Should we have a join table for things like discover_item and item_embedding.

There are several considerations:

  1. search performance for inner joins:
    Probably not so bad for one-to-one relations like discover_item and item_embedding. I think we can benchmark the query.

  2. frequently update column:
    I agree with your thought and it has been a big concern for us. Worth to move these columns to a separate table.

  3. optimization:
    Instead of running cosine similarity query on the fly, we could pre-generate the topics or some other features and store them in another table or indexing.

  4. naming:
    I think if we want to be more explicit, we should better go with a separate table discover_item which links to the libraryItem table for normalization and reducing the size of the table.

Having considered all these concerns, I think we should

  1. Test and benchmark the search query against join tables (probably only one-to-one relations) in cloned DB
  2. If (1) result is not bad, we could create two additional separate tables: discover_item and item_embedding
  3. Create another job to generate topic when the embedding is updated and stored in the table

@jacksonh

Copy link
Contributor

@Podginator Podginator May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sywhb I also mentioned this on Discord, but a lot of what you've talked about in here is already done as part of my Discover update.

The optimisation part is already done as part of the import in discover, see:
https://github.com/omnivore-app/omnivore/blob/main/packages/db/migrations/0168.do.add_discover_feeds.sql

The discover feeds contain already the topics and embeddings. There's also a separate table for discover items, which have their embeddings stored, whether a user has saved a story - which is used to calculate the popularity score in discover too.

The calculations of the embeddings is done via https://github.com/omnivore-app/omnivore/blob/main/packages/discover/src/index.ts - This deliberately slows down the embedding process so that you never exceeds the RPS that the API supports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Podginator ! Taking a look now


// We allow the backend to use the ID instead of a slug to fetch the article
// query against id if slug is a uuid
slug.match(/^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$/i)
? qb.andWhere('libraryItem.id = :id', { id: slug })
: qb.andWhere('libraryItem.slug = :slug', { slug })
? qb.where('libraryItem.id = :id', { id: slug })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this change do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this change do?

oh i see, we are removing the uid check above. Its a little scary to fully rely on RLS for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motive is to use the slug index to improve query performance. I think RLS should be quite reliable but alternatively we could update the index to use both user_id and slug columns too

@sywhb sywhb requested a review from jacksonh May 7, 2024 10:47
@sywhb sywhb closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants