New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/onboarding #3881
feature/onboarding #3881
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
// embedding?: number[] | ||
// typeorm does not support vector type, so we store it as a string | ||
@Column('text') | ||
embedding?: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should use a separate table for these? I know we talked about it with some other attributes.
@@ -204,4 +204,7 @@ export class LibraryItem { | |||
|
|||
@Column('text') | |||
highlightAnnotations?: string[] | |||
|
|||
@Column('timestamptz') | |||
sharedAt?: Date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud, but also wonder if this could be is_discover_item
not as flexible for future use, but maybe more explicit. I'm not sure what's better, just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and above are similar question: Should we have a join table for things like discover_item
and item_embedding
.
There are several considerations:
-
search performance for inner joins:
Probably not so bad for one-to-one relations likediscover_item
anditem_embedding
. I think we can benchmark the query. -
frequently update column:
I agree with your thought and it has been a big concern for us. Worth to move these columns to a separate table. -
optimization:
Instead of running cosine similarity query on the fly, we could pre-generate the topics or some other features and store them in another table or indexing. -
naming:
I think if we want to be more explicit, we should better go with a separate tablediscover_item
which links to thelibraryItem
table for normalization and reducing the size of the table.
Having considered all these concerns, I think we should
- Test and benchmark the search query against join tables (probably only one-to-one relations) in cloned DB
- If (1) result is not bad, we could create two additional separate tables:
discover_item
anditem_embedding
- Create another job to generate topic when the
embedding
is updated and stored in the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sywhb I also mentioned this on Discord, but a lot of what you've talked about in here is already done as part of my Discover update.
The optimisation part is already done as part of the import in discover, see:
https://github.com/omnivore-app/omnivore/blob/main/packages/db/migrations/0168.do.add_discover_feeds.sql
The discover feeds contain already the topics and embeddings. There's also a separate table for discover items, which have their embeddings stored, whether a user has saved a story - which is used to calculate the popularity score in discover too.
The calculations of the embeddings is done via https://github.com/omnivore-app/omnivore/blob/main/packages/discover/src/index.ts - This deliberately slows down the embedding process so that you never exceeds the RPS that the API supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Podginator ! Taking a look now
|
||
// We allow the backend to use the ID instead of a slug to fetch the article | ||
// query against id if slug is a uuid | ||
slug.match(/^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$/i) | ||
? qb.andWhere('libraryItem.id = :id', { id: slug }) | ||
: qb.andWhere('libraryItem.slug = :slug', { slug }) | ||
? qb.where('libraryItem.id = :id', { id: slug }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this change do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this change do?
oh i see, we are removing the uid check above. Its a little scary to fully rely on RLS for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motive is to use the slug
index to improve query performance. I think RLS should be quite reliable but alternatively we could update the index to use both user_id
and slug
columns too
No description provided.