Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(webhook): Implement webhook functionality for bookmark events #852

Merged
merged 5 commits into from
Jan 19, 2025

Conversation

hanguofeng
Copy link
Contributor

  • Added WebhookWorker to handle webhook requests.
  • Integrated webhook triggering in crawlerWorker after video processing.
  • Updated main worker initialization to include WebhookWorker.
  • Enhanced configuration to support webhook URLs, token, and timeout.
  • Documented webhook configuration options in the documentation.
  • Introduced zWebhookRequestSchema for validating webhook requests.

- Added WebhookWorker to handle webhook requests.
- Integrated webhook triggering in crawlerWorker after video processing.
- Updated main worker initialization to include WebhookWorker.
- Enhanced configuration to support webhook URLs, token, and timeout.
- Documented webhook configuration options in the documentation.
- Introduced zWebhookRequestSchema for validating webhook requests.
@MohamedBassem
Copy link
Collaborator

Hey, thanks a lot for the PR. I'll be able to review this later today!

Copy link
Collaborator

@MohamedBassem MohamedBassem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me! Left a couple of small comments, and then we can merge it!

Great job navigating the codebase without much assistance :)

{
concurrency: 1,
pollIntervalMs: 1000,
timeoutSecs: serverConfig.inference.jobTimeoutSec,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing inference timeout here is probably not intentional, let's maybe using the webhook.timeout / 1000 server settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to :

timeoutSecs: serverConfig.webhook.timeout / 1000 * (serverConfig.webhook.retryTimes + 1) * serverConfig.webhook.urls.length + 1,

This is considering retry times, urls, and timeout and add 1 second for other stuff


for (const url of webhookUrls) {
try {
const response = await Promise.race([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of Promise.race, we can use AbortSignal.timeout here so that the fetch itself get cancelled as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goods suggestions , I improved this


if (!response.ok) {
logger.error(`Webhook call to ${url} failed with status: ${response.status}`);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if the webhook response is a non success status code? In that case, do we want to retry or not? I think we probably should?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a retry logic and also a config WEBHOOK_RETRY_TIMES

```
Authorization: Bearer <WEBHOOK_TOKEN>
```
- The webhook will be triggered with the job id (used for idempotent), bookmark id, the user id, the url and the operation in JSON format in the body.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The webhook will be triggered with the job id (used for idempotent), bookmark id, the user id, the url and the operation in JSON format in the body.
- The webhook will be triggered with the job id (used for idempotence), bookmark id, the user id, the url and the operation in JSON format in the body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good , changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also some other doc improvements

Comment on lines 59 to 61
WEBHOOK_URLS: z.preprocess(
(val) => typeof val === "string" ? val.split(",") : val,
z.array(z.string().url())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, not sure if the use of preprocess here is right. It seems more correct to do do:

z.string().transform(val => val.split(",")).pipe(z.array(z.string().url()))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed it and with default value

//Webhook worker
export const zWebhookRequestSchema = z.object({
bookmarkId: z.string(),
url: z.string(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the url should be included here. We have bookmarks of other types, and you're already fetching the bookmark in the worker anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , I removed it and also removed from triggerWebhookWorker

queueDB,
{
defaultJobArgs: {
numRetries: 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we probably should retry on failure? Maybe 3 retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to 3

Comment on lines 750 to 752
// Trigger a webhook
await triggerWebhookWorker(bookmarkId, url, "create");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles the case for URLs though. I think we should also trigger the webhook for non URL bookmarks which don't go through the crawling workflows. You'll need to trigger those from packages/trpc/routers/bookmarks.ts (at the end of the createBookmark handler).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the question of which events to support needs more debate and consideration, I've constrained it to "crawled." This means we only support the event that signals crawl completion for now. This preserves scalability and prevents current issues from becoming unnecessarily complex like @mkarliner saying.

Copy link

@mkarliner mkarliner Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been thinking about this the last few days.
May I suggest a event of 'edited' or 'changed'.
Here is the use case:
I intend to use these events to post a stream of my bookmarks to social media, in particular the Fediverse/Maston, through the ActivityPub protocol. I don't actually want an unfiltered stream to be posted, I want to curate it at some level.
So, with an 'changed' event, I can manually change a bookmark in some way that will indicated to my bot that this is a bookmark I want to share and it will pick up the change and publish it.

It's worth noting that there are now Fediverse - Threads - Bluesky bridges, so this solution would get Hoarder bookmarks to a potentially large audience

Alternatively, if the event specified which list the bookmark was added to, that would also work.

"Authorization": `Bearer ${webhookToken}`,
},
body: JSON.stringify({
jobId, bookmarkId, userId: bookmark.userId, url: bookmark.link.url, operation: job.data.operation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include bookmark type here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added

@MohamedBassem
Copy link
Collaborator

A more high level question. Do we want to only trigger the webhook after the crawling? Or should we trigger it once on create, and once after it gets crawled? We can also add a configuration setting for the events people want to subscribe to.

@mkarliner
Copy link

A more high level question. Do we want to only trigger the webhook after the crawling? Or should we trigger it once on create, and once after it gets crawled? We can also add a configuration setting for the events people want to subscribe to.

To further open the can of worms, what are the events in the life cycle of a bookmark, and should tags or lists also have life cycles? I'm not suggesting these should be done now, but at least asking the question now leaves further expansion options open.

- Changed webhook operation type from "create" to "crawled" in crawlerWorker and documentation.
- Enhanced webhook retry logic in WebhookWorker to support multiple attempts.
- Updated Docker configuration to include new webhook environment variables.
- Improved validation for webhook configuration in shared config.
- Adjusted zWebhookRequestSchema to reflect the new operation type.
- Updated documentation to clarify webhook configuration options and usage.
@hanguofeng
Copy link
Contributor Author

Thank you for your professional review. I have revised the code and updated the branch in the PR accordingly.

Regarding your question about "what are the events in the life cycle of a bookmark, and should tags or lists also have life cycles?", here are my thoughts:

First, let me clarify the background of this PR. It stems from my own use case. I use Hoarder as a read-it-later tool and Obsidian as my note-taking tool. Since Obsidian isn't always convenient as a web clipper, I use Hoarder as an alternative. However, I also want to sync the content of my saved URLs to Obsidian. That's why I designed this webhook mechanism to receive signals and sync to Obsidian using specific methods.
I'm looking forward to the day when Hoarder implements features like "offline reading" and "vector search". These would largely replace the functionalities I currently use Obsidian for in my workflow.
Ultimately, processing collected notes, webpages, etc., with large language models for intelligent search and content creation is a common use case. If Hoarder's roadmap includes this, then the webhook capability doesn't need to be too robust. However, if the roadmap doesn't include this, using this mechanism to open up a community ecosystem would require further development.

@MohamedBassem
Copy link
Collaborator

Sorry that this taking long to review, I'll be able to have a look on it this weekend. @mkarliner your suggestion about more events makes a lot of sense now that we have the infra to do it. I think we can land this PR with the current events and followups with more PRs to add more events. Shouldn't be too hard.

@mkarliner
Copy link

Sorry that this taking long to review, I'll be able to have a look on it this weekend. @mkarliner your suggestion about more events makes a lot of sense now that we have the infra to do it. I think we can land this PR with the current events and followups with more PRs to add more events. Shouldn't be too hard.

Excellent. Looking forward to it.

@MohamedBassem MohamedBassem merged commit b9cce5d into hoarder-app:main Jan 19, 2025
4 checks passed
@MohamedBassem
Copy link
Collaborator

@hanguofeng Thanks a lot @hanguofeng for the high quality PR! I did some minor modifications before landing.

  • If I get time before the next release (specially with the ongoing trademark drama), I might make this functionality configurable per user such that every user can configure their own webhooks from the UI.
  • Regarding obsidian, someone actually just submitted an obsidian sync plugin to the obsidian plugin store (Export/sync to Obsidian #622).
  • Vector search and offline reading are both planned. In fact I already have a non-polished PR for embeddings and using a vector database (WIP: Bookmark embeddings #834).
  • @mkarliner I'll followup with another PR to add created/edited events!

@MohamedBassem
Copy link
Collaborator

Ended up pushing the changes to make this configurable by user, and added the created/edited lifecycle hooks.

@erik-nilcoast
Copy link

This is great, I'll integrate this into my RSS aggregator! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants