-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Links are now extracted after applying excludeTags #828
base: main
Are you sure you want to change the base?
Conversation
@@ -447,7 +447,7 @@ export async function scrapSingleUrl( | |||
let linksOnPage: string[] | undefined; | |||
|
|||
if (pageOptions.includeLinks) { | |||
linksOnPage = extractLinks(rawHtml, urlToScrap); | |||
linksOnPage = extractLinks(html, urlToScrap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem with this is that we rely on this function for our /crawl, which will end up failling to grab all the links if we don't pass the raw version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mogery lmk if im wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A global code search did not reveal any usage of the linksOnPage
field anywhere other than api return
I believe the crawler uses a separate extractLinksFromHTML function
crawler.extractLinksFromHTML(rawHtml ?? "", sc.originUrl), |
links.push(...this.extractLinksFromHTML(content, url).map(url => ({ url, html: content, pageStatusCode, pageError }))); |
defined as
firecrawl/apps/api/src/scraper/WebScraper/crawler.ts
Lines 322 to 337 in 8a4f4cb
public extractLinksFromHTML(html: string, url: string) { | |
let links: string[] = []; | |
const $ = load(html); | |
$("a").each((_, element) => { | |
const href = $(element).attr("href"); | |
if (href) { | |
const u = this.filterURL(href, url); | |
if (u !== null) { | |
links.push(u); | |
} | |
} | |
}); | |
return links; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right makes sense.
fixes #701