Skip to content

Implement web sources crawling and scrapping features#263

Open
Mascode-Dev wants to merge 44 commits into
mainfrom
web-sources
Open

Implement web sources crawling and scrapping features#263
Mascode-Dev wants to merge 44 commits into
mainfrom
web-sources

Conversation

@Mascode-Dev
Copy link
Copy Markdown
Collaborator

Frontend changes

  • Adding Dropdown Menu (sources into documents and websites)
  • Two pages for sources
  • Frontend feature flag
  • Keep UI clean for long links
  • Re-crawl option

Backend changes

  • Adding websources document type
  • New workers
  • Endpoints for crawler/scrapper and recrawl

Mascode-Dev added 30 commits May 6, 2026 13:05
}

@CheckPolicy((policy) => policy.canList())
@Sse(DocumentsRoutes.streamCrawlProgress.path, { method: 0 /* GET */ })
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange method: 0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestMethod.GET (but its already the case inside the codebase)

return (
await this.documentConnectRepository.find(connectScope, {
where: { sourceType: "project", uploadStatus: "uploaded" },
where: [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sourceType param

children: React.ReactNode
}) {
const { hasFeature } = useFeatureFlags()
const { hasFeature, isLoading } = useFeatureFlags()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? There was a bug?

if (project) {
return {
hasFeature: (feature: FeatureFlagKey): boolean => check(project.featureFlags || [], feature),
isLoading: false,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove your change and it should work

documentId: z.string(),
documentTitle: z.string(),
documentFileName: z.string().nullable(),
documentSourceType: z.string(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum?

effect: async (action, listenerApi) => {
syncDocumentEmbeddingStatusStreamWithDocuments(listenerApi)

// Refetch documents when a webCrawl document finishes embedding
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create an issue to refactor / clean this

files: File[]
sourceType: DocumentSourceType
tagIds?: string[]
name?: string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

return parsed
}
} catch {
// not JSON, not a crawl document
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe throw?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why parsing crawled page in frontend?

}) {
const date = buildSince(document.updatedAt)
const isWebCrawl = document.sourceType === "webCrawl"
const crawledPages = isWebCrawl ? parseCrawledPages(document.content) : null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? just webcrawl sourceType is not enough?

Comment thread CHANGELOG.md
### Added
- Display evaluation extraction run metrics in Bull Board
- Retry failed evaluation extraction runs from the UI
- (beta) Documents sidebar entry replaced by a Sources dropdown with separate Documents and Websites sections
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify, (beta) we can scrap web sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants