Implement web sources crawling and scrapping features by Mascode-Dev · Pull Request #263 · bayesimpact/agent-studio

Mascode-Dev · 2026-05-06T13:16:17Z

Frontend changes

Adding Dropdown Menu (sources into documents and websites)
Two pages for sources
Frontend feature flag
Keep UI clean for long links
Re-crawl option

Backend changes

Adding websources document type
New workers
Endpoints for crawler/scrapper and recrawl

embeddings pipeline

… sources

instantly in the UI

…pletion for dropdown

…d documents

…o status badge

…re flag

…uments from web sources

… modals

…or on crawl failure

…2.20.2

jdoucy · 2026-05-07T11:58:13Z

  }
+
+  @CheckPolicy((policy) => policy.canList())
+  @Sse(DocumentsRoutes.streamCrawlProgress.path, { method: 0 /* GET */ })


Strange method: 0

RequestMethod.GET (but its already the case inside the codebase)

jdoucy · 2026-05-07T12:02:17Z

    return (
      await this.documentConnectRepository.find(connectScope, {
-        where: { sourceType: "project", uploadStatus: "uploaded" },
+        where: [


Add a sourceType param

jdoucy · 2026-05-07T12:05:02Z

  children: React.ReactNode
 }) {
-  const { hasFeature } = useFeatureFlags()
+  const { hasFeature, isLoading } = useFeatureFlags()


Why? There was a bug?

jdoucy · 2026-05-07T12:06:25Z

  if (project) {
    return {
      hasFeature: (feature: FeatureFlagKey): boolean => check(project.featureFlags || [], feature),
+      isLoading: false,


There was a bug?

remove your change and it should work

jdoucy · 2026-05-07T12:14:48Z

  documentId: z.string(),
  documentTitle: z.string(),
  documentFileName: z.string().nullable(),
+  documentSourceType: z.string(),


jdoucy · 2026-05-07T13:15:24Z

+    effect: async (action, listenerApi) => {
      syncDocumentEmbeddingStatusStreamWithDocuments(listenerApi)
+
+      // Refetch documents when a webCrawl document finishes embedding


Create an issue to refactor / clean this

jdoucy · 2026-05-07T13:18:18Z

    files: File[]
    sourceType: DocumentSourceType
    tagIds?: string[]
+    name?: string


jdoucy · 2026-05-07T13:19:27Z

+      return parsed
+    }
+  } catch {
+    // not JSON, not a crawl document


Maybe throw?

Why parsing crawled page in frontend?

jdoucy · 2026-05-07T13:20:41Z

 }) {
  const date = buildSince(document.updatedAt)
+  const isWebCrawl = document.sourceType === "webCrawl"
+  const crawledPages = isWebCrawl ? parseCrawledPages(document.content) : null


Why? just webcrawl sourceType is not enough?

jdoucy · 2026-05-07T13:22:53Z

 ### Added
- Display evaluation extraction run metrics in Bull Board
- Retry failed evaluation extraction runs from the UI
+- (beta) Documents sidebar entry replaced by a Sources dropdown with separate Documents and Websites sections


Simplify, (beta) we can scrap web sources

Mascode-Dev added 30 commits May 6, 2026 13:05

feat(documents): add web crawl source type and crawl URL contract

d04f9b7

feat(documents): add website crawl feature to frontend

ac36bc2

feat(api): add Spider integration for web crawling

ef30d3e

feat(documents): add crawling domain logic

9bb84d4

feat(documents): wire crawling into documents service and

1d52255

embeddings pipeline

chore(api): register crawling module and add dependencies

7603ff0

fix(documents): update document service and chunk retrieval for crawl…

02a119b

… sources

feat(documents): ensure chunks never span across crawled pages

d923591

fix(spider): flatten nested array response from Spider API

e569a72

feat(documents): add crawled sub-pages dropdown in documents list

6479901

feat(documents): expose sourceType in document API response

9f7fe95

feat(documents): add sourceType to DocumentDto

36a1bc4

feat(documents): create crawl document eagerly so it appears

99c0324

instantly in the UI

feat(documents): show globe icon immediately and refetch on crawl com…

f05f4cf

…pletion for dropdown

feat(documents): add dedicated web-source-embeddings queue for crawle…

9155a45

…d documents

chore(api): register web-source-embeddings worker module

ff44327

feat(agents): include web-crawled sources in agent responses

dd1cb6d

feat(web): show document title and source type icon in sources popover

2a10e8d

feat(documents): return documentSourceType in retrieved chunks

5d29223

feat(documents): crawl full websites and surface crawl failures

e45415b

feat(documents): accept crawl requests without a page limit

69c821d

feat(spider): unlimited full-site crawl

53fe5b0

feat(web): full-site crawl with dedicated crawling state

b4a9fd1

feat(documents): drop limit from crawl contract and pass sourceType t…

af8abae

…o status badge

feat(documents): stream live crawl progress via SSE

40cd02a

feat(web): live crawl progress badge

4d15f3a

feat(api-contracts): add document_crawl_progress_changed channel

abc2184

feat(documents): stream Spider pages + SSE endpoint

482bf98

fix(web): open crawl progress stream on mount

efdd549

fix(web): truncate crawled sub-page URLs

1d6b53e

Mascode-Dev added 14 commits May 6, 2026 13:06

fix(web): keep feature-flag gate mounted while projects load

e3c968a

feat(documents): gate 'Explorer un site web' behind web_sources featu…

5ca0337

…re flag

feat(documents): add source-type filter tabs to separate uploaded doc…

4134e55

…uments from web sources

feat(documents): add optional name field to crawl URL and file upload…

5051982

… modals

feat(web): replace documents nav item with collapsible sources dropdown

13b567c

docs: changelog update

665edbd

feat(api): add source_url column to document and forward embeddingErr…

7270a09

…or on crawl failure

chore(deps): bump bull-board to 6.21.3 and express-openid-connect to …

b305dc5

…2.20.2

feat(web): delete page number column for uploaded documents

50b9c65

fix(crawling): fix the re-crawl feature

8d3bf79

feat(web): add recrawl action for web source documents

72ba3ea

fix(api): recrawl fails for renamed web sources without stored sourceUrl

1b8112f

fix: package-lock.json push for workers smoke gh action error

dcb4e0f

test(api): add e2e coverage for crawling endpoints

111abc8

jdoucy reviewed May 7, 2026

View reviewed changes

Conversation

Mascode-Dev commented May 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants