Skip to content

Crawler Stuck with "Direct fetch of page URL timed out" Errors #832

@MCSeekeri

Description

@MCSeekeri

When using browsertrix-crawler to crawl a specific website, the crawler seems to hang or get stuck after reaching a certain number of pages (e.g., around 7K-8K pages).

Although the program continues to output crawl statistics and the Direct fetch of page URL timed out message, it appears that no new pages are being crawled, and the "crawled" count in the statistics stops increasing. The entire crawler process seems to be stuck at this point.

This issue seems similar to #780, but I have encountered it in both version 1.5.7 and the latest version 1.6.1.

Docker Compose:

services:
  browsertrix-crawler:
    environment:
      - HTTP_PROXY=http://100.100.2.2:19999
      - HTTPS_PROXY=http://100.100.2.2:19999
    command:
      - crawl
      - --seeds=https://scp-wiki-cn.wikidot.com
      - --generateWACZ
      - --workers=32
      - --blockAds
      #- --waitUntil=networkidle2
      #- --proxyServer=http://100.100.2.2:19999
      - --scopeType=prefix
    image: webrecorder/browsertrix-crawler:1.6.1
    volumes:
      - ./crawls:/crawls/

tail.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions