You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: additional tweaks to docs for 'list of pages'
- link to 'list-of-pages' anchor to explain the direct entry and seed list upload option
- tweak the explaination under list of pages to cover the two options
- fix link to docs to include trailing slash before anchor to avoid redirect
- follow to #2792
Copy file name to clipboardExpand all lines: frontend/docs/docs/user-guide/workflow-setup.md
+14-4Lines changed: 14 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,12 +38,20 @@ _Site Crawl_
38
38
`Single Page`
39
39
: Crawls a single URL and does not include any linked pages.
40
40
41
-
`List of Pages`
42
-
: Crawls only specified URLs and does not include any linked pages (unless [_Include Any Linked Page_](#include-any-linked-page) is enabled). Each URL must be entered on its own line. URLs can be entered directly into the designated text area or uploaded as a text file. These options cannot be combined in a single workflow.
41
+
`List of Pages` <aname="list-of-pages"></a>
42
+
: Crawls a list of specified URLs.
43
43
44
-
Up to 100 URLs can be entered into the text area. If you paste a list of over 100 URLs, Browsertrix will automatically convert the list into a text file and attach it to the workflow. Text files can be viewed and deleted from within the workflow, but cannot be edited in place.
44
+
Select one of two options to provide a list of URLs:
45
+
46
+
*Enter URLs* - If the list is small enough, 100 URLs or less, the URLs can be entered directly into the text area. If a large list is pasted into the textbox, it will be converted into an uploaded URL list and attached to the workflow.
47
+
48
+
*Upload URL List* - A longer list of URLs can be provided as a text file, containing one URL per line. The text file may not exceed 25MB, but there is no limit to the number of URLs in the file. Once a file is added, a link will be provided to view the file (but not edit it). To change the file, a new file can be uploaded in its place.
45
49
46
-
Ensure each URL is on its own line so the crawler can queue all provided URLs for crawling. It will continue queuing until it reaches either the organization's pages per crawl limit or the crawl workflow's page limit. Once one of these limits is hit, it will stop queuing additional URLs. Duplicate URLs will be queued only once, while invalid URLs will be skipped and not queued at all. The crawl will fail if the list contains no valid URLs or if there is a file formatting error.
50
+
For both options, each line should contain a valid URL (starting with https:// or http://). Invalid or duplicate URLs will be skipped. The crawl will fail if the list contains no valid URLs or if the file is not a list of URLs.
51
+
52
+
While the uploaded text file can contain an unlimited number of URLs, the crawl will still be limited by the [page limit](#max-pages) for the workflow or organization - URLs beyond the limit will not be crawled.
53
+
54
+
If both a list of entered list and an uploaded file are provided, the currently selected option will be used.
47
55
48
56
`In-Page Links`
49
57
: Crawls only the specified URL and treats linked sections of the page as distinct pages.
@@ -70,6 +78,8 @@ _Site Crawl_
70
78
71
79
One or more URLs of the page to crawl. URLs must follow [valid URL syntax](https://www.w3.org/Addressing/URL/url-spec.html). For example, if you're crawling a page that can be accessed on the public internet, your URL should start with `http://` or `https://`.
72
80
81
+
See [List Of Pages](#list-of-pages) for additional info when providing a list of URLs.
82
+
73
83
??? example "Crawling with HTTP basic auth"
74
84
75
85
All crawl scopes support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`.
0 commit comments