Skip to content

Commit 9de9bef

Browse files
committed
Update the Crawler Used in New Paper
This updated version of crawler was used for the paper. We added a length restriction for the rootUrl, requestUrl, and snippet length to write into the sql in order to prevent error. We also update the crawl lists into the version we have used for this crawl.
1 parent e73bd17 commit 9de9bef

15 files changed

+6325
-1589
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ To install the browser and crawler do the following:
8686

8787
1. Install [Firefox Nightly](http://ftp.mozilla.org/pub/firefox/nightly/2024/01/2024-01-01-23-15-40-mozilla-central/).
8888

89-
**Important Note**: While downloading the [latest version](https://www.mozilla.org/en-US/firefox/channel/desktop/) of Nightly does work, testing the crawler has revealed that certain versions of Firefox Nightly break the ability to add monetization labels. We recommend downloading the version we have linked above and [disabling automatic updates](https://winaero.com/disable-updates-firefox-63-above/). This will also help achieve more consistent results across different runs.
89+
**Important Note**: While downloading the [latest version](https://www.mozilla.org/en-US/firefox/channel/desktop/) of Nightly does work, testing the crawler has revealed that certain versions of Firefox Nightly break the ability to add monetization labels (mostly version 130+). Therefore, we recommend downloading the version we have linked above and [disabling automatic updates](https://winaero.com/disable-updates-firefox-63-above/). This will also help achieve more consistent results across different runs.
9090

9191
**Note**: In addition to using a specific version of Firefox Nightly, we will also be disabling the [Enhanced Tracking Protection](https://support.mozilla.org/en-US/kb/enhanced-tracking-protection-firefox-desktop) that Firefox provides us with. Besides just providing us with additional data, this will also help ensure that Privacy Pioneer is operating as expected.
9292

rest-api/index.js

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,17 @@ async function rest(table) {
8282
var extraDetail = e.extraDetail;
8383
var cookie = e.cookie;
8484
var loc = e.loc;
85+
86+
if (rootUrl && rootUrl.length >= 255) {
87+
rootUrl = rootUrl.substring(0, 254);
88+
}
89+
if (requestUrl && requestUrl.length >= 4000) {
90+
requestUrl = requestUrl.substring(0, 3999);
91+
}
92+
if (snippet && snippet.length >= 4000) {
93+
snippet = snippet.substring(0, 3999);
94+
}
95+
8596
// console.log("posting to analysis...");
8697
connection.query(
8798
"INSERT INTO ??.?? (timestp, permission, rootUrl, snippet, requestUrl, typ, ind, firstPartyRoot, parentCompany, watchlistHash, extraDetail, cookie, loc) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",

0 commit comments

Comments
 (0)