-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
The 2021 Domain Crawl missed quite a lot of items because it treats CDNs like normal hosts and is far too 'polite', which means we never get caught up. We should add a sheet to make them go faster, but this needs a bit of research to see how fast it is safe for us to go.
Known CDNs include (this is just from scanning the sample of 2000 retired queues from DC 2021 that the Frontier Report shows. There were many more sites that hit the cap.
com,shopify,cdn,
com,wixstatic,static,
com,squarespace-cdn,images,
com,amazonaws,s3,primarysite-prod-sorted,
com,bigcommerce,cdn11,
com,squarespace,static1
com,wp,i0,
com,wp,i1,
com,wp,i2,
jp,imgz,c,
me,rocketcdn,
com,rs-cdn,uk,
uk,co,sykesassets,property-images-cdn,
cymru,cyfoethnaturiol,cdn,
net,ekm,cdn,
com,packhelp,cdn,static,
com,lw-cdn,
io,statically,cdn,
net,b-cdn,
com,rackcdn,
com,rackcdn,cf3,ssl,24a04536d882ca0087a3-289132c7eabba70668e526ce8cd83a46, [???]
com,myportfolio,pro2-bar-s3-cdn-cf4, [???]
com,smushcdn,664305,
com,productserve,images2,
com,stackpathcdn,
uk,co,foodism,cdn,
io,accentuate,cdn,
uk,co,love4lighting,cdn,
net,lightgalleries,cdn,
com,jimcdn,image,
com,tildacdn,static,
uk,co,ednology,marketplace,cdn,
uk,co,bargainmax,cdn,
com,tripadvisor,dynamic-media-cdn,
uk,co,express,images,cdn,
net,sz-cdn,uk,
com,shgcdn,i,
com,schooljotter2,cdn,img2,
com,uenicdn,img77,
com,ucarecdn, [??? brings in full site?]
com,dvipcdn,f,
com,kajabi-cdn,kajabi-storefronts-production,
com,sqspcdn,1,static1,
net,website-editor,le-cdn,
com,aiircdn,mmo,
com,schooljotter2,cdn,img,
events,asp,cdn,
com,cdn-website,irp,
com,cdn-website,lirp,
net,nccdn,0501,
net,create-cdn,sites,
com,simplesite,cdn,
net,secureservercdn,
uk,co,atcdn,m,
com,googleapis,storage,
com,editmysite,cdn2,
com,multiscreensite,lirp-cdn,
com,amazonaws,s3-eu-west-1 [???]
https://s3-eu-west-1.amazonaws.com/cdn.webfactore.co.uk/sr_274624.png?1537558108
uk,co,tropicalsky,cdn1,
uk,co,tropicalsky,cdn2, [???]
uk,co,memiah,cdn,
And from Slack (not sure if they want tagging here) "not a CDN, but I need to special-case domains like doi.org (and variants dx.doi.org etc) for scholarly crawling", so:
org,doi,
Metadata
Metadata
Assignees
Labels
No labels