Skip to content

Add faster/parallel queues for known CDNs #79

@anjackson

Description

@anjackson

The 2021 Domain Crawl missed quite a lot of items because it treats CDNs like normal hosts and is far too 'polite', which means we never get caught up. We should add a sheet to make them go faster, but this needs a bit of research to see how fast it is safe for us to go.

Known CDNs include (this is just from scanning the sample of 2000 retired queues from DC 2021 that the Frontier Report shows. There were many more sites that hit the cap.

com,shopify,cdn,
com,wixstatic,static,
com,squarespace-cdn,images,
com,amazonaws,s3,primarysite-prod-sorted,
com,bigcommerce,cdn11,
com,squarespace,static1
com,wp,i0,
com,wp,i1,
com,wp,i2,
jp,imgz,c,
me,rocketcdn,
com,rs-cdn,uk, 
uk,co,sykesassets,property-images-cdn, 
cymru,cyfoethnaturiol,cdn, 
net,ekm,cdn,
com,packhelp,cdn,static,
com,lw-cdn, 
io,statically,cdn, 
net,b-cdn,
com,rackcdn,
com,rackcdn,cf3,ssl,24a04536d882ca0087a3-289132c7eabba70668e526ce8cd83a46, [???]
com,myportfolio,pro2-bar-s3-cdn-cf4, [???]
com,smushcdn,664305,
com,productserve,images2, 
com,stackpathcdn,
uk,co,foodism,cdn,
io,accentuate,cdn, 
uk,co,love4lighting,cdn, 
net,lightgalleries,cdn,
com,jimcdn,image, 
com,tildacdn,static, 
uk,co,ednology,marketplace,cdn, 
uk,co,bargainmax,cdn,
com,tripadvisor,dynamic-media-cdn, 
uk,co,express,images,cdn, 
net,sz-cdn,uk, 
com,shgcdn,i, 
com,schooljotter2,cdn,img2, 
com,uenicdn,img77, 
com,ucarecdn, [??? brings in full site?]
com,dvipcdn,f, 
com,kajabi-cdn,kajabi-storefronts-production, 
com,sqspcdn,1,static1, 
net,website-editor,le-cdn, 
com,aiircdn,mmo, 
com,schooljotter2,cdn,img, 
events,asp,cdn, 
com,cdn-website,irp,
com,cdn-website,lirp,
net,nccdn,0501,
net,create-cdn,sites, 
com,simplesite,cdn, 
net,secureservercdn, 
uk,co,atcdn,m, 
com,googleapis,storage,
com,editmysite,cdn2, 
com,multiscreensite,lirp-cdn, 
com,amazonaws,s3-eu-west-1 [???]
https://s3-eu-west-1.amazonaws.com/cdn.webfactore.co.uk/sr_274624.png?1537558108
uk,co,tropicalsky,cdn1, 
uk,co,tropicalsky,cdn2, [???] 
uk,co,memiah,cdn, 

And from Slack (not sure if they want tagging here) "not a CDN, but I need to special-case domains like doi.org (and variants dx.doi.org etc) for scholarly crawling", so:

org,doi,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions