Skip to content

Commit 15d1daf

Browse files
feat: tool and scripts to interactively explore webgraphs
- the class GraphExplorer allows to explore webgraphs using the JShell - the class Graph holds all webgraph-related data as memory-mapped data: the graph, its transpose and the map to translate between vertex labels and IDs. It provides methods to access successors and predecessors, etc. - the script graph_explore_download_webgraph.sh downloads all files required for exploring a graph - the script graph_explore_build_vertex_map.sh builds a map of vertex labels to vertex ID and verifies that all graph files required for graph exploration are downloaded. - utility methods - get a common subset (intersection) or the union of the successors or predecessors of a list of vertices - class CountingMergedIntIterator to count occurrences of integers given a list of int iterators as input - print list of vertices - load and save vertex lists from/to files - count top-level domains in lists of vertices - JShell script to load a graph - tutorial / quick start graph exploration
1 parent 15917a1 commit 15d1daf

File tree

10 files changed

+1559
-2
lines changed

10 files changed

+1559
-2
lines changed

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,19 @@ java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <
1313

1414
The assembly jar file includes also the [WebGraph](https://webgraph.di.unimi.it/) and [LAW](https://law.di.unimi.it/software.php) packages required to compute [PageRank](https://en.wikipedia.org/wiki/PageRank) and [Harmonic Centrality](https://en.wikipedia.org/wiki/Centrality#Harmonic_centrality).
1515

16-
Note that the webgraphs are usually multiple Gigabytes in size and require a sufficient Java heap size ([Java option](https://docs.oracle.com/en/java/javase/14/docs/specs/man/java.html#extra-options-for-java) `-Xmx`) for processing.
16+
17+
### Javadocs
18+
19+
The Javadocs are created by `mvn javadoc:javadoc`. Then open the file `target/site/apidocs/index.html` in a browser.
20+
21+
22+
## Memory and Disk Requirements
23+
24+
Note that the webgraphs are usually multiple Gigabytes in size and require for processing
25+
- a sufficient Java heap size ([Java option](https://docs.oracle.com/en/java/javase/21/docs/specs/man/java.html#extra-options-for-java) `-Xmx`)
26+
- enough disk space to store the graphs and temporary data.
27+
28+
The exact requirements depend on the graph size and the task – graph exploration or ranking, etc.
1729

1830

1931
## Construction and Ranking of Host- and Domain-Level Web Graphs
@@ -49,7 +61,7 @@ The shell script is easily adapted to your needs. Please refer to the [LAW datas
4961

5062
The Common Crawl webgraph data sets are announced on the [Common Crawl web site](https://commoncrawl.org/tag/webgraph/).
5163

52-
Instructions how to explore the webgraphs are given in the [cc-notebooks project](//github.com/commoncrawl/cc-notebooks/tree/master/cc-webgraph-statistics).
64+
For instructions how to explore the webgraphs using the JShell please see the tutorial [Interactive Graph Exploration](./graph-exploration-README.md). For an older approach using [Jython](https://www.jython.org/) and [pyWebGraph](https://github.com/mapio/py-web-graph), see the [cc-notebooks project](//github.com/commoncrawl/cc-notebooks/tree/master/cc-webgraph-statistics).
5365

5466

5567
## Credits

graph-exploration-README.md

Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
# Interactive Graph Exploration
2+
3+
A tutorial how to interactively explore the Common Crawl webgraphs – or other graphs using the webgraph format – using the [JShell](https://docs.oracle.com/en/java/javase/21/jshell/index.html) and the [GraphExplorer](src/main/java/org/commoncrawl/webgraph/explore/GraphExplorer.java) class.
4+
5+
6+
## Quick Start
7+
8+
1. change into the "cc-webgraph" project directory, [build the cc-webgraph jar](README.md#compiling-and-packaging-java-tools) and remember the project directory using an environment variable
9+
10+
```
11+
$> cd .../cc-webgraph
12+
13+
$> mvn clean package
14+
15+
$> CC_WEBGRAPH=$PWD
16+
```
17+
18+
2. select a web graph you want to explore, choose a download directory and download the web graph
19+
20+
```
21+
$> GRAPH=cc-main-2024-feb-apr-may-domain
22+
23+
$> mkdir .../my-webgraphs/$GRAPH
24+
$> cd .../my-webgraphs/$GRAPH
25+
```
26+
27+
About 15 GiB disk are needed to hold all files of a domain-level webgraph.
28+
29+
```
30+
$> $CC_WEBGRAPH/src/script/webgraph_ranking/graph_explore_download_webgraph.sh $GRAPH
31+
```
32+
33+
3. Build the map from vertex label to vertex ID and vice versa. This allows to look up a reverse domain name (e.g. "org.commoncrawl") and get the corresponding vertex ID.
34+
35+
```
36+
$> $CC_WEBGRAPH/src/script/webgraph_ranking/graph_explore_build_vertex_map.sh $GRAPH $GRAPH-vertices.txt.gz
37+
```
38+
39+
4. Launch the [JShell](https://docs.oracle.com/en/java/javase/21/jshell/index.html)
40+
41+
```
42+
$> jshell --class-path $CC_WEBGRAPH/target/cc-webgraph-*-jar-with-dependencies.jar
43+
| Welcome to JShell -- Version 21.0.3
44+
| For an introduction type: /help intro
45+
46+
jshell>
47+
```
48+
49+
Now you may play around with the JShell or load the GraphExplorer class and your graph:
50+
51+
```
52+
jshell> import org.commoncrawl.webgraph.explore.GraphExplorer
53+
54+
jshell> GraphExplorer e = new GraphExplorer("cc-main-2024-feb-apr-may-domain")
55+
2024-06-23 13:38:51:084 +0200 [main] INFO Graph - Loading graph cc-main-2024-feb-apr-may-domain.graph
56+
2024-06-23 13:38:51:193 +0200 [main] INFO Graph - Loading transpose of the graph cc-main-2024-feb-apr-may-domain-t.graph
57+
2024-06-23 13:38:51:279 +0200 [main] INFO Graph - Loading vertex map cc-main-2024-feb-apr-may-domain.iepm (ImmutableExternalPrefixMap)
58+
2024-06-23 13:38:52:356 +0200 [main] INFO Graph - Loaded graph cc-main-2024-feb-apr-may-domain.graph
59+
e ==> org.commoncrawl.webgraph.explore.GraphExplorer@4cc0edeb
60+
```
61+
62+
But for now exit the JShell
63+
```
64+
jshell> /exit
65+
| Goodbye
66+
```
67+
68+
To make the loading easier, you may use the load script [graph_explore_load_graph.jsh](src/script/webgraph_ranking/graph_explore_load_graph.jsh) and pass the graph name as a Java property to the JShell via command-line option `-R-Dgraph=$GRAPH`
69+
70+
```
71+
$> jshell --class-path $CC_WEBGRAPH/target/cc-webgraph-*-jar-with-dependencies.jar \
72+
-R-Dgraph=$GRAPH \
73+
$CC_WEBRAPH/src/script/webgraph_ranking/graph_explore_load_graph.jsh
74+
Loading graph cc-main-2024-feb-apr-may-domain
75+
2024-06-23 13:30:14:134 +0200 [main] INFO Graph - Loading graph cc-main-2024-feb-apr-may-domain.graph
76+
2024-06-23 13:30:14:340 +0200 [main] INFO Graph - Loading transpose of the graph cc-main-2024-feb-apr-may-domain-t.graph
77+
2024-06-23 13:30:14:439 +0200 [main] INFO Graph - Loading vertex map cc-main-2024-feb-apr-may-domain.iepm (ImmutableExternalPrefixMap)
78+
2024-06-23 13:30:15:595 +0200 [main] INFO Graph - Loaded graph cc-main-2024-feb-apr-may-domain.graph
79+
80+
Graph cc-main-2024-feb-apr-may-domain loaded into GraphExplorer *e*
81+
Type "e." and press <TAB> to list the public methods of the class GraphExplorer
82+
... or "g." for the graph loaded for exploration
83+
84+
... or use one of the predefined methods:
85+
void cn(String)
86+
void cn(long)
87+
void pwn()
88+
void ls()
89+
void ls(long)
90+
void ls(String)
91+
void sl()
92+
void sl(long)
93+
void sl(String)
94+
95+
| Welcome to JShell -- Version 21.0.3
96+
| For an introduction type: /help intro
97+
98+
jshell>
99+
```
100+
101+
The predefined methods are those provided by [pyWebGraph](https://github.com/mapio/py-web-graph).
102+
103+
```
104+
jshell> cn("org.commoncrawl")
105+
#111997321 org.commoncrawl
106+
107+
jshell> pwn()
108+
#111997321 org.commoncrawl
109+
110+
jshell> ls() // list successors (vertices linked from the domain commoncrawl.org or one of its subdomains)
111+
112+
jshell> sl() // list predecessors (vertices connected via incoming links)
113+
```
114+
115+
116+
## Using the Java Classes
117+
118+
The Java classes "GraphExplorer" and "Graph" bundle a set of methods which help exploring the graphs:
119+
- load the webgraph, its transpose and the vertex map
120+
- access the vertices and their successors or predecessors
121+
- utilities to import or export a list of vertices or counts from or into a file
122+
123+
The methods are bundled in the classes of the Java package `org.commoncrawl.webgraph.explore`. To get an overview over all provided methods, inspect the source code or see the section [Javadocs](README.md#javadocs) in the main README for how to read the Javadocs. Here only few examples are presented.
124+
125+
We start again with launching the JShell and loading a webgraph:
126+
127+
```
128+
$> jshell --class-path $CC_WEBGRAPH/target/cc-webgraph-*-jar-with-dependencies.jar \
129+
-R-Dgraph=$GRAPH \
130+
$CC_WEBRAPH/src/script/webgraph_ranking/graph_explore_load_graph.jsh
131+
jshell>
132+
```
133+
134+
Two classes are already instantiated – the *GraphExplorer* `e` and the *Graph* `g`, the former holds a reference to the latter:
135+
136+
```
137+
jshell> /vars
138+
| String graph = "cc-main-2024-feb-apr-may-domain"
139+
| GraphExplorer e = org.commoncrawl.webgraph.explore.GraphExplorer@7dc7cbad
140+
| Graph g = org.commoncrawl.webgraph.explore.Graph@4f933fd1
141+
142+
jshell> e.getGraph()
143+
$45 ==> org.commoncrawl.webgraph.explore.Graph@4f933fd1
144+
```
145+
146+
First, the vertices in the webgraphs are represented by numbers. So, we need to translage between vertex label and ID:
147+
148+
```
149+
jshell> g.vertexLabelToId("org.wikipedia")
150+
$46 ==> 115107569
151+
152+
jshell> g.vertexIdToLabel(115107569)
153+
$47 ==> "org.wikipedia"
154+
```
155+
156+
One important note: Common Crawl's webgraphs list the host or domain names in [reverse domain name notation](https://en.wikipedia.org/wiki/Reverse_domain_name_notation). The vertex lists are sorted by the reversed names in lexicographic order and then numbered continuously. This gives a close-to-perfect compression of the webgraphs itself. Most of the arcs are close in terms of locality because subdomains or sites of the same region (by country-code top-level domain) are listed in one continous block. Cf. the paper [The WebGraph Framework I: Compression Techniques](https://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf) by Paolo Boldi and Sebastiano Vigna.
157+
158+
Now, let's look how many other domains are linked from Wikipedia?
159+
160+
```
161+
jshell> g.outdegree("org.wikipedia")
162+
$46 ==> 2106338
163+
```
164+
165+
Another note: Common Crawl's webgraphs are based on sample crawls of the web. Same as the crawls, also the webgraphs are not complete and the Wikipedia may in reality link to far more domains. But 2 million linked domains is already not a small sample.
166+
167+
The Graph class also gives you access to the successors of a vertex, as array or stream of integers, but also as stream of strings (vertex labels):
168+
169+
```
170+
jshell> g.successors("org.wikipedia").length
171+
$48 ==> 2106338
172+
173+
jshell> g.successorIntStream("org.wikipedia").count()
174+
$49 ==> 2106338
175+
176+
jshell> g.successorStream("org.wikipedia").limit(10).forEach(System.out::println)
177+
abb.global
178+
abb.nic
179+
abbott.cardiovascular
180+
abbott.globalpointofcare
181+
abbott.molecular
182+
abbott.pk
183+
abc.www
184+
abudhabi.gov
185+
abudhabi.mediaoffice
186+
abudhabi.tamm
187+
```
188+
189+
Using Java streams it's easy to translate between the both representations:
190+
191+
```
192+
jshell> g.successorIntStream("org.wikipedia").limit(5).mapToObj(i -> g.vertexIdToLabel(i)).forEach(System.out::println)
193+
abb.global
194+
abb.nic
195+
abbott.cardiovascular
196+
abbott.globalpointofcare
197+
abbott.molecular
198+
```
199+
200+
Successors represent outgoing links to other domains. We can do the same for predecsors, that is incoming links from other domains:
201+
202+
```
203+
jshell> g.indegree("org.wikipedia")
204+
$50 ==> 2752391
205+
206+
jshell> g.predecessorIntStream("org.wikipedia").count()
207+
$51 ==> 2752391
208+
209+
jshell> g.predecessorStream("org.wikipedia").limit(5).forEach(System.out::println)
210+
abogado.fabiobalbuena
211+
abogado.jacksonville
212+
abogado.jaskot
213+
abogado.super
214+
ac.789bet
215+
```
216+
217+
Technically, webgraphs only store successor lists. But the Graph class holds also two graphs: the "original" one and its transpose. In the transposed graph "successors" are "predecessors", and "outdegree" means "indegree". Some methods on a deeper level take one of the two webgraphs as argument, here it makes a difference whether you pass `g.graph` or `g.graphT`, here to a method which translates vertex IDs to labels and extracts the top-level domain:
218+
219+
```
220+
jshell> g.successorTopLevelDomainStream(g.graph, g.vertexLabelToId("org.wikipedia")).limit(5).forEach(System.out::println)
221+
abb
222+
abb
223+
abbott
224+
abbott
225+
abbott
226+
227+
jshell> g.successorTopLevelDomainStream(g.graphT, g.vertexLabelToId("org.wikipedia")).limit(5).forEach(System.out::println)
228+
abogado
229+
abogado
230+
abogado
231+
abogado
232+
ac
233+
```
234+
235+
The top-level domains repeat, and you may want to count the occurrences and create a frequency list. There is a predefined method to perform this:
236+
237+
```
238+
jshell> g.successorTopLevelDomainCounts("org.wikipedia").filter(e -> e.getKey().startsWith("abb")).forEach(e -> System.out.printf("%8d\t%s\n", e.getValue(), e.getKey()))
239+
4 abbott
240+
2 abb
241+
242+
jshell> g.successorTopLevelDomainCounts("org.wikipedia").limit(10).forEach(e -> System.out.printf("%8d\t%s\n", e.getValue(), e.getKey()))
243+
706707 com
244+
213406 org
245+
117042 de
246+
86684 net
247+
65906 ru
248+
55914 fr
249+
53628 uk
250+
52828 it
251+
51622 jp
252+
33729 br
253+
```
254+
255+
The same can be done for predecessors using the method "Graph::predecessorTopLevelDomainCounts".
256+
257+
Dealing with large successor or predecessor lists can be painful and viewing them in a terminal window is practically impossible. We've already discussed how to compress the list to top-level domain counts. Alternatively, you could select the labels by prefix...
258+
259+
```
260+
jshell> g.successorStream("org.wikipedia", "za.org.").limit(10).forEach(System.out::println)
261+
za.org.61mech
262+
za.org.aadp
263+
za.org.aag
264+
za.org.abc
265+
za.org.acaparty
266+
za.org.acbio
267+
za.org.accord
268+
za.org.acd
269+
za.org.acdp
270+
za.org.acjr
271+
```
272+
273+
... but even then the list may be huge. Then the best option is to write the stream output (vertex labels or top-level domain frequencies) into a file and view it later using a file viewer or use any other tool for further processing:
274+
275+
```
276+
jshell> e.saveVerticesToFile(g.successors("org.wikipedia"), "org-wikipedia-successors.txt")
277+
278+
jshell> e.saveCountsToFile(g.successorTopLevelDomainCounts("org.wikipedia"), "org-wikipedia-successors-tld-counts.txt")
279+
```
280+
281+
## Final Remarks
282+
283+
We hope these few examples will support either to have fun exploring the graphs or to develop your own pipeline to extract insights from the graphs.
284+
285+
Finally, thanks to the authors of the [WebGraph framework](https://webgraph.di.unimi.it/) and of [pyWebGraph](https://github.com/mapio/py-web-graph) for their work on these powerful tools and for any inspiration taken into these examples.

0 commit comments

Comments
 (0)