feat: tool and scripts to interactively explore webgraph

sebastian-nagel · sebastian-nagel · commit 8fffff223f42 · 2024-06-29T14:46:52.000+02:00
Add examples how to use the graph exploration Java classes.
Add information how to build the Javadocs.
diff --git a/README.md b/README.md
@@ -13,7 +13,19 @@ java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <
 
 The assembly jar file includes also the [WebGraph](https://webgraph.di.unimi.it/) and [LAW](https://law.di.unimi.it/software.php) packages required to compute [PageRank](https://en.wikipedia.org/wiki/PageRank) and [Harmonic Centrality](https://en.wikipedia.org/wiki/Centrality#Harmonic_centrality).
 
-Note that the webgraphs are usually multiple Gigabytes in size and require a sufficient Java heap size ([Java option](https://docs.oracle.com/en/java/javase/14/docs/specs/man/java.html#extra-options-for-java) `-Xmx`) for processing.
+
+### Javadocs
+
+The Javadocs are created by `mvn javadoc:javadoc`. Then open the file `target/site/apidocs/index.html` in a browser.
+
+
+## Memory and Disk Requirements
+
+Note that the webgraphs are usually multiple Gigabytes in size and require for processing
+- a sufficient Java heap size ([Java option](https://docs.oracle.com/en/java/javase/21/docs/specs/man/java.html#extra-options-for-java) `-Xmx`)
+- enough disk space to store the graphs and temporary data.
+
+The exact requirements depend on the graph size and the task – graph exploration or ranking, etc.
 
 
 ## Construction and Ranking of Host- and Domain-Level Web Graphs
diff --git a/graph-exploration-README.md b/graph-exploration-README.md
@@ -112,3 +112,174 @@ A tutorial how to interactively explore the Common Crawl webgraphs – or other
    jshell> sl()  // list predecessors (vertices connected via incoming links)
    ```
 
+
+## Using the Java Classes
+
+The Java classes "GraphExplorer" and "Graph" bundle a set of methods which help exploring the graphs:
+- load the webgraph, its transpose and the vertex map
+- access the vertices and their successors or predecessors
+- utilities to import or export a list of vertices or counts from or into a file
+
+The methods are bundled in the classes of the Java package `org.commoncrawl.webgraph.explore`. To get an overview over all provided methods, inspect the source code or see the section [Javadocs](README.md#javadocs) in the main README for how to read the Javadocs. Here only few examples are presented.
+
+We start again with launching the JShell and loading a webgraph:
+
+```
+$> jshell --class-path $CC_WEBGRAPH/target/cc-webgraph-*-jar-with-dependencies.jar \
+          -R-Dgraph=$GRAPH \
+          $CC_WEBRAPH/src/script/webgraph_ranking/graph_explore_load_graph.jsh
+jshell> 
+```
+
+Two classes are already instantiated – the *GraphExplorer* `e` and the *Graph* `g`, the former holds a reference to the latter:
+
+```
+jshell> /vars
+|    String graph = "cc-main-2024-feb-apr-may-domain"
+|    GraphExplorer e = org.commoncrawl.webgraph.explore.GraphExplorer@7dc7cbad
+|    Graph g = org.commoncrawl.webgraph.explore.Graph@4f933fd1
+
+jshell> e.getGraph()
+$45 ==> org.commoncrawl.webgraph.explore.Graph@4f933fd1
+```
+
+First, the vertices in the webgraphs are represented by numbers. So, we need to translage between vertex label and ID:
+
+```
+jshell> g.vertexLabelToId("org.wikipedia")
+$46 ==> 115107569
+
+jshell> g.vertexIdToLabel(115107569)
+$47 ==> "org.wikipedia"
+```
+
+One important note: Common Crawl's webgraphs list the host or domain names in [reverse domain name notation](https://en.wikipedia.org/wiki/Reverse_domain_name_notation). The vertex lists are sorted by the reversed names in lexicographic order and then numbered continuously. This gives a close-to-perfect compression of the webgraphs itself. Most of the arcs are close in terms of locality because subdomains or sites of the same region (by country-code top-level domain) are listed in one continous block. Cf. the paper [The WebGraph Framework I: Compression Techniques](https://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf) by Paolo Boldi and Sebastiano Vigna.
+
+Now, let's look how many other domains are linked from Wikipedia?
+
+```
+jshell> g.outdegree("org.wikipedia")
+$46 ==> 2106338
+```
+
+Another note: Common Crawl's webgraphs are based on sample crawls of the web. Same as the crawls, also the webgraphs are not complete and the Wikipedia may in reality link to far more domains. But 2 million linked domains is already not a small sample.
+
+The Graph class also gives you access to the successors of a vertex, as array or stream of integers, but also as stream of strings (vertex labels):
+
+```
+jshell> g.successors("org.wikipedia").length
+$48 ==> 2106338
+
+jshell> g.successorIntStream("org.wikipedia").count()
+$49 ==> 2106338
+
+jshell> g.successorStream("org.wikipedia").limit(10).forEach(System.out::println)
+abb.global
+abb.nic
+abbott.cardiovascular
+abbott.globalpointofcare
+abbott.molecular
+abbott.pk
+abc.www
+abudhabi.gov
+abudhabi.mediaoffice
+abudhabi.tamm
+```
+
+Using Java streams it's easy to translate between the both representations:
+
+```
+jshell> g.successorIntStream("org.wikipedia").limit(5).mapToObj(i -> g.vertexIdToLabel(i)).forEach(System.out::println)
+abb.global
+abb.nic
+abbott.cardiovascular
+abbott.globalpointofcare
+abbott.molecular
+```
+
+Successors represent outgoing links to other domains. We can do the same for predecsors, that is incoming links from other domains:
+
+```
+jshell> g.indegree("org.wikipedia")
+$50 ==> 2752391
+
+jshell> g.predecessorIntStream("org.wikipedia").count()
+$51 ==> 2752391
+
+jshell> g.predecessorStream("org.wikipedia").limit(5).forEach(System.out::println)
+abogado.fabiobalbuena
+abogado.jacksonville
+abogado.jaskot
+abogado.super
+ac.789bet
+```
+
+Technically, webgraphs only store successor lists. But the Graph class holds also two graphs: the "original" one and its transpose. In the transposed graph "successors" are "predecessors", and "outdegree" means "indegree". Some methods on a deeper level take one of the two webgraphs as argument, here it makes a difference whether you pass `g.graph` or `g.graphT`, here to a method which translates vertex IDs to labels and extracts the top-level domain:
+
+```
+jshell> g.successorTopLevelDomainStream(g.graph, g.vertexLabelToId("org.wikipedia")).limit(5).forEach(System.out::println)
+abb
+abb
+abbott
+abbott
+abbott
+
+jshell> g.successorTopLevelDomainStream(g.graphT, g.vertexLabelToId("org.wikipedia")).limit(5).forEach(System.out::println)
+abogado
+abogado
+abogado
+abogado
+ac
+```
+
+The top-level domains repeat, and you may want to count the occurrences and create a frequency list. There is a predefined method to perform this:
+
+```
+jshell> g.successorTopLevelDomainCounts("org.wikipedia").filter(e -> e.getKey().startsWith("abb")).forEach(e -> System.out.printf("%8d\t%s\n", e.getValue(), e.getKey()))
+       4        abbott
+       2        abb
+
+jshell> g.successorTopLevelDomainCounts("org.wikipedia").limit(10).forEach(e -> System.out.printf("%8d\t%s\n", e.getValue(), e.getKey()))
+  706707        com
+  213406        org
+  117042        de
+   86684        net
+   65906        ru
+   55914        fr
+   53628        uk
+   52828        it
+   51622        jp
+   33729        br
+```
+
+The same can be done for predecessors using the method "Graph::predecessorTopLevelDomainCounts".
+
+Dealing with large successor or predecessor lists can be painful and viewing them in a terminal window is practically impossible. We've already discussed how to compress the list to top-level domain counts. Alternatively, you could select the labels by prefix...
+
+```
+jshell> g.successorStream("org.wikipedia", "za.org.").limit(10).forEach(System.out::println)
+za.org.61mech
+za.org.aadp
+za.org.aag
+za.org.abc
+za.org.acaparty
+za.org.acbio
+za.org.accord
+za.org.acd
+za.org.acdp
+za.org.acjr
+```
+
+... but even then the list may be huge. Then the best option is to write the stream output (vertex labels or top-level domain frequencies) into a file and view it later using a file viewer or use any other tool for further processing:
+
+```
+jshell> e.saveVerticesToFile(g.successors("org.wikipedia"), "org-wikipedia-successors.txt")
+
+jshell> e.saveCountsToFile(g.successorTopLevelDomainCounts("org.wikipedia"), "org-wikipedia-successors-tld-counts.txt")
+```
+
+## Final Remarks
+
+We hope these few examples will support either to have fun exploring the graphs or to develop your own pipeline to extract insights from the graphs.
+
+Finally, thanks to the authors of the [WebGraph framework](https://webgraph.di.unimi.it/) and of [pyWebGraph](https://github.com/mapio/py-web-graph) for their work on these powerful tools and for any inspiration taken into these examples.