These are PHP bindings for HTML Inspector.
<?php
function extract_anchors(string $html_utf8, string $document_uri)
{
$doc = new HtmlInspector\HtmlDocument($html_utf8);
$base_node = $doc->select(0)->child()->name('html')->child()->name('head')->child()
->name('base')->iterate();
$base = HtmlInspector\resolve_iri($doc->get_attribute($base_node, 'href'), $document_uri);
$base ??= $document_uri;
$selector = $doc->select(0)->descendant()->name('a')->attribute_starts_with('href', '#')->not();
while (($node_a = $selector->iterate()) !== -1) {
$href = $doc->get_attribute($node_a, 'href');
$uri = HtmlInspector\resolve_iri($href, $base);
print("$uri\n");
}
}
I have thought back and forth whether to implement PHP iterators to loop through nodes. How PHP
implements iterators is awkward. Firstly, two redundant implementations are needed to support
looping with foreach
and to implement the Iterator
interface. Moreover, it needs the two
methods next
(with no return value) and current
instead of just one, we have to implement a
caching of both the current value and of the validity state of the iterator, and in current
we
conditionally have to make one implicit iteration. Python is an example where iteration is
implemented more elegantly using a single __next__
method that both iterates and then returns the
current value. Another complication is how to encode the non-existence of a node. With PHP iterators,
we need to use the value false
and implement union type hints and a respective check for the get_*
methods to enable a concise syntax. Without iterators, we can use the value -1
and pass it to the
C functions without further checks.