Scalpel's Select
Intro
Last time, we reviewed the main abstraction Scalpel lets you use, simply by implementing some common Haskell typeclasses. However, all these things are mostly composition, depending more or less on delayed evaluation. In other words, we’re yet to see any actual “stuff being done”, we’ve mostly seen “how we will compose the computed stuff”.
Today, we face the music and dance. I won’t lie : this is going to be a long and often quite boring post.
Let’s go back to the scalpel-core folder, and open the aptly named Select.hs.
A quick look-through will let you discover two rather intimidating functions, tagsToVector and the smaller but still rather terrifying vectorToTree . We’ll mostly work on these two beasts today, and study how they are used.
However, our cursory reading should also have noted:
The much shorter
select
function…… and another big function,
selectNodes
These two are, to a point, the heart of this library. To give you a general idea of how things work : tagsToVector
and vectorToTree
build an internal data structure the allow fast access to the XML document you want to parse. select
and selectNodes
leverage this structure to focus on subtrees in the document.
A quick look upstream
You might remember we briefly quoted the main way to manipulate scrappers, internally at least, the scrape
function from Scrape.hs.
scrape :: (Ord str, TagSoup.StringLike str)
=> Scraper str a -> [TagSoup.Tag str] -> Maybe a
scrape s = scrapeTagSpec s . tagsToSpec . TagSoup.canonicalizeTags
Remember, our goal is to provide a TagSpec str
datatype that will eventually be the input for our scrapers. The second parameter, [TagSoup.Tag str]
is the produce of the TagSoup library, namely a view of an XML document given as a list of string-like things. This parameter is eta-reduced in scrape
, but we can easily see that there will be a normalization step (canonicalizeTags
simply lowercases tags and make sure DOCTYPE instructions are uppercase); and we will then turn them to this mysterious TagSpec
datatype.
The tagsToSpec
function is defined in Select.hs, and this one is easy to read:
-- | Creates a TagSpec from a list of tags parsed by TagSoup.
tagsToSpec :: forall str. (Ord str, TagSoup.StringLike str)
=> [TagSoup.Tag str] -> TagSpec str
tagsToSpec tags = (vector, tree, ctx)
where
vector = tagsToVector tags
tree = vectorToTree vector
ctx = SelectContext 0
We can easily guess from this that TagSpec
is simply a type synony for a ternary tuple. The actual alias is defined a few lines before:
-- | A structured representation of the parsed tags that provides fast element
-- look up via a vector of tags, and fast traversal via a rose tree of tags.
type TagSpec str = (TagVector str, TagForest, SelectContext)
Thank you !, Will Coster (the author behind Scalpel), for commenting your code. It lets us understand that this structure is here to provide “fast element look up”, which makes sense, since this library is mostly there to look stuff up in an XML.
So, even though we don’t really know right now what TagVector
, TagForest
or SelectContext
are, we have a basic idea of why the scary tagsToVector
and vectorToTree
functions exist. And we can easily imagine that most Scalpel’s primitive will one way or the other make use of this structure.
A word on TagSoup model
Since most of what we’re going to see is going to make heavy use of TagSoup, it might be a good idea to study, not the implementation this time !, but at the very least the basic types and features of TagSoup.
TagSoup mostly transforms an XML document into a series of “instructions”, the various constructors of the Tag
type. The Tag
type is actually parametric over a str
type, and it expects a StringLike
to be able to handle the various ways of encoding string-like things in Haskell, but in the rest of this post, I will rather outrageously ignore this, so don’t be surprised if I sometimes ellide the str
bit.
This rather crude XML:
<students>
<student house="Gryffindor">Harry Potter</student>
<student house="Ravenclaw">Cho Chang</student>
<student house="Slytherin">Lucius Malfoy</student>
<student house="Hufflepuff">Heidi Macavoy</student>
</students>
… given as input to the basic TagSoup function parseTags
, becomes:
[ TagOpen "students" [], TagOpen "student" [("house", "Gryffindor")
, TagText "Harry Potter", TagClose "student", ...]
I did not write the whole list, but you get the gist. Two interesting things:
Incorrect XML can be parsed; e.g., we could have a
TagOpen
without its equivalentTagClose
instruction.This result does not give us a tree, but a simple
List
; whereas it’s probably more convient to think of a correct XML document as a tree.
There is a way to get a Tree
of Tag
, through TagSoup, using its Tree module, but the author of TagSoup warns that the current implementation is likely to change. Which is why Scalpel rolls its own structure, optimized for quick access as we’re going to see.
Let us keep in mind that we’re going to process a list of tag-related instructions, and move forward.
An Indexed Tag List through tagsToVector
Warning : this function is actually commented, but I’ll ignore these comments. The idea of this blog is to train our haskell-reading abilities. We should be able to read this even without comments.
Since the first step in transforming the list of tags into our TagSpec
internal structure is to build a TagVector
through the tagsToVector
function, we might as well start with this. First, let us look at its whole definition. (Don’t worry, we’ll decompose it in smaller bits as we explain it !)
tagsToVector :: forall str. (Ord str, TagSoup.StringLike str)
=> [TagSoup.Tag str] -> TagVector str
tagsToVector tags = let indexed = zip tags [0..]
total = length indexed
unsorted = go indexed Map.empty
emptyVec = Vector.replicate total undefined
in emptyVec Vector.// unsorted
where
go :: [(TagSoup.Tag str, Index)]
-> Map.Map T.Text [(TagSoup.Tag str, Index)]
-> [(Index, TagInfo str)]
go [] state = map (\(t, i) -> (i, TagInfo t (maybeName t) i Nothing))
$ concat
$ Map.elems state
where
maybeName t | TagSoup.isTagOpen t = Just $ getTagName t
| TagSoup.isTagClose t = Just $ getTagName t
| otherwise = Nothing
go (x@(tag, index) : xs) state
| TagSoup.isTagClose tag =
let maybeOpen = head <$> Map.lookup tagName state
state' = Map.alter popTag tagName state
info = TagInfo tag (Just tagName) index Nothing
res = catMaybes [
Just (index, info)
, calcOffset <$> maybeOpen
]
in res ++ go xs state'
| TagSoup.isTagOpen tag = go xs (Map.alter appendTag tagName state)
| otherwise =
let info = TagInfo tag Nothing index Nothing
in (index, info) : go xs state
where
tagName = getTagName tag
appendTag :: Maybe [(TagSoup.Tag str, Index)]
-> Maybe [(TagSoup.Tag str, Index)]
appendTag m = (x :) <$> (m <|> Just [])
calcOffset :: (TagSoup.Tag str, Int) -> (Index, TagInfo str)
calcOffset (t, i) =
let offset = index - i
info = TagInfo t (Just tagName) i (Just offset)
in offset `seq` (i, info)
popTag :: Maybe [a] -> Maybe [a]
popTag (Just (_ : y : xs)) = let s = y : xs in s `seq` Just s
popTag _ = Nothing
Faced with this wall of code, we should probably start by reading the signature.
We take the result from TagSoup
, we shall return a TagVector
. TagVector
is simply a specialization of the Vector
class from the Vector
library. Vector
is a list-like structure, but with easier access to indexes; whereas accessing the nth element of a default List
is O(n), access through indexes are O(1) for vectors
. So they behave like the arrays in other language. The TagVector
does not exactly contains Tag
from TagSoup, but rather TagInfo
, an intermediary type defined in Select.hs. We’ll understand this type as we read the rest of the code.
Initialization step
OK, our mission is to return a bunch of easy-to-access TagInfo
. Let us look at the main function itself.
tagsToVector tags = let indexed = zip tags [0..]
total = length indexed
unsorted = go indexed Map.empty
emptyVec = Vector.replicate total undefined
in emptyVec Vector.// unsorted
When I find code like this, I jump to the in
part first; the element listed in the let
block should be the part composed during the in
part. A quick check to the (//)
function defined by Vector let us understand that it’s “bulk update” for vectors. We have a vector (on the left of the //
sign), a list of tuple containing index and new values (on the right), we obtain a vector updated for each indexes listed.
Let us try this in the interpreter (there are several ways of creating a vector, we’ll use the replicate
version of vectors - it fills an array with the same value).
> import qualified Data.Vector as V
> let hellos = V.replicate 3 "Hello"
> hellos
["Hello, "Hello", "Hello]
> :t hellos
hello :: Vector [Char]
> hellos V.// [(1, " "), (2, "world !")]
["Hello", " ", "world !"]
When printed in ghci, vectors looks suspiciously like lists, but checking their types show that they’re indeed vectors (trying to use list-functions from the Prelude won’t work).
OK ! So we’re bulk updating a vector. What vector are we bulk updating ? emptyVec
! Let’s look at its definition: it’s Vector.replicate
that we’ve just used in ghci. It will replicate total
times, total
being the length of our input list. And we fill it with undefined
. So, oddly enough, we build an array full of uncomputable things, that we will bulk-update later, hopefully completely. Note that this won’t break when running, because our vector of undefined will never be evaluated.
So, all we need now is the list of updates, namely a list of pairs of indexes and values that will replace all these undefined
. The rest is a common way of indexing a list, indexed
that will just zip a list of Tag
into pairs. This will produce the [(Tag, Index)]
where Index
is a simple alias for Int
.
We feed this list of pairs to a go
subfunction - with a classic Map
from Data.Map. Only it’s an empty map, so for now we dont know much about this.
Unless you’ve never seen any “real Haskell” code in your life, you should be familiar with this “go-function” pattern, but just in case you’ve never seen it : the idea is to make sort of a “setup step” in your main function, and delegate the hard work to a go
subfunction. This is particularly often the case when there will be some recursion implied.
OK, we have a clearer idea of what this function does: we will build something through “go”, a list of indexes and values that we will use to replace our empty vector, to achiever our goal : an array of quickly accessible tags.
Time to read go
. Let’s start with its signature, because Types Are Our Friends.
go :: [(TagSoup.Tag str, Index)]
-> Map.Map T.Text [(TagSoup.Tag str, Index)]
-> [(Index, TagInfo str)]
OK, we have the list of pairs we’ve just talked about and a Map
of Text
to (Tag, Index)
, and we shall return a [(Index, TagInfo)]
- what we need for the bulk update. The only new thing we learned is the exact signature of the Map
we’ll work with - we did not know much about it, since we’ve passed an empty one.
I don’t often read recursive functions, but when I do, I typically start by reading the termination case. Unfortunately, here, it would be rather confusing, so we’ll read “in the order it would typically be executed”.
Figuring out the main case
Most calls will be handled by this bit :
go (x@(tag, index) : xs) state
| TagSoup.isTagClose tag =
let maybeOpen = head <$> Map.lookup tagName state
state' = Map.alter popTag tagName state
info = TagInfo tag (Just tagName) index Nothing
res = catMaybes [
Just (index, info)
, calcOffset <$> maybeOpen
]
in res ++ go xs state'
| TagSoup.isTagOpen tag = go xs (Map.alter appendTag tagName state)
| otherwise =
let info = TagInfo tag Nothing index Nothing
in (index, info) : go xs state
As the guards show, we have to handle tag opening, tag closing, and text in between. It might be a good idea to read this “in the order it is going to be parsed”. Typically, you’d have something like :
[TagOpen "MyTag", TagText "SomeContent", TagClose "MyTag"]
Well, to be closer to our real input type we would have:
[(TagOpen "MyTag", 0), (TagText "SomeContent", 1), (TagClose "MyTag", 2)]
Though, of course, this might be much more complex when tags contain other tags - but with this default, simplistic example in mind, we should start with a TagOpen
. First, we’ll make a stylistic digression.
On style
Rather than using pattern-matching, the author decided to use guards plus some functions from TagSoup
that lets you test against various Tag
constructor. I have a feeling this is not very elegant, but there might some good reasons for this. This is nagging me though, because the author used the intermediary syntactic solution, Pattern Guard in the Applicative
and Monad
implementation for Scraper
so I don’t really understand this rather unpleasant way of writing things here.
While I’m shamelessly criticizing code I’m utterly unable to write myself, I’d also like to note that manual recursion like this one if best avoided. Also, rather than an accumulator like the one we’re going to observe, things would be clearer had the author used MonadState
. I suppose the author did not because of (a) potential performance issues and (b) it would add mtl to the dependencies.
Finally, mixing where and let is often frowned upon; to put it in a nutshell, what we are going to read in this post is not the most idiomatic Haskell ever.
End of the digression. Let’s move on.
At tag opening
Let us start with the TagOpen
, the second guard, which will evaluate to:
go xs (Map.alter appendTag tagName state)
We simply recurse on the rest of the list of tags (xs
) without prepending anything to the list we intend to return at the end of our recursive calls.
Though, we will modify the state
parameter before recursing. Map.alter
is a smart update : it takes a Maybe a -> Maybe a
function, a key and a Map
. It will return the result of the function “over” the key, even if there is no current value with this key. Stupid example below : we’ll use a function that will always return “Just 1” and the key “hello” to an empty Map
.
> import qualified Data.Map as M
> M.alter (const Just 1) "Hello" M.empty
fromList [("Hello", 1)]
So. The function used for our Map.alter
is appendTag
; though the name shoud be rather transparent name, we’ll check its definition.
appendTag :: Maybe [(TagSoup.Tag str, Index)]
-> Maybe [(TagSoup.Tag str, Index)]
appendTag m = (x :) <$> (m <|> Just [])
OK, this is not as scary as it might look like. Remember we will take as input the result of a lookup
in our Map
- which will most certainbly be Nothing
when we handle the very first opened Tag
, considering that we start with an empty Map
.
We fmap
over (:)
, which is another way of getting rid of any Maybe
. x
, here, is the tag and its index (we are in a where block so we keep the scope of the caller). We will append it to m
(the result of the lookup) if it exists, or Just []
if there was no value attached to this key.
In other words : the state
maintains a list of tag definition and indexes for each tag name. When we lookup a tag name, it gives us the list of every encountered opened tags in reverse order of appearance: the first item of the list will always be the “last encountered opening tag”.
Tag content
Unless we are handling empty tags, after opening a tag, we should evaluate the otherwise
guard of go
. It’s rather short:
let info = TagInfo tag Nothing index Nothing
in (index, info) : go xs state
This is simple : we don’t touch our state; we build a TagInfo
, and we will prepend this tag and its index to the result of a new recursive call over the rest of the list of tags we’re processing.
We still don’t know a lot about TagInfo
. We know it will at very least contain Tag
data itself and its index.
Now that we have opened a tag and processed its content, we should meet a EndTag
.
At tag closing
OK, this one is not as simple as the last one…
let maybeOpen = head <$> Map.lookup tagName state
state' = Map.alter popTag tagName state
info = TagInfo tag (Just tagName) index Nothing
res = catMaybes [
Just (index, info)
, calcOffset <$> maybeOpen
]
in res ++ go xs state'
Reading the in
block show us we are going to add several things to the end result of our recursion (hence the (++)
and not a (:)
); also, you note we’re going to modify our state.
Let’s read the let
statement in their order of writing:
maybeOpen
gets the latest opened tag with the name of the closing one.state'
is the result of yet another call toMap.alter
, this time through the functionpopTag
that we will see later, but spoiler alert !, it will remove the openedTag from our state.info
build aTagInfo
(with the same information we’ve seen before plus the name of our current tag, e.g. “p”, “a”, “div”, etc.).res
will filter out theNothing
value out of a list containing the info and the current index, as we’ve seen, and callcalcOffset
on the opening tag we might have extracted from the state.
So, we have two helper functions to check up : popTag
and calcOffset
.
popTag :: Maybe [a] -> Maybe [a]
popTag (Just (_ : y : xs)) = let s = y : xs in s `seq` Just s
popTag _ = Nothing
Remember this is called with alter
, and it is most likely here to remove the corresponding opening tag we’ve added before meeting its closing partner. Once again we see what the state is used for: to handle the tag hierarchy and keep track of currently opened tags. popTag
will remove the first occurrence of the list of (tag, index)
values mapped for this tag name.
I am going to cheat here and ignore seq
the same way I ignored most things associated to strictness in this module, which tends to be as strict as possible (by using bang notations and importing the strict version of Data.Map). This is a wide topic, but to keep things short, let’s just say it’s a way to disregard Haskell’s normally lazy behaviour, mostly to prevent unncessary overhead on heap and improve performance. The seq
is a way of asking for immediate evaluation. The “normal”, lazy version whould be:
popTag (Just (_ : y : xs)) = let s = y : xs in Just s
… or, of course, even simpler…
popTag (Just (_ : y : xs)) = Just (y:xs)
For those who care about this kind of things: using Scalpel’s benchmark, I could indeed see this stictness improves performances (without the seq
, on my machine, average worst case scenario goes from 3.2ms to 3.5ms).
Note that the patterns are cleverly written: we only handle the case where there is at least one element left in our list (xs would then be []). In any other case, we simply return Nothing, because if our list contained only one Tag left, it’s the one we wanted to pop out anyway.
And finally the calcOffset
helper function:
calcOffset :: (TagSoup.Tag str, Int) -> (Index, TagInfo str)
calcOffset (t, i) =
let offset = index - i
info = TagInfo t (Just tagName) i (Just offset)
in offset `seq` (i, info)
This one is made rather painful to read because of scope and context. Remember: the pair we pass as parameter is the opening tag (t
) and its index (i
). The index
references the index extracted in go
during pattern-matching, so its the closing tag index
. We then build a complete TagInfo, containing the Tag
data, its name, the index it started at, and “how many indexes” it spans over. We see another use of seq
, and we will bravely ignore it once again.
This time we have a complete idea of what TagInfo
is for. It lets us track “how many tags away” is the closing tag of an opened tag - a piece of information that will prove precious when we’re trying to build a tree.
The Go terminal case
When the list is empty, this final pattern will be avaluated:
go [] state = map (\(t, i) -> (i, TagInfo t (maybeName t) i Nothing))
$ concat
$ Map.elems state
where
maybeName t | TagSoup.isTagOpen t = Just $ getTagName t
| TagSoup.isTagClose t = Just $ getTagName t
| otherwise = Nothing
The same way I read the in
first in a let ... in
block, I generally keep the where
for later: it contains “building blocks” for the main expression. And since we have a big line, I find it often better to read it “from right to left”, or in this case, “from bottom to top”. So, first, we Map.elems
the state; the signature of Map.elems
makes what it does rather obvious:
elems :: Map k a -> [a]
So this will get us every values stored in the map. Since individual values are list of pairs, we’ll get a list of list of pairs, namely [[(Tag, Index)]]
. We immediately flatten this through concat
, to get a [(Tag, Index)]
.
The maybeName
implementation is rather surprising, since we should only have TagOpen
stored in our state, but ok. And it’s boring, so I’ll ignore it.
Basically, we’ll pick every TagOpen
left “dangling” in the state, in case we’re dealing with an invalid XML where someone forgot to close its tags. For each of these, we’ll make a TagInfo, keeping its original index, and without the possibility of setting an offset
.
Putting it all together
Going into details like we did doesn’t help having a clear picture of what the function does. So a good way to process is to simply simulate manually the execution, though it’s sometimes a bit painful, but it’s probably the best way to be sure of what the function actually does.
So let us simulate its execution over a simple xml like this one:
<div>
<div>
<p>I'm deep.</p>
<p>I'm deep too.</p>
</div>
</div>
With tag soup, we shall get something like this:
[ TagOpen "div" [], TagOpen "div" [], TagOpen "p" [], TagText "I'm deep"
, TagClose "p" [], TagOpen "p" [], TagText "I'm deep too", TagClose "p" []
, TagClose "div" [], TagClose "div" []]
Remember that we will zip this with [0..] though, so we’ll get this list with each value paired with its indexes when we enter go
.
At first evaluation of go
, we alter our state
- our Map
now contains [(TagOpen "div", 0)]
for the key “div”. We will evaluate directly the result of the next recursion over go
.
Second evaluation of go
will have us modify our state
again, prepending the list associated to the key div
. It should now be: [(TagOpen "div", 1), (TagOpen "div", 0)]
.
Third evaluation have us, once again, modify our state
; we now put a value for a different key, p
, with [(TagOpen "p", 2)]
.
It’s only at our fourth evaluation that we are actually going to start building a result for go
. We return : TagInfo (TagText "I'm deep.") Nothing 3 Nothing
, which will be prepended on the result of the next evaluation of go
.
Fifth evaluation, we meet with the TagClose
for p
. We try to find the first item of the list associated to p
in our state
- and we pop it out. Since our state
contained a list of only one element, there is no value anymore associated to p
in our state
. We build a small list :
[ (4, TagInfo (TagClose "p" []) (Just "p") 4 Nothing)
, (2, TagInfo (TagOpen "p" []) (Just "p") 2 (Just 2))]
This is what the expression res
evaluates to. In other words; we will only return TagOpen
data when we meet their TagClose
, or, if the XML is not well formed, at the very end of the recursion. We store the offset
when returning a TagInfo
about a TagOpen
; so we know that the tag “p” that starts at index 2 will be close “2 indexes later”, at index 4. This res
list will be concatenated to the result of our next recursions.
I won’t detail the next recursions, but we should know return something that looks like :
[ (3, TagInfo (TagText "I'm deep") Nothing 3 Nothing
(4, TagInfo (TagClose "p" []) (Just "p") 4 Nothing)
, (2, TagInfo (TagOpen "p" []) (Just "p") 2 (Just 2))
, (6, TagInfo (TagText "I'm deep too"") Nothing 6 Nothing
, (7, TagInfo (TagClose "p" []) (Just "p") 7 Nothing)
, (5, TagInfo (TagOpen "p" []) (Just "p") 5 (Just 2))
, (8, TagInfo (TagClose "div" []) (Just "div") 8 Nothing)
, (1, TagInfo (TagOpen "div" []) (Just "div") 1 (Just 7))
, (9, TagInfo (TagClose "div" []) (Just "div") 9 Nothing)
, (0, TagInfo (TagOpen "div" []) (Just "div") 0 (Just 9)) ]
A list of indexes and values. Since our XML was well formed, the final case won’t have anything to do (our state
is empty, so Mep.elems
will return an empty list), but return what was accumulated during recursion.
This list of indexes will then be inserted into our vector through the bulk update abilities of (Vector.//)
. And we can now access any item of the list through its index without problematic complexity.
The coolest thing here is that while processing the tagList
, we can return our TagInfo
in the order we want; they will be put in the proper order through the index when (Vector.//)
does its magic.
A possible State rewrite
Just for kicks, here’s a potential rewrite using a StateMonad from mtl.
type CurrentTags str = Map.Map T.Text [(TagSoup.Tag str, Index)]
type CurrentTagsState str = State (CurrentTags str) [(Index, TagInfo str)]
-- ...
-- you'd call this go through: evalState (go indexed) Map.empty
go :: [(TagSoup.Tag str, Index)]
-> CurrentTagsState str
go [] = fmap
(map (\(t, i) -> (i, TagInfo t (Just $ getTagName t) i Nothing)) . concat . Map.elems)
get
go (x@(tag, index) : xs)
| TagSoup.isTagClose tag = do
st <- get
let opened = head <$> Map.lookup tagName st
modify (Map.alter popTag tagName)
let info = TagInfo tag (Just tagName) index Nothing
res = catMaybes [ Just (index, info), calcOffset <$> opened]
fmap (res ++) (go xs)
| TagSoup.isTagOpen tag = modify (Map.alter appendTag tagName) >> go xs
| otherwise = fmap ((:) (index, TagInfo tag Nothing index Nothing)) (go xs)
I omitted the end where
block - it doesn’t change. I’m pretty sure the TagClose can be done more elegantly. Performance wise, I’m very unsure of the result, I ran the benchmark and results were unconclusive (sometimes better, sometimes worse than the original version). But to be honest, I’m rather a noob at benchmarking, so I won’t hazard a guess.
While we’re at it, you can also completely rewrite this as a foldM
over a State
, if you delegate the “pour over what’s left in the State
” part to another function, like this:
type CurrentTags str = Map.Map T.Text [(TagSoup.Tag str, Index)]
type IndexedTags str = [(Index, TagInfo str)]
type CurrentTagsState str = State (CurrentTags str) (IndexedTags str)
tagsToVector tags = let indexed = zip tags [0..]
total = length indexed
unsorted = evalState (foldM go [] indexed >>= final) Map.empty
emptyVec = Vector.replicate total undefined
in emptyVec Vector.// unsorted
where
final :: IndexedTags str -> CurrentTagsState str
final ind = do finalState <- get
let fS = closed . concat . Map.elems $ finalState
pure $ ind ++ fS
where
closed :: [(TagSoup.Tag str, Index)] -> IndexedTags str
closed = map (\(t, i) -> (i, TagInfo t (Just . getTagName $ t) i Nothing))
go :: IndexedTags str -> (TagSoup.Tag str, Index) -> CurrentTagsState str
go ind x@(tag, index)
| TagSoup.isTagClose tag = do
st <- get
let opened = head <$> Map.lookup tagName st
modify (Map.alter popTag tagName)
let info = TagInfo tag (Just tagName) index Nothing
res = catMaybes [ Just (index, info), calcOffset <$> opened]
return $ res ++ ind
| TagSoup.isTagOpen tag = modify (Map.alter appendTag tagName) >> pure ind
| otherwise = pure $ (index, TagInfo tag Nothing index Nothing):ind
(Once again, the final where
block doesn’t change, and my final implementation is ugly and could probably be improved). But I find the intention behind the code is much clearer with State
and foldM
.
VectorToTree
OK, this one is also a screenful, though it’s a bit shorter. Once again, I’ll put the full definition first before commenting it bit by bit.
vectorToTree :: TagSoup.StringLike str => TagVector str -> TagForest
vectorToTree tags = fixup $ forestWithin 0 (Vector.length tags)
where
forestWithin :: Int -> Int -> TagForest
forestWithin !lo !hi
| hi <= lo = []
| not isOpen = forestWithin (lo + 1) hi
| otherwise = Tree.Node (Span lo closeIndex) subForest
: forestWithin (closeIndex + 1) hi
where
info = tags Vector.! lo
isOpen = TagSoup.isTagOpen $ infoTag info
closeIndex = lo + fromMaybe 0 (infoOffset info)
subForest = forestWithin (lo + 1) closeIndex
-- Lifts nodes whose closing tags lay outside their parent tags up to
-- within a parent node that encompasses the node's entire span.
fixup :: TagForest -> TagForest
fixup [] = []
fixup (Tree.Node (Span lo hi) subForest : siblings)
= Tree.Node (Span lo hi) ok : bad
where
(ok, bad) = malformed (fixup siblings) $ fixup subForest
malformed :: TagForest -- Forest to prepend bad trees on.
-> TagForest -- Remaining trees to examine.
-> (TagForest, TagForest)
malformed preBad [] = ([], preBad)
malformed preBad (n@(Tree.Node (Span _ nHi) _) : ns)
| hi < nHi = (ok, n : bad)
| otherwise = (n : ok, bad)
where (ok, bad) = malformed preBad ns
From the Vector
we just build, this function will build a TagForest
.
A TagForest
is a specialization of the Forest
type from Data.Tree
, defined at the beginning of the module we’re studying:
type TagForest = Tree.Forest Span
Where a Span is a simple datatype that contains two Int
(though a better solution would be to use Nat
, since a Span won’t contain negative values, but I think that once again, the author went for performances).
If you’re not familiar with Data.Tree, a Forest
is a list of Tree
. And a Tree
contains a value (nodeLabel
) and a subForest
of zero to many children. So, all type aliases aside, at the end of the day, we want to return a [Tree Span]
.
Sidenote : it might make more sense to return a single Tree to represent a XML. But (a) that would make our signatures tougher to write when we want to recurse and (b) we have to take into account potentially poorly written XML that do not have a single root (though, to be honest, people who write this kind of XML really want to see the world burn, don’t they ?).
Now that we have a general idea of the types we’re dealing with, let’s see the implementation of the main function:
vectorToTree tags = fixup $ forestWithin 0 (Vector.length tags)
OK, we compose two subfunction. Pretty simple; first we build a Forest
with forestWithin
, then we “fix it up” (if the name was not explicit enough, a glance at the comments from fixup
will let us know that it’s to handle improper XML, we’ll come back to it).
Pay attention to the parameters we sent to forestWithin
: 0 and the length of our vector. Basically, this would be the first Span
, the one that covers the whole document, which would normally be enclosed in one big opening tag, e.g. <html>
, the last tag being the same one closed, e.g. </html>
.
Building a forest
The grunt work is done in these lines. There is no special syntax here, and it’s short enough for us to simply simulate execution to understand how this function work - so let’s do it !
forestWithin :: Int -> Int -> TagForest
forestWithin !lo !hi
| hi <= lo = []
| not isOpen = forestWithin (lo + 1) hi
| otherwise = Tree.Node (Span lo closeIndex) subForest
: forestWithin (closeIndex + 1) hi
where
info = tags Vector.! lo
isOpen = TagSoup.isTagOpen $ infoTag info
closeIndex = lo + fromMaybe 0 (infoOffset info)
subForest = forestWithin (lo + 1) closeIndex
So, we have two Int parameters (let’s ignore the bangs, once again, it’s a strict vs. lazy issue). Obviously, these are bounds, a lower one and a higher one. Let’s take our previous example; we should get in vectorToTree
with this vector:
[TagInfo (TagOpen "div" []) (Just "div") 0 (Just 9)
TagInfo (TagOpen "div" []) (Just "div") 1 (Just 7)
TagInfo (TagOpen "p" []) (Just "p") 2 (Just 2))
TagInfo (TagText "I'm deep") Nothing 3 Nothing
TagInfo (TagClose "p" []) (Just "p") 4 Nothing)
TagInfo (TagOpen "p" []) (Just "p") 5 (Just 2))
TagInfo (TagText "I'm deep too"") Nothing 6 Nothing
TagInfo (TagClose "p" []) (Just "p") 7 Nothing)
TagInfo (TagClose "div" []) (Just "div") 8 Nothing)
TagInfo (TagClose "div" []) (Just "div") 9 Nothing)]
And enter forestWithin
with 0 and 10 as parameters. The first guard will be ignored, 10 is greater than 0. For the second guard, we need to check isOpen
; isOpen
uses the lo
parameter as an index in our vector, and checks if this is an opening tag. It is (and in any proper XML, for the first execution, it should be !). So the second guard of forestWithin
will be ignored too, since we want only not isOpen
.
It stands to reason that we should enter, at fist, the otherwise
guard. Now pay attention, because there is confusing stuff incoming. Read these lines of code very carefully.
| otherwise = Tree.Node (Span lo closeIndex) subForest
: forestWithin (closeIndex + 1) hi
The confusing bit is the (:)
. This does not preprend subForest
to forestWithin
. This prepends a Tree
, with its Span
and its subForest
, to the rest of the Forest
. Quick recap:
forestWithin
must return aForest
.a
Forest
is a list ofTree
.a
Tree
contains a value and asubForest
, i.e. its own enclosed list ofTree
.
So the otherwise
prepends a whole tree to the rest of its Forest
. In any sane XML, in the first case, there should be no “rest of the Forest
”, the document should be a single Tree, so we should return a Forest
containing only one Tree
.
Which means that otherwise
evaluates to the constructor for Tree
, Node
, containing:
our bounds (so, at first evaluation, in our example,
Span 0 10
)the subForest of the Tree
With two recursive calls : one for any siblings of our tree, this is the forestWithin
in otherwise; one for every children of our tree, this is the subForest
call (which, in turn, calls forestWithin
).
Both recursions depends on closeIndex
, a value we obtain from the Tag
at the index of lo
. It is simply the “index of the closing tag for the currently opened tag”. So for our “div” at 0, it will be the div at index 9 (check the vector if you’re not sure !). You should now see why we went through the pain of computing the offsets in tagsToVector
. Finally, observe that :
When computing the
subForest
of our current tree, we want to encompass everything between “the next tag after the one we’re investigating”,(lo + 1)
till “our closing tag”.When cmputing the
siblings
of our trees, we want to take “everything after our own closing tag”,(closeIndex + 1)
, till the hi bound we got as a parameter (which should be the higher bound of our own parent, or the span of the document in the initial call).
So in the first call, with the example vector, we will:
Build a
Node (Span 0 9)
;Compute its
subForest
with lower bound1
, higher bound9
.Check if it has a sibling (spoiler : it doesn’t) with lower bound
10
, higher bound10
. (This will meet the first guard offorestWithin
and return[]
).
Simulating the evaluation
Let us evaluate the subForest
of our first <div>
. We enter forestWithing
with 1 and 9.
We build a
Node (Span 1 8)
. This is our inner<div>
.The subforest of this
Node
spans from 2 to 8.Any siblings would span from 9 to 9 - so, no siblings.
We evaluate the new subForest
; this time, we’re at the first <p>
. Once again, we meet otherwise
.
We build a
Node (Span 2 4)
.The subforest of this
Node
spans from 3 to 4.The siblings for this
Node
span from 5 to 8.
Ok, we will simulate YET ANOTHER subforest, the content of our first p
. This time, finally, as the tag we investiga is not a TagOpen
, we evaluate the second guard :
| not isOpen = forestWithin (lo + 1) hi
Which means that we won’t add a Node, this time : we will simply delegate to yet another recursion. We should evaluate forestWithin
between 4
and 4
; and meet the first guard, so we’ll return []
. In other words: we only care about opening tags and their spans, we’ll mostly ignore anything else.
However, the sibling
of the previous <p>
will get evaluated. I’ll spare us the manual evaluation, and spoil you the end. We will get this - rather simple - Forest as a final result:
[Node (Span 0 9) [Node (Span 1 8) [Node (Span 2 4) [], Node (Span 5 7)]]]
That “lower-bound” / “higher bound” recursion pattern is rather interesting. I don’t know the first thing about recursions strategy, which is why I strive, whenever I can, to use canonical abstractions for recursion instead of writing my own. I’d wager it’s used here to reduce our traversal and keep things close to O(n), but it will take some more algorithmic-minded people than me to identify the name of this trick.
OK, massive warning here : the fixup
subfunction is not particularly interesting, explaining how it works is painful, and the result will be disappointing anyway. If you still want to follow me, read the following section. If not, I advise you to directly jump to “How select work”, down below, after the swamp of misery that is going to be the fixup
analysis.
Fixing the Forest
Man, you’re still there even despite my warnings à la Lemony Snicket. I admire your courage, I blame your foolishness.
A first malformed example
The issue when writing a scraper is that you’re likely to end up having to parse improper XML. Particularly, you’re likely to end up with some opened tags left dangling without their closing counterpart, as in:
<todos>
<todo>Learn XML</todo>
<todo>Fix the mistake at this line
<todo>Explain how Scalpel will handle this</todo>
</todos>
The mistake should be rather easy to spot. In this example, tagToVector
would return this vector:
[ TagInfo (TagOpen "todos" []) (Just "todos") 0 (Just 9)
, TagInfo (TagOpen "todo" []), (Just "todo") 1 (Just 2)
, TagInfo (TagText "Learn XML") Nothing 2
, TagInfo (TagClose "todo") (Just "todo") 3 Nothing
, TagInfo (TagOpen "todo" (Just "todo") 4 Nothing
, TagInfo (TagText "Fix the mistake at this line") Nothing 5 Nothing
, TagInfo (TagOpen "todo" []) (Just "todo") 6 (Just 2)
, TagInfo (TagText "Explain how Scalpel will handle this") Nothing 7 Nothing
, TagInfo (TagClose "todo" []) (Just "todo") 8, Nothing
, TagInfo {TagClose "todos" []) (Just "todos") 9 Nothing]
Which leads us to a TagOpen
with Nothing
as its offset since there was no associated closing tag. forestWithin
will return the following forest:
[Node (Span 0 9) [
Node (Span 1 3) []
, Node (Span 4 4) []
, Node (Span 6 8) []
]
]
Our problematic node is the Node (Span 4 4)
. See the way closeIndex
is defined:
closeIndex = lo + fromMaybe 0 (infoOffset info)
This makes sure that a TagOpen
with its offset set as Nothing
will get a Span
set with the same bounds. In other words : we cannot decide where the tag will close, so we’ll just make it a meaningless span, it only contains itself. This probably mean we’re losing the ability to access the text it should have contained but well, the author of this XML shoud have been more careful.
The fixup
function won’t help for this case. It will kind of help in another, classic XML mistake though, as we are going to see.
A second example of malformed XML
What about this one:
<todos>
<todo>Learn XML</todo>
<todo>Fix the mistake at this line
<todo>Explain how Scalpel will handle this</todo>
</todos></todo>
Ah, the infamous children tag that gets closed after its parent. Not only in this ugly, it would really mess up our structure.
[ TagInfo (TagOpen "todos" []) (Just "todos") 0 (Just 9)
, TagInfo (TagOpen "todo" []), (Just "todo") 1 (Just 2)
, TagInfo (TagText "Learn XML") Nothing 2
, TagInfo (TagClose "todo") (Just "todo") 3 Nothing
, TagInfo (TagOpen "todo" (Just "todo") 4 Nothing
, TagInfo (TagText "Fix the mistake at this line") Nothing 5 (Just 6)
, TagInfo (TagOpen "todo" []) (Just "todo") 6 (Just 2)
, TagInfo (TagText "Explain how Scalpel will handle this") Nothing 7 Nothing
, TagInfo (TagClose "todo" []) (Just "todo") 8, Nothing
, TagInfo {TagClose "todos" []) (Just "todos") 9 Nothing
, TagInfo (TagClose "todo" []) (Just "todo") 10 Nothing
Which would produce the following Forest
:
[Node (Span 0 9) [
Node (Span 1 3) []
, Node (Span 4 10) [
Node (Span 6 8) []
]
]
]
Oh dear. This will not do. The third “todo” is registered as a child of the second one. Careful examination of this result shows that the span of our second node (Span 4 10
) gets outside of its parent (Span 0 9
). This gets us a way of identifying malformed tags. This is exactly what fixup
is going to use to see what parts of the tree need to be fixed.
Fixing it up
I find fixup
particularly difficult to read, because you constantly have to “jump” to find definitions. And caution: massive recursion incoming.
fixup (Tree.Node (Span lo hi) subForest : siblings)
= Tree.Node (Span lo hi) ok : bad
We start by immediately rebuilding from the branch we’re considering : subForest and siblings will be rewritten by ok
and bad
. Once again, beware while reading the signature !
fixup (Tree.Node (Span lo hi) subForest : siblings)
It’s the same potential confusion I warned you about when we studied forestWithin
. As the label indicates, we are not deconstructing the subForest
, we are deconstructing the containing Forest
. So, in our previous example, the first subForest
will be all the children of our root node, and the siblings
bit is nothing but our spline, []
.
You can use additionnal parenthesis to make this more obvious:
fixup ((Tree.Node (Span lo hi) subForest) : siblings)
And the same way we deconstructed a Forest
, we’ll rebuild one with ok
for our current Node, and bad
for the siblings. What are ok
and bad
?
(ok, bad) = malformed (fixup siblings) $ fixup subForest
We will call malformed
, a function that takes two TagForest
, with the recursively applied fixup
on them. This recursions should let us go as deep in the tree as possible. Now for the malformed
, where Real Stuff Happens.
malformed :: TagForest -- Forest to prepend bad trees on.
-> TagForest -- Remaining trees to examine.
-> (TagForest, TagForest)
malformed preBad [] = ([], preBad)
malformed preBad (n@(Tree.Node (Span _ nHi) _) : ns)
| hi < nHi = (ok, n : bad)
| otherwise = (n : ok, bad)
where (ok, bad) = malformed preBad ns
Once again, we are going to need to evaluate “manually” this to get how it works - this is the issue with manual recursions… abstractions are not here to guide you to understand what is happening.
Fixup in action
So, we enter fixup, with our root Node. lo
is 0, hi
is 9. We evalute (ok, bad)
:
malformed (fixup siblings) $ fixup subForest
Since we’re in the root node, fixup siblings
will evaluate to []
. What about fixup subForest
though ?
First recursion
Stack up ! We need to call fixup on the subForest
. This call fixup
with our first child, Node (Span 1 3) []
. So, back to fixup
, lo
is 1, hi
is 3. subForest
is empty because this Node
has no children. siblings
contains the rest of the children of the root node, namely the malformed little brother of our current node. We evaluate (ok, bad)
. fixup subForest
will evaluate to []
; we still need to evaluate fixup siblings
.
Second recursion
Stack up again ! We’re entering fixup
with the malformed Node. lo
is now 4, hi
is 10, subForest
contains the Node that should not be here, and there are no siblings
anymore. We evaluate ok, bad
. We know that fixup siblings
will evaluate to []
. But we need to fixup subForest
again.
Third recursion
Stack up !, but I swear, it’s the last time. We receive the third item on our todo-list improperly identified as the child of the previous one. lo
is now 6, hi
is 8. siblings
is empty, ̀subforestis empty. We evaluate
(ok, bad). Since both
subForestand
siblingsare empty, we can FINALLY stop to recurse and enter
malformed`. We’ll match the first pattern:
malformed preBad [] = ([], preBad)
preBad
is the result of fixup siblings
. So it’s []
. (ok, bad)
will finally evaluate ([], []). This fixup
call returns (Node (Span 6 8) []):[]
.
Back to the second recursion
Stack down ! Where were we ? Ah yes, at the Node (Span 4 10)
, the tricky one. We wanted to evalute fixup subForest
and fixup siblings
. We have no siblings, and subForest
returned, as we just saw (Node (Span 6 8) []):[]
. We can evaluate malformed, and enter the second pattern: we HAVE the result of fixup subForest
.
malformed preBad (n@(Tree.Node (Span _ nHi) _) : ns)
| hi < nHi = (ok, n : bad)
| otherwise = (n : ok, bad)
where (ok, bad) = malformed preBad ns
Curses ! More recursion. Let us first understand our parameters:
preBad
is the result of our previousfixup siblings
, and we have no more siblings, so it is[]
.n
is the first child of our subforest, actually our only child here.nHi
should be8
.ns i
[]
Since our current hi
is 10, we should enter the otherwise
guard. Which means we need, sadly, to evaluate the overwritten values of ok
and bad
in our where
block (sidenote: don’t do this, please; don’t overwrite labels, you’re making everyone sad). Fortunately, since preBad is []
, and ns is []
, it is pretty easy to guess what the recursive malformed will return: ([], [])
. Our initial malformed
will then return (Node (Span 6 8): [], [])
. Which means that the third recursion of fixup
can now return : Node (Span 4 10) [Node (6 8) []]:[]
. So far, it’s not very different from our very first input value. Bear with me.
Back to the first recursion
Stack down ! We were evaluating malformed
. We had the fixup subForest
, it was []
. We needed fixup siblings
, that’s the return of the third recursion.
Back to malformed
, but we fall back on the first pattern, our second parameter is simply an empty list. So, this fixup
call will return this Forest
:
(Node (Span 1 3) []):(Node (Span 4 10) [Node (Span 6 8) []]):[]
Back to the initial call
So. We were, as usual, trying to evaluate a malformed
. We had no siblings
since we were in our root, so we had []
as a first parameter. And our subForest
is the result of our previous recursive calls. We enter malformed
.
What are our initial parameters ?
preBad
is[]
.n
is the first item of thesubForest
, the(Node (Span 1 3) [])
.meaning
nHi
is 3.our current
hi
, defined at the top level, was 9.ns
is the rest of our children.
Buckle up, we’re going to do some magic. Is hi
lower than nHi
? Nope. 9 is greater than 3. So we know we will return at the very least (Node (Span 1 3))
. Prepended to… something depending on the result of (ok, bad)
. But to know what (ok, bad)
is, we need to evaluate it. Boom, recursing all over again, this time with these parameters:
preBad
is still[]
.n
is the current item of thesubForest
, the(Node (Span 4 10) [Node (Span 6 8) []])
.meaning
nHi
is now 10.our current
hi
, defined at the top level, doesn’t change, it’s 9.ns
is the rest of our children, this time[]
.
And finally, finally, we meet this first guard
; the one where the world has been broken, the one where we have a high bound superior to our parent high bound, where nHi
(10) is above hi
(9). We need to know what the current ok
and bad
will be : and since ns
is []
, it will once again be ([], [])
. So we return :
([], (Node (Span 4 10) [Node (Span 6 8) []]):[])
Tired ? Come on, we’re almost done ! We were in the otherwise
case of the previous recursion, dealing with our first child Node
. And we needed to prepend this first Node
to ok
, and return, in the rest of our pair
, the bad
. Well, ok
is []
, the first item of the pair we’ve just returned. And bad
is our Node. So, for our final result !, we will have the following Forest
:
[ Node (Span 0 9) [Node (Span 1 3) []]
, Node (Span 4 10) [Node (Span 6 8) []]]
Which would be the stupid following XML:
<todos>
<todo>Learn XML</todo>
</todos>
<todo>Fix the mistake at this line
<todo>Explain how Scalpel will handle this</todo>
</todo>
ALL THAT HARD WORK FOR THIS ?! Yup. It’s not great, is it ? But at the very least, the Tree
makes more sense. Consider the fact that with a malformed XML, it is fairly difficult to know “where” nodes should be. Was the node mistakenly “inside” its parent ? Or is it just that the closing tag was misplaced ? There is no perfect solution to this problem, because there many different possible mistakes. But the advantage of this solution is that it preserves a sane tree-like structure, with containers “bigger” than their contained set.
I strongly encourage you to reread the code if you’ve survived this. It’s actually a clever way to “move out” a subTree
as sibling of a Forest
. How sane people come up with recursions tricks like these, I have no idea.
How selection works
Well, you now know the most important stuff there are to know about the inner structures of Scalpel
. Every Scraper
will work with this base structure:
A vector of tags
A tree that only gives the span of each tag.
And a thing we did not comment yet, which is the
SelectContext
, an index defaulting to 0.
Put this in a tuple of three element and you get the famous TagSpec
that will serve as the input for all scraping computation we’ve seen in our previous chapter.
Now to see how is it used. We’ve seen previously that Scalpel revolves around the notion of Selector
, which is nothing but a datatype encapsulating “what you are looking for”. You run a Selector
over the select
function. I call it “the heart of Scalpel”, because Scalpel is but a gigantic wrapper to get stuff out of this function.
As I wrote when I gave a brief overview of Scalpel, Selectors are defined in the internal/Select/Types module of scalpel-core.
Selector
is a newtype
around a list of SelectNode
, a datatype defined as:
data SelectNode = SelectNode !T.Text [AttributePredicate]
| SelectAny [AttributePredicate]
It should be rather easy to understand, but just in case : the SelectNode
constructor requires a tag name, and a list of predicates (expected classes or attributes). The SelectAny
don’t care about tag names. The fact that the newtype Selector
amounts to a list is what allow us to have the neat :
"div" // "p" // "a"
syntax, which would be rendered as a :
MkSelector [ SelectNode "div" []
, SelectNode "p" []
, SelectNode "a" []]
With this in mind, we can finally briefly read select
and its bigger brother selectNode
.
The select method
select :: (Ord str, TagSoup.StringLike str)
=> Selector -> TagSpec str -> [TagSpec str]
select s tagSpec = newSpecs
where
(MkSelector nodes) = s
newSpecs = zipWith applyPosition [0..] (selectNodes nodes tagSpec [])
applyPosition p (tags, f, ctx) = (tags, f, SelectContext p)
The whole select process can be seen as a read-only Lens
. In other words: it focuses on a “small subpart” (well, “small subparts”, in this instance) of a “big part”. This is made clear by its signature : we have one tagSpec
(typically, the one encompassing all of our document). We have a set of navigating instructions (the Selector
). We shall return a list of destination in our initial TagSpec
, “possible destinations for the given instructions”.
Which is why the definition of select
is simply to “rebuild” a list of TagSpec
objects. Look at the newSpecs
definition; it uses selectNodes
, a Depth First Search implementation over TagSpec
that we will study in a minute, and zips an index over its result. This index allows us to fill the famous last element of the TagSpec
tuple. It’s only used by the position
primitive offered by Scalpel. It lets you know the order in which it found elements. Note that most Scalpel primitives build a Scraper
using select
. You then simply extract the information you’re after (be it text, html, a given attribute) over the result. The singular primitives (text
, html
, etc.) only apply to the first result of select
. The pluralized forms (texts
, htmls
, etc.) will be mapped over every result.
Now for the final funtion we shall study : the selectNodes
.
Scalpel’s version of Depth First Search
You probably know there are two basic algorithm when you’re looking for something in a tree: Depth First Search (DFS for short) and Breadth First Search (BFS). The first one goes from branches to branches, the second one from depth levels to depth levels.
There are, of course, Haskell libraries that cover this, notably the famous FGL, a.k.a Functional Graphs Library. However, since Scalpel uses its own mix of Tree and Vector, using an external tool would probably not have been very convenient.
Now summon whatever courage you have left, because we have yet another BIG function to read. Here it is, without its helpful comments:
selectNodes :: TagSoup.StringLike str
=> [SelectNode] -> TagSpec str -> [TagSpec str] -> [TagSpec str]
selectNodes [] _ acc = acc
selectNodes [_] (_, [], _) acc = acc
selectNodes [n] (tags, f : fs, ctx) acc
| nodeMatches n info = (shrunkSpec :)
$ selectNodes [n] (tags, fs, ctx)
$ selectNodes [n] (tags, Tree.subForest f, ctx) acc
| otherwise = selectNodes [n] (tags, fs, ctx)
$ selectNodes [n] (tags, Tree.subForest f, ctx) acc
where
Span lo hi = Tree.rootLabel f
shrunkSpec = (
Vector.slice lo (hi - lo + 1) tags
, [fmap recenter f]
, ctx
)
recenter (Span nLo nHi) = Span (nLo - lo) (nHi - lo)
info = tags Vector.! lo
selectNodes (_ : _) (_, [], _) acc = acc
selectNodes (n : ns) (tags, f : fs, ctx) acc
| nodeMatches n info = selectNodes ns (tags, Tree.subForest f, ctx)
$ selectNodes (n : ns) (tags, fs, ctx) acc
| otherwise = selectNodes (n : ns) (tags, Tree.subForest f, ctx)
$ selectNodes (n : ns) (tags, fs, ctx) acc
where
Span lo _ = Tree.rootLabel f
info = tags Vector.! lo
OK, we can easily identify the main features of this function. Our first parameter is the list of selectors, our second parameter is the subpart of the TagSpec
we are currently traversing. And finally, we have an acc
parameter, which is typical of a manual recursions that build an accumulator; the two first patterns, namely the stop cases, will return this acc
.
The trick here is that there are actually two level of recursions. We are recursing over our selectors AND recursing over the tree. Let us first study the recursion over the selectors.
Recursion over selectors
So, say you want to identify every <a>
tags contained inside a <div>
tag. We first need to keep only the <div>
tags. We would typically start in the final pattern :
selectNodes (n : ns) (tags, f : fs, ctx) acc
| nodeMatches n info = selectNodes ns (tags, Tree.subForest f, ctx)
$ selectNodes (n : ns) (tags, fs, ctx) acc
| otherwise = selectNodes (n : ns) (tags, Tree.subForest f, ctx)
$ selectNodes (n : ns) (tags, fs, ctx) acc
where
Span lo _ = Tree.rootLabel f
info = tags Vector.! lo
Lots of things are being deconstructed here (I call this kind of patterns a “Derrida paradise”):
The list of selectors. We pick
n
, the first one.The
tagSpec
tuple: we isolatetags
, theVector
that contains all the tags, theForest
of spans, the context.The tree itself gets deconstructed; we pick the first item out of the forest,
f
, isolated from its siblings,fs
.
The where
block contains a mechanism to extract the Tag
we need to inspect. rootLabel
from Data.Tree extracts the value stored - so a Span
. We only care about the lower
bound of the Span
, because it tells us “where the TagOpen
bit is in our TagVector
; info
is this very Tag
, selected through the index-access .!
function over Vector
.
The nodeMatches
function will check for us if the Selector
n match.
If nodeMatches
evalutes to False
, there is this odd looking call to selectNodes
over the subforest, applied to selectNodes
over the siblings. It means that we will use the result of the function over the siblings as the accumulator for the result of the function over the forest. Note that we will explore the subForest
and the siblings : Scalpel always look for tags however deep they might be.
If there IS a Match, we also enter the subForest
, but this time with the remaining selectors. So, to come back to the example I gave at the beginning, if we were looking for "div" // "a"
, we found a div
, now we want to see if it contains an a
- anywhere among children, grand-children, an so on. That would let us enter the other pattern. (We will, of course, also continue recursing over siblings, with the full list of selectors).
Recursion over Tree
And finally, we arrive to a lone selector over a part of a TagSpec
and enter this pattern:
selectNodes [n] (tags, f : fs, ctx) acc
| nodeMatches n info = (shrunkSpec :)
$ selectNodes [n] (tags, fs, ctx)
$ selectNodes [n] (tags, Tree.subForest f, ctx) acc
| otherwise = selectNodes [n] (tags, fs, ctx)
$ selectNodes [n] (tags, Tree.subForest f, ctx) acc
where
Span lo hi = Tree.rootLabel f
shrunkSpec = (
Vector.slice lo (hi - lo + 1) tags
, [fmap recenter f]
, ctx
)
recenter (Span nLo nHi) = Span (nLo - lo) (nHi - lo)
info = tags Vector.! lo
The only difference here (baaah, this code is not very DRY) is the matching case. If we match, we will recurse after appending the result of shrunkSpec
, i.e. the “smaller part of the bigger part” that we want to retain. But what is “a smaller part” of a tuple that contains a tag vector, a list of trees and an arbitrary index ? Well, it’s a smaller vector, an updated forest and as for the index, it will be handled by select
as we saw earlier.
Let us start with the smaller vector : we cut it from its lower bound to the length of the span, plus one because vectors indexes from 0. So we only take the opening tag, what it encompasses, and its closing tag. As for the Tree
, we put it in a List
since we have to carry over a Forest
, but we map over each Span
inside to make them “local”. Let’s clarify this : all our Spans
contains indexes from the initial vector, so we need to recompute them, by substracting the index of the opening tag to every span or our indexes won’t match our new Vector
. And we’re done. All these smaller TagSpec
will be accumulated over by recursion and returned.
I won’t add any comment to this function, because this post is already incredibly long. Furthemore, I suspect there is a bug in this implementation. I might explain it in a next post - an ultimate, and I hope shorter !, post about Scalpel, because after that, we’ll try to read some other libraries.
What did we learn ?
Not that much, but still:
At the very least we learned not to be afraid by these massive chunks of codes.
We’ve seen that behind the handy API we’ve studied previously lies an internal datatype and how it worked.
This has also been an occasion to see how basic Haskell types and libraries are powerful enough to build data structures with very good read-access performances:
Data.Map
andData.Vector
are common and easy to understand.We’ve also seen that building your custom, performant, datatypes comes with a cost: it’s harder to use common recursive abstrations, though as we saw we’ve our own rewriting of
tagsToVector
, it’s probably almost achievable.Finally, we’ve seen long examples of pure, often recursive code, that should give you good examples of “how to achieve things” in Haskell, without needing too many fancy stuff or scary concepts, proving that even without the weapon of mass abstractions (Comonads ! Bifunctors ! FRP ! What have you !) we already have considerable power.