Scalpel's Select

Posted on April 20, 2017

Intro

Last time, we reviewed the main abstraction Scalpel lets you use, simply by implementing some common Haskell typeclasses. However, all these things are mostly composition, depending more or less on delayed evaluation. In other words, we’re yet to see any actual “stuff being done”, we’ve mostly seen “how we will compose the computed stuff”.

Today, we face the music and dance. I won’t lie : this is going to be a long and often quite boring post.

Let’s go back to the scalpel-core folder, and open the aptly named Select.hs.

A quick look-through will let you discover two rather intimidating functions, tagsToVector and the smaller but still rather terrifying vectorToTree . We’ll mostly work on these two beasts today, and study how they are used.

However, our cursory reading should also have noted:

These two are, to a point, the heart of this library. To give you a general idea of how things work : tagsToVector and vectorToTree build an internal data structure the allow fast access to the XML document you want to parse. select and selectNodes leverage this structure to focus on subtrees in the document.

A quick look upstream

You might remember we briefly quoted the main way to manipulate scrappers, internally at least, the scrape function from Scrape.hs.

scrape :: (Ord str, TagSoup.StringLike str)
       => Scraper str a -> [TagSoup.Tag str] -> Maybe a
scrape s = scrapeTagSpec s . tagsToSpec . TagSoup.canonicalizeTags

Remember, our goal is to provide a TagSpec str datatype that will eventually be the input for our scrapers. The second parameter, [TagSoup.Tag str] is the produce of the TagSoup library, namely a view of an XML document given as a list of string-like things. This parameter is eta-reduced in scrape, but we can easily see that there will be a normalization step (canonicalizeTags simply lowercases tags and make sure DOCTYPE instructions are uppercase); and we will then turn them to this mysterious TagSpec datatype.

The tagsToSpec function is defined in Select.hs, and this one is easy to read:

-- | Creates a TagSpec from a list of tags parsed by TagSoup.
tagsToSpec :: forall str. (Ord str, TagSoup.StringLike str)
           => [TagSoup.Tag str] -> TagSpec str
tagsToSpec tags = (vector, tree, ctx)
    where
        vector = tagsToVector tags
        tree   = vectorToTree vector
        ctx    = SelectContext 0

We can easily guess from this that TagSpec is simply a type synony for a ternary tuple. The actual alias is defined a few lines before:

-- | A structured representation of the parsed tags that provides fast element
-- look up via a vector of tags, and fast traversal via a rose tree of tags.
type TagSpec str = (TagVector str, TagForest, SelectContext)

Thank you !, Will Coster (the author behind Scalpel), for commenting your code. It lets us understand that this structure is here to provide “fast element look up”, which makes sense, since this library is mostly there to look stuff up in an XML.

So, even though we don’t really know right now what TagVector, TagForest or SelectContext are, we have a basic idea of why the scary tagsToVector and vectorToTree functions exist. And we can easily imagine that most Scalpel’s primitive will one way or the other make use of this structure.

A word on TagSoup model

Since most of what we’re going to see is going to make heavy use of TagSoup, it might be a good idea to study, not the implementation this time !, but at the very least the basic types and features of TagSoup.

TagSoup mostly transforms an XML document into a series of “instructions”, the various constructors of the Tag type. The Tag type is actually parametric over a str type, and it expects a StringLike to be able to handle the various ways of encoding string-like things in Haskell, but in the rest of this post, I will rather outrageously ignore this, so don’t be surprised if I sometimes ellide the str bit.

This rather crude XML:

<students>
  <student house="Gryffindor">Harry Potter</student>
  <student house="Ravenclaw">Cho Chang</student>
  <student house="Slytherin">Lucius Malfoy</student>
  <student house="Hufflepuff">Heidi Macavoy</student>
</students>

… given as input to the basic TagSoup function parseTags, becomes:

[ TagOpen "students" [], TagOpen "student" [("house", "Gryffindor")
, TagText "Harry Potter", TagClose "student", ...]

I did not write the whole list, but you get the gist. Two interesting things:

There is a way to get a Tree of Tag, through TagSoup, using its Tree module, but the author of TagSoup warns that the current implementation is likely to change. Which is why Scalpel rolls its own structure, optimized for quick access as we’re going to see.

Let us keep in mind that we’re going to process a list of tag-related instructions, and move forward.

An Indexed Tag List through tagsToVector

Warning : this function is actually commented, but I’ll ignore these comments. The idea of this blog is to train our haskell-reading abilities. We should be able to read this even without comments.

Since the first step in transforming the list of tags into our TagSpec internal structure is to build a TagVector through the tagsToVector function, we might as well start with this. First, let us look at its whole definition. (Don’t worry, we’ll decompose it in smaller bits as we explain it !)

tagsToVector :: forall str. (Ord str, TagSoup.StringLike str)
             => [TagSoup.Tag str] -> TagVector str
tagsToVector tags = let indexed  = zip tags [0..]
                        total    = length indexed
                        unsorted = go indexed Map.empty
                        emptyVec = Vector.replicate total undefined
                     in emptyVec Vector.// unsorted
    where
        go :: [(TagSoup.Tag str, Index)]
           -> Map.Map T.Text [(TagSoup.Tag str, Index)]
           -> [(Index, TagInfo str)]
        go [] state = map (\(t, i) -> (i, TagInfo t (maybeName t) i Nothing))
                                    $ concat
                                    $ Map.elems state
            where
                maybeName t | TagSoup.isTagOpen t  = Just $ getTagName t
                            | TagSoup.isTagClose t = Just $ getTagName t
                            | otherwise            = Nothing
        go (x@(tag, index) : xs) state
            | TagSoup.isTagClose tag =
                let maybeOpen = head <$> Map.lookup tagName state
                    state'    = Map.alter popTag tagName state
                    info      = TagInfo tag (Just tagName) index Nothing
                    res       = catMaybes [
                                  Just (index, info)
                              ,   calcOffset <$> maybeOpen
                              ]
                 in res ++ go xs state'
            | TagSoup.isTagOpen tag  = go xs (Map.alter appendTag tagName state)
            | otherwise              =
                let info = TagInfo tag Nothing index Nothing
                in (index, info) : go xs state
            where
                tagName = getTagName tag

                appendTag :: Maybe [(TagSoup.Tag str, Index)]
                          -> Maybe [(TagSoup.Tag str, Index)]
                appendTag m = (x :) <$> (m <|> Just [])

                calcOffset :: (TagSoup.Tag str, Int) -> (Index, TagInfo str)
                calcOffset (t, i) =
                    let offset = index - i
                        info   = TagInfo t (Just tagName) i (Just offset)
                     in offset `seq` (i, info)

                popTag :: Maybe [a] -> Maybe [a]
                popTag (Just (_ : y : xs)) = let s = y : xs in s `seq` Just s
                popTag _                   = Nothing

Faced with this wall of code, we should probably start by reading the signature.

We take the result from TagSoup, we shall return a TagVector. TagVector is simply a specialization of the Vector class from the Vector library. Vector is a list-like structure, but with easier access to indexes; whereas accessing the nth element of a default List is O(n), access through indexes are O(1) for vectors. So they behave like the arrays in other language. The TagVector does not exactly contains Tag from TagSoup, but rather TagInfo, an intermediary type defined in Select.hs. We’ll understand this type as we read the rest of the code.

Initialization step

OK, our mission is to return a bunch of easy-to-access TagInfo. Let us look at the main function itself.

tagsToVector tags = let indexed  = zip tags [0..]
                        total    = length indexed
                        unsorted = go indexed Map.empty
                        emptyVec = Vector.replicate total undefined
                     in emptyVec Vector.// unsorted

When I find code like this, I jump to the in part first; the element listed in the let block should be the part composed during the in part. A quick check to the (//) function defined by Vector let us understand that it’s “bulk update” for vectors. We have a vector (on the left of the // sign), a list of tuple containing index and new values (on the right), we obtain a vector updated for each indexes listed.

Let us try this in the interpreter (there are several ways of creating a vector, we’ll use the replicate version of vectors - it fills an array with the same value).

> import qualified Data.Vector as V
> let hellos = V.replicate 3 "Hello"
> hellos
["Hello, "Hello", "Hello]
> :t hellos
hello :: Vector [Char]
> hellos V.// [(1, " "), (2, "world !")]
["Hello", " ", "world !"]

When printed in ghci, vectors looks suspiciously like lists, but checking their types show that they’re indeed vectors (trying to use list-functions from the Prelude won’t work).

OK ! So we’re bulk updating a vector. What vector are we bulk updating ? emptyVec ! Let’s look at its definition: it’s Vector.replicate that we’ve just used in ghci. It will replicate total times, total being the length of our input list. And we fill it with undefined. So, oddly enough, we build an array full of uncomputable things, that we will bulk-update later, hopefully completely. Note that this won’t break when running, because our vector of undefined will never be evaluated.

So, all we need now is the list of updates, namely a list of pairs of indexes and values that will replace all these undefined. The rest is a common way of indexing a list, indexed that will just zip a list of Tag into pairs. This will produce the [(Tag, Index)] where Index is a simple alias for Int.

We feed this list of pairs to a go subfunction - with a classic Map from Data.Map. Only it’s an empty map, so for now we dont know much about this.

Unless you’ve never seen any “real Haskell” code in your life, you should be familiar with this “go-function” pattern, but just in case you’ve never seen it : the idea is to make sort of a “setup step” in your main function, and delegate the hard work to a go subfunction. This is particularly often the case when there will be some recursion implied.

OK, we have a clearer idea of what this function does: we will build something through “go”, a list of indexes and values that we will use to replace our empty vector, to achiever our goal : an array of quickly accessible tags.

Time to read go. Let’s start with its signature, because Types Are Our Friends.

go :: [(TagSoup.Tag str, Index)]
    -> Map.Map T.Text [(TagSoup.Tag str, Index)]
    -> [(Index, TagInfo str)]

OK, we have the list of pairs we’ve just talked about and a Map of Text to (Tag, Index), and we shall return a [(Index, TagInfo)] - what we need for the bulk update. The only new thing we learned is the exact signature of the Map we’ll work with - we did not know much about it, since we’ve passed an empty one.

I don’t often read recursive functions, but when I do, I typically start by reading the termination case. Unfortunately, here, it would be rather confusing, so we’ll read “in the order it would typically be executed”.

Figuring out the main case

Most calls will be handled by this bit :

go (x@(tag, index) : xs) state
    | TagSoup.isTagClose tag =
        let maybeOpen = head <$> Map.lookup tagName state
            state'    = Map.alter popTag tagName state
            info      = TagInfo tag (Just tagName) index Nothing
            res       = catMaybes [
                          Just (index, info)
                      ,   calcOffset <$> maybeOpen
                      ]
          in res ++ go xs state'
    | TagSoup.isTagOpen tag  = go xs (Map.alter appendTag tagName state)
    | otherwise              =
        let info = TagInfo tag Nothing index Nothing
        in (index, info) : go xs state

As the guards show, we have to handle tag opening, tag closing, and text in between. It might be a good idea to read this “in the order it is going to be parsed”. Typically, you’d have something like :

[TagOpen "MyTag", TagText "SomeContent", TagClose "MyTag"]

Well, to be closer to our real input type we would have:

[(TagOpen "MyTag", 0), (TagText "SomeContent", 1), (TagClose "MyTag", 2)]

Though, of course, this might be much more complex when tags contain other tags - but with this default, simplistic example in mind, we should start with a TagOpen. First, we’ll make a stylistic digression.

On style

Rather than using pattern-matching, the author decided to use guards plus some functions from TagSoup that lets you test against various Tag constructor. I have a feeling this is not very elegant, but there might some good reasons for this. This is nagging me though, because the author used the intermediary syntactic solution, Pattern Guard in the Applicative and Monad implementation for Scraper so I don’t really understand this rather unpleasant way of writing things here.

While I’m shamelessly criticizing code I’m utterly unable to write myself, I’d also like to note that manual recursion like this one if best avoided. Also, rather than an accumulator like the one we’re going to observe, things would be clearer had the author used MonadState. I suppose the author did not because of (a) potential performance issues and (b) it would add mtl to the dependencies.

Finally, mixing where and let is often frowned upon; to put it in a nutshell, what we are going to read in this post is not the most idiomatic Haskell ever.

End of the digression. Let’s move on.

At tag opening

Let us start with the TagOpen, the second guard, which will evaluate to:

go xs (Map.alter appendTag tagName state)

We simply recurse on the rest of the list of tags (xs) without prepending anything to the list we intend to return at the end of our recursive calls.

Though, we will modify the state parameter before recursing. Map.alter is a smart update : it takes a Maybe a -> Maybe a function, a key and a Map. It will return the result of the function “over” the key, even if there is no current value with this key. Stupid example below : we’ll use a function that will always return “Just 1” and the key “hello” to an empty Map.

> import qualified Data.Map as M
> M.alter (const Just 1) "Hello" M.empty 
fromList [("Hello", 1)]

So. The function used for our Map.alter is appendTag; though the name shoud be rather transparent name, we’ll check its definition.

appendTag :: Maybe [(TagSoup.Tag str, Index)]
          -> Maybe [(TagSoup.Tag str, Index)]
appendTag m = (x :) <$> (m <|> Just [])

OK, this is not as scary as it might look like. Remember we will take as input the result of a lookup in our Map - which will most certainbly be Nothing when we handle the very first opened Tag, considering that we start with an empty Map.

We fmap over (:), which is another way of getting rid of any Maybe. x, here, is the tag and its index (we are in a where block so we keep the scope of the caller). We will append it to m (the result of the lookup) if it exists, or Just [] if there was no value attached to this key.

In other words : the state maintains a list of tag definition and indexes for each tag name. When we lookup a tag name, it gives us the list of every encountered opened tags in reverse order of appearance: the first item of the list will always be the “last encountered opening tag”.

Tag content

Unless we are handling empty tags, after opening a tag, we should evaluate the otherwise guard of go. It’s rather short:

let info = TagInfo tag Nothing index Nothing
in (index, info) : go xs state

This is simple : we don’t touch our state; we build a TagInfo, and we will prepend this tag and its index to the result of a new recursive call over the rest of the list of tags we’re processing.

We still don’t know a lot about TagInfo. We know it will at very least contain Tag data itself and its index.

Now that we have opened a tag and processed its content, we should meet a EndTag.

At tag closing

OK, this one is not as simple as the last one…

let maybeOpen = head <$> Map.lookup tagName state
    state'    = Map.alter popTag tagName state
    info      = TagInfo tag (Just tagName) index Nothing
    res       = catMaybes [
                  Just (index, info)
              ,   calcOffset <$> maybeOpen
              ]
  in res ++ go xs state'

Reading the in block show us we are going to add several things to the end result of our recursion (hence the (++) and not a (:)); also, you note we’re going to modify our state.

Let’s read the let statement in their order of writing:

So, we have two helper functions to check up : popTag and calcOffset.

popTag :: Maybe [a] -> Maybe [a]
popTag (Just (_ : y : xs)) = let s = y : xs in s `seq` Just s
popTag _                   = Nothing

Remember this is called with alter, and it is most likely here to remove the corresponding opening tag we’ve added before meeting its closing partner. Once again we see what the state is used for: to handle the tag hierarchy and keep track of currently opened tags. popTag will remove the first occurrence of the list of (tag, index) values mapped for this tag name.

I am going to cheat here and ignore seq the same way I ignored most things associated to strictness in this module, which tends to be as strict as possible (by using bang notations and importing the strict version of Data.Map). This is a wide topic, but to keep things short, let’s just say it’s a way to disregard Haskell’s normally lazy behaviour, mostly to prevent unncessary overhead on heap and improve performance. The seq is a way of asking for immediate evaluation. The “normal”, lazy version whould be:

popTag (Just (_ : y : xs)) = let s = y : xs in Just s

… or, of course, even simpler…

popTag (Just (_ : y : xs)) = Just (y:xs)

For those who care about this kind of things: using Scalpel’s benchmark, I could indeed see this stictness improves performances (without the seq, on my machine, average worst case scenario goes from 3.2ms to 3.5ms).

Note that the patterns are cleverly written: we only handle the case where there is at least one element left in our list (xs would then be []). In any other case, we simply return Nothing, because if our list contained only one Tag left, it’s the one we wanted to pop out anyway.

And finally the calcOffset helper function:

calcOffset :: (TagSoup.Tag str, Int) -> (Index, TagInfo str)
calcOffset (t, i) =
    let offset = index - i
        info   = TagInfo t (Just tagName) i (Just offset)
      in offset `seq` (i, info)

This one is made rather painful to read because of scope and context. Remember: the pair we pass as parameter is the opening tag (t) and its index (i). The index references the index extracted in go during pattern-matching, so its the closing tag index. We then build a complete TagInfo, containing the Tag data, its name, the index it started at, and “how many indexes” it spans over. We see another use of seq, and we will bravely ignore it once again.

This time we have a complete idea of what TagInfo is for. It lets us track “how many tags away” is the closing tag of an opened tag - a piece of information that will prove precious when we’re trying to build a tree.

The Go terminal case

When the list is empty, this final pattern will be avaluated:

        go [] state = map (\(t, i) -> (i, TagInfo t (maybeName t) i Nothing))
                                    $ concat
                                    $ Map.elems state
            where
                maybeName t | TagSoup.isTagOpen t  = Just $ getTagName t
                            | TagSoup.isTagClose t = Just $ getTagName t
                            | otherwise            = Nothing

The same way I read the in first in a let ... in block, I generally keep the where for later: it contains “building blocks” for the main expression. And since we have a big line, I find it often better to read it “from right to left”, or in this case, “from bottom to top”. So, first, we Map.elems the state; the signature of Map.elems makes what it does rather obvious:

elems :: Map k a -> [a]

So this will get us every values stored in the map. Since individual values are list of pairs, we’ll get a list of list of pairs, namely [[(Tag, Index)]]. We immediately flatten this through concat, to get a [(Tag, Index)].

The maybeName implementation is rather surprising, since we should only have TagOpen stored in our state, but ok. And it’s boring, so I’ll ignore it.

Basically, we’ll pick every TagOpen left “dangling” in the state, in case we’re dealing with an invalid XML where someone forgot to close its tags. For each of these, we’ll make a TagInfo, keeping its original index, and without the possibility of setting an offset.

Putting it all together

Going into details like we did doesn’t help having a clear picture of what the function does. So a good way to process is to simply simulate manually the execution, though it’s sometimes a bit painful, but it’s probably the best way to be sure of what the function actually does.

So let us simulate its execution over a simple xml like this one:

<div>
  <div>
      <p>I'm deep.</p>
      <p>I'm deep too.</p>
  </div>
</div>

With tag soup, we shall get something like this:

[ TagOpen "div" [], TagOpen "div" [], TagOpen "p" [], TagText "I'm deep"
, TagClose "p" [], TagOpen "p" [], TagText "I'm deep too", TagClose "p" []
, TagClose "div" [], TagClose "div" []]

Remember that we will zip this with [0..] though, so we’ll get this list with each value paired with its indexes when we enter go.

At first evaluation of go, we alter our state - our Map now contains [(TagOpen "div", 0)] for the key “div”. We will evaluate directly the result of the next recursion over go.

Second evaluation of go will have us modify our state again, prepending the list associated to the key div. It should now be: [(TagOpen "div", 1), (TagOpen "div", 0)].

Third evaluation have us, once again, modify our state; we now put a value for a different key, p, with [(TagOpen "p", 2)].

It’s only at our fourth evaluation that we are actually going to start building a result for go. We return : TagInfo (TagText "I'm deep.") Nothing 3 Nothing, which will be prepended on the result of the next evaluation of go.

Fifth evaluation, we meet with the TagClose for p. We try to find the first item of the list associated to p in our state - and we pop it out. Since our state contained a list of only one element, there is no value anymore associated to p in our state. We build a small list :

[ (4, TagInfo (TagClose "p" []) (Just "p") 4 Nothing)
, (2, TagInfo  (TagOpen "p" []) (Just "p") 2 (Just 2))]

This is what the expression res evaluates to. In other words; we will only return TagOpen data when we meet their TagClose, or, if the XML is not well formed, at the very end of the recursion. We store the offset when returning a TagInfo about a TagOpen; so we know that the tag “p” that starts at index 2 will be close “2 indexes later”, at index 4. This res list will be concatenated to the result of our next recursions.

I won’t detail the next recursions, but we should know return something that looks like :

[ (3, TagInfo (TagText "I'm deep") Nothing 3 Nothing
  (4, TagInfo (TagClose "p" []) (Just "p") 4 Nothing)
, (2, TagInfo  (TagOpen "p" []) (Just "p") 2 (Just 2))
, (6, TagInfo (TagText "I'm deep too"") Nothing 6 Nothing
, (7, TagInfo (TagClose "p" []) (Just "p") 7 Nothing)
, (5, TagInfo (TagOpen "p" []) (Just "p") 5 (Just 2))
, (8, TagInfo (TagClose "div" []) (Just "div") 8 Nothing)
, (1, TagInfo (TagOpen "div" []) (Just "div") 1 (Just 7))
, (9, TagInfo (TagClose "div" []) (Just "div") 9 Nothing)
, (0, TagInfo (TagOpen "div" []) (Just "div") 0 (Just 9)) ]

A list of indexes and values. Since our XML was well formed, the final case won’t have anything to do (our state is empty, so Mep.elems will return an empty list), but return what was accumulated during recursion.

This list of indexes will then be inserted into our vector through the bulk update abilities of (Vector.//). And we can now access any item of the list through its index without problematic complexity.

The coolest thing here is that while processing the tagList, we can return our TagInfo in the order we want; they will be put in the proper order through the index when (Vector.//) does its magic.

A possible State rewrite

Just for kicks, here’s a potential rewrite using a StateMonad from mtl.


type CurrentTags str = Map.Map T.Text [(TagSoup.Tag str, Index)]
type CurrentTagsState str = State (CurrentTags str) [(Index, TagInfo str)]

-- ...
-- you'd call this go through: evalState (go indexed) Map.empty
go :: [(TagSoup.Tag str, Index)]
    -> CurrentTagsState str
go [] = fmap
    (map (\(t, i) -> (i, TagInfo t (Just $ getTagName t) i Nothing)) . concat . Map.elems)
    get
go (x@(tag, index) : xs)
    | TagSoup.isTagClose tag = do
        st <- get
        let opened = head <$> Map.lookup tagName st
        modify (Map.alter popTag tagName)
        let info = TagInfo tag (Just tagName) index Nothing
            res  = catMaybes [ Just (index, info), calcOffset <$> opened]
        fmap (res ++) (go xs)
    | TagSoup.isTagOpen tag  = modify (Map.alter appendTag tagName) >> go xs
    | otherwise              = fmap ((:) (index, TagInfo tag Nothing index Nothing)) (go xs)

I omitted the end where block - it doesn’t change. I’m pretty sure the TagClose can be done more elegantly. Performance wise, I’m very unsure of the result, I ran the benchmark and results were unconclusive (sometimes better, sometimes worse than the original version). But to be honest, I’m rather a noob at benchmarking, so I won’t hazard a guess.

While we’re at it, you can also completely rewrite this as a foldM over a State, if you delegate the “pour over what’s left in the State” part to another function, like this:

type CurrentTags str = Map.Map T.Text [(TagSoup.Tag str, Index)]
type IndexedTags str = [(Index, TagInfo str)]
type CurrentTagsState str = State (CurrentTags str) (IndexedTags str)

tagsToVector tags = let indexed  = zip tags [0..]
                        total    = length indexed
                        unsorted = evalState (foldM go [] indexed >>= final) Map.empty
                        emptyVec = Vector.replicate total undefined
                     in emptyVec Vector.// unsorted
    where
        final :: IndexedTags str -> CurrentTagsState str
        final ind = do finalState <- get
                       let fS = closed . concat . Map.elems $ finalState
                       pure $ ind ++ fS
          where
            closed :: [(TagSoup.Tag str, Index)] -> IndexedTags str
            closed = map (\(t, i) -> (i, TagInfo t (Just . getTagName $ t) i Nothing))
        go :: IndexedTags str -> (TagSoup.Tag str, Index) -> CurrentTagsState str
        go ind x@(tag, index)
            | TagSoup.isTagClose tag = do
                st <- get
                let opened = head <$> Map.lookup tagName st
                modify (Map.alter popTag tagName)
                let info = TagInfo tag (Just tagName) index Nothing
                    res  = catMaybes [ Just (index, info), calcOffset <$> opened]
                return $ res ++ ind
            | TagSoup.isTagOpen tag  = modify (Map.alter appendTag tagName) >> pure ind
            | otherwise              = pure $ (index, TagInfo tag Nothing index Nothing):ind

(Once again, the final where block doesn’t change, and my final implementation is ugly and could probably be improved). But I find the intention behind the code is much clearer with State and foldM.

VectorToTree

OK, this one is also a screenful, though it’s a bit shorter. Once again, I’ll put the full definition first before commenting it bit by bit.

vectorToTree :: TagSoup.StringLike str => TagVector str -> TagForest
vectorToTree tags = fixup $ forestWithin 0 (Vector.length tags)
    where
        forestWithin :: Int -> Int -> TagForest
        forestWithin !lo !hi
            | hi <= lo   = []
            | not isOpen = forestWithin (lo + 1) hi
            | otherwise  = Tree.Node (Span lo closeIndex) subForest
                         : forestWithin (closeIndex + 1) hi
            where
                info       = tags Vector.! lo
                isOpen     = TagSoup.isTagOpen $ infoTag info
                closeIndex = lo + fromMaybe 0 (infoOffset info)
                subForest  = forestWithin (lo + 1) closeIndex

        -- Lifts nodes whose closing tags lay outside their parent tags up to
        -- within a parent node that encompasses the node's entire span.
        fixup :: TagForest -> TagForest
        fixup [] = []
        fixup (Tree.Node (Span lo hi) subForest : siblings)
            = Tree.Node (Span lo hi) ok : bad
            where
                (ok, bad) = malformed (fixup siblings) $ fixup subForest

                malformed :: TagForest -- Forest to prepend bad trees on.
                          -> TagForest  -- Remaining trees to examine.
                          -> (TagForest, TagForest)
                malformed preBad [] = ([], preBad)
                malformed preBad (n@(Tree.Node (Span _ nHi) _) : ns)
                    | hi < nHi  = (ok, n : bad)
                    | otherwise = (n : ok, bad)
                    where (ok, bad) = malformed preBad ns

From the Vector we just build, this function will build a TagForest.

A TagForest is a specialization of the Forest type from Data.Tree, defined at the beginning of the module we’re studying:

type TagForest = Tree.Forest Span

Where a Span is a simple datatype that contains two Int (though a better solution would be to use Nat, since a Span won’t contain negative values, but I think that once again, the author went for performances).

If you’re not familiar with Data.Tree, a Forest is a list of Tree. And a Tree contains a value (nodeLabel) and a subForest of zero to many children. So, all type aliases aside, at the end of the day, we want to return a [Tree Span].

Sidenote : it might make more sense to return a single Tree to represent a XML. But (a) that would make our signatures tougher to write when we want to recurse and (b) we have to take into account potentially poorly written XML that do not have a single root (though, to be honest, people who write this kind of XML really want to see the world burn, don’t they ?).

Now that we have a general idea of the types we’re dealing with, let’s see the implementation of the main function:

vectorToTree tags = fixup $ forestWithin 0 (Vector.length tags)

OK, we compose two subfunction. Pretty simple; first we build a Forest with forestWithin, then we “fix it up” (if the name was not explicit enough, a glance at the comments from fixup will let us know that it’s to handle improper XML, we’ll come back to it).

Pay attention to the parameters we sent to forestWithin : 0 and the length of our vector. Basically, this would be the first Span, the one that covers the whole document, which would normally be enclosed in one big opening tag, e.g. <html>, the last tag being the same one closed, e.g. </html>.

Building a forest

The grunt work is done in these lines. There is no special syntax here, and it’s short enough for us to simply simulate execution to understand how this function work - so let’s do it !

forestWithin :: Int -> Int -> TagForest
forestWithin !lo !hi
    | hi <= lo   = []
    | not isOpen = forestWithin (lo + 1) hi
    | otherwise  = Tree.Node (Span lo closeIndex) subForest
                  : forestWithin (closeIndex + 1) hi
    where
        info       = tags Vector.! lo
        isOpen     = TagSoup.isTagOpen $ infoTag info
        closeIndex = lo + fromMaybe 0 (infoOffset info)
        subForest  = forestWithin (lo + 1) closeIndex

So, we have two Int parameters (let’s ignore the bangs, once again, it’s a strict vs. lazy issue). Obviously, these are bounds, a lower one and a higher one. Let’s take our previous example; we should get in vectorToTree with this vector:

[TagInfo (TagOpen "div" []) (Just "div") 0 (Just 9)
TagInfo (TagOpen "div" []) (Just "div") 1 (Just 7)
TagInfo  (TagOpen "p" []) (Just "p") 2 (Just 2))
TagInfo (TagText "I'm deep") Nothing 3 Nothing
TagInfo (TagClose "p" []) (Just "p") 4 Nothing)
TagInfo (TagOpen "p" []) (Just "p") 5 (Just 2))
TagInfo (TagText "I'm deep too"") Nothing 6 Nothing
TagInfo (TagClose "p" []) (Just "p") 7 Nothing)
TagInfo (TagClose "div" []) (Just "div") 8 Nothing)
TagInfo (TagClose "div" []) (Just "div") 9 Nothing)]

And enter forestWithin with 0 and 10 as parameters. The first guard will be ignored, 10 is greater than 0. For the second guard, we need to check isOpen; isOpen uses the lo parameter as an index in our vector, and checks if this is an opening tag. It is (and in any proper XML, for the first execution, it should be !). So the second guard of forestWithin will be ignored too, since we want only not isOpen.

It stands to reason that we should enter, at fist, the otherwise guard. Now pay attention, because there is confusing stuff incoming. Read these lines of code very carefully.

| otherwise  = Tree.Node (Span lo closeIndex) subForest
             : forestWithin (closeIndex + 1) hi

The confusing bit is the (:). This does not preprend subForest to forestWithin. This prepends a Tree, with its Span and its subForest, to the rest of the Forest. Quick recap:

So the otherwise prepends a whole tree to the rest of its Forest. In any sane XML, in the first case, there should be no “rest of the Forest”, the document should be a single Tree, so we should return a Forest containing only one Tree.

Which means that otherwise evaluates to the constructor for Tree, Node, containing:

With two recursive calls : one for any siblings of our tree, this is the forestWithin in otherwise; one for every children of our tree, this is the subForest call (which, in turn, calls forestWithin).

Both recursions depends on closeIndex, a value we obtain from the Tag at the index of lo. It is simply the “index of the closing tag for the currently opened tag”. So for our “div” at 0, it will be the div at index 9 (check the vector if you’re not sure !). You should now see why we went through the pain of computing the offsets in tagsToVector. Finally, observe that :

So in the first call, with the example vector, we will:

Simulating the evaluation

Let us evaluate the subForest of our first <div>. We enter forestWithing with 1 and 9.

We evaluate the new subForest; this time, we’re at the first <p>. Once again, we meet otherwise.

Ok, we will simulate YET ANOTHER subforest, the content of our first p. This time, finally, as the tag we investiga is not a TagOpen, we evaluate the second guard :

| not isOpen = forestWithin (lo + 1) hi

Which means that we won’t add a Node, this time : we will simply delegate to yet another recursion. We should evaluate forestWithin between 4 and 4; and meet the first guard, so we’ll return []. In other words: we only care about opening tags and their spans, we’ll mostly ignore anything else.

However, the sibling of the previous <p> will get evaluated. I’ll spare us the manual evaluation, and spoil you the end. We will get this - rather simple - Forest as a final result:

[Node (Span 0 9) [Node (Span 1 8) [Node (Span 2 4) [], Node (Span 5 7)]]]

That “lower-bound” / “higher bound” recursion pattern is rather interesting. I don’t know the first thing about recursions strategy, which is why I strive, whenever I can, to use canonical abstractions for recursion instead of writing my own. I’d wager it’s used here to reduce our traversal and keep things close to O(n), but it will take some more algorithmic-minded people than me to identify the name of this trick.

OK, massive warning here : the fixup subfunction is not particularly interesting, explaining how it works is painful, and the result will be disappointing anyway. If you still want to follow me, read the following section. If not, I advise you to directly jump to “How select work”, down below, after the swamp of misery that is going to be the fixup analysis.

Fixing the Forest

Man, you’re still there even despite my warnings à la Lemony Snicket. I admire your courage, I blame your foolishness.

A first malformed example

The issue when writing a scraper is that you’re likely to end up having to parse improper XML. Particularly, you’re likely to end up with some opened tags left dangling without their closing counterpart, as in:

<todos>
  <todo>Learn XML</todo>
  <todo>Fix the mistake at this line
  <todo>Explain how Scalpel will handle this</todo>
</todos>

The mistake should be rather easy to spot. In this example, tagToVector would return this vector:

[ TagInfo (TagOpen "todos" []) (Just "todos") 0 (Just 9)
, TagInfo (TagOpen "todo" []), (Just "todo") 1 (Just 2)
, TagInfo (TagText "Learn XML") Nothing 2
, TagInfo (TagClose "todo") (Just "todo") 3 Nothing
, TagInfo (TagOpen "todo" (Just "todo") 4 Nothing
, TagInfo (TagText "Fix the mistake at this line") Nothing 5 Nothing
, TagInfo (TagOpen "todo" []) (Just "todo") 6 (Just 2)
, TagInfo (TagText "Explain how Scalpel will handle this") Nothing 7 Nothing
, TagInfo (TagClose "todo" []) (Just "todo") 8, Nothing
, TagInfo {TagClose "todos" []) (Just "todos") 9 Nothing]

Which leads us to a TagOpen with Nothing as its offset since there was no associated closing tag. forestWithin will return the following forest:

[Node (Span 0 9) [ 
                   Node (Span 1 3) []
                 , Node (Span 4 4) []
                 , Node (Span 6 8) []
                 ]
]

Our problematic node is the Node (Span 4 4). See the way closeIndex is defined:

closeIndex = lo + fromMaybe 0 (infoOffset info)

This makes sure that a TagOpen with its offset set as Nothing will get a Span set with the same bounds. In other words : we cannot decide where the tag will close, so we’ll just make it a meaningless span, it only contains itself. This probably mean we’re losing the ability to access the text it should have contained but well, the author of this XML shoud have been more careful.

The fixup function won’t help for this case. It will kind of help in another, classic XML mistake though, as we are going to see.

A second example of malformed XML

What about this one:

<todos>
  <todo>Learn XML</todo>
  <todo>Fix the mistake at this line
  <todo>Explain how Scalpel will handle this</todo>
</todos></todo>

Ah, the infamous children tag that gets closed after its parent. Not only in this ugly, it would really mess up our structure.

[ TagInfo (TagOpen "todos" []) (Just "todos") 0 (Just 9)
, TagInfo (TagOpen "todo" []), (Just "todo") 1 (Just 2)
, TagInfo (TagText "Learn XML") Nothing 2
, TagInfo (TagClose "todo") (Just "todo") 3 Nothing
, TagInfo (TagOpen "todo" (Just "todo") 4 Nothing
, TagInfo (TagText "Fix the mistake at this line") Nothing 5 (Just 6)
, TagInfo (TagOpen "todo" []) (Just "todo") 6 (Just 2)
, TagInfo (TagText "Explain how Scalpel will handle this") Nothing 7 Nothing
, TagInfo (TagClose "todo" []) (Just "todo") 8, Nothing
, TagInfo {TagClose "todos" []) (Just "todos") 9 Nothing
, TagInfo (TagClose "todo" []) (Just "todo") 10 Nothing

Which would produce the following Forest:

[Node (Span 0 9) [
                   Node (Span 1 3) []
                 , Node (Span 4 10) [
                                      Node (Span 6 8) []
                                    ]
                 ]
]

Oh dear. This will not do. The third “todo” is registered as a child of the second one. Careful examination of this result shows that the span of our second node (Span 4 10) gets outside of its parent (Span 0 9). This gets us a way of identifying malformed tags. This is exactly what fixup is going to use to see what parts of the tree need to be fixed.

Fixing it up

I find fixup particularly difficult to read, because you constantly have to “jump” to find definitions. And caution: massive recursion incoming.

fixup (Tree.Node (Span lo hi) subForest : siblings)
  = Tree.Node (Span lo hi) ok : bad

We start by immediately rebuilding from the branch we’re considering : subForest and siblings will be rewritten by ok and bad. Once again, beware while reading the signature !

fixup (Tree.Node (Span lo hi) subForest : siblings)

It’s the same potential confusion I warned you about when we studied forestWithin. As the label indicates, we are not deconstructing the subForest, we are deconstructing the containing Forest. So, in our previous example, the first subForest will be all the children of our root node, and the siblings bit is nothing but our spline, [].

You can use additionnal parenthesis to make this more obvious:

fixup ((Tree.Node (Span lo hi) subForest) : siblings)

And the same way we deconstructed a Forest, we’ll rebuild one with ok for our current Node, and bad for the siblings. What are ok and bad ?

(ok, bad) = malformed (fixup siblings) $ fixup subForest

We will call malformed, a function that takes two TagForest, with the recursively applied fixup on them. This recursions should let us go as deep in the tree as possible. Now for the malformed, where Real Stuff Happens.

malformed :: TagForest -- Forest to prepend bad trees on.
          -> TagForest  -- Remaining trees to examine.
          -> (TagForest, TagForest)
malformed preBad [] = ([], preBad)
malformed preBad (n@(Tree.Node (Span _ nHi) _) : ns)
    | hi < nHi  = (ok, n : bad)
    | otherwise = (n : ok, bad)
    where (ok, bad) = malformed preBad ns

Once again, we are going to need to evaluate “manually” this to get how it works - this is the issue with manual recursions… abstractions are not here to guide you to understand what is happening.

Fixup in action

So, we enter fixup, with our root Node. lo is 0, hi is 9. We evalute (ok, bad):

malformed (fixup siblings) $ fixup subForest

Since we’re in the root node, fixup siblings will evaluate to []. What about fixup subForest though ?

First recursion

Stack up ! We need to call fixup on the subForest. This call fixup with our first child, Node (Span 1 3) []. So, back to fixup, lo is 1, hi is 3. subForest is empty because this Node has no children. siblings contains the rest of the children of the root node, namely the malformed little brother of our current node. We evaluate (ok, bad). fixup subForest will evaluate to []; we still need to evaluate fixup siblings.

Second recursion

Stack up again ! We’re entering fixup with the malformed Node. lo is now 4, hi is 10, subForest contains the Node that should not be here, and there are no siblings anymore. We evaluate ok, bad. We know that fixup siblings will evaluate to []. But we need to fixup subForest again.

Third recursion

Stack up !, but I swear, it’s the last time. We receive the third item on our todo-list improperly identified as the child of the previous one. lo is now 6, hi is 8. siblings is empty, ̀subforestis empty. We evaluate(ok, bad). Since bothsubForestandsiblingsare empty, we can FINALLY stop to recurse and entermalformed`. We’ll match the first pattern:

malformed preBad [] = ([], preBad)

preBad is the result of fixup siblings. So it’s []. (ok, bad) will finally evaluate ([], []). This fixup call returns (Node (Span 6 8) []):[].

Back to the second recursion

Stack down ! Where were we ? Ah yes, at the Node (Span 4 10), the tricky one. We wanted to evalute fixup subForest and fixup siblings. We have no siblings, and subForest returned, as we just saw (Node (Span 6 8) []):[]. We can evaluate malformed, and enter the second pattern: we HAVE the result of fixup subForest.

malformed preBad (n@(Tree.Node (Span _ nHi) _) : ns)
  | hi < nHi  = (ok, n : bad)
  | otherwise = (n : ok, bad)
  where (ok, bad) = malformed preBad ns

Curses ! More recursion. Let us first understand our parameters:

Since our current hi is 10, we should enter the otherwise guard. Which means we need, sadly, to evaluate the overwritten values of ok and bad in our where block (sidenote: don’t do this, please; don’t overwrite labels, you’re making everyone sad). Fortunately, since preBad is [], and ns is [], it is pretty easy to guess what the recursive malformed will return: ([], []). Our initial malformed will then return (Node (Span 6 8): [], []). Which means that the third recursion of fixup can now return : Node (Span 4 10) [Node (6 8) []]:[]. So far, it’s not very different from our very first input value. Bear with me.

Back to the first recursion

Stack down ! We were evaluating malformed. We had the fixup subForest, it was []. We needed fixup siblings, that’s the return of the third recursion.

Back to malformed, but we fall back on the first pattern, our second parameter is simply an empty list. So, this fixup call will return this Forest:

(Node (Span 1 3) []):(Node (Span 4 10) [Node (Span 6 8) []]):[]
Back to the initial call

So. We were, as usual, trying to evaluate a malformed. We had no siblings since we were in our root, so we had [] as a first parameter. And our subForest is the result of our previous recursive calls. We enter malformed.

What are our initial parameters ?

Buckle up, we’re going to do some magic. Is hi lower than nHi ? Nope. 9 is greater than 3. So we know we will return at the very least (Node (Span 1 3)). Prepended to… something depending on the result of (ok, bad). But to know what (ok, bad) is, we need to evaluate it. Boom, recursing all over again, this time with these parameters:

And finally, finally, we meet this first guard; the one where the world has been broken, the one where we have a high bound superior to our parent high bound, where nHi (10) is above hi (9). We need to know what the current ok and bad will be : and since ns is [], it will once again be ([], []). So we return :

([], (Node (Span 4 10) [Node (Span 6 8) []]):[])

Tired ? Come on, we’re almost done ! We were in the otherwise case of the previous recursion, dealing with our first child Node. And we needed to prepend this first Node to ok, and return, in the rest of our pair, the bad. Well, ok is [], the first item of the pair we’ve just returned. And bad is our Node. So, for our final result !, we will have the following Forest:

[ Node (Span 0 9) [Node (Span 1 3) []]
, Node (Span 4 10) [Node (Span 6 8) []]]

Which would be the stupid following XML:

<todos>
  <todo>Learn XML</todo>
</todos>
<todo>Fix the mistake at this line
  <todo>Explain how Scalpel will handle this</todo>
</todo>

ALL THAT HARD WORK FOR THIS ?! Yup. It’s not great, is it ? But at the very least, the Tree makes more sense. Consider the fact that with a malformed XML, it is fairly difficult to know “where” nodes should be. Was the node mistakenly “inside” its parent ? Or is it just that the closing tag was misplaced ? There is no perfect solution to this problem, because there many different possible mistakes. But the advantage of this solution is that it preserves a sane tree-like structure, with containers “bigger” than their contained set.

I strongly encourage you to reread the code if you’ve survived this. It’s actually a clever way to “move out” a subTree as sibling of a Forest. How sane people come up with recursions tricks like these, I have no idea.

How selection works

Well, you now know the most important stuff there are to know about the inner structures of Scalpel. Every Scraper will work with this base structure:

Put this in a tuple of three element and you get the famous TagSpec that will serve as the input for all scraping computation we’ve seen in our previous chapter.

Now to see how is it used. We’ve seen previously that Scalpel revolves around the notion of Selector, which is nothing but a datatype encapsulating “what you are looking for”. You run a Selector over the select function. I call it “the heart of Scalpel”, because Scalpel is but a gigantic wrapper to get stuff out of this function.

As I wrote when I gave a brief overview of Scalpel, Selectors are defined in the internal/Select/Types module of scalpel-core.

Selector is a newtype around a list of SelectNode, a datatype defined as:

data SelectNode = SelectNode !T.Text [AttributePredicate]
                | SelectAny [AttributePredicate]

It should be rather easy to understand, but just in case : the SelectNode constructor requires a tag name, and a list of predicates (expected classes or attributes). The SelectAny don’t care about tag names. The fact that the newtype Selector amounts to a list is what allow us to have the neat :

"div" // "p" // "a"

syntax, which would be rendered as a :

MkSelector [ SelectNode "div" []
           , SelectNode "p" []
           , SelectNode "a" []]

With this in mind, we can finally briefly read select and its bigger brother selectNode.

The select method

select :: (Ord str, TagSoup.StringLike str)
       => Selector -> TagSpec str -> [TagSpec str]
select s tagSpec = newSpecs
    where
        (MkSelector nodes) = s
        newSpecs = zipWith applyPosition [0..] (selectNodes nodes tagSpec [])
        applyPosition p (tags, f, ctx) = (tags, f, SelectContext p)

The whole select process can be seen as a read-only Lens. In other words: it focuses on a “small subpart” (well, “small subparts”, in this instance) of a “big part”. This is made clear by its signature : we have one tagSpec (typically, the one encompassing all of our document). We have a set of navigating instructions (the Selector). We shall return a list of destination in our initial TagSpec, “possible destinations for the given instructions”.

Which is why the definition of select is simply to “rebuild” a list of TagSpec objects. Look at the newSpecs definition; it uses selectNodes, a Depth First Search implementation over TagSpec that we will study in a minute, and zips an index over its result. This index allows us to fill the famous last element of the TagSpec tuple. It’s only used by the position primitive offered by Scalpel. It lets you know the order in which it found elements. Note that most Scalpel primitives build a Scraper using select. You then simply extract the information you’re after (be it text, html, a given attribute) over the result. The singular primitives (text, html, etc.) only apply to the first result of select. The pluralized forms (texts, htmls, etc.) will be mapped over every result.

Now for the final funtion we shall study : the selectNodes.

You probably know there are two basic algorithm when you’re looking for something in a tree: Depth First Search (DFS for short) and Breadth First Search (BFS). The first one goes from branches to branches, the second one from depth levels to depth levels.

There are, of course, Haskell libraries that cover this, notably the famous FGL, a.k.a Functional Graphs Library. However, since Scalpel uses its own mix of Tree and Vector, using an external tool would probably not have been very convenient.

Now summon whatever courage you have left, because we have yet another BIG function to read. Here it is, without its helpful comments:

selectNodes :: TagSoup.StringLike str
            => [SelectNode] -> TagSpec str -> [TagSpec str] -> [TagSpec str]
selectNodes []  _          acc = acc
selectNodes [_] (_, [], _) acc = acc
selectNodes [n] (tags, f : fs, ctx) acc
    | nodeMatches n info = (shrunkSpec :)
                         $ selectNodes [n] (tags, fs, ctx)
                         $ selectNodes [n] (tags, Tree.subForest f, ctx) acc
    | otherwise          = selectNodes [n] (tags, fs, ctx)
                         $ selectNodes [n] (tags, Tree.subForest f, ctx) acc
    where
        Span lo hi = Tree.rootLabel f
        shrunkSpec = (
                       Vector.slice lo (hi - lo + 1) tags
                     , [fmap recenter f]
                     , ctx
                     )
        recenter (Span nLo nHi) = Span (nLo - lo) (nHi - lo)
        info = tags Vector.! lo
selectNodes (_ : _) (_, [], _) acc = acc
selectNodes (n : ns) (tags, f : fs, ctx) acc
    | nodeMatches n info = selectNodes ns       (tags, Tree.subForest f, ctx)
                         $ selectNodes (n : ns) (tags, fs, ctx) acc
    | otherwise          = selectNodes (n : ns) (tags, Tree.subForest f, ctx)
                         $ selectNodes (n : ns) (tags, fs, ctx) acc
    where
        Span lo _ = Tree.rootLabel f
        info = tags Vector.! lo

OK, we can easily identify the main features of this function. Our first parameter is the list of selectors, our second parameter is the subpart of the TagSpec we are currently traversing. And finally, we have an acc parameter, which is typical of a manual recursions that build an accumulator; the two first patterns, namely the stop cases, will return this acc.

The trick here is that there are actually two level of recursions. We are recursing over our selectors AND recursing over the tree. Let us first study the recursion over the selectors.

Recursion over selectors

So, say you want to identify every <a> tags contained inside a <div> tag. We first need to keep only the <div> tags. We would typically start in the final pattern :

selectNodes (n : ns) (tags, f : fs, ctx) acc
    | nodeMatches n info = selectNodes ns       (tags, Tree.subForest f, ctx)
                         $ selectNodes (n : ns) (tags, fs, ctx) acc
    | otherwise          = selectNodes (n : ns) (tags, Tree.subForest f, ctx)
                         $ selectNodes (n : ns) (tags, fs, ctx) acc
    where
        Span lo _ = Tree.rootLabel f
        info = tags Vector.! lo

Lots of things are being deconstructed here (I call this kind of patterns a “Derrida paradise”):

The where block contains a mechanism to extract the Tag we need to inspect. rootLabel from Data.Tree extracts the value stored - so a Span. We only care about the lower bound of the Span, because it tells us “where the TagOpen bit is in our TagVector; info is this very Tag, selected through the index-access .! function over Vector.

The nodeMatches function will check for us if the Selector n match.

If nodeMatches evalutes to False, there is this odd looking call to selectNodes over the subforest, applied to selectNodes over the siblings. It means that we will use the result of the function over the siblings as the accumulator for the result of the function over the forest. Note that we will explore the subForest and the siblings : Scalpel always look for tags however deep they might be.

If there IS a Match, we also enter the subForest, but this time with the remaining selectors. So, to come back to the example I gave at the beginning, if we were looking for "div" // "a", we found a div, now we want to see if it contains an a - anywhere among children, grand-children, an so on. That would let us enter the other pattern. (We will, of course, also continue recursing over siblings, with the full list of selectors).

Recursion over Tree

And finally, we arrive to a lone selector over a part of a TagSpec and enter this pattern:

selectNodes [n] (tags, f : fs, ctx) acc
    | nodeMatches n info = (shrunkSpec :)
                         $ selectNodes [n] (tags, fs, ctx)
                         $ selectNodes [n] (tags, Tree.subForest f, ctx) acc
    | otherwise          = selectNodes [n] (tags, fs, ctx)
                         $ selectNodes [n] (tags, Tree.subForest f, ctx) acc
    where
        Span lo hi = Tree.rootLabel f
        shrunkSpec = (
                       Vector.slice lo (hi - lo + 1) tags
                     , [fmap recenter f]
                     , ctx
                     )
        recenter (Span nLo nHi) = Span (nLo - lo) (nHi - lo)
        info = tags Vector.! lo

The only difference here (baaah, this code is not very DRY) is the matching case. If we match, we will recurse after appending the result of shrunkSpec, i.e. the “smaller part of the bigger part” that we want to retain. But what is “a smaller part” of a tuple that contains a tag vector, a list of trees and an arbitrary index ? Well, it’s a smaller vector, an updated forest and as for the index, it will be handled by select as we saw earlier.

Let us start with the smaller vector : we cut it from its lower bound to the length of the span, plus one because vectors indexes from 0. So we only take the opening tag, what it encompasses, and its closing tag. As for the Tree, we put it in a List since we have to carry over a Forest, but we map over each Span inside to make them “local”. Let’s clarify this : all our Spans contains indexes from the initial vector, so we need to recompute them, by substracting the index of the opening tag to every span or our indexes won’t match our new Vector. And we’re done. All these smaller TagSpec will be accumulated over by recursion and returned.

I won’t add any comment to this function, because this post is already incredibly long. Furthemore, I suspect there is a bug in this implementation. I might explain it in a next post - an ultimate, and I hope shorter !, post about Scalpel, because after that, we’ll try to read some other libraries.

What did we learn ?

Not that much, but still: