Abstract
Hierarchical clustering has been shown to be valuable in many scenarios.Despite its usefulness to many situations, there is no agreed methodology onhow to properly evaluate the hierarchies produced from different techniques,particularly in the case where groundtruth labels are unavailable. Thismotivates us to propose a framework for assessing the quality of hierarchicalclustering allocations which covers the case of no groundtruth information.This measurement is useful, e.g., to assess the hierarchical structures used byonline retailer websites to display their product catalogues. Our framework isone of the few attempts for the hierarchy evaluation from a decisiontheoreticperspective. We model the process as a bot searching stochastically for itemsin the hierarchy and establish a measure representing the degree to which thehierarchy supports this search. We employ Partially Observable Markov DecisionProcesses (POMDP) to model the uncertainty, the decision making, and thecognitive return for searchers in such a scenario.
Quick Read (beta)
POMDP Modelling for Assessing Hierarchies
Abstract
Hierarchical clustering has been shown to be valuable in many scenarios. Despite its usefulness to many situations, there is no agreed methodology on how to properly evaluate the hierarchies produced from different techniques, particularly in the case where groundtruth labels are unavailable. This motivates us to propose a framework for assessing the quality of hierarchical clustering allocations which covers the case of no groundtruth information. This measurement is useful, e.g., to assess the hierarchical structures used by online retailer websites to display their product catalogues. Our framework is one of the few attempts for the hierarchy evaluation from a decision theoretic perspective. We model the process as a bot searching stochastically for items in the hierarchy and establish a measure representing the degree to which the hierarchy supports this search. We employ Partially Observable Markov Decision Processes (POMDP) to model the uncertainty, the decision making, and the cognitive return for searchers in such a scenario.
trees,bayesnet,arrows,automata,positioning \tcolorboxenvironmentlemmaenhanced jigsaw,colback=gray!20,drop shadow \tcolorboxenvironmentpropositionenhanced jigsaw,colback=gray!20,drop shadow \tcolorboxenvironmentremarkenhanced jigsaw,colback=gray!20,drop shadow \tcolorboxenvironmentdefinitionenhanced jigsaw,colback=white,drop shadow
1 Introduction
Hierarchical clustering (HC) analysis has been applied in many Ecommerce and scientific applications. The process generates a collection of nested clusters that group the data in a connected structure (balcan2014), forming a hierarchy. The hierarchy is represented by a tree data structure where each node contains a number of data items. Such a structured organisation of the items is useful for tasks such as efficient search. Searchers can navigate through the catalogue of items by choosing a path through the hierarchy and can thus avoid the cost of a linear search through the entire catalogue of items. If the hierarchy is well organised into coherent clusters, then finding the correct path to the required item is easy. Clearly, there are multiple ways of clustering the same items and organising their hierarchies. However, evaluating the quality of a hierarchy still needs more substantial study, especially for data that lacks a groundtruth hierarchy. The evaluation of hierarchical structures is the focus of this paper.
We consider online retailers and the customers who access them, as an example. When looking for specific items, customers are in essence navigating the hierarchies hosted by the website. Different clusters and hierarchies will probably provide divergent levels of user experience with respect to searching and navigating, even for the same users. For example, imagine that a female jeans is mistakenly placed in a branch of a certain hierarchy, labelled “clothes $\to $ trousers $\to $ jeans $\to $ males” . The user will expect that the jeans are contained along the branch “clothes $\to $ trousers $\to $ jeans $\to $ females”, and thus will surely fail to retrieve the required item in that hierarchy. As another example, creftype 1 shows that a searcher may be confused and possibly search for a Harry Potter DVD in the wrong route. Is it more so an action movie than a drama movie?
In this second example, it may well be the case that each cluster contains a coherent set of items, but their hierarchical organisation makes it difficult for a searcher to choose the correct path. Such observations motivate an evaluation function that accounts for the structure as much as the item to cluster assignments.
In the scenario that we consider here, all items stored in the subtree rooted at a node in the hierarchy are available for consideration by a searcher who stops at that node. For instance, a searcher stopping at the drama node could consider in turn all the DVDs, “Harry Potter”, “Lord of the Rings”, “Star Wars”, and so on. The further the searcher descends in the hierarchy, the fewer items that need to be considered once the searcher stops to search, thus increasing search efficiency, provided that a correct path in the hierarchy is taken.
There are many evaluation functions already proposed to measure the quality of regular (nonhierarchical) clustering results, with or without groundtruth data. Any of these measures can be applied to a HC, once an appropriate cut of the hierarchy is chosen. Such measures do give feedback on the coherence of the clusters and/or their agreement with a ground truth. However, there is relatively little research that has proposed measures scoring the structural organisation of the hierarchy. This is critical in order to properly evaluate hierarchies in the context that we have in mind. In particular, one can extend it to perform the hyperparameter tuning for a number of HC algorithms. As demonstrated in (kobren2017hierarchical), many HC algorithms depend on hyperparameters and might possibly be further improved with our function.
Problem Statement
Let us consider a set of items $X={\{{x}_{n}\}}_{n=1}^{N}$. HC is a set of nested clusters of the dataset arranged in a tree $\mathscr{H}$. Each node of the tree has associated with it a subset or cluster of the collection, and the children of any node are associated with a partition of the parent’s cluster. The root node is associated with the entire collection, and if an item belongs to any particular node in $\mathscr{H}$, it also belongs to all of its ancestor nodes. The goal is to seek a function mapping $\mathscr{H}$ to a real value, which reflects the quality of the hierarchical arrangement, from the point of view of supporting efficient search for a target item in the collection. In summary, this work aims to arrive at a quantitative quality measure for a given hierarchical arrangement of a catalogue of items, where in a typical usecase, the items are products in a large catalogue offered by an online retailer. Our scenario is that of a search bot seeking a specific target item and so we develop the measure by modelling a simplified search process but one which is sufficient to capture the important features of the hierarchy that determine its suitability to support efficient search.
It is intuitive to model this process of decision making under uncertainty as a Markov Decision Process (MDP). MDPs are concerned with the problem of determining a decision policy that optimises the cumulative reward obtained when applying this policy over some time horizon. Moreno2017 proposed such an MDP to tackle the problem of evaluating hierarchies. In that model, the quality of the hierarchy corresponds to the expected cumulative reward that a search bot will obtain when searching for target items, using a particular search policy, where the reward is a function of whether and how efficiently the target is found. Notice that, this quality measure depends not only on the hierarchy itself, but also on the policy used to determine the action at each decision point.
Contribution
We propose a novel framework extending the MDP model by proposing a reasonable search policy, which we posit provides a solid model for measuring the efficiency of item retrieval in the hierarchy. Our frameworks explicitly models the searcher’s lack of knowledge about the environment and this leads us to extend from an MDP framework, to a Partially Observed MDP (POMDP) framework. We coin the measure Hierarchy Quality for Search (HQS).
2 Related Works
When assessing general clustering results, one can appeal to measuring the results with or without the groundtruths (Liu13; Steinbach00). With respect to hierarchies, a groundtruth hierarchy should contain multiple layers assigning nodes to each level of the hierarchy. Thus, we do not consider the Dendrogram Purity (e.g., used in (heller2005bhc; kobren2017hierarchical)) as it evaluates a hierarchy with regard to only one layer of cluster assignment. There are many tools for evaluating the quality of the clustering but they lack the capability to evaluate the hierarchical arrangement (Johnson2013).
cigarran2005evaluating proposed a prototype considering the content of the cluster, the hierarchical arrangement and the navigation cost. Unfortunately, it is hard to develop this idea as neither detailed procedures nor experiments were presented in this short paper. More recently, Johnson2013 proposed the Hierarchical Agreement Index (HAI), which borrows its concept from the Rand Index, to compare the structure to a groundtruth hierarchy. Moreno2017 introduced the concept of using MDPs to model the evaluation. They observed that HC measures have hardly addressed the need to understand the convenience for the search and navigation efficiency provided by a hierarchy, accounting for the cognitive cost of choosing a correct path at each branch of the hierarchy. However, it is a prototype rather than a finished product. Our work extends these ideas to a POMDP (Kaelbling1998) model, which improves the previous work by 1) modelling the uncertainty in the search process with belief states; 2) specifying a policy to solve the problem so that the strong assumptions (the bot knows the position of the items universally) in the previous work can be relaxed. We consider this work an important contribution to the topic of evaluating hierarchical structures with no help from groundtruths.
There exist abundant research addressing the issue of acting optimally in POMDPs, such as (cassandra1994; Boutilier1996; meuleau1999solve; Ross2008), to name but a few. Seeking a better policy is not the main focus in the paper; hence, we adopt the existing (online) policies. On the other hand, we find the works (shani05; zheng2018drn; Eugene2019) employ Reinforcement Learning (RL) to model the user behaviours for improving the performance in recommender systems, which are inspiring. These are somewhat close to our work while in a different application domain.
3 POMDP Specification
We develop a POMPD model of a bot searching a hierarchy, where, at each node, it must make the decision to search or to descend further. The bot cannot backtrack, once a descent step has been performed. Simply put, a POMDP is an MDP with additional observations, belief state and belief estimator (cassandra1994; Kaelbling1998). Fixing the search target of the bot as item $x$, we model the search of the hierarchy as the POMDP ${\mathcal{P}}_{x}=\u27e8S,A,T,R,Z,O,\gamma \u27e9$, such that

•
$S$ is the finite state space.

•
$A$ is the set of possible actions.

•
$T:S\times A\times S\mapsto [0,1]$ is the transition function where $T(s,a,{s}^{\prime})\triangleq p({s}^{\prime}\mid s,a)$, represents the probability of moving to state ${s}^{\prime}$ from state $s$ using action $a$.

•
$R:S\times A\times S\mapsto \mathbb{R}$ assigns the expected immediate reward, $R(s,a,{s}^{\prime})$, to the resulting state ${s}^{\prime}$ after the bot selects action $a$ in state $s$.

•
$O$ is the observation set.

•
$Z:S\times A\times O\mapsto [0,1]$ is the probability function for the observations. We have $Z(s,a,o)=p(o\mid a,s)$ which reads as the probability of observing $o\in O$ when reaching the state $s$ through action $a$.

•
$\gamma $ is the discount factor. In our model, it is sufficient to set $\gamma =1$ and hence we will omit further explanation about this part.
States
POMDPs use states to represent the status of the search at a particular point during the search process. Our model defines that each state is a joint event consisting of 1) the physical location (node) $c$ of the bot, and 2) a boolean variable indicating if the bot is in the right path:
$S=\{\u27e8\mathrm{\varnothing},0\u27e9,\u27e8\mathrm{\varnothing},1\u27e9\}\cup {\displaystyle \bigcup _{c\in \mathscr{H}}}\{\u27e8c,0\u27e9,\u27e8c,1\u27e9\},$  (1) 
where $\mathrm{\varnothing}$ is the terminal point reached after the bot chooses to search. Accordingly, $\u27e8\mathrm{\varnothing},1\u27e9$ is the state representing that the bot stops and the stopping node contains the item, likewise $\u27e8\mathrm{\varnothing},0\u27e9$ depicts the opposite case.
Actions
We first discuss the guidance function concept, which is a core component in our model. Then, we specify the action set.
Guidance Function:
A guidance function $\eta :{2}^{X}\times {2}^{X}\times X\to [0,1]$ provides the search bot with evidence as to the correct path on which to search for an item $x\in X$. Specifically, $\eta (c,{c}^{\prime},x)$ represents the bot’s estimate that the target $x$ is contained in a cluster ${c}^{\prime}$, given that it is in cluster $c$, for $(c,{c}^{\prime})\in \mathscr{H}$. We have
$\forall {c}^{\prime}\notin \mathcal{C}(c):\eta (c,{c}^{\prime},x)=0\mathit{\hspace{1em}\hspace{1em}}\text{and}\mathit{\hspace{1em}\hspace{1em}}{\displaystyle \sum _{{c}^{\prime}\in \mathcal{C}(c)}}\eta (c,{c}^{\prime},x)=1$ 
where $\mathcal{C}(c)$ returns the children of $c$. The $\eta $ function summarises the information (prior domain knowledge) available in the hierarchy which will guide the bot’s behaviour. It is represented as a discrete probability density function. In particular, we write for any pair $(c,{c}^{\prime})\in \mathscr{H}$
$$\eta (c,{c}^{\prime},x)=p(x\in {c}^{\prime}\mid x\in c)\triangleq \frac{\mathrm{exp}\{\mathcal{S}(x,{c}^{\prime})/\delta \}}{{\sum}_{{c}^{\prime \prime}\in \mathcal{C}(c)}\mathrm{exp}\{\mathcal{S}(x,{c}^{\prime \prime})/\delta \}}.$$  (2) 
Here, $\mathcal{S}(x,c)$ is the similarity function between the item $x$ and the cluster $c$.^{1}^{1} 1 In the rest of the paper, we omit $x$ in the guidance function, since the discussions about $\eta $ will always concentrate on one specific item. Leaving the choice of similarity function open adds flexibility to the measure for adjusting to diverse datasets and applications. The $\eta $ function should return a higher probability score for the node ${c}^{\prime}$ that is “closest” to $x$ among the siblings. The parameter $\delta $ is a temperature parameter in the Boltzmann function. Due to the space limit, we disclose more insights about this function in the supplemental materials about how to control the function as the path goes deeper.
Action Set:
As our measure should represent the quality of the hierarchy for a typical searcher, we capture the possible search behaviours by modelling that the bot searches stochastically. Nevertheless, our measure focuses on rational search behaviour, rather than on fully random search. The proposed framework allows us to capture this rationality by exposing part of the search behaviour to rational decision making. Hence, we impose randomness in the manner in which children are selected by a searcher and expose only the decision of whether or not to explore the structure of the hierarchy to the bot’s decisionmaking logic. We design the bot to move randomly according to the guidance function $\eta (\cdot )$. Specifically, the bot moves to a child ${c}^{\prime}$ with probability $\eta (c,{c}^{\prime})$, i.e. based on the probability of the item being stored at the target location. With the above rationale, the action set contains only two actions: descend ${\mathrm{a}}_{\mathrm{d}}$ and search ${\mathrm{a}}_{\mathrm{s}}$, corresponding to the choices of descending the hierarchy to another level, or stopping and searching for the target at the current node.
Transitions
The transition is fixed and the bot can fully observe the outcome, when the bot searches. That is, the bot knows whether it is in $\u27e8\mathrm{\varnothing},0\u27e9$ or $\u27e8\mathrm{\varnothing},1\u27e9$ reaching the state $\mathrm{\varnothing}$ when applying action ${\mathrm{a}}_{\mathrm{s}}$.
For the descent action, the bot estimates the transition i.e. it computes $\widehat{T}$ such that $\widehat{T}(s,a,{s}^{\prime})=\widehat{p}({s}^{\prime}\mid s,a)$. In regular MDP and POMDP problems, the unknown transitions are handled by the RL methodology, such that the bot can explore by choosing certain actions and can approximate the transition probabilities by the ratio of the number of ending states over all the attempts (duff2002optimal; ng2012bayes). In our task, we have to design for bot’s transition distribution rather than to learn them from the data. Thus, the “domain knowledge” should play the role for driving the bot to succeed or fail in its task.
Let ${g}_{c}$ be the boolean variable associated with a node $c$ encapsulated in a state $s$. We decompose the transition probability of applying ${\mathrm{a}}_{\mathrm{d}}$ for the various cases. For ${c}^{\prime}\in \mathcal{C}(c)$, $s=\u27e8c,{g}_{c}\u27e9$ and ${s}^{\prime}=\u27e8{c}^{\prime},{g}_{{c}^{\prime}}\u27e9$,
$\widehat{p}({s}^{\prime}\mid s,{\mathrm{a}}_{\mathrm{d}})$  $=p\left(\u27e8{c}^{\prime},{g}_{{c}^{\prime}}\u27e9\mid \u27e8c,{g}_{c}\u27e9,{\mathrm{a}}_{\mathrm{d}}\right)=p\left({c}^{\prime}\mid c,{\mathrm{a}}_{\mathrm{d}}\right)\widehat{p}\left({g}_{{c}^{\prime}}\mid {g}_{c}\right)=\eta (c,{c}^{\prime})\widehat{p}\left({g}_{{c}^{\prime}}\mid {g}_{c}\right)$ 
where the last step follows from the stochastic descent process described in the previous section. The $\eta $ function is also used for the estimate $\widehat{p}({g}_{{c}^{\prime}}\mid {g}_{c})$ such that, $\widehat{p}({g}_{{c}^{\prime}}=1\mid {g}_{c}=1)=\eta (c,{c}^{\prime})$. Overall, for ${\mathrm{a}}_{\mathrm{d}}$, we get, for all $c\in \mathscr{H}$, ${c}^{\prime}\in \mathcal{C}(c)$:
$p(\u27e8{c}^{\prime},1\u27e9\mid \u27e8c,1\u27e9,{\mathrm{a}}_{\mathrm{d}})$  $=\eta (c,{c}^{\prime})\widehat{p}({g}_{{c}^{\prime}}=1\mid {g}_{c}=1)=\eta {(c,{c}^{\prime})}^{2}$  
$p(\u27e8{c}^{\prime},0\u27e9\mid \u27e8c,1\u27e9,{\mathrm{a}}_{\mathrm{d}})$  $=\eta (c,{c}^{\prime})\widehat{p}({g}_{{c}^{\prime}}=0\mid {g}_{c}=1)=\eta (c,{c}^{\prime})(1\eta (c,{c}^{\prime}))$  
$p(\u27e8{c}^{\prime},1\u27e9\mid \u27e8c,0\u27e9,{\mathrm{a}}_{\mathrm{d}})$  $=\eta (c,{c}^{\prime})\widehat{p}({g}_{{c}^{\prime}}=1\mid {g}_{c}=0)=0$  
$p(\u27e8{c}^{\prime},0\u27e9\mid \u27e8c,0\u27e9,{\mathrm{a}}_{\mathrm{d}})$  $=\eta (c,{c}^{\prime})\widehat{p}({g}_{{c}^{\prime}}=0\mid {g}_{c}=0)=\eta (c,{c}^{\prime}).$  (3) 
Rewards
In the POMDP, only when the bot stops and searches is a nonzero reward obtained; navigating earns $0$ reward. Moreover, searching in a node with fewer items gives the bot a higher cognitive reward which motivates the bot to traverse down the tree. In particular, we have
$R(s,a,{s}^{\prime})=\{\begin{array}{cc}0\hfill & a={\mathrm{a}}_{\mathrm{d}}\hfill \\ r(c)\hfill & {s}^{\prime}=\u27e8\mathrm{\varnothing},1\u27e9\hfill \\ 1\hfill & {s}^{\prime}=\u27e8\mathrm{\varnothing},0\u27e9\hfill \end{array}\mathit{\hspace{1em}}and\mathit{\hspace{1em}}r(c)$  $=1\left({e}^{c/N}1\right)/(e1)$ 
where $c$ is the location encapsulated in $s$. We assign $1$ to searching at a wrong node. One can customise $r(\cdot )$ as long as it is monotonically decreasing with the size of the input cluster to be searched. The selected $r(c)$ approximates the ease of searching within a set of items where $c$ is the number of items at $c$.
Observations and Observation Probabilities
The observations after the bot descends are its own location in the hierarchy, the children, and some abstract summaries from the children, etc. This information is fixed and unique for each location that the bot arrives at. Hence one can integrate it into one indexed observation that directly affects our belief update function. Recall that $Z(s,a,o)=p(o\mid s,a)$. It is trivial to see $p(o\mid \u27e8{c}^{\prime},{g}_{{c}^{\prime}}\u27e9,{\mathrm{a}}_{\mathrm{d}})=\mathrm{\U0001d7d9}\{{c}^{\prime}=o\}$, i.e. the bot can detect its exact location. After performing ${\mathrm{a}}_{\mathrm{s}}$, the observations are whether or not the target is found in the current node.
Belief Update
As states are not fully observable, the bot maintains its belief state $b$ as a probability function at any specific time. Belief updates can be written using the belief update function, $\tau $, where ${b}^{\prime}=\tau (b,a,o)$, is the updated belief when observation $o$ is made after action $a$ is applied in belief state $b$. By Bayes’ Theorem,
${b}^{\prime}({s}^{\prime})$  $=p({s}^{\prime}\mid a,o,b)={\displaystyle \frac{p(o\mid {s}^{\prime},a){\sum}_{s}p({s}^{\prime}\mid s,a)b(s)}{p(o\mid a,b)}}.$  (4) 
Considering that ${\mathrm{a}}_{\mathrm{d}}$ is performed at $c$, then for each possible child ${c}^{\prime}\in \mathcal{C}(c)$, such that the new state is ${s}^{\prime}=\u27e8{c}^{\prime},0\u27e9$ or ${s}^{\prime}=\u27e8{c}^{\prime},1\u27e9$, we have
$${b}^{\prime}\left({s}^{\prime}\right)\propto \mathrm{\U0001d7d9}\{{c}^{\prime}=o\}\sum _{s}p({s}^{\prime}\mid s,{\mathrm{a}}_{\mathrm{d}})b(s).$$ 
To be more specific,
${b}^{\prime}\left(\u27e8{c}^{\prime},0\u27e9\right)$  $\propto \mathrm{\U0001d7d9}\{{c}^{\prime}=o\}{\displaystyle \sum _{s}}p(\u27e8{c}^{\prime},0\u27e9\mid s,{\mathrm{a}}_{\mathrm{d}})b(s)$  
$=p(\u27e8{c}^{\prime},0\u27e9\mid \u27e8c,1\u27e9)b(\u27e8c,1\u27e9)+p(\u27e8{c}^{\prime},0\u27e9\mid \u27e8c,0\u27e9)b(\u27e8c,0\u27e9)$  
$=\eta (c,{c}^{\prime})[1\eta (c,{c}^{\prime})b(\u27e8c,1\u27e9)].$  (5)  
${b}^{\prime}\left(\u27e8{c}^{\prime},1\u27e9\right)$  $\propto \eta {(c,{c}^{\prime})}^{2}b\left(\u27e8c,1\u27e9\right),$  (6) 
where creftype 6 is obtained in a similar way to creftype 5. As discussed earlier, for other states, ${s}^{\prime \prime}$, not related to ${c}^{\prime}$, ${b}^{\prime}({s}^{\prime \prime})=0$. To compute the normalising constant $p(o\mid b,{\mathrm{a}}_{\mathrm{d}})$, when $a={\mathrm{a}}_{\mathrm{d}}$, we have that
$p(o={c}^{\prime}\mid b,{\mathrm{a}}_{\mathrm{d}})=\eta (c,{c}^{\prime})[1\eta (c,{c}^{\prime})b(\u27e8c,1\u27e9)+\eta (c,{c}^{\prime})b(\u27e8c,1\u27e9)]$  $=\eta (c,{c}^{\prime}),$ 
which gives a final belief update rule of
${b}^{\prime}\left(\u27e8{c}^{\prime},1\u27e9\right)$  $=\eta (c,{c}^{\prime})b\left(\u27e8c,1\u27e9\right)\mathit{\hspace{1em}\hspace{1em}}\text{and}\mathit{\hspace{1em}\hspace{1em}}{b}^{\prime}\left(\u27e8{c}^{\prime},0\u27e9\right)=1\eta (c,{c}^{\prime})b\left(\u27e8c,1\u27e9\right).$  (7) 
Note that $p(o\mid b,{\mathrm{a}}_{\mathrm{s}})=1$ for reaching the terminal state, such that $o=\mathrm{\varnothing}$.
Let us denote by $s={s}_{t}$ the state reached after $t$ update steps, with similar subscripting for $b$ and $c$. The root node is ${c}_{0}$. Throughout our analysis, we assume that the target $x$ is certainly contained inside the hierarchy and hence, the bot always starts in state $\u27e8{c}_{0},1\u27e9$. It follows that $b(\u27e8{c}_{0},1\u27e9)=1$.
By induction, we arrive at a simple expression for the belief state when the search has reached node ${c}_{T}$ at time $T$ where $T>0$, ${b}_{T}(\u27e8{c}_{T},1\u27e9)={\prod}_{t=1}^{T}\eta ({c}_{t1},{c}_{t})$ and ${b}_{T}(\u27e8{c}_{T},0\u27e9)=1{\prod}_{t=1}^{T}\eta ({c}_{t1},{c}_{t})$ while ${b}_{T}(s)=0$ for all other states $s$. That is, once the bot reaches a node $c$, the belief state of $\u27e8c,1\u27e9$ is simply the probability of reaching this node from the root, and that of $\u27e8c,0\u27e9$ is the residual, $1{b}_{T}(\u27e8c,1\u27e9)$. Furthermore, since there is no backtracking, there is exactly one belief state associated with each node $c\in \mathscr{H}$, which, if $\{{c}_{0},\mathrm{\dots},{c}_{T}=c\}$ is the unique path, is fully determined by the value ${b}_{c}\triangleq {b}_{T}(\u27e8{c}_{T},1\u27e9)$ and we can simply write, for ${c}^{\prime}\in \mathcal{C}(c)$, the belief update as ${b}_{{c}^{\prime}}=\eta (c,{c}^{\prime}){b}_{c}$.
Value Function
The most commonly used objective that drives policy selection in an MDP is the value $V$ (discounted long term reward), s.t. $V\triangleq {\sum}_{t=0}{\gamma}^{t}{R}_{t}$ where ${R}_{t}$ is the reward obtained at time stamp $t$. Let use denote a policy by $\pi $. Solving a POMDP aims to maximise the value function, ${V}^{\pi}(b)$, corresponding to the cumulative expected reward over the beliefs (singh1994learning). In particular,
${R}_{B}(b,a)={\displaystyle \sum _{s}}b(s){\displaystyle \sum _{{s}^{\prime}}}p({s}^{\prime}s,a)R(s,a,{s}^{\prime}).$ 
Then, the belief value function of a policy $\pi $ is given by
${V}^{\pi}(b)={\displaystyle \sum _{a}}\pi (b,a)\left[{R}_{B}(b,a)+\gamma {\displaystyle \sum _{s}}b(s){\displaystyle \sum _{{s}^{\prime}}}p({s}^{\prime}\mid s,a){\displaystyle \sum _{o}}p(o\mid {s}^{\prime},a){V}^{\pi}(\tau (b,a,o))\right]$ 
where $\pi (b,a)$ is the probability that action $a$ is selected in belief state $b$. The optimal value ${V}^{*}(b)$ with a corresponding deterministic policy to choose action ${a}^{*}$ can be written as a Bellman equation:
${V}^{*}(b)$  $=\underset{a\in A}{\mathrm{max}}Q(b,a)\mathit{\hspace{1em}\hspace{1em}}{a}^{*}=\underset{a\in A}{argmax}Q(b,a)$  
$\text{where}\mathit{\hspace{1em}\hspace{1em}}Q(b,a)$  $={R}_{B}(b,a)+\gamma {\displaystyle \sum _{o}}p(o\mid b,a){V}^{*}(\tau (b,a,o)).$ 
This shows that the POMDP may be interpreted as an MDP over belief states with $p(o\mid b,a)$ the transition probability for moving from belief state $b$ to belief state $\tau (b,a,o)$.
Let $T$ be the step at which a terminal state $\u27e8\mathrm{\varnothing},0\u27e9$ or $\u27e8\mathrm{\varnothing},1\u27e9$ is reached. The cumulative reward is given by
$\sum _{t=0}^{T1}}{R}_{t}=\{\begin{array}{cc}1\hfill & {s}_{T}=\u27e8\mathrm{\varnothing},0\u27e9\hfill \\ r({c}_{T1})\hfill & {s}_{T}=\u27e8\mathrm{\varnothing},1\u27e9\hfill \end{array}.$ 
It is natural to examine the value ${V}^{\pi}$ achieved by the policy $\pi $ learned over the POMDP in the underlying MDP (singh1994learning), that arises from the POMDP when the states are fully known. ${V}^{\pi}$ may be interpreted as the value that an oracle that knows the target’s location would assign to the bot’s policy. We base the HQS measure on this oracle value, noting that the bot can only choose the policy $\pi $ to try to maximise the belief value ${V}^{\pi}(b)$.
3.1 Example of the POMDP
Due to the space limit, we provide a walkthrough example in the supplemental document (Appendix A.) which shows how a POMDP will be constructed given a simple hierarchy.
3.2 Hierarchy Quality for Search
We measure the quality of the hierarchy as the oracle value, i.e. the value of this policy in the underlying MDP. Since the reward function always outputs $1$ whenever the policy leads to a wrong path, to compute this oracle value, we only need to focus on the correct path. In particular, let $T$ denote the depth at which the bot chooses ${\mathrm{a}}_{\mathrm{s}}$ and let $\{{c}_{0},{c}_{1},\mathrm{\dots},{c}_{T}\}$ now denote the correct path containing the target item $x$ to depth $T$. Then
${V}_{x}$  $=r({c}_{T}){\displaystyle \prod _{t=1}^{T}}\eta ({c}_{t1},{c}_{t})+(1)\cdot \left(1{\displaystyle \prod _{t=1}^{T}}\eta ({c}_{t1},{c}_{t})\right)=(r({c}_{T})+1)\left({\displaystyle \prod _{t=1}^{T}}\eta ({c}_{t1},{c}_{t})\right)1.$ 
Intuitively, the bot moves randomly according to $\eta (\cdot )$, and when it stops at depth $T$, the probability of obtaining a positive reward is simply the probability that the guidance function led it along the right path. Given $0\le r(\cdot )\le 1$, we obtain $1\le {V}_{x}\le 1$ which matches our intuition that if there are fewer items stored in the node at which the bot stops and the uncertainty of reaching this node is smaller, the bot achieves a higher value. The stopping point $T$, is the key output of the beliefbased policy determining the oracle value.
Eventually, we define HQS as follows.
Definition 1.
The HQS is a function over the set of data $X\mathrm{=}{\mathrm{\{}{x}_{n}\mathrm{\}}}_{n\mathrm{=}\mathrm{1}}^{N}$ and the hierarchy $\mathrm{H}$, such that $\mathrm{HQS}\mathit{}\mathrm{(}X\mathrm{,}\mathrm{H}\mathrm{)}\mathrm{=}\frac{\mathrm{1}}{N}\mathit{}{\mathrm{\sum}}_{n\mathrm{=}\mathrm{1}}^{N}{V}_{{x}_{n}}^{\pi}$ where ${V}_{{x}_{n}}^{\pi}$ is the long term expected reward for the bot to search item ${x}_{n}$, obtained by taking policy $\pi $.
4 Solving the POMDP
One can categorise POMDP solvers to offline and online policies. Offline methods either learn the whole environment prior to determining the policy or learn it through RL. Online planners solve the POMDP in a realtime decision making context, focusing on the information in the current states (Ross2008). We appeal to online planners as they better simulate the situation of a bot making decisions as the search proceeds. Also, offline methods would make the process less tractable and thus unsuitable for the evaluation task.
One notable online MonteCarlo based planner, POMCP (silver2010monte), is able to earn a good underlying MDP value regardless of the quality of the belief states. This contradicts what we expect from the role of belief in our model. If the guidance function is weak, then we expect that this should be reflected in a belief function that leads to a poor reward for the underlying MDP, as it should tend to lead the bot on a path not containing the target. Hence, the bot is coded with the RealTime Belief Space Search (RTBSS) planner to seek the policy (paquet2005online; paquet2005real; Ross2008). In RTBSS, at each decision point, the bot is restricted to learn the required information only of the immediate children, limiting it to only one layer down in the hierarchy. This corresponds to two lookaheads for an online planner, where the bot can compare stopping and searching at the current node, with that searching at the next layer after one descent.
4.1 RTBSS
We leave the details of the original RTBSS procedure as presented in (Ross2008) to the supplemental document. RTBSS is a greedy algorithm that explores a lowerbound on the optimal $V(b)$ using a diversity of actions within a limited number of lookaheads. It selects the policy that maximises this lowerbound.
We present the specialisation of the RTBSS planner to our setting. At each decision point $t$, the algorithm calculates $Q({b}_{t},{\mathrm{a}}_{\mathrm{s}})$, the Qvalue of stopping and searching at the current node and an estimate for descending, $\widehat{Q}({b}_{t},{\mathrm{a}}_{\mathrm{d}})$, obtained by assuming that the child nodes are leaf nodes, or equivalently, by taking the value of a second descent step, $Q({b}_{t+1},{\mathrm{a}}_{\mathrm{d}})$ to be 0. In particular,
$Q({b}_{t},{\mathrm{a}}_{\mathrm{s}})$  $={b}_{{c}_{t}}r({c}_{t})+(1{b}_{{c}_{t}})(1)={b}_{{c}_{t}}(r({c}_{t})+1)1$  
$\widehat{Q}({b}_{t},{\mathrm{a}}_{\mathrm{d}})$  $\triangleq {\displaystyle \sum _{{c}^{\prime}\in \mathcal{C}({c}_{t})}}p(o={c}^{\prime}\mid {b}_{t},{\mathrm{a}}_{\mathrm{d}})Q(\tau ({b}_{t},{\mathrm{a}}_{\mathrm{d}},o),{\mathrm{a}}_{\mathrm{s}})$  
$={\displaystyle \sum _{{c}^{\prime}\in \mathcal{C}({c}_{t})}}\eta ({c}_{t},{c}^{\prime})({b}_{{c}^{\prime}}(r({c}^{\prime})+1)1).$ 
\[email protected]@algorithmic[1] \STATEInitialise $b$ such that $b(\u27e8{c}_{0},1\u27e9)\leftarrow 1$ \WHILEIsNotLeaf($b$) and $\pi (b,{\mathrm{a}}_{\mathrm{s}})\ne 0$ \STATECompute $Q(b,{\mathrm{a}}_{\mathrm{s}})$ and $\widehat{Q}(b,{\mathrm{a}}_{\mathrm{d}})$ \IF$\widehat{Q}(b,{\mathrm{a}}_{\mathrm{d}})>Q(b,{\mathrm{a}}_{\mathrm{s}})$ \STATE${a}^{*}\leftarrow {\mathrm{a}}_{\mathrm{d}}$ \ELSE\STATE${a}^{*}\leftarrow {\mathrm{a}}_{\mathrm{s}}$ \ENDIF\STATE$\pi (b,{a}^{*})\leftarrow 1$ \STATE$b\leftarrow \tau (b,a,o)$ \COMMENTIt focuses on the $o$ in the right path only \ENDWHILE\STATE$\pi (b,{\mathrm{a}}_{\mathrm{s}})\leftarrow 1$ \COMMENTThe bot can only choose to search at a leaf node Our simplified algorithm (in creftype 4.1) provides a beliefbased policy for descending through the hierarchy, which at any given node determines whether to descend further or to stop. The simplified algorithm helps HQS become less computationally demanding and achieves a polynomial time complexity.
Proposition 1.
The worst case complexity of computing HQS for $N$ items is $O\mathit{}\mathrm{(}{N}^{\mathrm{3}}\mathit{}\mathrm{F}\mathit{}\mathrm{(}\mathrm{S}\mathrm{)}\mathrm{)}$ where $\mathrm{F}\mathit{}\mathrm{(}\mathrm{S}\mathrm{)}$ is the complexity of the similarity function.
The sketch of the proof can be found in the supplemental document.
5 Experimental Study
We show how HQS works through a case study using a small subset of Amazon data data^{2}^{2} 2 http://jmcauley.ucsd.edu/data/amazon (mcauley2015image) and compare it with an existing approach. Then, we will discuss on scaling the HQS.
5.1 Case Study on Amazon
This study uses the textual information for the items from the Amazon, and bases the guidance function on a TFIDF similarity score. We select commonly seen daily life products and manually construct a number of hierarchies, varying their quality. To ensure that the hierarchies can be intuitively assessed by inspection, we restrict to just 12 items. This approach can help us quickly assess HQS procedure by assessing if it orders the hierarchies as expected.
The Selected Items
Some TFIDF details about the items are displayed in Table S1 and S2 (in the supplemental document). We refer to items by their indices as specified in that table. More details are also revealed in the supplemental document.
Since the data is represented by TFIDF vectors, which we write as ${\bm{v}}_{x}$ for item $x$, we define the similarity function $\mathcal{S}(x,c)$ as follows
$\mathcal{S}(x,c)=\{\begin{array}{cc}1\hfill & c=\{x\}\hfill \\ \frac{1}{c1}{\sum}_{{x}^{\prime}\in c\backslash \{x\}}\mathrm{cos}({\bm{v}}_{x},{\bm{v}}_{{x}^{\prime}})\hfill & \text{otherwise}\hfill \end{array}.$ 
This similarity is in spirit similar to the average link of Agglomerative Clustering (AC) (day1984efficient).^{3}^{3} 3 It is important to exclude $x$ itself from computing the similarities over a nonsingleton cluster, since the maximum cosine similarity between any $(x,y)$ pair is only around $0.11$ in the selected items, and hence $\mathrm{cos}({\bm{v}}_{x},{\bm{v}}_{x})$ would dominate the similarity score.
For the guidance function in our experiments, we set $\delta =.01$ which helps us increase the weight of the best cluster and ensure that the cluster with largest similarities to the item, has a dominant $\eta $ value.
Tested Hierarchies
We analyse the HQS scores for five hierarchies in creftype 2, where HierA is the ground truth obtained from the labels provided in the Amazon dataset. HierB and HierC are randomly generated hierarchies. HierD is constructed to be deliberately poor. All items in each leaf cluster have different groundtruth labels. For example, “Whitener”, “Biker Boot Straps”, and “Ultrasonic Cleaner” in cluster 0 of HierD are all in separate clusters according to the Amazon organisation. Finally, HierE is constructed on purpose to show that a good hierarchy is achievable even different from the groundtruth.
HQS  0  1  2  3  4  5  6  7  8  9  10  11  

A  .572  .620  .747  .827  .811  .600  ..518  .793  .679  .562  .432  .812  .817 
B  .280  .0  .991  .989  .863  .955  .736  .760  .659  .995  .0  .749  .0 
C  .310  .752  .999  .0  .0  .0  .889  .857  .933  .879  .0  .965  .979 
D  .848  .668  .740  .708  .996  .652  1.00  .791  .607  .995  .925  .990  .901 
E  .327  .611  .797  .673  .0  .533  .0  .834  .0  .423  .0  .835  .808 
Result Analysis
The second column of creftype 1 shows the HQS results of the five different hierarchies. The ground truth hierarchy, A, provides the highest HQS score (0.5715), followed by a relatively high score for HierE as expected. HierD provides the worst HQS score (0.8477). In this example, the HQS scoring scheme agrees with our intuition and we believe this ranking is reasonable. creftype 1 also displays the values for each item obtained by RTBSS in the five hierarchies. Next, we specify how the policy propagates stepbystep for those hierarchies.
HierA
This hierarchy is generated given the ground truth labels of the items. For instance, node 4 refers to “Clothing, Shoes & Jewelry”, node 7 refers to “Women”, and 5 refers to “Shoe Care & Accessories” etc., as shown in Table S1 and S2.
For item ${x}_{0}$, “Women Boots”, staying and searching at the root earns an estimate $Q({b}_{0},{\mathrm{a}}_{\mathrm{s}})=0$. Now, the similarities ${x}_{0}$ with the child nodes are $\mathcal{S}({x}_{0},{c}_{7})=0.061$ and $\mathcal{S}({x}_{0},{c}_{8})=0.0277$. It follows that $\eta ({c}_{9},{c}_{7})=0.9987$ and $\eta ({c}_{9},{c}_{8})=0.0013$. Clearly, both nodes have 6 items and so, the rewards for staying at both nodes are equal, $0.6226$. It follows that $\widehat{Q}({b}_{0},{\mathrm{a}}_{\mathrm{d}})={\sum}_{o\in \{{c}_{7},{c}_{8}\}}p(o\mid b,{\mathrm{a}}_{\mathrm{d}})Q(\tau ({b}_{0},{\mathrm{a}}_{\mathrm{d}}o))=0.6203$. Since $$, the bot performs action ${\mathrm{a}}_{\mathrm{d}}$, and it randomly moves to the right node with probability $0.9986$. From ${c}_{7}$ to ${c}_{4}$, it is a straight move as the estimates will be equal and the bot should explore a better chance for a higher value. Let us review that cluster ${c}_{0}$ refers to “Boots” and ${c}_{3}$ refers to “Fashion Sneakers” under the category chain “All $\to $ Clothing, Shoes & Jewelry $\to $ Women” (${c}_{9},{c}_{7},{c}_{4}$). It shows $\mathcal{S}({x}_{0},{c}_{0})=0.06,\mathcal{S}({x}_{0},{c}_{3})=0.061$ and we obtain $\eta ({c}_{4},{c}_{0})=0.4158$ and $\eta ({c}_{4},{c}_{3})=0.5829$. This suggests that ${x}_{0}$ more strongly belongs to cluster ${c}_{3}$ than to ${c}_{0}$, the cluster it is assigned to. Regarding ${c}_{3}$, its items “Women Runner”, “Women Sneaker” and “Women Trainer” are highly close to “Women Boots” in some sense. Upon computing the Qvalues, the bot decides to stop at node ${c}_{4}$, since it is not confident that another descent step will lead to greater reward. Finally, the oracle value ${V}_{{x}_{0}}$ is $0.6203$, a good score, but short of the maximum reward which captures the lack of clarity between the clusters at the lower levels of the hierarchy for this item. This suggests that our measurement is rather reasonable in this case.
Another interesting example is ${x}_{7}$, “Dye Kit”. It starts navigating wrong at the first level. The similarities $\mathcal{S}({x}_{7},{c}_{7})$ and $\mathcal{S}({x}_{7},{c}_{8})$ return $0.0367$ and $0.0297$ respectively. It belongs to ${c}_{8}$ while the measurement prefers ${c}_{7}$. In fact, “Dye Kit” is close to the shoe items in cluster ${c}_{7}$, but also close to the items “Shoe Cleaner” and “Whitener” in cluster ${c}_{8}$, as they are also cleaning tools (for jewellery). As it stands, the bot will be guided to the wrong node and hence receives a negative reward. It could well be the case that the situation would be improved with more data items.
The bot is not confident to descend to the bottom of the hierarchy for every case which is reflected in the associated score. In summary, HierA is sufficiently good to help the bot descend and acquires a good HQS of $0.5715$.
HierB
In this case, the final HQS is $0.2801$. creftype 1 demonstrates that there are 6 items earning a negative reward as the hierarchy guides them to the “wrong” node. Three more items earn a score of 0, as the bot gets confused at the root level and so it prefers searching at the root. Interestingly, a random hierarchy does not necessarily achieve an absolute HQS of zero. The HQS measurement is apparently not linear.
HierC
Its HQS is $0.3097$. Similarly to HierB, there are 6 items earning a negative reward as the hierarchy has only two items achieving positive scores which are the two in the small cluster, “Biker Boot Straps” and “Silver Cloth” obtain $0.8892$ and $0.8789$ respectively. The similarities are diluted compared with the small node, since the sibling clusters are too diversified. However, HQS unveils the fact that the hierarchy is poor for item retrieval. Although we compute the value for the bot searching for each item, HQS considers the hierarchy as a whole. Thus, even a poor hierarchy might enable efficient search for a few items, while there are many more poorly scoring items.
HierD
For this hierarchy, the overall HQS score of $0.8477$, reflects the fact that this is a hierarchy within which the bot has difficulty in searching. HQS successfully reveals this by returning a score close to the minimum, the worst performing one among the exhibited hierarchies. This implies that HQS correctly downgrades a random hierarchy.
HierE
Finally, we construct a “good” hierarchy (creftype (e)e) while its style is very different to the groundtruth. It separates the items about jewellery and the items associated with shoes at the first level. Overall, there is only one item, “Shoe Cream” which receives a negative value $0.7972$. At the first decision point for this item, $\eta ({c}_{8},{c}_{7})$ is $0.8894$ which leads the bot to the wrong node with a dominant probability. The item’s textual representation tends to be better associated with the cleaning toolkit. Interestingly, “Biker Boot Straps” receives the similarities $0.0379$ and $0.0375$ for ${c}_{6}$ and ${c}_{7}$ respectively; thus the bot stops at the root node. It implies that the HQS acknowledges hierarchies with different structures to the groundtruth as long as the hierarchy is efficient for search.
5.1.1 Comparison to Existing Approach
We show that HQS outperforms HAI (Johnson2013) on our example hierarchy.
HAI defines a distance function $d(x,y,\mathscr{H})={c}_{x,y}/N$ where ${c}_{x,y}$ is the closest ancestor node of items $x$ and $y$ when such an ancestor is not a leaf node; otherwise the distance returns $0$. The HAI metric requires a groundtruth. Let ${\mathscr{H}}_{gt}$ be such a groundtruth tree. Then HAI is formulated as
$\mathrm{HAI}(\mathscr{H},{\mathscr{H}}_{gt})=1{\displaystyle \frac{1}{{N}^{2}}}{\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j=1}^{N}}d({x}_{i},{x}_{j},\mathscr{H})d({x}_{i},{x}_{j},{\mathscr{H}}_{gt}).$ 
We conduct the evaluation over the hierarchies B–E with HierA as ground truth. Their HAI scores are $0.64,0.46,0.58,0.85$ respectively. Both HQS and HAI rank HierE and HierB as first and second, excluding the groundtruth. However, the HAI score of 0.58 for HierD, which was generated to serve as the worst example, is better than the score of HierC (which should not happen) and also numerically close to that of HierB.
5.2 Discussion of Scaling HQS
HQS can easily be run in parallel, since the POMDPs for the items are completely independent. In this section, we demonstrate that sampling can further help with the runtime efficiency–with sufficient samples, one can achieve an approximated HQS close to the true HQS. With the number of items sampled, the discrepancy between the sampled HQS and the true HQS decreases exponentially and the runtime speeds up linearly.
Again, we use Amazon but keep $37,826$ samples and $100$ dimensions with PCA. The hierarchy is simply generated using AC and contains $75,651$ nodes with a maximum height $65$. The prototype of HQS is implemented with Python.^{4}^{4} 4 The experiments are conducted in a computer equipped with AMD Ryzen 7 [email protected] 8Core Processor and 64GB of RAM, using 16 threads in parallel. Tree is a relatively memoryintense data structure and the data size is also a consideration. Hence, RAM prevents us from examining larger (in another order of magnitude) trees.
Now, we downsample the items for computing the oracle values. creftype 5 displays the empirical densities of the HQS for the $37,826$ Amazon items, fitted using StatsPlots^{5}^{5} 5 https://github.com/JuliaPlots/StatsPlots.jl. The black dashed line is the true HQS with all items computed, while the others have sampled the corresponding percentages of items. One can see that the density curve with more samples gets closer to the one of the entire set of data. The L1 errors, which are divided by the absolute HQS, between the true HQS and the sampled HQS are exhibited in creftype 5. It is clear that the errors decreases drastically as more items are sampled. After $30\%$ sampling, the errors are less than $5\%$ of the absolute value of the true HQS. As expected, the HQS of the cases with fewer samples are less predictable. Finally, the runtime is plotted in creftype 5, and is shown to follow approximately a linear relation.
Overall, HQS can benefit from that it is completely parallelisable. In addition, sampling techniques can also help improve the efficiency. We would like to emphasise that, although the runtime is not quick enough, it is improvable with better code and machines.
6 Conclusion
We have proposed HQS, an approach for assessing the quality of hierarchical clusters which can be used without groundtruth information. It employs POMDP for modelling the uncertainty and the decision making for searchers with regard to search and navigation in a hierarchy, and extends the model to measure the quality of the hierarchies. The experiments show that HQS can order hierarchies reasonably based on their qualities without groundtruth information, and perform better than a stateoftheart approach which requires the groundtruth hierarchy. For future work, we will theoretically analyse the properties of the measure. Additionally, we will continue exploring better models and policies.