POMDP Modelling for Assessing Hierarchies

  • 2020-06-29 17:20:36
  • Weipeng Huang, Guangyuan Piao, Raul Moreno, Neil J. Hurley
  • 0


Hierarchical clustering has been shown to be valuable in many scenarios.Despite its usefulness to many situations, there is no agreed methodology onhow to properly evaluate the hierarchies produced from different techniques,particularly in the case where ground-truth labels are unavailable. Thismotivates us to propose a framework for assessing the quality of hierarchicalclustering allocations which covers the case of no ground-truth information.This measurement is useful, e.g., to assess the hierarchical structures used byonline retailer websites to display their product catalogues. Our framework isone of the few attempts for the hierarchy evaluation from a decision-theoreticperspective. We model the process as a bot searching stochastically for itemsin the hierarchy and establish a measure representing the degree to which thehierarchy supports this search. We employ Partially Observable Markov DecisionProcesses (POMDP) to model the uncertainty, the decision making, and thecognitive return for searchers in such a scenario.


Quick Read (beta)

POMDP Modelling for Assessing Hierarchies

Weipeng Huang [email protected] Insight Centre for Data Analytics, School of Computer Science, University College Dublin, Dublin, Ireland Guangyuan Piao [email protected] Insight Centre for Data Analytics, Data Science Institute, IDA Business Park, Galway, Ireland Raul Moreno [email protected] Insight Centre for Data Analytics, School of Computer Science, University College Dublin, Dublin, Ireland Neil Hurley [email protected] Insight Centre for Data Analytics, School of Computer Science, University College Dublin, Dublin, Ireland

Hierarchical clustering has been shown to be valuable in many scenarios. Despite its usefulness to many situations, there is no agreed methodology on how to properly evaluate the hierarchies produced from different techniques, particularly in the case where ground-truth labels are unavailable. This motivates us to propose a framework for assessing the quality of hierarchical clustering allocations which covers the case of no ground-truth information. This measurement is useful, e.g., to assess the hierarchical structures used by online retailer websites to display their product catalogues. Our framework is one of the few attempts for the hierarchy evaluation from a decision theoretic perspective. We model the process as a bot searching stochastically for items in the hierarchy and establish a measure representing the degree to which the hierarchy supports this search. We employ Partially Observable Markov Decision Processes (POMDP) to model the uncertainty, the decision making, and the cognitive return for searchers in such a scenario.


trees,bayesnet,arrows,automata,positioning \tcolorboxenvironmentlemmaenhanced jigsaw,colback=gray!20,drop shadow \tcolorboxenvironmentpropositionenhanced jigsaw,colback=gray!20,drop shadow \tcolorboxenvironmentremarkenhanced jigsaw,colback=gray!20,drop shadow \tcolorboxenvironmentdefinitionenhanced jigsaw,colback=white,drop shadow

1 Introduction

Hierarchical clustering (HC) analysis has been applied in many E-commerce and scientific applications. The process generates a collection of nested clusters that group the data in a connected structure (balcan2014), forming a hierarchy. The hierarchy is represented by a tree data structure where each node contains a number of data items. Such a structured organisation of the items is useful for tasks such as efficient search. Searchers can navigate through the catalogue of items by choosing a path through the hierarchy and can thus avoid the cost of a linear search through the entire catalogue of items. If the hierarchy is well organised into coherent clusters, then finding the correct path to the required item is easy. Clearly, there are multiple ways of clustering the same items and organising their hierarchies. However, evaluating the quality of a hierarchy still needs more substantial study, especially for data that lacks a ground-truth hierarchy. The evaluation of hierarchical structures is the focus of this paper.

We consider on-line retailers and the customers who access them, as an example. When looking for specific items, customers are in essence navigating the hierarchies hosted by the website. Different clusters and hierarchies will probably provide divergent levels of user experience with respect to searching and navigating, even for the same users. For example, imagine that a female jeans is mistakenly placed in a branch of a certain hierarchy, labelled “clothes trousers jeans males” . The user will expect that the jeans are contained along the branch “clothes trousers jeans females”, and thus will surely fail to retrieve the required item in that hierarchy. As another example, creftype 1 shows that a searcher may be confused and possibly search for a Harry Potter DVD in the wrong route. Is it more so an action movie than a drama movie?


for tree= fit=band,line width=.6pt,edge=line width=.6pt, [DVD, draw, ellipse [drama, [fantasy, [Harry Potter] [Lord of the Rings] ] [science fiction [Star Wars] [] ] ] [action,draw, ellipse [love story []] [kids []] [war-time []] ] ]

Figure 1: An example which illustrates a hierarchy that confuses the searchers given that the target is a Harry Porter DVD

In this second example, it may well be the case that each cluster contains a coherent set of items, but their hierarchical organisation makes it difficult for a searcher to choose the correct path. Such observations motivate an evaluation function that accounts for the structure as much as the item to cluster assignments.

In the scenario that we consider here, all items stored in the sub-tree rooted at a node in the hierarchy are available for consideration by a searcher who stops at that node. For instance, a searcher stopping at the drama node could consider in turn all the DVDs, “Harry Potter”, “Lord of the Rings”, “Star Wars”, and so on. The further the searcher descends in the hierarchy, the fewer items that need to be considered once the searcher stops to search, thus increasing search efficiency, provided that a correct path in the hierarchy is taken.

There are many evaluation functions already proposed to measure the quality of regular (non-hierarchical) clustering results, with or without ground-truth data. Any of these measures can be applied to a HC, once an appropriate cut of the hierarchy is chosen. Such measures do give feedback on the coherence of the clusters and/or their agreement with a ground truth. However, there is relatively little research that has proposed measures scoring the structural organisation of the hierarchy. This is critical in order to properly evaluate hierarchies in the context that we have in mind. In particular, one can extend it to perform the hyper-parameter tuning for a number of HC algorithms. As demonstrated in (kobren2017hierarchical), many HC algorithms depend on hyper-parameters and might possibly be further improved with our function.

Problem Statement

Let us consider a set of items X={xn}n=1N. HC is a set of nested clusters of the dataset arranged in a tree . Each node of the tree has associated with it a sub-set or cluster of the collection, and the children of any node are associated with a partition of the parent’s cluster. The root node is associated with the entire collection, and if an item belongs to any particular node in , it also belongs to all of its ancestor nodes. The goal is to seek a function mapping to a real value, which reflects the quality of the hierarchical arrangement, from the point of view of supporting efficient search for a target item in the collection. In summary, this work aims to arrive at a quantitative quality measure for a given hierarchical arrangement of a catalogue of items, where in a typical use-case, the items are products in a large catalogue offered by an on-line retailer. Our scenario is that of a search bot seeking a specific target item and so we develop the measure by modelling a simplified search process but one which is sufficient to capture the important features of the hierarchy that determine its suitability to support efficient search.

It is intuitive to model this process of decision making under uncertainty as a Markov Decision Process (MDP). MDPs are concerned with the problem of determining a decision policy that optimises the cumulative reward obtained when applying this policy over some time horizon. Moreno2017 proposed such an MDP to tackle the problem of evaluating hierarchies. In that model, the quality of the hierarchy corresponds to the expected cumulative reward that a search bot will obtain when searching for target items, using a particular search policy, where the reward is a function of whether and how efficiently the target is found. Notice that, this quality measure depends not only on the hierarchy itself, but also on the policy used to determine the action at each decision point.


We propose a novel framework extending the MDP model by proposing a reasonable search policy, which we posit provides a solid model for measuring the efficiency of item retrieval in the hierarchy. Our frameworks explicitly models the searcher’s lack of knowledge about the environment and this leads us to extend from an MDP framework, to a Partially Observed MDP (POMDP) framework. We coin the measure Hierarchy Quality for Search (HQS).

2 Related Works

When assessing general clustering results, one can appeal to measuring the results with or without the ground-truths (Liu13; Steinbach00). With respect to hierarchies, a ground-truth hierarchy should contain multiple layers assigning nodes to each level of the hierarchy. Thus, we do not consider the Dendrogram Purity (e.g., used in (heller2005bhc; kobren2017hierarchical)) as it evaluates a hierarchy with regard to only one layer of cluster assignment. There are many tools for evaluating the quality of the clustering but they lack the capability to evaluate the hierarchical arrangement (Johnson2013).

cigarran2005evaluating proposed a prototype considering the content of the cluster, the hierarchical arrangement and the navigation cost. Unfortunately, it is hard to develop this idea as neither detailed procedures nor experiments were presented in this short paper. More recently, Johnson2013 proposed the Hierarchical Agreement Index (HAI), which borrows its concept from the Rand Index, to compare the structure to a ground-truth hierarchy. Moreno2017 introduced the concept of using MDPs to model the evaluation. They observed that HC measures have hardly addressed the need to understand the convenience for the search and navigation efficiency provided by a hierarchy, accounting for the cognitive cost of choosing a correct path at each branch of the hierarchy. However, it is a prototype rather than a finished product. Our work extends these ideas to a POMDP (Kaelbling1998) model, which improves the previous work by 1) modelling the uncertainty in the search process with belief states; 2) specifying a policy to solve the problem so that the strong assumptions (the bot knows the position of the items universally) in the previous work can be relaxed. We consider this work an important contribution to the topic of evaluating hierarchical structures with no help from ground-truths.

There exist abundant research addressing the issue of acting optimally in POMDPs, such as (cassandra1994; Boutilier1996; meuleau1999solve; Ross2008), to name but a few. Seeking a better policy is not the main focus in the paper; hence, we adopt the existing (on-line) policies. On the other hand, we find the works (shani05; zheng2018drn; Eugene2019) employ Reinforcement Learning (RL) to model the user behaviours for improving the performance in recommender systems, which are inspiring. These are somewhat close to our work while in a different application domain.

3 POMDP Specification

We develop a POMPD model of a bot searching a hierarchy, where, at each node, it must make the decision to search or to descend further. The bot cannot backtrack, once a descent step has been performed. Simply put, a POMDP is an MDP with additional observations, belief state and belief estimator (cassandra1994; Kaelbling1998). Fixing the search target of the bot as item x, we model the search of the hierarchy as the POMDP 𝒫x=S,A,T,R,Z,O,γ, such that

  • S is the finite state space.

  • A is the set of possible actions.

  • T:S×A×S[0,1] is the transition function where T(s,a,s)p(ss,a), represents the probability of moving to state s from state s using action a.

  • R:S×A×S assigns the expected immediate reward, R(s,a,s), to the resulting state s after the bot selects action a in state s.

  • O is the observation set.

  • Z:S×A×O[0,1] is the probability function for the observations. We have Z(s,a,o)=p(oa,s) which reads as the probability of observing oO when reaching the state s through action a.

  • γ is the discount factor. In our model, it is sufficient to set γ=1 and hence we will omit further explanation about this part.


POMDPs use states to represent the status of the search at a particular point during the search process. Our model defines that each state is a joint event consisting of 1) the physical location (node) c of the bot, and 2) a boolean variable indicating if the bot is in the right path:

S={,0,,1}c{c,0,c,1}, (1)

where is the terminal point reached after the bot chooses to search. Accordingly, ,1 is the state representing that the bot stops and the stopping node contains the item, likewise ,0 depicts the opposite case.


We first discuss the guidance function concept, which is a core component in our model. Then, we specify the action set.

Guidance Function:

A guidance function η:2X×2X×X[0,1] provides the search bot with evidence as to the correct path on which to search for an item xX. Specifically, η(c,c,x) represents the bot’s estimate that the target x is contained in a cluster c, given that it is in cluster c, for (c,c). We have

c𝒞(c):η(c,c,x)=0  and  c𝒞(c)η(c,c,x)=1

where 𝒞(c) returns the children of c. The η function summarises the information (prior domain knowledge) available in the hierarchy which will guide the bot’s behaviour. It is represented as a discrete probability density function. In particular, we write for any pair (c,c)

η(c,c,x)=p(xcxc)exp{𝒮(x,c)/δ}c′′𝒞(c)exp{𝒮(x,c′′)/δ}. (2)

Here, 𝒮(x,c) is the similarity function between the item x and the cluster c.11 1 In the rest of the paper, we omit x in the guidance function, since the discussions about η will always concentrate on one specific item. Leaving the choice of similarity function open adds flexibility to the measure for adjusting to diverse datasets and applications. The η function should return a higher probability score for the node c that is “closest” to x among the siblings. The parameter δ is a temperature parameter in the Boltzmann function. Due to the space limit, we disclose more insights about this function in the supplemental materials about how to control the function as the path goes deeper.

Action Set:

As our measure should represent the quality of the hierarchy for a typical searcher, we capture the possible search behaviours by modelling that the bot searches stochastically. Nevertheless, our measure focuses on rational search behaviour, rather than on fully random search. The proposed framework allows us to capture this rationality by exposing part of the search behaviour to rational decision making. Hence, we impose randomness in the manner in which children are selected by a searcher and expose only the decision of whether or not to explore the structure of the hierarchy to the bot’s decision-making logic. We design the bot to move randomly according to the guidance function η(). Specifically, the bot moves to a child c with probability η(c,c), i.e. based on the probability of the item being stored at the target location. With the above rationale, the action set contains only two actions: descend ad and search as, corresponding to the choices of descending the hierarchy to another level, or stopping and searching for the target at the current node.


The transition is fixed and the bot can fully observe the outcome, when the bot searches. That is, the bot knows whether it is in ,0 or ,1 reaching the state when applying action as.

For the descent action, the bot estimates the transition i.e. it computes T^ such that T^(s,a,s)=p^(ss,a). In regular MDP and POMDP problems, the unknown transitions are handled by the RL methodology, such that the bot can explore by choosing certain actions and can approximate the transition probabilities by the ratio of the number of ending states over all the attempts (duff2002optimal; ng2012bayes). In our task, we have to design for bot’s transition distribution rather than to learn them from the data. Thus, the “domain knowledge” should play the role for driving the bot to succeed or fail in its task.

Let gc be the boolean variable associated with a node c encapsulated in a state s. We decompose the transition probability of applying ad for the various cases. For c𝒞(c), s=c,gc and s=c,gc,

p^(ss,ad) =p(c,gcc,gc,ad)=p(cc,ad)p^(gcgc)=η(c,c)p^(gcgc)

where the last step follows from the stochastic descent process described in the previous section. The η function is also used for the estimate p^(gcgc) such that, p^(gc=1gc=1)=η(c,c). Overall, for ad, we get, for all c, c𝒞(c):

p(c,1c,1,ad) =η(c,c)p^(gc=1gc=1)=η(c,c)2
p(c,0c,1,ad) =η(c,c)p^(gc=0gc=1)=η(c,c)(1-η(c,c))
p(c,1c,0,ad) =η(c,c)p^(gc=1gc=0)=0
p(c,0c,0,ad) =η(c,c)p^(gc=0gc=0)=η(c,c). (3)


In the POMDP, only when the bot stops and searches is a non-zero reward obtained; navigating earns 0 reward. Moreover, searching in a node with fewer items gives the bot a higher cognitive reward which motivates the bot to traverse down the tree. In particular, we have

R(s,a,s)={0a=adr(c)s=,1-1s=,0andr(c) =1-(e|c|/N-1)/(e-1)

where c is the location encapsulated in s. We assign -1 to searching at a wrong node. One can customise r() as long as it is monotonically decreasing with the size of the input cluster to be searched. The selected r(c) approximates the ease of searching within a set of items where |c| is the number of items at c.

Observations and Observation Probabilities

The observations after the bot descends are its own location in the hierarchy, the children, and some abstract summaries from the children, etc. This information is fixed and unique for each location that the bot arrives at. Hence one can integrate it into one indexed observation that directly affects our belief update function. Recall that Z(s,a,o)=p(os,a). It is trivial to see p(oc,gc,ad)=𝟙{c=o}, i.e. the bot can detect its exact location. After performing as, the observations are whether or not the target is found in the current node.

Belief Update

As states are not fully observable, the bot maintains its belief state b as a probability function at any specific time. Belief updates can be written using the belief update function, τ, where b=τ(b,a,o), is the updated belief when observation o is made after action a is applied in belief state b. By Bayes’ Theorem,

b(s) =p(sa,o,b)=p(os,a)sp(ss,a)b(s)p(oa,b). (4)

Considering that ad is performed at c, then for each possible child c𝒞(c), such that the new state is s=c,0 or s=c,1, we have


To be more specific,

b(c,0) 𝟙{c=o}sp(c,0s,ad)b(s)
=η(c,c)[1-η(c,c)b(c,1)]. (5)
b(c,1) η(c,c)2b(c,1), (6)

where creftype 6 is obtained in a similar way to creftype 5. As discussed earlier, for other states, s′′, not related to c, b(s′′)=0. To compute the normalising constant p(ob,ad), when a=ad, we have that

p(o=cb,ad)=η(c,c)[1-η(c,c)b(c,1)+η(c,c)b(c,1)] =η(c,c),

which gives a final belief update rule of

b(c,1) =η(c,c)b(c,1)  and  b(c,0)=1-η(c,c)b(c,1). (7)

Note that p(ob,as)=1 for reaching the terminal state, such that o=.

Let us denote by s=st the state reached after t update steps, with similar sub-scripting for b and c. The root node is c0. Throughout our analysis, we assume that the target x is certainly contained inside the hierarchy and hence, the bot always starts in state c0,1. It follows that b(c0,1)=1.

By induction, we arrive at a simple expression for the belief state when the search has reached node cT at time T where T>0, bT(cT,1)=t=1Tη(ct-1,ct) and bT(cT,0)=1-t=1Tη(ct-1,ct) while bT(s)=0 for all other states s. That is, once the bot reaches a node c, the belief state of c,1 is simply the probability of reaching this node from the root, and that of c,0 is the residual, 1-bT(c,1). Furthermore, since there is no backtracking, there is exactly one belief state associated with each node c, which, if {c0,,cT=c} is the unique path, is fully determined by the value bcbT(cT,1) and we can simply write, for c𝒞(c), the belief update as bc=η(c,c)bc.

Value Function

The most commonly used objective that drives policy selection in an MDP is the value V (discounted long term reward), s.t. Vt=0γtRt where Rt is the reward obtained at time stamp t. Let use denote a policy by π. Solving a POMDP aims to maximise the value function, Vπ(b), corresponding to the cumulative expected reward over the beliefs (singh1994learning). In particular,


Then, the belief value function of a policy π is given by


where π(b,a) is the probability that action a is selected in belief state b. The optimal value V*(b) with a corresponding deterministic policy to choose action a* can be written as a Bellman equation:

V*(b) =maxaAQ(b,a)  a*=argmaxaAQ(b,a)
where  Q(b,a) =RB(b,a)+γop(ob,a)V*(τ(b,a,o)).

This shows that the POMDP may be interpreted as an MDP over belief states with p(ob,a) the transition probability for moving from belief state b to belief state τ(b,a,o).

Let T be the step at which a terminal state ,0 or ,1 is reached. The cumulative reward is given by


It is natural to examine the value Vπ achieved by the policy π learned over the POMDP in the underlying MDP (singh1994learning), that arises from the POMDP when the states are fully known. Vπ may be interpreted as the value that an oracle that knows the target’s location would assign to the bot’s policy. We base the HQS measure on this oracle value, noting that the bot can only choose the policy π to try to maximise the belief value Vπ(b).

3.1 Example of the POMDP

Due to the space limit, we provide a walk-through example in the supplemental document (Appendix A.) which shows how a POMDP will be constructed given a simple hierarchy.

3.2 Hierarchy Quality for Search

We measure the quality of the hierarchy as the oracle value, i.e. the value of this policy in the underlying MDP. Since the reward function always outputs -1 whenever the policy leads to a wrong path, to compute this oracle value, we only need to focus on the correct path. In particular, let T denote the depth at which the bot chooses as and let {c0,c1,,cT} now denote the correct path containing the target item x to depth T. Then

Vx =r(cT)t=1Tη(ct-1,ct)+(-1)(1-t=1Tη(ct-1,ct))=(r(cT)+1)(t=1Tη(ct-1,ct))-1.

Intuitively, the bot moves randomly according to η(), and when it stops at depth T, the probability of obtaining a positive reward is simply the probability that the guidance function led it along the right path. Given 0r()1, we obtain -1Vx1 which matches our intuition that if there are fewer items stored in the node at which the bot stops and the uncertainty of reaching this node is smaller, the bot achieves a higher value. The stopping point T, is the key output of the belief-based policy determining the oracle value.

Eventually, we define HQS as follows.

Definition 1.

The HQS is a function over the set of data X={xn}n=1N and the hierarchy H, such that HQS(X,H)=1Nn=1NVxnπ where Vxnπ is the long term expected reward for the bot to search item xn, obtained by taking policy π.

4 Solving the POMDP

One can categorise POMDP solvers to off-line and on-line policies. Off-line methods either learn the whole environment prior to determining the policy or learn it through RL. On-line planners solve the POMDP in a real-time decision making context, focusing on the information in the current states (Ross2008). We appeal to on-line planners as they better simulate the situation of a bot making decisions as the search proceeds. Also, off-line methods would make the process less tractable and thus unsuitable for the evaluation task.

One notable on-line Monte-Carlo based planner, POMCP (silver2010monte), is able to earn a good underlying MDP value regardless of the quality of the belief states. This contradicts what we expect from the role of belief in our model. If the guidance function is weak, then we expect that this should be reflected in a belief function that leads to a poor reward for the underlying MDP, as it should tend to lead the bot on a path not containing the target. Hence, the bot is coded with the Real-Time Belief Space Search (RTBSS) planner to seek the policy (paquet2005online; paquet2005real; Ross2008). In RTBSS, at each decision point, the bot is restricted to learn the required information only of the immediate children, limiting it to only one layer down in the hierarchy. This corresponds to two look-aheads for an on-line planner, where the bot can compare stopping and searching at the current node, with that searching at the next layer after one descent.


We leave the details of the original RTBSS procedure as presented in (Ross2008) to the supplemental document. RTBSS is a greedy algorithm that explores a lower-bound on the optimal V(b) using a diversity of actions within a limited number of look-aheads. It selects the policy that maximises this lower-bound.

We present the specialisation of the RTBSS planner to our setting. At each decision point t, the algorithm calculates Q(bt,as), the Q-value of stopping and searching at the current node and an estimate for descending, Q^(bt,ad), obtained by assuming that the child nodes are leaf nodes, or equivalently, by taking the value of a second descent step, Q(bt+1,ad) to be 0. In particular,

Q(bt,as) =bctr(ct)+(1-bct)(-1)=bct(r(ct)+1)-1
Q^(bt,ad) c𝒞(ct)p(o=cbt,ad)Q(τ(bt,ad,o),as)

Simplified RTBSS Policy specified for HQS \[email protected]@algorithmic[1] \STATEInitialise b such that b(c0,1)1 \WHILEIsNotLeaf(b) and π(b,as)0 \STATECompute Q(b,as) and Q^(b,ad) \IFQ^(b,ad)>Q(b,as) \STATEa*ad \ELSE\STATEa*as \ENDIF\STATEπ(b,a*)1 \STATEbτ(b,a,o) \COMMENTIt focuses on the o in the right path only \ENDWHILE\STATEπ(b,as)1 \COMMENTThe bot can only choose to search at a leaf node Our simplified algorithm (in creftype 4.1) provides a belief-based policy for descending through the hierarchy, which at any given node determines whether to descend further or to stop. The simplified algorithm helps HQS become less computationally demanding and achieves a polynomial time complexity.

Proposition 1.

The worst case complexity of computing HQS for N items is O(N3F(S)) where F(S) is the complexity of the similarity function.

The sketch of the proof can be found in the supplemental document.

5 Experimental Study

We show how HQS works through a case study using a small sub-set of Amazon data data22 2 http://jmcauley.ucsd.edu/data/amazon (mcauley2015image) and compare it with an existing approach. Then, we will discuss on scaling the HQS.

5.1 Case Study on Amazon

This study uses the textual information for the items from the Amazon, and bases the guidance function on a TF-IDF similarity score. We select commonly seen daily life products and manually construct a number of hierarchies, varying their quality. To ensure that the hierarchies can be intuitively assessed by inspection, we restrict to just 12 items. This approach can help us quickly assess HQS procedure by assessing if it orders the hierarchies as expected.

The Selected Items

Some TF-IDF details about the items are displayed in Table S1 and S2 (in the supplemental document). We refer to items by their indices as specified in that table. More details are also revealed in the supplemental document.

Since the data is represented by TF-IDF vectors, which we write as 𝒗x for item x, we define the similarity function 𝒮(x,c) as follows


This similarity is in spirit similar to the average link of Agglomerative Clustering (AC) (day1984efficient).33 3 It is important to exclude x itself from computing the similarities over a non-singleton cluster, since the maximum cosine similarity between any (x,y) pair is only around 0.11 in the selected items, and hence cos(𝒗x,𝒗x) would dominate the similarity score.

For the guidance function in our experiments, we set δ=.01 which helps us increase the weight of the best cluster and ensure that the cluster with largest similarities to the item, has a dominant η value.

Tested Hierarchies

We analyse the HQS scores for five hierarchies in creftype 2, where Hier-A is the ground truth obtained from the labels provided in the Amazon dataset. Hier-B and Hier-C are randomly generated hierarchies. Hier-D is constructed to be deliberately poor. All items in each leaf cluster have different ground-truth labels. For example, “Whitener”, “Biker Boot Straps”, and “Ultrasonic Cleaner” in cluster 0 of Hier-D are all in separate clusters according to the Amazon organisation. Finally, Hier-E is constructed on purpose to show that a good hierarchy is achievable even different from the ground-truth.

(a) Hier-A
(b) Hier-B
(c) Hier-C
(d) Hier-D
(e) Hier-E
Figure 2: Five different hierarchies
Table 1: HQS results of the five hierarchies (A–E in Fig. 2) and the value V for the twelve items (indexed from 0–11) in each hierarchy.
HQS 0 1 2 3 4 5 6 7 8 9 10 11
A .572 .620 .747 .827 .811 .600 ..518 .793 -.679 .562 .432 .812 .817
B -.280 .0 -.991 -.989 .863 -.955 -.736 .760 .659 -.995 .0 .749 .0
C -.310 -.752 -.999 .0 .0 .0 .889 .857 -.933 .879 .0 -.965 -.979
D -.848 -.668 -.740 -.708 -.996 -.652 -1.00 -.791 -.607 -.995 -.925 -.990 -.901
E .327 .611 -.797 .673 .0 .533 .0 .834 .0 .423 .0 .835 .808
Result Analysis

The second column of creftype 1 shows the HQS results of the five different hierarchies. The ground truth hierarchy, A, provides the highest HQS score (0.5715), followed by a relatively high score for Hier-E as expected. Hier-D provides the worst HQS score (-0.8477). In this example, the HQS scoring scheme agrees with our intuition and we believe this ranking is reasonable. creftype 1 also displays the values for each item obtained by RTBSS in the five hierarchies. Next, we specify how the policy propagates step-by-step for those hierarchies.


This hierarchy is generated given the ground truth labels of the items. For instance, node 4 refers to “Clothing, Shoes & Jewelry”, node 7 refers to “Women”, and 5 refers to “Shoe Care & Accessories” etc., as shown in Table S1 and S2.

For item x0, “Women Boots”, staying and searching at the root earns an estimate Q(b0,as)=0. Now, the similarities x0 with the child nodes are 𝒮(x0,c7)=0.061 and 𝒮(x0,c8)=0.0277. It follows that η(c9,c7)=0.9987 and η(c9,c8)=0.0013. Clearly, both nodes have 6 items and so, the rewards for staying at both nodes are equal, 0.6226. It follows that Q^(b0,ad)=o{c7,c8}p(ob,ad)Q(τ(b0,ado))=0.6203. Since Q(b0,as)<Q^(b0,ad), the bot performs action ad, and it randomly moves to the right node with probability 0.9986. From c7 to c4, it is a straight move as the estimates will be equal and the bot should explore a better chance for a higher value. Let us review that cluster c0 refers to “Boots” and c3 refers to “Fashion Sneakers” under the category chain “All Clothing, Shoes & Jewelry Women” (c9,c7,c4). It shows 𝒮(x0,c0)=0.06,𝒮(x0,c3)=0.061 and we obtain η(c4,c0)=0.4158 and η(c4,c3)=0.5829. This suggests that x0 more strongly belongs to cluster c3 than to c0, the cluster it is assigned to. Regarding c3, its items “Women Runner”, “Women Sneaker” and “Women Trainer” are highly close to “Women Boots” in some sense. Upon computing the Q-values, the bot decides to stop at node c4, since it is not confident that another descent step will lead to greater reward. Finally, the oracle value Vx0 is 0.6203, a good score, but short of the maximum reward which captures the lack of clarity between the clusters at the lower levels of the hierarchy for this item. This suggests that our measurement is rather reasonable in this case.

Another interesting example is x7, “Dye Kit”. It starts navigating wrong at the first level. The similarities 𝒮(x7,c7) and 𝒮(x7,c8) return 0.0367 and 0.0297 respectively. It belongs to c8 while the measurement prefers c7. In fact, “Dye Kit” is close to the shoe items in cluster c7, but also close to the items “Shoe Cleaner” and “Whitener” in cluster c8, as they are also cleaning tools (for jewellery). As it stands, the bot will be guided to the wrong node and hence receives a negative reward. It could well be the case that the situation would be improved with more data items.

The bot is not confident to descend to the bottom of the hierarchy for every case which is reflected in the associated score. In summary, Hier-A is sufficiently good to help the bot descend and acquires a good HQS of 0.5715.


In this case, the final HQS is -0.2801. creftype 1 demonstrates that there are 6 items earning a negative reward as the hierarchy guides them to the “wrong” node. Three more items earn a score of 0, as the bot gets confused at the root level and so it prefers searching at the root. Interestingly, a random hierarchy does not necessarily achieve an absolute HQS of zero. The HQS measurement is apparently not linear.


Its HQS is -0.3097. Similarly to Hier-B, there are 6 items earning a negative reward as the hierarchy has only two items achieving positive scores which are the two in the small cluster, “Biker Boot Straps” and “Silver Cloth” obtain 0.8892 and 0.8789 respectively. The similarities are diluted compared with the small node, since the sibling clusters are too diversified. However, HQS unveils the fact that the hierarchy is poor for item retrieval. Although we compute the value for the bot searching for each item, HQS considers the hierarchy as a whole. Thus, even a poor hierarchy might enable efficient search for a few items, while there are many more poorly scoring items.


For this hierarchy, the overall HQS score of -0.8477, reflects the fact that this is a hierarchy within which the bot has difficulty in searching. HQS successfully reveals this by returning a score close to the minimum, the worst performing one among the exhibited hierarchies. This implies that HQS correctly downgrades a random hierarchy.


Finally, we construct a “good” hierarchy (creftype (e)e) while its style is very different to the ground-truth. It separates the items about jewellery and the items associated with shoes at the first level. Overall, there is only one item, “Shoe Cream” which receives a negative value -0.7972. At the first decision point for this item, η(c8,c7) is 0.8894 which leads the bot to the wrong node with a dominant probability. The item’s textual representation tends to be better associated with the cleaning toolkit. Interestingly, “Biker Boot Straps” receives the similarities 0.0379 and 0.0375 for c6 and c7 respectively; thus the bot stops at the root node. It implies that the HQS acknowledges hierarchies with different structures to the ground-truth as long as the hierarchy is efficient for search.

5.1.1 Comparison to Existing Approach

We show that HQS outperforms HAI (Johnson2013) on our example hierarchy.

HAI defines a distance function d(x,y,)=|cx,y|/N where cx,y is the closest ancestor node of items x and y when such an ancestor is not a leaf node; otherwise the distance returns 0. The HAI metric requires a ground-truth. Let gt be such a ground-truth tree. Then HAI is formulated as


We conduct the evaluation over the hierarchies B–E with Hier-A as ground truth. Their HAI scores are 0.64,0.46,0.58,0.85 respectively. Both HQS and HAI rank Hier-E and Hier-B as first and second, excluding the ground-truth. However, the HAI score of 0.58 for Hier-D, which was generated to serve as the worst example, is better than the score of Hier-C (which should not happen) and also numerically close to that of Hier-B.

5.2 Discussion of Scaling HQS

HQS can easily be run in parallel, since the POMDPs for the items are completely independent. In this section, we demonstrate that sampling can further help with the runtime efficiency–with sufficient samples, one can achieve an approximated HQS close to the true HQS. With the number of items sampled, the discrepancy between the sampled HQS and the true HQS decreases exponentially and the runtime speeds up linearly.

Again, we use Amazon but keep 37,826 samples and 100 dimensions with PCA. The hierarchy is simply generated using AC and contains 75,651 nodes with a maximum height 65. The prototype of HQS is implemented with Python.44 4 The experiments are conducted in a computer equipped with AMD Ryzen 7 [email protected] 8-Core Processor and 64GB of RAM, using 16 threads in parallel. Tree is a relatively memory-intense data structure and the data size is also a consideration. Hence, RAM prevents us from examining larger (in another order of magnitude) trees.

Figure 3: Empirical density
Figure 4: Normalised L1 errors
Figure 5: Runtime of the sampled cases

Now, we down-sample the items for computing the oracle values. creftype 5 displays the empirical densities of the HQS for the 37,826 Amazon items, fitted using StatsPlots55 5 https://github.com/JuliaPlots/StatsPlots.jl. The black dashed line is the true HQS with all items computed, while the others have sampled the corresponding percentages of items. One can see that the density curve with more samples gets closer to the one of the entire set of data. The L1 errors, which are divided by the absolute HQS, between the true HQS and the sampled HQS are exhibited in creftype 5. It is clear that the errors decreases drastically as more items are sampled. After 30% sampling, the errors are less than 5% of the absolute value of the true HQS. As expected, the HQS of the cases with fewer samples are less predictable. Finally, the runtime is plotted in creftype 5, and is shown to follow approximately a linear relation.

Overall, HQS can benefit from that it is completely parallelisable. In addition, sampling techniques can also help improve the efficiency. We would like to emphasise that, although the runtime is not quick enough, it is improvable with better code and machines.

6 Conclusion

We have proposed HQS, an approach for assessing the quality of hierarchical clusters which can be used without ground-truth information. It employs POMDP for modelling the uncertainty and the decision making for searchers with regard to search and navigation in a hierarchy, and extends the model to measure the quality of the hierarchies. The experiments show that HQS can order hierarchies reasonably based on their qualities without ground-truth information, and perform better than a state-of-the-art approach which requires the ground-truth hierarchy. For future work, we will theoretically analyse the properties of the measure. Additionally, we will continue exploring better models and policies.