Learning event representations in image sequences by dynamic graph embedding

  • 2019-10-08 15:48:50
  • Mariella Dimiccoli, Herwig Wendt
  • 1

Abstract

Recently, self-supervised learning has proved to be effective to learnrepresentations of events in image sequences, where events are understood assets of temporally adjacent images that are semantically perceived as a whole.However, although this approach does not require expensive manual annotations,it is data hungry and suffers from domain adaptation problems. As analternative, in this work, we propose a novel approach for learning eventrepresentations named Dynamic Graph Embedding (DGE). The assumption underlyingour model is that a sequence of images can be represented by a graph thatencodes both semantic and temporal similarity. The key novelty of DGE is tolearn jointly the graph and its graph embedding. At its core, DGE works byiterating over two steps: 1) updating the graph representing the semantic andtemporal structure of the data based on the current data representation, and 2)updating the data representation to take into account the current data graphstructure. The main advantage of DGE over state-of-the-art self-supervisedapproaches is that it does not require any training set, but instead learnsiteratively from the data itself a low-dimensional embedding that reflectstheir temporal and semantic structure. Experimental results on two benchmarkdatasets of real image sequences captured at regular intervals demonstrate thatthe proposed DGE leads to effective event representations. In particular, itachieves robust temporal segmentation on the EDUBSeg and EDUBSeg-Desc benchmarkdatasets, outperforming the state of the art.

 

Quick Read (beta)

Learning event representations in image sequences
by dynamic graph embedding

Mariella Dimiccoli and Herwig Wendt M. Dimiccoli is with the Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barselona, Spain ([email protected]). H. Wendt is with the Institut de Recherche en Informatique de Toulouse, CNRS, University of Toulouse, France ([email protected]).
August 10, 2019
Abstract

Recently, self-supervised learning has proved to be effective to learn representations of events in image sequences, where events are understood as sets of temporally adjacent images that are semantically perceived as a whole. However, although this approach does not require expensive manual annotations, it is data hungry and suffers from domain adaptation problems. As an alternative, in this work, we propose a novel approach for learning event representations named Dynamic Graph Embedding (DGE). The assumption underlying our model is that a sequence of images can be represented by a graph that encodes both semantic and temporal similarity. The key novelty of DGE is to learn jointly the graph and its graph embedding. At its core, DGE works by iterating over two steps: 1) updating the graph representing the semantic and temporal structure of the data based on the current data representation, and 2) updating the data representation to take into account the current data graph structure. The main advantage of DGE over state-of-the-art self-supervised approaches is that it does not require any training set, but instead learns iteratively from the data itself a low-dimensional embedding that reflects their temporal and semantic structure. Experimental results on two benchmark datasets of real image sequences captured at regular intervals demonstrate that the proposed DGE leads to effective event representations. In particular, it achieves robust temporal segmentation on the EDUBSeg and EDUBSeg-Desc benchmark datasets, outperforming the state of the art. Index Terms: clustering, graph embedding, geometric learning, temporal context prediction, temporal segmentation, event representations

1 Introduction

Figure 1: Temporally adjacent frames in two events in first person image sequences.

Temporal segmentation of videos and image sequences has a long story of research since it is crucial not only to video understanding but also to video browsing, indexing and summarization [koprinska2001temporal, money2008video, del2017summarization]. With the proliferation of wearable cameras in recent years, the field is facing new challenges. Indeed, wearable cameras allow to capture, from a first-person (egocentric) perspective, and “in the wild”, long unconstrained videos (35fps) and image sequences (aka photostreams, 2fpm). Due to their low temporal resolution, the segmentation of first person image sequences is particularly challenging. Indeed, abrupt changes in appearance may arise within an event due to sudden camera movements, and event transitions that are smooth at higher sampling rate due to continuous recording are reduced to a few frames that are difficult to detect. While for a human observer it is relatively easy to segment egocentric image sequences into discrete units, this poses serious difficulties for automated temporal segmentation (see Fig. 1 for an illustration).

Given the limited amount of annotated data, current state-of-the-art approaches for the temporal segmentation of first-person image sequences aim at obtaining event representations by encoding the temporal context of each frame of a sequence in an unsupervised fashion [dias2019learning, del2018predicting]. These methods rely on neural or recurrent neural networks and are generally based on the idea of learning event representations by training the network to predict past and future frame representations. Recurrent neural networks have proved to be more efficient than simple neural networks for the temporal prediction task. The main limitation of these approaches is that they must rely on large training datasets to yield state-of-the-art performance. Even if, in this case, training data do not require manual annotations, they can always introduce a bias and suffer from the domain adaptation problem. For instance, in the case of temporal segmentation of image sequences, the results will be difficult to generalize to data acquired with a camera with different field of view or for people having different lifestyles.

In this paper, we aim at overcoming this limitation with a novel approach that is able to unveil a representation encoding the temporal structure of an image sequence from the single sequence itself. With this goal in mind, we propose to learn event representations as an embedding on a graph. Our model is based on the assumption that each event belongs to a community structure of semantically similar events on an unknown underlying graph, where communities are understood here as sets of nodes that are interconnected by edges with large weights. Moreover, the graph weights reflect not only temporal proximity, but also semantic similarity. This is motivated by neuroscientific findings showing that neural representations of events arise from temporal community structures [schapiro2013neural]. In other words, frames that share the temporal context are grouped together in the representational space. In Fig. 2 we illustrate the idea by means of an egocentric image sequence capturing the full day of a person: going from home to work using public transports, having a lunch break in a restaurant and going back to home after doing some shopping, etc. Each point cloud corresponds with images similar in appearance and most of them are visited multiple times. This means that every pair of images in a point cloud is related semantically, but they could or could not be related at temporal level.

Figure 2: The assumption underlying our learning approach is that image sequences captured at regular intervals (2fpm) can be organized into a graph structure, where each community structure in the graph corresponds to a particular temporal context, that captures both temporal and semantic relatedness. Points of the same color in the figure are related semantically and may be more or less related at temporal level. The arrows indicate temporal transitions between semantic clusters. They have only a visualization purpose, since temporal transition are between pairs of points.

Based on this model, the proposed solution consists in learning simultaneously the graph structure (encoded by its weights) and the data representation. This is achieved by iterating over two alternate steps: 1) the updating of the graph structure as a function of the current data representation, where the graph structure is assumed to encode a finite number of community structures, and 2) the updating of the data representation as a function of the current graph structure in a low-dimensional embedding space. We term this solution dynamic graph embedding (DGE). We provide illustrative experiments on synthetic data, and we validate the proposed approach on two real world benchmark datasets for first person image sequence temporal segmentation. Our framework is the first attempt to learn simultaneously graph structure and data representations for temporal segmentation of image sequences.

Our main contributions are: (i) we re-frame the event learning problem as the problem of learning a graph embedding, (ii) we introduce an original graph initialization approach based on the concept of temporal self-similarity, (iii) we propose a novel technical approach to solve the graph embedding problem when the underlying graph structure is unknown, (iv) we demonstrate that the learnt graph embedding is suitable for the task of temporal segmentation, achieving state-of-the-art results on two challenging reference benchmark datasets [dimiccoli2017sr, bolanos2018egocentric], without relying on any training set for learning the temporal segmentation model.

The structure of the paper is as follows. Section 2 highlights related work on data representation learning on graphs and on the temporal segmentation of videos and image sequences. In Section 3.1 we introduce our problem formulation while in Sections 3.2 to 3.5 we detail the proposed graph embedding model. The performance of our algorithm on real world data are evaluated in Section 4. In Section 5, we conclude on our contributions and results.

2 Related work

2.1 Geometric learning

The proposed approach lies in the field of geometric learning, which is an umbrella term for those techniques that work in non-Euclidean domains such as graphs and manifolds. Following [bronstein2017geometric], geometric learning approaches either address the problem of characterizing the structure of the data, or deal with analyzing functions defined on a given non-Euclidean domain. In the former case, which is more closely related with the method proposed in this paper, the goal is to learn an embedding of the data in a low dimensional space, such that the geometric relations in the embedding space reflect the graph structure. These methods are commonly referred to as node-embedding and can be understood from an encoder-decoder perspective [hamilton2017representation]. Given a graph 𝒢=(𝒱,), where 𝒱 and represents the set of nodes and edges of the graph respectively, the encoder maps each node v𝒱 of 𝒢 in a low dimensional space. The decoder is a function defined in the embedding space that acts on node pairs to compute a similarity S between the nodes. Therefore the graph embedding problem can be formulated as the problem of optimizing decoder and encoder mappings to minimize the discrepancy between similarity values in the embedding and original feature space. Within the general encoder-decoder framework, node embedding algorithms can be roughly classified into two classes: shallow embeddings [ahmed2013distributed, ou2016asymmetric, perozzi2014deepwalk, grover2016node2vec, tang2015line, chen2018harp, cao2015grarep] and generalized encoder-decoder architectures [cao2016deep, wang2016structural, hinton2006reducing, hamilton2017inductive]. In shallow embedding approaches, the encoder function acts simply as a lookup function and the input nodes vi𝒱 are represented as one-hot vectors, so that they cannot leverage node attributes during encoding.

Instead, in generalized encoder-decoder architectures [cao2016deep, wang2016structural, hinton2006reducing], the encoders depend more generally on the structure and attributes of the graph. In particular, convolutional encoders [hamilton2017inductive] generate embeddings for a node by aggregating information from its local neighborhood, in a manner similar to the receptive field of a convolutional kernel in computer vision. They rely on node features to generate embeddings and, as the process iterates, the node embedding contains information aggregated from further and further reaches of the graph. Closely related to convolutional encoders are Graph Neural Networks (GNNs). The main difference is that GNNs capture the dependence of graphs via message passing between the nodes of graphs and utilize node attributes and node labels to train model parameters end-to-end for a specific task in a semi-supervised fashion [henaff2015deep, garcia2017few, defferrard2016convolutional, niepert2016learning].

In all these methods, the graph structure is assumed to be given by the problem domain. For instance in social networks, the structure of the graph is given by the connection between people. However, in the case of temporal segmentation considered here, the problem is non-structural since the graph structure is not given by the problem domain but needs to be determined together with the node embedding.

2.2 Event segmentation

Extensive research has been conducted to temporally segment videos and image sequences into events. Early approaches aimed at segmenting edited videos such as TV programs and movies [kijak2006audiovisual, yuan2007formal, chasanis2009movie, liu2011exploiting, liu2013learning] into commercial, news-related or movie events. More recently, with the advent of wearable cameras and camera equipped smartphones, there has been an increasing interest in segmenting untrimmed videos or image sequences captured by nonprofessionals into semantically homogeneous units [hu2011survey, tang2012learning, Iwan2017]. In particular, videos or image sequences captured by a wearable camera are typically long and unconstrained [bolanos2017toward], therefore it is important for the user to have them segmented into semantically meaningful chapters. In addition to appearance-based features [bettadapura2016discovering, xu2015gaze], motion features have been extensively used for temporal segmentation of both third-person videos [koprinska2001temporal, tang2012learning] and first-person videos [huang2017egocentric, spriggs2009temporal, lee2015predicting, poleg2014temporal]. In [spriggs2009temporal], motion cues from a wearable inertial sensor are leveraged for temporal segmentation of human motion into actions. Lee and Grauman [lee2015predicting] used temporally constrained clustering of motion and visual features to determine whether the differences in appearance correspond to event boundaries or just to abrupt head movements. Poleg et al. [poleg2014temporal] proposed to use integrated motion vectors to segment egocentric videos into a hierarchy of long-term activities, whose first level corresponds to static/transit activities.

However, motion information is not available in first person image sequences, that are the main focus of this paper. In addition, given the limited amount of annotated data, event segmentation is very often performed by using a clustering algorithm relying on hand-crafted visual features [doherty2008investigating, doherty2008combining, lin2017vci2r, yamamoto2017pbg, lin2006structuring]. Tavalera et al. [talavera2015r] proposed to combine agglomerative clustering with a change detection method within a graph-cut energy minimization framework. Later on, [dimiccoli2017sr] extended this framework by improving the feature representations through building a vocabulary of concepts for representing each frame. Paci et al. [paci2016context] proposed a Siamese ConvNets based approach that aims at learning a similarity function between low-resolution egocentric images. Recently, del Molino et al. [del2018predicting] proposed to learn event representations by training in an unsupervised fashion a LSTM-based model to predict the temporal context of each frame, that is initially represented by the output of the pre-pooling layer of InceptionV3. This simple approach has shown outstanding results on reference benchmark datasets [dimiccoli2017sr, bolanos2018egocentric], as long as the model is trained on a large training dataset (over 1.2 million images). The method in [del2018predicting] is similar to the chronologically earlier approach reported in [dias2019learning], that proposed to train the model in a fully self-supervised fashion instead. Specifically, starting from a concept vector representation of each frame, the authors proposed a neural network based model and an LSTM-based model performing a self-supervised pretext task consisting in predicting the concept vectors of neighbor frames given the concept vector of the current frame. Consequently, unlike [del2018predicting], the single image sequence itself is used to learn the features encoding the temporal context, without need for a training dataset. The performance achieved in [dias2019learning] were less impressive than in [del2018predicting].

Here, we propose a new model that as [dias2019learning] does not make use of any training set for learning the temporal event representation, but achieves state-of-the-art results. In particular, it outperforms [del2018predicting] on the EDUBSeg and EDUBSeg-Desc benchmarks [dimiccoli2017sr, bolanos2018egocentric].

3 Dynamic Graph Embedding (DGE)

3.1 Problem formulation and proposed model

We formulate the event learning problem as a geometric learning problem. More specifically, given a set of data points (the frames of the image sequence) embedded into a high-dimensional Euclidean space (the initial data representation), we assume that these data points are organized in an underlying low dimensional graph structure. Our a priori on the structure of the graph in the underlying low dimensional space is that it consists of a finite number of community structures corresponding to different temporal contexts. Since along an image sequence, a same temporal context can be visited several times at different time intervals, we assume that the underlying graph consists of a finite number of community structures and that edges between nodes belonging to different communities correspond to transitions between different temporal contexts, whereas edges between nodes belonging to the same community correspond to transitions between nodes sharing the same temporal context, being them temporally adjacent or not. This structure implicitly assumes that the graph topology models jointly temporal and semantic similarity relations.

More formally, given a sequence of N images and their feature vectors XN×n, we aim at finding a fully connected, weighted graph 𝒢~=(X~,,𝒲~) with node embedding X~N×d in a low dimensional space, dn, and edge weights 𝒲~ given by the entries of an affinity matrix G~=S(X~), where S is a similarity kernel, such that the similarity G~kj between any pair k,j of nodes of the graph reflects both semantic relatedness and temporal adjacency between the images. Semantic relatedness is captured by a similarity function between semantic image descriptors, whereas temporal adjacency is imposed through temporal constraints injected in the graph structure. The above constraints lead to easily grouping the graph nodes in a finite number of clusters, each corresponding to a different temporal context. As seen in the previous section, in classical node embedding the low dimensional representation of each node encodes information about the position and the structure of local neighborhood in the graph. Since all these methods incorporate graph structure in some way, the construction of the underlying graph is extremely important, but relatively little explored. In our problem at hand the graph structure is initially unknown since it arises from events, that are unknown. Therefore, we aim at learning jointly the structure of the underlying graph and the node embedding.

3.2 Graph initialization by nonlocal self-similarity

Temporal nonlocal self-similarity.  Let XN×n be the set of N given data points in a high-dimensional space n, normalized in the interval [-1,1]. To obtain a first coarse estimate of the graph, we apply a nonlocal self-similarity algorithm in the temporal domain to the initial data X [dimiccoli2019enhancing]. The nonlocal self-similarity filtering creates temporal neighborhoods of frames that are likely to be in the same event. Let X(k)n denote the k-th row of X, that is, the vector of n image features at time k, k=1,,N. Further, let 𝒩kM={k-M,,k-1,k+1,,k+M} and 𝒩kL={k-L,,k-1,k+1,,k+L} denote the indices of the 2M and 2L neighboring feature vectors of X(k), respectively, with L>M. In analogy with 2D data (images) [buades2005non], the self-similarity function of X(k) in a temporal sequence, conditioned to its temporal neighborhood j𝒩kM, is given by the quantity [dimiccoli2019enhancing]

SNL(k,j)=1𝒵(k)exp(-dist(X(𝒩kM),X(𝒩jM))h), (1)

where 𝒵(k) is a normalizing factor such that j𝒩kLSkjNL=1, ensuring that SkjNL can be interpreted as a conditional probability of X(j) given X(𝒩k), as detailed in [buades2005non], dist(X(𝒩k),X(𝒩j))=i=12M||X(𝒩k(i))-X(𝒩j(i))||1 is the sum of the 1 distances of the vectors in the neighborhoods of k and j, and h is the parameter that tunes the decay of the exponential function. The key idea of our graph initialization is to model each frame k by its denoised version, obtained as

X^(k) = NLmeans1D(X(k))=j𝒩kLSkjNLX(j),
X^0 = (X^(1)TX^(2)TX^(N)T)T. (2)

A numerical illustration on real data is provided in Fig. 4.

Initial graph and initial embedding.  An initial graph 𝒢0 is obtained by computing the N×N affinity matrix G0=Sl^(X^0) of X^0, defined elementwise as the pairwise similarity

(Sl(X))kj=exp(-cdist(X(j),X(k))l) (3)

where cdist(,) is the cosine distance and l the filtering parameter of the exponential function. In the following, we will not distinguish any longer between a graph 𝒢 and its representation by its affinity matrix G and make use both symbols synonymously. In our model, 𝒢0 represents the initial data structure in the original high dimensional space as a fully connected graph, from which we would like to learn the graph in the embedding space, say 𝒢~, that better encodes temporal constraints. To obtain an initial embedding X~0 for the graph, we apply PCA on X^0, keep the d major principal components X~, and minimize the cross-entropy loss between the affinity matrices Sl^(X^0) and Sl~(X~0)

X~0=argminX~(Sl~(X~),G0)

where the different filtering parameters l^ and l~ account for the different dimensionality of X^0 and X~0. Even if PCA is a linear operator and for small sets of high-dimensional vectors dual PCA could be more appropriate [giannakis2018topology], we found it sufficient here for initializing the algorithm. The initial graph 𝒢~0 in the embedding space is then given by G~0=Sl(X~0).

3.3 DGE core alternating steps

Given the initial embedding X~0 and graphs 𝒢0 and 𝒢~0, the main loop of our DGE alternates over the following two steps:

  1. 1.

    Assuming that 𝒢~i-1 is fixed, update the node representations X~i.

  2. 2.

    Assuming that X~i is fixed, update the graph 𝒢~i.

\SetKwDataLeftleft\SetKwDataThisthis\SetKwDataUpup \SetKwFunctionUnionUnion\SetKwFunctionFindCompressFindCompress N- length of the image sequence
n- original feature dimension
d- embedding feature dimension (dn)
\SetKwInOutInputInput\SetKwInOutOutputOutput \Input XN×n initial feature matrix \Output X~N×dgraph embedded feature matrix \BlankLine\tcc Reference graph initialization Eq. (1-3) X^0=NLmeans1D(X)N×n denoise initial features
G0=Sl^(X^0)N×N initialize graph in original space
\tcc Graph embedding initialization Eq. (3) X~=PCAd(X^0)N×d
X~0=argminX~(Sl(X~),G0) initialize embedding features
G~0=Sl~(X~0)N×N initialize graph in embedding space\tccDGE core loop Eqs. (4-6) \Fori1 \KwToK --- graph embedding update
X~i=argminX~(1-α)1(Sl~(X~),G~i-1)+α2(Sl~(X~),G0)
update embedding features given current graph 𝒢~i-1
--- graph structure update: temporal prior
G~iSl~(X~i)𝒦p local average of weights of graph of X~i
G~ifη(G~i) strengthen temporally adjacent edges
--- graph structure update: semantic prior
𝒞=kmeansNC(X~i;NC) estimate semantic communities
G~i=gμ(G~i,𝒞;μ) encode semantic similarity in graph
ALGORITHM 1 Dynamic Graph Embedding

Step (1) is inspired from graph embedding methods, such as the ones reviewed in Section 2.1, that have proved to be very good at encoding a given graph structure. Step (2) aims at enforcing temporal constraints and at fostering semantic similarity in the graph structure.

Graph embedding update.  To estimate the graph embedding X~i at iteration i assuming that 𝒢~i-1 is given, we solve

X~i=argminX~(1-α)1(Sl~(X~),G~i-1)+α2(Sl~(X~),G0). (4)

Here 1 and 2 are cross-entropy losses and Sl(,) is the cosine-based similarity defined in (3). The first loss term controls the fit of the representation X~ with the learnt graph 𝒢~i in low dimensional embedding space. The second loss term quantifies the fit of the representation X~ with the fixed initial graph 𝒢0 in high dimensional space and is reminiscent of shallow graph embedding; α[0,1] is a regularization parameter that controls the weight of each loss. Standard gradient descent can be used to solve (4).

Graph structure update.  To obtain an estimate of the graph structure at the i-th iteration, say 𝒢~i, assuming that X~i is given, we start from an initial estimate for 𝒢~i as G~i=Sl~(X~i), and then make use of the model assumptions described in Section 3.1 to modify the graph: temporal adjacency, and semantic similarity.

i) To foster similarity for temporally adjacent nodes, we apply two operations. First, local averaging of the edge weights, G~i=G~i𝒦p, where is the 2D convolution operator and 𝒦p a p×p kernel that is here simply the normalized bump function. Second, and more importantly, we apply a non-linear operation to 𝒢~i that scales down by a factor η, 0<η<1, the weights of edges between nodes k and j that are not direct temporal neighbors (i.e., for which |k-j|>1) while leaving similarities of directly temporally adjacent nodes unchanged

(G~i)kj(fη(G~i))kj={(1-η)(G~i)kjif |k-j|>1(1-η)(G~i)kjotherwise, (5)

thus strengthening the temporal adjacency of the graph.

ii) To reinforce the semantic similarity of X~i, we first obtain a coarse estimate of the community structure of the graph 𝒢~i. To this end, we apply a clustering algorithm on X~i, which yields estimated cluster labels 𝒞=(cj)j=1NC, cj{1,,NC}, for each frame. Then we update 𝒢~i by applying to it a non-linear function gμ that reduces the similarity between nodes k and j that do not belong to the same cluster, cjck, by a factor μ, 0<μ<1, and does not change similarities within clusters, i.e.,

(G~i)kj(gμ(G~i,𝒞))kj={(1-μ)(G~i)kjif cjck(1-μ)(G~i)kjotherwise. (6)

For μ<1, this operation hence reinforces within-event semantic similarity.

DGE aims revealing at the temporal and semantic relatedness for each pair of data vectors, and therefore the estimated graphs 𝒢~i, i=0,,K are fully connected at each stage. A high-level overview of our DGE approach can be found on ALGORITHM 1.

3.4 Graph post-processing: Event boundary detection

Depending on the problem, applicative context and objective, different standard and graph signal processing tools can be applied to the estimated graph 𝒢~K in order to extract the desired information, or transform 𝒢~K [wang2015global, ortega2018graph]. In this work, the focus is on event detection in image sequences and we thus directly use the learnt representation X~K corresponding to 𝒢~K in a contextual event boundary detector.

Boundary detector.  Since the focus of this paper is on how to improve event representation and not on the temporal segmentation algorithm itself, we used the same boundary detector as in [del2018predicting]. Such a boundary detector is based on the idea that the larger the distance between the predicted visual contexts for a frame k, once computed forward (from past frames j<k) and once backward (from future frames j>k), is, the more likely this frame will correspond to an event boundary. Hence, the boundary prediction function is defined as the distance between the past and future contexts of frame k: pred(k)=dist(ctxP,ctxF), where dist(,) is the cosine distance and the temporal contexts ctxF and ctxP are defined as the average of the representation vectors X~ of a small number of the previous (or past) frames. Those frames for which the values of the boundary prediction function exceed a threshold

Figure 3: Illustration of DGE on synthetic data (Gaussian mixture with 4 components, d=2): (a) scatter plot of features X~i for iterations i=(0,1,3,10) (left to right column, respectively); mixture components are indicated with color, solid lines indicate temporal adjacency; (b) the features X~i plotted as time series, with estimated segment boundaries (vertical dashed lines) and component number ground truth (solid black); (c) the corresponding graphs 𝒢~i and (d) 𝒢~i after pruning edges with weight smaller than 0.7 times that of the strongest edge (yellow and blue color correspond with strong and weak similarity, respectively); (e) F-score, precision and recall for estimated segmentation as a function of iteration number i.

are the detected event boundaries, see [del2018predicting] for details. We use the same parameter values and thresholds as in [del2018predicting].

Hereafter we call our temporal segmentation model relying on the features learnt by using the proposed DGE approach CES-DGE, in analogy with CES-VCP in [del2018predicting] (where CES stands for contextual event segmentation and VCP for visual context prediction).

3.5 Numerical illustration of CES-DGE on synthetic data

To illustrate the main idea behind our modeling, in Fig. 3 we show with a synthetic example in n=2 dimensions how the original data X0 and the associated initial graph 𝒢~0 change over 10 iteration of the DGE algorithm (here, d=n=2, and we assume X~0=X^0=X0, i.e., no preprocessing). The data consist of a temporal sequence of N=350 feature vectors that are drawn from a Gaussian mixture with 4 equi-probable components with mean vectors (-7.8 5.1),(-1.4 2.8),(-2.9 12.1),(2.5 3.4) and diagonal covariance matrix σ𝐈 with σ=3.7. To model the community structures in the data, after each switch to a new state the state is maintained with probability that decreases from 1 at an exponential rate with time. In Fig. 3, panel (a) shows the learnt representations X~i as scatter plots for iterations i=(0,1,3,10), where colors indicate the true cluster labels and connections between data points temporal adjacency; panel (b) plots X~i as time series, together with event boundary detection as dashed bars and ground truth labels as solid bars; panel (c) shows the corresponding graph 𝒢~i and panel (e) summarizes boundary detection performance as a function of iteration number i; in addition, to highlight the estimated event graph structure, panel (d) plots a pruned version of 𝒢~i in which all edges with weight below 70% the weight of the strongest edge are removed. Clearly, it is difficult to obtain a good segmentation for the initial, cleaned data (iteration i=0, left column). It can be appreciated how, with a few DGE iterations, the embedded data points X~i belonging to the four different components are pushed apart, and the distance between temporally adjacent points belonging to the same components is reduced, increasing the similarity within each temporal context and revealing the underlying data generating mechanism (Fig. 3 (a-b), columns 2 to 4, respectively). Moreover, the learnt representation remains faithful to the temporal structure within each segment (e.g., the position and relative size of local maxima and minima is not changed). An alternative view is provided by the corresponding learnt graphs 𝒢~i in Fig. 3 (c-d), for which increasingly homogeneous diagonal and off-diagonal blocks with sharp boundaries emerge with increasing number of iterations i, reflecting the temporal community structure underlying the data. This improved representation leads in turn to significantly better event segmentation results, with F-score increasing from initially 0.59 to 0.62, 0.64, 0.67 for iterations i=1,3,4-10, respectively.

Parameter F-score
K 1 2 3 4 5
(DGE iterations) 0.655 0.698 0.682 0.671 0.662
α 0.05 0.1 0.2 0.3 0.4
(initial graph memory) 0.679 0.698 0.690 0.685 0.682
p 2 3 4 5 6
(2D local average size) 0.691 0.698 0.691 0.687 0.679
η 0.01 0.1 0.3 0.4 0.6
(graph temporal regularization) 0.685 0.688 0.698 0.686 0.672
μ 0.03 0.05 0.1 0.2 0.3
(extra-cluster penalty) 0.675 0.697 0.698 0.687 0.670
NCL 3 4 6 8 10
0.650 0.656 0.666 0.682 0.698
(cluster number) 12 14 16 18 20
0.686 0.685 0.677 0.674 0.676
Table 1: Robustness of CES-DGE with respect to hyperparameter values (EDUBSeg, tolerance τ=5): F-scores obtained when varying the DGE hyperparameter indicated in the first column, with all others held fixed (best results in bold).

3.6 Comparative results for EDUBSeg dataset

Method F-score Rec Prec
k-Means smoothed 0.51 0.39 0.82
AC-color 0.38 0.25 0.90
SR-ClusteringCNN 0.53 0.68 0.49
KTS 0.53 0.40 0.87
CES-VCP 0.69 0.77 0.66
CES-DGE 0.70 0.70 0.72
Manual segmentation 0.72 0.68 0.80
Table 2: Comparison of CES-DGE with state-of-the-art methods & manual segmentation for EDUBSeg and tolerance τ=5.
Method CES-VCP CES-DGE
Tolerance F-score Rec Prec F-score Rec Prec
τ=5 0.69 0.77 0.66 0.70 0.70 0.72
τ=4 0.67 0.75 0.63 0.68 0.69 0.70
τ=3 0.64 0.62 0.71 0.65 0.64 0.68
τ=2 0.59 0.67 0.56 0.59 0.59 0.61
τ=1 0.44 0.44 0.49 0.48 0.48 0.50
Table 3: Comparison of CES-DGE with state-of-the-art CES-VCP for different values of tolerance for EDUBSeg.

4 Performance evaluation

4.1 Datasets and experimental setup

Datasets. 

We used two temporal segmentation benchmark datasets for performance evaluation. The EDUBSeg dataset, introduced in [dimiccoli2017sr], consists of 20 first person image sequences acquired by seven people with a total of 18,735 images. The dataset has been used to validate image sequence temporal segmentation in several recent works [dimiccoli2017sr, dias2019learning, paci2016context, del2018predicting]. For each image sequence, 3 manual segmentations have been obtained by 3 different persons; in line with previous works, the first segmentation was used here as ground truth. The second dataset is the larger and more recent EDUBSeg-Desc dataset [bolanos2018egocentric], with 46 sequences (42,947 images) acquired by eight people. Every frame of the egocentric image sequences was described here using the output of the pre-pooling layer of InceptionV3 [szegedy2016rethinking] pretrained on ImageNet as in [del2018predicting], resulting in n=2048 features.

On average, the sequences of both datasets contain roughly the same number of ground truth event boundaries (28 for the former vs. 26 for the latter), but those of EDUBSeg-Desc dataset consist of 25% longer continuously recorded segments than EDUBSeg (3h46m25s vs. 3h1m29s continuous “camera on” time, and 3.0 vs. 3.55 continuously recorded segments per sequence). Because of this increased continuity with a larger number of harder to detect continuous event transitions, EDUBSeg-Desc is considered more difficult.

Other publicly available datasets of egocentric image sequences, such as CLEF [dang2017overview], NTCIR [gurrin2017overview] and the more recent R3 [del2018predicting], do not have ground truth event segmentations. They can therefore not be used for performance evaluation. In [del2018predicting], these datasets with more than 1.2 million images were used as training sets to learn the temporal context representation. We emphasize that in contrast, our algorithm operates fully self-supervised instead, without training dataset.

Performance evaluation.  Following previous work, we use F-score, Precision (Prec) and Recall (Rec) to evaluate the performance of our approach. In previous works [paci2016context, dimiccoli2017sr, del2018predicting, dias2019learning], results have been reported for a single level of tolerance τ=5 (i.e., time interval of length 2τ within which a detected event boundary is considered correct). Here, we report results for several values for the level of tolerance τ[1,T], where T=5.

DGE hyperparameters.  The hyperparameter values for the graph initialization and embedding (i.e., for the non-local self-similarity kernel (1) and for the similarity (3)) have been chosen a priori based on visual inspection of the similarity matrices of X^0 and X~0 for a couple of sequences of EDUBSeg. The values are fixed to L=3, M=1, h=0.25 for Eq. (1), and l^=0.001, l~=0.3 for Eq. (3). The embedding dimension is set to d=15, which is found sufficient for the representation to faithfully reproduce the graph topology underlying the data (larger values for d were tested but did not yield performance gains). The DGE core loop hyperparameters are set to α=0.1 (graph regularization), p=3, η=0.3 (temporal prior), and μ=0.1, NC=10 (semantic prior; a k-means algorithm is used to estimate cluster labels); robustness of this choice is investigated in the next section.

4.2 Robustness to changes in hyperparameter values

Tab. 1 reports F-score values obtained on the EDUBSeg dataset when the DGE iterations K and the DGE core parameters α, p, η, μ and NCL are varied individually. It can be appreciated that the performance of CES-DGE is overall very robust w.r.t. precise hyperparameter values. The highest sensitivity is observed for the DGE iteration number K, which should be chosen as 1<K<5 so that F-score values are at most 3% below the best observed F-score. Results are very robust w.r.t. temporal regularization for reasonably small parameter values of p5 and η0.4, for larger values the learnt representation is over-smoothed. Similarly, F-score values vary little when changing the semantic similarity parameter as long as 0<μ0.2 (note that these variations are significant since η,μ(0,1)). Finally, the number of clusters used to estimate semantic similarity can be chosen within a reasonable range (F-score values drop by less than 3% for NCL(6,,20)). Overall, these results suggest that CES-DGE is quite insensitive to hyperparameter tuning and yields robust segmentation for a wide range of parameter values.

Tab. 2 reports comparisons with five state-of-the-art methods for a fixed value of tolerance (τ=5, results reproduced from [del2018predicting]). The first four (k-Means smoothed with k=30, AC-color [lee2015predicting], SR-Clustering [dimiccoli2017sr], KTS [potapov2014category]) are standard/generic approaches and achieve modest performance, with F-scores no better than 0.53. CES-VCP of [del2018predicting] yields significantly better F-score of 0.69, thanks to the use of a large training set for learning the event representation. The proposed CES-DGE approach further improves on this state-of-the-art result and yields F-score of 0.70. Interestingly, CES-DGE also achieves more balanced Rec and Prec values of 0.70 and 0.72, as compared to 0.77 and 0.66 for CES-VCP. Finally, even when compared to average manual segmentation performance, estimated by averaging the performance of the two remaining manual segmentations for EDUBSeg against the selected ground truth, our results are only 2% below.

In Table 3, we provide comparisons between our proposed approach and CES-VCP for different values of tolerance. We can observe that CES-DGE achieves systematically better results than CES-VCP in terms of F-score for all values of tolerance. These improvements with respect to the state of the art reach up to 4%. Besides, Rec/Prec values for CES-DGE are more balanced and within ±3% of the values for F-score for all tolerance levels (±8% of F-score for CES-VCP).

Overall, this leads to conclude that the proposed CES-DGE approach is effective in learning event representations for image sequences and yields robust segmentation results. These results are even more remarkable considering that CES-DGE learns feature representations from the sequence itself, without relying on any training dataset.

4.3 Comparative results for EDUBSeg-Desc dataset

Tab. 4 summarizes the event boundary detection performance of the proposed CES-DGE approach and of CES-VCP for the larger EDUBSeg-Desc dataset; since CES-VCP reported state-of-the-art results (with F-score values 16% above other methods, cf., Tab. 2), we omit comparison with other methods in what follows for space reasons. The same (hyper)parameter values as for EDUBSeg are used. It can be observed that the performance for both CES-VCP and CES-DGE are inferior to those reported for the EDUBSeg dataset in the previous section, for all tolerance values, indicating that EDUBSeg-Desc contains more difficult image sequences. Interestingly, while the F-scores achieved by CES-VCP are up to 12% (and more than 11% on average) below that reported for EDUBSeg, the F-scores of the proposed CES-DGE approach are at worst 5% smaller. In other words, CES-DGE yields up to 8% (on average 6%) better F-score values than the state of the art for the EDUBSeg-Desc dataset. Our CES-DGE also yields systematically better Rec and Prec values, for all levels of tolerance. Overall, these findings corroborate those obtained for the EDUBSeg dataset and confirm the excellent practical performance of the proposed approach. The results further suggest that CES-DGE, by relying only on the image sequence itself for learning event representations, can benefit from improved adaptability and robustness as compared to approaches that leverage on training datasets.

Method CES-VCP CES-DGE
Tolerance F-score Rec Prec F-score Rec Prec
τ=5 0.57 0.59 0.60 0.65 0.67 0.65
τ=4 0.56 0.58 0.58 0.63 0.66 0.63
τ=3 0.52 0.54 0.54 0.60 0.62 0.60
τ=2 0.49 0.50 0.50 0.54 0.56 0.54
τ=1 0.43 0.44 0.45 0.45 0.46 0.45
Table 4: Comparison of CES-DGE with state-of-the-art CES-VCP for different values of tolerance for EDUBSeg-Desc.

4.4 Ablation study

Graph initialization.  We investigate performance obtained by applying the boundary detector to features obtained at different stages of our method: the original features X (denoted CES-raw) and the features X^0 obtained by applying NLmeans on the temporal dimension (CES-NLmeans 1D), both of dimension n=2048; the initial embedded features X~0 (CES-Embedding) and the features obtained after running the DGE main loop for K=2 iterations (CES-DGE), both of dimension d=15. The results, obtained for the EDUBSeg dataset, are reported in Tab. 5. They indicate that CES-NLmeans 1D increases F-score by 8% w.r.t. CES-raw, and CES-Embedding adds another 1% in F-score. CES-DGE gains an additional 9%, hence significantly improves upon this initial embedding. This confirms that the graph initialization and the reduction of the dimension of the graph representation is beneficial.

An illustration of the effect of the graph initialization and of the DGE steps for EDUBSeg Subject 1 Day 3 is provided in Fig. 4. It can be observed how boundaries between temporally adjacent frames along the diagonal in the graph are successively enhanced when the original features X0 (column 1) are replaced with the denoised version X^0 (column 2), with the embedded features X~0 (column 3), and finally with the DGE representation estimates (columns 4 & 5 for DGE iterations i=1,2, respectively). Moreover, the boundaries of off-diagonal blocks that correspond to segments of frames that ressemble segments at other temporal locations, presumably of the same event, are sharpened, revealing the community structures.

DGE core operations.  In Tab. 6, we report the performance that is obtained on the EDUBSeg dataset when the different operations in the DGE core iterations are removed one-by-one by setting the respective parameter to zero: graph regularization (α), edge local averaging (p), temporal edge weighting (η), and extra-cluster penalization (μ). It is observed that the overall DGE F-score drops by 0.4% to 3.2% when one single of these operations is deactivated (versus a drop of 8.6% from 0.698 to 0.612 when no DGE operation is performed at all, as discussed above). The fact that removing any of the operations individually does not lead to a knock-out of the DGE loop suggests that the associated individual (temporal & semantic) model assumptions are all and independently important. Among all operations, the largest individual F-score-drop (3.2%) corresponds with deactivating the extra-cluster penalization. This points to the essential role of semantic similarity in the graph model. The smallest F-score difference is associated with edge local averaging; to improve this temporal regularization step, future work could use nonlocal instead of local averaging, or learnt kernels. However, the graph temporal edge weighting is effective in encoding the temporal prior (1.5% F-score drop if deactivated).

Method F-score Rec Prec
CES-raw 0.52 0.56 0.56
CES-NLmeans 1D 0.60 0.63 0.61
CES-Embedding 0.61 0.61 0.65
CES-DGE 0.70 0.70 0.72
Table 5: CES-DGE ablation study for EDUBSeg and tolerance τ=5.
Deactivated DGE parameter α=0 p=0 η=0 μ=0
F-score -0.685 -0.694 -0.683 -0.666
difference with full DGE (0.698) -0.013 -0.004 -0.015 -0.032
Table 6: F-scores obtained when single core steps are removed from DGE (indicated by a zero value for the respective parameter).
Figure 4: Illustration of DEG on Subject 1 Day 3 of EDUBSeg. Panel (a): Data and learnt representations with boundary estimates (red dashed vertical bars), ground truth (blue solid vertical bars) and resulting F-score values; from left to right initial features X0 (1st column), denoised features X^0 (2nd column), initial embedded features X~i=0 (3rd column), learnt representation X~i after a few iterations i (4th & 5th column). The panel (b) shows the corresponding graphs 𝒢0, 𝒢^0 and 𝒢~i, and panel (c) the graph after pruning edges with weight smaller than 0.3 times the strongest edge (dark blue corresponds with small weights and yellow corresponds with large weights, respectively).

5 Conclusion

This paper proposed a novel approach to learn representations of events in low temporal resolution image sequences, named Dynamic Graph Embedding (DGE). Unlike state of the art work, which requires (large) datasets for training the model, DGE operates in a fully self-supervised way and learns the temporal event representation for an image sequence directly from the sequence itself. To this end, we introduced an original model based on the assumption that the sequence can be represented as a graph that captures both the temporal and the semantic similarity of the images. The key novelty of our DGE approach is then to learn the structure of this unknown underlying data generating graph, jointly with a low dimensional graph embedding. This is, to the best of our knowledge, one of the first attempts to learn simultaneously graph structure and data representations for image sequences. Experimental results have shown that DGE yields robust and effective event representations and outperforms the state of the art in terms of event boundary detection precision, improving F-score values by 1% and 8% on the EDUBSeg and EDUBSeg-Desc event segmentation benchmark datasets, respectively. Future work will include exploring the use of more sophisticated methods than the k-means algorithm in the semantic similarity estimation step, and other applications in the field of Multimedia, such as the temporal segmentation of wearable sensor data streams.

References