Abstract
Recently, selfsupervised learning has proved to be effective to learnrepresentations of events in image sequences, where events are understood assets of temporally adjacent images that are semantically perceived as a whole.However, although this approach does not require expensive manual annotations,it is data hungry and suffers from domain adaptation problems. As analternative, in this work, we propose a novel approach for learning eventrepresentations named Dynamic Graph Embedding (DGE). The assumption underlyingour model is that a sequence of images can be represented by a graph thatencodes both semantic and temporal similarity. The key novelty of DGE is tolearn jointly the graph and its graph embedding. At its core, DGE works byiterating over two steps: 1) updating the graph representing the semantic andtemporal structure of the data based on the current data representation, and 2)updating the data representation to take into account the current data graphstructure. The main advantage of DGE over stateoftheart selfsupervisedapproaches is that it does not require any training set, but instead learnsiteratively from the data itself a lowdimensional embedding that reflectstheir temporal and semantic structure. Experimental results on two benchmarkdatasets of real image sequences captured at regular intervals demonstrate thatthe proposed DGE leads to effective event representations. In particular, itachieves robust temporal segmentation on the EDUBSeg and EDUBSegDesc benchmarkdatasets, outperforming the state of the art.
Quick Read (beta)
Learning event representations in image sequences
by dynamic graph embedding
Abstract
Recently, selfsupervised learning has proved to be effective to learn representations of events in image sequences, where events are understood as sets of temporally adjacent images that are semantically perceived as a whole. However, although this approach does not require expensive manual annotations, it is data hungry and suffers from domain adaptation problems. As an alternative, in this work, we propose a novel approach for learning event representations named Dynamic Graph Embedding (DGE). The assumption underlying our model is that a sequence of images can be represented by a graph that encodes both semantic and temporal similarity. The key novelty of DGE is to learn jointly the graph and its graph embedding. At its core, DGE works by iterating over two steps: 1) updating the graph representing the semantic and temporal structure of the data based on the current data representation, and 2) updating the data representation to take into account the current data graph structure. The main advantage of DGE over stateoftheart selfsupervised approaches is that it does not require any training set, but instead learns iteratively from the data itself a lowdimensional embedding that reflects their temporal and semantic structure. Experimental results on two benchmark datasets of real image sequences captured at regular intervals demonstrate that the proposed DGE leads to effective event representations. In particular, it achieves robust temporal segmentation on the EDUBSeg and EDUBSegDesc benchmark datasets, outperforming the state of the art. Index Terms: clustering, graph embedding, geometric learning, temporal context prediction, temporal segmentation, event representations
1 Introduction
Temporal segmentation of videos and image sequences has a long story of research since it is crucial not only to video understanding but also to video browsing, indexing and summarization [koprinska2001temporal, money2008video, del2017summarization]. With the proliferation of wearable cameras in recent years, the field is facing new challenges. Indeed, wearable cameras allow to capture, from a firstperson (egocentric) perspective, and “in the wild”, long unconstrained videos ($\approx $35fps) and image sequences (aka photostreams, $\approx $2fpm). Due to their low temporal resolution, the segmentation of first person image sequences is particularly challenging. Indeed, abrupt changes in appearance may arise within an event due to sudden camera movements, and event transitions that are smooth at higher sampling rate due to continuous recording are reduced to a few frames that are difficult to detect. While for a human observer it is relatively easy to segment egocentric image sequences into discrete units, this poses serious difficulties for automated temporal segmentation (see Fig. 1 for an illustration).
Given the limited amount of annotated data, current stateoftheart approaches for the temporal segmentation of firstperson image sequences aim at obtaining event representations by encoding the temporal context of each frame of a sequence in an unsupervised fashion [dias2019learning, del2018predicting]. These methods rely on neural or recurrent neural networks and are generally based on the idea of learning event representations by training the network to predict past and future frame representations. Recurrent neural networks have proved to be more efficient than simple neural networks for the temporal prediction task. The main limitation of these approaches is that they must rely on large training datasets to yield stateoftheart performance. Even if, in this case, training data do not require manual annotations, they can always introduce a bias and suffer from the domain adaptation problem. For instance, in the case of temporal segmentation of image sequences, the results will be difficult to generalize to data acquired with a camera with different field of view or for people having different lifestyles.
In this paper, we aim at overcoming this limitation with a novel approach that is able to unveil a representation encoding the temporal structure of an image sequence from the single sequence itself. With this goal in mind, we propose to learn event representations as an embedding on a graph. Our model is based on the assumption that each event belongs to a community structure of semantically similar events on an unknown underlying graph, where communities are understood here as sets of nodes that are interconnected by edges with large weights. Moreover, the graph weights reflect not only temporal proximity, but also semantic similarity. This is motivated by neuroscientific findings showing that neural representations of events arise from temporal community structures [schapiro2013neural]. In other words, frames that share the temporal context are grouped together in the representational space. In Fig. 2 we illustrate the idea by means of an egocentric image sequence capturing the full day of a person: going from home to work using public transports, having a lunch break in a restaurant and going back to home after doing some shopping, etc. Each point cloud corresponds with images similar in appearance and most of them are visited multiple times. This means that every pair of images in a point cloud is related semantically, but they could or could not be related at temporal level.
Based on this model, the proposed solution consists in learning simultaneously the graph structure (encoded by its weights) and the data representation. This is achieved by iterating over two alternate steps: 1) the updating of the graph structure as a function of the current data representation, where the graph structure is assumed to encode a finite number of community structures, and 2) the updating of the data representation as a function of the current graph structure in a lowdimensional embedding space. We term this solution dynamic graph embedding (DGE). We provide illustrative experiments on synthetic data, and we validate the proposed approach on two real world benchmark datasets for first person image sequence temporal segmentation. Our framework is the first attempt to learn simultaneously graph structure and data representations for temporal segmentation of image sequences.
Our main contributions are: (i) we reframe the event learning problem as the problem of learning a graph embedding, (ii) we introduce an original graph initialization approach based on the concept of temporal selfsimilarity, (iii) we propose a novel technical approach to solve the graph embedding problem when the underlying graph structure is unknown, (iv) we demonstrate that the learnt graph embedding is suitable for the task of temporal segmentation, achieving stateoftheart results on two challenging reference benchmark datasets [dimiccoli2017sr, bolanos2018egocentric], without relying on any training set for learning the temporal segmentation model.
The structure of the paper is as follows. Section 2 highlights related work on data representation learning on graphs and on the temporal segmentation of videos and image sequences. In Section 3.1 we introduce our problem formulation while in Sections 3.2 to 3.5 we detail the proposed graph embedding model. The performance of our algorithm on real world data are evaluated in Section 4. In Section 5, we conclude on our contributions and results.
2 Related work
2.1 Geometric learning
The proposed approach lies in the field of geometric learning, which is an umbrella term for those techniques that work in nonEuclidean domains such as graphs and manifolds. Following [bronstein2017geometric], geometric learning approaches either address the problem of characterizing the structure of the data, or deal with analyzing functions defined on a given nonEuclidean domain. In the former case, which is more closely related with the method proposed in this paper, the goal is to learn an embedding of the data in a low dimensional space, such that the geometric relations in the embedding space reflect the graph structure. These methods are commonly referred to as nodeembedding and can be understood from an encoderdecoder perspective [hamilton2017representation]. Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$, where $\mathcal{V}$ and $\mathcal{E}$ represents the set of nodes and edges of the graph respectively, the encoder maps each node $v\in \mathcal{V}$ of $\mathcal{G}$ in a low dimensional space. The decoder is a function defined in the embedding space that acts on node pairs to compute a similarity $S$ between the nodes. Therefore the graph embedding problem can be formulated as the problem of optimizing decoder and encoder mappings to minimize the discrepancy between similarity values in the embedding and original feature space. Within the general encoderdecoder framework, node embedding algorithms can be roughly classified into two classes: shallow embeddings [ahmed2013distributed, ou2016asymmetric, perozzi2014deepwalk, grover2016node2vec, tang2015line, chen2018harp, cao2015grarep] and generalized encoderdecoder architectures [cao2016deep, wang2016structural, hinton2006reducing, hamilton2017inductive]. In shallow embedding approaches, the encoder function acts simply as a lookup function and the input nodes ${v}_{i}\in \mathcal{V}$ are represented as onehot vectors, so that they cannot leverage node attributes during encoding.
Instead, in generalized encoderdecoder architectures [cao2016deep, wang2016structural, hinton2006reducing], the encoders depend more generally on the structure and attributes of the graph. In particular, convolutional encoders [hamilton2017inductive] generate embeddings for a node by aggregating information from its local neighborhood, in a manner similar to the receptive field of a convolutional kernel in computer vision. They rely on node features to generate embeddings and, as the process iterates, the node embedding contains information aggregated from further and further reaches of the graph. Closely related to convolutional encoders are Graph Neural Networks (GNNs). The main difference is that GNNs capture the dependence of graphs via message passing between the nodes of graphs and utilize node attributes and node labels to train model parameters endtoend for a specific task in a semisupervised fashion [henaff2015deep, garcia2017few, defferrard2016convolutional, niepert2016learning].
In all these methods, the graph structure is assumed to be given by the problem domain. For instance in social networks, the structure of the graph is given by the connection between people. However, in the case of temporal segmentation considered here, the problem is nonstructural since the graph structure is not given by the problem domain but needs to be determined together with the node embedding.
2.2 Event segmentation
Extensive research has been conducted to temporally segment videos and image sequences into events. Early approaches aimed at segmenting edited videos such as TV programs and movies [kijak2006audiovisual, yuan2007formal, chasanis2009movie, liu2011exploiting, liu2013learning] into commercial, newsrelated or movie events. More recently, with the advent of wearable cameras and camera equipped smartphones, there has been an increasing interest in segmenting untrimmed videos or image sequences captured by nonprofessionals into semantically homogeneous units [hu2011survey, tang2012learning, Iwan2017]. In particular, videos or image sequences captured by a wearable camera are typically long and unconstrained [bolanos2017toward], therefore it is important for the user to have them segmented into semantically meaningful chapters. In addition to appearancebased features [bettadapura2016discovering, xu2015gaze], motion features have been extensively used for temporal segmentation of both thirdperson videos [koprinska2001temporal, tang2012learning] and firstperson videos [huang2017egocentric, spriggs2009temporal, lee2015predicting, poleg2014temporal]. In [spriggs2009temporal], motion cues from a wearable inertial sensor are leveraged for temporal segmentation of human motion into actions. Lee and Grauman [lee2015predicting] used temporally constrained clustering of motion and visual features to determine whether the differences in appearance correspond to event boundaries or just to abrupt head movements. Poleg et al. [poleg2014temporal] proposed to use integrated motion vectors to segment egocentric videos into a hierarchy of longterm activities, whose first level corresponds to static/transit activities.
However, motion information is not available in first person image sequences, that are the main focus of this paper. In addition, given the limited amount of annotated data, event segmentation is very often performed by using a clustering algorithm relying on handcrafted visual features [doherty2008investigating, doherty2008combining, lin2017vci2r, yamamoto2017pbg, lin2006structuring]. Tavalera et al. [talavera2015r] proposed to combine agglomerative clustering with a change detection method within a graphcut energy minimization framework. Later on, [dimiccoli2017sr] extended this framework by improving the feature representations through building a vocabulary of concepts for representing each frame. Paci et al. [paci2016context] proposed a Siamese ConvNets based approach that aims at learning a similarity function between lowresolution egocentric images. Recently, del Molino et al. [del2018predicting] proposed to learn event representations by training in an unsupervised fashion a LSTMbased model to predict the temporal context of each frame, that is initially represented by the output of the prepooling layer of InceptionV3. This simple approach has shown outstanding results on reference benchmark datasets [dimiccoli2017sr, bolanos2018egocentric], as long as the model is trained on a large training dataset (over 1.2 million images). The method in [del2018predicting] is similar to the chronologically earlier approach reported in [dias2019learning], that proposed to train the model in a fully selfsupervised fashion instead. Specifically, starting from a concept vector representation of each frame, the authors proposed a neural network based model and an LSTMbased model performing a selfsupervised pretext task consisting in predicting the concept vectors of neighbor frames given the concept vector of the current frame. Consequently, unlike [del2018predicting], the single image sequence itself is used to learn the features encoding the temporal context, without need for a training dataset. The performance achieved in [dias2019learning] were less impressive than in [del2018predicting].
Here, we propose a new model that as [dias2019learning] does not make use of any training set for learning the temporal event representation, but achieves stateoftheart results. In particular, it outperforms [del2018predicting] on the EDUBSeg and EDUBSegDesc benchmarks [dimiccoli2017sr, bolanos2018egocentric].
3 Dynamic Graph Embedding (DGE)
3.1 Problem formulation and proposed model
We formulate the event learning problem as a geometric learning problem. More specifically, given a set of data points (the frames of the image sequence) embedded into a highdimensional Euclidean space (the initial data representation), we assume that these data points are organized in an underlying low dimensional graph structure. Our a priori on the structure of the graph in the underlying low dimensional space is that it consists of a finite number of community structures corresponding to different temporal contexts. Since along an image sequence, a same temporal context can be visited several times at different time intervals, we assume that the underlying graph consists of a finite number of community structures and that edges between nodes belonging to different communities correspond to transitions between different temporal contexts, whereas edges between nodes belonging to the same community correspond to transitions between nodes sharing the same temporal context, being them temporally adjacent or not. This structure implicitly assumes that the graph topology models jointly temporal and semantic similarity relations.
More formally, given a sequence of $N$ images and their feature vectors $X\in {\mathbb{R}}^{N}\times {\mathbb{R}}^{n}$, we aim at finding a fully connected, weighted graph $\stackrel{~}{\mathcal{G}}=(\stackrel{~}{X},\mathcal{E},\stackrel{~}{\mathcal{W}})$ with node embedding $\stackrel{~}{X}\in {\mathbb{R}}^{N}\times {\mathbb{R}}^{d}$ in a low dimensional space, $d\ll n$, and edge weights $\stackrel{~}{\mathcal{W}}$ given by the entries of an affinity matrix $\stackrel{~}{\text{G}}=S(\stackrel{~}{X})$, where $S$ is a similarity kernel, such that the similarity ${\stackrel{~}{\text{G}}}_{kj}$ between any pair $k,j$ of nodes of the graph reflects both semantic relatedness and temporal adjacency between the images. Semantic relatedness is captured by a similarity function between semantic image descriptors, whereas temporal adjacency is imposed through temporal constraints injected in the graph structure. The above constraints lead to easily grouping the graph nodes in a finite number of clusters, each corresponding to a different temporal context. As seen in the previous section, in classical node embedding the low dimensional representation of each node encodes information about the position and the structure of local neighborhood in the graph. Since all these methods incorporate graph structure in some way, the construction of the underlying graph is extremely important, but relatively little explored. In our problem at hand the graph structure is initially unknown since it arises from events, that are unknown. Therefore, we aim at learning jointly the structure of the underlying graph and the node embedding.
3.2 Graph initialization by nonlocal selfsimilarity
Temporal nonlocal selfsimilarity. Let $X\in {\mathbb{R}}^{N}\times {\mathbb{R}}^{n}$ be the set of $N$ given data points in a highdimensional space ${\mathbb{R}}^{n}$, normalized in the interval $[1,1]$. To obtain a first coarse estimate of the graph, we apply a nonlocal selfsimilarity algorithm in the temporal domain to the initial data $X$ [dimiccoli2019enhancing]. The nonlocal selfsimilarity filtering creates temporal neighborhoods of frames that are likely to be in the same event. Let $X(k)\in {\mathbb{R}}^{n}$ denote the $k$th row of $X$, that is, the vector of $n$ image features at time $k$, $k=1,\mathrm{\dots},N$. Further, let ${\mathcal{N}}_{k}^{M}=\{kM,\mathrm{\dots},k1,k+1,\mathrm{\dots},k+M\}$ and ${\mathcal{N}}_{k}^{L}=\{kL,\mathrm{\dots},k1,k+1,\mathrm{\dots},k+L\}$ denote the indices of the $2M$ and $2L$ neighboring feature vectors of $X(k)$, respectively, with $L>M$. In analogy with 2D data (images) [buades2005non], the selfsimilarity function of $X(k)$ in a temporal sequence, conditioned to its temporal neighborhood $j\in {\mathcal{N}}_{k}^{M}$, is given by the quantity [dimiccoli2019enhancing]
$${S}^{NL}(k,j)=\frac{1}{\mathcal{Z}(k)}\mathrm{exp}\left(\frac{dist(X({\mathcal{N}}_{k}^{M}),X({\mathcal{N}}_{j}^{M}))}{h}\right),$$  (1) 
where $\mathcal{Z}(k)$ is a normalizing factor such that ${\sum}_{j\in {\mathcal{N}}_{k}^{L}}{S}_{kj}^{NL}=1$, ensuring that ${S}_{kj}^{NL}$ can be interpreted as a conditional probability of $X(j)$ given $X({\mathcal{N}}_{k})$, as detailed in [buades2005non], $dist(X({\mathcal{N}}_{k}),X({\mathcal{N}}_{j}))={\sum}_{i=1}^{2M}{X({\mathcal{N}}_{k}(i))X({\mathcal{N}}_{j}(i))}_{{\mathrm{\ell}}_{1}}$ is the sum of the ${\mathrm{\ell}}_{1}$ distances of the vectors in the neighborhoods of $k$ and $j$, and $h$ is the parameter that tunes the decay of the exponential function. The key idea of our graph initialization is to model each frame $k$ by its denoised version, obtained as
$\widehat{X}(k)$  $=$  ${\text{NLmeans}}^{1D}(X(k))={\displaystyle \sum _{j\in {\mathcal{N}}_{k}^{L}}}{S}_{kj}^{NL}\cdot X(j),$  
${\widehat{X}}_{0}$  $=$  ${(\widehat{X}{(1)}^{T}\widehat{X}{(2)}^{T}\mathrm{\dots}\widehat{X}{(N)}^{T})}^{T}.$  (2) 
A numerical illustration on real data is provided in Fig. 4.
Initial graph and initial embedding. An initial graph ${\mathcal{G}}_{0}$ is obtained by computing the ${\mathbb{R}}^{N}\times {\mathbb{R}}^{N}$ affinity matrix ${\text{G}}_{0}={S}_{\widehat{l}}({\widehat{X}}_{0})$ of ${\widehat{X}}_{0}$, defined elementwise as the pairwise similarity
$${({S}_{l}(X))}_{kj}=\mathrm{exp}\left(\frac{cdist(X(j),X(k))}{l}\right)$$  (3) 
where $cdist(\cdot ,\cdot )$ is the cosine distance and $l$ the filtering parameter of the exponential function. In the following, we will not distinguish any longer between a graph $\mathcal{G}$ and its representation by its affinity matrix G and make use both symbols synonymously. In our model, ${\mathcal{G}}_{0}$ represents the initial data structure in the original high dimensional space as a fully connected graph, from which we would like to learn the graph in the embedding space, say $\stackrel{~}{\mathcal{G}}$, that better encodes temporal constraints. To obtain an initial embedding ${\stackrel{~}{X}}_{0}$ for the graph, we apply PCA on ${\widehat{X}}_{0}$, keep the $d$ major principal components $\stackrel{~}{X}$, and minimize the crossentropy loss $\mathcal{L}$ between the affinity matrices ${S}_{\widehat{l}}({\widehat{X}}_{0})$ and ${S}_{\stackrel{~}{l}}({\stackrel{~}{X}}_{0})$
$${\stackrel{~}{X}}_{0}=\underset{\stackrel{~}{X}}{\mathrm{arg}\mathrm{min}}\mathcal{L}({S}_{\stackrel{~}{l}}(\stackrel{~}{X}),{\text{G}}_{0})$$ 
where the different filtering parameters $\widehat{l}$ and $\stackrel{~}{l}$ account for the different dimensionality of ${\widehat{X}}_{0}$ and ${\stackrel{~}{X}}_{0}$. Even if PCA is a linear operator and for small sets of highdimensional vectors dual PCA could be more appropriate [giannakis2018topology], we found it sufficient here for initializing the algorithm. The initial graph ${\stackrel{~}{\mathcal{G}}}_{0}$ in the embedding space is then given by ${\stackrel{~}{\text{G}}}_{0}={S}_{l}({\stackrel{~}{X}}_{0})$.
3.3 DGE core alternating steps
Given the initial embedding ${\stackrel{~}{X}}_{0}$ and graphs ${\mathcal{G}}_{0}$ and ${\stackrel{~}{\mathcal{G}}}_{0}$, the main loop of our DGE alternates over the following two steps:

1.
Assuming that ${\stackrel{~}{\mathcal{G}}}_{i1}$ is fixed, update the node representations ${\stackrel{~}{X}}_{i}$.

2.
Assuming that ${\stackrel{~}{X}}_{i}$ is fixed, update the graph ${\stackrel{~}{\mathcal{G}}}_{i}$.
Step (1) is inspired from graph embedding methods, such as the ones reviewed in Section 2.1, that have proved to be very good at encoding a given graph structure. Step (2) aims at enforcing temporal constraints and at fostering semantic similarity in the graph structure.
Graph embedding update. To estimate the graph embedding ${\stackrel{~}{X}}_{i}$ at iteration $i$ assuming that ${\stackrel{~}{\mathcal{G}}}_{i1}$ is given, we solve
$${\stackrel{~}{X}}_{i}=\underset{\stackrel{~}{X}}{\mathrm{arg}\mathrm{min}}(1\alpha ){\mathcal{L}}_{1}({S}_{\stackrel{~}{l}}(\stackrel{~}{X}),{\stackrel{~}{\text{G}}}_{i1})+\alpha {\mathcal{L}}_{2}({S}_{\stackrel{~}{l}}(\stackrel{~}{X}),{\text{G}}_{0}).$$  (4) 
Here ${\mathcal{L}}_{1}$ and ${\mathcal{L}}_{2}$ are crossentropy losses and ${S}_{l}(\cdot ,\cdot )$ is the cosinebased similarity defined in (3). The first loss term controls the fit of the representation $\stackrel{~}{X}$ with the learnt graph ${\stackrel{~}{\mathcal{G}}}_{i}$ in low dimensional embedding space. The second loss term quantifies the fit of the representation $\stackrel{~}{X}$ with the fixed initial graph ${\mathcal{G}}_{0}$ in high dimensional space and is reminiscent of shallow graph embedding; $\alpha \in [0,1]$ is a regularization parameter that controls the weight of each loss. Standard gradient descent can be used to solve (4).
Graph structure update. To obtain an estimate of the graph structure at the $i$th iteration, say ${\stackrel{~}{\mathcal{G}}}_{i}$, assuming that ${\stackrel{~}{X}}_{i}$ is given, we start from an initial estimate for ${\stackrel{~}{\mathcal{G}}}_{i}$ as ${\stackrel{~}{\text{G}}}_{i}={S}_{\stackrel{~}{l}}({\stackrel{~}{X}}_{i})$, and then make use of the model assumptions described in Section 3.1 to modify the graph: temporal adjacency, and semantic similarity.
i) To foster similarity for temporally adjacent nodes, we apply two operations. First, local averaging of the edge weights, ${\stackrel{~}{\text{G}}}_{i}=\leftarrow {\stackrel{~}{\text{G}}}_{i}\ast {\mathcal{K}}_{p}$, where $\ast $ is the 2D convolution operator and ${\mathcal{K}}_{p}$ a $p\times p$ kernel that is here simply the normalized bump function. Second, and more importantly, we apply a nonlinear operation to ${\stackrel{~}{\mathcal{G}}}_{i}$ that scales down by a factor $\eta $, $$, the weights of edges between nodes $k$ and $j$ that are not direct temporal neighbors (i.e., for which $kj>1$) while leaving similarities of directly temporally adjacent nodes unchanged
$${({\stackrel{~}{\text{G}}}_{i})}_{kj}\leftarrow {({f}_{\eta}({\stackrel{~}{\text{G}}}_{i}))}_{kj}=\{\begin{array}{cc}(1\eta ){({\stackrel{~}{\text{G}}}_{i})}_{kj}\text{if}kj1\hfill & \\ {(}{1}{}{\eta}{)}{({\stackrel{~}{\text{G}}}_{i})}_{kj}\text{otherwise,}\hfill & \end{array}$$  (5) 
thus strengthening the temporal adjacency of the graph.
ii) To reinforce the semantic similarity of ${\stackrel{~}{X}}_{i}$, we first obtain a coarse estimate of the community structure of the graph ${\stackrel{~}{\mathcal{G}}}_{i}$. To this end, we apply a clustering algorithm on ${\stackrel{~}{X}}_{i}$, which yields estimated cluster labels $\mathcal{C}={({c}_{j})}_{j=1}^{{N}_{C}}$, ${c}_{j}\in \{1,\mathrm{\dots},{N}_{C}\}$, for each frame. Then we update ${\stackrel{~}{\mathcal{G}}}_{i}$ by applying to it a nonlinear function ${g}_{\mu}$ that reduces the similarity between nodes $k$ and $j$ that do not belong to the same cluster, ${c}_{j}\ne {c}_{k}$, by a factor $\mu $, $$, and does not change similarities within clusters, i.e.,
$${({\stackrel{~}{\text{G}}}_{i})}_{kj}\leftarrow {({g}_{\mu}({\stackrel{~}{\text{G}}}_{i},\mathcal{C}))}_{kj}=\{\begin{array}{cc}(1\mu ){({\stackrel{~}{\text{G}}}_{i})}_{kj}\text{if}{c}_{j}\ne {c}_{k}\hfill & \\ {(}{1}{}{\mu}{)}{({\stackrel{~}{\text{G}}}_{i})}_{kj}\text{otherwise.}\hfill & \end{array}$$  (6) 
For $$, this operation hence reinforces withinevent semantic similarity.
DGE aims revealing at the temporal and semantic relatedness for each pair of data vectors, and therefore the estimated graphs ${\stackrel{~}{\mathcal{G}}}_{i}$, $i=0,\mathrm{\dots},K$ are fully connected at each stage. A highlevel overview of our DGE approach can be found on ALGORITHM 1.
3.4 Graph postprocessing: Event boundary detection
Depending on the problem, applicative context and objective, different standard and graph signal processing tools can be applied to the estimated graph ${\stackrel{~}{\mathcal{G}}}_{K}$ in order to extract the desired information, or transform ${\stackrel{~}{\mathcal{G}}}_{K}$ [wang2015global, ortega2018graph]. In this work, the focus is on event detection in image sequences and we thus directly use the learnt representation ${\stackrel{~}{X}}_{K}$ corresponding to ${\stackrel{~}{\mathcal{G}}}_{K}$ in a contextual event boundary detector.
Boundary detector. Since the focus of this paper is on how to improve event representation and not on the temporal segmentation algorithm itself, we used the same boundary detector as in [del2018predicting]. Such a boundary detector is based on the idea that the larger the distance between the predicted visual contexts for a frame $k$, once computed forward (from past frames $$) and once backward (from future frames $j>k$), is, the more likely this frame will correspond to an event boundary. Hence, the boundary prediction function is defined as the distance between the past and future contexts of frame $k$: $pred(k)=dist({\text{ctx}}^{P},{\text{ctx}}^{F})$, where $dist(\cdot ,\cdot )$ is the cosine distance and the temporal contexts ${\text{ctx}}^{F}$ and ${\text{ctx}}^{P}$ are defined as the average of the representation vectors $\stackrel{~}{X}$ of a small number of the previous (or past) frames. Those frames for which the values of the boundary prediction function exceed a threshold
are the detected event boundaries, see [del2018predicting] for details. We use the same parameter values and thresholds as in [del2018predicting].
Hereafter we call our temporal segmentation model relying on the features learnt by using the proposed DGE approach CESDGE, in analogy with CESVCP in [del2018predicting] (where CES stands for contextual event segmentation and VCP for visual context prediction).
3.5 Numerical illustration of CESDGE on synthetic data
To illustrate the main idea behind our modeling, in Fig. 3 we show with a synthetic example in $n=2$ dimensions how the original data ${X}_{0}$ and the associated initial graph ${\stackrel{~}{\mathcal{G}}}_{0}$ change over $10$ iteration of the DGE algorithm (here, $d=n=2$, and we assume ${\stackrel{~}{X}}_{0}={\widehat{X}}_{0}={X}_{0}$, i.e., no preprocessing). The data consist of a temporal sequence of $N=350$ feature vectors that are drawn from a Gaussian mixture with $4$ equiprobable components with mean vectors $(\mathrm{7.8\hspace{0.33em}5.1}),(\mathrm{1.4\hspace{0.33em}2.8}),(\mathrm{2.9\hspace{0.33em}12.1}),(\mathrm{2.5\hspace{0.33em}3.4})$ and diagonal covariance matrix $\sigma \mathbf{I}$ with $\sigma =3.7$. To model the community structures in the data, after each switch to a new state the state is maintained with probability that decreases from $1$ at an exponential rate with time. In Fig. 3, panel (a) shows the learnt representations ${\stackrel{~}{X}}_{i}$ as scatter plots for iterations $i=(0,1,3,10)$, where colors indicate the true cluster labels and connections between data points temporal adjacency; panel (b) plots ${\stackrel{~}{X}}_{i}$ as time series, together with event boundary detection as dashed bars and ground truth labels as solid bars; panel (c) shows the corresponding graph ${\stackrel{~}{\mathcal{G}}}_{i}$ and panel (e) summarizes boundary detection performance as a function of iteration number $i$; in addition, to highlight the estimated event graph structure, panel (d) plots a pruned version of ${\stackrel{~}{\mathcal{G}}}_{i}$ in which all edges with weight below 70% the weight of the strongest edge are removed. Clearly, it is difficult to obtain a good segmentation for the initial, cleaned data (iteration $i=0$, left column). It can be appreciated how, with a few DGE iterations, the embedded data points ${\stackrel{~}{X}}_{i}$ belonging to the four different components are pushed apart, and the distance between temporally adjacent points belonging to the same components is reduced, increasing the similarity within each temporal context and revealing the underlying data generating mechanism (Fig. 3 (ab), columns 2 to 4, respectively). Moreover, the learnt representation remains faithful to the temporal structure within each segment (e.g., the position and relative size of local maxima and minima is not changed). An alternative view is provided by the corresponding learnt graphs ${\stackrel{~}{\mathcal{G}}}_{i}$ in Fig. 3 (cd), for which increasingly homogeneous diagonal and offdiagonal blocks with sharp boundaries emerge with increasing number of iterations $i$, reflecting the temporal community structure underlying the data. This improved representation leads in turn to significantly better event segmentation results, with Fscore increasing from initially $0.59$ to $0.62$, $0.64$, $0.67$ for iterations $i=1,3,410$, respectively.
Parameter  Fscore  

$K$  1  2  3  4  5 
(DGE iterations)  0.655  0.698  0.682  0.671  0.662 
$\alpha $  0.05  0.1  0.2  0.3  0.4 
(initial graph memory)  0.679  0.698  0.690  0.685  0.682 
$p$  2  3  4  5  6 
(2D local average size)  0.691  0.698  0.691  0.687  0.679 
$\eta $  0.01  0.1  0.3  0.4  0.6 
(graph temporal regularization)  0.685  0.688  0.698  0.686  0.672 
$\mu $  0.03  0.05  0.1  0.2  0.3 
(extracluster penalty)  0.675  0.697  0.698  0.687  0.670 
${N}_{CL}$  3  4  6  8  10 
0.650  0.656  0.666  0.682  0.698  
(cluster number)  12  14  16  18  20 
0.686  0.685  0.677  0.674  0.676 
3.6 Comparative results for EDUBSeg dataset
Method  Fscore  Rec  Prec 
kMeans smoothed  0.51  0.39  0.82 
ACcolor  0.38  0.25  0.90 
SRClusteringCNN  0.53  0.68  0.49 
KTS  0.53  0.40  0.87 
CESVCP  0.69  0.77  0.66 
CESDGE  0.70  0.70  0.72 
Manual segmentation  0.72  0.68  0.80 
Method  CESVCP  CESDGE  

Tolerance  Fscore  Rec  Prec  Fscore  Rec  Prec 
$\tau =5$  0.69  0.77  0.66  0.70  0.70  0.72 
$\tau =4$  0.67  0.75  0.63  0.68  0.69  0.70 
$\tau =3$  0.64  0.62  0.71  0.65  0.64  0.68 
$\tau =2$  0.59  0.67  0.56  0.59  0.59  0.61 
$\tau =1$  0.44  0.44  0.49  0.48  0.48  0.50 
4 Performance evaluation
4.1 Datasets and experimental setup
Datasets.
We used two temporal segmentation benchmark datasets for performance evaluation. The EDUBSeg dataset, introduced in [dimiccoli2017sr], consists of 20 first person image sequences acquired by seven people with a total of 18,735 images. The dataset has been used to validate image sequence temporal segmentation in several recent works [dimiccoli2017sr, dias2019learning, paci2016context, del2018predicting]. For each image sequence, $3$ manual segmentations have been obtained by $3$ different persons; in line with previous works, the first segmentation was used here as ground truth. The second dataset is the larger and more recent EDUBSegDesc dataset [bolanos2018egocentric], with 46 sequences (42,947 images) acquired by eight people. Every frame of the egocentric image sequences was described here using the output of the prepooling layer of InceptionV3 [szegedy2016rethinking] pretrained on ImageNet as in [del2018predicting], resulting in $n=2048$ features.
On average, the sequences of both datasets contain roughly the same number of ground truth event boundaries (28 for the former vs. 26 for the latter), but those of EDUBSegDesc dataset consist of 25% longer continuously recorded segments than EDUBSeg (3h46m25s vs. 3h1m29s continuous “camera on” time, and 3.0 vs. 3.55 continuously recorded segments per sequence). Because of this increased continuity with a larger number of harder to detect continuous event transitions, EDUBSegDesc is considered more difficult.
Other publicly available datasets of egocentric image sequences, such as CLEF [dang2017overview], NTCIR [gurrin2017overview] and the more recent R3 [del2018predicting], do not have ground truth event segmentations. They can therefore not be used for performance evaluation. In [del2018predicting], these datasets with more than 1.2 million images were used as training sets to learn the temporal context representation. We emphasize that in contrast, our algorithm operates fully selfsupervised instead, without training dataset.
Performance evaluation. Following previous work, we use Fscore, Precision (Prec) and Recall (Rec) to evaluate the performance of our approach. In previous works [paci2016context, dimiccoli2017sr, del2018predicting, dias2019learning], results have been reported for a single level of tolerance $\tau =5$ (i.e., time interval of length $2\tau $ within which a detected event boundary is considered correct). Here, we report results for several values for the level of tolerance $\tau \in [1,T]$, where $T=5$.
DGE hyperparameters. The hyperparameter values for the graph initialization and embedding (i.e., for the nonlocal selfsimilarity kernel (1) and for the similarity (3)) have been chosen a priori based on visual inspection of the similarity matrices of ${\widehat{X}}_{0}$ and ${\stackrel{~}{X}}_{0}$ for a couple of sequences of EDUBSeg. The values are fixed to $L=3$, $M=1$, $h=0.25$ for Eq. (1), and $\widehat{l}=0.001$, $\stackrel{~}{l}=0.3$ for Eq. (3). The embedding dimension is set to $d=15$, which is found sufficient for the representation to faithfully reproduce the graph topology underlying the data (larger values for $d$ were tested but did not yield performance gains). The DGE core loop hyperparameters are set to $\alpha =0.1$ (graph regularization), $p=3$, $\eta =0.3$ (temporal prior), and $\mu =0.1$, ${N}_{C}=10$ (semantic prior; a kmeans algorithm is used to estimate cluster labels); robustness of this choice is investigated in the next section.
4.2 Robustness to changes in hyperparameter values
Tab. 1 reports Fscore values obtained on the EDUBSeg dataset when the DGE iterations $K$ and the DGE core parameters $\alpha $, $p$, $\eta $, $\mu $ and ${N}_{CL}$ are varied individually. It can be appreciated that the performance of CESDGE is overall very robust w.r.t. precise hyperparameter values. The highest sensitivity is observed for the DGE iteration number $K$, which should be chosen as $$ so that Fscore values are at most 3% below the best observed Fscore. Results are very robust w.r.t. temporal regularization for reasonably small parameter values of $p\le 5$ and $\eta \le 0.4$, for larger values the learnt representation is oversmoothed. Similarly, Fscore values vary little when changing the semantic similarity parameter as long as $$ (note that these variations are significant since $\eta ,\mu \in (0,1)$). Finally, the number of clusters used to estimate semantic similarity can be chosen within a reasonable range (Fscore values drop by less than 3% for ${N}_{CL}\in (6,\mathrm{\dots},20)$). Overall, these results suggest that CESDGE is quite insensitive to hyperparameter tuning and yields robust segmentation for a wide range of parameter values.
Tab. 2 reports comparisons with five stateoftheart methods for a fixed value of tolerance ($\tau =5$, results reproduced from [del2018predicting]). The first four (kMeans smoothed with k$=30$, ACcolor [lee2015predicting], SRClustering [dimiccoli2017sr], KTS [potapov2014category]) are standard/generic approaches and achieve modest performance, with Fscores no better than 0.53. CESVCP of [del2018predicting] yields significantly better Fscore of 0.69, thanks to the use of a large training set for learning the event representation. The proposed CESDGE approach further improves on this stateoftheart result and yields Fscore of 0.70. Interestingly, CESDGE also achieves more balanced Rec and Prec values of 0.70 and 0.72, as compared to 0.77 and 0.66 for CESVCP. Finally, even when compared to average manual segmentation performance, estimated by averaging the performance of the two remaining manual segmentations for EDUBSeg against the selected ground truth, our results are only 2$\%$ below.
In Table 3, we provide comparisons between our proposed approach and CESVCP for different values of tolerance. We can observe that CESDGE achieves systematically better results than CESVCP in terms of Fscore for all values of tolerance. These improvements with respect to the state of the art reach up to 4$\%$. Besides, Rec/Prec values for CESDGE are more balanced and within $\pm $3$\%$ of the values for Fscore for all tolerance levels ($\pm $8$\%$ of Fscore for CESVCP).
Overall, this leads to conclude that the proposed CESDGE approach is effective in learning event representations for image sequences and yields robust segmentation results. These results are even more remarkable considering that CESDGE learns feature representations from the sequence itself, without relying on any training dataset.
4.3 Comparative results for EDUBSegDesc dataset
Tab. 4 summarizes the event boundary detection performance of the proposed CESDGE approach and of CESVCP for the larger EDUBSegDesc dataset; since CESVCP reported stateoftheart results (with Fscore values 16% above other methods, cf., Tab. 2), we omit comparison with other methods in what follows for space reasons. The same (hyper)parameter values as for EDUBSeg are used. It can be observed that the performance for both CESVCP and CESDGE are inferior to those reported for the EDUBSeg dataset in the previous section, for all tolerance values, indicating that EDUBSegDesc contains more difficult image sequences. Interestingly, while the Fscores achieved by CESVCP are up to 12% (and more than 11% on average) below that reported for EDUBSeg, the Fscores of the proposed CESDGE approach are at worst 5% smaller. In other words, CESDGE yields up to 8% (on average 6%) better Fscore values than the state of the art for the EDUBSegDesc dataset. Our CESDGE also yields systematically better Rec and Prec values, for all levels of tolerance. Overall, these findings corroborate those obtained for the EDUBSeg dataset and confirm the excellent practical performance of the proposed approach. The results further suggest that CESDGE, by relying only on the image sequence itself for learning event representations, can benefit from improved adaptability and robustness as compared to approaches that leverage on training datasets.
Method  CESVCP  CESDGE  

Tolerance  Fscore  Rec  Prec  Fscore  Rec  Prec 
$\tau =5$  0.57  0.59  0.60  0.65  0.67  0.65 
$\tau =4$  0.56  0.58  0.58  0.63  0.66  0.63 
$\tau =3$  0.52  0.54  0.54  0.60  0.62  0.60 
$\tau =2$  0.49  0.50  0.50  0.54  0.56  0.54 
$\tau =1$  0.43  0.44  0.45  0.45  0.46  0.45 
4.4 Ablation study
Graph initialization. We investigate performance obtained by applying the boundary detector to features obtained at different stages of our method: the original features $X$ (denoted CESraw) and the features ${\widehat{X}}_{0}$ obtained by applying NLmeans on the temporal dimension (CESNLmeans 1D), both of dimension $n=2048$; the initial embedded features ${\stackrel{~}{X}}_{0}$ (CESEmbedding) and the features obtained after running the DGE main loop for $K=2$ iterations (CESDGE), both of dimension $d=15$. The results, obtained for the EDUBSeg dataset, are reported in Tab. 5. They indicate that CESNLmeans 1D increases Fscore by 8% w.r.t. CESraw, and CESEmbedding adds another 1% in Fscore. CESDGE gains an additional 9%, hence significantly improves upon this initial embedding. This confirms that the graph initialization and the reduction of the dimension of the graph representation is beneficial.
An illustration of the effect of the graph initialization and of the DGE steps for EDUBSeg Subject 1 Day 3 is provided in Fig. 4. It can be observed how boundaries between temporally adjacent frames along the diagonal in the graph are successively enhanced when the original features ${X}_{0}$ (column 1) are replaced with the denoised version ${\widehat{X}}_{0}$ (column 2), with the embedded features ${\stackrel{~}{X}}_{0}$ (column 3), and finally with the DGE representation estimates (columns 4 & 5 for DGE iterations $i=1,2$, respectively). Moreover, the boundaries of offdiagonal blocks that correspond to segments of frames that ressemble segments at other temporal locations, presumably of the same event, are sharpened, revealing the community structures.
DGE core operations. In Tab. 6, we report the performance that is obtained on the EDUBSeg dataset when the different operations in the DGE core iterations are removed onebyone by setting the respective parameter to zero: graph regularization ($\alpha $), edge local averaging ($p$), temporal edge weighting ($\eta $), and extracluster penalization ($\mu $). It is observed that the overall DGE Fscore drops by 0.4% to 3.2% when one single of these operations is deactivated (versus a drop of 8.6% from 0.698 to 0.612 when no DGE operation is performed at all, as discussed above). The fact that removing any of the operations individually does not lead to a knockout of the DGE loop suggests that the associated individual (temporal & semantic) model assumptions are all and independently important. Among all operations, the largest individual Fscoredrop (3.2%) corresponds with deactivating the extracluster penalization. This points to the essential role of semantic similarity in the graph model. The smallest Fscore difference is associated with edge local averaging; to improve this temporal regularization step, future work could use nonlocal instead of local averaging, or learnt kernels. However, the graph temporal edge weighting is effective in encoding the temporal prior (1.5% Fscore drop if deactivated).
Method  Fscore  Rec  Prec 

CESraw  0.52  0.56  0.56 
CESNLmeans 1D  0.60  0.63  0.61 
CESEmbedding  0.61  0.61  0.65 
CESDGE  0.70  0.70  0.72 
Deactivated DGE parameter  $\alpha =0$  $p=0$  $\eta =0$  $\mu =0$ 

Fscore  ${}$0.685  ${}$0.694  ${}$0.683  ${}$0.666 
difference with full DGE (0.698)  $$0.013  $$0.004  $$0.015  $$0.032 
5 Conclusion
This paper proposed a novel approach to learn representations of events in low temporal resolution image sequences, named Dynamic Graph Embedding (DGE). Unlike state of the art work, which requires (large) datasets for training the model, DGE operates in a fully selfsupervised way and learns the temporal event representation for an image sequence directly from the sequence itself. To this end, we introduced an original model based on the assumption that the sequence can be represented as a graph that captures both the temporal and the semantic similarity of the images. The key novelty of our DGE approach is then to learn the structure of this unknown underlying data generating graph, jointly with a low dimensional graph embedding. This is, to the best of our knowledge, one of the first attempts to learn simultaneously graph structure and data representations for image sequences. Experimental results have shown that DGE yields robust and effective event representations and outperforms the state of the art in terms of event boundary detection precision, improving Fscore values by 1% and 8% on the EDUBSeg and EDUBSegDesc event segmentation benchmark datasets, respectively. Future work will include exploring the use of more sophisticated methods than the kmeans algorithm in the semantic similarity estimation step, and other applications in the field of Multimedia, such as the temporal segmentation of wearable sensor data streams.