Abstract
Batteryless or so called passive wearables are providing new and innovativemethods for human activity recognition (HAR), especially in healthcareapplications for older people. Passive sensors are low cost, lightweight,unobtrusive and desirably disposable; attractive attributes for healthcareapplications in hospitals and nursing homes. Despite the compellingpropositions for sensing applications, the data streams from these sensors arecharacterised by high sparsitythe time intervals between sensor readings areirregular while the number of readings per unit time are often limited. In thispaper, we rigorously explore the problem of learning activity recognitionmodels from temporally sparse data. We describe how to learn directly fromsparse data using a deep learning paradigm in an endtoend manner. Wedemonstrate significant classification performance improvements on realworldpassive sensor datasets from older people over the stateoftheart deeplearning human activity recognition models. Further, we provide insights intothe model's behaviour through complementary experiments on a benchmark datasetand visualisation of the learned activity feature spaces.
Quick Read (beta)
SparseSense: Human Activity Recognition from Highly Sparse Sensor Datastreams Using Setbased Neural Networks
Abstract
Batteryless or so called passive wearables are providing new and innovative methods for human activity recognition (HAR), especially in healthcare applications for older people. Passive sensors are low cost, lightweight, unobtrusive and desirably disposable; attractive attributes for healthcare applications in hospitals and nursing homes. Despite the compelling propositions for sensing applications, the data streams from these sensors are characterised by high sparsity—the time intervals between sensor readings are irregular while the number of readings per unit time are often limited. In this paper, we rigorously explore the problem of learning activity recognition models from temporally sparse data. We describe how to learn directly from sparse data using a deep learning paradigm in an endtoend manner. We demonstrate significant classification performance improvements on realworld passive sensor datasets from older people over the stateoftheart deep learning human activity recognition models. Further, we provide insights into the model’s behaviour through complementary experiments on a benchmark dataset and visualisation of the learned activity feature spaces.
SparseSense: Human Activity Recognition from Highly Sparse Sensor Datastreams Using Setbased Neural Networks
Alireza Abedin, S. Hamid Rezatofighi, Qinfeng Shi, Damith C. Ranasinghe
School of Computer Science, The University of Adelaide, Australia
{alireza.abedinvaramin, hamid.rezatofighi, javen.shi, damith.ranasinghe}@adelaide.edu.au
1 Introduction
Understanding human activities using wearables is the basis for an increasing number of healthcare applications such as rehabilitation, gait analysis, falls detection and falls prevention [?]. In particular, older people have expressed a preference for unobtrusive and wearable sensing modalities [?; ?]. While traditional wearables employ battery powered devices, new opportunities for human activity recognition applications, especially in healthcare, are being created by batteryless or passive wearables operating on harvested energy [?; ?]. In contrast to using often bulky and obtrusive battery powered wearables, passive sensing modalities provide maintenancefree, often disposable, unobtrusive and lightweight devices highly desirable to both older people and healthcare providers. However, the very nature of these sensors leads to new challenges.
Problem. The process of operating a batteryless sensor and transmitting the data captured is reliant on harvested power. Due to variable times to harvest adequate energy to operate sensors, the datastreams generated are highly sparse with variable intersample times. We illustrate the problem in Fig. 1 for a data stream captured by a bodyworn passive sensor. We can see two key artefacts: i) the variable time intervals between sensor data reporting times; and ii) the relatively low average sampling rate. In this paper, we consider the problem of learning human activity recognition (HAR) models from sparse datastreams using a deep learning paradigm in an endtoend manner.
Current Approaches. Wearable sensors generate timeseries data. Consequently, the dominant human activity recognition pipeline uses fixed duration sliding window partitioning to feed neural networks during both training and inference stages [?; ?; ?; ?; ?; ?]. When dealing with sparse data partitions, a common remedy is to rely on interpolation techniques as a preprocessing step to synthesise sensor observations to obtain a fixed size representation from timeseries partitions as illustrated in Fig. 1 [?; ?]. However, we recognize two key issues with an interpolated sparse datastream:

•
Interpolating between sensor readings that are temporarily distant can potentially lead to poor approximations of missing measurements and contextual activity information. Accordingly, adoption of convolutional filters or recurrent layers to extract temporal patterns from the poorly approximated measurements may potentially propagate the estimation errors to the activity recognition model—we substantiate this through extensive experiments in Section 3.3.

•
Interpolation is as an intermediate processing step that prevents endtoend learning of activity recognition models directly from raw data and introduces realtime prediction delays in time critical applications—we demonstrate the time overheads imposed on inference in Section 3.3.
Our Approach. Instead of relying on the naturally poor temporal correlations between consecutively received samples in sparse datastreams, we consider incentivizing the activity recognition model to uncover discriminative representations from the input sensory data partitions of various sizes to distinguish different activity categories. Our intuition is that a few information bearing sensor samples, although not temporally consistent, can capture adequate amount of information. Therefore, we propose learning HAR models directly from sparse datastreams. An illustrative summary of our proposed methodology for sparse datastream classification in comparison with the conventional treatment is presented in Fig. 1.
In this paper, we describe how human activity recognition with sparse datastreams can be elegantly handled using deep neural networks in an endtoend learning process. Given that we no longer rely on often poor temporal information, we represent sparse data stream partitions as unordered sets with various cardinalities from which embeddings capable of discriminating activities can be learned. Our approach is inspired by recent research efforts to investigate setbased deep learning paradigms to address a new family of problems where inputs [?; ?] of the task are naturally expressed as sets with unknown and unfixed cardinalities. Therefore, our approach here is to develop activity recognition models that can learn and predict from incomplete sets of sensor observations, without requiring any extra interpolation efforts.
Contribution. In particular: i) We solve a new problem with a deep neural network formulation—learning from sparse sensor datastreams in an endtoend manner; ii) We show that set learning can tolerate missing information which otherwise would not be possible with conventional DNN; and iii) We demonstrate that our novel treatment of the problem yields significantly outperforming recognition models with lower inference delays compared with the stateoftheart on naturally sparse public datasets—over 4% improvement in the best case. We further compare with a benchmark HAR dataset and provide deeper insights into the performance improvements obtained from our proposed approach.
2 Methodology
We first present a formal description of human activity recognition problem with sparse datastreams and introduce the notations used throughout this paper before elaborating on our proposed activity recognition framework to learn directly from sparse datastreams in an endtoend manner.
2.1 Problem Formulation
Consider a collected datastream of raw timeseries samples from bodyworn sensors of the form $\mathbf{S}=({\mathbf{x}}_{1},{\mathbf{x}}_{2},\text{. . .},{\mathbf{x}}_{\mathrm{T}})$, where ${\mathbf{x}}_{t}\in {\mathbb{R}}^{\mathrm{d}}$ is a multidimensional vector that contains sample measurements over $\mathrm{d}$ distinct sensor channels at time step $t$ and $\mathrm{T}$ is the total length of the sequence. Without loss of generality, we assume a hardwarespecific sampling rate for the wearable sensors, denoted by $\mathrm{f}$.
HAR with Uniform Timeseries Data
In an ideally controlled laboratory setup, sensor samples are constantly taken at regular intervals of $\frac{1}{\mathrm{f}}$ seconds. In such case, applying the commonly adopted timeseries segmentation technique with a sliding window of fixed temporal context $\delta t$ yields the labeled dataset
$${\mathcal{D}}_{\text{uniform}}=\{({\mathbf{X}}_{1},{\mathbf{y}}_{1}),({\mathbf{X}}_{2},{\mathbf{y}}_{2}),\text{. . .},({\mathbf{X}}_{n},{\mathbf{y}}_{n})\},$$  (1) 
where ${\mathbf{X}}_{i}=[{\mathbf{x}}_{i},\text{. . .},{\mathbf{x}}_{i+\mathrm{m}1}]\in {\mathbb{R}}^{\mathrm{d}\times \mathrm{m}}$ is a fixed size segment of captured sensor readings, $\mathrm{m}=\mathrm{f}\delta t$ is the constant number of received samples, and ${\mathbf{y}}_{i}$ denotes the corresponding onehot encoded groundtruth from the predefined activity space $\mathcal{A}=\{{a}_{1},\text{. . .},{a}_{c}\}$. The acquired dataset can then be utilized to train activity recognition models using outofthebox machine learning techniques.
HAR with Sparse Timeseries Data
Unfortunately, sparse timeseries data often found in realworld deployment settings, especially with passive sensors have variable intersensor observation intervals. In this case, utilising a fixed time sliding window approach to segment the sparse datastream results in the labeled dataset:
$${\mathcal{D}}_{\text{sparse}}=\{({\mathcal{X}}_{1}^{{m}_{1}},{\mathbf{y}}_{1}),({\mathcal{X}}_{2}^{{m}_{2}},{\mathbf{y}}_{2}),\text{. . .},({\mathcal{X}}_{n}^{{m}_{n}},{\mathbf{y}}_{n})\},$$  (2) 
where ${\mathcal{X}}_{i}^{{m}_{i}}=\{{\mathbf{x}}_{i},\text{. . .},{\mathbf{x}}_{i+{m}_{i}1}\}\in \stackrel{{m}_{i}}{\stackrel{\u23de}{{\mathbb{R}}^{\mathrm{d}}\times \text{. . .}\times {\mathbb{R}}^{\mathrm{d}}}}$ is a set of sparse sensor observations during a timed window, ${m}_{i}\in \mathbb{N}$ is the cardinality of the obtained observation set, and ${\mathbf{y}}_{i}$ denotes the corresponding activity class. We emphasise that the number of received sensor readings in the time interval $\delta t$ is unfixed for different sensory segments and upper bounded by the sensor sampling rate; i.e., for any given sensory segment ${\mathcal{X}}_{i}^{{m}_{i}}$, we have ${m}_{i}\le \mathrm{f}\delta t$.
In this paper, having acquired the training dataset of sparse sensory segments ${\mathcal{D}}_{\text{sparse}}={\{({\mathcal{X}}_{i}^{{m}_{i}},{\mathbf{y}}_{i})\}}_{i=1}^{n}$, we intend to directly learn a mapping function ${\mathcal{F}}_{{\mathrm{\Theta}}^{*}}:{2}^{{\mathbb{R}}^{\mathrm{d}}}\to \mathcal{A}$, that operates on input sensory sets with unfixed cardinalities and accurately predicts the underlying activity classes,
$${\mathbf{y}}_{i}={\mathcal{F}}_{{\mathrm{\Theta}}^{*}}({\mathcal{X}}_{i}^{{m}_{i}})={\mathcal{F}}_{{\mathrm{\Theta}}^{*}}(\{{\mathbf{x}}_{i},\text{. . .},{\mathbf{x}}_{i+{m}_{i}1}\}),\forall i\in \{1,\text{. . .},n\}.$$  (3) 
2.2 SparseSense Framework
Dataset  HAR Model  Interpolant  Input  Window Size  ${\text{\mathbf{P}\mathbf{r}\mathbf{e}\mathbf{c}\mathbf{i}\mathbf{s}\mathbf{i}\mathbf{o}\mathbf{n}}}_{m}$  ${\text{\mathbf{R}\mathbf{e}\mathbf{c}\mathbf{a}\mathbf{l}\mathbf{l}}}_{m}$  ${\mathbf{\text{Fscore}}}_{m}$  
(clinical room)  (acceleration)  ($\delta t$)  (mean$\pm $std)  (mean$\pm $std)  (mean$\pm $std)  
Roomset1  ${\text{SVM}}^{lin*}$  Cubic  Handcrafted features  4 seconds  87.87$\pm $2.55  83.44$\pm $1.72  84.96$\pm $1.23  
${\text{SVM}}^{rbf*}$  None  Handcrafted features  8 seconds  90.39$\pm $2.70  87.42$\pm $1.42  88.45$\pm $1.68  
CRF${}^{*}$  Linear  Handcrafted features  2 seconds  85.97$\pm $2.43  82.35$\pm $3.08  83.73$\pm $2.40  
BiLSTM  Linear  Raw sensor readings  2 seconds  89.97$\pm $0.78  85.11$\pm $0.99  86.96$\pm $1.06  
DeepCNN  Quadratic  Raw sensor readings  4 seconds  92.43$\pm $1.21  87.93$\pm $1.74  89.73$\pm $1.55  
DeepConvLSTM  Linear  Raw sensor readings  4 seconds  91.87$\pm $1.43  88.88$\pm $1.79  90.42$\pm $1.54  
(Ours) SparseSense  None  Raw sensor readings  2 seconds  95.0$\mathrm{\pm}$0.75  94.08$\mathrm{\pm}$0.78  94.51$\mathrm{\pm}$0.62  

${\text{SVM}}^{lin*}$  Cubic  Handcrafted features  2 seconds  87.06$\pm $4.10  84.00$\pm $2.90  84.97$\pm $3.74  
${\text{SVM}}^{rbf*}$  None  Handcrafted features  8 seconds  90.97$\pm $4.11  83.88$\pm $2.04  85.53$\pm $2.86  
CRF${}^{*}$  None  Handcrafted features  16 seconds  83.68$\pm $6.50  78.29$\pm $3.58  79.99$\pm $4.76  
BiLSTM  Previous  Raw sensor readings  2 seconds  92.38$\pm $0.91  91.4$\pm $0.62  91.78$\pm $0.58  
DeepCNN  Linear  Raw sensor readings  4 seconds  93.11$\pm $0.94  91.7$\pm $1.18  92.36$\pm $0.99  
DeepConvLSTM  Previous  Raw sensor readings  4 seconds  94.16$\pm $0.52  93.05$\pm $0.78  93.77$\pm $0.63  
(Ours) SparseSense  None  Raw sensor readings  2 seconds  97.07$\mathrm{\pm}$0.52  96.88$\mathrm{\pm}$0.34  96.97$\mathrm{\pm}$0.37 
Our work is built upon the insight that incorporating interpolation techniques to recover the missing measurements across large temporal gaps between received sensor observations in sparse datastreams leads to poor estimations and therefore, significant interpolation errors. As we demonstrate in Section 3.3, the adoption of convolutional filters or recurrent layers to extract temporal patterns from the poorly approximated measurements can potentially propagate the estimation errors to the activity recognition model.
Instead of forcing the network to exploit the potentially weak temporal correlations in sparse datastreams, we propose learning global embeddings from sets that encode aggregated information related to an activity. Therefore, we propose formulating sparse segments as unordered sets with unfixed and unknown number of sensor readings. Hence, we design SparseSense as a setbased activity recognition framework for the HAR task that directly manipulates sets of received sensor readings with irregular intersample observation intervals and outputs the corresponding activity membership distributions. Our approach provides a complete endtoend learning method that incentivizes the activity recognition model to uncover globally discriminative representations for the input sparse segments with variable number of samples, and distinguish different activity categories accordingly.
Network Architecture. The overall architecture of our proposed SparseSense network is illustrated in Fig. 2. Essentially, we approximate the optimal mapping function ${\mathcal{F}}_{{\mathrm{\Theta}}^{*}}$ in Eq. (3) through training of a deep neural network parameterized by $\mathrm{\Theta}$. The primary task for integrating set learning into deep neural networks is employing a shared network to map each set element independently into a higher dimensional embedding space (to facilitate class separability) and adopting a symmetric operation across the element embeddings to generate a global representation for the entire set that does not rely on the set element orderings. We incorporate this pipeline into the building blocks of our network as elucidated in what follows:
Input. Adopting sliding window segmentation over the sparse datastream yields sets of sparsely received sensor observations $\mathcal{X}$ in the predefined temporal window $\delta t$, with potentially varying cardinalities.
The shared sample embedding network. The embedding network ${\varphi}_{{\theta}_{1}}:{\mathbb{R}}^{\mathrm{d}}\to {\mathbb{R}}^{\mathrm{z}}$ parameterized by ${\theta}_{1}$, operates identically and independently on each sample measurement $\mathbf{x}$ within the received observation set $\mathcal{X}$ and learns a corresponding higher dimensional projection ${\mathbf{z}}_{\mathbf{x}}\in {\mathbb{R}}^{\mathrm{z}}$ to alleviate separability of activity features in the new embedding space; i.e., ${\mathbf{z}}_{\mathbf{x}}={\varphi}_{{\theta}_{1}}(\mathbf{x}),\forall \mathbf{x}\in \mathcal{X}$. Technically, ${\varphi}_{{\theta}_{1}}$ is a standard multilayer perceptron (MLP) whose parameters are shared between the sensor sample readings; i.e., all samples undergo the same layer operations and are therefore processed identically through a copy of the MLP.
The aggregation layer. Described by $h:{\mathbb{R}}^{\mathrm{z}}\times \text{. . .}\times {\mathbb{R}}^{\mathrm{z}}\to {\mathbb{R}}^{\mathrm{z}}$, the aggregation layer applies a symmetric operation across the latent representations of individual sensor samples and extracts a fixed size global embedding ${\mathbf{z}}_{\mathcal{X}}\in {\mathbb{R}}^{\mathrm{z}}$ to represent the sensory segment as a whole. Thus, for a given sensory segment ${\mathcal{X}}_{i}$, we have
$${\mathbf{z}}_{{\mathcal{X}}_{i}}=h(\{{\mathbf{z}}_{{\mathbf{x}}_{i}},\text{. . .},{\mathbf{z}}_{{\mathbf{x}}_{i+{m}_{i}1}}\}).$$  (4) 
Notably, the shared sample embedding network coupled with the symmetric aggregation layer allow summarizing sparse segments with effective highdimensional projections that i) do not rely on the weak temporal ordering of the sparse samples, and, ii) ensure fixed size tensor representations independent of the number of received readings. Inspired by [?], in this paper, we set $h$ to incorporate a featurewise maximum pooling across sample embeddings which promises robustness against set element perturbations.
The segment embedding classifier. Described by ${\rho}_{{\theta}_{2}}:{\mathbb{R}}^{\mathrm{z}}\to \mathcal{A}$ parameterized by ${\theta}_{2}$ is trained to exploit the segment embeddings ${\mathbf{z}}_{\mathcal{X}}$ through multiple layers of nonlinearity and predict the corresponding activity class probability distributions $\widehat{\mathbf{y}}$; i.e., $\widehat{\mathbf{y}}={\rho}_{{\theta}_{2}}({\mathbf{z}}_{\mathcal{X}})$. Here, a softmax activation function governs the output of our network to yield posterior probability distributions over the activity space $\mathcal{A}$.
Summary. Now, we can express the mathematical operations constituting the forward pass of our proposed activity recognition model for a given sparse sensory segment ${\mathcal{X}}_{i}$ as:
$${\mathcal{F}}_{\mathrm{\Theta}}({\mathcal{X}}_{i}^{{m}_{i}})={\rho}_{{\theta}_{2}}\left(h(\{{\varphi}_{{\theta}_{1}}({\mathbf{x}}_{i}),\text{. . .},{\varphi}_{{\theta}_{1}}({\mathbf{x}}_{i+{m}_{i}1})\})\right),$$  (5) 
where $\mathrm{\Theta}$ denotes the collection of all network parameters; i.e., $\mathrm{\Theta}=({\theta}_{1},{\theta}_{2})$.
Network Training and Activity Inference. During the training process, the goal is to learn the network parameters $\mathrm{\Theta}$ such that the disagreement between the network outputs and the corresponding groundtruth activities is minimised for the training dataset. We can precisely express this discrepancy minimisation by adopting an endtoend optimisation of the negative loglikelihood loss function ${\mathcal{L}}_{\text{NLL}}$ on the training dataset ${\mathcal{D}}_{\text{sparse}}$; i.e.,
$${\mathrm{\Theta}}^{*}=\mathrm{arg}\underset{\mathrm{\Theta}}{\mathrm{min}}\sum _{i=1}^{n}{\mathcal{L}}_{\text{NLL}}({\mathcal{F}}_{\mathrm{\Theta}}({\mathcal{X}}_{i}^{{m}_{i}}),{\mathbf{y}}_{i}).$$  (6) 
As the training process progresses and the corresponding objective function is minimised, the SparseSense network uncovers highly discriminative embeddings for sparse segments that allow effective separation of classes in the activity space.
Once the training procedure converges and the optimal network parameters ${\mathrm{\Theta}}^{*}$ are learned from the training dataset, we adopt a maximum a posteriori (MAP) inference to promote the most probable activity category for any given set of sparse sensor readings; i.e., the highest scoring class in the softmax output of the network is chosen to be the final prediction.
3 Experiments and Results
3.1 Datasets
To ground our study, we evaluate our proposed framework on two naturally sparse public datasets collected in clinical rooms with older people using a bodyworn batteryless sensor intended for ambulatory monitoring in hospital settings. For further insights, we also present extensive empirical analysis of our approach on a HAR benchmark dataset with synthesized sparsification and provide comparisons against the stateoftheart deep learning based HAR models.
Clinical Room Datasets [?]: The dataset is collected from fourteen older volunteers, with a mean age of 78 years, performing a set of broadly scripted activities while wearing a ${\text{W}}^{2}\text{ISP}$ over their attire at the sternum level (see Fig. 1). The ${\text{W}}^{2}\text{ISP}$ is a passive sensorenabled RFID (Radio Frequency Identification) device that operates on harvested electromagnetic energy emitted from nearby RFID antennas to send data with an upperbound sampling rate of 40 Hz. Data collection was carried out in two clinical rooms with two different antenna deployment configurations to power the sensor and capture data; resulting in Roomset1 and Roomset2 datasets. Each sensor observation in the obtained datasets records triaxial acceleration measurements as well as contextual information from the RFID platform indicating the antenna identifier and the strength of the received signal from the sensor. These recordings were manually annotated with lying on bed, sitting on bed, ambulating and sitting on chair to closely simulate hospitalized patients’ actions. Consecutive samples in the sparse datastreams from Roomset1 and Roomset2 exhibit high mean time differences of $0.37$ s and $0.72$ s respectively.
WISDM Benchmark Dataset [?]: This dataset contains acceleration measurements from 36 volunteers collected through controlled, laboratory conditions while performing a specific set of activities. The sensing device used for data acquisition is an Android mobile phone with a constant sampling rate of 20 Hz and placed in the subjects’ front pant’s pocket. The sensor samples carry annotations from walking, jogging, climbing up stairs, climbing down stairs, sitting and standing. The collected dataset delivers high quality data and has frequently been used in HAR studies for benchmarking purposes. Accordingly, we find this dataset a suitable choice for thorough investigation of our SparseSense network under different levels of synthesized data sparsification.
3.2 Experiment Setup
In this study, we initially perform perfeature normalization to scale realvalued observation attributes to the $[0,1]$ interval. We consider a fixed temporal context $\delta t$ and obtain sensory partitions by sliding a window over the recorded datastreams. The acquired segments are assumed to reflect adequate information related to a wearer’s current activity and are thus, assigned a categorical activity label based on the most observed sample annotation in the timespan of the sliding window.
We implement the experiments in Pytorch [?] deep learning framework on a machine with a NVIDIA GeForce GTX 1060 GPU. The SparseSense deep human activity recognition model is trained in a fullysupervised fashion by backpropagating the gradients of the loss function in minibatches of size 128; i.e., the network parameters are iteratively adjusted according to the RMSProp [?] update rule in order to minimise the negative loglikelihood loss using minibatch gradient descent. The optimiser learning rate is initialised with ${10}^{4}$, reduced by a factor of $0.1$ after 100 epochs, and the optimisation is ceased after 150 epochs. Further, a weight decay of ${10}^{4}$ is imposed as ${L}_{2}$ penalty for regularisation. Following previous studies, we employ 7fold stratified crossvalidation on the datasets and preserve activity class distributions across all folds. Each constructed fold is in turn utilized once for validation while the remaining six folds constitute the training data.
3.3 Baselines and Results
Clinical Room Experiments. In Table 1, we report the mean Fmeasure (${\text{Fscore}}_{m}$) as the widely adopted evaluation metric and compare SparseSense with the highest performing activity recognition models previously studied for the naturally sparse clinical room datasets as well as the stateoftheart deep learning based HAR models. Previous studies have explored shallow models including support vector machines (${\text{\mathit{S}\mathit{V}\mathit{M}}}^{lin}$ and ${\text{\mathit{S}\mathit{V}\mathit{M}}}^{rbf}$), and conditional random fields (CRF) trained on top of handcrafted features extracted from either raw or interpolated sparse segments. In addition, we investigate the effectiveness of BiLSTM [?], DeepCNN and DeepConvLSTM [?] as solid deep learning baselines representing the stateoftheart for HAR applications.
BiLSTM leverages bidirectional LSTM recurrent layers to directly learn the temporal dependencies of samples within the sensory segments. Both DeepCNN and DeepConvLSTM adopt four layers of 1D convolutional filters along the temporal dimension of the fixed size segmented data to automatically extract feature representations. However, DeepCNN is then followed by two fully connected layers to aggregate the feature representations while DeepConvLSTM utilizes a two layered LSTM to model the temporal dynamics of feature activations prior to the final softmax layer. We refer interested readers to the original papers introducing the HAR models for further details and network specifications. Following [?], for each baseline we explore progressively increasing window durations, i.e. $\delta t\in \{2,4,8,16\}$, adopt perchannel interpolation schemes (linear, cubic, quadratic and previous) to compensate for the missing acceleration data and report the highest achieving configurations in Table 1. In this regard, cubic and quadratic interpolation schemes respectively refer to a spline interpolation of second and third order, and the previous scheme fills missed values with the previously received sensor readings.
From the outlined results, we observe that the SparseSense network outperforms all the baseline models with a large margin in the task of sparse datastream classification. Notably, the baselines are: i) wellengineered shallow models that require a large pool of domain expert handcrafted features; and ii) stateoftheart deep learning HAR models that demand interpolation techniques to synthesize regular sensor sampling rates. In contrast, SparseSense seamlessly operates on sparse sets of sensory observations without requiring any extra interpolation efforts or manually designed features, and automatically extracts highly discriminative embeddings for the classification task in an endtoend framework.
WISDM Benchmark Experiments. To provide additional insights onto the model’s behavior, we conduct experiments on WISDM benchmark dataset and analyze the network’s classification performance under different levels of synthesized data sparsification. Taking into account the superior performance of DeepConvLSTM among the baselines in Table 1, here we only present comparisons with this model. Following [?; ?], we partition the datastreams into fixed size sensory segments using a sliding window of 10 seconds duration (corresponding to 200 sensor readings) and train the HAR models on the acquired segmented data. Subsequently at test time, we drop sensor readings at random timesteps in order to generate synthetic sparse segments.
Tolerance to Data Sparsity and Delays. In Fig. 3, the obtained evaluation measures are plotted for both HAR models under different sparsification settings. When data segments are received in full, DeepConvLSTM performs better than SparseSense due to its ability in capturing temporal dependencies between consecutive sensor readings. However, as the data sparsity increases and the temporal correlation weakens, we observe a significant drop in classification performance of DeepConvLSTM. Notably, with large temporal gaps between sensor observations, interpolation techniques cannot produce good estimations of the missing samples and fail to recover the original acceleration measurements which in turn impacts the classification decisions of DeepConvLSTM. In contrast, not only does SparseSense achieve comparable classification results for completely received sensor data segments, but it also displays great robustness to data sparsity by making accurate decisions for incomplete segments of sensor data. In addition, we show in the bar plot the mean processing time required by the HAR models to make predictions on a minibatch of 128 sensory segments. Clearly, our framework demonstrates a significant advantage over other HAR models for realtime activity recognition using sparse datastreams by removing the need for prior interpolation preprocessing.
SparseSense Model Behaviour. We visualize the highdimensional feature space for both models in 2D space using tdistributed stochastic neighbor embedding (tSNE) [?] in Fig. 4. In the absence of significant data sparsity, the segment embeddings for each activity are clustered together while different activities are separated in the feature space for both models. However, while SparseSense is able to maintain this cluster separation for severely missed sample ratios and incomplete observation sets, DeepConvLSTM clearly struggles to discriminate between the interpolated segments. Technically, the symmetric max pooling operation in the aggregation layer of SparseSense incentivises our HAR model to summarise sensory segments using only the most informative sensor readings in the segment.
In Fig. 5, we provide density plots for the number of sensor readings that ultimately contribute to the aggregated segment embeddings for each activity category of the WISDM dataset. We observe that SparseSense intelligently summarizes the segments through discarding potentially redundant information in the neighboring samples when complete sensor data sets are presented to the network–see the density plots where the tails towards 200 contributing samples have a probability of zero. More interestingly, the network displays a clear distinction in its behavior towards learning embeddings for static activities (i.e., sitting and standing) as opposed to dynamic activities (i.e., walking, jogging and climbing stairs) by exploiting far fewer number of sensor observations in the window. This can be intuitively understood as static activities reflect signal patterns with small changes in sensor measurements of a timed window as compared with dynamic activities and thus, can be summarised with smaller number of observations.
4 Conclusions
In this study, we present an endtoend human activity recognition framework to learn directly from temporally sparse datastreams using setbased deep neural networks. In contrast to previous studies that rely on interpolation preprocessing to synthesise sensory partitions with fixed temporal context, our proposed SparseSense network seamlessly operates on sparse segments with potentially varying number of sensor readings and delivers highly accurate predictions in the presence of missing sensor observations. Through extensive experiments on publicly available HAR datasets, we substantiate how our novel treatment for sparse datastream classification results in recognition models that significantly outperform stateoftheart deep learning based HAR models while incurring notably lower realtime prediction delays.
References
 [Alsheikh et al., 2016] Mohammad Abu Alsheikh, Ahmed Selim, Dusit Niyato, Linda Doyle, Shaowei Lin, and HweePink Tan. Deep activity recognition models with triaxial accelerometers. In Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [Bulling et al., 2014] Andreas Bulling, Ulf Blanke, and Bernt Schiele. A tutorial on human activity recognition using bodyworn inertial sensors. ACM Computing Surveys, 46(3):33, 2014.
 [Chen et al., 2015] Shengjian Jammy Chen, Christophe Fumeaux, Damith Chinthana Ranasinghe, and Thomas Kaufmann. Paired snapon buttons connections for balanced antennas in wearable systems. IEEE Antennas and Wireless Propagation Letters, 14:1498–1501, 2015.
 [Gövercin et al., 2010] Mehmet Gövercin, Y Költzsch, M Meis, S Wegel, M Gietzelt, J Spehr, S Winkelbach, M Marschollek, and E SteinhagenThiessen. Defining the user requirements for wearable and optical fall prediction and fall detection devices for home use. Informatics for health and social care, 35(34):177–187, 2010.
 [Gu et al., 2018] Fuqiang Gu, Kourosh Khoshelham, Shahrokh Valaee, Jianga Shang, and Rui Zhang. Locomotion activity recognition using stacked denoising autoencoders. IEEE Internet of Things Journal, 5(3):2085–2093, 2018.
 [Guan and Plötz, 2017] Yu Guan and Thomas Plötz. Ensembles of deep lstm learners for activity recognition using wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 1(2):11, 2017.
 [Hammerla et al., 2016] Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. Deep, convolutional, and recurrent models for human activity recognition using wearables. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 1533–1540, 2016.
 [Kwapisz et al., 2011] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter, 12(2):74–82, 2011.
 [Lemey et al., 2016] Sam Lemey, Sam Agneessens, Patrick Van Torre, Kristof Baes, Jan Vanfleteren, and Hendrik Rogier. Wearable flexible lightweight modular rfid tag with integrated energy harvester. IEEE Transactions on Microwave Theory and Techniques, 64(7):2304–2314, 2016.
 [Maaten and Hinton, 2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9:2579–2605, 2008.
 [Ordóñez and Roggen, 2016] Francisco Ordóñez and Daniel Roggen. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors, 16(1):115, 2016.
 [Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [Qi et al., 2017] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 [Tieleman and Hinton, 2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2):26–31, 2012.
 [Torres et al., 2013] Roberto L Shinmoto Torres, Damith C Ranasinghe, Qinfeng Shi, and Alanson P Sample. Sensor enabled wearable rfid technology for mitigating the risk of falls near beds. In IEEE International Conference on RFID, pages 191–198, 2013.
 [Torres et al., 2017] Roberto L Shinmoto Torres, Renuka Visvanathan, Derek Abbott, Keith D Hill, and Damith C Ranasinghe. A batteryless and wireless wearable sensor system for identifying bed and chair exits in a pilot trial in hospitalized older people. PloS one, 12(10):1–25, 2017.
 [Wang et al., 2019] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. Deep learning for sensorbased activity recognition: A survey. Pattern Recognition Letters, 119:3 – 11, 2019.
 [Wickramasinghe and Ranasinghe, 2015] Asanga Wickramasinghe and Damith Ranasinghe. Recognising activities in real time using body worn passive sensors with sparse data streams: To interpolate or not to interpolate? In proceedings of the 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pages 21–30, 2015.
 [Yang et al., 2015] Jian Bo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. Deep convolutional neural networks on multichannel time series for human activity recognition. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 3995–4001, 2015.
 [Zaheer et al., 2017] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017.
 [Zeng et al., 2014] Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. Convolutional neural networks for human activity recognition using mobile sensors. In 6th International Conference on Mobile Computing, Applications and Services, pages 197–205, 2014.