Abstract
In this paper, we teach a machine to discover the laws of physics from videostreams. We assume no prior knowledge of physics, beyond a temporal stream ofbounding boxes. The problem is very difficult because a machine must learn notonly a governing equation (e.g. projectile motion) but also the existence ofgoverning parameters (e.g. velocities). We evaluate our ability to discoverphysical laws on videos of elementary physical phenomena, such as projectilemotion or circular motion. These elementary tasks have textbook governingequations and enable ground truth verification of our approach.
Quick Read (beta)
Visual Physics: Discovering Physical Laws from Videos
Abstract
In this paper, we teach a machine to discover the laws of physics from video streams. We assume no prior knowledge of physics, beyond a temporal stream of bounding boxes. The problem is very difficult because a machine must learn not only a governing equation (e.g. projectile motion) but also the existence of governing parameters (e.g. velocities). We evaluate our ability to discover physical laws on videos of elementary physical phenomena, such as projectile motion or circular motion. These elementary tasks have textbook governing equations and enable ground truth verification of our approach.
1 Introduction
This paper aims to teach a machine to discover the laws of physics from video streams. In the apocryphal story, Isaac Newton’s observation of a falling apple was a catalyst for deriving his physical laws. In like fashion, our machine aims to observe the dynamics of a moving object as a means to infer physical laws. We refer to this as discovering physics from video, as shown in Figure 1.
The discovery problem is very difficult because a machine must derive not only the governing equations of a physical model but also governing parameters like velocity. We emphasize that a discovery algorithm like ours does not know a priori what “velocity” means—it must learn the existence of velocity. In order to handle the underdetermined nature of recovering both governing equations and governing parameters, we make a few assumptions. Section 3 expands on our assumptions, which we believe are the most relaxed to date.
Our work is powered by methods from representation learning and evolutionary algorithms. The discovery of underlying governing parameters is achieved using a modified $\beta $variational autoencoder ($\beta $VAE) to obtain latent representations. These are then used in an equation discovery step, driven by genetic programming approaches. Our approach is able to learn equations that symbolically match ground truth, and have governing parameters that correspond to human interpretable constructs (e.g. velocity, angular frequency).
Contributions:
Our key contribution is a first attempt at an algorithm that is able to rediscover both governing equations and governing parameters from video. Previous work can either discover governing equations or the parameters, but not both. We test the algorithm on both synthetic data (with and without noise), as well as real data. Our performance analysis shows that the proposed method results in symbolically accurate expressions, and interpretable governing parameter discovery for a variety of simple, yet fundamental physics tasks. The method is also found to be robust to large amounts of positional noise and effective under a range of input data sizes. To lay a foundation for future work, we release the Visual Physics dataset, consisting of both real and synthetic videos of dynamic physical phenomena.
2 Related Work
Although our goals are different, we are inspired by work in physicsbased computer vision, physical representation learning, and symbolic equation derivation.
Physicsbased computer vision
encompasses the use of known physical models to either directly solve or inspire computer vision techniques. Techniques like shape from shading [ikeuchi1981numerical, horn1989shape] and photometric stereo [woodham1980photometric] use known models of optical physics to estimate shape. Along this theme, recent work in the area of computational light transport has advanced the field to see around corners [ramesh20085d, velten2012recovering, o2018confocal, xin2019theory] or infer material properties [tanaka2018material].^{1}^{1} 1 For an overview of the physics of light transport, the reader is directed to an ACM SIGGRAPH course by O’Toole and Wetzstein [o2014computational]. Known physical models can also be used to inspire the design of vision algorithms. Examples include deformable parts models [felzenszwalb2008discriminatively, felzenszwalb2009object] or snakes [kass1988snakes], which use the physics of springs to design computer vision cost functions. The recent popularity of datadriven techniques has spawned a family of work that combines a known physical model with pattern recognition. For example, [gregor2010learning, diamond2017unrolled] unfold the existing physical models as the backbone in the network architecture; [chen2018reblur2deblur, stewart2017label] use physical information to supervise the training process; [fei2019geo] relies on gravity cues to improve depth estimation; and [davis2015visual, jin2017deep, kang2017deep, ba2019physics, li2019restoration, Halder_2019_ICCV, zeng2019tossingbot] introduce physicsbased learning to set the new stateoftheart in a range of vision problem domains. These approaches are powered by knowledge of a physical model, whereas our work has the complementary aim of learning the underlying model.
Learning physical parameters from visual inputs
has been a topic of interest in recent years. For instance, [JiajunWu2015Gallileo, Brubaker2009, Bhat2002, Mottaghi15Newton, purushwalkam2019bounce, Wu2017Deanimation] estimate parameters or equivalent information for wellcharacterized physical equations with visual inputs. These can be incorporated into realistic physical engines to infer complex system behavior. Fragidaki et al. [Fragidaki16Billiards] integrate the model of external dynamics within the agent to play simulated billiards games. More recently, [Battalgia2016IN, Watters2017VisualInteractionNetworks] deploy interaction networks with graph inputs to encode the interactions among objects in complex environments, and estimate other invariant quantities of the phenomenon using deep learning. In the field of controls, Shi et al. [shi2019neural] learn the nearground dynamics to achieve stable trajectory control. While these prior attempts are capable of predicting the system dynamics precisely, they also require a wellcharacterized physical model.
Symbolic regression
aims to generate symbolic equations from a space of mathematical expressions to fit the distributions of input samples. Genetic programming [GeneticProgramming] is one of the prevalent methods in this field, with previous applications in discovering Lagrangians [hills2015algorithm] and nonlinear model structure identification [winkler2005new]. Additional features from the input variables [Kaizen, GPRVM] and partial derivatives pairs [schmidt2009distilling] can also be introduced into genetic programming for more reliable regression. Other evolutionary methods can also be used to derive partial differential equations (PDEs) [maslyaev2019data]. Sparse regression [Brunton2016] and dimensional function synthesis [wang2019deriving] are two other alternatives to conduct symbolic regression. Recently, deep neural networks (DNNs) have also been utilized to generate symbolic regression [EQL, EQL_extented, NeuralSymbolicRegression2019]. These existing methods usually require predetermined terms or prior knowledge from physics.
3 Defining Discovery and its Assumptions
Assumptions:
This paper represents only a first attempt to discover the laws of physics from video. As such, we make certain assumptions. First, we restrict our focus to the dynamics of single objects (rather than groups of objects). Second, it is assumed that we know the object for which we would like to derive the physical equations. Third, we assume that videos are in sequence. We believe these assumptions are sufficiently general to allow us to characterize our technique as “discovering physics”. For example, the apocrypyhal story of Isaac Newton observing the apple falling aligns with the three assumptions outlined above. In the story, Newton was watching a temporal sequence of a single object in motion and was able to inductively reason about the laws of physics.
Defining “discovery of physics”:
We define discovery of physics as discovering both the governing parameters and governing equations. Given the assumptions from the previous paragraph, we must therefore discover all parameters except for the object location and time. As compared to Huang et al. [huang2018NIPSworkshop], where the parameters of the governing equations are used as prior knowledge, our attempt at discovery is more general. Concretely, for a task like trajectory estimation, our framework has to tackle the challenging task of learning both the projectile equation, as well as the existence of a “velocity” term, from video input. Refer to Figure 2 for details.
4 Algorithm Architecture for Discovery
Having defined “discovery” in Section 3, we now describe a framework that enables discovery of physics from video. There are three interconnected modules that handle position detection, latent physics discovery, and equation discovery, respectively. Figure 3 summarizes this framework.
Position detection module:
We build the Visual Physics framework based on the assumption that the underlying physical equations are reflected in the dynamics of an object across different time steps. Therefore, a robust object detection algorithm is required at the first stage to achieve accurate moving object localization for diversified categories of objects. We deploy a pretrained Mask RCNN [he2017mask] to extract the bounding box of the object in each frame, and the centroid of the detected bounding box is considered as the object location in a particular frame.
Latent physics module:
The objective of the Visual Physics framework is to derive the governing physical laws without prior knowledge. To achieve this goal, we need to infer the associated latent governing parameters from positional observations. VAEs [kingma2013auto] have been widely deployed to extract the latent representations with applications in physics, such as SciNet [iten2018discovering]. We adopt a modified $\beta $VAE architecture for our latent physics module as well. The encoder takes a vector corresponding to the object trajectory at uniformly sampled time instants as the input, and condenses them into a limited number of latent parameters. The decoder tries to reconstruct the object location $({x}_{q},{y}_{q})$ at an unseen time instant with these latent parameters $[{l}_{1}$ ${l}_{2}$ ${l}_{3}]{}^{T}$ and the time instant ${t}_{q}$ as inputs. This module is supervised by the object locations without other prior physical knowledge. Once the network converges, both locations obtained from the position detection module, and the corresponding learned hidden representations from the latent physics module are paired as the equation discovery module input.
Equation discovery module:
We concatenate the latent parameters and positional observations, and use this as input to a symbolic regression approach. Vanilla genetic programming approaches are usually subject to convergence issues, and may lead to trivial equations that are not descriptive for the physics associated with the data. Schmidt et al. [schmidt2009distilling] alleviate this problem by introducing partial derivative pairs between the input variables as a search criterion. We follow this strategy to design an equation discovery module, capable of generating multiple equations with a range of equation complexity and fit accuracy. The final output is a symbolic equation that is Paretooptimal.
5 Implementation
Visual Physics dataset:
To evaluate the proposed framework, we generate both a real and synthetic dataset of videos covering physical phenomena. Table 1 shows three simulated phenomena: Free Fall, Constant Acceleration Motion and Uniform Circular Motion. Each synthetic task includes 600 videos with randomly sampled physical parameters. We additionally include real video clips for Free Fall (411 videos). For all scenes, the physical phenomena is known in closedform, enabling us to compare our proposed approach to ground truth. While the physics may seem elementary, we test in realworld conditions and add noise to make the task harder. Please see the supplement for additional scenes with a wider range of complexity.
Dataset  Visualization  Description 

free fall  This dataset consists of 600 videos of 150 frames each at a frame rate of 240 frames per second. The frame size is chosen to be 720$\times $720 pixels. The object of interest is released with random initial velocities, from random points across different videos. The positions are selected from a uniform distribution, such that the initial position is in the bottomleft quadrant of the image. Initial velocities are also selected from a uniform distribution such that the object stays in the frame for the duration of the video. The object is acted upon by earth’s gravity ($9.8m/{s}^{2}$ at a scale of 300 pixels per meter), which is the only active external agent.  
constant acceleration motion 
This dataset consists of 600 videos of 200 frames each, at a frame rate of 40 frames per second and a frame size of 720$\times $720 pixels. Here, the object of interest is released horizontally with a fixed initial velocity of $5m/s$ (at a scale of 8 pixels per meter), and is acted upon by a uniformly random sampled external force, leading to an acceleration $a\in [0,4]$ $m/{s}^{2}$.  
uniform circular motion 
This dataset consists of 600 videos of 200 frames each, at a frame rate of 20 frames per second and a frame size of 720$\times $720 pixels. In this scenario, the object of interest is in uniform circular motion at a fixed radius of 5 m (at a scale of 50 pixels per meter), with angular velocity $\omega \in [1,2]$ rad/s. The center of rotation is kept fixed across all dataset videos. The initial position of the object is kept fixed, and no additional external force affects this motion (that is, the motion is assumed to be in the horizontal plane). 
Software implementation and training details:
For the position detection module, we deploy a Mask RCNN [he2017mask] pretrained on COCO dataset [lin2014microsoft]. As to the physical inference module, both the encoder and the decoder consist of six fullyconnected layers, and the size of the latent parameters is set to be three. We use the mean squared error (MSE) of the reconstructed locations and the $\beta $VAE loss [higgins2017betaVAE] to supervise the training process. $\beta $VAE penalty is introduced to encourage the disentanglement of latent representations, so that independent physical parameters are inferred in separate latent nodes. The entire loss function $L$ of the latent physics network can be written as follows:
$$L={L}_{mse}({Y}_{{t}_{q}},{\widehat{Y}}_{{t}_{q}})+\beta {L}_{kl}(Z),$$  (1) 
where ${Y}_{{t}_{q}}$ is the groundtruth location at time step ${t}_{q}$, ${\widehat{Y}}_{{t}_{q}}$ is the estimated location from the network, ${L}_{mse}(\cdot )$ is the MSE loss, $Z$ denotes the extracted latent representations, ${L}_{kl}(\cdot )$ denotes the Kullback–Leibler divergence between a Gaussian prior, and $\beta $ is the balance factor for the $\beta $VAE loss as described in [higgins2017betaVAE]. We use Adam optimizer [kingma2014adam] with an initial learning rate of 0.001, and this learning rate is decayed exponentially with a factor of 0.99 every 200 epochs. All the networks are implemented in the PyTorch framework [paszke2017automatic]. We construct the equation discovery module by using the widely available Eureqa package [EureqaSoftware]. The candidate operation set includes all the basic operations, such as addition, multiplication, and sine function. We search two equations for horizontal and vertical directions separately, and Rsquared value is used to measure the goodness of fit during searching. Please refer to Appendix D for additional implementation details.
6 Evaluation
Section 6.1 evaluates our results on discovering equations from synthetic videos. Section 6.2 shows that the method generalizes to real data. Finally, Section 6.3 tests the robustness of our technique by introducing noise and other confounding factors.
6.1 Synthetic Data Evaluation
Figure 4 illustrates various results from our framework, tested on synthetically generated data described in Table 1. With free fall, we assess the ability of our system to perform with parameters that affect the discovery linearly (as coefficients to a term linear in time). With constant acceleration, we observe the performance on nonlinear (quadratic) parameter effect. Finally, circular motion provides insight into performance for sinusoidal dependence. Results for two additional tasks, helical motion and damped oscillation, may be found in Appendix B.
Free Fall (synthetic):
In this scene, all possible trajectories are completely parameterized by the initial velocities ${v}_{0x}$ and ${v}_{0y}$ along the $x$ and $y$ directions. Figure 4(a) displays the output of our method for free fall, including both embeddings as well as the discovered equation. The embedding trends show that our latent physics model successfully learns to separate these horizontal and vertical velocity in two separate nodes. The correlation of the three latent nodes with the two governing (groundtruth) parameters demonstrate that the nodes learn an affine transform of the groundtruth velocities. It is important to note that the third node does not show dependence on the input, assuming a constant value. This reconciles with human intuition in the sense that free fall is determined only by two parameters. In evaluating the final output, we observe that the discovered governing equation matches the form of the familiar kinematic equations. The value of the acceleration due to gravity is learnt exactly and the parametric dependence of the equation on the initial velocities is accurate up to an affine transform.
Constant Acceleration Motion (synthetic):
In this task, the trajectory is governed by a single parameter: the acceleration $a$ acting on the object. Obtained results are displayed in Figure 4(b). As we expect, since only one of the nodes is required to describe the phenomenon, the embedding trends show that two nodes are invariant to the input and learn an almost constant, low magnitude value. The other node, which is correlated to the input, learns acceleration. Turning to the output equations, we find our method discovers both the correct form, and the latent variable maps to an interpretation of $a$. Also note that the value of the $y$ coordinate, which is expected to be constant, is discovered accurately.
Uniform Circular Motion (synthetic):
This task has a sinusoidal, rather than polynomial form. For a fixed radius of revolution, the governing parameter we seek to discover is the angular frequency $\omega $ of the rotating object. Hence, this task also depends on a single governing parameter. Figure 4(c) highlights that one of the latent parameters is correlated with angular frequency, while the other two are uncorrelated to the input. Based on the learned parameters and observed positions, the proposed method correctly identifies a sinusoidal dependence for both the $x$ and the $y$ coordinates.
6.2 Real Data Evaluation
free fall (real experiment):
We replicate free fall in the realworld in a relatively uncontrolled manner. As shown in Figure 5 the test set is a video sequence of a human tossing a ball with varying spins and uncontrolled air resistance. The motion may also not be perpendicular to the camera, leading to scale inconsistencies. 411 videos are collected, where each video represents a toss. To obtain ground truth initial velocities, we fit the kinematic equations to the observed videos, using the appropriate scaled value of the acceleration due to gravity $g$. The proposed latent discovery module does not have the luxury of this information. We report results in two conditions. In Figure 5(a), we train on real data and test on real data. Diversity in the dataset occurs due to different types of spins and tosses. To show that our method is not overfitting, Figure 5(b) displays results when we train on synthetic data and test on real data. Both cases achieve successful discovery of the groundtruth governing equation. In particular, two latent nodes show strong affine correlations with the ground truth horizontal and vertical velocities. In contrast, the third node, as we would expect, is uncorrelated (since only two parameters, ${v}_{0x},{v}_{0y}$ govern the system). The symbolic form of the equation we learn reconciles with the known physics model up to an affine transform in the governing parameters. It is important to note that slight error is observed when testing on real data. In both Figure 5(a) and 5(b) the value of acceleration we learn due to gravity is off by a factor of about $7\%$. We believe the following reasons account for a part of this inconsistency: (i) Noise due to the greater Mask RCNN error on the real videos, as compared to the simulations; and (ii) physical nonidealities such as air resistance and drag. We successfully test our method on an additional real task, uniform circular motion. Please refer to Appendix A for details and results.
6.3 Performance Analysis
We now look at analyzing, in reasonable detail, the characteristics and performance of the proposed approach. These factors hold special importance towards the function of the pipeline as a physics discovery unit, in a future application domain (e.g. biomedical, astrophysics).
Latent nodes an affine transform of ground truth:
Figure 4 and Figure 5 explicitly show that the latent nodes are an affine transformation of the ground truth, governing parameters. This reinforces our claim that the latent parameters we learn are human interpretable. Due to the use of a $\beta $VAE, the latent physics module is constrained to learn sparse representations, subject to a Pareto fit. Adding additional latent nodes therefore results in representations for these superfluous nodes either being entirely uncorrelated to the governing parameters, or of extremely low magnitude. The affine transform is important, not only for interpretability, but also because a linear least squares can be used to tune the parameters once the governing equation has been identified.
Robustness against noise:
To assess performance in context of noise, we use the synthetic free fall task and add noise to the position detection module of varying strengths. This corrupted data is then used to train the latent physics module and serve as the input to the equation discovery module. The plots of governing parameters in Figure 6 show that with increasingly noisy input trajectories, the representations remain relatively robust. However, the variance in representations is found to increase as the input corruption level increases. We are satisfied with the quality of these representations. Using even noisy (yet correlated) representations in the equation discovery step, still enables us to recover output equations that are symbolically accurate. The method eventually fails for corruption with noise of standard deviation of 128. At this very high noise level, even the direction of the trajectory is changing (i.e. the ball appears to travel backward). We can observe this in the last column of Figure 6.
Equation complexity versus accuracy:
Here we discuss how the proposed framework is able to recover the correct equation by balancing optimality in context of equation sparsity and performance fit. The equation discovery module results in a set of possible equations, of varying complexity (a function of the number of terms and operations in the equation). In order to choose an appropriate tradeoff between fitting accuracy and complexity, we use plots such as those shown in Figure 7. The knee point of the tradeoff curve is chosen as the expression of interest, since it marks the point of maximum gain in error performance with minimal increase in complexity. Such a selection ensures that the genetic programming algorithm refrains from overfitting on the relevant data, which is essential towards allowing for interpretability. This is also analogous to similar observations from representation learning, where there is an understood tradeoff between the extent of disentanglement of latent embeddings and downstream prediction accuracy [higgins2017betaVAE].
Effect of training data size:
Finally, we analyze the performance of our proposed method with respect to varying amounts of training data. This holds relevance in terms of the possible application of the pipeline (or others inspired by it) toward tasks with varying data availability. Figure 8 shows the results of this analysis on the synthetic free fall task. We evaluate performance based on: (a) the normalized crosscorrelation coefficient between the learnt active latent node and the groundtruth governing parameters, and (b) the trajectory prediction accuracy based on the latent values predicted by the latent physics module on test data, used on the discovered equations. Please refer to Appendix C for a detailed description of these metrics. The general trend of increasing correlation and reducing prediction error with increasing training samples is clearly visible in the plots. However, what is also of interest is the fact that the worst case error for the scenario with the lowest number of input samples (200 samples) has a sufficiently high correlation of 0.95. This highlights the versatility and robustness of the proposed approach towards a range of possible tasks.
7 Discussion
In summary, we have demonstrated the ability to discover physics from video streams. Our method is unique in that it is able to discover both the governing equations and physical parameters. Our results are powered by an encoderdecoder framework that learns latent representations. We show that, even in cases of significant noise, the latent representations are physically interpretable.
Beyond 2D phenomena:
The Visual Physics dataset consists of 2dimensional scenarios. For example, the tossing ball is viewed from the side, such that the ball does not change in its axial depth. For engineering reasons, we assume that the physical phenomena is observed in the 2D camera space of a video camera. If dynamics occur in 3dimensions (e.g. motion in $x,y,z$), then our algorithmic pipeline is still valid, but we must use a 3D camera to capture these 3D dynamics. In general, Visual Physics framework can apply to higherdimensional scenarios, potentially outside of video, provided that the measurement space is able to capture the phenomena.
Applications:
For reader accessibility and experimental reproducibility, we have chosen simple problems (like projectile motion and circular motion). However, we could envision future applications of this framework to domains like highenergy astrophysics, optical scattering, and medical imaging where the governing equations are unknown or partially known. In medical imaging, for example, it is important to find latent embeddings that are both discriminative, but also physically interpretable.
Open problems:
Analogous to the apocryphal story of Newton’s apple we have considered dynamics of a single object. This work is therefore a stepping stone to understanding the dynamics of multipleobjects. Another open problem is to extend the pipeline, beyond the three modules we have proposed. Concretely, we could also see adding a fourth module where the equation and embeddings we discover is used as input to another inference framework. For example, it might be possible to improve object detection given the velocities of objects, or create computational imaging pipelines that learn to classify scenes based on scattering properties. In conclusion, this paper is scratching the surface of the possibilities at the seamline of computer vision, physics, and artificial intelligence. We are excited to see these fields continue to merge.
References
Supplementary Results
This supplement is organized as follows:

1.
Appendix A includes a real scene with a sinusoidal, rather than polynomial, physical form.

2.
Appendix B shows that the method generalizes to more difficult physical problems, in context of mathematical form (e.g. exponential decay, helical motion).

3.
Appendix C discusses the quantitative metrics for performance evaluation.

4.
Appendix D describes specific implementation details and includes source code snippets for key portions of the paper.
Appendix A Circular Motion (real experiment)
Having discovered the equations for circular motion from synthetic data, experiments on this task are now extended to real data. Through this, we aim to further demonstrate the applicability of our method on real scenes. The dataset consists of 80 videos of an object rotating at fixed angular velocity. The rotation radius is kept constant across the dataset, and the angular velocity $\omega $ is varied in the range [$1.2\pi $, $3\pi $] radians/s. Videos with $$ are excluded from the dataset in order to avoid nonlinear effects of the motor at low frequencies. The first 200 frames of every video are used as input to the position detection module. The positions obtained are corrected for initial phase, so that all input trajectories have the same (zero) phase, by appropriate rotation of coordinates. The groundtruth $\omega $ for each video is calculated numerically based on these detected locations, from zerocrossing frequencies. These are used for verification of the learned representations, and are not used as part of the discovery process. Figure 9(a) shows a graphical description of the setup for data collection.
The latent physics module is trained with synthetic data, which is generated so as to match the parameters of the real dataset (frame rate, angular velocity range). We then use the real data on this trained model, in order to obtain the latent representations and the inputs for the equation discovery module. It may be observed from Figure 9(b) that the first latent embedding ${l}_{1}$ obtained for the real data is wellcorrelated with $\omega $. The other two nodes are close to zero in magnitude. This reconciles with the fact that there exists only one primary governing parameter for this setup. Additionally, the trend between the learnt embedding ${l}_{1}$ and $\omega $ suggests a quadratic relation. Hence, in Figure 9(d), we verify that the discovered angular velocity ${\omega}^{net}$ (mentioned in Figure 9(c)) corresponds to groundtruth $\omega $ with high accuracy.
Here, it is important to emphasize the correlation of latent nodes with ground truth parameters, as shown in Figure 8. The interpretability of the discovered equations is directly related to the value of this correlation coefficient. This is easily evident in the affine mapping obtained between the latent parameters and underlying physics concepts, for the results in the main paper. However, we impose no such explicit linearity constraint in the pipeline, since that may be construed as prior human knowledge. As long as the proposed method learns representations that are strongly correlated with the underlying physics parameters, the discovered equations will be interpretable and will embody the physics parameters. Therefore, even if the latent embeddings have a quadratic (or any nonlinear onetoone) relationship with the ground truth, as observed for the uniform circular motion task for real scenes, interpretability is still maintained.
Appendix B Helical Motion and Damped Oscillation (synthetic scenes)
The evaluation on two additional synthetic tasks of greater difficulty in terms of functional form (helical motion and damped oscillation) is presented in this section. The corresponding results are illustrated in Figure 10.
Helical Motion (synthetic):
We demonstrate the discovery of translational and rotational motion in the main paper through the free fall and uniform circular motion datasets. To increase the complexity of the physics task, we now evaluate the proposed framework for 2dimensional helical motion, where both translational and rotational motion act together. The synthetic videos are generated with different angular velocities $\omega $ and horizontal translational velocities ${v}_{0x}$. There is no translational motion along the $y$ direction, and the radius of the rotational motion is held constant for all the videos. Of the 600 videos in this synthetic dataset, 500 are used for training. Figure 10(i) shows the learnt representations and equations along the $x$ and $y$ directions. It may be observed that two of the latent representations are affine transforms of the governing physical parameters, ${v}_{0x}$ and $\omega $, and the derived equations are of the same functional form as the true equations. This emphasizes the performance of our framework on scenarios with multiple physical phenomena in action.
Damped Oscillation (synthetic):
Damping is a general energy loss mechanism for various systems, and one of the common forms of damping is the exponential decay. In this experiment, we simulate videos of damped oscillation, where the oscillation amplitude decays exponentially with time. We aim to test the capability of the proposed method towards discovering physical laws of more complex forms. We only change the damping factor $b$ and the angular frequency $\omega $ along $x$ direction, while the object location along $y$ direction is fixed. 600 videos are generated with random initial conditions as part of the dataset. Among these, 500 are used to train the proposed architecture, and the remaining constitute the test set. As shown in Figure 10(ii), the latent physics module is able to discover the notion of $\omega $ and $b$ in two different nodes, and the equation discovery module can generate equations to describe the combination of periodic and damped motions accurately.
Appendix C Quantitative Performance Evaluation
The performance of the proposed Visual Physics framework may be measured along two fronts: (i) the mean error between the groundtruth trajectories and the trajectories from discovered equations, and (ii) the normalized crosscorrelation coefficient between the latent representations and the corresponding groundtruth governing parameters. The analysis on the effect of training data size, from onward in the main paper, utilizes these metrics for evaluation. Here, we describe these metrics in more detail.
Let the groundtruth trajectory coordinates be denoted by $({x}^{(t)},{y}^{(t)})$ at a given time instant $t$. Based on the Visual Physics framework, let the learnt equations for $x$ and $y$ be given by $x={f}_{x}(t,{l}_{1},{l}_{2},\mathrm{\dots},{l}_{n})$ and $y={f}_{y}(t,{l}_{1},{l}_{2},\mathrm{\dots},{l}_{n})$, where ${l}_{1},{l}_{2},\mathrm{\dots},{l}_{n}$ are the latent node values. Then, the mean error between trajectories ($\u03f5$) can be computed as
$$\u03f5=\sqrt{\frac{{\mathrm{\Sigma}}_{t}{({x}^{(t)}{f}_{x}(t,{l}_{1},{l}_{2},\mathrm{\dots},{l}_{n}))}^{2}}{S}+\frac{{\mathrm{\Sigma}}_{t}{({y}^{(t)}{f}_{y}(t,{l}_{1},{l}_{2},\mathrm{\dots},{l}_{n}))}^{2}}{S}},$$  (2) 
where $S$ is the total number of time samples in the trajectory under consideration. Additionally, the values for ${l}_{1},{l}_{2},\mathrm{\dots},{l}_{n}$ are estimated through leastsquares. Some values of $\u03f5$ evaluated on trajectories for the free fall case may be found in Figure 8. A test set of unseen trajectories was evaluated using these metrics. A low value of the error implies that the model (equation) learnt is sufficiently parametrized to characterize the observed trajectory, as well as that the time evolution of the predicted trajectory matches that of the observed trajectory.
Let the groundtruth governing parameters be represented by ${g}_{1},{g}_{2},\mathrm{\dots},{g}_{m}$, $m\le n$. On successful discovery, the hidden nodes of the latent physics module are expected to show strong correlations with the governing parameters. Hence, the normalized crosscorrelation between corresponding latent nodes and governing parameters is given by
$${C}_{i,j}=\frac{{\mathrm{\Sigma}}_{k=1}^{K}{g}_{i}^{(k)}{l}_{j}^{(k)}}{K{\sigma}_{{g}_{i}}{\sigma}_{{l}_{j}}},$$  (3) 
where $K$ is the number of test trajectories, and ${\sigma}_{u}$ is the standard deviation of the variable $u$. We look at the magnitude of the strongly correlated hidden nodegoverning parameter pairs, and use the magnitude as an indicator of ‘goodness of latent representations’. Figure 8 again highlights the computed values of the same for the free fall task. It may be observed that the values of the correlation metric are acceptably high. An additional metric for the goodness of latent representations and complexity evaluation can be the number of latent nodes required for the task. For instance, it would be interesting to apply this framework on multidimensional physics tasks, where the governing parameters are a lot more than 3, requiring us to use more number of latent parameters.
Appendix D Software Implementation Details
This section highlights the synthetic dataset generation and pipeline implementation. We provide reproducible code snippets for one of the synthetic tasks, free fall.
Dataset Generation (synthetic data):
The synthetic dataset comprises of an object undergoing motions governed by a range of diverse physical laws. We use Python and associated toolkits for simulating the same phenomena. Specifically, we use NumPy (np) and OpenCV (cv2). Each scene consists of a spherical object, of fixed size. The code for generating the object is shown below.
The background is chosen to be a constant frame, independent of the video. Frame rate, video duration and frame size are the tunable parameters for this setup. The trajectory of the ball is then calculated, based on initial positions, initial velocities and time. Specifically, the initial velocity range is chosen so that for a given initial position, the object always stays in the frame at all times. The code snippet for the same is as follows.
Based on these parameters, the object location at each time instant is determined using the kinematic equations, and the corresponding frame is created. Code for the same is below.
These sets of frames are then stored as the respective videos. Note that for the train on simulated, test on real regime for the uniform circular motion and free fall tasks, the frame size, frame rate and scale were chosen so as to be consistent with the real data.
Position Detection Module:
To process the videos, we developed a Mask RCNN [he2017mask] based pipeline to convert the videos into position vectors which can be processed by the latent physics module. The input to the Mask RCNN is a video with $N$ frames ($N=200$ for synthetic data). The frames are sampled alternately, in a way that the even numbered frames are processed. The odd numbered frames are then used as the query input set. The output of the module is hence a $N+1$ length vector, where the first $\frac{N}{2}$ elements correspond to $y$ coordinates, the next $\frac{N}{2}$ elements correspond to $x$ coordinates, and the last element of the vector corresponds to the frame number which is the query. The code snippet below illustrates the function which comprises the core of the position detection module.
For handling real data, the positions in the video were mapped from pixel coordinates to real world coordinates. In case of the uniform circular motion task, the position detection module was modified slightly to avoid unwanted detections by the Mask RCNN in the video frames. The modification is to convolve each frame of the video with a Gaussian blur kernel (using OpenCV), so that other irrelevant stationary components of the video frame are partially abstracted out, and the Mask RCNN detects only the object of interest in the frame. Since we deal with a single object setting in our work, the blurring technique is useful to improve the robustness of the Mask RCNN for detecting the object of interest in a variety of real scenes.
Latent Physics Module:
This module uses the position outputs from the previous step to identify governing parameters in the latent nodes. We use a feedforward neural network for this purpose, specifically a modified $\beta $Variational Auto Encoder ($\beta $VAE) architecture [higgins2017betaVAE, Eslami1204GenerativeQueryNet]. The inputs of length $N+1$ are obtained from the position detection module ($N$ is the number of input video frames to the position detection module). Both the encoder and decoder consist of 6 fullyconnected layers each. The dimension of all of the hidden layers is fixed at 256, and we use three latent nodes. For training, we concatenate a randomly chosen time query with the latent nodes, as input to the decoder. The outputs of the decoder are the position of the object at the specified time query. We use the mean squared error (MSE) loss on the predicted locations, regularized by the $\beta $VAE disentanglement loss.
Equation Discovery Module:
We use the genetic programming toolkit Eureqa [schmidt2009distilling] to drive the equation discovery module. For our experiments, we use a fixed configuration setup. Inputs for the genetic programming are the positions along $x$ and $y$, the time instants $t$ (evaluated using the frame index and the frames rate of the video) and the latent node information for each trajectory ${l}_{1},{l}_{2},\mathrm{\dots},{l}_{n}$, where $n$ is the number of latent nodes used. For our experiments we use $n=3$. For $M$ training trajectories and $K$ samples per trajectory, we therefore have $M\times K$ sets of $(x,y,t,{l}_{1},{l}_{2},\mathrm{\dots},{l}_{n})$ as inputs.
The error metric is chosen to be the Rsquared goodness of fit. The candidate functions are chosen to be: (i) constant, (ii) input variable, (iii) addition, (iv) subtraction, (v) multiplication, (vi) division, (vii) sine, (viii) cosine and (ix) exponential. The complexity for each of the candidate functions is kept at the default value. No other configuration parameters for the toolkit are changed. Since the toolkit output includes several equations with varying complexities, the final equation is chosen based on paretooptimality in the fitcomplexity space.
Runtime Analysis:
Experiments were performed using a Linux (Ubuntu 18.04 LTS) machine with an Intel i58400 CPU (6 cores, 2.80 GHz), 16GB of RAM, and NVIDIA GeForce RTX 2070 GPU (8 GB of GPU RAM). Table 2 shows the runtime analysis for the helical motion task. As suggested from the table, the overall runtime for this task is approximately 1.5 hours. The position detection module is the primary bottleneck in our pipeline, largely due to the size of the dataset. Depending on the complexity of the equation, the time required by the equation discovering module to converge at a plausible equation ranges from 60 s to 1800 s, for equations along two dimensions.
Module  Runtime per unit  Number of Units 

Position Detection Module  11 s per video  500 videos in a training set 
Latent Physics Module  60 s per 1000 epochs  2000 epochs required for convergence 
Equation Discovery Module  30 s per equation  2 equations ($x$ and $y$ directions) 
Overall Time  5680 s 