Itron Idea Labs

Deep Learning: Get Ready for Another Revolution

May 06, 2021

Deep learning—the use of multiple layers of wide neural networks to solve a learning problem—is hugely popular, but still a toddler. Almost everyone is familiar with the term deep learning, partly due to a string of remarkable and widely publicized successes in domains such as image processing and speech recognition. If you’ve conversed with Alexa, scanned a check for deposit or searched online for images, you’ve used deep learning. While most advances have occurred just since 2000, and large-scale impacts on the industry are only about 10 years old, age is not the only reason that deep learning qualifies as a toddler. The more remarkable reason is that even though the algorithms and mechanics of deep learning are well understood, the theory is not. Science is just beginning to understand how deep learning works, and why it works so well. And that means there is a lot of room to improve it.

Our ignorance is fading fast. The last several years have witnessed an intense cross-fertilization of ideas from numerous mathematical fields aimed not only at extending deep learning to new problems and new kinds of data, but also at advancing the basic science of what happens when a deep network learns, and more importantly, what could be done to make learning more efficient, faster and fool proof. Due to this cross-fertilization, the next generation of deep learning models promises to be even more remarkable. Hence, the subtitle, get ready for another revolution.

Many recent papers on deep learning use terms and concepts from optimal transport and dynamical systems, including control theory. I single out these two fields because of the short room available here for discussion. There are more influences. For example, ideas from information geometry are helping to improve deep learning on graphs. In the brief narrative that follows, I highlight a few ways in which dynamical systems and optimal transport are impacting the development of deep learning.

First, it might help to understand how deep learning is structured. Here we focus on the supervised learning setting, where the goal is to find a map that intakes inputs x and produces outputs y. If all goes well, outputs of the map closely match observed values, and if so, we say that the map has low error. Deep learning is a map that is, in essence, a composition of functions. If f1, f2, and f3 are functions, deep learning is structured as y = f3(f2(f1(x)))). It’s a nesting of functions, like a Matryoshka (Russian) doll. The result of f1(x) is used as input to f2, and that result is used as input to f3. Each of the functions is parameterized, and “learning” means to find those parameters that produce output y with the least error.

The whole thing is more complicated, of course, and there can be more than three composite functions. This is just a birds-eye view.

A deep neural network can be structured in many ways, one of which is called a residual network, or resnet for short. This is just like the composition of functions I mentioned above, only now each layer receives a duplicate input term. If h is the output from each function, then starting from the inside, h1 = x + f1(x); using this result, h2 = h1 + f2(h1); and using this result, h3 = h2 + f3(h2). Dropping the subscripts for clarity, the pattern at each layer is h + f(h).

In a milestone paper from 2018, Chen et al. recognized that this same pattern occurs for an ordinary differential equation (ODE): xt+1 = xt + f(xt), where t is time. They developed a method to conduct learning using solvers from the field of dynamical systems. Rearranging terms to be a bit more formal, (xt+1 – xt ) = f(xt), where the term on the left-hand side is a derivative, like dx/dt for small time steps, and on the right-hand side f is a neural network. As such, the deep learning problem can be solved by passing the function f, and some initial conditions, into a standard ODE solver.

The upshot is that tools for dynamical systems can now be applied to deep learning. Not only does this suggest new possibilities for solving deep learning problems, but it also means that the mature field of dynamical systems analysis can be applied toward understanding how learning occurs in resnet, recurrent networks, and other similar deep neural network architectures. Liu and Theodorou and Li et al. provide reviews.

As an aside, an at first seemingly unrelated development over the last few years is called normalizing flows. The idea is to transform a simple distribution, such as a Gaussian, into one that is more complex and better matched to the data at hand. This is done by repeatedly applying a simple transform to a base input. But, as you might already suspect, this process is very similar to the recurrent pattern discussed above. It turns out that in many cases neural ODEs can be understood as continuous normalizing flows and resnets can be understood as discretized versions of normalizing flows. Again, a cross-fertilization of ideas is occurring.

Concepts from optimal transport appear in numerous deep learning papers. One interesting application has to do with how the weights of deep neural networks evolve during training. At the start of training they are randomly initialized, then in each round of training (or with each minibatch of training), they are adjusted slightly to improve model fit. If all goes well, errors fall during training rounds, and at some point, training can stop.

Optimal transport is a field focused on the problem of transforming one distribution into another at the least cost, via the shortest path, or while doing the least work. It turns out that deep learning networks tend to perform best if weights do not vary far from initial values. This prevents some weights from growing wildly large, for example, or following erratic paths. An ideal training path alters weights as little as possible to achieve desired model accuracy. Optimal transport to the rescue. Karkar et al. describe how ideas from optimal transport can be used to train neural networks. Similarly, Onken et al. describe how ideas from optimal transport can be used to improve the dynamics and training of neural ODEs.

This is just a quick foray into the cross-fertilization of fields that is not only driving the maturation of deep learning, but also expanding the domain of deep learning to new kinds of problems and new kinds of data (like graphs). As Itron Idea Labs explores opportunities in the world of machine learning and artificial intelligence, we welcome conversations with our utility customers and others regarding challenges that require advanced approaches. If you would like to schedule a conversation with the Idea Labs team on these topics, please feel free to contact us at Itronidealabs@itron.com.

By John Boik

Senior Principle Data Scientist

John Boik received his PhD in biomedical sciences from the University of Texas, Health Sciences Center, Houston, where he studied cancer biology. He completed postdoctoral work at Stanford University, in the Department of Statistics, and is currently courtesy faculty at Oregon State University, Environmental Sciences Graduate Program. His BS is in civil engineering, from the University of Colorado, Boulder. He has broad experience modeling biological and societal processes, including utility processes. His professional interests include Bayesian statistical methods, scientific machine learning (merging dynamical systems theory with machine learning), graph representations of data, machine learning on graphs, and Bayesian approaches to artificial intelligence, in particular, active inference. Active inference has potential to model cooperation and communication between intelligent agents, such as intelligent IoT devices. He is a Senior Principle Data Scientist at Itron Idea Labs, where he constructs machine learning models for use by Idea Labs and others at Itron, and assists in evaluation of data science proposals, strategies, and approaches.

innovation, stem