The classical picture of gradient descent treats it as hill-descending: from any starting position in parameter space, follow the negative gradient until reaching a minimum. Early analyses worried that the non-convexity of neural network loss functions would trap gradient descent in poor local minima. Empirically this concern has not materialized at scale — large overparameterized networks reliably find solutions that generalize well, a puzzle that has driven much of the theoretical work in deep learning.
The resolution involves the interaction between gradient descent and the topology of the loss landscape. In high-dimensional parameter spaces, local minima are rare; saddle points and flat regions dominate. Mode connectivity research has shown that distinct optima found by different training runs are connected by continuous low-loss paths, and gradient descent with stochastic noise (from mini-batch sampling) can traverse these paths or converge on flat minima with rich adjacent-configuration structure.
The directedness of gradient descent represents the sharpest disanalogy with Wagner's biological framework, where undirected mutation makes topology the crucial variable. But the disanalogy is subtler than it first appears. Gradient descent is directed locally — it follows the gradient at each step — but the overall trajectory through parameter space is shaped by initialization, batch ordering, and optimizer choices, introducing stochastic elements that produce exploratory dynamics distinct from pure optimization. Different initializations converge on different positions on the connected network of equivalent configurations, sampling the network much as biological populations sample their genotype networks through undirected drift.
The interaction of directed search with topological structure raises questions that Wagner's framework alone cannot answer. How does the quality of the optimizer affect which innovations are accessible? Can more sophisticated optimizers find capabilities that less sophisticated ones miss in the same space? What role does the training data distribution play in shaping the topology that gradient descent traverses? These questions take Wagner's framework as a starting point but extend it in directions specific to engineered learning systems — a reminder that biological insight illuminates AI without determining it.
Gradient descent as an optimization method dates to Cauchy's 1847 paper on solving systems of simultaneous equations. Its application to neural networks through backpropagation was developed across multiple independent formulations in the 1960s-1980s and crystallized in the 1986 paper by Rumelhart, Hinton, and Williams. The modern variants — stochastic gradient descent, Adam, AdamW, and others — form the computational backbone of deep learning.
Gradient descent is directed optimization. Unlike biological mutation, it follows a signal — the loss gradient — that pulls parameters toward reduced error.
Directedness is a disanalogy with biology. Wagner's framework rests on undirected exploration; gradient descent introduces a mechanism that has no biological equivalent.
Stochastic elements introduce exploration. Batch sampling, initialization, and optimizer choices produce trajectory diversity that partly resembles biological drift.
Topology still matters. The architectural features of loss landscapes — flat minima, mode connectivity, diverse adjacency — shape what gradient descent can find.
Optimizer quality affects outcomes. Better optimizers can find innovations that worse ones miss in the same space — a degree of freedom that biological systems do not possess.