1 Introduction
I will first quickly summarize a few relevant concepts discussed in much more detail in previous reports [76, 77]
. The reader might profit from being familiar with some of our earlier work on algorithmic transfer learning
[69, 75, 77] and recurrent neural networks (RNNs) for control and planning [55, 56, 60, 79] and hierarchical chunking [61].To become a general problem solver that is able to run arbitrary problemsolving programs, the controller of a robot or an artificial agent must be a generalpurpose computer [15, 7, 92, 44]. Artificial RNNs fit this bill. A typical RNN consists of many simple, connected processors called neurons, each producing a sequence of realvalued activations. Input neurons get activated through sensors perceiving the environment, other neurons get activated through weighted connections or wires from previously active neurons, and some neurons may affect the environment by triggering actions. Learning or credit assignment is about finding realvalued weights that make the RNN exhibit desired behavior, such as driving a car. The weight matrix of an RNN is its program.
Many RNNlike models can be used to build general computers, e.g., RNNs controlling pushdown automata [9, 41] or other types of differentiable memory [20] including differentiable fast weights [63, 65], as well as closely related RNNbased metalearners [66, 25, 51]. Using sloppy but convenient terminology, we refer to all of them as RNNs [77]. In practical applications, most RNNs are Long ShortTerm Memory (LSTM) networks [24, 12, 19, 76], now used billions of times per day for automatic translation [106, 43], speech recognition [50], and many other tasks [76]. If there are large 2dimensional inputs such as video images, the LSTM may have a frontend [95] in form of a convolutional neural net (CNN) [11, 33, 97, 4, 45, 52, 8, 76] implemented on fast GPUs [8, 76]. Such a CNNLSTM combination is still an RNN.
Without a teacher, rewardmaximizing programs of an RNN must be learned through repeated trial and error, e.g., through artificial evolution [39, 107, 83, 40, 16, 18, 103, 14, 89, 88] [76, Sec. 6.6], or reinforcement learning [29, 90, 101, 76] through policy gradients [104, 91, 3, 1, 13, 30, 103, 48, 82, 22, 102, 42] [76, Sec. 6]. The search space can often be reduced dramatically by evolving compact encodings of RNNs, e.g., [67, 87, 32, 94][76, Sec. 6.7]
. Nevertheless, this is often much harder than imitating teachers through gradientbased supervised learning
[99, 105, 47][76] for LSTM [24, 12, 19].2 One Big RNN For Everything: Basic Ideas and Related Work
I will focus on the incremental training of an increasingly general problem solver interacting with an environment, continually [46] learning to solve new tasks (possibly without supervisor) and without forgetting any previous, still valuable skills. The problem solver is a single RNN called ONE.
Unlike previous RNNs, ONE or copies thereof or parts thereof are trained in various ways, in particular, by (1) black box optimization / reinforcement learning / artificial evolution without a teacher, or (2) gradient descentbased supervised or unsupervised learning (Sec. 1). (1) is usually much harder than (2). Here I combine (1) and (2) in a way that leaves much if not most of the work to (2), building on several ideas from previous work:

[leftmargin=*]

Extra goaldefining input patterns to encode usergiven tasks. A reinforcement learning neural controller of 1990 learned to control a fovea through sequences of saccades to find particular objects in visual scenes, thus learning sequential attention [79]. Userdefined goals were provided to the system by special “goal input vectors” that remained constant [79, Sec. 3.2] while the system shaped its incoming stream of standard visual inputs through its foveashifting actions. Also in 1990, gradientbased recurrent subgoal generators [57, 58, 80] used special start and goaldefining input vectors, also for an evaluator network predicting the costs and rewards associated with moving from starts to goals. The later PowerPlay system (2011) [75] also used such taskdefining special inputs, actually selecting on its own new goals and tasks, to become a more and more general problem solver in an active but unsupervised fashion. In the present paper, variants of ONE will also adopt this concept of extra goaldefining inputs to distinguish between numerous different tasks.

Incremental black box optimization of rewardmaximizing RNN controllers. If ONE already knows how to solve several tasks, then a copy of ONE may profit from this prior knowledge, learning a new task through additional weight changes more quickly than learning the task from scratch, e.g., [17, 100, 13], ideally through optimal algorithmic transfer learning, like in the at least asymptotically Optimal Ordered Problem Solver [69], where new solution candidates in form of programs may exploit older ones in arbitrary computable fashion.

Unsupervised prediction and compression of all data of all trials. An RNNbased “world model” M of 1990 [55, 56] learned to predict (and thus compress [61]) future inputs including vectorvalued reward signals [56] from the environment of an agent controlled by another RNN called C through environmentchanging actions. This was also done in more recent, more sophisticated CM systems [77]. Here we collapse both M and C into ONE, very much like in Sec. 5.3 of the previous paper [77], where C and M were bidirectionally connected such that they effectively became one big net that “learns to think” [77]. In the present paper, however, we do not make any explicit difference any more between C and M.

Compressing all behaviors so far into ONE. The chunkerautomatizer system of the neural history compressor of 1991 [61, 64] used gradient descent to compress the learned behavior of a socalled “conscious” chunker RNN into a separate “subconscious”
automatizer RNN, which not only learned to imitate the chunker network, but also was continually retrained on its own previous tasks, namely, (1) to predict teachergiven targets through supervised learning, and (2) to compress through unsupervised learning all sequences of observations by predicting them (what is predictable does not have to be stored extra). It was shown that this type of unsupervised pretraining for deep learning networks can greatly facilitate the learning of additional userdefined tasks
[61, 64].Here we apply the basic idea to the incremental skill training of ONE. Both the predictive skills acquired by gradient descent and the taskspecific control skills acquired by black box optimization can be collapsed into one single network (namely, ONE itself) through pure gradient descent, by retraining ONE on all inputoutput traces of all previously learned behaviors that are still deemed useful [75]. Towards this end, we simply retrain ONE to reproduce control behaviors of successful past versions of ONE, but without really executing the behaviors in the environment (usually the expensive part). Simultaneously, all inputoutput traces ever observed (including those of failed trials) can be used to train ONE to become a better predictor of future inputs, given previous inputs and actions. Of course, this requires to store inputoutput traces of all trials [70, 72, 77].
That is, once a new skill has been learned by a copy of ONE (or even by another machine learning device), e.g., through slow trial and errorbased evolution or reinforcement learning, ONE is simply retrained in
PowerPlay style [75] through wellknown, feasible, gradientbased methods on stored input/output traces [75, Sec. 3.1.2] of all previously learned control and prediction skills still considered worth memorizing, similar to the chunkerautomatizer system of the neural history compressor of 1991 [61]. In particular, standard gradient descent through backpropagation in discrete graphs of nodes with differentiable activation functions
[38, 98][76, Sec. 5.5] can be used to squeeze many expensively evolved skills into the limited computational resources of ONE. Compare recent work on incremental skill learning [5]. Wellknown regularizers [76, Sec. 5.6.3]can be used to further compress ONE, possibly shrinking it by pruning neurons and connections, as proposed already in 1965 for deep learning multilayer perceptrons
[27, 26, 77]. This forces ONE even more to relate partially analogous skills (with shared algorithmic information [84, 31, 6, 34, 85, 36, 69]) to each other, creating common subprograms in form of shared subnetworks of ONE. This may greatly speed up subsequent learning of novel but algorithmically related skills, through reuse of such subroutines created as byproducts of data compression, where the data are actually programs encoded in ONE’s previous weight matrices.So ONE continually collapses more and more skills and predictive knowledge into itself, compactly encoding shared algorithmic information in reusable form, to learn new problemsolving programs more quickly.
3 More Formally: ONE and its SelfAcquired Data
The notation below is similar but not identical to the one in previous work on an RNNbased CM system called the RNNAI [77].
Let denote positive integer constants, and positive integer variables assuming ranges implicit in the given contexts. The th component of any realvalued vector, , is denoted by . For convenience, let us assume that ONE’s life span can be partitioned into trials In each trial, ONE attempts to solve a particular task, trying to manipulate some unknown environment through a sequence of actions to achieve some goal. Let us consider one particular trial and its discrete sequence of time steps, .
At the beginning of a given time step, , ONE receives a “normal” sensory input vector, , and a reward input vector, . For example, parts of may represent the pixel intensities of an incoming video frame, while components of may reflect external positive rewards, or negative values produced by pain sensors whenever they measure excessive temperature or pressure or low battery load (hunger). Inputs may also encode usergiven goals or tasks, e.g., through commands spoken by a user. Often, however, it is convenient to use an extra input vector to uniquely encode usergiven goals, as we have done since 1990, e.g., [79, 75]. Let denote the concatenation of the vectors , and . The total reward at time is . The total cumulative reward up to time is . During time step , ONE computes during several micro steps (e.g., [77, Sec. 3.1]) an output action vector, , which may influence the environment and thus future for .
3.1 Training a Copy of ONE on New Control Tasks Without a Teacher
One of ONE’s goals is to maximize . Towards this end, copies of successive instances of ONE are trained in a series of trials through a black box optimization method in Step 3 of Algorithm 1, e.g., through incremental neuroevolution [17], hierarchical neuroevolution [100, 93], hierarchical policy gradient algorithms [13], or asymptotically optimal ways of algorithmic transfer learning [69]. Given a new task and a ONE trained on several previous tasks, such hierarchical/incremental methods may create a copy of the current ONE, freeze its current weights, then enlarge the copy of ONE by adding a few new units and connections [26] which are trained until the new task is satisfactorily solved. This process can reduce the size of the search space for the new task, while giving the new weights the opportunity to learn to somehow use certain frozen parts of ONE’s copy as subroutines. (Of course, it is also possible to simply retrain all weights of the entire copy to solve the new task.) Compare a recent study of incremental skill learning with feedforward networks [5].
In nondeterministic or noisy environments, by definition the task is considered solved once the latest version of the RNN has performed satisfactorily on a statistically significant numer of trials according to a usergiven criterion, which also implies that the inputoutput traces of these trials (Sec. 3.7) are sufficient to retrain ONE in Step 4 of Algorithm 1 without further interaction with the environment.
3.2 Unsupervised ONE Learning to Predict/Compress Observations
ONE may further profit from unsupervised learning that compresses the observed data [61] into a compact representation that may make subsequent learning of externally posed tasks easier [61, 77]. Hence, another goal of ONE can be to compress ONE’s entire growing interaction history of all failed and successful trials [70, 73], e.g., through neural predictive coding [61, 78]. For this purpose, ONE has special output units to produce for a prediction of [55, 56, 59, 54, 60] from ONE’s previous observations and actions, which are in principle accessible to ONE through (recurrent) connections. In one of the simplest cases, this contributes to the error function to be minimized by gradient descent in ONE’s weights, in Step 4 of Algorithm 1. This will train to become more like the expected value of of , given the past. See previous papers [78, 70, 77] for ways of translating such neural predictions into compression performance. (Similar prediction tasks could also be specified through particular prediction taskspecific goal inputs , like with other tasks.)
3.3 Training ONE to Predict Cumulative Rewards
We may give ONE yet another set of special output units to produce for another prediction of and of the total remaing reward [55]. Unlike in the present paper, predictions of expected cumulative rewards are actually essential in traditional reinforcement learning [29, 90, 101, 76] where they are usually limited to the case of scalar rewards (while ONE’s rewards may be vectorvalued like in old work of 1990 [55, 56]). Of course, in principle, such cumulative knowledge is already implicitly present in a ONE that has learned to predict only next step rewards . However, explicit predictions of expected cumulative rewards may represent redundant but useful derived secondary features that further facilitate black box optimization in later incarnations of Step 3 of Algorithm 1, which may discover useful subprograms of the RNN making good use of those features.
3.4 Adding Other Reasonable Objectives to ONE’s Goals
We can add additional objectives to ONE’s goals. For example, we may give ONE another set of special output units and train them through unsupervised learning [62] to produce for a vector that represents an ideal factorial code [2] of the observed history so far, or that encodes the data in related ways generally considered useful, e.g., [23, 28, 81, 68, 21].
3.5 No Fundamental Problem with Bad Predictions of Inputs and Rewards
Note that like in work of 2015 [77] but unlike in earlier work on learning to plan of 1990 [55, 56], it is not that important that ONE becomes a good predictor of inputs (Sec. 3.2) including cumulative rewards (Sec. 3.3). In fact, in noisy environments, perfect prediction is impossible. The learning of solutions of control tasks in Step 3 of Algorithm 1, however, does not essentially depend on good predictions, although it might profit from internal subroutines of ONE (learned in Step 4) that at least occasionally yield good predictions of expected future observations in form of of or .
Likewise, control learning may profit from but does not existentially depend on nearoptimal codes according to Sec. 3.4.
To summarize, ONE’s subroutines for making codes and predictions may or may not help to solve control problems during Step 3, where it is ONE’s task to figure out when to use or ignore those subroutines.
3.6 Store Behavioral Traces
Like in previous work since 2006 [70, 72, 77], to be able to retrain ONE on all observations ever made, we should store ONE’s entire, growing, lifelong sensorymotor interaction history including all inputs and goals and actions and reward signals observed during all successful and failed trials [70, 72, 77], including what initially looks like noise but later may turn out to be regular. This is normally not done, but feasible today. Remarkably, as pointed out in 2009, even human brains may have enough storage capacity to store 100 years of sensory input at a reasonable resolution [72].
On the other hand, in some applications, storage space is limited, and we might want to store (and retrain on) only some (lowresolution variants) of the previous observations, selected according to certain usergiven criteria. This does not fundamentally change the basic setup  ONE may still profit from subroutines that encode such limited previous experiences, as long as they convey algorithmic information about solutions for new tasks to be learned.
3.7 Incrementally Collapse All Previously Learned Skills into ONE
Let denote the concatenation of and and (and possibly and if any). Let denote the sequence . To combine the objectives of the previous, very general papers [75, 77], we can use simple, well understood, rather efficient, gradientbased learning to compress [61] all relevant aspects of into ONE, and thus compress all control [49] and prediction [61] skills learned so far by previous instances of ONE (or even by separate machine learning methods), preventing ONE not only from forgetting previous knowledge, but also making ONE discover new relations and analogies and other types of mutual algorithmic information among subroutines implementing previous skills. Typically, given a ONE that already knows many skills, traces of a new skill learned by a copy of ONE are added to the relevant traces, and compressed into ONE, which is also retrained on traces of the previous skills. See Step 4 of Algorithm 1.
Note that PowerPlay (2011) [75, 86] also uses environmentindependent replay of behavioral traces (or functionally equivalent but more efficient methods) to avoid forgetting and to compress or speed up previously found, suboptimal solutions. At any given time, an acceptable (possibly selfinvented) task is to solve a previously solved task with fewer computational resources such as time, space, energy, as long as this does not worsen performance on other tasks. In the present paper, we focus on pure gradient descent for ONE (which may have an LSTMlike architecture) to implement the PowerPlay principle.
3.8 Learning Goal InputDependence Through Compression
After Step 3 of Algorithm 1, a copy of ONE may have been modified and may have learned to control an agent in a video game such that it reaches a given goal in a maze, indicated through a particular goal input, e.g., one that looks a bit like the goal [79, Sec. 3.2]. However, the weight changes of ONE’s copy may be insufficient to perform this behavior exclusively when the corresponding goal input is on. And it may have forgotten previous skills for finding other goals, given other goal inputs. Nevertheless, the gradientbased [49] dreaming phase of Step 4 can correct and finetune all those behaviors, making them goal inputdependent in a way that would be hard for typical black box optimizers such as neuroevolution.
The setup is also sufficient for highdimensional spoken commands arriving as input vector sequences at certain standard input units connected to a microphone. The nontrivial pattern recognition required to recognize commands such as
“go to the northeast corner of the maze” will require a substantial subnetwork of ONE and many weights. We cannot expect neuroevolution to learn such speech recognition within reasonable time. However, a copy of ONE may rather easily learn by neuroevolution during Step 3 of Algorithm 1 to always go to the northeast corner of the maze, ignoring speech inputs. In a later incarnation of Step 3, a copy of another instance of ONE may rather easily learn to always go to the northwest corner of the maze, again ignoring corresponding spoken commands such as “go to the northwest corner of the maze.” In the consolidation phase of Step 4, ONE then may rather easily learn [10, 50] the speech commanddependence of these behaviors through gradientbased learning, without having to interact with the environment again. Compare the concept of input injection [5].3.9 Discarding SubOptimal Previous Behaviors
Once ONE has learned to solve some control task in suboptimal fashion, it may later learn to solve it faster, or with fewer computational resources. That’s why Step 4 of Algorithm 1 does not retrain ONE to generate action outputs in replays [37] of formerly relevant traces of trials of superseded controllers implemented by earlier versions of ONE. However, replays of unsuccessful trials can still be used to retrain ONE to become a better predictor or world model [77], given past observations and actions (Sec. 3.2).
3.10 Algorithmic Information Theory (AIT) Argument
As discussed in earlier work [77], according to the Theory of Algorithmic Information (AIT) or Kolmogorov Complexity [84, 31, 6, 34, 85, 36], given some universal computer, , whose programs are encoded as bit strings, the mutual information between two programs and is expressed as , the length of the shortest program that computes , given , ignoring an additive constant of depending on (in practical applications the computation will be timebounded [36]). That is, if is a solution to problem , and is a fast (say, linear time) solution to problem , and if is small, and is both fast and much shorter than , then asymptotically optimal universal search [35, 69] for a solution to , given , will generally find first (to compute and solve ), and thus solve much faster than search for from scratch [69].
In the style of the previous report [77], we can directly apply this AIT argument to ONE. For example, suppose that ONE has learned to represent (e.g., through predictive coding [61, 78]) videos of people placing toys in boxes, or to summarize such videos through textual outputs. Now suppose ONE’s next task is to learn to control a robot that places toys in boxes. Although the robot’s actuators may be quite different from human arms and hands, and although videos and videodescribing texts are quite different from desirable trajectories of robot movements, ONE’s knowledge about videos is expected to convey algorithmic information about solutions to ONE’s new control task, perhaps in form of connected highlevel spatiotemporal feature detectors representing typical movements of hands and elbows independent of arm size. Training ONE to address this information in its own subroutines and partially reuse them to solve the robot’s task may be much faster than learning to solve the task from scratch with a fresh network.
3.11 Gaining Efficiency by Selective Replays
Instead of retraining ONE in a sleep phase (step 4 of algorithm 1) on all inputoutput traces of all trials ever, we may also retrain it on parts thereof, by selecting trials randomly or otherwise, and replaying [37] them to retrain ONE in standard fashion [77]. Generally speaking, we cannot expect perfect compression of previously learned skills and knowledge within limited retraining time spent in a particular invocation of Step 4. Nevertheless, repeated incarnations of Step 4 will over time improve ONE’s performance on all tasks so far.
3.12 Heuristics: Gaining Efficiency by Tracking Weight Variance
As a heuristic, we may track the variance of each weight’s value at the ends of all trials. Frequently used weights with low variance can be suspected to be important for many tasks, and may get small or zero learning rates during Step 3 of Algorithm
1, thus making them even more stable, such that the system does not easily forget them during the learning of new tasks. Weights with high variance, however, may get high learning rates in Step 3, and thus participate easily in the learning of new skills. Similar heuristics go back to the early days of neural network research. They can protect ONE’s earlier acquired skills and knowledge to a certain extent, to facilitate retraining in Step 4.3.13 Gaining Efficiency by Tracking Which Weights Are Used for Which Tasks
To avoid forgetting previous skills, instead of replaying all previous traces of still relevant trials (the simplest option to achieve the PowerPlay criterion [75]), one can also implement ONE as a selfmodularizing, computation costminimizing, winnertakeall RNN [53, 74, 86]. Then we can keep track of which weights of ONE are used for which tasks. That is, to test whether ONE has forgotten something in the wake of recent modifications of some of its weights, only inputoutput traces in the union of affected tasks have to be retested [75, Sec. 3.3.2]. First implementations of this simple principle were described in previous work on PowerPlay [75, 86].
3.14 Ordering Tasks Automatically
So far the present paper has focused on usergiven sequences of tasks. But in general, given a set of tasks, no teacher knows the best sequential ordering of tasks, to make ONE learn to solve all tasks as quickly as possible.
The PowerPlay framework (2011) [75] offers a general solution to the automatic task ordering problem. Given is a set of tasks, which may actually be the set of all tasks with computable task descriptions, or a more limited set of tasks, some of them possibly given by a user. In unsupervised mode, one PowerPlay variant systematically searches the space of possible pairs of new tasks and modifications of the current problem solver, until it finds a more powerful problem solver that solves all previously learned tasks plus the new one, while the unmodified predecessor does not. The greedy search of typical PowerPlay variants uses timeoptimal program search to order candidate pairs of tasks and solver modifications by their conditional computational (time and space) complexity, given the stored experience so far. The new task and its corresponding tasksolving skill are those first found and validated. This biases the search toward pairs that can be described compactly and validated quickly. The computational costs of validating new tasks need not grow with task repertoire size.
3.14.1 Simple automatic ordering of ONE’s tasks
A related, more naive, but easytoimplement strategy is given by Algorithm 2, which temporally skips tasks that it currently cannot solve within a given time budget, trying to solve them again later after it has learned other skills, eventually doubling the time budget if any unsolved tasks are left.
4 Conclusion
Supervised learning in large LSTMs works so well that it has become highly commercial, e.g., [50, 106, 96, 43]. True AI, however, must continually learn to solve more and more complex control problems in partially observable environments without a teacher. In principle, this could be achieved by black box optimization through neuroevolution or related techniques. Such approaches, however, are currently feasible only for networks much smaller than large commercial supervised LSTMs. Here we combine the best of both worlds, and apply the AIT argument to show how a single recurrent neural network called ONE can incrementally absorb more and more control and prediction skills through rather efficient and wellunderstood gradient descentbased compression of desirable behaviors, including behaviors of control policies learned by past instances of ONE through neuroevolution or similar general but slow techniques. Ideally, none of the “holy data” from all trials is ever discarded; all can be used to incrementally make ONE an increasingly general problem solver able to solve more and more tasks.
Essentially, during ONE’s dreams, gradientbased compression of policies and data streams simplifies ONE, squeezing the essence of ONE’s previously learned skills and knowledge into the code implemented within the recurrent weight matrix of ONE itself. This can improve ONE’s ability to generalize and quickly learn new, related tasks when it is awake.
References

[1]
D. Aberdeen.
PolicyGradient Algorithms for Partially Observable Markov Decision Processes
. PhD thesis, Australian National University, 2003.  [2] H. B. Barlow. Unsupervised learning. Neural Computation, 1(3):295–311, 1989.

[3]
J. Baxter and P. L. Bartlett.
Infinitehorizon policygradient estimation.
J. Artif. Int. Res., 15(1):319–350, 2001.  [4] S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
 [5] G. Berseth, C. Xie, P. Cernek, and M. V. de Panne. Progressive reinforcement learning with distillation for multiskilled motion control. In Proc. International Conference on Learning Representations (ICLR); Preprint arXiv:1802.04765v1, 2018.
 [6] G. J. Chaitin. On the length of programs for computing finite binary sequences. Journal of the ACM, 13:547–569, 1966.
 [7] A. Church. An unsolvable problem of elementary number theory. American Journal of Mathematics, 58:345–363, 1936.
 [8] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012, June 2012. Long preprint arXiv:1202.2745v1 [cs.CV], Feb 2012.
 [9] S. Das, C. Giles, and G. Sun. Learning contextfree grammars: Capabilities and limitations of a neural network with an external stack memory. In Proceedings of the The Fourteenth Annual Conference of the Cognitive Science Society, Bloomington, 1992.
 [10] S. Fernández, A. Graves, and J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. In Proceedings of the 17th International Conference on Artificial Neural Networks, September 2007.
 [11] K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position  Neocognitron. Trans. IECE, J62A(10):658–665, 1979.
 [12] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
 [13] M. Ghavamzadeh and S. Mahadevan. Hierarchical policy gradient algorithms. In Proceedings of the Twentieth Conference on Machine Learning (ICML2003), pages 226–233, 2003.

[14]
T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and J. Schmidhuber.
Exponential natural evolution strategies.
In
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pages 393–400. ACM, 2010.  [15] K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173–198, 1931.
 [16] F. J. Gomez. Robust Nonlinear Control through Neuroevolution. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2003.
 [17] F. J. Gomez and R. Miikkulainen. Incremental evolution of complex general behavior. Adaptive Behavior, 5:317–342, 1997.
 [18] F. J. Gomez and R. Miikkulainen. Active guidance for a finless rocket using neuroevolution. In Proc. GECCO 2003, Chicago, 2003.
 [19] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 2009.
 [20] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. GrabskaBarwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.

[21]
K. Greff, S. van Steenkiste, and J. Schmidhuber.
Neural expectation maximization.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6673–6685. Curran Associates, Inc., 2017.  [22] M. Grüttner, F. Sehnke, T. Schaul, and J. Schmidhuber. MultiDimensional Deep Memory AtariGo Players for Parameter Exploring Policy Gradients. In Proceedings of the International Conference on Artificial Neural Networks ICANN, pages 114–123. Springer, 2010.

[23]
J. Hérault and B. Ans.
Réseau de neurones à synapses modifiables: Décodage de messages sensoriels composites par apprentissage non supervisé et permanent.
Comptes rendus des séances de l’Académie des sciences. Série 3, Sciences de la vie, 299(13):525–528, 1984.  [24] S. Hochreiter and J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):1735–1780, 1997. Based on TR FKI20795, TUM (1995).
 [25] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN2001), pages 87–94. Springer: Berlin, Heidelberg, 2001.
 [26] A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364–378, 1971.
 [27] A. G. Ivakhnenko and V. G. Lapa. Cybernetic Predicting Devices. CCM Information Corporation, 1965.
 [28] C. Jutten and J. Herault. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1):1–10, 1991.
 [29] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of AI research, 4:237–285, 1996.
 [30] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 3, pages 2619–2624. IEEE, 2004.
 [31] A. N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11, 1965.
 [32] J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez. Evolving largescale neural networks for visionbased reinforcement learning. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 1061–1068, Amsterdam, July 2013. ACM.
 [33] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
 [34] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416, 1973.
 [35] L. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):265–266, 1973.
 [36] M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and its Applications (2nd edition). Springer, 1997.

[37]
L.J. Lin.
Programming robots using reinforcement learning and teaching.
In
Proceedings of the Ninth National Conference on Artificial Intelligence  Volume 2
, AAAI’91, pages 781–786. AAAI Press, 1991.  [38] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.

[39]
G. Miller, P. Todd, and S. Hedge.
Designing neural networks using genetic algorithms.
In Proceedings of the 3rd International Conference on Genetic Algorithms, pages 379–384. Morgan Kauffman, 1989.  [40] D. E. Moriarty. Symbiotic Evolution of Neural Networks in Sequential Decision Tasks. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, 1997.
 [41] M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of contextfree languages. Advances in Neural Information Processing Systems (NIPS), pages 863–863, 1993.
 [42] J. Peters. Policy gradient methods. Scholarpedia, 5(11):3698, 2010.

[43]
J. Pino, A. Sidorov, and N. Ayan.
Transitioning entirely to neural machine translation.
Facebook Research Blog, 2017, https://code.facebook.com/posts/289921871474277/transitioningentirelytoneuralmachinetranslation/.  [44] E. L. Post. Finite combinatory processesformulation 1. The Journal of Symbolic Logic, 1(3):103–105, 1936.

[45]
M. A. Ranzato, F. Huang, Y. Boureau, and Y. LeCun.
Unsupervised learning of invariant feature hierarchies with
applications to object recognition.
In
Proc. Computer Vision and Pattern Recognition Conference (CVPR’07)
, pages 1–8. IEEE Press, 2007.  [46] M. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712, August 1994.
 [47] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/FINFENG/TR.1, Cambridge University Engineering Department, 1987.
 [48] T. Rückstieß, M. Felder, and J. Schmidhuber. StateDependent Exploration for policy gradient methods. In W. D. et al., editor, European Conference on Machine Learning (ECML) and Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages 234–249, 2008.
 [49] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. Preprint arXiv:1606.04671, 2016.
 [50] H. Sak, A. Senior, K. Rao, F. Beaufays, and J. Schalkwyk. Google voice search: faster and more accurate. Google Research Blog, 2015, http://googleresearch.blogspot.ch/2015/09/googlevoicesearchfasterandmore.html.
 [51] T. Schaul and J. Schmidhuber. Metalearning. Scholarpedia, 6(5):4650, 2010.
 [52] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), pages 92–101, 2010.
 [53] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403–412, 1989.
 [54] J. Schmidhuber. Learning algorithms for networks with internal and external feedback. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Connectionist Models Summer School, pages 52–61. Morgan Kaufmann, 1990.
 [55] J. Schmidhuber. Making the world differentiable: On using fully recurrent selfsupervised neural networks for dynamic reinforcement learning and planning in nonstationary environments. Technical Report FKI12690 (revised), Institut für Informatik, Technische Universität München, November 1990. (Revised and extended version of an earlier report from February.).
 [56] J. Schmidhuber. An online algorithm for dynamic reinforcement learning and planning in reactive environments. In Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego, volume 2, pages 253–258, 1990.
 [57] J. Schmidhuber. Towards compositional learning with dynamic neural networks. Technical Report FKI12990, Institut für Informatik, Technische Universität München, 1990.
 [58] J. Schmidhuber. Learning to generate subgoals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967–972. Elsevier Science Publishers B.V., NorthHolland, 1991.
 [59] J. Schmidhuber. A possibility for implementing curiosity and boredom in modelbuilding neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222–227. MIT Press/Bradford Books, 1991.
 [60] J. Schmidhuber. Reinforcement learning in Markovian and nonMarkovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3 (NIPS 3), pages 500–506. Morgan Kaufmann, 1991.
 [61] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992. (Based on TR FKI14891, TUM, 1991).
 [62] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 [63] J. Schmidhuber. Learning to control fastweight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.

[64]
J. Schmidhuber.
Netzwerkarchitekturen, Zielfunktionen und Kettenregel.
(Network architectures, objective functions, and chain rule.)
Habilitation Thesis, Inst. f. Inf., Tech. Univ. Munich, 1993.  [65] J. Schmidhuber. On decreasing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460–463. Springer, 1993.
 [66] J. Schmidhuber. A selfreferential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446–451. Springer, 1993.
 [67] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
 [68] J. Schmidhuber. Neural predictors for detecting and removing redundant information. In H. Cruse, J. Dean, and H. Ritter, editors, Adaptive Behavior and Learning. Kluwer, 1999.
 [69] J. Schmidhuber. Optimal ordered problem solver. Machine Learning, 54:211–254, 2004.
 [70] J. Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18(2):173–187, 2006.
 [71] J. Schmidhuber. Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. In G. Pezzulo, M. V. Butz, O. Sigaud, and G. Baldassarre, editors, Anticipatory Behavior in Adaptive Learning Systems. From Psychological Theories to Artificial Cognitive Systems, volume 5499 of LNCS, pages 48–76. Springer, 2009.
 [72] J. Schmidhuber. Simple algorithmic theory of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. SICE Journal of the Society of Instrument and Control Engineers, 48(1):21–32, 2009.
 [73] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 [74] J. Schmidhuber. Selfdelimiting neural networks. Technical Report IDSIA0812, arXiv:1210.0118v1 [cs.NE], The Swiss AI Lab IDSIA, 2012.
 [75] J. Schmidhuber. PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Psychology, 2013. (Based on arXiv:1112.5309v1 [cs.AI], 2011).
 [76] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. Published online 2014; 888 references; based on TR arXiv:1404.7828 [cs.NE].
 [77] J. Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. Preprint arXiv:1511.09249, 2015.
 [78] J. Schmidhuber and S. Heil. Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1):142–146, 1996.
 [79] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135–141, 1991. (Based on TR FKI12890, TUM, 1990).
 [80] J. Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd International Conference on Simulation of Adaptive Behavior, pages 196–202. MIT Press, 1992.
 [81] H. G. Schuster. Learning by maximization the information transfer through nonlinear noisy neurons and “noise breakdown”. Phys. Rev. A, 46(4):2131–2138, 1992.
 [82] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Parameterexploring policy gradients. Neural Networks, 23(4):551–559, 2010.
 [83] K. Sims. Evolving virtual creatures. In A. Glassner, editor, Proceedings of SIGGRAPH ’94 (Orlando, Florida, July 1994), Computer Graphics Proceedings, Annual Conference, pages 15–22. ACM SIGGRAPH, ACM Press, jul 1994. ISBN 0897916670.
 [84] R. J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control, 7:1–22, 1964.
 [85] R. J. Solomonoff. Complexitybased induction systems. IEEE Transactions on Information Theory, IT24(5):422–432, 1978.
 [86] R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First experiments with PowerPlay. Neural Networks, 41(0):130 – 136, 2013. Special Issue on Autonomous Learning.
 [87] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci. A hypercubebased encoding for evolving largescale neural networks. Artificial Life, 15(2):185–212, 2009.
 [88] Y. Sun, F. Gomez, T. Schaul, and J. Schmidhuber. A Linear Time Natural Evolution Strategy for NonSeparable Functions. In Proceedings of the Genetic and Evolutionary Computation Conference, page 61, Amsterdam, NL, July 2013. ACM.
 [89] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Efficient natural evolution strategies. In Proc. 11th Genetic and Evolutionary Computation Conference (GECCO), pages 539–546, 2009.
 [90] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT Press, 1998.
 [91] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS) 12, pages 1057–1063, 1999.
 [92] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230–267, 1936.
 [93] N. van Hoorn, J. Togelius, and J. Schmidhuber. Hierarchical controller learning in a firstperson shooter. In Proceedings of the IEEE Symposium on Computational Intelligence and Games, 2009.
 [94] S. van Steenkiste, J. Koutník, K. Driessens, and J. Schmidhuber. A waveletbased encoding for neuroevolution. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, pages 517–524, New York, NY, USA, 2016. ACM.
 [95] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. Preprint arXiv:1411.4555, 2014.
 [96] W. Vogels. Bringing the Magic of Amazon AI and Alexa to Apps on AWS. All Things Distributed, 2016, http://www.allthingsdistributed.com/2016/11/amazonaiandalexaforallawsapps.html.
 [97] J. Weng, N. Ahuja, and T. S. Huang. Cresceptron: a selforganizing neural network which grows adaptively. In International Joint Conference on Neural Networks (IJCNN), volume 1, pages 576–581. IEEE, 1992.
 [98] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8  4.9, NYC, pages 762–770, 1981.
 [99] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.
 [100] S. Whiteson, N. Kohl, R. Miikkulainen, and P. Stone. Evolving keepaway soccer players through task decomposition. Machine Learning, 59(1):5–30, May 2005.
 [101] M. Wiering and M. van Otterlo. Reinforcement Learning. Springer, 2012.
 [102] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber. Recurrent policy gradients. Logic Journal of IGPL, 18(2):620–634, 2010.
 [103] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In Congress of Evolutionary Computation (CEC 2008), 2008.
 [104] R. J. Williams. Reinforcementlearning in connectionist networks: A mathematical analysis. Technical Report 8605, Institute for Cognitive Science, University of California, San Diego, 1986.
 [105] R. J. Williams and D. Zipser. Gradientbased learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.
 [106] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. Preprint arXiv:1609.08144, 2016.
 [107] X. Yao. A review of evolutionary artificial neural networks. International Journal of Intelligent Systems, 4:203–222, 1993.
Comments
There are no comments yet.