Replay in minds and machines

Experience-related brain activity patterns reactivate during sleep, wakeful rest, and brief pauses from active behavior. In parallel, machine learning research has found that experience replay can lead to substantial performance improvements in artificial agents. Together, these lines of research suggest that replay has a variety of computational benefits for decision-making and learning. Here, we provide an overview of putative computational functions of replay as suggested by machine learning and neuroscientific research. We show that replay can lead to faster learning, less forgetting, reorganization or augmentation of experiences, and support planning and generalization. In addition, we highlight the benefits of reactivating abstracted internal representations rather than veridical memories, and discuss how replay could provide a mechanism to build internal representations that improve learning and decision-making. Memory, planning and imagination are important aspects of intelligent behavior; they allow the mind to go beyond merely observing and reacting to its surroundings. But how the brain implements these functions, and how they could help to improve artificial intelligent agents, is not yet fully understood. In this review, we will provide an overview of one important candidate mechanism involved in memory, imagination and planning: Replay. The term replay is used to refer to a wide variety of mechanisms that relate to the reactivation of past memories. This reactivation has Russek replay of ( 𝑠 𝑡 ,𝑎 𝑡 ,𝑠 𝑡+1 )-tuples that are prioritized by recency, in which update the successor matrix. this approach, shown that learning SRs with offline replay gives an agent unique benefits compared to agents without replay. In the fast updating of successor through replay the agent to quickly infer policy updates needed to adapt to changes in the task environment, a new barrier, that affect the state transition structure.


Replay in the brain
Before we discuss the benefits of replay, we will give a brief historical overview of the subject from a neuroscientific point of view. In neuroscience, much research on memory, planning and imagination has focused on the hippocampus (e.g., Squire, 1992;Buckner, 2010). Early indications that the hippocampus gives rise to memory functioning came from studies of lesion patients (Scoville and Milner, 1957), and studies of rodent spatial navigation (O'Keefe and Nadel, 1978). Of particular importance were rodent recordings from hippocampal pyramidal neurons, known as place cells, that demonstrated spatial firing selectivity when animals navigated in a spatial environment (O'Keefe and Dostrovsky, 1971;. These place cells have since been regarded as a core neural substrate for a cognitive map (Tolman, 1948) of physical space that supports spatial navigation Nadel, 1974, 1978;Moser et al., 2008), as well as memory (Cohen and Eichenbaum, 1993;Redish and Touretzky, 1998).
It soon became clear that hippocampal place cells are also active when an animal is not engaged in a particular task, in line with theoretical proposals that a reactivation mechanism could support consolidation of recent memory traces into an aggregated memory store (Marr, 1971). Following early empirical support for this idea (Buzsaki, 1989;Pavlides and Winson, 1989), multi-unit recordings in rodents led to the discovery of replay -the finding that during periods of rest and sleep, hippocampal cells reactivate sequentially in fast bursts, as if retracing paths the animal had taken during wakefulness (Wilson and McNaughton, 1994;Skaggs and McNaughton, 1996;Kudrimoti et al., 1999;Nadasdy et al., 1999;Gerrard et al., 2001;Lee and Wilson, 2002;Louie and Wilson, 2001, for reviews of these earlier findings, see Redish, 1999;Sutherland and McNaughton, 2000). These observations were followed by a wealth of findings that established the now classic neuroscientific view of replay: replay is sequential, occurs during sleep or rest, reflects previous experience in spatial navigation and memory tasks, and happens on a temporally compressed timescale (for review, see e.g., Foster, 2017).
Over the following decades, much more became known about the In each case, we show an agent (black dot / robot) that stores a policy π, a value function Q and a model M see Box 1. Depicted is also the closest goal / reward (grey square), the relevant episode (blue bar), whether the episode is internally transformed (blue striped bar) and which aspect of the agent is updated through replay (green arrow). (A) A case in which encountering a goal triggers reverse replay. Reverse replay is then used to update the agent's value function, similar to Lin (1992). (B) Interleaved replay in which episodes from a previous task are replayed to prevent catastrophic forgetting, see McClelland et al. (1995). (C) Replay of uniformly / randomly selected individual transitions (Mnih et al., 2015). (D) An agent can also learn a model through online updating, and replay from the model during offline periods to update its value function (Sutton, 1991). (E) Episodes can be selected for replay based on the magnitude of prediction errors or other reward-related signals experienced during task performance (Schaul et al., 2015). (F) Instead of replaying previously experienced episodes, an agent can simulate possible episodes based on its model and policy in order to update the agent's value function ("offline policy evaluation", Sutton and Barto, 2018), or to plan the next actions (the policy π) at a choice point, without updating values. (G) Previously experienced episodes can also be abstracted before they are replayed, as done for instance when internal representations instead of observations are reactivated (Kapturowski et al., 2019). This can be used to e.g., update the agent's model and / or value function. (H) Agents can also insert imagined sub-goals into replayed episodes, in particular in order to leverage information from episodes in which the agent never reached the final goal. This is done in hindsight replay (Andrychowicz et al., 2017). (I) Replay occurring during sleep, i.e., while the agent is not engaging in any task at all. This is commonly observed in animals (Klinzing et al., 2019), but analogies from the ML literature are lacking because artificial agents do not sleep. Note, that this figure is not meant to be complete and merely illustrates some but not all aspects of the referenced algorithms. biological aspects of replay, and many findings supported the idea that replay is important for memory. First, replay is commonly detected during brief, highfrequency oscillations called sharp wave-ripples (SWRs) (for review, see e.g., Buzsaki, 2015;Joo and Frank, 2018), which have also been found in human medial temporal lobe (MTL) (Bragin et al., 1999;Staba et al., 2002) and can be linked to memory consolidation during sleep, rest, and awake episodic memory retrieval (Axmacher et al., 2008;Staresina et al., 2015;Zhang et al., 2018;Helfrich et al., 2019;Norman et al., 2019;Vaz et al., 2019Vaz et al., , 2020. A link to memory is also supported by findings showing that the selective disruption of SWRs during post-task rest slows learning in hippocampus-dependent spatial memory tasks (Girardeau et al., 2009;Ego-Stengel and Wilson, 2010;Jadhav et al., 2012), and that memory can be influenced by playing sounds during SWR events while an animal sleeps (Bendor and Wilson, 2012;Rothschild et al., 2016). Second, replay is much faster than wakeful experience, and this temporal compression is believed to induce the conditions that drive learning and the strengthening of memory traces through synaptic plasticity (Bliss and Collingridge, 1993;Magee and Johnston, 1997;King et al., 1999). Third, interactions between the hippocampus and prefrontal cortex (PFC) during replay events support the idea of consolidating reactivated memories in the brain (for reviews, see e.g., Tang and Jadhav, 2019;Zielinski et al., 2020). Due to the fast and anatomically localized nature of the replay phenomenon, these insights were almost exclusively gained from invasive recordings in rodents and human patient populations. But existing studies focusing on non-invasive detection of replay in humans point to similar conclusions. Memory benefits of nonsequential reactivation during rest or sleep are well documented in humans (Staresina et al., 2013;Deuker et al., 2013;Tambini and Davachi, 2013;Tambini et al., 2010;Gruber et al., 2016). Memory consolidation in humans can also be biased by presenting learning-associated sensory cues, a technique known as targeted memory reactivation (TMR), during replay-associated sleep phases in humans (Oudiette and Paller, 2013;Lewis and Bendor, 2019). Recent progress in neuroimaging analyses have also been able to capture the sequentiality of fast replay events using magnetoencephalography (MEG) (Kurth-Nelson et al., 2016;Liu et al., 2021a) and functional magnetic resonance imaging (fMRI) (Schuck and Niv, 2019;Wittkuhn and Schuck, 2021). In combination, these findings have demonstrated that replay exists in a variety of species and support the idea that it reflects a consolidation process that strengthens memory associations (for reviews, see Sutherland and McNaughton, 2000;Rasch and Born, 2007;O'Neill et al., 2010;Diekelmann and Born, 2010;Carr et al., 2011;Zhang et al., 2017;Tambini and Davachi, 2019).
While the above findings have established a foundational knowledge of replay, our understanding of this phenomenon has undergone significant and continued change that sometimes challenged the classic picture of replay (for review, see e.g., Foster, 2017). For instance, replay seems to be significantly more frequent than initially thought, happening not only during sleep or rest but also during brief wakeful pauses from active behavior (Csicsvari et al., 2007;Davidson et al., 2009;Diba and Buzsaki, 2007;Foster and Wilson, 2006;Karlsson and Frank, 2009, for reviews of awake replay, see e.g., Carr et al., 2011;Tambini and Davachi, 2019). Replay-like sequential reactivation patterns occur at various speeds, from highly accelerated to much slower behavioral timescales (Deng et al., 2020;Denovellis et al., 2020;Tang et al., 2021). Recordings outside of the hippocampus have identified replay-like phenomena in a large number of other brain areas, including entorhinal (Ólafsdóttir et al., 2016;Ólafsdóttir et al., 2017;O'Neill et al., 2017;Trettel et al., 2019), prefrontal (Euston et al., 2007;Peyrache et al., 2009;Jadhav et al., 2016;Yu et al., 2018;Shin et al., 2019;Kaefer et al., 2020;Tang et al., 2021), visual and auditory sensory cortices (Ji and Wilson, 2006;Rothschild et al., 2016;Wittkuhn and Schuck, 2021), parietal cortex (Qin et al., 1997;Hoffman and McNaughton, 2002;Harvey et al., 2012), motor cortex (Ramanathan et al., 2015;Gulati et al., 2017), and ventral striatum (Lansink et al., 2009(Lansink et al., , 2008Pennartz, 2004;Gomperts et al., 2015). Moreover, replay is not necessarily a faithful replication of previous behavioral sequences but can also reverse the order of experiences (Csicsvari et al., 2007;Davidson et al., 2009;Diba and Buzsaki, 2007;Foster and Wilson, 2006;Karlsson and Frank, 2009) or change the order of actual experiences according to a learned task rule (Liu et al., 2019a). It can also represent remote, non-local and never-experienced locations (Karlsson and Frank, 2009;Gupta et al., 2010;Ólafsdóttir et al., 2015), reflect non-spatial and partially observable task features (Schuck and Niv, 2019), and occur even after tasks without explicit memory requirements (Wittkuhn and Schuck, 2021). Collectively, these results suggest that replay (1) occurs during a variety of behavioral states, including rest, sleep and pausing, (2) occurs on a variety of time scales, (3) occurs in a variety of brain areas, and (4) does not only reflect previous experience, but is involved in a much broader range of cognitive functions than memory consolidation and spatial navigation alone.
Indeed, our understanding of hippocampal place cells, and the neural architecture underlying memory and spatial navigation more generally, has also evolved considerably. The "places" represented by hippocampal neurons are not exclusively determined by location in physical space, but can also incorporate other taskrelevant aspects, such as sounds (Aronov et al., 2017) and time (MacDonald et al., 2011), but see also O'Keefe and Krupic (2021). Other studies have pointed out that the hippocampus may learn and predict transitions between states in the environment (Gaussier et al., 2002) or encode representations that are predictive of future locations, so called successor representations (SRs), that can be used for reinforcement learning (RL) (Stachenfeld et al., 2017), and that grid-like patterns in the entorhinal cortex and ventromedial PFC may represent coordinates of a non-spatial space (Constantinescu et al., 2016). Thus, today the cognitive map in the hippocampal-entorhinal system is often thought to represent relationships of locations and events beyond physical space, from conceptual knowledge to social cognition (for reviews and perspectives, see Khamassi and Humphries, 2012;Kaplan et al., 2017;Epstein et al., 2017;Schafer and Schiller, 2018;Behrens et al., 2018;Bellmund et al., 2018;Peer et al., 2020;Bottini and Doeller, 2020;Spiers, 2020). Map-like representations also exist beyond the hippocampus, most notably in the medial entorhinal cortex (Hafting et al., 2005;Fyhn et al., 2007;Høydal et al., 2019), and in prefrontal and orbitofrontal cortex (OFC) (Wilson et al., 2014;Schuck et al., 2016;Constantinescu et al., 2016, for a review, see e.g., Schuck et al., 2018). These findings may have important implications for our understanding of the nature of replayed representations, and suggest a mechanism that is much broader than a mere recapitulation of past observations in the hippocampus.
How can such a diverse set of findings about replay in the hippocampus and the rest of the brain be integrated? We argue that insights into this question can be gained by considering the machine learning (ML) literature, where "experience replay" was introduced in the early 90s (Lin, 1991). More recently, experience replay has become particularly popular after its importance for training deep neural networks (DNNs) to play Atari video games became clear (e.g., Mnih et al., 2013Mnih et al., , 2015Hessel et al., 2018). This led experience replay to rise to prominence as a crucial ingredient in building human-level intelligence in artificial agents (Kumaran et al., 2016). Despite the conceptual similarity of biological and artificial replay, research on this subject in neuroscience and ML has progressed largely in parallel. Here, we aim to connect insights from both research fields and review computational perspectives taken in ML on the replay phenomenon. Our goal is to highlight the diversity of possible computational and cognitive functions that might be served by a replay mechanism and attempt to answer the question of why agents would replay in the first place. Fig. 1 provides a non-exhaustive overview of different forms of replay, which differ in which experiences are selected for replay and in how replayed information affects the subsequent behavior of an agent. We will discuss these different instantiations of replay below.

Box 1
What is reinforcement learning? Reinforcement learning (RL) theory provides a formal framework to describe how agents learn to optimize their behavior through interactions with an environment that yields rewards or punishments (Sutton and Barto, 2018). The agent-environment interaction is modelled as a Markov decision process (MDP), which consists of (1) an environment, described by of a set of states S, (2) a set of actions A available to the agent, (3) a state transition function M( , , +1 ), also called a model, reflecting the probabilities of moving from state to the next state +1 after taking action , and (4) a reward function ( , , +1 ) that maps each [state, action, next-state]-triplet to a scalar reinforcement signal r. MDPs can have continuous state or action spaces, although most applications consider finite and discrete cases.
In MDPs, the agent-environment interaction is assumed to be Markovian with respect to reward and state, which means that the state and reward at the next time point t + 1 depend only on the state and action of the current time point t, but not on any states or actions before. The current state therefore contains all relevant information from the previous history to determine the next state after an action has been performed. In brief, this means that we can think of learning from trial-and-error as the following process: the agent represents the current state of the environment, and then performs an action . The action will affect the environment, changing the agent's state from to +1 as described by the state transition function M, and potentially yield a reward, as described by the reward function R. The agent's goal is to always perform the actions that maximize return -the expected (discounted) sum of total rewards over the course of its interaction with the environment. In value-based approaches, the agent learns values that estimate the return, and then implements a policy π that maximizes the values. One popular approach is to estimate values with so-called temporal difference (TD) learning, using a Q-learning algorithm (Watkins and Dayan, 1992): where γ ∈ [0, 1], the discount factor, attenuates the influence of distal rewards, and α ∈ [0,1] is a learning rate. Based on the Bellman equation, Q-learning estimates the discounted sum of future rewards in an iterative bootstrapping process that involves the current and future Q-value. Notably, because the algorithm uses the value of the best action on the next step, rather than the value of the action that was actually performed, it is a so-called off-policy approach. Q-learning does not require a transition model of the environment, and hence is a model-free method. One important question that is not addressed by the framework of RL is what information about the environment is encoded in an agent's internal states. It is important to realize that agents mostly do not have a way to simply know the objective, "true" state of the environment, but rather must infer that state from their observations. An agent's internal state representation therefore may not be equivalent to the true state, which, as we will discuss in Section 2.6, has many repercussions. The agent could simply take its sensory data to be the states, but this is not sufficient for many tasks; rather the agent needs to supplement its internal state representations with non-observable information (e.g., Wilson et al., 2014;Schuck et al., 2018).

Computational benefits of replay
Replay has become a highly studied aspect of artificial agents. But why do machines need replay, and do animals and machines have the same reasons to employ this process? In the following sections, we will compare the roles of replay in both biological and artificial agents, and distill the most significant benefits of replay.
Before we begin, we would like to point out some significant aspects in which the concept of replay differs between ML and neuroscience. First, neuroscience emphasizes the sequential and often accelerated nature of replay (Genzel et al., 2020). In ML, in contrast, some methods focus on replaying sets of individual transitions (e.g., Mnih et al., 2015), rather than sequences (but see e.g., Hausknecht and Stone, 2015). The issue of replay speed has not been a major consideration in ML, as artificial agents are not bound to physical interaction with the environment and the timescales of biology. A second difference is that understanding the distinction between sleep replay and replay during wakeful pauses from active behavior has a prominence in neuroscience (and is covered extensively in previous reviews; see e.g., Findlay et al., 2021;Klinzing et al., 2019) that is not equivalently mirrored in ML research. While the contrast between sleep and wakefulness is a theme that has inspired ML research conceptually (see e.g., Hinton et al., 1995), the mere fact that artificial agents do not "sleep" in the way that biological agents do, makes it practically impossible to investigate those differences in artificial agents. While biological agents have several "modes" in which they are "off-policy" (sleeping, resting, pausing, mind-wandering, etc.), to our knowledge no comparable distinctions have been made for artificial agents. Third, ML researchers often distinguish between experience replay, which corresponds to sampling experiences from a memory buffer, and model-based methods, in which the agent internally generates new experiences from a learned model of the environment. While these model-based methods involve an offline reactivation process, they are not always called replay in the ML literature, but are often referred to as planning instead. In neuroscience, in contrast, many sequential reactivation phenomena are universally referred to as replay, whereas planning is considered to be one of the cognitive processes that might be supported by a replay mechanism.
Similar to previous work (e.g., Foster and Knierim, 2012;Cazé et al., 2018;Momennejad, 2020), our review focuses on the frameworks of RL (Sutton and Barto, 2018) and neural networks. The formalism of RL allows parallels to be drawn between reactivation of neural patterns in biological agents and replay of task states in artificial agents. The RL framework considers agents that learn from interactions with their environment and thereby gather experiences one at a time. RL techniques are designed to learn from experience gradually, through trial-and-error, using every new experience immediately to adjust the agent's knowledge about the task. This has the benefit of accruing knowledge without delay, while integrating information over all experiences gained so far, rather than using just the most recent experience to make decisions. Typically, small adjustments are made to the agent's knowledge with each new experience because large updates risk overwriting the effects of earlier learning and can limit generalization. Box 1 describes the fundamental aspects of RL. Next to RL, we will also draw on insights from (supervised) deep learning (for overviews, see e.g., LeCun et al., 2015;McClelland and Botvinick, 2020) and the successful combination of the two approaches, deep RL (Mnih et al., 2015, see Tesauro, 1995 for an earlier integration).
Which benefits can an agent obtain from using replay? In the next sections, we will discuss five potential computational functions of replay: increasing speed and data efficiency of learning, reducing forgetting, reorganizing experiences, planning, and generalization. We do not consider these functions to be entirely separable. We distinguish them because they each offer a unique perspective on what an intelligent agent, biological or artificial, stands to gain from replaying past experiences. This perspective also sheds light on why replay can have different properties in different study contexts, which have found replay to be sometimes backward and sometimes forward, or in some cases to occur immediately and in other cases long after the experience was acquired.
In addition to the topics above, we will consider one underexplored aspect of replay: whether replay reflects sensory memories, or past internal representations, and whether replay may also be involved in shaping internal representations as well. We hypothesize that the content and function of replay is determined by its interplay with the agent's current representation of the task and the representational demands of the task at hand, a notion which has recently received some computational Caselles-Dupre et al., 2019;Momennejad, 2020) as well as empirical support (Schuck and Niv, 2019). In this view, replay can be understood not only as a phenomenon that retrieves relational information stored in a cognitive map, but also as a process that changes relational information and internal state representations of an agent (see Box 1 for a definition of state representations).
Each of the sections will be organized as follows. First, we will state a computational problem that any learning agent will be faced with. Then we discuss how this problem has been approached in ML using replay, highlighting both theoretical and empirical results. Finally, we will discuss empirical findings from the neuroscience literature that support a particular ML proposal, or suggest alternative mechanisms.

Faster learning and data efficiency
The gradual approach to learning in RL has many benefits, but it results in very slow learning that may need thousands of iterations to achieve the optimal policy. Even worse, the slowness of learning grows exponentially with the number of states in the task environment, a phenomenon known as the "curse of dimensionality" (Bellman, 1957). To be a feasible approach to learning in complex and changing environments, gradual methods must therefore be complemented by mechanisms that will speed up learning without sacrificing the benefits of immediate knowledge acquisition and stable long-term memory. In this light, the idea of recapitulating previous experiences seems particularly appealing for machines, because it is easy and cheap for artificial agents to relearn from past experience that is retrieved from a memory buffer. The brain arguably faces a similar computational challenge. Humans, and other animals, often have to learn directly from the outcomes of their decisions. Yet, repeating errors can pose actual risks, which limits the usefulness of exclusively relying on a slow, trial-and-error-based learning mechanism. More generally, the number of experiences that is acquired with a particular situation in a lifetime is quite limited in relation to the complexity of the environments and the brain, which contains approximately 10 14 synapses (Tang et al., 2001). In order to make thousands of gradual adjustments to each of these synapses, the ability to reuse experience efficiently is paramount. Replay might be one mechanism to do just that.

Replay can speed-up gradual learning from experience and support temporal credit assignment
In the RL literature, "experience replay" was initially introduced to address the issues of slow learning and data inefficiency (Lin, 1991(Lin, , 1992(Lin, , 1993. In his seminal paper, Lin (1992) wrote that "[…Q-learning algorithms […] are inefficient in that experiences obtained by trial-and-error are utilized to adjust the networks only once and then thrown away.
[…] Experiences should be reused in an effective way." (p. 299). Lin (1992) proposed that experiences can be used to update knowledge in a dual fashion; (1) immediately when experiences are acquired, and (2) at later time points, after experience itself may have long passed. Specifically, Lin (1992) proposed replaying full sequences of experiences, starting from an initial state to a final state, in backward order, and learning from these experiences, as if they were real. Lin (1992) then showed that this is a more efficient use of data that accelerates learning of an RL agent. In line with these ideas, many others have since emphasized the computational benefit of replay for maximizing data efficiency and the speed of learning (for reviews, see e.g., Hassabis et al., 2017;Kumaran et al., 2016).
There are several reasons why replay can help learning. In the real world, outcomes are often only obtained after a long sequence of events and actions but agents still need to know how to behave at the start of the sequence, as for instance, in a chess game. This problem is known in RL as the temporal credit assignment problem (Minsky, 1961) and replay may help to solve it. The early work by Lin (1991Lin ( , 1992Lin ( , 1993 pointed out that replay could help an agent to remember the sequence of previous states and actions that led to a given outcome, and assign credit for the reward to the sequence of states and actions that preceded it. This also explains why sequential replay may proceed in backward order (Lin, 1992). Another aspect is that as the agent's knowledge of the rewards becomes better with time, outcomes in the past should be re-evaluated in light of this updated knowledge (van Seijen and Sutton, 2015). Replay could serve this function by retrieving past rewards which can then be compared to current value estimates.
Several neuroscientific studies suggest an important role of replay in speeding up learning in biological agents too. First, studies in rodents reported increases in SWR-associated reactivation following initial learning in novel environments (Cheng and Frank, 2008;Eschenko et al., 2008;O'Neill et al., 2008;van de Ven et al., 2016;Tang et al., 2017), when an acceleration of learning from replay might be most beneficial. Second, several studies reported that it requires only a few experiences in a novel environment for replay to occur, and that it can be detected already during the awake state immediately after behavior (Foster and Wilson, 2006), but see Jackson et al. (2006). Third, disrupting replay-related SWRs during awake rest in rodents slows learning in a spatial navigation task (Jadhav et al., 2012).
Previous research has also suggested that backward replay reflects learning through temporal credit assignment in the brain. First, awake backward replay has indeed been frequently observed, where rewarded spatial trajectories of an animal are replayed in reverse order (Diba and Buzsaki, 2007;Foster and Wilson, 2006;Singer and Frank, 2009), and the frequency of awake backward (but not forward) replay is modulated by the change in reward magnitude (Ambrose et al., 2016;Liu et al., 2019a). Second, the rate of backward replay was observed to be more frequent in novel compared to familiar environments (Foster and Wilson, 2006;Singer and Frank, 2009) and to decrease its bias to reflect previous paths to the goal location as a function of learning (Shin et al., 2019). This could suggest that the relevant trajectory has been learned and does not need to be reinforced through replay anymore (Foster and Knierim, 2012). Interestingly, Cazé et al. (2018) have shown that in particular model-based replay will also decrease its tendency to replay paths to the goal with learning, while changes in forward planning (Johnson and Redish, 2007) might stem from a model-free process. In a task setting with a stable goal, the replay buffer of a model-free learner will increasingly accumulate rewarded episodes while a model-based learner draws on a learned model to sample episodes in a more balanced fashion. The learning-related changes discussed here might therefore reflect a shift from a model-free to a model-based process with learning -although further data will be needed, and model-based and model-free replay might be difficult to disentangle experimentally (Khamassi and Girard, 2020). The third line of support comes from computational work that shows how backward replay can strengthen forward synaptic pathways through spike timing dependent plasticity (STDP) (Haga and Fukai, 2018) and thus support forward replay during sleep and active behavior (Johnson and Redish, 2007;Pfeiffer and Foster, 2013;Redish, 2015b, 2013). Fourth, further evidence for the role of replay in assigning credit is provided by findings that show replay is coordinated with subcortical activation of brain areas related to processing reward (Lansink et al., 2009;Pennartz, 2004;Gomperts et al., 2015), which could convey reward signals to other brain regions like the hippocampus. Finally, in a recent MEG study in humans, backward replay following reward receipt was found to be related to non-local learning of task sequences leading to the reward (Liu et al., 2021b). In summary, existing empirical studies support the idea that awake backward replay supports temporal credit assignment by retrieving states that led to the outcome,accelerating learning for cases in which a long delay between rewards and actions must be encoded. Fig. 2 provides an illustration of how backward replay of full sequences works in the context of RL. We consider an RL agent navigating in a square environment with 20 × 20 tiles that contains several walls and one goal location with a reward (see Fig. 2A). The agent can move into one of the four cardinal directions (up, down, left, right). A small negative reward is given for bumping into a wall (− 0.1), and a reward of 1 when arriving at the goal location. Otherwise no rewards are provided. The best policy in this case is to navigate to the reward with as little steps as possible, avoiding the wall. This is a well-known "grid world" problem that can be solved using RL, but might be painfully slow without replay. For illustrative purposes, we use the off-policy, model-free Q-learning algorithm described in Equation 1 in Box 1. The learning rate α, temperature τ and discounting factor γ were arbitrarily set to 0.3, 1 and 0.99 for the purposes of this illustration.

Fig. 2.
Replay speeds up learning to navigate to a goal in a grid world. (A) Square environment ("grid world") with 20 × 20 tiles (shown in gray) that contains several walls (black tiles) and one goal location (white tile labelled "G") that contains a reward. At the beginning of each episode, an RL agent is placed in a random location and can move into one of four cardinal directions (up, down, left, right). The agent receives no reward for moving, a small negative reward for bumping into a wall (− 0.1), and a reward of 1 when arriving at the goal location. An episode is terminated once the agent reaches the goal location or a maximum of allowed steps per episode set to 1000. (B) Illustration of the learned value function after the first episode of experience (top left), following replay of the first episode (top right), and after the 50 th and 250 th episode (bottom). Colors indicate the values of locations from smallest (blue) to highest (red). Values are under the best possible policy, which is assuming that the agent would perform the value-maximizing action in each location. The increasing prevalence of red tiles after 250 episodes therefore reflects that after training the agent has learned a policy for most locations that will avoid any collisions with the wall and reach the goal within the maximum number of allowed steps. Color mapping is scaled for each plot and values smaller than 0.1 are shown as gray tiles. (C) Number of steps (y-axis) needed by the RL agent in each consecutive episode (x-axis) to reach the goal location when using no replay (blue line) between episodes, or when replaying the previous episode in backward order once (brown line) or five times (yellow line). (D) Mean reward (y-axis) achieved by the RL agent in each consecutive episode (x-axis). Colors as in (c). The computer code for the simulations is publicly available at https://github.com/nschuck/replaysim-wittkuhn-etal2021. © Wittkuhn et al., https://doi.org/10.6084/m9.figshare.14261636.v4, CC-BY 4.0 license (https://creativecommons.org/licenses/by/4.0/). The algorithm is described in Algorithm 1. Briefly, in each episode, the agent starts in a random position and navigates until it has found the reward or the maximum search time has elapsed. The starting locations varied randomly, although start location distances to the reward location were constrained to lie at least 10 tiles away from the goal location (in order to avoid episodes which were too simple). If the agent found the reward, it internally traversed backwards through the sequence of states, actions and rewards until the beginning of the episode, updating its Q-value at each step.
The blue lines in Fig. 2C-D show the number of steps the agent needs to navigate to the goal location; about 250 episodes are needed before the the agent quickly finds the goal location from a new start position. But the speed of learning increases when we supply the agent with a simple replay mechanism described in Table 1, as can be seen in Fig. 2C-D (brown and yellow lines). Adding replay reduced the number of interactions needed to achieve ceiling performance to less than half of what was observed without replay. Note that the choice to replay the full sequence of states, actions and rewards between the start location and the goal location is not without consequence, and a variety of different definitions of what constitutes an episode are common in RL and neuroscience (see Box 2). There are multiple ways to instantiate replay in an RL agent, and the illustration in Fig. 2 only serves as a basic introduction to computational replay (see Fig.  1).

Less forgetting
Increasing the speed of learning is an important computational benefit of replay, but not the only one. Replay may also help to reduce forgetting. The problem of forgetting arises because many statistical learning mechanisms were built under the assumption that the agent encounters its environment entirely at random, and therefore can learn from examples that are independent and identically distributed (i.i.d.). Yet, experiences in real life are often not "i.i.d.". First, we typically experience the world as a sequence of related events, chunked in time. Second, some events are much rarer than others, partly because of the way we interact with the environment. These temporal auto-correlations and uneven distributions of events can be an important obstacle for learning. Why does this pose a computational challenge for gradual learning algorithms? Gradual learning mechanisms are designed to integrate experiences over longer periods, but they emphasize the most recent experience. This can cause the agent to forget about important past experiences that were not re-experienced for a long time. This problem is particularly apparent in a supervised learning setting in which neural networks that rely on stochastic gradient descent (SGD) for learning engage in two tasks in a blocked manner. RL-based networks also struggle with this phenomenon (Atkinson et al., 2021). If a DNN, for instance, is first trained to perform a task A and subsequently trained with another task B, performance on task A drops dramatically, as if the network forgot how to solve A. In other words: learning task B interfered with what was learned about task A. This problem is known as catastrophic forgetting, or catastrophic interference, and has long been recognized as a major problem in the ML field (McCloskey and Cohen, 1989;Ratcliff, 1990;French, 1999;Hassabis and Maguire, 2007;Kumaran et al., 2016;Parisi et al., 2019). Catastrophic forgetting is one of the main reasons why artificial agents can usually learn a single task quite well but subsequent training on a different task results in poor performance on the previously learned task. This prevents the agent from achieving competencies across multiple tasks, which comes relatively easily to humans. Catastrophic interference can also be understood as an issue threatening the stability of a cognitive map representation (Gupta et al., 2010;McClelland et al., 1995;O'Reilly and McClelland, 1994).

Replay can prevent overwriting of previous experiences
A potential solution to the computational problem of catastrophic interference is interleaved learning where new experiences are interleaved with existing knowledge to reconcile competing memory representations (McClelland et al., 1995). This influential idea, rooted in the complementary learning systems (CLS) theory (McClelland et al., 1995;O'Reilly et al., 2014;Schapiro et al., 2017), also suggests that replay may be the mechanism that agents can use to "mentally" interleave past with present experience. While the DNN tunes its connection weights to solve task A, the experienced episodes are stored in a memory buffer. During learning of task B, previous experience with task A is integrated during offline periods via a replay-like mechanism, preventing forgetting, and allowing the agent to perform well on both tasks. Shin et al. (2017), for instance, proposed an approach that learns a generative model based on experience with one classification task A. When switching to an independent classification task B, the system is retrained using a combination of new task data and fictitious sequences from the generative model, resulting in rapid generalization to the new task with little performance loss. Similarly, implementing replay in this way in DNNs can help to overcome performance deficits in incremental task learning scenarios and continuous task environments (van de Ven and Tolias, 2018;van de Ven et al., 2020). Of note, non-sequential replay has been shown to become necessary in an artificial neural network (ANN) when an internal model of a continuous task environment has to be learned .
Despite its benefits, interleaved replay can result in problems if the agent's current policy is very different from the behavioral policy when the experiences were collected. To account for this, several authors have argued that replay needs to be corrected for such "off-policyness" using importance sampling (Meuleau et al., 2010) or other off-policy correction methods such as Retrace  or V-trace (Espeholt et al., 2018). These approaches essentially weight updates that result from replay in proportion to the mismatch between the policy used to generate the replay and the agent's current policy. This issue is particularly pressing in distributed replay approaches (e.g., Horgan et al., 2018), where virtual experiences are simulated in parallel and are then used for learning only with some time gap.
Given that humans and other animals do not necessarily seem to suffer from the computational problem of catastrophic interference, the question arises how the brain has apparently solved this issue and, for the purpose of this review, whether replay plays a role in the solution. Humans and animals can solve a wide set of tasks throughout their lifetime, despite temporal autocorrelation of experience and even learn well from blocked experience which troubles DNNs (Flesch et al., 2018). The idea that this ability might be related to replay (Antony and Schapiro, 2019) is supported by several studies. Karlsson and Frank (2009) for instance have observed replay of episodes from a remote spatial context. In humans, reactivation of previously learned events in the hippocampus that overlap with newly encoded memories leads to better retention (Kuhl et al., 2010).

Box 2
What is replayed?
Replay is generally thought to represent previous experience. How is this experience stored in artificial and biological agents? In artificial agents, an experience at time , , is commonly defined as a quadruple consisting of the state , the taken action , the reward received after taking action in state , and the next state +1 , together = ( , , , +1 ), effectively describing a single transition between two states as the atomic unit of an artificial replay event. Although in some cases individual transitions are replayed, such as in the Deep Q-Network (DQN) approach by Mnih et al. (2015) where the states S consisted of preprocessed versions of Atari pixel frames, other work uses sequential replay of past states (e.g., the early version of experience replay by Lin (1992), see Fig. 2, or replay in recurrent neural networks (RNNs), see e.g., Hausknecht and Stone, 2015;Kapturowski et al., 2019). Interestingly, replay techniques in ML increasingly reactivate internal state representations, rather than observations like pixel values (Hayes et al., 2021). We will discuss this aspect in more detail in Section 2.6 on representation learning.
What constitutes a replayed experience is more difficult to answer for biological agents. Unlike in artificial agents, replay in biological agents is thought to be sequential (Genzel et al., 2020), and typically involves hippocampal place cells that represent locations in a spatial environment, akin to previously experienced trajectories of locations. However, hippocampal cells appear to be quite flexible in encoding task-relevant information other than physical space, for instance sounds (Aronov et al., 2017), trial history (Wood et al., 2000;Sun et al., 2020) or abstract task states (Schuck and Niv, 2019). Moreover, a prominent theme in neuroscience emphasizes that the brain segments continuous experience into representations of distinct neural states that transition at event boundaries or shifts in context (for reviews, see e.g., Bird, 2020;Brunec et al., 2018;Maurer and Nadel, 2021;Richmond and Zacks, 2017;Shin and DuBrow, 2020). To complicate matters, this process might also happen retroactively, i.e., after experiences have been obtained (Clewett et al., 2019). This formation of segmented memory traces is thought to be driven by various factors, including inferred changes in the environment (DuBrow et al., 2017), prediction error signals elicited by reward outcomes (Rouhani et al., 2020) or discontinuities in the statistical structure of the environment (Gershman et al., 2014). We suggest that a practical approach for human research therefore seems to be to define events as "meaningful" units of experience (Bird, 2020) within the current experimental paradigm, and to potentially formalize them as states in an MDP, as for instance in Schuck et al. (2016). Finally, in understanding memory as a constructive process, it is important to note that neural task representations may change from perception to reactivation (Favila et al., 2020). We argue that this aspect is particularly crucial for the study of replay in humans, because activity patterns that are expected to reactivate are commonly determined based on simple localizer tasks that do not involve mnemonic task components (see e.g., Wittkuhn and Schuck, 2021). The brain might have already transformed its input data to a representation that is different from what the researcher was hoping to see re-merge from replayed activity patterns.
How many experiences are stored and for how long? In DNNs using replay, the newest experiences are stored at each time step in a memory buffer D = 1 , …, with finite size N (e.g., Mnih et al., 2013Mnih et al., , 2015Zhang and Sutton, 2017). Since the success of the DNN by Mnih et al. (2015), the memory buffer is typically set to a size of N = 10 6 newest experiences, which continuously replace the oldest experiences (Fedus et al., 2020;Zhang and Sutton, 2017). Recently, Fedus et al. (2020) investigated the relationship between the number and age of experiences stored in the memory buffer. First, they found that increased memory capacity improved learning performance, likely due to a larger coverage of state-action pairs (Fedus et al., 2020). Second, decreasing the age of the oldest experience in the memory buffer also improved performance, likely because of older experiences that resulted from policies that are inconsistent with the current on-policy decision strategy, which is in line with earlier findings noting that experience replay is only beneficial if it is consistent with the current decision policy (Lin, 1991(Lin, , 1993. An exception to this are certain Atari games that are characterized by sparse rewards and require high levels of exploration. In such tasks, sampling from older off-policy experiences is still beneficial (Fedus et al., 2020). These considerations about the size and age of the memory buffer in artificial agents point to an intriguing trade-off between the utilities of old and new memories: On the one hand, a youthful memory buffer storing only recent experiences can effectively drive the current decision policy and quickly abandon outdated and potentially inefficient behavior. On the other hand, keeping older experiences and integrating them with recent ones may foster generalization and prevent an agent from becoming stuck in a decision policy that is suboptimal.
While the size and content of a memory buffer in artificial agents can be crafted by ML researchers, determining number and nature of memories in brains is topic of ongoing debate for neuroscientists. The human brain is famously known to have a very large storage capacity, owing to the large number of modifiable synapses (Bartol Jr. et al., 2015). But forgetting is a common phenomenon. Although decay plays some role in forgetting (Hardt et al., 2013), other factors, such as interference and usage seem to be important as well (Feld and Born, 2017). Indeed forgetting might also be an important aspect of sleep, even while replay processes lead to consolidation (Feld and Born, 2017). Moreover, even if biological agents had an unlimited memory storage, selecting memories for replay from that storage would become challenging with a large amount of experiences, and, from a decision-making perspective, memory representations are only useful in so far they have utility for behavior.

Replay can amplify the influence of rare events on learning
Another challenge arises when learning must occur in environments where some events happen rarely, but are nevertheless of great significance for the agent's success or well-being. Naive DNNs will, for instance, often forget about dangerous states and revisit them (García and Fernandez, 2015). This can be mitigated by replay when separate replay buffers for safe and dangerous states are maintained, such that the model cannot forget, and will frequently be reminded about dangers (Meuleau et al., 2010;Lipton et al., 2016). More generally, importance sampling techniques have been used to ensure that those experiences are sampled which are most important for the current policy of the agent, rather than those that occurred most frequently .
Evidence that replay might be used to mitigate this problem in animals comes from studies showing that actions which should be avoided will be reactivated, like paths to a shock zone (Wu et al., 2017) or paths to devalued outcomes (Carey et al., 2019). In addition to learning about events that should be avoided, replaying rare events that are only weakly encoded could allow the agent to form a stable representation of the entire environment even if only a smaller subset is experienced frequently. In Gupta et al. (2010) non-local replay was stronger for remote sequences if they were experienced less frequently. Using MEG in humans, Jafarpour et al. (2017) showed that stronger reactivation of one of three previously encoded stimuli was determined by how weakly the stimulus was attended to during encoding. These findings are supported by an fMRI study by Schapiro et al. (2018), who demonstrated that older, less well remembered task stimuli were selectively reactivated during a subsequent rest period resulting in memory improvement, an effect that was particularly strong in participants who slept in the 12-hr interval between test sessions, which likely offered opportunity for additional consolidation through replay. In another study, the benefits of targeted memory reactivation (TMR) were stronger for weakly learned information (Tambini et al., 2017). Further, replay-associated electroencephalography (EEG) sleep spindles during a nap following difficult (potentially weaker) but not easy (potentially stronger) memory encoding were related to improved subsequent memory performance (Schmidt et al., 2006). Together, we suggest that replay liberates an agent from needing to consider transitions only in proportion to how many times they were experienced. Instead, replay can flexibly increase or decrease the number of opportunities for learning from single episodes.

Re-inventing the past
In our introductory example (see Fig. 2), replayed content was a close reflection of past experience. Replay occurred immediately after an episode was experienced and reflected past trajectories from start to finish, albeit in reverse order. This setup stands in contrast to the ideas discussed in Section 2.2 on forgetting, which imply that replay must not necessarily respect the structure of experiences, but could, for instance, change the order and frequency of events. Beyond dealing with unevenly distributed events, replay could in fact be used to arbitrarily alter the distribution of events upon which memory is built.
Such a reorganization of experience also requires a different understanding of what constitutes an episode. In our simulation, we had assumed that the minimal unit of replayed content is one entire sequence of states, actions and rewards that occurred between a random start position and the encounter of a goal. This meant that episodes were often quite long, involving several hundreds of steps particularly early in learning (see Fig. 2C), and that the transitions between locations had to be replayed in the order in which they were experienced. But for replay to be able to reorganize experience, an episode could be divided into a much smaller unit of experience, a simple sequence of just one state, one action, one reward and the next state, known as a ( , , , +1 )-tuple. Arguably, replaying such minimal experiences risks losing the benefits of temporal credit assignment, because values will not necessarily propagate along the trajectory to starting positions. But it does offer important advantages, discussed in Subsections 2.3.1 to 2.3.4. In consequence, the question of what constitutes an atomic unit of experience from the perspective of replay has important implications and is therefore actively debated (see Box 2).

Replay can reactivate experiences randomly
Using minimal transitions, there is a large variety of ways in which replay may alter the structure of experiences that have been discussed in the ML and neuroscience literature. One possibility is to reactivate ( , , , +1 )-tuples in a random order, which artificially crafts similar conditions as during supervised learning that allows ANNs trained with stochastic gradient descent (SGD) to excel . Such uniformly sampled ( , , , +1 )-tuples have therefore played an important role in adapting DNNs to RL problems, such as the famous DQN (Mnih et al., 2015). Random replay has also been found to be useful when updates are done incrementally (learning from each example as it arrives), rather than in a batch-wise manner (learning from groups of examples gathered over time), as is common in ML (Chaudhry et al., 2018). Interestingly, some animal studies have also found replay of seemingly random trajectories following exploration of a familiar open-field arena (Stella et al., 2019). Note however, that Stella et al. (2019) still observed replay of sequentially organized transitions that reflected the spatial constraints of the environment, whereas random replay used in ML can involve sets of single transitions that do not form sequential trajectories. This highlights the different understanding of replay content in ML and neuroscience. Additionally, most animal studies impose the assumption of sequentiality during data analysis, and would discard fully random activation of transitions as noise. In both ML and neuroscience, however, random replay refers to sequential reactivation that is unrelated to previously experienced action sequences, and can be seen at one extreme of a continuum describing how closely replay matches actual behavioral sequences (see Swanson et al., 2020, their Figure 2).

Replay can prioritize rewarding experiences
Another particularly important idea from the ML literature is to prioritize replay of transitions that led to large surprises, which often prove to be more informative than others and result in more efficient learning (Schaul et al., 2015;Horgan et al., 2018). Such prioritized replay records a prediction error (PE), the difference between the expected and actual reward, for every encountered transition and uses this signal to select experiences for replay later. This method is very similar to, and inspired by, an earlier algorithm in model-based planning known as prioritized sweeping, which selects the state to be updated according to the magnitude of the change in value upon the execution of the update (Andre et al., 1998;Moore and Atkeson, 1993;Peng and Williams, 1993). Based on the success of the prioritized replay approach, more frequent sampling of transitions with a high absolute TD error is now a common approach to train DNNs (Fedus et al., 2020). Using RL models, Mattar and Daw (2018) extended previous approaches by focusing prioritization on behaviorally relevant states that are likely to be encountered again in the future and those transitions where a policy change would yield the largest net increase in discounted future reward. Note that although prioritization algorithms assume selection of replay content on the level of individual transitions, they can, under some circumstances, still lead to sequential replay. This is true, for example, in the model from Mattar and Daw (2018), because expectations about increases in future reward are themselves often auto-correlated.
The idea that replay should be influenced by reward and surprise is in line with several animal studies. Place cell sequences associated with reward are replayed more often (Ólafsdóttir et al., 2015;Foster and Wilson, 2006;Bhattarai et al., 2019), in particular those with a high PE (Singer and Frank, 2009;Michon et al., 2019;Roscow et al., 2019), and the rate of SWRs is also influenced by reward (Ambrose et al., 2016;Singer and Frank, 2009). These results highlight replay's role in credit assignment, as discussed in Section 2.1. In human neuroimaging studies, hippocampal activity is modulated by reward magnitude (Wolosin et al., 2012;Igloi et al., 2015). This is in line with the link between backward replay and selection of transitions based on changes in value, proposed by Mattar and Daw (2018) and Cazé et al. (2018). Replay is also more likely to contain behaviorally significant locations, such as the current goal (Gupta et al., 2010;Pfeiffer and Foster, 2013;Ólafsdóttir et al., 2015) and is biased by novelty (Cheng and Frank, 2008;Foster and Wilson, 2006). It has also been observed that optogenetic manipulation of dopaminergic input neurons, thought to signal PEs, increase replay during subsequent sleep (McNamara et al., 2014). Note that reward prediction errors might be accompanied by state prediction errors, which in the brain might both be conveyed by dopaminergic signals (see e.g., Sharpe et al., 2017;Gardner et al., 2018Gardner et al., 2018

Replay can connect experiences in novel ways
Reactivating reordered sequences can also be used to connect experiences in novel ways or strengthen weakly learned relationships. Replay can for instance correspond more closely to sampling from an internal model of the environment, rather than a veridical recapitulation of past experiences (Sutton, 1991). Among the most early ideas about replay, the Dyna architecture (Sutton, 1991) used an internal model to generate experiences that were then used to train a model-free agent. Indeed, replay can be seen as a way to blur the lines between model-free RL, such as the Q-learning method introduced in Box 1, and model-based RL, during which the agent stores an explicit model of the environment and can use it for planning (van Seijen and Sutton, 2015;Russek et al., 2017;. Neuroscientific evidence for reorganized experience has also been reported. During behavior, replay events can switch between reflecting immediately preceding, upcoming or more remote episodes, depending on the behavioral state of the animal at the time of replay (Pfeiffer and Foster, 2013;Ólafsdóttir et al., 2017). Even single replay events can depict more than one trajectory, such as the next one and the path the animal will take after reaching the goal location (Pfeiffer and Foster, 2013), as if representing a multi-step planning process (Foster, 2017;Miller and Venditto, 2021). Note, however, that another reason why experiences from a more distant past might be replayed could simply be that the agent is using a period during which it does not have to engage with the environment to optimize memory. This is particularly apparent for replay during sleep, when the brain has idle time to process experiences while not being actively engaged with any task. Sleep replay has frequently been observed in animals and humans, and been linked in particular to memory consolidation. Following sleep, memory interference is reduced (Baran et al., 2010;McDevitt et al., 2015McDevitt et al., 2015 and memory integration or differentiation has been found in fMRI patterns after a delay period with sleep Tompary and Davachi, 2017).

Replay can reuse past experiences to learn about new goals
A final aspect of reorganization relates to re-considering the usefulness of past experiences in light of one's knowledge about a goal. The RL framework presented thus far is aimed at the pursuit of a single goal (e.g., the single reward location in our grid world, see Fig. 2). However, in many real-world applications, such as the movements of a robotic arm that needs to pick and place objects, an RL strategy incorporating multiple goals would be far more beneficial. Consider again the grid world example in Fig. 2, but this time the agent can only move a finite number of steps. Since there is only one goal state that returns a reward, most of the transitions do not land in the goal state and therefore receive no reward. In such a sparse binary reward situation, where success only results from those sequences of transitions ending in the goal state, most sequences of transitions end in uninformative failures, often related to early termination without reward ("giving up"). For instance, when an agent gives up because a goal was not found after a particular amount of time, it can not know how close it was to the goal. Humans, however, can learn from failure as well as success. Inspired by this idea, an ML technique known as hindsight experience replay (Andrychowicz et al., 2017) is used to relabel the unsuccessful transitions by simply changing the goal state, such that the transitions would now be considered as successful under the new goal, thus contributing to the agent's learning. To the best of our knowledge, no directly equivalent observation has been made in the brain so far.

Planning for a better future
So far, we have mainly focused on the various ways in which replay serves learning and memory. Yet, psychological, neuroscientific and ML research has pointed out the importance of another mechanism that is crucial for goal-directed behavior: planning. A core aspect of this process is the prospective evaluation during which an agent deliberates which of the available sequences of actions and states leads to the best among several potential outcomes. In most cases, planning requires a mental model, or cognitive map (Tolman, 1938(Tolman, , 1948, of the environment, that describes the agent's knowledge about the transition structure of events, including the outcomes at each potential location (e.g., Moerland et al., 2020). Knowledge about the causal structure of the environment allows an agent to predict and compare the outcomes of sequences of states and actions and to choose the one that yields most reward. Yet, as we will see below, a replay buffer can be used instead of a model in order to perform planning functions too. Fig. 3 provides an illustration of how planning differs from the other two aspects of cognition we have considered so far, acting and learning. Within the RL framework, the difference between acting based on learned (cached) values versus acting based on an internal planning process is embodied by the distinction between model-free and model-based systems (Sutton and Barto, 2018;Daw et al., 2005). In model-based RL, the agent uses experience to learn a model of the environment that is described by a function that relates the current state and action to the next state +1 and the reward (see Box 1). The agent can use this model at decision time or during offline periods to simulate experience (Sutton, 1990). Simulated replay can be used to update cached values or to determine which action would be best to execute next, considering the rewards obtained and how the environment would change if a particular action was taken. This deliberation process has two advantages. First, planning allows the agent to remain in the safety of mental imagination and avoid the risk of suffering from potentially harmful consequences. Second, planning can be used to decide between never-experienced, entirely hypothetical courses of action (Liu et al., 2019a), a feat which would not be possible with purely experience-based replay.
Despite these differences between planning and learning, much work in RL has emphasized their similarities (Sutton, 1990(Sutton, , 1991Sutton et al., 2012;van Seijen and Sutton, 2015). This research points to a function of planning that goes beyond deliberation and has shown that planning functions can be achieved without an explicit model (van Seijen and Sutton, 2015;van Hasselt et al., 2019).
As we have seen in our discussion of Lin (1992), the same learning mechanisms can be applied to real or simulated experience. Planning can thus not only be used to determine immediate behavior, but also to shape value functions, a process referred to as background planning (as opposed to decision-time planning and deliberation, see Pezzulo et al., 2019). This can be illustrated by the Dyna architecture (Sutton, 1990(Sutton, , 1991. Just as any model-free RL agent, a Dyna agent selects actions according to learned Q-values, and uses experiences to update these Q-values. But it also uses experiences to observe which states and rewards follow the current action, using this information to update its internal model of the world. Importantly, in Dyna the model is then used to train the model-free agent by replaying simulated episodes, and updating the agent's Q-values based on prediction errors, just like real experiences. A second aspect is that replaying experiences stored in a memory buffer in some sense replaces functions that would otherwise be subserved by a model (van Seijen and Sutton, 2015;Hessel et al., 2018;van Hasselt et al., 2019), or at least enhance model-based planning functions (Eysenbach et al., 2019). van Seijen and Sutton (2015), for instance, have shown that learning value functions by a model-free method with replay can be equivalent to learning value functions with a model-based method. Empirically, van Hasselt et al. (2019) have shown that state-of-the-art replay methods, involving prioritization based on a Kullback-Leibler (KL) loss, can outperform model-based methods on Atari games (Kaiser et al., 2019), in part because an inaccurate model can lead to unstable learning. Moreover, Eysenbach et al. (2019) have shown that replay can be used to infer a graph representation of the current task that provides insights into subgoals, which in turn can be used for planning (cf. Pong et al., 2018). This is reminiscent of hindsight replay, which retroactively inserts rewards into stored replay sequences in order to facilitate learning about hierarchical subgoals (Andrychowicz et al., 2017). We note, however, that model-based planning methods remain popular in ML (Pan et al., 2018;Kaiser et al., 2019;Moerland et al., 2020). Planning methods provide the flexibility needed to generate unseen but possible transitions, and planning over long horizons can be achieved using algorithms such as tree search (Guo et al., 2014;Silver et al., 2016;Anthony et al., 2017).
While the potential benefits of replay for planning have been recognized early on in RL (Sutton, 1990), consideration of this aspect in neuroscience only appeared later, when studies demonstrated replay events in the awake state, often during short pauses from active behavior (e.g., Csicsvari et al., 2007;Diba and Buzsaki, 2007;Eldar et al., 2020;Foster and Wilson, 2006;Kudrimoti et al., 1999;Kurth-Nelson et al., 2016). This allowed researchers to draw closer correspondence between the replayed and the behavioral trajectories, and has resulted in a wealth of findings supporting the idea that replay supports modelbased planning in animals as well as humans (for reviews, see e.g., Yu and Frank, 2015;Pezzulo et al., 2019;Wang et al., 2020;Tambini and Davachi, 2019;Carr et al., 2011;Ólafsdóttir et al., 2018). Disruption of awake hippocampal SWRs during a spatial alternation task specifically impaired the ability to decide between two trajectories to alternating goal locations, whereas place field representations, reactivation during rest, and other navigation behavior remained intact (Jadhav et al., 2012). Replay events in the awake state predominantly co-occur with SWRs during short pauses from ongoing exploratory behavior. Forward replay trajectories during awake SWRs often start at the current location of the animal (a well-known "initiation bias", Ambrose et al., 2016;Davidson et al., 2009;Diba and Buzsaki, 2007;Karlsson and Frank, 2009;Pfeiffer and Foster, 2013;Singer et al., 2013), and end at the goal location Pfeiffer and Foster, 2013), but not always (see e.g., Johnson and Redish, 2007).
A behavioral correlate of deliberation was already described in the 1930s in rodents (Tolman, 1926;Muenzinger and Fletcher, 1936), who tend to pause at a decision point to look back and forth between possible paths, a behavior called vicarious trial and error (VTE) (for review, see . Later studies found that during VTE events, hippocampal place cells associated with theta sequences sweep ahead from the animal's current location (Johnson and Redish, 2007;Wikenheiser and Redish, 2015b;Amemiya and Redish, 2016;Papale et al., 2016). It was also found that during VTE-like behavior, place cell activity influenced the formation of place fields thought to stabilize the cognitive map (Monaco et al., 2014). Note that VTE-associated replay is often accompanied by theta sequences, which differ from SWRs in their neurophysiology. Nonetheless, both can be described as sequential activation of hippocampal cell populations, a simplifying assumption that is helpful from a computational perspective (Foster, 2017;Pezzulo et al., 2019). Recently, theta sequences have been shown to quickly cycle between possible future trajectories (Kay et al., 2020), and increases in theta power in the MTL have been observed in humans in a spatial planning task (Kaplan et al., 2020). In human fMRI, blood-oxygen-level dependent (BOLD) activity in the hippocampus has been shown to increase with deliberation time when deciding between two food items with similar value (Bakkour et al., 2019) and hippocampal activity patterns reflect routes to navigational goal locations (Brown et al., 2016). Another study has found that when humans re-learn outcomes associated with choices at lower levels of a decision tree, the extent to which higher levels of the decision tree are reactivated during rest correlates with how much their decisions change, to reach the new downstream reward states (Momennejad et al., 2018).

Replay can influence behavior directly or indirectly
It should be noted that although awake replay during deliberation of future choices is often related to improved task performance, the replayed trajectories do not necessarily correspond to the behavioral trajectory the animal will subsequently take, and sometimes do not end in the goal location (Johnson and Redish, 2007;Singer et al., 2013). In the study by Singer et al. (2013), hippocampal replay during SWRs that preceded correct choices reflected trajectories for the correct and incorrect option in a two-alternative W-maze. Once correct performance became stable (at 85% correct), replayed trajectories shifted to represent the correct future choice more frequently than the incorrect one (Singer et al., 2013). One interpretation of these findings is that the hippocampus uses replay to evaluate all potential trajectories and the behaviorally relevant trajectory is instantiated in a different brain region. Furthermore, backward replay, which backpropagates value information from the goal location, and forward replay, which samples possible trajectories ahead of the animal, might connect their trajectories as proposed by models of bidirectional planning (Khamassi and Girard, 2020). Forward replay events have been shown to end at or close to the goal location (Pfeiffer and Foster, 2013) and might efficiently stop in states where value estimates have already been updated by a backward replay mechanism, as could be instantiated by prioritized sweeping (see Khamassi and Girard, 2020). In sum, these findings support the idea that deliberation and learning may interact. Changes in a familiar environment might increase deliberation, while the need for model updating and deliberation could diminish with learning, e.g., because decisions become more habitual and less deliberate (Dolan and Dayan, 2013). Thus, replayed trajectories in the hippocampus that evaluate all potential trajectories might be only predictive of behavior during earlier phases of learning (Singer et al., 2013) or vanish from the hippocampus (e.g., Wimmer and Büchel, 2019) when behavior becomes stereotyped. Findings by Papale et al. (2016) also demonstrate an inverse relationship between SWRs at reward sites and deliberation at choice points.
One additional complicating factor regarding the relationship between replay and subsequent behavior concerns the task setting and motivational state of the animal (e.g., Carey et al., 2019;Wu et al., 2017). Take the example of replayed place cell sequences representing the trajectory into a shock zone that is subsequently avoided (Wu et al., 2017). This might serve the purpose of learning strongly from and not forget about significant outcomes, and thus in this circumstance replay is related to avoiding rather than initiating trajectories. In line with this idea, a growing literature on computational psychiatry posits that replay could underlie symptoms like avoidance and rumination that characterize psychiatric disorders like anxiety (Gagne et al., 2018;Heller and Bagot, 2020;Mobbs et al., 2020).
If replay is related to planning, but the ultimate determination of behavior also depends on other brain areas, then replayed trajectories might be influenced by concurrent reactivation outside the hippocampus, such as the amygdala in the case of aversive outcomes (Girardeau et al., 2017). A number of studies sheds light on how replay in the hippocampus is coordinated with other brain regions to instantiate behavior. Replay is known to be coordinated with PFC Pezzulo et al., 2014;Peyrache et al., 2009;Tang et al., 2017), and some work has placed particular focus on the interaction of hippocampal replay and the OFC (Schuck and Niv, 2019;Steiner and Redish, 2012). Indeed, disruption of nearby medial PFC attenuated components of hippocampal theta sequences representing the current location of the animal (Schmidt et al., 2019) and suppression of hippocampal input impaired the integration of task state structure in the OFC (Wikenheiser et al., 2017). Similarly, a recent study in humans has found that hippocampal replay at rest was not directly linked to behavior during a task (Schuck and Niv, 2019). Rather, replay at rest was linked to how well the different task-states were represented in the OFC, which in turn were linked to behavior (Schuck and Niv, 2019). Outside of the PFC, entorhinal grid cells that are thought to enable vector-based spatial navigation likely contribute to planning, as implicated in computational work (see e.g., Erdem and Hasselmo, 2012;Bush et al., 2015).

Preplay can help planning in unknown environments
While replay is mostly thought to occur after experiences have been made, some ideas have been put forth that assume a "preplay" mechanism, in which experiences are mapped out before they are encountered. Few models related to this idea have been proposed in ML, but perhaps the closest concepts are related to attractor dynamics or reservoir computing (for a review, see e.g., Lukoševičius and Jaeger, 2009). Indeed, recent computational work has suggested a link between preplay and efficient learning, arguing that attractor dynamics can account for replay (Corneil and Gerstner, 2015) or preexisting internal sequences could be used as a dynamical reservoir (Leibold, 2020). In work by Cazin et al. (2019), the framework of reservoir computing is used to model the PFC that is shown to integrate replayed sequences into larger sequence assemblies that can be recalled.
Preplay has also been observed in neuroscience, with some studies reporting apparent "preplay" of place cell sequences before the environment was ever experienced Tonegawa, 2011, 2013). While preplay seems reminiscent of a planning process, most findings highlight that apparent sequentiality can also reflect hippocampal cell assemblies that are connected in a way that constrains sequential firing, even prior to experience of a new maze. Nevertheless, other findings indicate that previous experience is required for such spontaneous sequential activation to occur (Silva et al., 2015). The extent to which the hippocampus is able to seemingly preplay novel experiences could depend on the similarity between pre-existing hippocampal representations and new memories about to be formed (Eichenbaum, 2015). Methodologically, this nevertheless highlights the necessity of comparing pre-versus post-task replay (e.g., Buhry et al., 2011), as shown by recent research that observed pre-vs.-post changes in replay can indeed be explained by cell activation and firing rate correlations during experience (Farooq et al., 2019).

Inference and generalization
Although past information provides a glimpse into what we might expect in the future, every new experience is different from the past in some form or another. In order to use experiences effectively, agents must therefore know how to abstract from their details and store, and replay, information which could generalize best to future challenges. Past experiences should also be used to perform inferences that give novel insights that go beyond what has been observed.

Replay can reflect generalizable information and transition structure
Apart from its role in learning and planning, recent developments in ML and neuroscience research suggest that replay also contributes to inference and generalization (for previous reviews and perspectives, see Kumaran, 2012;Kumaran and McClelland, 2012;Cazé et al., 2018;Herszage and Censor, 2018;Lewis et al., 2018;Momennejad, 2020). One theme in this domain has been to build artificial agents that learn generative models from experience, which can then be used to infer new connections based on latent structural rules (Evans and Burgess, 2019), infer the correct context when given new data (Stoianov et al., 2020) and generalize information to new tasks to mitigate performance losses . In the model proposed by Stoianov et al. (2020), for instance, trajectories through a maze are used to learn a generative model, which can produce new trajectories consistent with the current maze structure during offline periods. As new mazes are learned, novel trajectories continue to be generated offline, but from all the mazes that have been experienced, preventing information about any one maze from being lost (similar to our considerations about forgetting in Section 2.2). The hierarchical structure of the model results in trajectories being clustered into distinct maze contexts, which allows maze categories to be inferred when presented with new data. Unlike replay used in other contexts, the model by Stoianov et al. (2020) does not suggest that prioritized replay helps to improve behavioral outcomes. Generative replay that was prioritized based on how surprising observations were under the generative model increased the number of reactivation events that contained important goal locations but did not further improve inference performance (Stoianov et al., 2020).
The clustering of trajectories seen in the model above is related to a broader theoretical view, which has emphasized that separate encoding of transition information and sensory information during learning will allow knowledge about transitions to be reused across situations with structural similarities but new sensory specifics (Behrens et al., 2018;Baram et al., 2020;Whittington et al., 2020). Because replay provides a strong candidate mechanism for learning about transition structure (Stoianov et al., 2020), replay of abstract (sensory-independent) transition information could help to build representations of task structure that can be generalized and used to guide behaviour in new sensory environments (Liu et al., 2019b) or combined with sensory observations to make inferences about the current environment (Evans and Burgess, 2019;Stoianov et al., 2020).
Another major computational approach has focused on replay as a mechanism to learn successor representations, a predictive representation that reflects the expected future visitation of states, given the current state (Dayan, 1993). Unlike the one-step transition matrices that are known as models in model-based RL, the successor matrix can reflect non-adjacent dependencies. This allows the agent to understand relationships between a state and multiple successor states, knowledge which can be used to solve inference problems, such as finding the shortest path to a new reward location Momennejad et al., 2017). The eigenvectors of a successor matrix can also partition the environment into clusters that help planning (Stachenfeld et al., 2017). Critically, replay of past experiences could be used to update the successor matrix during offline periods (e.g., Russek et al., 2017), resonating with the general theme of using replay for model updating . In recent work, Russek et al. (2017) proposed replay of ( , , +1 )-tuples that are prioritized by recency, in which rewards are not needed to update the successor matrix. Using this approach, it was shown that learning SRs with offline replay gives an agent unique benefits compared to agents without replay. In particular, the fast updating of successor states through replay allowed the agent to quickly infer policy updates needed to adapt to changes in the task environment, like a new barrier, that affect the state transition structure.
A potential role of replay in generalization and inference has also been suggested by neuroscientific studies. In a recent study from Barron et al. (2020), hippocampal cells selective to cues and rewarding outcomes that have not been directly experienced together, but whose relationship can be inferred based on sensory pre-conditioning (Brogden, 1939), were found to be co-active during SWR events (Barron et al., 2020). These cells also tended to be reactivated during SWR events in a specific order, with reward selective cells reactivated prior to cue selective cells akin to backward replay. Using MEG, researchers have discovered that visual stimuli are reactivated in a non-experienced order that was based on prior learning of a rule about how items should be reported (Liu et al., 2019a). This indicates that sequential reactivation is able to combine prior learning with new sensory inputs to produce behavior relevant to new environments. Recordings in the hippocampus and PFC have also shown that hippocampal place cells are reactivated with subsets of prefrontal cells that encode generalizable task elements (Yu et al., 2018) and that at least some medial PFC neurons involved in replay have generalized firing fields that cover multiple starting locations or multiple goal locations within a maze (Kaefer et al., 2020), suggesting that replay could also contribute to generalization through coordinating the appropriate reactivation of PFC neurons. Finally, neuroscientific studies have found supporting evidence for SRs, which have similar properties to place fields, skewing in the opposite direction of travel and over-representing goal locations, while the eigenvectors of SRs can account for entorhinal grid cells in some spatial contexts (Stachenfeld et al., 2017). Hippocampal-entorhinal fMRI signals have also been shown to reflect relationships between successive non-spatial objects organized in a graph (Schapiro et al., 2013;Garvert et al., 2017), consistent with the SR (Stachenfeld et al., 2017).

Representation learning
A final important aspect for understanding replay in minds and machines concerns states, the internal representations agents use to describe their environment. In this last section, we highlight findings indicating that replay is not specific to spatial locations or sensory observations, but might instead involve task-dependent state representations. We argue that replaying states has unique benefits as opposed to replaying only observations. Moreover, we speculate that replay might also have a role in learning the representations that guide behavior. As such, replay could offer a window into the operations the brain performs to craft useful representations of the possible task states.

Replay can reflect state representations
The internal states of an agent are a major determinant of its success (see Box 1). In most environments, sensory input alone will be neither fully necessary nor fully sufficient for predicting outcomes. It contains too much task-irrelevant information, and what is needed to determine the best action can often not be observed (a property called "partial observability"). Even when doing a mundane task such as crossing the street, there will be many perceived aspects you can safely ignore (the color of the cars, the behavior of passers-by, etc.), but also factors that are very important for your decision that might not be in your current sensory input, such as your expectation that cars can appear quickly from behind a sharp bend. Hence, representing sensory input alone is often insufficient as a state representation. As Dayan (1993) has put it: "difficult problems can be rendered trivial if looked at in the correct way" (p. 613).
Moreover, since the agent does not know what the true states of the environment are, learning useful state representations constitutes a major challenge (Bengio et al., 2013;Niv, 2019). This learning involves focusing attention on task-relevant dimensions (e.g., Niv et al., 2015;Leong et al., 2017), representing non-observable context, such as past events, in combination with current observations (e.g., Wilson et al., 2014;Schuck et al., 2016Schuck et al., , 2018, and leveraging similarity among the states to determine which experiences might reflect the same hidden causes and which information can be generalized (e.g., Gershman and Niv, 2010).
Many replay algorithms for DNNs store past observations, and during replay internally convert observations into a suitable feature space using a previously learned transformation, such as a convolutional network (e. g., Mnih et al., 2015). But the benefits of directly storing internal representations for replay are increasingly acknowledged (Kapturowski et al., 2019;Iscen et al., 2020;Caccia et al., 2019;Hayes et al., 2019Hayes et al., , 2021van de Ven et al., 2020;Pellegrini et al., 2019). Amongst others, storing internal representations is often more memory efficient (Iscen et al., 2020;Hayes et al., 2019), while observations can still be recreated from compressed internal representations if they are needed (van de Ven et al., 2020). Moreover, representational replay can capture unobservable context that was necessary to process a given observation when it was made (Kapturowski et al., 2019).
However, representational replay, also called state replay, comes with its own set of challenges, in particular in the context of recurrent networks. Specifically, the problem of partial observability is often addressed by combining long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) with DNNs, an architecture known as Deep Recurrent Q-Networks (Hausknecht and Stone, 2015). Because the agent's internal state representations in these networks depend on the history of previous observations, replay of past observations risks being out of context, and replay of internal states becomes necessary. Yet, as agents continue to learn from their experience, the way external inputs are mapped onto internal states changes too; in consequence, the way observations were represented internally in the past might be outdated, a phenomenon known as representational drift. It has therefore been suggested that while replaying internal representations, offline learning should be regularized in a way that captures the amount of representational drift since the replay episode (Pomponi et al., 2020;Balaji et al., 2020). The problem of representational drift will be most severe for RNNs, where observational replay will not lead to useful updates if recreated internal states do not match the internal states when the observation was made originally (e.g., Kapturowski et al., 2019). Although initial research has suggested that simply "zeroing" the agents internal state at the start of a replay event is useful in some circumstances (Hausknecht and Stone, 2015), this makes learning longer temporal dependencies more difficult. Accordingly, Kapturowski et al. (2019) have shown that it is beneficial for an agent to store its own past internal states and re-initialize the appropriate state at the start of a replay event. To account for representational drift, a part of the stored state sequences can first be replayed without updating, to reach a more appropriate internal state, and only the remainder of the sequence is then used for offline learning. In sum, replay can be beneficial for learning if it involves not only past observations but also past states.
In animals, several findings indicate that a large variety of representations, including non-spatial sensory as well as state-like representations, might be replayed. First, the firing of hippocampal "place" cells can reflect a number of non-spatial aspects of the environment, if they are task-relevant, such as sounds (Aronov et al., 2017), time (MacDonald et al., 2011), accumulated evidence for a choice (Nieh et al., 2021), or successor representations (Stachenfeld et al., 2017), but see O' Keefe and Krupic (2021). In fact, findings by Cabral et al. (2014) show that hippocampal neurons in mice flexibly switch between representations of spatial or temporal aspects of a task, depending on which strategy was needed to solve it. More directly, one fMRI study by Schuck and Niv, 2019 has found that sequential hippocampal replay during post-task rest reflected the non-spatial states of a sequential decision-making task. Importantly, observed transitions between decoded replay events were best explained by replay of states that include non-observable task aspects, such as information from the previous trial, rather than by replay of sensory features of the task stimuli alone. This study therefore provides direct evidence for the idea that replay involves state representations that are optimized for the operation of RL algorithms. In an MEG-study by Liu et al. (2019b), human participants first learned an abstract rule governing how objects should be ordered in a sequence and later replayed a novel set of objects according to the learned rule rather than in order of experience. Replayed sequences consisted of factorized representations of sensory objects, the identity of the sequence they belonged to, as well as the position within that sequence, supporting the notion that replay is not limited to one kind of information. Moreover, Jadhav et al. (2012) showed that disruption of SWRs in a spatial alternation task impaired navigation when it required unobservable knowledge of the previous trial, thus hinting at the activation of state representations rather than observations during replay.
Interestingly, much evidence in neuroscience indicates that replay involves multiple representations which are reactivated in parallel, possibly suggesting that observations might be recreated at the time of replay (van de Ven et al., 2020). These representations reflect visual (Ji and Wilson, 2006;Wittkuhn and Schuck, 2021), auditory  or grid-like (Ólafsdóttir et al., 2016;Ólafsdóttir et al., 2017;O'Neill et al., 2017) information. These reactivated offline and online representations might interact, as it has been observed for the case of hippocampus and OFC (Schuck and Niv, 2019). This interaction between the OFC and the hippocampus (for reviews, see Wikenheiser and Redish, 2015a;Schoenbaum, 2016Wikenheiser andRedish, 2015a;Wikenheiser and Schoenbaum, 2016) is particularly interesting given that the OFC might store an agent's task state representations (Schuck et al., 2016, see also Kaplan et al., 2017). Disruption of the medial PFC particularly attenuated components of hippocampal theta sequences representing the current location of the animal (Schmidt et al., 2019) and suppression of hippocampal input to the OFC impaired the integration of task state structure (Wikenheiser et al., 2017). Conversely, disruption of SWRs during sleep impaired the integrity of hippocampal maps but they re-emerged following re-learning (Gridchyn et al., 2020) suggesting that relevant maps are stored in brain areas other than the hippocampus (Niethard and Born, 2020). Despite these interactions, it should also be noted that a number of investigations have shown replay events outside the hippocampus need not be coordinated with hippocampal activity (O'Neill et al., 2017;Kaefer et al., 2020;Wittkuhn and Schuck, 2021). In sum, hippocampal "place cell" firing can reflect a variety of non-spatial but task-relevant aspects (e.g., Aronov et al., 2017), replay occurs in a wide variety of interacting brain areas that reflect an animal's understanding of what is task-relevant, and replay has also be found to directly reflect partially observable task states (Schuck and Niv, 2019).

Can replay support learning useful representations?
Simultaneous replay on different levels of representation, including states and sensory observations, might convey benefits beyond those discussed so far; it might help to build better state representations. One interesting instance concerns successor representation (SR), a form of state representation that provides an efficient way to incorporate knowledge about the transitions between states into the state definition. Computational work by Russek et al. (2017) has shown that SRs can also be learned and updated through replay. More generally, information about state transitions can give rise to further graph analytical insights that are known to provide a good basis for state representations (Mahadevan and Maggioni, 2007). Sequential replay is a natural match as a mechanism to learn states that encode transitional information. Possibly, it could also be used to extract graph properties from experienced transition structures, such as bottleneck states, which then become integrated into state representations. A similar approach has been proposed by Eysenbach et al. (2019), who used replay to infer graph representations that can be used for planning. Other work has highlighted that representations which predict latent embeddings of future observations are particularly useful (Guo et al., 2020). An evaluation of predictiveness could therefore be an important contribution of replay to state representation learning.
More speculatively, coordinated replay across several levels might serve as a mechanism to identify which aspects of sensory observations exhibit transitions that are uncorrelated with state transitions of stored outcomes. Note that RL models benefit from transition information between states, but they could be affected adversely if transitions of task-irrelevant aspects influenced the agent's internal model. For example, representing states as specific locations in physical space will result in a transition matrix that is different from a transition matrix of more abstract task states, but if spatial position is irrelevant to the task at hand, then transitions between locations could be harmful for learning and planning. If replay can be used to find unattended aspects of sensory observations that correlate with reward or relevant transitions, in turn it might be used to determine which dimensions of observed input are task-irrelevant (see e.g., Schuck et al., 2015, for an example of how recognizing correlations in the environment could lead to changes in state representations). To the best of our knowledge, this idea has not been evaluated yet.
Other evidence suggests that the role of replay for state learning could go beyond information about observed transitions. SRs, for instance, can be extended to deal with partially observable task environments (Vértes et al., 2019). Caselles-Dupre and colleagues proposed another interesting account that involves variational autoencoders (VAEs) (Caselles-Dupré et al., 2018;Caselles-Dupré et al., 2019). Building on earlier work that used generative models to circumvent the memory requirements of observational replay , Caselles-Dupré et al. (2019) proposed storing latent representations rather than observations, and using past experiences in this form to continually train a VAE that acts as a state model. Importantly, only by replaying past episodes can the VAE learn to form a state representation that allows the agent to act efficiently across more than one environment.
In the brain, only some evidence so far suggests that replay can change state representations. Schuck and Niv, 2019 observed that replay in the hippocampus during rest was related to better decodability of partially observable state representations from the OFC during the task. Moreover, decoding of state representations in the OFC increased over time, suggesting that representation learning continued during the task, and perhaps was related to replay. Although this evidence is correlational, it hints at a relationship between replay and state representation learning. Yet, much work is left to do, and uncovering representational changes during or following replay will require new analytical approaches which for instance do not use localizer tasks. It is therefore still unclear, whether state learning mechanisms provide a realistic account for biological replay.

Goal-directed behavior without replay?
In this review, we have outlined the myriad ways in which replay-like mechanisms can support intelligent behavior. But can the function of replay really be so broad, or has replay simply become a scientific bandwagon? One part of the problem is that replay has no definition that is universally agreed on by the whole scientific community, and given the popularity of the topic, this has led to subsuming a vast variety of phenomena under the same term (but see Genzel et al., 2020, for a consensus statement). Moreover, the search for an inclusive understanding across the ML and neuroscience communities has probably led to further broadening of the concept. Against this backdrop, the concept of replay has occupied a large share of the scientific study of memory, and memory undoubtedly is a very fundamental building block of intelligent behavior in minds and machines alike.
Although our review covers the broad range of functions associated with replay, it is meant as an attempt to differentiate the debate about the topic. We belief that the field needs to be careful with not overburdening a single concept. In this spirit, we hope to have elucidated how, for instance, interleaved replay of past task experiences differs from replay observed during planning, or coordinated replay during offline periods. These differences can be both computational and implementational in nature: replay in these scenarios presumably serves a different function, and it is implemented in the brain, and in computers, in different ways.
In addition, we would like to underline that despite replay's continued popularity in ML, many state-of-the-art techniques exist which do not use replay. Efficient learning can be achieved without replay, for instance using Asynchronous Advantage Actor-Critic (A3C; Mnih et al., 2016) or on-policy policy gradient optimization (V-MPO; Song et al., 2019). Moreover, novel transformer models (Vaswani et al., 2017), which dispense of the need for convolutional and recurrent computations, have emerged as a powerful framework for solving complex tasks such as language processing (Dai et al., 2019) or RL (Parisotto et al., 2019). Some transformer models incorporate replay (Wu et al., 2020), but many powerful transformers have been proposed which do not require replay, including for RL problems (Parisotto et al., 2019).
Where replay ends and other forms of memory access start is often unclear as well. Consider, for instance, approaches in which agents rely directly on specific single episodes for behavioral control, such as in the context of episodic RL (for recent reviews, see e.g., Gershman and Daw, 2017;Botvinick et al., 2019). During episodic RL, specific single episodes are stored in memory and retrieved to directly determine behavior when the same or a similar situation is encountered again (Lengyel and Dayan, 2007;Gershman and Daw, 2017;Botvinick et al., 2019). In humans, the retrieval of single experiences in decision-making is associated with the hippocampus (Bornstein and Norman, 2017;Lee et al., 2015;Wimmer and Büchel, 2020) and reinstatement of information from past choice trials at decisiontime biases present choices towards decisions made previously in the reinstated context (Duncan and Shohamy, 2016;Bornstein and Norman, 2017). To what extent these retrievals of single episodes are supported by sequential replay remains an open question.
Moreover, even if complex memory computations are needed to solve a task, external memory architectures, such as MERLIN (Wayne et al., 2018), can store past experiences in a memory buffer and learn how to read out only relevant experiences when needed. Memory storage in this model can still be efficient as the model can learn to store lower dimensional state representations instead of raw observations, and memory access is targeted to only the currently needed past information. MERLIN has been shown to outperform the LSTM architectures discussed above. But is MERLIN a replay mechanism? In some ways yes, and in others not. But a more important question is which predictions the algorithm makes, and whether they might fit neuroscientific observations. This can be nicely illustrated in the case of the MERLIN algorithm: although this model is computationally distinct from the "traditional" replay-based architectures, MERLIN predicts sophisticated reactivation phenomena. In a task in which the agent had to navigate to a goal location, for instance, the agent's memory read-out alternated between the subgoals along the way to the goal. In our opinion, asking whether this prediction is true in the brain, and in which environments such a mechanism could be helpful, rather than labeling it as replay or not, would be the most fruitful way forward. This also illustrates that replay is not a single testable theory, but rather a framework within which memory, planning and imagination-related functions, as well as their relationships, can be understood.

Conclusion and outlook
In this review, we have summarized the literature on replay in neuroscience and ML to showcase which computational benefits biological and artificial agents can gain from replaying previous experience. We have discussed five main computational benefits that, although overlapping, provide useful categories for thinking about what might motivate an agent to employ replay: faster learning and increased data efficiency, less forgetting, the reorganization of experience, planning and generalization. In addition we have argued that replayed content is much richer than a sequence of locations, and could reflect the agent's current state representation. State representations are often task-and context-dependent, being influenced by a range of factors, including the goal-relevant aspects of the agents observations, the transition structure of states, the location, number and value of goal locations and the motivational and metabolic state of the animal. We have argued that RL theory provides useful guidance to understand which form state representations might take in a given task, and which implications a particular state representation would have for an agent's behavior. Finally, we have discussed how replay might not only reflect but could help the agent to learn those states to begin with. While many questions in particular regarding the latter idea still remain, considering these factors will greatly help to determine what replayed representations represent and how replay updates decision-making policies that are used to control behavior.

Declaration of competing interest
The authors declare no competing interests.