Abstract
Recent experiments have shown that animals and humans have a remarkable ability to adapt their learning rate according to the volatility of the environment. Yet the neural mechanism responsible for such adaptive learning has remained unclear. To fill this gap, we investigated a biophysically inspired, metaplastic synaptic model within the context of a wellstudied decisionmaking network, in which synapses can change their rate of plasticity in addition to their efficacy according to a rewardbased learning rule. We found that our model, which assumes that synaptic plasticity is guided by a novel surprise detection system, captures a wide range of key experimental findings and performs as well as a Bayes optimal model, with remarkably little parameter tuning. Our results further demonstrate the computational power of synaptic plasticity, and provide insights into the circuitlevel computation which underlies adaptive decisionmaking.
https://doi.org/10.7554/eLife.18073.001eLife digest
Humans and other animals have a remarkable ability to adapt their decision making to changes in their environment. An experiment called the “multiarmed bandit task” shows this process in action. The individual’s role in this task is to choose between multiple targets. One of these has a higher probability of reward than the other three, and individuals soon begin to favor this target over the others. If the identity of the most rewarded target changes, individuals adjust their responses accordingly. Crucially, however, individuals learn more quickly when the identity of the most rewarded target changes frequently. In other words, they learn faster in an uncertain world.
Changes in the strength of connections between neurons – called synapses – are thought to underlie such learning processes. Receiving a reward strengthens synapses in a process referred to as synaptic plasticity. However, the standard model of synaptic plasticity – in which synapses change from weak to strong or vice versa at a constant rate – struggles to explain why individuals learn more quickly under variable conditions.
An alternative model of learning is the cascade model, which incorporates ‘metaplasticity’. This assumes that the rateof synaptic plasticity can also vary; that is, synapses change their strength at different speeds. The cascade model is based on the observation that multiple biochemical signaling cascades contribute to synaptic plasticity, and some of these are faster than others. Kiyohito Iigaya therefore decided to test whether the cascade model could explain data from experiments such as the fourarmed bandit task. While the cascade model was indeed more flexible than the standard model of synaptic plasticity, it still could not fully explain the observed results.
Iigaya solved the problem by introducing an external “surprise detection system” into the model. Doing so enabled the model to detect a sudden change in the environment and to rapidly increase the rate of learning, just as individuals do in real life. The surprise detection system allowed synapses to quickly forget what they had learned before, which in turn made it easier for them to engage in new learning. The next step is to identify the circuit behind the surprise detection system: this will require further theoretical and experimental studies.
https://doi.org/10.7554/eLife.18073.002Introduction
From neurons to behavior, evidence shows that adaptation takes place over a wide range of timescales, with temporal dynamics often captured by powerlaw, or collection of multiple exponents, rather than a single exponent (Thorson and BiedermanThorson, 1974; Ulanovsky et al., 2004; Corrado et al., 2005; Fusi et al., 2007; Kording et al., 2007; Wark et al., 2009; Lundstrom et al., 2010; Rauch et al., 2003; Pozzorini et al., 2013). On the other hand, singleexponent model analysis of behavioral data showed that the time constant of exponents (or learning rate) changed across trials (Behrens et al., 2007; Rushworth and Behrens, 2008; Soltani et al., 2006; Nassar et al., 2010; Nassar et al., 2012; Neiman and Loewenstein, 2013; McGuire et al., 2014). While theoretical and experimental studies strongly suggest that activitydependent synaptic plasticity plays a crucial role in learning and adaptation in general (Martin et al., 2000; Kandel et al., 2000; Dayan and Abbott, 2001), the neural mechanisms behind flexible learning, especially in the case of decision making under uncertainty, has remained unclear. To address this issue, here we investigate the roles of synaptic plasticity within an established decisionmaking neural circuit model, and propose a model that can account for empirical data.
Standard learning models which use a single learning rate, $\alpha $, fail to capture multiple timescales of adaptation, including those described by a powerlaw, since these models can only store and update memory on a single timescale of $\tau \sim 1/\alpha $. This includes a well studied switchlike synaptic model of memory (Amit and Fusi, 1994; Fusi and Abbott, 2007) in which synapses make transitions between weak and strongefficacy states at a rate $\alpha $. It has been shown that its transition rate $\alpha $ effectively functions as the learning rate of systems with populations of such synapses in a decision making network (Soltani and Wang, 2006; Fusi et al., 2007; Iigaya and Fusi, 2013). Unlike the classical unbounded synapses model, this switchlike model incorporates a biologically relevant assumption of bounded synaptic weights. However, by itself, the plausible assumption of bounded synapses fails to capture key phenomena of adaptive learning, including welldocumented multiple timescales of adaptation (Thorson and BiedermanThorson, 1974; Ulanovsky et al., 2004; Corrado et al., 2005; Fusi et al., 2007; Kording et al., 2007; Wark et al., 2009; Lundstrom et al., 2010; Rauch et al., 2003; Pozzorini et al., 2013).
It is however known that there are various chemical cascade processes taking place in synapses that affect synaptic plasticity (Citri and Malenka, 2008; Kotaleski and Blackwell, 2010). Those processes, in general, operate on a wide range of timescales (Zhang et al., 2012; Kramar et al., 2012). To capture this complex, multitimescale synaptic plasticity in a minimum form, a complex – but still switchlike – synaptic model, the cascade model of synapses, has been proposed (Fusi et al., 2005). In the cascade model, synapses are still bounded in their strengths but assumed to be metaplastic, meaning that, in addition to the usual case of adaptable synaptic strengths, synapses are also permitted to change their rates of plasticity $\alpha $. The resulting model can efficiently capture the widelyobserved powerlaw forgetting curve (Wixted and Ebbesen, 1991). However, application has been limited to studies of the general memory storage problem (Fusi et al., 2005; Savin et al., 2014), where synapses passively undergo transitions in response to uncorrelated learning events.
Indeed, recent experiments show that humans and other animals have a remarkable ability to actively adapt themselves to changing environments. For instance, animals can react rapidly to abrupt steplike changes in environments (Behrens et al., 2007; Rushworth and Behrens, 2008; Soltani et al., 2006; Nassar et al., 2010; Nassar et al., 2012; Neiman and Loewenstein, 2013; McGuire et al., 2014), or change their strategies dynamically (Summerfield et al., 2011; Donoso et al., 2014). While the original cascade model (Fusi et al., 2005) is likely to be able to naturally encode multiple timescales of reward information (Corrado et al., 2005; Fusi et al., 2007; Bernacchia et al., 2011; Iigaya et al., 2013; Iigaya, 2013), such active adaptation may also require external guidance, such as in the form of a surprise signal (Hayden et al., 2011; Garvert et al., 2015).
So far the computational studies of such changes in learning rates have largely been limited to optimal Bayesian inference models (e.g. Behrens et al., 2007). While those models can account for normative aspects of animal’s inference and learning, they provide limited insight into how probabilistic inference can be implemented in neural circuits.
To address these issues, in this paper we apply the cascade model of synapses to a well studied decisionmaking network. Our primary finding is that the cascade model of synapses can indeed capture the remarkable flexibility shown by animals in changing environments, but under the condition that synaptic plasticity is guided by a novel surprise detection system with simple, noncascade type synapses. In particular, we show that while the cascade model of synapses is able to consolidate reward information in a stable environment, it is severely limited in its ability to adapt to a sudden change in the environment. The addition of a surprise detection system, which is able to detect such abrupt changes, facilitates adaptation by enhancing the synaptic plasticity of the decisionmaking network. We also shows that our model can capture other aspects of learning, such as spontaneous recovery of preference (Mazur, 1996; Gallistel et al., 2001).
Results
The tradeoff in the rate of synaptic plasticity under uncertainty in decision making tasks
In this paper, we analyze our model in stochasticallyrewarding choice tasks in two slightly different reward schedules. One is a concurrent variable interval (VI) schedule, where rewards are given stochastically according to fixed contingencies. Although the optimal behavior is to repeat a deterministic choice sequence according to the contingencies, animals instead show probabilistic choices described by the matching law (Herrnstein, 1961; Sugrue et al., 2004; Lau and Glimcher, 2005) in which the fraction of choices is proportional to the fraction of rewards obtained from the choice. In fact, the best probabilistic behavior under this schedule is to throw a dice with a bias given by the matching law (Sakai and Fukai, 2008; Iigaya and Fusi, 2013). We therefore assume that the goal of subjects in this case is to implement the matching law, which has previously been shown to be produced by the model under study (Soltani and Wang, 2006; Fusi et al., 2007; Wang, 2008; Iigaya and Fusi, 2013). The other schedule is a variable rate (VR) schedule, also known as a multiarmed bandit task, where the probability of obtaining a reward is fixed for each choice. In this case, subjects need to figure out which choice currently has the highest probability of rewards. In both tasks, subjects are required to make adaptive decision making according to the changing values of options in order to collect more rewards.
We study the role of synaptic plasticity in a wellstudied decision making network (Soltani and Wang, 2006; Fusi et al., 2007; Wang, 2008; Iigaya and Fusi, 2013) illustrated in Figure 1A. The network has three types of neural populations: (1) an input population, which we assume to be uniformly active throughout each trial; (2) action selection populations, through which choices are made; and (3) an inhibitory population, through which different action selection populations compete. It has been shown that this network shows attractor dynamics with bistability, corresponding to a winnertakeall process acting between action selection populations. We assume that choice corresponds to the winning action selection population, as determined by the synaptic strength projecting from input to action selection populations. It has been shown that the decision probability can be well approximated by a sigmoid of the difference between the strength of two synaptic populations ${E}_{A}$ and ${E}_{B}$ (Soltani and Wang, 2006):
where ${P}_{A}$ is the probability of choosing target $A$, and the temperature $T$ is a free parameter describing the noise in the network.
This model can show adaptive probabilistic choice behaviors when assuming simple rewardbased Hebbian learning (Soltani and Wang, 2006, 2010; Iigaya and Fusi, 2013). We assume that the synaptic efficacy is bounded, since this has been shown to be an important biologicallyrelevant assumption (Amit and Fusi, 1994; Fusi and Abbott, 2007). As the simplest case, we assume binary synapses, and will call states ‘depressed’ and ‘potentiated’, with associated strengths 0 (weak) and 1 (strong), respectively. We previously showed that the addition of intermediate synaptic efficacy states does not alter the model’s performance (Iigaya and Fusi, 2013). At the end of each trial, synapses are modified stochastically depending on the activity of the pre and postsynaptic neurons and on the outcome (i.e. whether the subject receives a reward or not). The synapses projecting from the input population to the winning target population are potentiated stochastically with probability ${\alpha}_{r}$ in case of a reward, while they are depressed stochastically with probability ${\alpha}_{nr}$ in case of noreward (for simplicity we assume ${\alpha}_{r}={\alpha}_{nr}=\alpha $, otherwise explicitly noted). These transition probabilities are closely related to the plasticity of synapses, as a synapse with a larger transition probability is more vulnerable to changes in strength. Thus, we call $\alpha $’s the rate of plasticity. The total synaptic strength projecting to each action selection population encodes the reward probability over the timescale of $1/\alpha $ (Soltani and Wang, 2006; Soltani and Wang, 2010; Iigaya and Fusi, 2013) (For more detailed learning rules, see the Materials and methods section).
It has also been shown, however, that this model exhibits limited flexibility in the face of abrupt changes of timescales in the environment (Soltani and Wang, 2006; Iigaya and Fusi, 2013). This is due to the tradeoff: a high rate of synaptic plasticity is necessary to react to a sudden change, but at the cost of very noisy estimation (as the synapses inevitably track local noise). This is illustrated in Figure 1B,C, where we simulated our model with a fixed rate of synaptic plasticity in a VI reward schedule in which reward contingencies change abruptly (Sugrue et al., 2004; Corrado et al., 2005). As seen in Figure 1B,C, the choice probability is reliable only if the rate of plasticity is set to be very small ($\alpha =0.002$); however, then the system cannot adjust to a rapid unexpected change in the environment (Figure 1B). On the other hand, highly plastic synapses ($\alpha =0.2$) can react to a rapid change, but with a price to pay as a noisy estimate afterwards (Figure 1C).
Changing plasticity according to the environment: the cascade model of synapses and the surprise detection system
How can animals solve this tradeoff? Experimental studies suggest that they integrate reward history on multiple timescales rather than a single timescale (Corrado et al., 2005; Fusi et al., 2007; Bernacchia et al., 2011). Other studies show that animals can change the integration timescale, or the learning rate, depending on the environment (Behrens et al., 2007; Nassar et al., 2010; Nassar et al., 2012). To incorporate these findings into our model, we use a synaptic model that can change the rate of plasticity $\alpha $ itself, in addition to the strength (weak or strong), depending on the environment. The best known and successful model is the cascade model of synapses, originally proposed to incorporate biochemical cascade process taking place over a wide range of timescales (Fusi et al., 2005). In the cascade model, illustrated in Figure 2A, the degree of synaptic strength is still assumed to be binary (weak or strong); however, there are $m$ states with different levels of plasticity ${\alpha}_{1}$, ${\alpha}_{2}$, …, ${\alpha}_{m}$, where ${\alpha}_{1}>{\alpha}_{2}>\mathrm{\dots}>{\alpha}_{m}$. The model also allows transitions from one level of plasticity to another with a metaplastic transition probability ${p}_{i}$ ($i=1,2,\mathrm{\dots},m1$) that is fixed depending on the depth. Following (Fusi et al., 2005), we assume $p}_{1}>{p}_{2}>...>{p}_{m1$, meaning that entering less plastic states becomes less likely to occur with increasing depth. All the transitions follow the same rewardbased learning rule with corresponding probabilities, where the probabilities are separated logarithmically (ex. ${\alpha}_{i}={\left(\frac{1}{2}\right)}^{i}$ and ${p}_{i}={\left(\frac{1}{2}\right)}^{i}$ ) following (Fusi et al., 2005) (see Materials and methods section for more details).
We found that the cascade model of synapses can encode reward history on a wide, variable range of timescales. The wide range of transition probabilities in the model allows the system to encode values on multiple timescales, while the metaplastic transitions allow the model to vary the range of timescales. These features allow the model to consolidate the value information in a steady environment, as the synapses can become less plastic (Figure 2B–D). As seen in Figure 2C, the fluctuation of choice probability with the cascade model synapses becomes smaller as the model stays in the stable environment, where we artificially set that all synapses are initially at the most plastic states (top states). Because of the rewardbased metaplastic transitions, more and more synapses gradually occupy less plastic states in the stationary environment. Since those synapses at less plastic states are hard to modify its strength, the fluctuations in the synaptic strength becomes smaller.
We also found, however, that this desirable property of memory consolidation also leads to a problem of resetting memory. In other words, the cascade model fails to respond to a sudden, steplike change in the environment (Figure 2B,D). This is because after staying in a stable environment, many of the synapses are already in deeper, less plastic, states of cascade. In fact, as seen in Figure 2D, the time required to adapt to a new environment increases proportionally to the duration of the previous stable environment. In other words, what is missing in the original cascade model is the ability to reset the memory, or to increase the rate of plasticity in response to an unexpected change in the environment. Indeed, recent human experiments suggest that humans can react to such sudden changes by increasing their learning rates (Nassar et al., 2010).
To overcome this problem, we introduce a novel surprise detection system with plastic synapses that can accumulate reward information and monitor the performance of decisionmaking network over multiple (discrete) timescales. The main idea is to compare the reward information of multiple timescales that are stored in plastic (but not metaplastic) synapses in order to detect changes on a trialbytrial basis. More precisely, the system compares the current difference in reward rates between a pair of timescales to the expected difference; once the former significantly exceeds the latter, a surprise signal is sent to the decision making network to increase the rate of synaptic plasticity in the cascade models.
The mechanism is illustrated in Figure 2E–H. The synapses in this system follow the same reward based learning rules as in the decision making network. The important difference, however, is that unlike the cascade model, the rate of plasticity is fixed, and each group of synapses takes one of the logarithmically segregated rates of plasticity ${\alpha}_{i}$’s (Figure 2E). Also, the learning takes place independent of selected actions in order to monitor the overall performance. While the same computation is performed on various pairs of timescales, for illustrative purposes only the synapses belonging to two timescales are shown in Figure 2G, where they learn the reward rates on two different timescales by two different rates of plasticity (say, ${\alpha}_{i}$ and ${\alpha}_{j}$ and ${\alpha}_{i}\gg {\alpha}_{j}$ ). As can be seen, when the environment and incoming reward rate is stable, the estimate of the more plastic population fluctuates around the estimate of the less plastic population within a certain range. This fluctuation is expected from the past, since the rewards were delivered stochastically, but the probability was well estimated. This expected range of fluctuation is learned by the system by simply integrating the difference between the two estimates with a learning rate ${\alpha}_{j}$, which we call expected uncertainty, inspired by (Yu and Dayan, 2005) (the shaded area in Figure 2G). Similarly, we call the current difference in the two estimates unexpected uncertainty (Yu and Dayan, 2005). Updating unexpected uncertainty involves a prediction error signal, which is the difference between the unexpected uncertainty and the current expected uncertainty.
If the unexpected uncertainty significantly exceeds the expected uncertainty (indicated by yellow in Figure 2G), a surprise signal is sent to the decision making network, resulting in an increase in the plasticity of the cascade model synapses; thus, the synapses increase their transition rates between depressed and potentiated states. We allow this to take place in the states higher (or more plastic) than $j$ ($k\le j$). This selective modification is not crucial in a simple task but may become important in more complex tasks in order to retain information on longer timescales that is still useful, such as task structures or cue identities. As encoding these information is in fact beyond the limit of our simple decision making network, we leave this study for future works. The surprise signal is transmitted as long as the unexpected uncertainty significantly exceeds the expected uncertainty, during which the synapses that received the surprise signal keep enhanced plasticity rates so that they reset the memory (Figure 2H). Ultimately, expected uncertainty catches up with unexpected uncertainty so that synapses can start consolidating the memory again with the original cascade model transition rates.
Thanks to the surprise detection system, the decision making network with cascade model synapses can now adapt to an unexpected change. As seen in Figure 2C,D,F, it can successfully achieve both consolidation (i.e. accurate estimation of probabilities before the change point) and the quick adaptation to unpredicted changes in the environment. This is because the synapses can gradually consolidate the values by becoming less plastic as long as the environment is stationary, while plasticity can be boosted when there is a surprise signal so that memory can be reset. This can be seen prominently in Figure 2H, where the distribution of synaptic plasticity decreases over time before the change point, but increases afterwards due to the surprise signal.
For more details of implementation of our model, including how the two systems work as a whole, please see the Materials and methods section and Figure 8 wherein.
Our model selftunes the learning rate and captures key experimental findings
Experimental evidence shows that humans have a remarkable ability to change their learning rates depending on the volatility of their environment (Behrens et al., 2007; Nassar et al., 2010). Here we show that our model can capture this key experimental finding. We note that single learning rates have been usually reported in most of the past analyses of experimental data. This was simply because single timescale models were assumed when fitting data. Our model, however, has no specific timescale, since it has a wide range of timescales in metaplastic states. Thus, merely for the purpose of comparison of our results with previous findings from single timescale models, we define the effective learning rate of our system as the average transition rates ${\alpha}_{i}$’s weighted by the synaptic populations that fill corresponding states. Changes in learning rate were therefore characterized by changes in the distribution in synaptic plasticity states in our model.
In Figure 3A, we simulated our model in a fourarmed bandit task, where one target has a higher probability of obtaining reward than the other targets, while the identity of the most rewarding target is switched at the change points indicated by vertical lines. We found that the effective learning rate is on average significantly larger when the environment is rapidly changing (those trials in shorter blocks) than when the environment is more stable (those trials in longer blocks). This is consistent with the experimental finding in (Behrens et al., 2007) that the learning rate was high in a smaller block (volatile) condition than in a larger block (stable) condition. Also, within each block of trials, we found that the learning rate is largest after the change point, decaying slowly over subsequent trials. This is consistent with both experimental findings and the predictions of optimal Bayesian models (Nassar et al., 2010; Dayan et al., 2000).
It should be noted that our model does not assume any a priori timescale of the environment. Rather, the distribution of the rates of synaptic plasticity is dynamically selftuned to a given environment. To see how well the tuning is achieved, in Figure 3B, we contrasted the effective learning rate of our model (red line) under a fixed block size condition (the size was varied over xaxis), to the harvesting efficiency of a single timescale model with different rates of plasticity (varied over yaxis, which we simply call here the learning rate). The background colour shows the normalized harvesting efficiency of single rate of plasticity models, which is defined by the amount of rewards that the model collected, divided by the maximum amount of rewards that the best model for each block size collected, so that the maximum is always equal to one. The effective learning rate of our full model is again defined by the average potentiation/depression rate weighted by the synaptic population on each state, and the median of the effective learning rate in each block is shown by the red trace. (Note that the effective learning rate constantly changes over trials. The error bars indicate the 25th and 70th percentiles of the effective learning rates.) As can be seen, the cascade model’s effective leaning rate is automatically tuned to the learning rate expected from the handtuned noncascade plasticity model. This agreement is remarkable, as we did not assume any specific timescales in our cascade model of plasticity nor any optimisation technique; rather, we assumed a wide range of timescales ($1/{\alpha}_{i}$’s) and that synapses make rewardbased plastic and metaplastic transitions by themselves, guided by surprise signals.
Moreover, we found that our cascade model of metaplastic synapses can significantly outperform the model with fixed learning rates when the environment changes on multiple timescales, which is a very realistic situation but has yet to been explored experimentally. We simulated a fourarmed bandit task with two different sizes of blocks with fixed reward contingencies, which is similar to the example in Figure 3A. As seen in Figure 3C, our model of cascade synapses combined with surprise detection system can collect significantly more rewards than any model with fixed single synaptic plasticity. This is because that the synaptic plasticity distribution of the cascade model is selftuned on a trialbytrial basis, rather than on average over a long timescale, as shown in Figure 3A. We also found that this is true with a very wide range of threshold values for the surprise detection network, indicating that tuning of the threshold is not required.
In order to further investigate the optimality of our neural model, we compared our model with a previously proposed Bayesian learner model (Behrens et al., 2007). This Bayesian model has been proposed to perform an optimal inference of changing reward probabilities and the volatility of the environment. While human behavioral data has been shown to be consistent with what the optimal model predicted (Behrens et al., 2007), this model itself, however, does not account for how such an adaptive learning can be achieved neurally. Since our model is focused on an implementation of adaptive learning, a comparison of our model and the Bayes optimal model can address this issue.
For this purpose, we simulated the Bayesian model (Behrens et al., 2007), and compared the results with our model’s results. Remarkably, as seen in Figure 4, we found that our neural model (red) performed as well as the Bayesian learner model (black). Figure 4A contrasts the fluctuation of choice probability of our model to the Bayesian learner model under a fixed reward contingency. As seen, the reduction of fluctuations over trials in our model is strikingly similar to that the Bayesian model predicts. Figure 4B, on the other hand, shows the adaptation time as a function of the previous block size. Again, our model performed as well as the Bayesian model across conditions, though our model was marginally slower than the Bayesian model when the block was longer. (Whether this small difference in the longer block size actually reflects biological adaptation or not should be tested in future experiments, as there have been limited studies with a block size in this range.)
So far we have focused on changes in learning rate; however, our model has a range of potential applications to other experimental data. For example, here we briefly illustrate how our model can account for a welldocumented phenomenon that is often referred to as the spontaneous recovery of preference (Mazur, 1996; Gallistel et al., 2001; Rescorla, 2004; Lloyd and Leslie, 2013). In one example of animal experiments (Mazur, 1996), pigeons performed an alternative choice task on a variable interval schedule. In the first session, two targets had the same probability of rewards. In the following sessions, one of the targets was always associated with a higher reward probability than the other. In these sessions, subjects showed a bias from the first session persistently over multiple sessions, most pertinently in the beginning of each session. Crucially, this bias was modulated by the length of intersessionintervals (ISIs). When birds had long ISIs, the bias effect was smaller and the adaptation was faster. One idea is that subjects ‘forget’ recent reward contingencies during long ISIs.
We simulated our model in this experimental setting, and found that our model can account for this phenomenon (Figure 5). The task consists of four sessions, the first of which had the same probability of rewards for two targets (3000 trials). In the following sessions, one of the targets (target A) was always associated with a higher reward probability than the other (the reward ratio is 9 to 1; 200 trials per session). We simulated our model in a task with short (Figure 5A) and long (Figure 5B) ISIs. We assumed that the cascade model synapses ‘forget’ during the ISI, simulated by random transitions with the probabilities according to each synaptic states (See Materials and methods and Figure 7).
As seen in Figure 5, the model shows a bias from the first session persistently over multiple sessions (Sessions 2–4), most pertinently in the beginning of each session. Also, learning was slower with shorter ISIs, which is consistent with findings in Mazur (1996). This is because the cascade model makes metaplastic transitions to deeper states (memory consolidation) during stable session 1, and those synapses are less likely to be modified in later sessions, remaining as a bias. However, they could be reset during each ISI due to forgetting transitions (Figure 7), the chance of which is higher with a longer ISI.
We also found that the surprise system played little role in this spontaneous recovery, because forgetting during the ISI allowed many synapses to become plastic, a function virtually similar to what the surprise system does at a block change in blockdesigned experiments. Crucially, however, not all synapses become plastic during the ISIs, leading to a persistent bias toward the previous preference. Our model in fact predicts such a bias can develop over multiple sessions, and this is supported by experimental data (Iigaya et al., 2013). We plan to present this formally elsewhere. Also, we note that our model echoes with the idea that animals carry over memory of contexts of the first session to later sessions (Lloyd and Leslie, 2013).
Discussion
Humans and other animals have a remarkable ability to adapt to a changing environment. The neural circuit mechanism underling such behavioral adaptation has remained, however, largely unknown. While one might imagine that the circuits underlying such remarkable flexibility must be very complex, the current work suggests that a relatively simple, wellstudied decisionmaking network, when combined with a relatively simple model of synaptic plasticity guided by a surprise detection system, can capture a wide range of existing data.
We should stress that there have been extensive studies of modulation of learning in conditioning tasks in psychology, inspired by two very influential proposals. The first was by Mackintosh (Mackintosh, 1975), in which he proposed that learning should be enhanced if a stimulus predicts rewards. In other words, a rewardirrelevant stimulus should be ignored, while a rewardpredictive stimulus should continue to be attended to. This can be interpreted in our model in terms of formations of stimulusselective neural populations in the decision making circuit. In other words, such a process would be equated with a shaping of the network architecture itself. This modification is beyond the scope of the current work, and we leave it as future work. The other influential proposal was made by Pearce and Hall (Pearce and Hall, 1980). They proposed that learning rates should be increased when an outcome was unexpected. This indeed is at the heart of the model proposed here, where unexpected uncertainty enhanced synaptic plasticity and hence the learning rate. Since the PearceHall model focused on the algorithmic level of computation while our work focusing more on neural implementation level of computation, our work complements the classical model of Pearce and Hall (Pearce and Hall, 1980). We should, however, stress again that how our surprise detection system can be implemented should still be determined in the future.
In relation to surprise, the problem of changepoint detection has long been studied in relation to the modulation of learning rates in reinforcement learning theory and Bayesian optimal learning theory (Pearce and Hall, 1980; Adams and MacKay, 2007; Dayan et al., 2000; Gallistel et al., 2001; Courville et al., 2006; Yu and Dayan, 2005; Behrens et al., 2007; Summerfield et al., 2011; Pearson and Platt, 2013; Wilson et al., 2013). These models, however, provided limited insight into how the algorithms can be implemented in neural circuits. To fill this gap, we proposed a computation which is partially performed by bounded synapses, and we found that our model performs as well as a Bayesian learner model (Behrens et al., 2007). We should, however, note that we did not specify a network architecture for our surprise detection system. A detailed architecture for this, including connectivity between neuronal populations, requires more experimental evidence. For example, how the difference in reward rates (subtraction) were computed in the network needs to be further explored theoretically and experimentally. One possibility is a network that includes two neural populations (X and Y), each of whose activity is proportional to its synaptic weights. Then one way to perform subtraction between these populations would be to have a readout population that receives an inhibitory projection from one population (X) and an excitatory projection from the other population (Y). The activity of the readout neurons would then reflect the subtraction of signals that are proportional to synaptic weights (Y–X).
Nonetheless, the surprise detection algorithm that we propose was previously hinted by Aston and Cohen (AstonJones and Cohen, 2005), where they suggested that taskrelevant values computed in the anterior cingulate cortex (ACC) and the orbitofrontal cortex (OFC) are somehow integrated on multiple timescales and combined at the locus coeruleus (LC), as they proposed that the phasic and tonic release of norepinephrine (NE) controls the exploitationexploration tradeoff. Here we showed that this computation can be carried out mainly by synaptic plasticity. We also related our computation to the notions of unexpected and expected uncertainties, which have been suggested to be correlated with NE and Acetylcholine (Ach) release, respectively (Yu and Dayan, 2005). In fact, there is increasing evidence that the activity of ACC relates to the volatility of the environment (Behrens et al., 2007) or surprise signal (Hayden et al., 2011). Also, there is a large amount of experimental evidence that Ach can enhance synaptic plasticity (Gordon et al., 2005; Mitsushima et al., 2013). This could imply that our surprise signal could be expressed as the balance between Ach and NE. On the other hand, in relation to encoding reward history over multiple timescales, it is well known that the phasic activity of dopaminergic neurons reflects a reward prediction error (Schultz et al., 1997), while tonic dopamine levels may reflect reward rates (Niv et al., 2007); these signals could also play crucial roles in our multiple timescales of reward integration process. We also note that a similar algorithm for the surprise detection was recently suggested in a reduced Bayesian framework (Wilson et al., 2013).
In this paper, we assume that the surprise signals are sent when the incoming reward rate decreases unexpectedly, so that the cascade model synapses can increase the rate of plasticity and reset memory. However, there are other cases where surprise signals could be sent to modify the rates of plasticity. For example, when the incoming reward rate is dramatically increased, surprise signals could enhance the metaplastic transitions so that the memory of recent action values are rapidly consolidated. Also, in response to an unexpected punishment rather than reward, surprise signals could be sent to enhance the metaplastic transitions to achieve a oneshot memory (Schafe et al., 2001). Furthermore, the effect of the surprise signal may not be limited to rewardbased learning. An unexpected recall of episodic memory could itself also trigger a surprise signal. This may explain some aspects of memory reconsolidation (Schafe et al., 2001).
Our model has some limitations. First, we mainly focused on a relatively simple decision making task, where one of the targets is more rewarding than the other and the reward rates for targets change at the same time. In reality, however, it is also possible that reward rates of different targets change independently. In this case it would be preferable to selectively change learning rates for different targets, which might be solved by incorporating an additional mechanism such as synaptic tagging (Clopath et al., 2008; Barrett et al., 2009). Second, although we assumed that the surprise signal would reset most of the accumulated evidence when rewardharvesting performance deteriorates, in many cases it would be better to keep accumulated evidence, such as to form distinct ’contexts’ (Gershman et al., 2010; Lloyd and Leslie, 2013). This would allow subjects to access it later. This type of operation may require further neural populations to be added to the decision making circuit that we studied. In fact, it has been shown that introducing neurons that are randomly connected to neurons in the decision making network can solve context dependent decisionmaking tasks (Rigotti et al., 2010; Barak et al., 2013). Those randomly connected neurons were reported in the prefrontal cortex (PFC) as ‘mixedselective’ neurons (Rigotti et al., 2013). It would be interesting to introduce such neuronal populations to our model to study more complex tasks.
Also, distributing memory among different brain areas may also allow flexible access of memory on different timescales, or hierarchical structure of contexts, if the rates of synaptic plasticity are similarly distributed amongst different brain areas, with memory information being transferred from one area to another (Squire and Wixted, 2011). Indeed, it has recently been shown that such a partitioning could also be advantageous general memory performance (Roxin and Fusi, 2013), and this could be incorporated with relative ease into our model. One possibility is that the value signals computed by the cascade model synapses with a different range of timescales in distinct brain areas are combined to make decisions, so that the surprise signal is sent to the appropriate brain areas with the targeted rates of plasticity and contexts.
Some of the key features of our model remain to be tested as predictions. One is that the synapses encoding action values in the decision making network should change the level of plasticity itself. In other words, those synapses that reach the boundary of synaptic strength should become more resilient to change. For example, if rewards are given every trial from the same target, the synaptic strength targeting such target would reach the boundary, say after 100 trials. This means that the synaptic strength would remain the same, even after 1000 trials. However, the synapses after 1000 trials should be more resilient to change than synapse after 100 trials. Equally, the synapses that encode overall reward rates, or subject’s performance, in a surprise detection system should not make metaplastic transitions. Thus, studying the nature of synaptic plasticity may allow us to dissociate the functions of circuits.
While we found that our model is robust to parameter changes, the effect of extreme parameter values may give insights into psychiatric and personality disorders. For example, if the threshold of the surprise signal, $h$, is extremely low, the model can become inflexible in the face of changes in the environment. On the other hand, if the threshold is extremely high, the model cannot consolidate the values of actions, leading to unstable behavior. As these sorts of maladaptive behaviors are common across different psychiatric and personality disorders, our model could potentially provide insights into the circuit level dynamics underlying aspects of these disorders (Deisseroth, 2014).
Materials and methods
Our model consists of two systems: (1) the decision making network, which makes decisions according the actions values stored in plastic synapses (2) the surprise detection system, which computes expected uncertainties and unexpected uncertainties on multiple timescales to send a surprise signal to the decision making network, when the unexpected uncertainty exceeds the expected uncertainty.
The decision making network with cascade type synapses
Request a detailed protocolThe decision making network (Soltani and Wang, 2006; Fusi et al., 2007; Wang, 2008; Iigaya and Fusi, 2013) is illustrated in Figure 1A. In this neural circuit, two groups of excitatory neurons (decision populations), each of which is selective to an action of choosing each target stimuli (A or B), receive inputs from sensory neurons on each trial. Each of the excitatory populations are recurrently connected to sustain their activity during each trial. In addition, they inhibit with each other through a inhibitory neuronal population.
As a result of the inhibitory interaction, the firing rate of one population of excitatory neurons become much larger than the other population (winner take all process) (Wang, 2002). This is a stable state of this attractor network, and we assume that subject’s action is determined by the winning population (selecting A or B).
Soltani and Wang (Soltani and Wang, 2006) showed in simulations of a such network with spiking neurons that the decision of the attractor network is stochastic, but the probability of choosing a particular target can be well fitted by a sigmoid function of the difference between the synaptic input currents ${I}_{A}{I}_{B}$ from the sensory neurons to the action selective populations A and B:
where ${P}_{A}$ is the probability of choosing target $A$ and the temperature $T$ is a free parameter determined by the amount of noise in the network.
The afferent currents ${I}_{A}$ and ${I}_{B}$ are proportional to the synaptic weights between the input population of neurons and the two decision populations of neurons. The current to a neuron that belongs to the decision of selecting target A can be expressed as:
where the $\nu}_{j$’s are the firing rates of the $i$th neuron (of the total of $N$ neurons) in the input population and $w}_{j}^{A$ is the synaptic weight to the population selective to A. An analogous expression holds for the ${I}_{B}$ and we assume that $N$ is the same for both populations. Assuming that the firing rates of input population is approximately to be uniform ${\nu}_{j}=\nu $, we can simplify the expression of the current:
where ${\u27e8w\u27e9}_{A}$ is the average synaptic weight to the population selective to A. Here we can assume $\nu N=1$ without any loss of generality, as we can rescale $T$ as $T/\nu N\to T$. Also any overlapping of selectivity or any other noise in those two decision making populations can be incorporated to the temperature parameter $T$ in our model.
Following (Fusi et al., 2005), the cascade model of synapses assumes that each synaptic strength is binary – either depressed or potentiated, with the value of 0 or 1, respectivey. This follows the important constrant of bounded synapses (Amit and Fusi, 1994; Fusi and Abbott, 2007), and it has been shown that having intermediate strength between 0 and 1 does not significantly improve model’s memory performance (Fusi and Abbott, 2007) or decisionmaking behavior (Iigaya and Fusi, 2013). In addition, the cascade model of synapses (Fusi et al., 2005; Soltani and Wang, 2006; Iigaya and Fusi, 2013) assumes synapses can take different levels of plasticity. Following (Iigaya and Fusi, 2013), we assume there are $m$ states in this dimension.
Instead of simulating the dynamics of all individual synapse, it is more convenient to keep track of the distribution of synapses over the synaptic state space:
where ${F}_{i}^{A}$ (${F}_{i}^{A+}$) is the fraction of synapses occupying the depressed (potentiated) state at the $i$’th level of the plasticity state in the population targeting the action of choosing $A$. The same can be written for the synapses targeting the neural population selective to target B. As we assume that the synaptic strength is 0 for the depressed states and 1 for the potentiated states, the total (normalized) synaptic strength can be expressed as
Again, an analogous relation holds for the synaptic population between the input neurons and the neurons selective to choosing target B.
Hence the action of choosing A or B is determined by the decision making network as:
Thus the decision is biased by the synapses occupying the potentiated states, which reflects the memory of past rewards that is updated according to a learning rule. Here we apply the standard activity dependent rewardbased learning rule (Fusi et al., 2007; Soltani and Wang, 2006; Soltani et al., 2006; Iigaya and Fusi, 2013) to the cascade model. This is schematically shown in Figure 6. When the network received a reward after choosing target A, the synapses between input population and the action selective population that is targeting the just rewarded action A (note that these neurons have a higher firing rates than the other population) make transitions as following.
where ${\alpha}_{r}^{i}$ is the transition probability to modify synaptic strength (between depressed 0 and 1) from the $i$’th level to the first level after rewards, and ${p}_{r}^{i}$ is the metaplastic transition probability from $i$’th (upper) level to $i+1$’th (lower) level after a reward. In words, the synapses at depressed states make stochastic transitions to the most plastic potentiated state, while the synapses that were already at potentiated states make stochastic transitions to deeper, or less plastic, states (see Figure 6).
For the synapses tarting unchosen population, we assume the opposite learning:
where $\gamma <1$ is the factor determining the probability of chaining states of synapses targeting an unchosen action at a given trial. In words, the synapses at potentiated states make stochastic transitions to the most plastic depressed state, while the synapses that were already at depressed states make stochastic transitions to deeper, or less plastic, states (see Figure 6).
Similarly, when the network received no reward after choosing target A, synapses change their states as:
and
where ${\alpha}_{nr}^{i}$ is the transition probability from the $i$’th state to the first state in case of no reward, and $p}_{nr}^{i$ is the metaplastic transition probability from $i$’th (upper) level to $i+1$’th (lower) level after no reward. Unless otherwise noted, in this paper we set ${\alpha}_{n}^{i}={\alpha}_{nr}^{i}(={\alpha}_{i})$ and ${p}_{n}^{i}={p}_{nr}^{i}(={p}_{i})$.
In Figure 5, we also simulating the effect of intersessioninterval (ISI). To do this, we simply assumed that random noisy events drive forgetting during the ISIs. This was simulated simply by letting synapses undergo what we define as forgetting transitions (Figure 7):
and
In Figure 5, we assume the unit of ISI, $T}_{\text{ISI}$, is 100 repetition of these transitions. We found that our qualitative finding is robust against the setting of threshold value $h$. We did not allow metaplastic (downward) transitions during forgetting, since we focused on the forgetting aspect of ISI, which was sufficient to account for the data (Mazur, 1996).
The surprise detection system
Request a detailed protocolHere we describe our surprise detection system. We do not intend to specify detailed circuit architecture of the surprise detection system. Rather, we propose a simple computation algorithm that can be partially implementable by wellstudied bounded synaptic plasticity. As detailed circuits of a surprise detection system have yet to be shown either theoretically or experimentally, we leave a problem of specifying the architecture of system to future studies.
In summary, this system (1) computes reward rates on different timescales (2) computes expected differences between the reward rates of different timescales (we call this as expected uncertainty) (3) compares the expected uncertainty with the current actual difference between reward rates (we call this unexpected uncertainty) (4) sends a surprise signal to the decision making network, if the unexpected uncertainty exceeds the expected uncertainty. As a result, the system receives an input of a reward or noreward every trial, and sends an output of surprise or nosurprise to the decision making network.
It has been shown that a population of binary synapses can encode the rate of rewards on a timescale of $\tau =1/\alpha $, where $\alpha $ is the rate of synaptic plasticity (Rosenthal et al., 2001; Iigaya and Fusi, 2013). Here we use this property to monitor reward rates on multiple timescales, by introducing populations of synapses with different rates of plasticity. Since the goal of this system is to monitor incoming reward rates on which the cascade model synapses in the decision making network operates, we assume the total of $m$ populations of synapses, where $m$ is the same as the number of metaplastic states of the cascade model synapses. Accordingly, synapses in population $i$ have the plasticity rate of ${\alpha}_{r}^{i}$, which is the same rate as the cascade model’s transition rate at the $i$’th level. Crucially, we assume these synapses are not metaplastic. They simply undergo rewarddependent stochastic learning; but importantly, this time they do so independent of a chosen action so that the system can keep track of overall performance.
It is again convenient to keep track of the distribution of synapses in the state space. We write the fraction of synapses at the depressed state is ${G}_{i}^{}$, and the fraction of synapses at potentiated state is ${G}_{i}^{+}$:
Assuming that the synaptic strength is either 0 (depressed) or 1 (potentiated), the total synaptic strength ${Z}_{i}$ of population $i$ is simply
where $n$ is the total number of synapses. For simplicity, we assume each population has the same number of synapses. While ${Z}_{i}$ is the value that should be read out by a readout, without a loss of generality, we keep track of the normalized weight ${R}_{i}={Z}_{i}/n={G}_{i}^{+}$ as the synaptic strength.
The distribution changes according to a simple reward based plasticity rule (Iigaya and Fusi, 2013). When a network receives a reward,
which means that the synapses at the depressed state make transitions to the potentiated state with a probability of ${\alpha}_{r}^{i}$. When the network received no reward, on the other hand,
which means that the synapses at the potentiated state make transitions to the depressed state with a probability of ${\alpha}_{nr}^{i}$. The transition rate ${\alpha}_{nr}^{i}$ is designed to match the transition rate of the cascade model in case of noreward. (In this paper we set ${\alpha}_{r}^{i}={\alpha}_{nr}^{i}(={\alpha}^{i})$, as is also the case in the cascade model synapses in the decision making network.) These transitions take place independent of the taken action, and the synaptic strength ${v}_{i}={Z}_{i}/n={G}_{i}^{+}$ is a lowpass filtered (by bounded synapses) of reward rates on a timescale ${\tau}_{i}=1/{\alpha}^{i}$.
On each trial, the system also computes the expected uncertainty ${u}_{i,j}$ of reward rates between different timescales of synaptic populations. Note that for this we focus on the computational algorithm, and we do not specify the architecture of neural circuits responsible for this computation. As detailed circuits of a surprise detection system have yet to be shown either theoretically or experimentally, we leave a problem of specifying the architecture of system to future studies. The system learns the absolute value of the difference between the approximated reward rates ${v}_{i}$ and ${v}_{j}$ at a rate of $min({\alpha}^{j},{\alpha}^{i})$:
where we assume that the learning rate is a smaller rate of plasticity in the two populations. We call ${u}_{i,j}$ as the expected uncertainty between $i$ and $j$ (Yu and Dayan, 2005), representing the how different the reward rates of different timescales are expected to be. We also call the actual current difference ${v}_{i}{v}_{j}$ as unexpected uncertainty between $i$ and $j$. Hence the expected uncertainty is the lowpass filtered unexpected uncertainty, both of which dynamically change over trials.
On each trial, the system also compares the expected uncertainty ${u}_{i,j}$ and unexpected uncertainty ${v}_{i}{v}_{j}$ for each pair of $i$ and $j$. If the latter significantly exceeds the former, ${v}_{i}{v}_{j}\gg {u}_{i,j}$, then the system sends an output of a surprise signal to the decision making network. For simplicity, we set the threshold $h$ as $erf(\frac{{v}_{i}{v}_{j}}{\sqrt{2}{u}_{i,j}})=h$ when $i>j$, where $erf(.)$ is the error function. Note that the error function is sign sensitive. Thus when $v}_{i}>{v}_{j$, or when the reward rate is increasing locally in time, surprise signal is not sent if the threshold is set to be $h<0.5$. This threshold $h$ is a free parameter; but we confirmed that the system is robust over a wide range of $h$.
If a surprise signal is sent, because of the discrepancy between two timescales $i$ and $j$, ${v}_{i}{v}_{j}\gg {u}_{i,j}$, the decision making network (cascade synapses) increase the rates of plasticity. Importantly this is done only for the levels of synapses that the surprise is detected (the lower levels do not change the rates of plasticity). This allows the decisionmaking network to keep information on different timescales as long as it is useful. For example, when a surprise was detected between $i$’th and $j$’th levels, we set the cascade model of transition rates
for $k\le j$ of the cascade model synapses. This allows the decision making network to reset the memory and adapt to a new environment. Note that this change of the rate of synopses is only for the cascade model synapses. The synapses in the surprise detection system do not change the rate of plasticity.
Figure 8 illustrates how the whole system of the decision making network and the surprise detection work together. We simulated our model in a twochoice VI schedule task with a total baiting probability of $0.4$. The reward contingency was reversed every 100 trials. The mean synaptic strength of each population ${v}_{i}$ is shown in Figure 8D, while each pair was compared separetly in Figure 8E–G. Surprises were detected mostly between ${v}_{2}$ and ${v}_{3}$, or between ${v}_{1}$ and ${v}_{3}$, (Figure 8I), but not between ${v}_{1}$ and ${v}_{2}$. This makes sense because the timescale of block change was 100 trial, which is similar to the timescale of ${v}_{3}$: $1/{\alpha}_{3}=25$ trials. Thus the timescale of $v}_{2$ was too short to detect this change: $1/{\alpha}_{3}=25$ trials. Thanks to the surprise signals, the cascade model of synapses were able to adapt to the sudden changes in contingency (Figure 8B,C). As a result, the choice probability also adapt to the environment (Figure 8A).
Bayesian model (Behrens et al., 2007)
Request a detailed protocolWe also compared our model with a previously proposed Bayesian inference model (Behrens et al., 2007). Details of the model can be found in Behrens et al. (2007); thus, here we briefly summarize the formalism. In this model, the probability ${R}_{i}^{A}$ of obtaining a reward from target A at time $t=i$ is assumed to change according to the volatility ${v}_{i}^{A}$.
where ${R}_{i}^{A}=1/\left(1+{e}^{{r}_{i}^{A}}\right)$, ${V}_{i}^{A}={e}^{{v}_{i}^{A}}$, and $N(,)$ is a Gaussian. Variables are transformed for a computational convenience. The volatility also changes according to the equation:
where ${K}^{A}={e}^{{k}^{A}}$ determines the rate of change in volatility. Using the Bayes rule, the posterior probability of the joint distribution given data ${y}^{A}$ can be written as
Following (Behrens et al., 2007), we performed a numerical integration over grids without assuming an explicit function form of the joint distribution, where at $t=0$ we assumed a uniform distribution. Inference was performed for each target independently. For simplicity, we assumed that the model’s policy follows the matching law on concurrent VI schedule, as it has been shown to be the optimal probabilistic decision policy (Sakai and Fukai, 2008; Iigaya and Fusi, 2013).
All the analysis/simulations in this paper were conducted in the MatLab (MathWorks Inc.), and the Mathematica (Wolfram Research).
References

Learning in neural networks with material synapsesNeural Computation 6:957–982.https://doi.org/10.1162/neco.1994.6.5.957

An integrative theory of locus coeruleusnorepinephrine function: adaptive gain and optimal performanceAnnual Review of Neuroscience 28:403–450.https://doi.org/10.1146/annurev.neuro.28.061604.135709

The sparseness of mixed selectivity neurons controls the generalizationdiscrimination tradeoffJournal of Neuroscience 33:3844–3856.https://doi.org/10.1523/JNEUROSCI.275312.2013

State based model of longterm potentiation and synaptic tagging and capturePLoS Computational Biology 5:e1000259.https://doi.org/10.1371/journal.pcbi.1000259

Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

A reservoir of time constants for memory traces in cortical neuronsNature Neuroscience 14:366–372.https://doi.org/10.1038/nn.2752

Synaptic plasticity: multiple forms, functions, and mechanismsNeuropsychopharmacology 33:18–41.https://doi.org/10.1038/sj.npp.1301559

Tagtriggerconsolidation: a model of early and late longtermpotentiation and depressionPLoS Computational Biology 4:e1000248.https://doi.org/10.1371/journal.pcbi.1000248

Linearnonlinearpoisson models of primate choice dynamicsJournal of the Experimental Analysis of Behavior 84:581–617.https://doi.org/10.1901/jeab.2005.2305

Bayesian theories of conditioning in a changing worldTrends in Cognitive Sciences 10:294–300.https://doi.org/10.1016/j.tics.2006.05.004

BookTheoretical Neuroscience : Computational and Mathematical Modeling of Neural SystemsCambridge, Mass: Massachusetts Institute of Technology Press.

Limits on the memory storage capacity of bounded synapsesNature Neuroscience 10:485–493.https://doi.org/10.1038/nn1859

The rat approximates an ideal detector of changes in rates of reward: Implications for the law of effectJournal of Experimental Psychology 27:354–372.https://doi.org/10.1037/00977403.27.4.354

Context, learning, and extinctionPsychological Review 117:197–209.https://doi.org/10.1037/a0017808

Norepinephrine triggers release of glial ATP to increase postsynaptic efficacyNature Neuroscience 8:1078–1086.https://doi.org/10.1038/nn1498

Relative and absolute strength of response as a function of frequency of reinforcementJournal of the Experimental Analysis of Behavior 4:267–272.https://doi.org/10.1901/jeab.1961.4267

Neural network models of decision making with learning on multiple timescalesPh.D. Thesis, Columbia University (New York, NY, USA).

Dynamical regimes in neural network models of matching behaviorNeural Computation 25:1–20.https://doi.org/10.1162/NECO_a_00522

Deviations from the matching law reflect reward integration over multiple timescalesCosyne Abstract.

The dynamics of memory as a consequence of optimal adaptation to a changing bodyNature Neuroscience 10:779–786.https://doi.org/10.1038/nn1901

Modelling the molecular mechanisms of synaptic plasticity using systems biology approachesNature Reviews Neuroscience 11:239–251.https://doi.org/10.1038/nrn2807

Dynamic responsebyresponse models of matching behavior in rhesus monkeysJournal of the Experimental Analysis of Behavior 84:555–579.https://doi.org/10.1901/jeab.2005.11004

Contextdependent decisionmaking: a simple Bayesian modelJournal of the Royal Society Interface 10:20130069.https://doi.org/10.1098/rsif.2013.0069

A theory of attention: Variations in the associability of stimuli with reinforcementPsychological Review 82:276–298.https://doi.org/10.1037/h0076778

Synaptic plasticity and memory: an evaluation of the hypothesisAnnual Review of Neuroscience 23:649–711.https://doi.org/10.1146/annurev.neuro.23.1.649

Past experience, recency, and spontaneous recovery in choice behaviorAnimal Learning & Behavior 24:1–10.https://doi.org/10.3758/BF03198948

A cholinergic trigger drives learninginduced plasticity at hippocampal synapsesNature Communications 4:2760.https://doi.org/10.1038/ncomms3760

Rational regulation of learning dynamics by pupillinked arousal systemsNature Neuroscience 15:1040–1046.https://doi.org/10.1038/nn.3130

Tonic dopamine: opportunity costs and the control of response vigorPsychopharmacology 191:507–520.https://doi.org/10.1007/s0021300605024

Change detection, multiple controllers, and dynamic environments: insights from the brainJournal of the Experimental Analysis of Behavior 99:74–84.https://doi.org/10.1002/jeab.5

Temporal whitening by powerlaw adaptation in neocortical neuronsNature Neuroscience 16:942–948.https://doi.org/10.1038/nn.3431

Neocortical pyramidal cells respond as integrateandfire neurons to in vivolike input currentsJournal of Neurophysiology 90:1598–1612.https://doi.org/10.1152/jn.00293.2003

Internal representation of task rules by recurrent dynamics: the importance of the diversity of neural responsesFrontiers in Computational Neuroscience 4:24.https://doi.org/10.3389/fncom.2010.00024

Efficient partitioning of memory systems and its importance for memory consolidationPLoS Computational Biology 9:e1003146.https://doi.org/10.1371/journal.pcbi.1003146

Choice, uncertainty and value in prefrontal and cingulate cortexNature Neuroscience 11:389–397.https://doi.org/10.1038/nn2066

Optimal recall from bounded metaplastic synapses: predicting functional adaptations in hippocampal area CA3PLoS Computational Biology 10:e1003489.https://doi.org/10.1371/journal.pcbi.1003489

Memory consolidation of Pavlovian fear conditioning: a cellular and molecular perspectiveTrends in Neurosciences 24:540–546.https://doi.org/10.1016/S01662236(00)01969X

Neural mechanism for stochastic behaviour during a competitive gameNeural Networks 19:1075–1090.https://doi.org/10.1016/j.neunet.2006.05.044

A biophysically based neural model of matching law behavior: melioration by stochastic synapsesJournal of Neuroscience 26:3731–3744.https://doi.org/10.1523/JNEUROSCI.515905.2006

Synaptic computation underlying probabilistic inferenceNature Neuroscience 13:112–119.https://doi.org/10.1038/nn.2450

The cognitive neuroscience of human memory since H.MAnnual Review of Neuroscience 34:259–288.https://doi.org/10.1146/annurevneuro061010113720

Multiple time scales of adaptation in auditory cortex neuronsJournal of Neuroscience 24:10440–10453.https://doi.org/10.1523/JNEUROSCI.190504.2004

A mixture of deltarules approximation to bayesian inference in changepoint problemsPLoS Computational Biology 9:e1003150.https://doi.org/10.1371/journal.pcbi.1003150

On the form of forgettingPsychological Science 2:409–415.https://doi.org/10.1111/j.14679280.1991.tb00175.x

Computational design of enhanced learning protocolsNature Neuroscience 15:294–297.https://doi.org/10.1038/nn.2990
Decision letter

Naoshige UchidaReviewing Editor; Harvard University, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for resubmitting your work entitled "Adaptive learning and decisionmaking under uncertainty by metaplastic synapses guided by a surprise detection system" for further consideration at eLife. Your revised article has been favorably evaluated by Eve Marder as the Senior Editor, a Reviewing Editor, and three reviewers.
The author has performed additional simulations and revised the manuscript extensively. All the referees agreed that the manuscript has greatly improved. However, there are some remaining issues to which we would like to see your response.
1) The reviewers pointed out that it is unclear whether the author's model is biologically plausible as proposed. During discussion, however, the reviewers noted that "biophysiological plausibility" is often difficult to define or relative, and that abstract models are often useful. Nevertheless, because the author now emphasizes biological plausibility in order to contrast with existing models (e.g. Bayesian models; Mackintosh; PearceHall), the reviewers thought a little more clarifications or toning down of this point would be required.
We do appreciate that the proposed model is an important step toward a mechanistic investigation of the interesting question; yet, it appears very difficult to implement some of the key components of the model. Specifically, one important proposal is the "surprise detection system" which takes the difference between the current and expected uncertainty, with uncertainty defined as the range of fluctuation (Figure 2G). To compute this, the author proposes to calculate the difference in synaptic weights of two groups. This is a very interesting idea yet it is unclear how a neural circuit computes the difference in synaptic weights. One reviewer thought that precisely computing the difference of synaptic weights is beyond the ability of neural circuits (or "out of biological constraints"). We would like you to address this point either by showing how such a computation can be performed or approximated while obeying biological constraints or by simply further deemphasizing the claim for implementation on specific parts although we note that you already state explicitly that network architecture of the surprise detection system is not specified in the present study, and that the efforts toward biophysical implementation is an important aspect of the present study overall.
2) Please make sure that you do not say that the model "implements" Bayesoptimal solution.
3) One reviewer suggested two additional considerations (Reviewer 1's point #2 and #3). Although we do not see these as essential for revision, they might improve the manuscript. So we would like to see your response.
4) During discussion, all the reviewers agreed that we should not raise the concern of biological plausibility of the cascade model.
Below please find the reviewers' original comments, which contains additional comments for your reference.
Reviewer #1:
The author has mostly addressed my comments. Some lingering issues:
1) I don't think it's correct to say that the model implements the Bayesoptimal solution. There's nothing showing that this is true mathematically. What was shown is that it achieves comparable performance. The discussion should be modified to reflect this.
2) The model accounts for the findings of Mazur's second experiment; can it account for the findings of Mazur's first experiment, namely that spontaneous recovery is towards roughly the average of recent sessions? I think it can, which would be a compelling demonstration.
3) While it is nice to see a further application of the model, this seems like a rather random choice of application. Since the author is emphasizing the neural implementation perspective, what one would really like to see is a simulation of specific neural phenomena. Note that the (small number of) phenomena modeled here are all behavioral results. Are there really no neural data bearing on the neural predictions of the model?
Reviewer #2:
The manuscript has been significantly improved and also contains new simulation data. I appreciate all these efforts made for improving the clarity of the manuscript. This work shows an interesting idea in computation and will be highly appreciated by computational journals. However, I still doubt whether the model is biologically plausible enough for publication in eLife.
The author claims that the model is biologically plausible as it is based on a previously published work of the "cascade synapse model". In fact, I doubt the biological plausibility of the cascade model itself even though the cascade model is unique and provides interesting computational functions. The cascade model assumes binary states to avoid unbounded growth of synaptic strength. However, results from various cortical areas have revealed longtailed or skewed distributions for the strength of cortical synapses (e.g., Song et al., PLoS Biol 2005; Buzsaki and Mizuseki, Nat Rev Neurosci 2014). These results do not seem to be consistent with binary synapses having only a depressed and a potentiated state. Though the longtailed distributions contain very strong synapses, these synapses only constitute a small fraction of several thousands of synapses a cortical neuron receives, meaning that the fraction of synapses in the potentiated states should be much smaller than that of synapses in the depressed states. However, it is unclear whether the cascade model, or multitimescale plasticity, also works under such constraint.
Another concern is that there will be a plenty of different ways to implement a surprise detection system. For example, the detection system may be realized within the framework of reinforcement learning as a system that simply monitors the expected amount of instantaneous reward. Though the author claimed that the previous models of surprised detection did not provide much insight into biological implementation (e.g., in the Discussion), so does the present model. This is my honest impression. I feel that the surprise detection system was proposed in this study just to save the specific cascade model.
Reviewer #3:
In the revised paper, several things have been improved.
First, the model by Iigaya is now compared to the Bayesian model by Behrens et al. (2007), and it is shown that the model essentially yields similar results. Second, the model is applied to another type of behavior, and the model can successfully account for this behavior, as well. Third, the method section has been improved and more details to the underpinning of the model have been provided.
In my original review, I had specifically addressed the lack of a clear biophysical implementation of the model. With respect to these points, the author has now more clearly specified the network model, the location of the synapses, and the way they are being modeled. In these respects, I find that the paper has been improved. However, the surprise detection system is still modeled on a purely phenomenological level. This would in principle be fine, except that the author really emphasizes how this model is about a circuit implementation (Marr's third level) of the observed behaviors, and I don’t find that this is really the case.
In fact, my main problem is not even that the surprise detection system is not explicitly modeled as a circuit/ network. Rather, it is that some of the key computations required – taking differences of synaptic strength – seem to rule out any halfway realistic circuit computation. How would information about synaptic strength be propagated to reach a location where the subtraction can then be carried out? Apart from wildly speculative ideas, this is not clear to me. The author addressed this by saying that it is left for future work, but the problem is that it looks like this type of computation cannot be implemented biophysically. There may be other ways of performing the relevant computations, but the current set of computations really seem to rule out that this could work biophysically.
[Editors’ note: a previous version of this study was rejected after peer review, but the author submitted for reconsideration. The first decision letter after peer review is shown below.]
Thank you for submitting your work entitled "Adaptive learning and decisionmaking under uncertainty by metaplastic synapses guided by a surprise detection system" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Naoshige Uchida as Reviewing Editor and Eve Marder as the Senior Editor. Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.
All the reviewers thought that this work addresses an important question of how the brain adjusts its learning rates in the face of changing volatility of the environment. The author introduce a surprise detection system to a "cascade model" that was previously proposed by Fusi and colleagues. The manuscript is clearly written although it would benefit from better explanations of modeling (see below). Overall, all the reviewers thought that the idea and the results are promising. On the other hand, the reviewers raised a number of concerns that would require substantial revisions. Addressing these concerns would require a substantial amount of simulations and rewriting. It is eLife's policy to not invite revisions that require substantial new scientific work. For that reason, we are forced to reject the manuscript in its current form.
The detailed comments from each referee are attached below. After discussion, the referees thought that the following four points are especially important. First, previous work (e.g. Behrens et al. 2007) have addressed a similar question and presented computational models. The author should compare different models and make the novelty of the current model more explicit. Second, this study only addresses one empirical finding and it is unclear whether this model can explain other phenomena. Applying the current model to other data that demonstrated changes in learning rates would be illuminating. Third, it is argued that the current model is biophysicallyinspired but some reviewers thought that the model is still very phenomenological, although this argument could be strengthened by further simulations. Fourth, the methods section requires more work to fully explain the model, and the simulation code should be made available.
Reviewer #1:
This paper presents a new computational model of metaplasticity, building on ideas from the cascade model, which allows synapses to rapidly adapt to changing volatility. This is an important question for biological decisionmaking systems. The article is clearly written and the theory is elegantly simple. However, I have several fundamental concerns that prevent me from recommending this paper for publication.
1) The model only explains a single empirical finding (adaptation of learning rate to reward volatility). This finding is already explained by a number of other models (for example, see Behrens et al. 2007). So it's not clear to me what this new model is adding.
2) While the model is discussed in terms of synapses, no specific biological evidence is presented that directly supports the assumptions of the model.
3) There's a huge literature on the effects of various experimental manipulations on learning rate. Much of this research was inspired by the seminal models of Mackintosh (1975) and Pearce & Hall (1980). Addressing at least some of this literature is important for demonstrating the explanatory power of the model.
Reviewer #2:
In this work, Iigaya investigates how organisms can adjust their learning rates to the time scales of a randomly varying and somewhat unpredictable environment. The author studies this problem in the context of models of synaptic plasticity. In these 'cascade' models, learning operates on many different time scales. Iigaya shows that an organism can rapidly switch to the right time scale if it has access to a 'surprise' system that detects any changes in an agents' ability to predict outcomes in the environment. The results are illustrated through various simulations.
Overall, I found the paper quite well written and a pleasure to read. I also think it addresses an interesting and important topic. The only quibble I have is that the model, despite being announced as mechanistic and biophysical, is actually rather phenomenological. It would be nice if the author could find a way to better tie the 'synaptic' plasticity to the underlying neurobiology. For instance, if I were to run an experimental lab and was really interested in these learning questions, what exactly should I measure to test this theory? I elaborate a bit more on this below.
Comments:
1) Biophysical realism: Iigaya emphasizes that this is a model of 'synaptic' plasticity. However, the synapses seem to be considered completely in isolation, and their embedding within a network is only hinted at in words. For instance, no neuron model is specified in the method section, and a (somewhat unspecific) network model is only referenced in the main text. I'd be completely fine with a learning model on a purely phenomenological level. However, if the author wants to emphasize that this type of learning occurs at the level of synapses, he should make the model more biophysical, e.g., by introducing a specific neuron and network model etc. The biophysical plausibility is particularly stretched in equation (24) which learns 'differences' between synaptic weights. I am fine with the learning rules per se, but talking about them in terms of networks and synapses seems a stretch. So either really show that this works within a network, or deemphasize the biophysical interpretation.
2) One simplification of the whole model seems to be that, if an animal has learnt a particular environment quite well, and synapses are fairly stable with slow plasticity, then a change in environment and the concomitant set of 'surprise signals' would essentially erase everything that had been learnt, and start things from scratch (at least for all learning rates faster than the detected surprise). The author states that longer time scales could remain stable, but it seems to me that does not exactly solve problem of switching between environments each of which changes on a faster time scale. This kind of contextdependence may be worth discussing.
Reviewer #3:
Behavioral learning by humans and other animals occurs at multiple timescales. Some years ago, the cascade synapse model successfully modeled the multitimescale dynamics of synaptic plasticity for decisionmaking. However, as the overall learning performance gradually shifts to slower timescales in a stationary environment, the cascade synapse model has a difficulty in adapting sudden changes in the environment. To overcome this difficulty, the author proposes a "surprise" detection system for decisionmaking. The basic idea is to compare the reward information stored in plastic synapses on multiple timescales to detect change points in the environment. Since pieces of evidence suggest that such a signal exists in the brain, the idea and results are of potential interest. However, I feel that the current manuscript is not unambiguously written and is hard to follow for readers unfamiliar with the cascade model. Some improvement is necessary.
Major comments:
1) Figure 2A and E explains the cascade model and surprise detection system, respectively. While reading the manuscript, I wondered whether the two systems work in harmony or work independently without interactions. Though now I find that the former should be the case, how the two systems interact with one another, or how a surprise signal is informed to the cascade synapses, during decisionmaking is not perfectly clear to me. Methods also do not clarify my doubt. Please explain more about this point. In Methods, mathematical descriptions also require some revisions. For instance, the definition of R^{B+()} remains unclear in equations. (2023). Are there also quantities like R^{A+()} as in equations 519 of the cascade model? Should the cascade model and surprise detection system have the same depth of multitimescales? The parameters α_{r}and α_{nr} in the r.h.s. of equations 25 and 26 are not defined, and the meaning of these operations is also unclear.
2) Related to the above point, I want to see in Figure 2F how multiple state variables in the cascade model and surprise detector simultaneously evolve on multiple timescales during decisionmaking. Showing synaptic strength only for two timescales in Figure 2G is not sufficient to understand the entire decisionmaking system. For example, is a surprise signal detected only at a pair of some timescales or at multiple pairs of different timescales slower than a critical timescale? Does the complex entire system (cascade model + surprise system) always work consistently on all timescales?
3) Results section, subsection “C. Our model selftunes the learning rate and captures key experimental findings”: The author mentioned that optimal Bayesian model and the proposed model show a similar behavior of the learning rate in each block of trials. Given this information, the readers may wonder what is the advantage of the proposed model over the optimal Bayesian model. Please make comments on this point.
https://doi.org/10.7554/eLife.18073.011Author response
The author has performed additional simulations and revised the manuscript extensively. All the referees agreed that the manuscript has greatly improved. However, there are some remaining issues to which we would like to see your response.
1) The reviewers pointed out that it is unclear whether the author's model is biologically plausible as proposed. During discussion, however, the reviewers noted that "biophysiological plausibility" is often difficult to define or relative, and that abstract models are often useful. Nevertheless, because the author now emphasizes biological plausibility in order to contrast with existing models (e.g. Bayesian models; Mackintosh; PearceHall), the reviewers thought a little more clarifications or toning down of this point would be required.
We do appreciate that the proposed model is an important step toward a mechanistic investigation of the interesting question; yet, it appears very difficult to implement some of the key components of the model. Specifically, one important proposal is the "surprise detection system" which takes the difference between the current and expected uncertainty, with uncertainty defined as the range of fluctuation (Figure 2G). To compute this, the author proposes to calculate the difference in synaptic weights of two groups. This is a very interesting idea yet it is unclear how a neural circuit computes the difference in synaptic weights. One reviewer thought that precisely computing the difference of synaptic weights is beyond the ability of neural circuits (or "out of biological constraints"). We would like you to address this point either by showing how such a computation can be performed or approximated while obeying biological constraints or by simply further deemphasizing the claim for implementation on specific parts although we note that you already state explicitly that network architecture of the surprise detection system is not specified in the present study, and that the efforts toward biophysical implementation is an important aspect of the present study overall.
Thank you very much for the comments and suggestions. As stated in our original manuscript, we do not intend to propose a network architecture that implements the whole surprise detection algorithm. Specifying the entire architecture will require more experimental evidence and theoretical analysis.
As for the ‘subtraction’, we agree that it is implausible that the system can read out the synaptic strength per se. We sincerely apologize if this caused the confusion. Now we omitted synaptic strength from the following sentence:
In Materials and Methods, in the subsection 'The surprise detection system': “As detailed circuits of a surprise detection system have yet to be shown either theoretically or experimentally, we leave a problem of specifying the architecture of system to future studies. The system learns the absolute value of the difference between the synaptic strength approximated reward rates vi and vj at a rate of…”.
We believe, however, that the difference of weights between two synaptic populations can be approximated by reading out from relevant neural populations. For example, imagine a network that includes two neural populations (A and B), each of whose activity is proportional to its total synaptic weights. Then one way to perform subtraction between these populations would be to have a readout population that receives an inhibitory projection from one population (A) and an excitatory projection from the other population (B). The activity of the readout neurons would then reflect the subtraction of signals that are proportional to synaptic weights (B – A). Now we further emphasize the limitation of the model and mention this possibility:
Second paragraph of Discussion: “We should, however, stress again that how our surprise detection system can be implemented should still be determined in the future.”
“To fill this gap, we proposed a more biophysically implementable computation which is partially performed by bounded synapses, and we found that our model performs as well as a Bayesian learner model”
“We should, however, note that we did not specify a network architecture for our surprise detection system. […] The activity of the readout neurons would then reflect the subtraction of signals that are proportional to synaptic weights (B – A).”
2) Please make sure that you do not say that the model "implements" Bayesoptimal solution.
We sincerely apologize for this. We had no intention to claim that our model implemented the Bayesoptimal solution. We corrected our manuscript to avoid such confusions.
3) One reviewer suggested two additional considerations (Reviewer 1's point #2 and #3). Although we do not see these as essential for revision, they might improve the manuscript. So we would like to see your response.
Point #2:
We appreciate this suggestion. We agree that our model would be consistent with the data that the spontaneous recovery was towards the average of recent sessions. In order to further investigate other aspects of spontaneous recovery, including this one, we plan to conduct a more systematic analysis in future studies. Thank you for the suggestion.
Point #3:
Thank you again for the suggestion. Unfortunately, experimental studies into the circuit dynamics of adaptive learning rates are very limited (though, some studies are discussed in the Discussion). As a result, it is currently very difficult to test our model in specific neural data. We hope that our study will stimulate further experimental, and computational, studies.
[Editors’ note: the author responses to the first round of peer review follow.]
Reviewer #1:
This paper presents a new computational model of metaplasticity, building on ideas from the cascade model, which allows synapses to rapidly adapt to changing volatility. This is an important question for biological decisionmaking systems. The article is clearly written and the theory is elegantly simple. However, I have several fundamental concerns that prevent me from recommending this paper for publication.
1) The model only explains a single empirical finding (adaptation of learning rate to reward volatility). This finding is already explained by a number of other models (for example, see Behrens et al. 2007). So it's not clear to me what this new model is adding.
I’m sorry that it was not clear. We aware that there are models that shows changes in learning rates, including the one by Behrens et al. (2007). However, as we noted above, most computational studies have been limited to Bayesian inference models, which focus on optimal probability interference according to the Bayes law. Those models cannot, by design, specify any biological implementation of such computation. Thus we aimed to provide a more biologically implementable computation in this manuscript, by combining a previously proposed neural circuit model and the cascade model of synaptic plasticity.
We agree that we should compare our model with such optimal computation models. In the current version we simulated Behrens et al. model and compared with our model. We found that our neural model performs as well as the Bayes optimal model. Our results thus now provide a unique insight into how the optimal adaptation of learning rates can be implemented in neural circuits with plastic synapses.
Also, in the current version we account for a different phenomenon with the same model, which is spontaneous recovery of preference.
2) While the model is discussed in terms of synapses, no specific biological evidence is presented that directly supports the assumptions of the model.
We apologize that we failed to provide biological supports of the synaptic model. Experiments and computational studies have shown that long time modification of synaptic strengths accounts for memory. It has been recognized, however, that remarkable memory performance of classical memory circuit was based on an assumption of unbounded synaptic weights. Bounding synaptic weights has been shown to create a catastrophic consequence to the memory performance, because synapses ‘forget’ very quickly by overwriting [Amit and Fusi, 1995; Fusi and Abbott, 2007]. However, human memory does not seem to suffer from such a catastrophic forgetting. To account for this, the model of cascade synapse [Fusi et al., 2005] has been proposed. This model was based on the biochemical cascades that are ubiquitous in biological systems and, in particular, are associated with synaptic plasticity. Those processes take place over a wide range of timescales. They showed that the model could significantly improve the model’s memory maintenance performance.
Adaptive decisionmaking has been studied in a neural circuit model with binary synapses (Soltani and Wang, 2006; Iigaya and Fusi, 2013). The decisionmaking network was originally proposed by XJ Wang (2002). It is a biophysically based model because it has an “anatomically plausible architecture in which not only single spiking neurons are described biophysically with a reasonable level of accuracy but also synaptic interactions are calibrated by quantitative neurophysiology (which turned out to be critically important) [Wang, 2008]”. It has been shown that the circuit model can account for features of experimental data.
However, it has been recognized that the model has a severe limitation due to the simple synaptic model, that is a speed accuracy tradeoff of adaptation. To address this issue, we applied the cascade model of synapses to a wellstudied decisionmaking network.
We did not intend to specify detailed circuit architecture of the surprise detection system. Rather, we proposed a simple computation algorithm that can be partially implementable by a simple binary synaptic plasticity.
As detailed circuits of a surprise detection system have yet to be shown either theoretically or experimentally, we leave a problem of specifying the architecture of system to future studies.
3) There's a huge literature on the effects of various experimental manipulations on learning rate. Much of this research was inspired by the seminal models of Mackintosh (1975) and Pearce & Hall (1980). Addressing at least some of this literature is important for demonstrating the explanatory power of the model.
We apologize that we failed to stress the important past research. We discussed these works and their relationship to our work in our current manuscript:
“We should stress that there have been extensive studies of modulation of learning in conditioning tasks in psychology, inspired by two very influential proposals.[…] Since the PearceHall model focused on the algorithmic level of computation while our work focusing on neural implementation level of computation, our work complements the classical model of Pearce and Hall.”
We now also applied our model to a phenomenon called spontaneous recovery, and showed that our model can account for the phenomenon. We should however stress that our work is considered to be complementally to both mackintosh and Pearce & Hall models, because those models do not specify neural implementation of the algorithm. It is David Marr’s 2nd level, algorithm of computation (Marr, 1982). Our approach is at the third level, the neural implementation of computation.
As Marr stressed, these levels should be studied in parallel.
Reviewer #2:
In this work, Iigaya investigates how organisms can adjust their learning rates to the time scales of a randomly varying and somewhat unpredictable environment. The author studies this problem in the context of models of synaptic plasticity. In these 'cascade' models, learning operates on many different time scales. Iigaya shows that an organism can rapidly switch to the right time scale if it has access to a 'surprise' system that detects any changes in an agents' ability to predict outcomes in the environment. The results are illustrated through various simulations.
Overall, I found the paper quite well written and a pleasure to read. I also think it addresses an interesting and important topic. The only quibble I have is that the model, despite being announced as mechanistic and biophysical, is actually rather phenomenological. It would be nice if the author could find a way to better tie the 'synaptic' plasticity to the underlying neurobiology. For instance, if I were to run an experimental lab and was really interested in these learning questions, what exactly should I measure to test this theory? I elaborate a bit more on this below.
Comments:
1) Biophysical realism: Iigaya emphasizes that this is a model of 'synaptic' plasticity. However, the synapses seem to be considered completely in isolation, and their embedding within a network is only hinted at in words. For instance, no neuron model is specified in the method section, and a (somewhat unspecific) network model is only referenced in the main text. I'd be completely fine with a learning model on a purely phenomenological level. However, if the author wants to emphasize that this type of learning occurs at the level of synapses, he should make the model more biophysical, e.g., by introducing a specific neuron and network model etc. The biophysical plausibility is particularly stretched in equation (24) which learns 'differences' between synaptic weights. I am fine with the learning rules per se, but talking about them in terms of networks and synapses seems a stretch. So either really show that this works within a network, or deemphasize the biophysical interpretation.
We apologize for this confusion and thank you very much for pointing this out. We now detail this in our new version of manuscript.
The cascade models of synapses are embedded in the X.J. Wang’s decisionmaking network. In this network, it has been shown previously that the firing rates of neurons that are responsible for making decisions are largely determined by the strengths of synaptic weights. Hence most of our focus was on the strengths of such synapses. This is now explained in more details in the methods section.
As the reviewer pointed out, however, we did not intend to specify the actual architecture of the other system: the surprise detection system.
This is because there is little experimental and theoretical evidence for specifying the architecture. Hence, for the surprise detection system, we proposed a computational algorithm, which can partially be operated on bounded synapses, without specifying the circuits. We agree that the part that the model learns the difference in the synaptic weights is abstract and we had no intention to specify its biological implementation. We apologize that we did not make this clear. We leave a problem of specifying the architecture of system to future studies. We stress this in the current manuscript:
“Note that for this we focus on the computational algorithm, and we do not specify the architecture of neural circuits responsible for this computation. As detailed circuits of a surprise detection system have yet to be shown either theoretically or experimentally, we leave the problem of specifying the architecture of system to future studies.”
“To fill this gap, we proposed a more biophysically implementable computation, partially performed by bounded synapses, and we found that our model performs as well as a Bayesian learner model (Behrens et al., 2007). We should, however, note that we did not specify network architecture for our surprise detection system. A detailed architecture for this, including connectivity between neuronal populations, requires more experimental evidence.”
2) One simplification of the whole model seems to be that, if an animal has learnt a particular environment quite well, and synapses are fairly stable with slow plasticity, then a change in environment and the concomitant set of 'surprise signals' would essentially erase everything that had been learnt, and start things from scratch (at least for all learning rates faster than the detected surprise). The author states that longer time scales could remain stable, but it seems to me that does not exactly solve problem of switching between environments each of which changes on a faster time scale. This kind of contextdependence may be worth discussing.
Thank you very much for pointing this out. This is indeed a limitation of our model. Our model needs a modification to apply more complex situations (for example, what (Gershman et al., 2010) has addressed). In our current manuscript, we explicitly discussed it:
“Our model has some limitations. First, we mainly focused on a relatively simple decisionmaking task, where one of the targets is more rewarding than the other and the reward rates for targets change at the same time. […] Those randomly connected neurons were reported in PFC as ‘mixed selective’ neurons [50]. It would be interesting to introduce such neuronal populations to our model to study more complex tasks.”
Reviewer #3:
Behavioral learning by humans and other animals occurs at multiple timescales. Some years ago, the cascade synapse model successfully modeled the multitimescale dynamics of synaptic plasticity for decisionmaking. However, as the overall learning performance gradually shifts to slower timescales in a stationary environment, the cascade synapse model has a difficulty in adapting sudden changes in the environment. To overcome this difficulty, the author proposes a "surprise" detection system for decisionmaking. The basic idea is to compare the reward information stored in plastic synapses on multiple timescales to detect change points in the environment. Since pieces of evidence suggest that such a signal exists in the brain, the idea and results are of potential interest. However, I feel that the current manuscript is not unambiguously written and is hard to follow for readers unfamiliar with the cascade model. Some improvement is necessary.
Major comments:
1) Figure 2A and E explains the cascade model and surprise detection system, respectively. While reading the manuscript, I wondered whether the two systems work in harmony or work independently without interactions. Though now I find that the former should be the case, how the two systems interact with one another, or how a surprise signal is informed to the cascade synapses, during decisionmaking is not perfectly clear to me. Methods also do not clarify my doubt. Please explain more about this point. In Methods, mathematical descriptions also require some revisions. For instance, the definition of R^{B+} remains unclear in equations. (2023). Are there also quantities like R^{A+} as in equations 519 of the cascade model? Should the cascade model and surprise detection system have the same depth of multitimescales? The parameters α_{r} and α_{nr} in the r.h.s. of equations 25 and 26 are not defined, and the meaning of these operations is also unclear.
Thank you very much for pointing this out. We apologize and we detailed this in the Methods section.
2) Related to the above point, I want to see in Figure 2F how multiple state variables in the cascade model and surprise detector simultaneously evolve on multiple timescales during decisionmaking. Showing synaptic strength only for two timescales in Figure 2G is not sufficient to understand the entire decisionmaking system. For example, is a surprise signal detected only at a pair of some timescales or at multiple pairs of different timescales slower than a critical timescale? Does the complex entire system (cascade model + surprise system) always work consistently on all timescales?
We really appreciate this comment. The whole system is designed to work on all timescales. The cascade model, however, could potentially have a bias to the task relevant time scales. This sometimes leads to a maladaptive behavior when the environment has suddenly changed. To adjust this, the surprise system must operate on all timescales.
It is very important to illustrate how our model works as a whole. We now illustrate this in detail with new Figure 8, and we extended the Methods section.
3) Results section, subsection “C. Our model selftunes the learning rate and captures key experimental findings”: The author mentioned that optimal Bayesian model and the proposed model show a similar behavior of the learning rate in each block of trials. Given this information, the readers may wonder what is the advantage of the proposed model over the optimal Bayesian model. Please make comments on this point.
Thank you very much for pointing this out. As we explained above, we now stress the difference, and conducted an explicit model comparison.
https://doi.org/10.7554/eLife.18073.012Article and author information
Author details
Funding
Schwartz foundation
 Kiyohito Iigaya
Gatsby Charitable Foundation
 Kiyohito Iigaya
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
I especially thank Stefano Fusi for fruitful discussions. I also thank Larry Abbott, Peter Dayan, Kevin Lloyd, Anthony Decostanzo for critical reading of the manuscript; Ken Miller, Yashar Ahmadian, Yonatan Loewenstein, Mattia Rigotti, Wittawat Jitkrittum, Angus Chadwick, and Carlos Stein N Brito for most helpful discussions. I thank the Swartz Foundation and Gatsby Charitable Foundation for generous support.
Reviewing Editor
 Naoshige Uchida, Harvard University, United States
Publication history
 Received: May 23, 2016
 Accepted: August 8, 2016
 Accepted Manuscript published: August 9, 2016 (version 1)
 Version of Record published: September 1, 2016 (version 2)
Copyright
© 2016, Iigaya et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,118
 Page views

 438
 Downloads

 14
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.