Decoding reward–curiosity conflict in decision-making from irrational behaviors

Konaka, Yuki; Naoki, Honda

doi:10.1038/s43588-023-00439-w

Download PDF

Article
Open access
Published: 15 May 2023

Decoding reward–curiosity conflict in decision-making from irrational behaviors

Nature Computational Science volume 3, pages 418–432 (2023)Cite this article

10k Accesses
1 Citations
129 Altmetric
Metrics details

Subjects

Abstract

Humans and animals are not always rational. They not only rationally exploit rewards but also explore an environment owing to their curiosity. However, the mechanism of such curiosity-driven irrational behavior is largely unknown. Here, we developed a decision-making model for a two-choice task based on the free energy principle, which is a theory integrating recognition and action selection. The model describes irrational behaviors depending on the curiosity level. We also proposed a machine learning method to decode temporal curiosity from behavioral data. By applying it to rat behavioral data, we found that the rat had negative curiosity, reflecting conservative selection sticking to more certain options and that the level of curiosity was upregulated by the expected future information obtained from an uncertain environment. Our decoding approach can be a fundamental tool for identifying the neural basis for reward–curiosity conflicts. Furthermore, it could be effective in diagnosing mental disorders.

Curiosity: primate neural circuits for novelty and information seeking

Article 23 January 2024

Shared striatal activity in decisions to satisfy curiosity and hunger at the risk of electric shocks

Article 30 March 2020

Using deep learning to predict human decisions and using cognitive models to explain deep learning models

Article Open access 18 March 2022

Main

Animals and humans perceive the external world through their sensory systems and make decisions accordingly^1,2. Generally, they cannot make optimal decisions because of the uncertainty of the environment as well as the limited computational capacity of the brain and time constraints associated with decision-making³. In fact, they perform irrational actions. For example, people play lotteries and gamble despite low reward expectations. In this case, they face a dilemma between low expected reward and curiosity regarding whether a reward will be acquired. Thus, understanding how animals control the balance between reward and curiosity is important for clarifying the whole decision-making process. However, a method is yet to be established for quantifying the reward–curiosity balance has yet been established. In this study, we developed a machine learning method to decode the time series of the reward–curiosity balance from animal behavioral data.

Some irrational behaviors emerge because of the strength of curiosity^4,5. For example, conservative individuals avoid uncertainty and prefer to select an action that leads to predictable outcomes. Conversely, inquisitive individuals strongly desire to know the environment rather than rewards and prefer to select an action that leads to unpredictable outcomes. Too conservative and inquisitive natures can be interpreted as autism spectrum disorder and attention deficit hyperactivity disorder, patients with which are known to substantially avoid and seek novel information, respectively^{6,7,8,9,10,11,12,13,14}. Rational individuals fall midway between these two extremes. In an ambiguous environment, they select an action to efficiently understand the environment, and if the environment becomes clear, they select an action to efficiently exploit the rewards. Therefore, curiosity has a major impact on behavioral patterns, and it is believed that animals control the balance between reward and curiosity in a context-dependent manner.

Decision-making has been modeled primarily by reinforcement learning (RL), which is a theory for describing reward-seeking adaptive behavior in which animals not only exploit rewards but also explore the environment. In RL, explorative behavior was addressed by a passive, random choice of action¹⁵. However, animals actively explore the environment by selecting actions that minimize the uncertainty of the environment given their curiosity.

Recently, the free energy principle (FEP) was proposed by Karl Friston under the Bayesian brain hypothesis, in which the brain optimally recognizes the outside world according to Bayesian estimation^16,17,18. The FEP addresses not only the recognition of the external world but also the information-seeking action selection, which minimizes the uncertainty of the recognition of the external world, known as “active inference”^19,20,21. Furthermore, FEP proposed a score of action, called expected free energy, which consists of the expected reward and curiosity with the same unit^{22,23,24,25,26,27}. Thus, action selection can be formulated by maximizing both reward and curiosity. Note that curiosity can be regarded as information gain, that is, the extent to which we expect our recognition to be updated by the new observation through the action. However, FEP assumes that the weighting of rewards and curiosity is always even and constant. Although a previous FEP study modeled active inference in a two-choice task, it assumed a constant intensity of curiosity and thus could not treat actual animal behaviors in which the weights of rewards and curiosity are expected to change over time²⁸. Hence, conventional theories such as RL and FEP are limited in describing the conflict between reward and curiosity.

Identifying the temporal variability of curiosity is important for future clarifying the neural mechanisms of the reward and curiosity conflicts in decision-making. Many FEP studies have been devoted to the construction of theory, assuming that the decision-making processes of animals are Bayes optimal. Thus, there was not even the idea that animals irrationally make decisions depending on the reward and curiosity conflicts. For this reason, a method to decode the temporal balance between reward and curiosity from behavioral data is yet to be established. Such a method would enable us to analyze neural correlates with the temporal variability of curiosity, and consequently, it would help us clarify how the brain controls the balance of reward and curiosity in a context-dependent manner.

In this study, we extended FEP by incorporating a meta-parameter that controls the conflict dynamics between reward and curiosity, called the reward–curiosity decision-making (ReCU) model. The ReCU model can exhibit various behavioral patterns, such as greedy behavior toward reward, information-seeking behaviors with high curiosity and conservative behaviors avoiding uncertainty. Moreover, we developed a machine learning method called the inverse FEP (iFEP) method to estimate the internal variables of decision-making information processing. Applying the iFEP method to a behavioral time series in a two-choice task, we successfully estimated the internal variables, such as variations in curiosity, recognition of reward availability and its confidence.

Results

Decision-making with the reward–curiosity dilemma

Animals perceive the environment by inferring causes such as reward availability from observation, and then they make decisions based on their own inferences. In this study, we developed an ReCU model of a decision-making agent facing a dilemma between reward and curiosity in a two-choice task, wherein the agent selects either of two choices associated with the same rewards but with different reward probabilities (Fig. 1a). If the agent aims to maximize cumulative rewards, the agent must select an option with a higher reward probability. However, in animal behavioral experiments, even after they learned which option was more associated with a reward, they did not exclusively select the best choice, but also often selected the option with a smaller reward probability, which seems unreasonable.

**Fig. 1: Decision-making model for the two-choice task with reward–curiosity dilemma.**

Here, we hypothesized the following: Animals assume that the reward probability for each option might fluctuate over time, and therefore, the continuous selection of one option decreases the confidence of the reward probability estimation for the other option. Thus, they become curious about the ambiguous option even with a smaller reward probability, and so selecting the ambiguous option is reasonable for increasing the confidence of the estimation for both options. Therefore, we considered that the agent should make decisions driven by reward and curiosity in a situation-dependent manner.

ReCU model

In the ReCU model, we divided information processing in the brain into two processes. In the first process, the agent updates the recognition of the reward probability of each option (Fig. 1a, process 1). In the second process, the agent selects an action based on the current recognition and curiosity (Fig. 1a, process 2). The agent repeats these two processes in the two-choice task.

We modeled the first process by sequential Bayesian estimation, under the assumption that reward probabilities latently fluctuate in time (Fig. 1b). The agent updates the belief about the reward probabilities, which are expressed as estimation distributions, in response to actions and consequence reward observations (Fig. 1c). We derived the equations of the belief update (Fig. 1d and Methods). We modeled the second process by action selection based on two kinds of motivations: the desires to maximize the reward and to gain the information from the environment (Fig. 1e). This sum, called ‘expected net utility’ in this study, can be expressed by

$$U_t\left( {a_{t + 1}} \right) = E\left[ {\mathrm{Reward}_{t + 1}} \right] + c_t \cdot E\left[ {\mathrm{Info}_{t + 1}} \right],$$

(1)

where a and t indicate the action and trial index, respectively, and E[x] denotes the expectation value of x based on current recognition. The first and second terms represent expectations of reward and information derived from a new observation, respectively, for the next action a_t+1 based on current recognition. c_t denotes a meta-parameter describing the intensity of curiosity, which weighs the expected information gain (see Methods for detail). We assumed an irrational mental conflict as c_t varies over time (Fig. 1f). In decision-making, the agents prefer to select action a_t+1 with the higher expected net utility, in which the action is selected probabilistically following a sigmoidal function

$$P\left( {a_{t + 1}} \right) = \frac{1}{{1 + \exp \left( { - \beta {\Delta}U_t} \right)}},$$

(2)

where ΔU_t = U_t(a_t + 1) − U_t$\left( {\overline {a_{t + 1}} } \right)$, and β denotes the inverse temperature controlling the randomness of action selection^22,29,30.

Recognition and decision-making in the simulation

To validate our model, we performed simulations with constant curiosity c_t = 1 for two cases. In the first case, where reward probabilities were constant and different between the two options (Fig. 2a), the agent preferred to select the option with the higher reward probability (Fig. 2b). The recognized reward probabilities converged to ground truths, indicating that the agent accurately recognized the reward probabilities (Fig. 2c). The recognition confidence changed over time depending on the behavior in each trial; the confidence of the selected option increased with information from the observation, whereas that of the unselected option decreased because of the agent’s assumption regarding the fluctuation in reward probabilities (Fig. 2d). Similarly, the expected information gain of an option increased and decreased when that option was selected and unselected, respectively. Thus, the expected information gain was lower for the option with higher confidence (Fig. 2e). The expected reward followed the recognized reward probability (Fig. 2f). Initially, decreasing expected information gain and increasing expected reward eventually cross at some number of trials, which correspond to switching between information exploration and reward acquisition (Supplementary Fig. 1a). These two factors are negatively correlated, indicating a trade-off relationship (Supplementary Fig. 1b). The expected net utility, which is the sum of the expected information gain and reward, represents the value of each selection (Fig. 2g), resulting in the agent’s preferentially selecting the option with the higher expected net utility.

**Fig. 2: Simulations of the decision-making model.**

In the second case, we assumed a dynamic environment with a time-dependent reward probability (Fig. 2h). In the simulation, the agent adaptively changed its recognition of the reward probability following the change in the true reward probability, and selected the option with the higher estimated reward probability (Fig. 2i,j). The confidence was affected by the uncertainty of the reward probability at each time; the confidence was high where the reward probabilities were near-deterministic, around 1 and 0, but low where the reward probabilities were uncertain, i.e., approximately 0.5 (Fig. 2k). The expected information gain of an option was negatively correlated with the confidence of the option (Fig. 2l), suggesting that the agent was curious about the uncertain option. The expected reward just varied depending on the recognized reward probability (Fig. 2m). In this case, we also observed a switching between information exploration and reward acquisition at the initial phase (Supplementary Fig. 1c). In contrast to the above case, the expected reward and expected information gain did not show clearly linear correlation because of the dynamic environment (Supplementary Fig. 1d). The expected net utility changed similarly to the expected reward; however, the difference between the left and right options was less pronounced because of curiosity (Fig. 2n). These two demonstrations indicate that our model can represent the process of cognition and decision-making based on reward and curiosity.

Discrimination of passive and curiosity-dependent behaviors

The behavioral difference based on curiosity is interesting. The expected net utility can be rewritten as

$${\Delta}U_t = {\Delta}E\left[ {\mathrm{Reward}}_{{\mathrm{t}} + 1} \right] + c_t \cdot {\Delta}E\left[ {\mathrm{Info}}_{{\mathrm{t}} + 1} \right],$$

(3)

where the first and second terms represent the differences between the expected reward and information gain of two options, respectively. Thus, the agent is decided based on the balance between ${\mathrm{\Delta}}E\left[ {\mathrm{Reward}}_{{\mathrm{t}} + 1} \right]$ and ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}}_{t + 1} \right]\,$. Here, we created a diagram visualizing the selected actions in the space of ${{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right]$ and ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}}_{t + 1} \right]$. Depending on the intensity of curiosity c_t, the left and right actions can be separated by a boundary of ${{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right] + c_t \cdot {{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}}_{{t + 1}} \right] = 0$ (Fig. 2o–q). The agent with c_t = 0 selected an option based only on ${{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right]$ (Fig. 2p). In contrast, if c_t was a nonzero value, the boundary leaned to a different direction depending on the positive and negative values of c_t (Fig. 2o,q). These results indicate that passive choice (c_t = 0) and curiosity-dependent choice ($c_t \ne 0$) can be discriminated based on the distributed pattern of selected actions in the space of ${{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right]$ and ${\mathrm{\Delta}}E\left[ {\mathrm{Info}}_{{t + 1}} \right]$.

Curiosity-dependent irrational behaviors

We examined how behavioral patterns are regulated by the intensity of curiosity and the degree of reward seeking (Fig. 3). In a scenario where the reward probabilities were zero for the left and 0.5 for the right (Fig. 3a), we simulated a model by varying the meta-parameters c and P_o, which is a control parameter of the reward amount (Fig. 3b and Methods). When the agent strongly desired the reward ($P_o = 0.99$) with no curiosity (c = 0), the agent preferred the right option with a higher reward probability (Fig. 3c, point a). If the agent had no desire for a reward ($P_o = 0.5$) with high curiosity (c = 0), the agent preferred the option with a higher reward probability (Fig. 3c, point b). Although this behavior seems to be rational at first glance, the agent did not seek the reward, but rather sought the information (i.e., belief update) driven by curiosity, which resulted in a preference for the uncertain option. When the agent has negative curiosity (c = −10), the agent continuously selected either of the two options depending on the first selection (Fig. 3c, point c). In this behavior, the agent conservatively selected the more certain option, as patients with autism spectrum disorder irrationally avoid new information and repeat the same choices^{6,7,8,9,10,11}.

In addition, we obtained a nontrivial result in another scenario, where the reward probabilities were 0.5 for the left and 1 for the right (Fig. 3d–f). As in the previous scenario, the agents with a strong desire for the reward (P_o = 0.99, nearly equal to 1) preferred the right option with a higher reward probability (Fig. 3f, point a). The agent with no desire for reward (P_o = 0.5) and high curiosity (c = 10) preferred the left option with a lower reward probability (Fig. 3f, point b). This seemingly irrational behavior was the outcome of focusing on satisfying curiosity and not seeking rewards, which recalls patients with attention deficit hyperactivity disorder irrationally exploring new information^12,13,14. In addition, as seen in the previous scenario (Fig. 3c, point c), the agents with negative curiosity (c = −10), irrespective of the desire for the reward, exhibited conservative selection (Fig. 3f, point c). In combination, these results clearly indicate that behavioral patterns largely depend on the degree of conflict between reward and curiosity.

Inverse FEP: Bayesian estimation of the internal state

In the above cases, we assumed a constant balance between reward and curiosity. However, our feelings swing in a context-dependent manner. Although it is important to decipher the temporal swinging of the conflict between reward and curiosity in terms of neuroscience and psychology, it is difficult to quantify the conflict because of its temporal dynamics. Here, we addressed the inverse problem to estimate the internal states of the agent from behavioral data, known as computational phenotyping or meta-Bayesian inference^{31,32,33,34,35}. To this end, we developed a machine learning method called iFEP to quantitatively decipher the temporal dynamics of the internal state including the curiosity meta-parameter from behavioral data.

For developing iFEP, we needed to switch the viewpoint from the agent to the observer of the agent, that is, from animals to us. In the state-space model (SSM) from the viewpoint of the agent, we described the sequential recognition of reward probabilities by the agent (Figs. 1b and 4a,b). Conversely, we developed a state-space model from the observer’s eye (the observer-SSM) to determine the internal state of the agent, for example, the intensity of curiosity c_t, recognition μ_i,t and its confidence P_i,t (i.e., inverse of the variance of the estimation distribution) (Fig. 4c). In the observer-SSM, the intensity of curiosity was assumed to change continuously over time, and the agent’s recognition of the reward probability was updated by using the equations shown in Fig. 1c following the FEP; however, they were unknown to the observers. In addition, the agent’s actions were assumed to be generated depending on the intensity of its curiosity, recognition and confidence, as described in equation (2), but the observers can only monitor the agent’s action and the presence of a reward. In iFEP, based on the observer-SSM, we estimate the latent internal state of agent z from observation x in a Bayesian manner as

$$\begin{array}{*{20}{c}} {P\left( {z_{1:T}|x_{1:T}} \right) \propto P\left( {x_{1:T}|z_{1:T}} \right)P\left( {z_{1:T}} \right),} \end{array}$$

(4)

where $z_{1:T} = \left\{ {\mu _{i,1:T},p_{i,1:T},c_{1:T}} \right\},x_{1:T} = \left\{ {a_{1:T},o_{1:T}} \right\}$, and the subscript 1:T indicates steps 1 to T. In this Bayesian estimation, a posterior distribution $P\left( {z_{1:T}|x_{1:T}} \right)$ represents the observer’s recognition of the estimated $z_{1:T}$ given observation x_1:T with uncertainty. A prior distribution $P\left( {z_{1:T}} \right)$ represents our belief, which is expressed as the ReCU model with the random motion of the curiosity meta-parameter c as

$$\begin{array}{*{20}{c}} {c_t = c_{t - 1} + {\it{\epsilon }}\zeta _t,} \end{array}$$

(5)

where $\zeta _t$ indicates the white standard Gauss noise and ${\it{\epsilon }}$ indicates its noise intensity. The likelihood P(x_1:T|z_1:T) represents the probability that x_1:T was observed assuming z_1:T, which also follows the ReCU model. This Bayesian estimation, namely iFEP, was conducted using a particle filter and Kalman backward algorithm (Methods).

**Fig. 4: The scheme of iFEP by an observer of a decision-making agent.**

Validation of iFEP with artificial data

We tested the validity of the iFEP method by applying it to the artificial data generated by the ReCU model. We simulated a model agent with nonconstant curiosity in the two-choice task, where reward probabilities varied temporally. We then demonstrated that iFEP estimated the ground truth of the internal state of the simulated agent, that is, the agent’s intensity of curiosity, recognition and confidence (Supplementary Fig. 2). We also confirmed that the estimation performance is robust against the value of ${\it{\epsilon }}$ (Supplementary Fig. 3). Therefore, iFEP is in a position to provide efficient estimators of belief updating to clarify decision-making processing and the accompanying temporal swing in the conflict between reward and curiosity.

iFEP-decoded internal state behind rat behaviors

We applied iFEP to actual rat behavioral data from the two-choice task experiment with temporally varying reward probabilities³⁶ (Fig. 5a). In this experiment, once the reward probabilities were suddenly changed in a discrete manner, the rat slowly adapted to select the option with the higher reward probability (Fig. 5b), suggesting that the rat sequentially updated its recognition of the reward probability. Based on these behavioral data of the rat, iFEP estimated the internal state, that is, the intensity of curiosity, the recognized reward probabilities and their confidence levels (Fig. 5c–e). We found that the rat was not perfectly aware of the true reward probabilities but was able to recognize increases and decreases in reward probability (Fig. 5d,e). We also found that confidence increased with choice and decreased with no choice (Fig. 5f,g).

**Fig. 5: Estimation of the rat’s internal state by iFEP.**

With iFEP, we can examine whether the rat subjectively assumed fluctuating or constant environments, that is, reward probabilities. We estimated the degree of fluctuation of the reward probabilities the rat assumes $p_w = 1.785$ (that is, $\sigma _w^2 = 0.560$) from the rat behavioral data. This estimated value implied that the latent variable controlling the reward probabilities showed a random walk with increasing s.d. with trials as $\sqrt {\sigma _w^2t}$. Compared with a reward probability represented by the sigmoidal of the latent variable, the estimated reward probability can largely change from 0.5 to 0.5 ± 0.4 during only ten trials (Supplementary Fig. 4). Therefore, it was suggested that the rat assumed fluctuating environments and, thus, easily forgot its recognition and lost its confidence.

Negative curiosity and its dynamics decoded by iFEP

Interestingly, the curiosity held by the rat was estimated to be negative for almost all trials (Fig. 5h). In other words, the rat conservatively preferred certain choices but did not explore uncertain choices. This negative curiosity is reasonable for the starved animals because animals desired to obtain more reward with higher confidence. To validate the negative curiosity, we visualize the selected action in the space of ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right]$ and ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]$, as shown in Fig. 2o–q (Fig. 6a–c). When the estimated c_t was negative, the rat dominantly selected the left action in a region with positive ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right]$ and negative ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]$ (Fig. 6a for c_t < −1.1; Fig. 6b for $- 1.1 \le c_t < - 0.7$), clearly indicating that the rat has positive subjective reward and negative curiosity. When the estimated c_t was close to 0, the rat selected both actions based on ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right]$, independent of ${{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]$ (Fig. 6c for $- 0.7 \le c_t$). These results clearly supported the expected net utility with positive weight for the expected reward and time-dependent weight for the expected information gain. In addition, we statistically tested negative curiosity (P < 0.01 for Monte Carlo testing in Supplementary Fig. 5).

**Fig. 6: Negative curiosity and its dynamics.**

Further, we noticed an increase in the estimated level of curiosity at which the reward probabilities changed suddenly (Fig. 5h). This curiosity dynamics can be interpreted such that the rat recognized the rule change and adaptively controlled the extent to which the rat sought new information. We further examined how the curiosity is regulated by recognized environmental information. We did not detect the correlation between the estimated curiosity and expected information gains (Fig. 6d,e). Moreover, we found that the temporal derivative of the estimated curiosity highly correlated with the sum of the expected information gains for both options (Fig. 6f,g). These results implied that the rat actively upregulated the curiosity level when the uncertainty in the recognized reward probabilities increases, such as,

$$\begin{array}{*{20}{c}} {\frac{{\mathrm{d}c}}{{\mathrm{d}t}} \propto \mathop {\sum}\limits_i {E\left[ {\mathrm{Info}_i} \right]} .} \end{array}$$

(6)

Evaluations of alternative models from rat behaviors

Finally, to further confirm the validity of the ReCU model, we compared it with other decision-making models based on the rat behavioral data. As an alternative version of the expected net utility, we introduced the time-dependent desire for reward as

$$\begin{array}{*{20}{c}} {U_t\left( {a_{t + 1}} \right) = d_t \cdot E\left[ {\mathrm{Reward}_{t + 1}} \right] + E\left[ {\mathrm{Info}_{t + 1}} \right],} \end{array}$$

(7)

where d_t denotes a meta-parameter describing subjective reward (see Methods for details). With this alternative model, the time series of d_t were estimated by iFEP from the rat behavioral data. We found that the estimation of the subjective reward meta-parameter d_t changed dynamically and sometimes became close to zero when the rat encountered drastic changes in the reward probabilities (Supplementary Fig. 6), which indicated that the rat suddenly no longer needed rewards. This should be unnatural for animals that were starved before the experimental task to motivate them to obtain food. Thus, the alternative model is not suitable for describing the rat behavior.

Another possible model is Q-learning in the framework of RL, which has been widely used to model the decision-making tasks. Following previous studies^36,37,38, we introduced the time-dependent inverse temperature B_t, which controls the randomness of the action selection; however, it does not lead to information-seeking behavior in the ReCU model. With the Q-learning model, we estimated time series of B_t (Methods). Then, we determined that the B_t decreased when the reward probabilities suddenly changed (Supplementary Fig. 7), meaning that the rat tends to perform a random selection of actions in response to the environmental rule change. Although this dynamic behavior of the inverse temperature seems reasonable, it is unknown how it is regulated in the Q-learning model. Here, we hypothesized that B_t could be regulated by the uncertainty of our recognition. To probe this, we compared B_t and the expected information gain, where the former was estimated based on Q-learning and the latter was estimated by iFEP in the ReCU model. We found that they are positively correlated (Supplementary Fig. 7) as

$$\begin{array}{*{20}{c}} {B_t \propto \mathop {\sum}\limits_i {E\left[ {\mathrm{Info}_{t,i}} \right]} .} \end{array}$$

(8)

Therefore, Q-learning requires a cue regarding the uncertainty of recognition, that is, the expected information gain, which supports the ReCU model for explaining the curiosity-driven behavior.

Discussion

The advancement realized in this study is the modeling and decoding of mental conflict between reward and curiosity, which is yet to be quantified. The proposed approach can potentially improve our understanding of how mental conflict is regulated and highlight its neural mechanisms in the future by combining neural recording data.

Our decoding approach has some limitations. The iFEP requires long trial behavioral data in the two-choice task because the particle filter needs certain trials for converging the estimation. Thus, the iFEP is not applicable to short behavioral data. In addition, the iFEP assumed gradual change in curiosity in time and cannot follow the pathologically rapid dynamics of curiosity in the estimation process of the particle filter.

Comparing the ReCU model and other models including Q-learning is difficult. In all these models, the action selection is commonly formulated with the sigmoidal function. Thus, any models can be made to fit the observed behaviors, i.e., increase likelihood, by adjusting the time-dependent meta-parameters. Therefore, model selection based solely on likelihood is not very helpful and determining whether animals use the ReCU model or other models to make decisions seems inherently challenging. However, a potential advantage of the ReCU model is its ability to capture the dynamics of curiosity and the interplay between curiosity and reward-based learning.

There is a related theoretical model, which differs from RL and FEP. Ortega and Braun formulated FEP, which describes irrational decision making^39,40. Interestingly, their formulation was based on microscopic thermodynamics and the temperature parameters controlled this irrationality. However, the thermodynamics-based FEP did not treat the sequential update of the recognition from the observations.

Finally, it is worth discussing future perspectives of our iFEP approach in medicine. In general, mental diagnosis relies on medical interviews and has not been evaluated quantitatively. Our iFEP method can quantitatively estimate the psychological state of patients based on their behavioral data. For example, patients with social withdrawal, also known as ‘Hikikomori,’ have no interest in anything. In this case, social withdrawal would be characterized by a negative value of curiosity in our FEP model. Therefore, the iFEP method could be considered effective for diagnosing mental disorders.

Methods

Amount of reward

In the two-choice task, the reward is given as all-or-none; however, its intensity depends on the agent’s own feeling as

$$\begin{array}{*{20}{c}} {R = 0\,{\rm{or}}\ln \frac{{P_o}}{{1 - P_o}},} \end{array}$$

(9)

where P_o represents the desired probability of how much the agent wants the reward and controls the reward intensity felt by the agent (Supplementary Fig. 8). In Friston’s formulation, the presence and absence of rewards are given by log preference with natural unit as ln P_o and In(1 − P_o), which take negative values^22,23. In this equation, the reward is the difference between log preferences to ensure that the reward intensity is positive.

State space model for reward probability recognition

The agent assumed that the reward is generated probabilistically from the latent cause w, and probabilities λ_i are represented by

$$\begin{array}{*{20}{c}} {\lambda _i = f\left( {w_i} \right),} \end{array}$$

(10)

where i indicates the indices of the options and $f\left( x \right) = 1/\left( {1 + {\mathrm{e}}^{ - x}} \right)$. In addition, the agent assumed an ambiguous environment in which the reward probabilities are fluctuated by a random walk as

$$w_{i,t} = w_{i,t - 1} + \sigma _w\xi _{i,t},$$

(11)

where t, $\xi _{i,t}$, and $\sigma _w$ denote the trial of the two-choice task, standard Gaussian noise and the noise intensity, respectively. Thus, the agent assumes an environment expressed by an SSM as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1},\sigma _w^2{{{\mathbf{I}}}}} \right),} \end{array}$$

(12)

$$P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right) = \mathop {\prod }\limits_i \left[ {f\left( {w_{i,t}} \right)^{o_t}\left\{ {1 - f\left( {w_{i,t}} \right)} \right\}^{1 - o_t}} \right]^{a_{i,t}},$$

(13)

where w_t and o_t denote the latent variables controlling the reward probabilities of both options at step t (${{{\mathbf{w}}}}_t = \left( {w_{1,t},w_{2,t}} \right)^{\mathrm{T}}$) and the observation of the presence of the reward $\left( {o_t \in \left\{ {0,1} \right\}} \right)$, respectively. Meanwhile, a_t denotes the agent’s action at step t, which is represented by a one-hot vector $\left( {{{{\mathbf{a}}}}_t \in \left\{ {\left( {1,0} \right)^{\mathrm{T}},\left( {0,1} \right)^{\mathrm{T}}} \right\}} \right)$ , ${{{\mathcal{N}}}}\left( {{{{\mathbf{x}}}}|{{{\mathbf{\upmu }}}},{\varSigma}} \right)$ denotes the Gaussian distribution mean μ and variance Σ, $\sigma _w^2$ denotes the variance of the transition probability of w, I denotes an identity matrix, and $f(w_{i,t}) = 1/(1 + {\mathrm{e}}^{ - w_{i,t}})$, which represents the probability of the reward of option i at step t. The initial distribution of w₁ is given by $P\left( {{{{\mathbf{w}}}}_1} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_1|0,\kappa {{{\mathbf{I}}}}} \right)$, where κ denotes variance.

FEP for reward probability recognition

We modeled the agent’s recognition process of reward probability using sequential Bayesian updating as

$$P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right) \propto P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\int} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right)P\left( {{{{\mathbf{w}}}}_{t - 1}|{{{{o}}}}_{1:t - 1},{{{\mathbf{a}}}}_{1:t - 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t - 1}} .$$

(14)

Because of the non-Gaussian P(o_t|w_t, a_t), the posterior distribution of w_t, $P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right)$, becomes non-Gaussian and cannot be calculated analytically. To avoid this problem, we introduced a simple posterior distribution approximated by a Gaussian distribution:

$$Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{\upmu }}}}_t,{{{\mathbf{{\Lambda}}}}}_t^{ - 1}} \right) {\fallingdotseq} P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right),$$

(15)

where φ_t = {μ_t, Λ_t}, and μ_t and Λ_t denote the mean and precision, respectively $( {{{{\mathbf{\upmu }}}}_t = \left( {\mu _{1,t},\mu _{2,t}} \right)^{\mathrm{T}};\,{{{\mathbf{{\Lambda}}}}}_{{{{t}}}} = \text{diag}\left( {p_{1,t},p_{2,t}} \right)})$. $Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)$ denotes the recognition distribution. The model agent aims to update the recognition distribution through φ_t at each time step by minimizing the surprise, which is defined by $- \ln P\left( {o_t|o_{1:t - 1}} \right)$. The surprise can be decomposed as follows:

$$\begin{array}{rcl} - \ln P\left( {o_t|o_{1:t - 1}} \right) & = & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\ln \frac{{Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)}}{{P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right)}}\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ && - {{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)||P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right)} \right], \end{array}$$

(16)

where ${{{\mathrm{KL}}}}\left[ {q\left( {{{\mathbf{x}}}} \right)||p\left( {{{\mathbf{x}}}} \right)} \right]$ denotes the Kullback–Leibler (KL) divergence between the probability distributions q(x) and p(x). Because of the nonnegativity of KL divergence, the first term is the upper bound of the surprise:

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\ln Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\mathrm{d}{{{\mathbf{w}}}}_t} + {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)J\left( {o_t,{{{\mathbf{w}}}}_t} \right)\mathrm{d}{{{\mathbf{w}}}}_t,} } \end{array}$$

(17)

which is called the free energy, where $J\left( {o_t,{{{\mathbf{w}}}}_t} \right) = - \ln P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right)$. The first term of the free energy corresponds to the negative entropy of a Gaussian distribution:

$$\begin{array}{*{20}{c}} {F_1 = {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{{t}}}}|\varphi _t} \right)\ln Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)\mathrm{d}{{{\mathbf{w}}}}_t.} } \end{array}$$

(18)

The second term is approximated as

$$\begin{array}{rcl} F_2 & = & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)J\left( {o_t,{{{\mathbf{w}}}}_t} \right)\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ & \cong & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{{t}}}}|\varphi _t} \right)\left\{ {J\left( {o_t,{{{\mathbf{\upmu }}}}_t} \right) + \frac{{\mathrm{d}J}}{{\mathrm{d}w}}\left( {{{{\mathbf{w}}}}_t - {{{\mathbf{\upmu }}}}_t} \right) + \frac{1}{2}\frac{{\mathrm{d}^2J}}{{\mathrm{d}w^2}}\left( {{{{\mathbf{w}}}}_t - {{{\mathbf{\upmu }}}}_t} \right)^2} \right\}\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ & = & \left. {J\left( {o_t,{{{\mathbf{w}}}}_t} \right)} \right|_{w_t = \mu _t} + \frac{1}{2}\left. {\frac{{\mathrm{d}^2J}}{{\mathrm{d}w^2}}} \right|_{w_t = \mu _t}{{{\mathbf{{\Lambda}}}}}_t^{ - 1}. \end{array}$$

(19)

Note that $E\left( {o_t,{{{\mathbf{w}}}}_t} \right)$ is expanded by a second-order Taylor series around ${{{\mathbf{\upmu }}}}_t$. At each time step, the agent updates $\varphi _t$ by minimizing $F\left( {o_t,\varphi _t} \right)$.

Calculation of free energy

The free energy is derived as follows: F₁ simply becomes

$$\begin{array}{*{20}{c}} {F_1 = \frac{1}{2}\ln 2\uppi p_{1,t}^{ - 1} + \frac{1}{2}\ln 2\uppi p_{2,t}^{ - 1} + \text{const.}} \end{array}$$

(20)

For computing F₂,

$$\begin{array}{*{20}{c}}{P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right) = P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right)P\left( {{{{\mathbf{w}}}}_{t - 1}|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t - 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t - 1}} }\\ { \cong P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right)N\left( {{{{\mathbf{w}}}}_{t - 1}|{{{\mathbf{\upmu }}}}_{t - 1},{{{\mathbf{{\Lambda}}}}}_{{{{\mathrm{t}}}} - 1}^{ - 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t - 1}.} }\end{array}$$

(21)

In the second line of this equation, we use the approximated recognition distribution as the previous posterior $P\left( {{{{\mathbf{w}}}}_{t - 1}|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t - 1}} \right)$ . This equation can be written as

$$P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right) \cong \mathop {\prod}\limits_i {\left[ {f\left( {w_{i,t}} \right)^{o_t}\left\{ {1 - f\left( {w_{i,t}} \right)} \right\}^{1 - o_t}} \right]^{a_{i,t}}N\left( {w_{i,t}|\mu _{i,\,t - 1},p_{i,\,t}^{ - 1} + \sigma _w^2} \right).}$$

(22)

Then

$$\begin{array}{*{20}{c}} {\left. {E\left( {o_t,{{{\mathbf{w}}}}_t} \right)} \right|_{w_t = \mu _t} = J_1\left( {o_t,\mu _{1,t}} \right) + J_2\left( {o_t,\mu _{2,t}} \right) + \text{const.},} \end{array}$$

(23)

where

$$\begin{array}{*{20}{c}} \begin{array}{ccccc}\\ J_i\left( {o_t,\,\mu _{i,t}} \right) = & a_{i,t}\left[ {o_t\ln f\left( {\mu _{i,t}} \right) + \left( {1 - o_t} \right)\ln \left\{ {1 - f\left( {\mu _{i,t}} \right)} \right\}} \right]\cr \\ & - \frac{1}{2}\frac{{\left( {\mu _{i,t} - \mu _{i,t - 1}} \right)^2}}{{p_{i,t}^{ - 1} + \sigma _w^2}} - \frac{1}{2}\ln \left( {p_{i,t}^{ - 1} + \sigma _w^2} \right).\\ \end{array} \end{array}$$

(24)

Thus, F₂ is calculated by substituting this equation into equation (19). Taken together,

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = \mathop {\sum}\limits_i {\left\{ {J_i\left( {o_t,\mu _{i,t}} \right) + \frac{1}{2}\left. {\frac{{\mathrm{d}^2J_i}}{{\mathrm{d}w_{i,t}^2}}} \right|_{w_{i,t} = \mu _{i,t}}p_{i,t}^{ - 1} + \frac{1}{2}\ln 2\uppi p_{i,t}^{ - 1}} \right\}.} } \end{array}$$

(25)

Sequential updating of the agent’s recognition

The updating rule for φ_t was derived by minimizing the free energy. The optimized p_i,t can be computed by $\partial F/\partial p_{i,t}^{ - 1} = 0$, which leads to

$$\begin{array}{*{20}{c}} {p_{i,t} = \left. {\frac{{\mathrm{d}^2J_i}}{{\mathrm{d}w_{i,t}^2}}} \right|_{w_{i,t} = \mu _{i,t}}.} \end{array}$$

(26)

By substituting p_i,t into equation (25), the second term in the summation becomes constant, irrespective of μ_i,t. Thus, μ_i,t is updated by minimizing only the first term as

$$\begin{array}{*{20}{c}} {\mu _{i,t} = \mu _{i,t - 1} - \alpha \delta _{i,a}\left. {\frac{{\partial J_i}}{{\partial \mu _{i,t}}}} \right|_{\mu _{i,t} = \mu _{i,t - 1}},} \end{array}$$

(27)

where α is the learning rate. These two equations finally lead to

$$\begin{array}{*{20}{c}} {\mu _{i,t} = \mu _{i,t - 1} + \alpha K_{i,t}\left( {o_t - f\left( {\mu _{i,t - 1}} \right)} \right),} \end{array}$$

(28)

$$\begin{array}{*{20}{c}} {p_{i,t} = K_{i,t}^{ - 1} + f\left( {\mu _{i,t}} \right)\left( {1 - f\left( {\mu _{i,t}} \right)} \right),} \end{array}$$

(29)

where $K_{i,t} = \left( {p_{i,t - 1} + \sigma _w^{ - 2}} \right)/\left( {p_{i,t - 1}\sigma _w^{ - 2}} \right)$, which is called the Kalman gain. If option i was not selected, the second terms in both equations will vanish, which results in belief $\mu _{i,t}$ staying the same, while its precision decreases (that is, $p_{i,t + 1} < p_{i,t}$). If it is selected, the belief is updated by the prediction error (that is, $o_t - f\left( {\mu _{i,t}} \right)$), and its precision is improved. The confidence of the recognized reward probability should be evaluated not in $w_{i,t}$ space but in $\lambda _{i,t}$ space; hence, the confidence is defined by $\gamma _{i,t} = p_{i,t}/f^\prime \left( {\mu _{i,t}} \right)^2$.

Expected net utility

The expected net utility is described by

$$\begin{array}{*{20}{c}}{U_t\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = c_t \cdot E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right]}\\ { + E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {R\left( {o_{t + 1}} \right)} \right],}\end{array}$$

(30)

where $R\left( {o_{t + 1}} \right) = o_{t + 1}\ln \left( {P_o/\left( {1 - P_o} \right)} \right)$ ; the first and second terms represent the expected information gain and expected reward, respectively; and c_t denotes the intensity of curiosity at time t. The heuristic idea of introducing the curiosity meta-parameter c_t was also proposed in the RL field³⁰.

We briefly show how to derive it based on ref. ⁴¹. The free energy at current time t is described by

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = E_{Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)}\left[ {\ln Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right) - \ln P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right)} \right],} \end{array}$$

(31)

which is a rewriting of equation (17). Here, we attempt to express the future free energy at time t + 1, conditioned on action a_{t +1} as

$$\begin{array}{*{20}{c}} {F\left( {o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)}\left[ {\ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right) - \ln P\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)} \right].} \end{array}$$

(32)

However, there is an apparent problem in this equation: o_t+1 has not yet been observed because it is a future observation. To resolve this issue, we take an expectation with respect to o_t+1 as

$$\begin{array}{l} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{{{{\boldsymbol{t}}}} + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{{{{\boldsymbol{t}}}} + 1}|\varphi _t} \right)} \\ \qquad \qquad\left[ {\ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right) - \ln P\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)} \right]. \end{array}$$

(33)

This can be rewritten as

$$\begin{array}{lll} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)\right.\\ \qquad\qquad\quad- \ln P\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right) - \ln P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right) \left]\right. \end{array}$$

(34)

Although $P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)$ can be computed by the generative model as

$$\begin{array}{*{20}{c}}{P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right)P\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} }\\ { \cong {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1},} }\end{array}$$

(35)

it is assumed that $P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)$ was heuristically replaced with $P\left( {o_{t + 1}} \right)$ as the prior agent’s preference of observing $o_{t + 1}$, so that $\ln P\left( {o_{t + 1}} \right)$ can be addressed as an instrumental reward. The equation (34) was further transformed into

$$\begin{array}{lll} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)\right.\\ \qquad\qquad\quad {- \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right) - \ln P\left( {o_{t + 1}} \right) \left]\right.}. \end{array}$$

(36)

Finally, we obtained the so-called expected free energy as

$$\begin{array}{*{20}{c}} {F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ { - {{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}} \right)} \right] - \ln P\left( {o_{t + 1}} \right)} \right].} \end{array}$$

(37)

where the KL divergence is called a Bayesian surprise, representing the extent to which the agent’s beliefs are updated by observation, whereas the second term $\ln P\left( {o_{t + 1}} \right)$ can be addressed as the agent’s prior preference of observing $o_{t + 1}$, which can be interpreted as an instrumental reward. Therefore, the expected free energy is derived in an ad hoc manner.

In the first term of equation (30), the posterior and prior distributions of ${{{\mathbf{w}}}}_{t + 1}$ in the KL divergence are derived as

$$\begin{array}{*{20}{c}} {Q\left( {{{{\mathbf{w}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{w}}}}_t} \right)Q\left( {{{{\mathbf{w}}}}_t} \right){\mathrm{d}}{{{\mathbf{w}}}}_t} ,} \end{array}$$

(38)

$$\begin{array}{*{20}{c}} {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1,}{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{{{{\boldsymbol{t}}}} + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}}{{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}},} \end{array}$$

(39)

respectively. $o_{t + 1}$ is a future observation, and therefore, the first term was expected by $P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)$, which can be calculated as

$$\begin{array}{*{20}{c}} {P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}.} } \end{array}$$

(40)

In the second term, the reward is quantitatively interpreted as the desired probability of $o_{t + 1}$. For the two-choice task, we use

$$\begin{array}{*{20}{c}} {P\left( {o_{t + 1}} \right) = P_o^{o_{t + 1}}\left( {1 - P_o} \right)^{\left( {1 - o_{t + 1}} \right)},} \end{array}$$

(41)

where P_o indicates the desired probability of the presence of a reward. According to the probabilistic interpretation of reward^22,23, the presence and absence of a reward can be evaluated by ln P_o and ln(1 − P_o), respectively.

In this study, we changed the sign of the expected free energy as the expected net utility, because we modeled decision-making as the maximization of the expected free energy. We ventured to introduce the curiosity meta-parameter c_t to express irrational decision-making. Because rewards are relative, in this study, we set ${{{\mathrm{ln}}}}\left\{ {P_o/\left( {1 - P_o} \right)} \right\}$ and 0 for the presence and absence of a reward, respectively. Thus, our expected net utility can be described by equation (30), in which there is a gap between the original expected free energy and our expected net utility.

Model for action selection

The agent probabilistically selects a choice with higher expected net utility as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{\exp \left( {\beta U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right)}}{{\mathop {\sum}\nolimits_{{{\mathbf{a}}}} {\exp \left( {\beta U\left( {{{\mathbf{a}}}} \right)} \right)} }},} \end{array}$$

(42)

where $U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)$ indicates the expected net utility of action a_t+1. Equation (42) is equivalent to equation (2). To derive equation (42), we considered the expectation of the expected net utility with respect to probabilistic action as

$$\begin{array}{*{20}{c}} {E\left[ U \right] = E_{Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {U\left( {{{{\mathbf{a}}}}_{t + 1}} \right) - \beta ^{ - 1}\ln Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right],} \end{array}$$

(43)

where β indicates an inverse temperature, and the entropic constraint of action probability is introduced in the second term. This equation can be rewritten as

$$\begin{array}{*{20}{c}} {E\left[ U \right] = - \beta ^{ - 1}\text{KL}\left[ {Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)\Vert\exp \left( {\beta U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right)/Z} \right] + \beta ^{ - 1}\ln Z,} \end{array}$$

(44)

where Z indicates a normalization constant. Thus, its maximization respective to $Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)$ leads to the optimal action probability as shown in equation (42).

Alternative expected net utility

For comparison, we consider an alternative expected net utility by introducing a time-dependent meta-parameter in the second term as follows:

$$\begin{array}{*{20}{c}}{U\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right]}\\ { +\, d_t \cdot E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {R\left( {o_{t + 1}} \right)} \right],}\end{array}$$

(45)

where d_t denotes the subjective intensity of reward at time t. In this case, the agent with high d_t will show more exploitative behavior, whereas the agent with d_t = 0 shows more explorative behavior driven by the expected information gain.

Calculation of expected net utility

Here, we present the calculation of the expected net utility. The KL divergence in the first term of equation (30) can be transformed into

$$\begin{array}{*{20}{c}} {E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\text{KL}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right] = H\left( {o_{t + 1}} \right) - H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right),} \end{array}$$

(46)

where the first and second terms represent the conditional and marginal entropies, respectively:

$$\begin{array}{*{20}{c}} {H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = {{{\mathrm{E}}}}_{P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\boldsymbol{a}}}}_{t + 1}} \right)}\left[ { - \ln P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)} \right],} \end{array}$$

(47)

$$\begin{array}{*{20}{c}} {H\left( {o_{t + 1}} \right) = {{{\mathrm{E}}}}_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ { - \ln P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right].} \end{array}$$

(48)

The conditional entropy ${{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right)$ can be calculated by substituting equation (13) into equation (47) as

$$H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = - E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\mathop {\sum}\nolimits_i {a_{i,t + 1}g\left( {w_{i,t + 1}} \right)} } \right],$$

(49)

where

$$\begin{array}{*{20}{c}} {g\left( w \right) = f\left( w \right)\ln f\left( w \right) + \left( {1 - f\left( w \right)} \right)\ln \left( {1 - f\left( w \right)} \right).} \end{array}$$

(50)

Here, we approximately calculate this equation by using the second-order Taylor expansion as

$$\begin{array}{*{20}{c}} {{{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) \cong - E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}{{{\mathrm{|}}}}{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\mathop {\sum}\limits_i {a_{i,t + 1}\left\{ {\begin{array}{*{20}{c}} {g\left( {\mu _{i,t + 1}} \right) + \frac{{\partial g}}{{\partial w_{i,t + 1}}}\left( {w_{i,t + 1} - \mu _{i,t + 1}} \right)} \\ { + \frac{1}{2}\frac{{\partial ^2g}}{{\partial w_{i,t + 1}^2}}\left( {w_{i,t + 1} - \mu _{i,t + 1}} \right)^2} \end{array}} \right\}} } \right],} \end{array}$$

(51)

which leads to

$$\begin{array}{*{20}{c}} {{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = \\- \mathop {\sum}\limits_i {a_{i,t + 1}\left[ {\begin{array}{*{20}{c}} {f\left( {\mu _{i,t + 1}} \right){\mathrm{ln}}f\left( {\mu _{i,t + 1}} \right) + \left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right){\mathrm{ln}}\left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right)} \\ { + \frac{1}{2}\left\{ {f\left( {\mu _{i,t + 1}} \right)\left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right)\left( {1 + \left( {1 - 2f\left( {\mu _{i,t + 1}} \right)} \right){{{\mathrm{ln}}}}\frac{{f\left( {\mu _{i,t + 1}} \right)}}{{1 - f\left( {\mu _{i,t + 1}} \right)}}} \right)} \right\}} \\ {\left( {p_{i,t}^{ - 1} + p_w^{ - 1}} \right)} \end{array}} \right]} . \end{array}$$

(52)

The marginal entropy ${{{\mathrm{H}}}}\left( {o_{t + 1}} \right)$ can be calculated as

$$\begin{array}{*{20}{c}}{{{{\mathrm{H}}}}\left( {o_{t + 1}} \right) = - \mathop {\sum}\limits_i {a_{i,t + 1}\left\{ {P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)\ln P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)} \right.} ,}\\ {\left. { + P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)\ln P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)} \right\}}\end{array}$$

(53)

where

$$\begin{array}{ccc}\\ & P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} \cr \\ & = {\displaystyle{\int}} {\mathop {\prod }\limits_i \left\{ {f\left( {w_{i,t + 1}} \right)^{o_{t + 1}}\left( {1 - f\left( {w_{i,t + 1}} \right)} \right)^{1 - o_{t + 1}}} \right\}^{a_{i,t + 1}}Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} \cr \\ & = \mathop {\prod }\limits_i \left[ {\begin{array}{*{20}{c}} {f\left( {\mu _{i,t + 1}} \right)^{o_{t + 1}}\left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right)^{1 - o_{t + 1}}} \\ { + 1^{o_{t + 1}}\left( { - 1} \right)^{1 - o_{t + 1}}\frac{1}{2}f\left( {\mu _{i,t + 1}} \right)\left\{ {1 - f\left( {\mu _{i,t + 1}} \right)} \right\}\left\{ {1 - 2f\left( {\mu _{i,t + 1}} \right)} \right\}\left( {p_{i,t}^{ - 1} + p_w^{ - 1}} \right)} \end{array}} \right]^{a_i,t + 1}.\\ \end{array}$$

(54)

The second term of the expected net utility (equation (36)) is calculated as

$$\begin{array}{ccccc}\\ & E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{ln}}}}P\left( {o_{t + 1}} \right)} \right]= E_{P\left( {o_{t + 1}{{{\mathrm{|}}}}{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {o_{t + 1}{{{\mathrm{ln}}}}P_o + \left( {1 - o_{t + 1}} \right)\ln \left( {1 - P_o} \right)} \right]\cr \\ & = P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)\ln \left( {1 - P_o} \right) + P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)\ln \left( {1 - P_o} \right).\\ \end{array}$$

(55)

Observer-SSM

We constructed the observer-SSM, which describes the temporal transitions of the latent internal state z of agent and the generation of action, from the viewpoint of the observer of the agent. This is depicted graphically in Fig. 4. As prior information, we assumed that the agent acts based on the internal state, that is, the intensity of curiosity, the recognized reward probabilities and their confidence levels. The intensity of curiosity was assumed to change temporally as a random walk:

$$\begin{array}{*{20}{c}} {c_t = c_{t - 1} + {\it{\epsilon }}\zeta _t,} \end{array}$$

(56)

where $\zeta _t$ denotes white noise with zero mean and unit variance, and ${\it{\epsilon }}$ denotes its noise intensity. Other internal states, that is, $\mu _i$ and $p_i$, were assumed to update as equations (28) and (29). The transition of the internal state is expressed by the probability distribution

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right),{{{{{\varGamma}}}}}} \right),} \end{array}$$

(57)

$${{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right) = \left[ {\begin{array}{*{20}{c}} 0 \\ {h\left( {\mu _{1,t - 1},p_{1,t - 1},o_t,a_1} \right)} \\ {h\left( {\mu _{2,t - 1},p_{2,t - 1},o_t,a_2} \right)} \\ {k\left( {\mu _{1,t - 1},p_{1,t - 1},a_1} \right)} \\ {k\left( {\mu _{2,t - 1},p_{2,t - 1},a_2} \right)} \end{array}} \right],$$

(58)

where ${{{\mathbf{z}}}}_t = \left( {c_t,{{{\mathbf{\mu }}}}_t^{\mathrm{T}},{{{\mathbf{p}}}}_t^{\mathrm{T}}} \right)^{{{\mathrm{T}}}}$ and ${{{{{\varGamma}}}}} = {\it{\epsilon }}^2\text{diag}(1,0,0,0,0)$ . $h\left( {\mu _{i,t - 1},p_{i,t - 1},o_t,a_i} \right)$ and $k\left( {\mu _{i,t - 1},p_{i,t - 1},a_i} \right)$ represent the right-hand sides of equations (28) and (29), respectively; and ${{{{{\varGamma}}}}}$ and $\text{diag}\left( {{{\mathbf{x}}}} \right)$ denote the variance–covariance matrix and square matrix whose diagonal component is ${{{\mathbf{x}}}}$, respectively. In addition, the agent was assumed to select an action ${{{\mathbf{a}}}}_{t + 1}$ based on the expected net utilities, as follows:

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{{{{\mathrm{exp}}}}(\beta U({{{\mathbf{a}}}}_{t + 1}))}}{{\mathop {\sum}\nolimits_{{{\mathbf{a}}}} {\exp \left( {\beta U\left( {{{\mathbf{a}}}} \right)} \right)} }},} \end{array}$$

(59)

and the reward was obtained by the following probability distribution:

$$\begin{array}{*{20}{c}} {P\left( {o_t|{{{\mathbf{a}}}}_t} \right) = \mathop {\prod }\limits_i \left\{ {\lambda _{i,t}^{o_t}(1 - \lambda _{i,t})^{1 - o_t}} \right\}^{a_{i,t}}.} \end{array}$$

(60)

Q-leaning in two-choice task and its observer-SSM

The decision-making in the two-choice task was also modeled by Q-learning. Reward prediction for the ith option Q_i,t is updated as

$$\begin{array}{*{20}{c}} {Q_{i,t} = Q_{i,t - 1} + \alpha _{t - 1}\left( {r_ta_{i,t - 1} - Q_{i,t - 1}} \right),} \end{array}$$

(61)

where $\alpha _t$ indicates a learning rate at trail t. The agents selected action following a softmax function:

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{\exp \left( {B_tQ_{i,t}} \right)}}{{\mathop {\sum}\nolimits_i {\exp \left( {B_tQ_{i,t}} \right)} }},} \end{array}$$

(62)

where B_t indicates the inverse temperature at trail t controlling the randomness of the action selection.

The time-dependent parameters α_t and B_t in Q-learning were estimated from behavioral data³⁷. These parameters were assumed to change temporally as a random walk:

$$\begin{array}{*{20}{c}} {\theta _t = \theta _{t - 1} + {\it{\epsilon }}_\theta \zeta _{\theta ,t},} \end{array}$$

(63)

where $\theta \in \left\{ {\alpha ,B} \right\}$, $\zeta _{\theta ,t}$ denotes white noise with zero mean and unit variance and ${\it{\epsilon }}_\theta$ denotes its noise intensity. Thus, the transition of the internal state is expressed by the probability distribution

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right),{{{{{\Gamma}}}}}} \right),} \end{array}$$

(64)

$$\begin{array}{*{20}{c}} {\mathbf{F}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right) = \left[ {\begin{array}{*{20}{c}} 0 \\ 0 \\ {h_1\left( {\alpha _t,Q_{1,t - 1},{{{\mathbf{a}}}}_{t - 1}} \right)} \\ {h_2\left( {\alpha _t,Q_{2,t - 1},{{{\mathbf{a}}}}_{t - 1}} \right)} \end{array}} \right],} \end{array}$$

(65)

where ${{{\mathbf{z}}}}_t = \left( {\alpha _t,B_t,Q_{1,t},Q_{2,t}} \right)^{{{\mathrm{T}}}}$, ${{{{{\Gamma}}}}} = {\it{\epsilon }}^2{\mathrm{diag}}\left( {1,1,0,0} \right)$, $h_i\left( {\alpha _t,Q_{i,t - 1},{{{\mathbf{a}}}}_t} \right)$ represents the right-hand side of equation (61); and ${{{{{\Gamma}}}}}$ and $\text{diag}\left( {{{\mathbf{x}}}} \right)$ denote the variance–covariance matrix and square matrix whose diagonal component is ${{{\mathbf{x}}}}$, respectively.

iFEP by particle filter and Kalman backward algorithm

Based on the observer-SSM, we estimated the posterior distribution of the latent internal state of agent $z_t$ given all observations from 1 to T ($x_{1:T}$) in a Bayesian manner, that is, $P\left( {z_t|x_{1:T}} \right)$. This estimation was done by forward and backward algorithms, which are called filtering and smoothing, respectively.

In filtering, the posterior distribution of $z_t$ given observations until t ($x_{1:t}$) is sequentially updated in a forward direction as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right) \propto P\left( {{{{\mathbf{x}}}}_t|{{{\mathbf{z}}}}_t,\theta } \right){\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right)P\left( {{{{\mathbf{z}}}}_{t - 1}|{{{\mathbf{x}}}}_{1:t - 1}} \right){\mathrm{d}}{{{\mathbf{z}}}}_{t - 1},} } \end{array}$$

(66)

where ${{{\mathbf{x}}}}_t = \left( {{{{\mathbf{a}}}}_t^T,o_t} \right)^{{{\mathrm{T}}}}$ and $\theta = \left\{ {\sigma ^2,\alpha ,P_o} \right\}$. The prior distribution of ${{{\mathbf{z}}}}_1$ is

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_1} \right) = \left[ {\mathop {\prod }\limits_i {{{\mathcal{N}}}}(\mu _{i,1}|\mu _0,\sigma _\mu ^2)\text{Gam}(p_{i,1}|a_g,b_g)} \right]\text{Uni}(c_1|a_u,b_u),} \end{array}$$

(67)

where $\mu _0$ and $\sigma _\mu ^2$ denote means and variances, $\text{Gam}(x|a_g,b_g)$ indicates the Gamma distribution with shape parameter $a_g$ and scale parameter $b_g$, and $\text{Uni}\left( {x|a_u,b_u} \right)$ indicates uniform distribution from $a_u$ to $b_u$. We used a particle filter⁴² to sequentially calculate the posterior $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right)$, which cannot be analytically derived because of the nonlinear transition probability.

After the particle filter, the posterior distribution of ${{{\mathbf{z}}}}_t$ given all observations (${{{\mathbf{x}}}}_{1:T}$) is sequentially updated in a backward direction as

$$\begin{array}{ccccc}\\ & P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right) = {\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{x}}}}_{1:T}} \right)P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t + 1},{{{\mathbf{x}}}}_{1:t},\theta } \right){\mathrm{d}}{{{\mathbf{z}}}}_{t + 1}} \cr \\ & = {\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{x}}}}_{1:T}} \right)\frac{{P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t},\theta } \right)}}{{\mathop {\smallint }\nolimits^ P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)P({{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t},\theta ){\mathrm{d}}{{{\mathbf{z}}}}_t}}{\mathrm{d}}{{{\mathbf{z}}}}_{t + 1}} .\\ \end{array}$$

(68)

However, this backward integration is intractable because of the non-Gaussian $P\left( {{{{\mathbf{z}}}}_T|{{{\mathbf{x}}}}_{1:T}} \right)$, which was represented by the particle ensemble in the particle filter, and the nonlinear relationship between ${{{\mathbf{z}}}}_t$ and ${{{\mathbf{z}}}}_{t + 1}$ in $P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)$ (equation (57)). Thus, we approximated $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right)$ as ${{{\mathcal{N}}}}({{{\mathbf{z}}}}_t|{{{\mathbf{m}}}}_t,{{{\mathbf{V}}}}_t)$, where ${{{\mathbf{m}}}}_t$ and ${{{\mathbf{V}}}}_t$ denote a sample mean and a sample variance of the particles at t, whereas we linearized $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right)$ as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right) \cong {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{Az}}}}_{t - 1} + {{{\mathbf{b}}}},{{{{{\Gamma}}}}}} \right),} \end{array}$$

(69)

$$\begin{array}{*{20}{c}} {{{{\mathbf{A}}}} = \left. {\frac{{\partial {{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right)}}{{\partial {{{\mathbf{z}}}}_{t - 1}}}} \right|_{{{{\mathbf{m}}}}_t},} \end{array}$$

(70)

$$\begin{array}{*{20}{c}} {{{{\mathbf{b}}}} = F\left( {{{{\mathbf{m}}}}_t,{{{\mathbf{a}}}}_{t - 1}} \right) - {{{A\mathbf{m}}}}_t,} \end{array}$$

(71)

where A denotes a Jacobian matrix. Because these approximations make the integration of equation (68) tractable, the posterior distribution $P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right)$ can be computed by a Gaussian distribution as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|\widehat {{{\mathbf{m}}}}_t,\widehat {{{\mathbf{V}}}}_t} \right)} \end{array}$$

(72)

whose mean and variance were analytically updated by Kalman backward algorithms as⁴³

$$\begin{array}{*{20}{c}} {\widehat {{{\mathbf{m}}}}_t = {{{\mathbf{m}}}}_t + {{{\mathbf{J}}}}_t\left\{ {\widehat {{{\mathbf{m}}}}_{t + 1} - ({{{\mathbf{Am}}}}_t + {{{\mathbf{b}}}})} \right\},} \end{array}$$

(73)

$$\begin{array}{*{20}{c}} {\widehat {{{\mathbf{V}}}}_t = {{{\mathbf{V}}}}_t + {{{\mathbf{J}}}}_t\left\{ {\widehat {{{\mathbf{m}}}}_{t + 1} - ({{{\mathbf{AV}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}} + {{{\mathbf{{\Gamma}}}}})} \right\}{{{\mathbf{J}}}}_t^{{{\mathrm{T}}}},} \end{array}$$

(74)

where

$$\begin{array}{*{20}{c}} {{{{\mathbf{J}}}}_t = {{{\mathbf{V}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}}\left( {{{{\mathbf{AV}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}} + {{{{{\Gamma}}}}}} \right)^{ - 1}.} \end{array}$$

(75)

Impossibility of model discrimination

In the ReCU and Q-learning models, the action selections were formulated with the same softmax functions as

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{{{{\mathrm{exp}}}}\left( {\beta \left( {E[\mathrm{Reward}_{i,t}] + c_tE[\text{Info}_{i,t}]} \right)} \right)}}{{\mathop {\sum}\nolimits_j {{{{\mathrm{exp}}}}\left( {\beta \left( {E[\mathrm{Reward}_{j,t}] + c_tE[\text{Info}_{j,t}]} \right)} \right)} }}.} \end{array}$$

(76)

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{\exp \left( {\beta _tQ_{i,t}} \right)}}{{\mathop {\sum}\nolimits_i {\exp \left( {\beta _tQ_{i,t}} \right)} }},} \end{array}$$

(77)

which correspond to that of the ReCU and Q-learning models, respectively. These equations contain the time-dependent meta-parameters, that is, c_t and β_t. For both models, the goodness of fit (that is, likelihood) for the actual behavioral data can be freely improved by tuning the time-dependent meta-parameters. Thus, discrimination between the ReCU and Q-learning models must be essentially impossible.

Estimation of parameters in iFEP

The ReCU model has several parameters: $\sigma _w^2$, α, β, P_o and ${\it{\epsilon }}$. In the estimation, we set ${\it{\epsilon }}$ to 1, which was the optimal value for estimation in the artificial data (Supplementary Fig. 3). We assumed the unit intensity of reward, that is, $\ln P_o/(1 - P_o) = 1$, because it is impossible to estimate both P_o and β caused by multiplying β and $\ln P_o/(1 - P_o)$ in the expected net utility (equations (30) and (42)). This treatment is suitable for relative comparison between the curiosity meta-parameter and the reward. In addition, we addressed βc_t as a latent variable as $\hat c_t = \beta c_t$ because of the multiplication of β in the expected net utility (equations (30) and (42)). Thus, the estimation of c_t can be obtained by dividing the estimated $\hat c_t$ by the estimated β. Therefore, the hyperparameters to be estimated were $\sigma _w^2$, α and β.

To estimate these parameters $\theta = \left\{ {\sigma _w^2,\alpha ,\beta } \right\}$, we extended the observer-SSM to a self-organizing SSM⁴⁴ in which θ was addressed as constant latent variables:

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t,\theta |{{{\mathbf{x}}}}_{1:t}} \right) \propto P\left( {{{{\mathbf{x}}}}_t|{{{\mathbf{z}}}}_t} \right){\int} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right)P\left( {{{{\mathbf{z}}}}_{t - 1},\theta |{{{\mathbf{x}}}}_{1:t - 1}} \right)\mathrm{d}{{{\mathbf{z}}}}_{t - 1},} } \end{array}$$

(78)

where $P\left( \theta \right) = \text{Uni}\left( {\sigma ^2|a_\sigma ,b_\sigma } \right)\text{Uni}(\alpha |a_\alpha ,b_\alpha ){{{\mathcal{N}}}}(\beta |m_\beta ,v_\beta )$. To sequentially calculate the posterior $P\left( {{{{\mathbf{z}}}}_t,\theta |{{{\mathbf{x}}}}_{1:t}} \right)$ using the particle filter, we used 100,000 particles and augmented the state vector of all particles by adding the parameter θ, which was not updated from randomly sampled initial values.

The hyperparameter values used in this estimation were μ₀ = 0, $\sigma _\mu ^2 = 0.01^2$, a_g = 10, b_g = 0.001, a_u = −15, b_u = 15, a_σ = 0.2, b_σ = 0.7, a_α = 0.04, b_α = 0.06, a_β = 0 and b_β = 50, which were heuristically given as parameters correctly estimated using the artificial data (Supplementary Fig. 2).

Statistical testing with Monte Carlo simulations

Supplementary Fig. 5 shows statistical testing of the negative curiosity estimated in Fig. 5. A null hypothesis is that an agent has no curiosity (that is, c_t = 0) decides on a choice only depending on its recognition of the reward probability. Under the null hypothesis, model simulations were repeated 1,000 times under the same experimental conditions as in Fig. 5 and the curiosity was estimated for each using iFEP. We adopted the temporal average of the estimated curiosity as a test statistic and plotted the null distribution of the test statistic. Compared with the estimated curiosity of the rat behavior, we computed the P value for a one-sided left-tailed test.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Source data for Figs. 2, 3, 5 and 6 are available with this paper. Source data for Supplementary Figures are available in Supplementary Data. We used the rat behavioral data published in ref. ³⁶, which is publicly available at https://groups.oist.jp/ja/ncu/data. These rat behavioral data are also included in the Source Data for Fig. 5 and on Zenodo⁴⁵.

Code availability

The computer simulation and data analysis were performed using MATLAB (version R2020b) software. The code used for this work is available on GitHub at https://github.com/YukiKonaka/Konaka_Honda_2023. The specific version used to produce the results in this manuscript is also available on Zenodo⁴⁵.

References

Helmholtz, H. Handbuch der Physiologischen Optik (Andesite Press, 1867).
Yuille, A. & Kersten, D. Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. 10, 301–308 (2006).
Article Google Scholar
Millett, J. D. & Simon, H. A. Administrative behavior: a study of decision-making processes in administrative organization. Polit. Sci. Q. 62, 621 (1947).
Article Google Scholar
Dubey, R. & Griffiths, T. L. Understanding exploration in humans and machines by formalizing the function of curiosity. Curr. Opin. Behav. Sci. 35, 118–124 (2020).
Article Google Scholar
Kidd, C. & Hayden, B. Y. The psychology and neuroscience of curiosity. Neuron 88, 449–460 (2015).
Article Google Scholar
Klein, U. & Nowak, A. J. Characteristics of patients with autistic disorder (AD) presenting for dental treatment: a survey and chart review. Spec. Care Dentist. 19, 200–207 (1999).
Article Google Scholar
Lockner, D. W., Crowe, T. K. & Skipper, B. J. Dietary intake and parents’ perception of mealtime behaviors in preschool-age children with autism spectrum disorder and in typically developing children. J. Am. Diet. Assoc. 108, 1360–1363 (2008).
Article Google Scholar
Schreck, K. A. & Williams, K. Food preferences and factors influencing food selectivity for children with autism spectrum disorders. Res. Dev. Disabil. 27, 353–363 (2006).
Article Google Scholar
Esposito, M. et al. Sensory processing, gastrointestinal symptoms and parental feeding practices in the explanation of food selectivity: clustering children with and without autism. Int. J. Autism Relat. Disabil. 2, 1–12 (2019).
Google Scholar
Hobson, R. P. Autism and the development of mind. Essays Dev. Psychol. (Routledge, 1993).
Burke, R. Personalized recommendation of PoIs to people with autism. Commun. ACM 65, 100 (2022).
Article Google Scholar
Ghanizadeh, A. Educating and counseling of parents of children with attention-deficit hyperactivity disorder. Patient Educ. Couns. 68, 23–28 (2007).
Article Google Scholar
Sedgwick, J. A., Merwood, A. & Asherson, P. The positive aspects of attention deficit hyperactivity disorder: a qualitative investigation of successful adults with ADHD. ADHD Atten. Deficit Hyperact. Disord. 11, 241–253 (2019).
Article Google Scholar
Redshaw, R. & McCormack, L. ‘Being ADHD’: a qualitative study. Adv. Neurodev. Disord. 6, 20–28 (2022).
Article Google Scholar
Sutton, R. S. & Barto, A. G. Reinforcement Learning: an Introduction (MIT Press, 1998).
Friston, K. A theory of cortical responses. Philos. Trans. R. Soc. B 360, 815–836 (2005).
Article Google Scholar
Friston, K., Kilner, J. & Harrison, L. A free energy principle for the brain. J. Physiol. Paris 100, 70–87 (2006).
Article Google Scholar
Friston, K. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 127–138 (2010).
Article Google Scholar
Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27, 986–1005 (1956).
MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4, 590–604 (1992).
Article Google Scholar
Berger, J. O. Statistical Decision Theory and Bayesian Analysis, Springer Series in Statistics (Springer, 2011).
Friston, K. et al. Active inference and epistemic value. Cogn. Neurosci. 6, 187–214 (2015).
Article Google Scholar
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P. & Pezzulo, G. Active inference: a process theory. Neural Comput. 29, 1–49 (2017).
Article MathSciNet MATH Google Scholar
Attias, H. Planning by probabilistic inference in Proc. 9th Int. Work. Artif. Intell. Stat. 4, 9–16 (2003).
Botvinick, M. & Toussaint, M. Planning as inference. Trends Cogn. Sci. 16, 485–488 (2012).
Article Google Scholar
Kaplan, R. & Friston, K. J. Planning and navigation as active inference. Biol. Cybern. 112, 323–343 (2018).
Article MathSciNet MATH Google Scholar
Matsumoto, T. & Tani, J. Goal-directed planning for habituated agents by active inference using a variational recurrent neural network. Entropy 22, (2020).
Schwartenbeck, P. et al. Computational mechanisms of curiosity and goal-directed exploration. eLife 8, 1–45 (2019).
Article Google Scholar
Millidge, B., Tschantz, A. & Buckley, C. L. Whence the expected free energy? Neural Comput. 33, 447–482 (2021).
Article MathSciNet MATH Google Scholar
Houthooft, R. et al. VIME: variational information maximizing exploration. Adv. Neural Inf. Process. Syst. 0, 1117–1125 (2016).
Google Scholar
Smith, R. et al. Greater decision uncertainty characterizes a transdiagnostic patient sample during approach-avoidance conflict: a computational modelling approach. J. Psychiatry Neurosci. 46, E74–E87 (2021).
Article Google Scholar
Smith, R. et al. Long-term stability of computational parameters during approach-avoidance conflict in a transdiagnostic psychiatric patient sample. Sci Rep. 11, 1–13 (2021).
Article Google Scholar
Schwartenbeck, P. & Friston, K. Computational phenotyping in psychiatry: a worked example. eNeuro 3, 1–18 (2016).
Daunizeau, J. et al. Observing the observer (I): meta-Bayesian models of learning and decision-making. PLoS ONE 5, e15554 (2010).
Article Google Scholar
Patzelt, E. H., Hartley, C. A. & Gershman, S. J. Computational phenotyping: using models to understand individual differences in personality, development, and mental illness. Personal. Neurosci. 1, e18 (2018).
Article Google Scholar
Ito, M. & Doya, K. Validation of decision-making models and analysis of decision variables in the rat basal ganglia. J. Neurosci. 29, 9861–9874 (2009).
Article Google Scholar
Samejima, K., Doya, K., Ueda, Y. & Kimura, M. Estimating internal variables and parameters of a learning agent by a particle filter. Adv. Neural Inf. Process. Syst. 16 (2003).
Samejima, K., Ueda, Y., Doya, K. & Kimura, M. Neuroscience: representation of action-specific reward values in the striatum. Science (80-.) 310, 1337–1340 (2005).
Article Google Scholar
Ortega, P. A. & Braun, D. A. Thermodynamics as a theory of decision-making with information-processing costs. Proc. R. Soc. London. A 469, 20120683 (2013).
MathSciNet MATH Google Scholar
Gottwald, S. & Braun, D. A. The two kinds of free energy and the Bayesian revolution. PLoS Comput. Biol. 16, (2020).
Parr, T. & Friston, K. J. Generalised free energy and active inference. Biol. Cybern. 113, 495–513 (2019).
Article MathSciNet MATH Google Scholar
Kitagawa, G. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Stat. 5, 1–25 (1996).
MathSciNet Google Scholar
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Kitagawa, G. A self-organizing state-space model. J. Am. Stat. 93, 1203–1215 (1998).
Google Scholar
Konaka, Y. & Naoki, H. Codes for Konaka and Honda 2023. Zenodo https://doi.org/10.5281/zenodo.7722905 (2023)

Download references

Acknowledgements

We are grateful to K. Doya and M. Ito for providing rat behavioral data. We thank the organizers of the tutorial on the free energy principle in 2019, which inspired this research, and I. Higashino and M. Fujiwara-Yada for carefully checking all the equations in the manuscript. This study was supported in part by a Grant-in-Aid for Transformative Research Areas (B) (no. 21H05170), AMED (grant no. JP21wm0425010), Moonshot R&D–MILLENNIA program (grant no. JPMJMS2024-9) by JST, the Cooperative Study Program of Exploratory Research Center on Life and Living Systems (ExCELLS) (program no. 21-102) and the grant of Joint Research by the National Institutes of Natural Sciences (NINS program no. 01112102).

Author information

Authors and Affiliations

Laboratory of Data-Driven Biology, Graduate School of Integrated Sciences for Life, Hiroshima University, Hiroshima, Japan
Yuki Konaka & Honda Naoki
Kansei-Brain Informatics Group, Center for Brain, Mind and Kansei Sciences Research, Hiroshima University, Hiroshima, Japan
Honda Naoki
Theoretical Biology Research Group, Exploratory Research Center on Life and Living Systems, National Institutes of Natural Sciences, Okazaki, Japan
Honda Naoki
Laboratory of Theoretical Biology, Graduate School of Biostudies, Kyoto University, Kyoto, Japan
Honda Naoki

Authors

Yuki Konaka
View author publications
You can also search for this author in PubMed Google Scholar
Honda Naoki
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.N. conceived of the project. Y. K. and H.N. developed the method, and Y.K. implemented the model simulation. Y.K. and H.N. wrote the manuscript.

Corresponding author

Correspondence to Honda Naoki.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Junichiro Yoshimoto and Karl Friston for their contribution to the peer review of this work. Primary Handling Editors: Ananya Rastogi and Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–8.

Reporting Summary

Peer Review File

Supplementary Data 1

Source Data for Supplementary Figures.

Source data

Source Data Fig. 2

Simulation Source Data

Source Data Fig. 3

Simulation Source Data

Source Data Fig. 5

Simulation Source Data

Source Data Fig. 6

Simulation Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Konaka, Y., Naoki, H. Decoding reward–curiosity conflict in decision-making from irrational behaviors. Nat Comput Sci 3, 418–432 (2023). https://doi.org/10.1038/s43588-023-00439-w

Download citation

Received: 21 June 2022
Accepted: 29 March 2023
Published: 15 May 2023
Issue Date: May 2023
DOI: https://doi.org/10.1038/s43588-023-00439-w

Subjects

Abstract

Similar content being viewed by others

Main

Results

Decision-making with the reward–curiosity dilemma

ReCU model

Recognition and decision-making in the simulation

Discrimination of passive and curiosity-dependent behaviors

Curiosity-dependent irrational behaviors

Inverse FEP: Bayesian estimation of the internal state

Validation of iFEP with artificial data

iFEP-decoded internal state behind rat behaviors

Negative curiosity and its dynamics decoded by iFEP

Evaluations of alternative models from rat behaviors

Discussion

Methods

Amount of reward

State space model for reward probability recognition

FEP for reward probability recognition

Calculation of free energy

Sequential updating of the agent’s recognition

Expected net utility

Model for action selection

Alternative expected net utility

Calculation of expected net utility

Observer-SSM

Q-leaning in two-choice task and its observer-SSM

iFEP by particle filter and Kalman backward algorithm

Impossibility of model discrimination

Estimation of parameters in iFEP

Statistical testing with Monte Carlo simulations

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links