Main

Animals and humans perceive the external world through their sensory systems and make decisions accordingly1,2. Generally, they cannot make optimal decisions because of the uncertainty of the environment as well as the limited computational capacity of the brain and time constraints associated with decision-making3. In fact, they perform irrational actions. For example, people play lotteries and gamble despite low reward expectations. In this case, they face a dilemma between low expected reward and curiosity regarding whether a reward will be acquired. Thus, understanding how animals control the balance between reward and curiosity is important for clarifying the whole decision-making process. However, a method is yet to be established for quantifying the reward–curiosity balance has yet been established. In this study, we developed a machine learning method to decode the time series of the reward–curiosity balance from animal behavioral data.

Some irrational behaviors emerge because of the strength of curiosity4,5. For example, conservative individuals avoid uncertainty and prefer to select an action that leads to predictable outcomes. Conversely, inquisitive individuals strongly desire to know the environment rather than rewards and prefer to select an action that leads to unpredictable outcomes. Too conservative and inquisitive natures can be interpreted as autism spectrum disorder and attention deficit hyperactivity disorder, patients with which are known to substantially avoid and seek novel information, respectively6,7,8,9,10,11,12,13,14. Rational individuals fall midway between these two extremes. In an ambiguous environment, they select an action to efficiently understand the environment, and if the environment becomes clear, they select an action to efficiently exploit the rewards. Therefore, curiosity has a major impact on behavioral patterns, and it is believed that animals control the balance between reward and curiosity in a context-dependent manner.

Decision-making has been modeled primarily by reinforcement learning (RL), which is a theory for describing reward-seeking adaptive behavior in which animals not only exploit rewards but also explore the environment. In RL, explorative behavior was addressed by a passive, random choice of action15. However, animals actively explore the environment by selecting actions that minimize the uncertainty of the environment given their curiosity.

Recently, the free energy principle (FEP) was proposed by Karl Friston under the Bayesian brain hypothesis, in which the brain optimally recognizes the outside world according to Bayesian estimation16,17,18. The FEP addresses not only the recognition of the external world but also the information-seeking action selection, which minimizes the uncertainty of the recognition of the external world, known as “active inference”19,20,21. Furthermore, FEP proposed a score of action, called expected free energy, which consists of the expected reward and curiosity with the same unit22,23,24,25,26,27. Thus, action selection can be formulated by maximizing both reward and curiosity. Note that curiosity can be regarded as information gain, that is, the extent to which we expect our recognition to be updated by the new observation through the action. However, FEP assumes that the weighting of rewards and curiosity is always even and constant. Although a previous FEP study modeled active inference in a two-choice task, it assumed a constant intensity of curiosity and thus could not treat actual animal behaviors in which the weights of rewards and curiosity are expected to change over time28. Hence, conventional theories such as RL and FEP are limited in describing the conflict between reward and curiosity.

Identifying the temporal variability of curiosity is important for future clarifying the neural mechanisms of the reward and curiosity conflicts in decision-making. Many FEP studies have been devoted to the construction of theory, assuming that the decision-making processes of animals are Bayes optimal. Thus, there was not even the idea that animals irrationally make decisions depending on the reward and curiosity conflicts. For this reason, a method to decode the temporal balance between reward and curiosity from behavioral data is yet to be established. Such a method would enable us to analyze neural correlates with the temporal variability of curiosity, and consequently, it would help us clarify how the brain controls the balance of reward and curiosity in a context-dependent manner.

In this study, we extended FEP by incorporating a meta-parameter that controls the conflict dynamics between reward and curiosity, called the reward–curiosity decision-making (ReCU) model. The ReCU model can exhibit various behavioral patterns, such as greedy behavior toward reward, information-seeking behaviors with high curiosity and conservative behaviors avoiding uncertainty. Moreover, we developed a machine learning method called the inverse FEP (iFEP) method to estimate the internal variables of decision-making information processing. Applying the iFEP method to a behavioral time series in a two-choice task, we successfully estimated the internal variables, such as variations in curiosity, recognition of reward availability and its confidence.

Results

Decision-making with the reward–curiosity dilemma

Animals perceive the environment by inferring causes such as reward availability from observation, and then they make decisions based on their own inferences. In this study, we developed an ReCU model of a decision-making agent facing a dilemma between reward and curiosity in a two-choice task, wherein the agent selects either of two choices associated with the same rewards but with different reward probabilities (Fig. 1a). If the agent aims to maximize cumulative rewards, the agent must select an option with a higher reward probability. However, in animal behavioral experiments, even after they learned which option was more associated with a reward, they did not exclusively select the best choice, but also often selected the option with a smaller reward probability, which seems unreasonable.

Fig. 1: Decision-making model for the two-choice task with reward–curiosity dilemma.
figure 1

a, Decision-making in the two-choice task. Reward is provided at different probabilities for each option. The agent does not know those probabilities. Through repeated trial and error, the agent recognizes the world by inferring the latent reward probability of each option, and decides to choose the next action, that is, option, based on its own inference. b, Sequential Bayesian estimation as a recognition process. The agent assumes that the reward probabilities change over time owing to the fluctuation in the latent variable controlling reward probability. c, Belief updating. The agent recognizes the latent variable as a probability distribution. d, The update rule of the mean and variance of the estimation distribution for each option. α, Kt and f(μt) indicate the learning rate, Kalman gain, and the prediction of the reward probability, respectively. The second term in both equations disappears if the option is not selected. e, The action selection process by the agent. The agent evaluates the expected net utility \(U_t(a_{t + 1})\) of each action using the weighted sum of the expected reward and information gain, as shown in the equation. The agent compares the expected net utilities for both actions and prefers the option with the larger expected net utility. f, Time-dependent curiosity. The intensity of curiosity changes over time owing to the fluctuation of \(c_t\).

Here, we hypothesized the following: Animals assume that the reward probability for each option might fluctuate over time, and therefore, the continuous selection of one option decreases the confidence of the reward probability estimation for the other option. Thus, they become curious about the ambiguous option even with a smaller reward probability, and so selecting the ambiguous option is reasonable for increasing the confidence of the estimation for both options. Therefore, we considered that the agent should make decisions driven by reward and curiosity in a situation-dependent manner.

ReCU model

In the ReCU model, we divided information processing in the brain into two processes. In the first process, the agent updates the recognition of the reward probability of each option (Fig. 1a, process 1). In the second process, the agent selects an action based on the current recognition and curiosity (Fig. 1a, process 2). The agent repeats these two processes in the two-choice task.

We modeled the first process by sequential Bayesian estimation, under the assumption that reward probabilities latently fluctuate in time (Fig. 1b). The agent updates the belief about the reward probabilities, which are expressed as estimation distributions, in response to actions and consequence reward observations (Fig. 1c). We derived the equations of the belief update (Fig. 1d and Methods). We modeled the second process by action selection based on two kinds of motivations: the desires to maximize the reward and to gain the information from the environment (Fig. 1e). This sum, called ‘expected net utility’ in this study, can be expressed by

$$U_t\left( {a_{t + 1}} \right) = E\left[ {\mathrm{Reward}_{t + 1}} \right] + c_t \cdot E\left[ {\mathrm{Info}_{t + 1}} \right],$$
(1)

where a and t indicate the action and trial index, respectively, and E[x] denotes the expectation value of x based on current recognition. The first and second terms represent expectations of reward and information derived from a new observation, respectively, for the next action at+1 based on current recognition. ct denotes a meta-parameter describing the intensity of curiosity, which weighs the expected information gain (see Methods for detail). We assumed an irrational mental conflict as ct varies over time (Fig. 1f). In decision-making, the agents prefer to select action at+1 with the higher expected net utility, in which the action is selected probabilistically following a sigmoidal function

$$P\left( {a_{t + 1}} \right) = \frac{1}{{1 + \exp \left( { - \beta {\Delta}U_t} \right)}},$$
(2)

where ΔUt = Ut(at+ 1) − Ut\(\left( {\overline {a_{t + 1}} } \right)\), and β denotes the inverse temperature controlling the randomness of action selection22,29,30.

Recognition and decision-making in the simulation

To validate our model, we performed simulations with constant curiosity ct = 1 for two cases. In the first case, where reward probabilities were constant and different between the two options (Fig. 2a), the agent preferred to select the option with the higher reward probability (Fig. 2b). The recognized reward probabilities converged to ground truths, indicating that the agent accurately recognized the reward probabilities (Fig. 2c). The recognition confidence changed over time depending on the behavior in each trial; the confidence of the selected option increased with information from the observation, whereas that of the unselected option decreased because of the agent’s assumption regarding the fluctuation in reward probabilities (Fig. 2d). Similarly, the expected information gain of an option increased and decreased when that option was selected and unselected, respectively. Thus, the expected information gain was lower for the option with higher confidence (Fig. 2e). The expected reward followed the recognized reward probability (Fig. 2f). Initially, decreasing expected information gain and increasing expected reward eventually cross at some number of trials, which correspond to switching between information exploration and reward acquisition (Supplementary Fig. 1a). These two factors are negatively correlated, indicating a trade-off relationship (Supplementary Fig. 1b). The expected net utility, which is the sum of the expected information gain and reward, represents the value of each selection (Fig. 2g), resulting in the agent’s preferentially selecting the option with the higher expected net utility.

Fig. 2: Simulations of the decision-making model.
figure 2

a, The two-choice task with constant reward probabilities. b, The selection probabilities for the left and right options, plotted as a moving average with window width of 101. c, The recognized reward probabilities for the left and right options compared with the ground truths depicted by dashed lines. d, The confidences of reward probability recognitions for left the and right options. eg, The expected brief updates (e), expected reward (f) and expected net utility (g) for the left and right options. h, The two-choice task with constant and temporally varying reward probabilities for the left and right options. in, The same as bg with parameter values of \(c = 1\), \(P_o = 0.8\), \(\alpha = 0.05\), \(\beta = 2\) and \(\sigma _w = 0.63\). oq, The selected options in a space of left–right differences of the expected reward and information gain. The ReCU model was simulated with dynamically changing reward probabilities for different intensities of curiosity: \(c = - 1\) (o), \(c = 0\) (p) and \(c = 3\) (q). The reward probabilities were generated by the Ornstein–Uhlenbeck process of w for 1,000 trials: \(w_{i,t} = w_{i,t - 1} - 0.01w_{i,t - 1} + 0.15\xi _t\), where \(\xi _t\) indicates the standard Gauss noise. The heatmap represents the probability of action selection in the space (equation (2)).

Source data

In the second case, we assumed a dynamic environment with a time-dependent reward probability (Fig. 2h). In the simulation, the agent adaptively changed its recognition of the reward probability following the change in the true reward probability, and selected the option with the higher estimated reward probability (Fig. 2i,j). The confidence was affected by the uncertainty of the reward probability at each time; the confidence was high where the reward probabilities were near-deterministic, around 1 and 0, but low where the reward probabilities were uncertain, i.e., approximately 0.5 (Fig. 2k). The expected information gain of an option was negatively correlated with the confidence of the option (Fig. 2l), suggesting that the agent was curious about the uncertain option. The expected reward just varied depending on the recognized reward probability (Fig. 2m). In this case, we also observed a switching between information exploration and reward acquisition at the initial phase (Supplementary Fig. 1c). In contrast to the above case, the expected reward and expected information gain did not show clearly linear correlation because of the dynamic environment (Supplementary Fig. 1d). The expected net utility changed similarly to the expected reward; however, the difference between the left and right options was less pronounced because of curiosity (Fig. 2n). These two demonstrations indicate that our model can represent the process of cognition and decision-making based on reward and curiosity.

Discrimination of passive and curiosity-dependent behaviors

The behavioral difference based on curiosity is interesting. The expected net utility can be rewritten as

$${\Delta}U_t = {\Delta}E\left[ {\mathrm{Reward}}_{{\mathrm{t}} + 1} \right] + c_t \cdot {\Delta}E\left[ {\mathrm{Info}}_{{\mathrm{t}} + 1} \right],$$
(3)

where the first and second terms represent the differences between the expected reward and information gain of two options, respectively. Thus, the agent is decided based on the balance between \({\mathrm{\Delta}}E\left[ {\mathrm{Reward}}_{{\mathrm{t}} + 1} \right]\) and \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}}_{t + 1} \right]\,\). Here, we created a diagram visualizing the selected actions in the space of \({{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right]\) and \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}}_{t + 1} \right]\). Depending on the intensity of curiosity ct, the left and right actions can be separated by a boundary of \({{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right] + c_t \cdot {{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}}_{{t + 1}} \right] = 0\) (Fig. 2o–q). The agent with ct = 0 selected an option based only on \({{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right]\) (Fig. 2p). In contrast, if ct was a nonzero value, the boundary leaned to a different direction depending on the positive and negative values of ct (Fig. 2o,q). These results indicate that passive choice (ct = 0) and curiosity-dependent choice (\(c_t \ne 0\)) can be discriminated based on the distributed pattern of selected actions in the space of \({{{\mathrm{{\Delta}}}}}E\left[ \mathrm{Reward}_{\mathrm{t + 1}} \right]\) and \({\mathrm{\Delta}}E\left[ {\mathrm{Info}}_{{t + 1}} \right]\).

Curiosity-dependent irrational behaviors

We examined how behavioral patterns are regulated by the intensity of curiosity and the degree of reward seeking (Fig. 3). In a scenario where the reward probabilities were zero for the left and 0.5 for the right (Fig. 3a), we simulated a model by varying the meta-parameters c and Po, which is a control parameter of the reward amount (Fig. 3b and Methods). When the agent strongly desired the reward (\(P_o = 0.99\)) with no curiosity (c = 0), the agent preferred the right option with a higher reward probability (Fig. 3c, point a). If the agent had no desire for a reward (\(P_o = 0.5\)) with high curiosity (c = 0), the agent preferred the option with a higher reward probability (Fig. 3c, point b). Although this behavior seems to be rational at first glance, the agent did not seek the reward, but rather sought the information (i.e., belief update) driven by curiosity, which resulted in a preference for the uncertain option. When the agent has negative curiosity (c = −10), the agent continuously selected either of the two options depending on the first selection (Fig. 3c, point c). In this behavior, the agent conservatively selected the more certain option, as patients with autism spectrum disorder irrationally avoid new information and repeat the same choices6,7,8,9,10,11.

Fig. 3: Curiosity-dependent irrational behaviors.
figure 3

a, The two-choice task with different, constant reward probabilities: 0% for the left and 50% for the right. b, The heatmap of the selection probability of the right option as a function of the parameters of curiosity and reward intensity. The probability was obtained empirically by running 1,000 simulations for each set of curiosity and reward. Three representative conditions are indicated by black dots: reward-seeking (point a; \(c = 0,P_o = 1\)), information-seeking (point b; \(c = 10,P_o = 0.5\)) and information-avoiding (point c; \(c = - 10,P_o = 0.75\)). c, Box plots showing the selection ratios of the right option for the conditions at points a–c, where the central line indicates the median, the edges are the lower and upper quantiles and the upper and lower whiskers represent the highest and lowest value after excluding outliers, respectively. At point c, the model agent dominantly selected either the right or left option, where the left- and right-dominantly selected simulations occurred 505 and 495 times, respectively. The box plots for point c appear crushed because the data points are too densely packed. df, The same as ac for the two-choice task with different, constant reward probabilities: 50% for the left and 100% for the right. At point c in f, the left- and right-dominantly selected simulations occurred 489 and 511 times, respectively.

Source data

In addition, we obtained a nontrivial result in another scenario, where the reward probabilities were 0.5 for the left and 1 for the right (Fig. 3d–f). As in the previous scenario, the agents with a strong desire for the reward (Po = 0.99, nearly equal to 1) preferred the right option with a higher reward probability (Fig. 3f, point a). The agent with no desire for reward (Po = 0.5) and high curiosity (c = 10) preferred the left option with a lower reward probability (Fig. 3f, point b). This seemingly irrational behavior was the outcome of focusing on satisfying curiosity and not seeking rewards, which recalls patients with attention deficit hyperactivity disorder irrationally exploring new information12,13,14. In addition, as seen in the previous scenario (Fig. 3c, point c), the agents with negative curiosity (c = −10), irrespective of the desire for the reward, exhibited conservative selection (Fig. 3f, point c). In combination, these results clearly indicate that behavioral patterns largely depend on the degree of conflict between reward and curiosity.

Inverse FEP: Bayesian estimation of the internal state

In the above cases, we assumed a constant balance between reward and curiosity. However, our feelings swing in a context-dependent manner. Although it is important to decipher the temporal swinging of the conflict between reward and curiosity in terms of neuroscience and psychology, it is difficult to quantify the conflict because of its temporal dynamics. Here, we addressed the inverse problem to estimate the internal states of the agent from behavioral data, known as computational phenotyping or meta-Bayesian inference31,32,33,34,35. To this end, we developed a machine learning method called iFEP to quantitatively decipher the temporal dynamics of the internal state including the curiosity meta-parameter from behavioral data.

For developing iFEP, we needed to switch the viewpoint from the agent to the observer of the agent, that is, from animals to us. In the state-space model (SSM) from the viewpoint of the agent, we described the sequential recognition of reward probabilities by the agent (Figs. 1b and 4a,b). Conversely, we developed a state-space model from the observer’s eye (the observer-SSM) to determine the internal state of the agent, for example, the intensity of curiosity ct, recognition μi,t and its confidence Pi,t (i.e., inverse of the variance of the estimation distribution) (Fig. 4c). In the observer-SSM, the intensity of curiosity was assumed to change continuously over time, and the agent’s recognition of the reward probability was updated by using the equations shown in Fig. 1c following the FEP; however, they were unknown to the observers. In addition, the agent’s actions were assumed to be generated depending on the intensity of its curiosity, recognition and confidence, as described in equation (2), but the observers can only monitor the agent’s action and the presence of a reward. In iFEP, based on the observer-SSM, we estimate the latent internal state of agent z from observation x in a Bayesian manner as

$$\begin{array}{*{20}{c}} {P\left( {z_{1:T}|x_{1:T}} \right) \propto P\left( {x_{1:T}|z_{1:T}} \right)P\left( {z_{1:T}} \right),} \end{array}$$
(4)

where \(z_{1:T} = \left\{ {\mu _{i,1:T},p_{i,1:T},c_{1:T}} \right\},x_{1:T} = \left\{ {a_{1:T},o_{1:T}} \right\}\), and the subscript 1:T indicates steps 1 to T. In this Bayesian estimation, a posterior distribution \(P\left( {z_{1:T}|x_{1:T}} \right)\) represents the observer’s recognition of the estimated \(z_{1:T}\) given observation x1:T with uncertainty. A prior distribution \(P\left( {z_{1:T}} \right)\) represents our belief, which is expressed as the ReCU model with the random motion of the curiosity meta-parameter c as

$$\begin{array}{*{20}{c}} {c_t = c_{t - 1} + {\it{\epsilon }}\zeta _t,} \end{array}$$
(5)

where \(\zeta _t\) indicates the white standard Gauss noise and \({\it{\epsilon }}\) indicates its noise intensity. The likelihood P(x1:T|z1:T) represents the probability that x1:T was observed assuming z1:T, which also follows the ReCU model. This Bayesian estimation, namely iFEP, was conducted using a particle filter and Kalman backward algorithm (Methods).

Fig. 4: The scheme of iFEP by an observer of a decision-making agent.
figure 4

a, An agent performing a two-choice task from the observer’s perspective. b, The observer’s assumption about the agent’s decision-making. The agent is assumed to follow the decision-making model, as described in Fig. 1. c, The SSM of the observer’s eye. For the observer, the agent’s reward–curiosity conflict, recognized reward probabilities and their uncertainties are temporally varying latent variables, whereas the agent’s action and the presence/absence of a reward are observable. The observer estimates the latent internal states of the agent.

Validation of iFEP with artificial data

We tested the validity of the iFEP method by applying it to the artificial data generated by the ReCU model. We simulated a model agent with nonconstant curiosity in the two-choice task, where reward probabilities varied temporally. We then demonstrated that iFEP estimated the ground truth of the internal state of the simulated agent, that is, the agent’s intensity of curiosity, recognition and confidence (Supplementary Fig. 2). We also confirmed that the estimation performance is robust against the value of \({\it{\epsilon }}\) (Supplementary Fig. 3). Therefore, iFEP is in a position to provide efficient estimators of belief updating to clarify decision-making processing and the accompanying temporal swing in the conflict between reward and curiosity.

iFEP-decoded internal state behind rat behaviors

We applied iFEP to actual rat behavioral data from the two-choice task experiment with temporally varying reward probabilities36 (Fig. 5a). In this experiment, once the reward probabilities were suddenly changed in a discrete manner, the rat slowly adapted to select the option with the higher reward probability (Fig. 5b), suggesting that the rat sequentially updated its recognition of the reward probability. Based on these behavioral data of the rat, iFEP estimated the internal state, that is, the intensity of curiosity, the recognized reward probabilities and their confidence levels (Fig. 5c–e). We found that the rat was not perfectly aware of the true reward probabilities but was able to recognize increases and decreases in reward probability (Fig. 5d,e). We also found that confidence increased with choice and decreased with no choice (Fig. 5f,g).

Fig. 5: Estimation of the rat’s internal state by iFEP.
figure 5

a, A rat in the two-choice task with temporally switching reward probabilities. b, Rat’s behaviors from a public dataset at https://groups.oist.jp/ja/ncu/data ref. 5. Vertical lines indicate the selections of left (L) and right (R) options, respectively. The time series indicates the moving average of the selection probability of the left option with 25 window width backward. c, The moving average of the selection probabilities for the left and right options. d,e, The rat behavior data-driven estimations of agent-recognized reward probabilities for the left (d) and right (e) options. fh, The rat behavior-driven estimations of agent’s confidence about recognized reward probabilities for the left (f) and right (g) options, and the agent’s curiosity (h). The estimated parameter values were \(\alpha = 0.058\), \(\beta = 6.991\) and \(\sigma _w^2 = 0.560\). The number of particles was 100,000 in the particle filter. Continuous shaded ranges represent the s.d. for all the particles in dh.

Source data

With iFEP, we can examine whether the rat subjectively assumed fluctuating or constant environments, that is, reward probabilities. We estimated the degree of fluctuation of the reward probabilities the rat assumes \(p_w = 1.785\) (that is, \(\sigma _w^2 = 0.560\)) from the rat behavioral data. This estimated value implied that the latent variable controlling the reward probabilities showed a random walk with increasing s.d. with trials as \(\sqrt {\sigma _w^2t}\). Compared with a reward probability represented by the sigmoidal of the latent variable, the estimated reward probability can largely change from 0.5 to 0.5 ± 0.4 during only ten trials (Supplementary Fig. 4). Therefore, it was suggested that the rat assumed fluctuating environments and, thus, easily forgot its recognition and lost its confidence.

Negative curiosity and its dynamics decoded by iFEP

Interestingly, the curiosity held by the rat was estimated to be negative for almost all trials (Fig. 5h). In other words, the rat conservatively preferred certain choices but did not explore uncertain choices. This negative curiosity is reasonable for the starved animals because animals desired to obtain more reward with higher confidence. To validate the negative curiosity, we visualize the selected action in the space of \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right]\) and \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]\), as shown in Fig. 2o–q (Fig. 6a–c). When the estimated ct was negative, the rat dominantly selected the left action in a region with positive \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right]\) and negative \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]\) (Fig. 6a for ct < −1.1; Fig. 6b for \(- 1.1 \le c_t < - 0.7\)), clearly indicating that the rat has positive subjective reward and negative curiosity. When the estimated ct was close to 0, the rat selected both actions based on \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right]\), independent of \({{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]\) (Fig. 6c for \(- 0.7 \le c_t\)). These results clearly supported the expected net utility with positive weight for the expected reward and time-dependent weight for the expected information gain. In addition, we statistically tested negative curiosity (P < 0.01 for Monte Carlo testing in Supplementary Fig. 5).

Fig. 6: Negative curiosity and its dynamics.
figure 6

ac, The selected options in a space of left–right differences of the expected reward and information gain. All actions were separated into three: when the estimated curiosity is negatively large (\(c \le - 1.1\), \(n = 110\)) (a), negatively medium (\(- 1.1 < c \le - 0.7\), n = 93) (b) and close to 0 (\(- 0.7 < c\), \(n = 187\)) (c). The left and right options were discriminated by a logistic regression with \(P\left( {a_t = 1} \right) = f\left( {w_R{{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Reward}_{t + 1}} \right] + w_I{{{\mathrm{{\Delta}}}}}E\left[ {\mathrm{Info}_{t + 1}} \right]} \right)\), where wR and wI indicate the weight parameters. Two linear lines indicate discrimination boundaries using the estimated \(w_R\) and \(w_I\) from this scattered data and using \(w_R = 1\) and \(w_I = {\sum} {c_t/} N\), which is an average of the estimated curiosity, respectively. the heatmap represents the probability of selecting the left option based on the estimated wR and wI. d,e, Time series of the intensity of curiosity and the sum of the expected information gains for both options (d), and their cross-correlation (e). f,g, Time series of the temporal derivative of curiosity and the sum of the expected information gains for both options (f), and their cross-correlation (g). The temporal derivative was computed by linear regression within a time window of seven trials.

Source data

Further, we noticed an increase in the estimated level of curiosity at which the reward probabilities changed suddenly (Fig. 5h). This curiosity dynamics can be interpreted such that the rat recognized the rule change and adaptively controlled the extent to which the rat sought new information. We further examined how the curiosity is regulated by recognized environmental information. We did not detect the correlation between the estimated curiosity and expected information gains (Fig. 6d,e). Moreover, we found that the temporal derivative of the estimated curiosity highly correlated with the sum of the expected information gains for both options (Fig. 6f,g). These results implied that the rat actively upregulated the curiosity level when the uncertainty in the recognized reward probabilities increases, such as,

$$\begin{array}{*{20}{c}} {\frac{{\mathrm{d}c}}{{\mathrm{d}t}} \propto \mathop {\sum}\limits_i {E\left[ {\mathrm{Info}_i} \right]} .} \end{array}$$
(6)

Evaluations of alternative models from rat behaviors

Finally, to further confirm the validity of the ReCU model, we compared it with other decision-making models based on the rat behavioral data. As an alternative version of the expected net utility, we introduced the time-dependent desire for reward as

$$\begin{array}{*{20}{c}} {U_t\left( {a_{t + 1}} \right) = d_t \cdot E\left[ {\mathrm{Reward}_{t + 1}} \right] + E\left[ {\mathrm{Info}_{t + 1}} \right],} \end{array}$$
(7)

where dt denotes a meta-parameter describing subjective reward (see Methods for details). With this alternative model, the time series of dt were estimated by iFEP from the rat behavioral data. We found that the estimation of the subjective reward meta-parameter dt changed dynamically and sometimes became close to zero when the rat encountered drastic changes in the reward probabilities (Supplementary Fig. 6), which indicated that the rat suddenly no longer needed rewards. This should be unnatural for animals that were starved before the experimental task to motivate them to obtain food. Thus, the alternative model is not suitable for describing the rat behavior.

Another possible model is Q-learning in the framework of RL, which has been widely used to model the decision-making tasks. Following previous studies36,37,38, we introduced the time-dependent inverse temperature Bt, which controls the randomness of the action selection; however, it does not lead to information-seeking behavior in the ReCU model. With the Q-learning model, we estimated time series of Bt (Methods). Then, we determined that the Bt decreased when the reward probabilities suddenly changed (Supplementary Fig. 7), meaning that the rat tends to perform a random selection of actions in response to the environmental rule change. Although this dynamic behavior of the inverse temperature seems reasonable, it is unknown how it is regulated in the Q-learning model. Here, we hypothesized that Bt could be regulated by the uncertainty of our recognition. To probe this, we compared Bt and the expected information gain, where the former was estimated based on Q-learning and the latter was estimated by iFEP in the ReCU model. We found that they are positively correlated (Supplementary Fig. 7) as

$$\begin{array}{*{20}{c}} {B_t \propto \mathop {\sum}\limits_i {E\left[ {\mathrm{Info}_{t,i}} \right]} .} \end{array}$$
(8)

Therefore, Q-learning requires a cue regarding the uncertainty of recognition, that is, the expected information gain, which supports the ReCU model for explaining the curiosity-driven behavior.

Discussion

The advancement realized in this study is the modeling and decoding of mental conflict between reward and curiosity, which is yet to be quantified. The proposed approach can potentially improve our understanding of how mental conflict is regulated and highlight its neural mechanisms in the future by combining neural recording data.

Our decoding approach has some limitations. The iFEP requires long trial behavioral data in the two-choice task because the particle filter needs certain trials for converging the estimation. Thus, the iFEP is not applicable to short behavioral data. In addition, the iFEP assumed gradual change in curiosity in time and cannot follow the pathologically rapid dynamics of curiosity in the estimation process of the particle filter.

Comparing the ReCU model and other models including Q-learning is difficult. In all these models, the action selection is commonly formulated with the sigmoidal function. Thus, any models can be made to fit the observed behaviors, i.e., increase likelihood, by adjusting the time-dependent meta-parameters. Therefore, model selection based solely on likelihood is not very helpful and determining whether animals use the ReCU model or other models to make decisions seems inherently challenging. However, a potential advantage of the ReCU model is its ability to capture the dynamics of curiosity and the interplay between curiosity and reward-based learning.

There is a related theoretical model, which differs from RL and FEP. Ortega and Braun formulated FEP, which describes irrational decision making39,40. Interestingly, their formulation was based on microscopic thermodynamics and the temperature parameters controlled this irrationality. However, the thermodynamics-based FEP did not treat the sequential update of the recognition from the observations.

Finally, it is worth discussing future perspectives of our iFEP approach in medicine. In general, mental diagnosis relies on medical interviews and has not been evaluated quantitatively. Our iFEP method can quantitatively estimate the psychological state of patients based on their behavioral data. For example, patients with social withdrawal, also known as ‘Hikikomori,’ have no interest in anything. In this case, social withdrawal would be characterized by a negative value of curiosity in our FEP model. Therefore, the iFEP method could be considered effective for diagnosing mental disorders.

Methods

Amount of reward

In the two-choice task, the reward is given as all-or-none; however, its intensity depends on the agent’s own feeling as

$$\begin{array}{*{20}{c}} {R = 0\,{\rm{or}}\ln \frac{{P_o}}{{1 - P_o}},} \end{array}$$
(9)

where Po represents the desired probability of how much the agent wants the reward and controls the reward intensity felt by the agent (Supplementary Fig. 8). In Friston’s formulation, the presence and absence of rewards are given by log preference with natural unit as ln Po and In(1 − Po), which take negative values22,23. In this equation, the reward is the difference between log preferences to ensure that the reward intensity is positive.

State space model for reward probability recognition

The agent assumed that the reward is generated probabilistically from the latent cause w, and probabilities λi are represented by

$$\begin{array}{*{20}{c}} {\lambda _i = f\left( {w_i} \right),} \end{array}$$
(10)

where i indicates the indices of the options and \(f\left( x \right) = 1/\left( {1 + {\mathrm{e}}^{ - x}} \right)\). In addition, the agent assumed an ambiguous environment in which the reward probabilities are fluctuated by a random walk as

$$w_{i,t} = w_{i,t - 1} + \sigma _w\xi _{i,t},$$
(11)

where t, \(\xi _{i,t}\), and \(\sigma _w\) denote the trial of the two-choice task, standard Gaussian noise and the noise intensity, respectively. Thus, the agent assumes an environment expressed by an SSM as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1},\sigma _w^2{{{\mathbf{I}}}}} \right),} \end{array}$$
(12)
$$P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right) = \mathop {\prod }\limits_i \left[ {f\left( {w_{i,t}} \right)^{o_t}\left\{ {1 - f\left( {w_{i,t}} \right)} \right\}^{1 - o_t}} \right]^{a_{i,t}},$$
(13)

where wt and ot denote the latent variables controlling the reward probabilities of both options at step t (\({{{\mathbf{w}}}}_t = \left( {w_{1,t},w_{2,t}} \right)^{\mathrm{T}}\)) and the observation of the presence of the reward \(\left( {o_t \in \left\{ {0,1} \right\}} \right)\), respectively. Meanwhile, at denotes the agent’s action at step t, which is represented by a one-hot vector \(\left( {{{{\mathbf{a}}}}_t \in \left\{ {\left( {1,0} \right)^{\mathrm{T}},\left( {0,1} \right)^{\mathrm{T}}} \right\}} \right)\) , \({{{\mathcal{N}}}}\left( {{{{\mathbf{x}}}}|{{{\mathbf{\upmu }}}},{\varSigma}} \right)\) denotes the Gaussian distribution mean μ and variance Σ, \(\sigma _w^2\) denotes the variance of the transition probability of w, I denotes an identity matrix, and \(f(w_{i,t}) = 1/(1 + {\mathrm{e}}^{ - w_{i,t}})\), which represents the probability of the reward of option i at step t. The initial distribution of w1 is given by \(P\left( {{{{\mathbf{w}}}}_1} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_1|0,\kappa {{{\mathbf{I}}}}} \right)\), where κ denotes variance.

FEP for reward probability recognition

We modeled the agent’s recognition process of reward probability using sequential Bayesian updating as

$$P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right) \propto P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\int} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right)P\left( {{{{\mathbf{w}}}}_{t - 1}|{{{{o}}}}_{1:t - 1},{{{\mathbf{a}}}}_{1:t - 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t - 1}} .$$
(14)

Because of the non-Gaussian P(ot|wt, at), the posterior distribution of wt, \(P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right)\), becomes non-Gaussian and cannot be calculated analytically. To avoid this problem, we introduced a simple posterior distribution approximated by a Gaussian distribution:

$$Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{\upmu }}}}_t,{{{\mathbf{{\Lambda}}}}}_t^{ - 1}} \right) {\fallingdotseq} P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right),$$
(15)

where φt = {μt, Λt}, and μt and Λt denote the mean and precision, respectively \(( {{{{\mathbf{\upmu }}}}_t = \left( {\mu _{1,t},\mu _{2,t}} \right)^{\mathrm{T}};\,{{{\mathbf{{\Lambda}}}}}_{{{{t}}}} = \text{diag}\left( {p_{1,t},p_{2,t}} \right)})\). \(Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\) denotes the recognition distribution. The model agent aims to update the recognition distribution through φt at each time step by minimizing the surprise, which is defined by \(- \ln P\left( {o_t|o_{1:t - 1}} \right)\). The surprise can be decomposed as follows:

$$\begin{array}{rcl} - \ln P\left( {o_t|o_{1:t - 1}} \right) & = & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\ln \frac{{Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)}}{{P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right)}}\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ && - {{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)||P\left( {{{{\mathbf{w}}}}_t|o_{1:t},{{{\mathbf{a}}}}_{1:t}} \right)} \right], \end{array}$$
(16)

where \({{{\mathrm{KL}}}}\left[ {q\left( {{{\mathbf{x}}}} \right)||p\left( {{{\mathbf{x}}}} \right)} \right]\) denotes the Kullback–Leibler (KL) divergence between the probability distributions q(x) and p(x). Because of the nonnegativity of KL divergence, the first term is the upper bound of the surprise:

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\ln Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)\mathrm{d}{{{\mathbf{w}}}}_t} + {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)J\left( {o_t,{{{\mathbf{w}}}}_t} \right)\mathrm{d}{{{\mathbf{w}}}}_t,} } \end{array}$$
(17)

which is called the free energy, where \(J\left( {o_t,{{{\mathbf{w}}}}_t} \right) = - \ln P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right)\). The first term of the free energy corresponds to the negative entropy of a Gaussian distribution:

$$\begin{array}{*{20}{c}} {F_1 = {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{{t}}}}|\varphi _t} \right)\ln Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)\mathrm{d}{{{\mathbf{w}}}}_t.} } \end{array}$$
(18)

The second term is approximated as

$$\begin{array}{rcl} F_2 & = & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{\boldsymbol{t}}}}|\varphi _t} \right)J\left( {o_t,{{{\mathbf{w}}}}_t} \right)\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ & \cong & {\displaystyle{\int}} {Q\left( {{{{\mathbf{w}}}}_{{{{t}}}}|\varphi _t} \right)\left\{ {J\left( {o_t,{{{\mathbf{\upmu }}}}_t} \right) + \frac{{\mathrm{d}J}}{{\mathrm{d}w}}\left( {{{{\mathbf{w}}}}_t - {{{\mathbf{\upmu }}}}_t} \right) + \frac{1}{2}\frac{{\mathrm{d}^2J}}{{\mathrm{d}w^2}}\left( {{{{\mathbf{w}}}}_t - {{{\mathbf{\upmu }}}}_t} \right)^2} \right\}\mathrm{d}{{{\mathbf{w}}}}_t} \cr \\ & = & \left. {J\left( {o_t,{{{\mathbf{w}}}}_t} \right)} \right|_{w_t = \mu _t} + \frac{1}{2}\left. {\frac{{\mathrm{d}^2J}}{{\mathrm{d}w^2}}} \right|_{w_t = \mu _t}{{{\mathbf{{\Lambda}}}}}_t^{ - 1}. \end{array}$$
(19)

Note that \(E\left( {o_t,{{{\mathbf{w}}}}_t} \right)\) is expanded by a second-order Taylor series around \({{{\mathbf{\upmu }}}}_t\). At each time step, the agent updates \(\varphi _t\) by minimizing \(F\left( {o_t,\varphi _t} \right)\).

Calculation of free energy

The free energy is derived as follows: F1 simply becomes

$$\begin{array}{*{20}{c}} {F_1 = \frac{1}{2}\ln 2\uppi p_{1,t}^{ - 1} + \frac{1}{2}\ln 2\uppi p_{2,t}^{ - 1} + \text{const.}} \end{array}$$
(20)

For computing F2,

$$\begin{array}{*{20}{c}}{P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right) = P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right)P\left( {{{{\mathbf{w}}}}_{t - 1}|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t - 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t - 1}} }\\ { \cong P\left( {o_t|{{{\mathbf{w}}}}_t,{{{\mathbf{a}}}}_t} \right){\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_t|{{{\mathbf{w}}}}_{t - 1}} \right)N\left( {{{{\mathbf{w}}}}_{t - 1}|{{{\mathbf{\upmu }}}}_{t - 1},{{{\mathbf{{\Lambda}}}}}_{{{{\mathrm{t}}}} - 1}^{ - 1}} \right)\mathrm{d}{{{\mathbf{w}}}}_{t - 1}.} }\end{array}$$
(21)

In the second line of this equation, we use the approximated recognition distribution as the previous posterior \(P\left( {{{{\mathbf{w}}}}_{t - 1}|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t - 1}} \right)\) . This equation can be written as

$$P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right) \cong \mathop {\prod}\limits_i {\left[ {f\left( {w_{i,t}} \right)^{o_t}\left\{ {1 - f\left( {w_{i,t}} \right)} \right\}^{1 - o_t}} \right]^{a_{i,t}}N\left( {w_{i,t}|\mu _{i,\,t - 1},p_{i,\,t}^{ - 1} + \sigma _w^2} \right).}$$
(22)

Then

$$\begin{array}{*{20}{c}} {\left. {E\left( {o_t,{{{\mathbf{w}}}}_t} \right)} \right|_{w_t = \mu _t} = J_1\left( {o_t,\mu _{1,t}} \right) + J_2\left( {o_t,\mu _{2,t}} \right) + \text{const.},} \end{array}$$
(23)

where

$$\begin{array}{*{20}{c}} \begin{array}{ccccc}\\ J_i\left( {o_t,\,\mu _{i,t}} \right) = & a_{i,t}\left[ {o_t\ln f\left( {\mu _{i,t}} \right) + \left( {1 - o_t} \right)\ln \left\{ {1 - f\left( {\mu _{i,t}} \right)} \right\}} \right]\cr \\ & - \frac{1}{2}\frac{{\left( {\mu _{i,t} - \mu _{i,t - 1}} \right)^2}}{{p_{i,t}^{ - 1} + \sigma _w^2}} - \frac{1}{2}\ln \left( {p_{i,t}^{ - 1} + \sigma _w^2} \right).\\ \end{array} \end{array}$$
(24)

Thus, F2 is calculated by substituting this equation into equation (19). Taken together,

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = \mathop {\sum}\limits_i {\left\{ {J_i\left( {o_t,\mu _{i,t}} \right) + \frac{1}{2}\left. {\frac{{\mathrm{d}^2J_i}}{{\mathrm{d}w_{i,t}^2}}} \right|_{w_{i,t} = \mu _{i,t}}p_{i,t}^{ - 1} + \frac{1}{2}\ln 2\uppi p_{i,t}^{ - 1}} \right\}.} } \end{array}$$
(25)

Sequential updating of the agent’s recognition

The updating rule for φt was derived by minimizing the free energy. The optimized pi,t can be computed by \(\partial F/\partial p_{i,t}^{ - 1} = 0\), which leads to

$$\begin{array}{*{20}{c}} {p_{i,t} = \left. {\frac{{\mathrm{d}^2J_i}}{{\mathrm{d}w_{i,t}^2}}} \right|_{w_{i,t} = \mu _{i,t}}.} \end{array}$$
(26)

By substituting pi,t into equation (25), the second term in the summation becomes constant, irrespective of μi,t. Thus, μi,t is updated by minimizing only the first term as

$$\begin{array}{*{20}{c}} {\mu _{i,t} = \mu _{i,t - 1} - \alpha \delta _{i,a}\left. {\frac{{\partial J_i}}{{\partial \mu _{i,t}}}} \right|_{\mu _{i,t} = \mu _{i,t - 1}},} \end{array}$$
(27)

where α is the learning rate. These two equations finally lead to

$$\begin{array}{*{20}{c}} {\mu _{i,t} = \mu _{i,t - 1} + \alpha K_{i,t}\left( {o_t - f\left( {\mu _{i,t - 1}} \right)} \right),} \end{array}$$
(28)
$$\begin{array}{*{20}{c}} {p_{i,t} = K_{i,t}^{ - 1} + f\left( {\mu _{i,t}} \right)\left( {1 - f\left( {\mu _{i,t}} \right)} \right),} \end{array}$$
(29)

where \(K_{i,t} = \left( {p_{i,t - 1} + \sigma _w^{ - 2}} \right)/\left( {p_{i,t - 1}\sigma _w^{ - 2}} \right)\), which is called the Kalman gain. If option i was not selected, the second terms in both equations will vanish, which results in belief \(\mu _{i,t}\) staying the same, while its precision decreases (that is, \(p_{i,t + 1} < p_{i,t}\)). If it is selected, the belief is updated by the prediction error (that is, \(o_t - f\left( {\mu _{i,t}} \right)\)), and its precision is improved. The confidence of the recognized reward probability should be evaluated not in \(w_{i,t}\) space but in \(\lambda _{i,t}\) space; hence, the confidence is defined by \(\gamma _{i,t} = p_{i,t}/f^\prime \left( {\mu _{i,t}} \right)^2\).

Expected net utility

The expected net utility is described by

$$\begin{array}{*{20}{c}}{U_t\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = c_t \cdot E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right]}\\ { + E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {R\left( {o_{t + 1}} \right)} \right],}\end{array}$$
(30)

where \(R\left( {o_{t + 1}} \right) = o_{t + 1}\ln \left( {P_o/\left( {1 - P_o} \right)} \right)\) ; the first and second terms represent the expected information gain and expected reward, respectively; and ct denotes the intensity of curiosity at time t. The heuristic idea of introducing the curiosity meta-parameter ct was also proposed in the RL field30.

We briefly show how to derive it based on ref. 41. The free energy at current time t is described by

$$\begin{array}{*{20}{c}} {F\left( {o_t,\varphi _t} \right) = E_{Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right)}\left[ {\ln Q\left( {{{{\mathbf{w}}}}_t|\varphi _t} \right) - \ln P\left( {o_t,{{{\mathbf{w}}}}_t|o_{1:t - 1},{{{\mathbf{a}}}}_{1:t}} \right)} \right],} \end{array}$$
(31)

which is a rewriting of equation (17). Here, we attempt to express the future free energy at time t + 1, conditioned on action at +1 as

$$\begin{array}{*{20}{c}} {F\left( {o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)}\left[ {\ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right) - \ln P\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)} \right].} \end{array}$$
(32)

However, there is an apparent problem in this equation: ot+1 has not yet been observed because it is a future observation. To resolve this issue, we take an expectation with respect to ot+1 as

$$\begin{array}{l} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{{{{\boldsymbol{t}}}} + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{{{{\boldsymbol{t}}}} + 1}|\varphi _t} \right)} \\ \qquad \qquad\left[ {\ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right) - \ln P\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)} \right]. \end{array}$$
(33)

This can be rewritten as

$$\begin{array}{lll} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1},{{{\mathbf{w}}}}_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)\right.\\ \qquad\qquad\quad- \ln P\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right) - \ln P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right) \left]\right. \end{array}$$
(34)

Although \(P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)\) can be computed by the generative model as

$$\begin{array}{*{20}{c}}{P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right)P\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} }\\ { \cong {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1},} }\end{array}$$
(35)

it is assumed that \(P\left( {o_{t + 1}|o_{1:t},{{{\mathbf{a}}}}_{1:t + 1}} \right)\) was heuristically replaced with \(P\left( {o_{t + 1}} \right)\) as the prior agent’s preference of observing \(o_{t + 1}\), so that \(\ln P\left( {o_{t + 1}} \right)\) can be addressed as an instrumental reward. The equation (34) was further transformed into

$$\begin{array}{lll} F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|\varphi _t} \right)\right.\\ \qquad\qquad\quad {- \ln Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{1:t + 1},{{{\mathbf{a}}}}_{1:t + 1}} \right) - \ln P\left( {o_{t + 1}} \right) \left]\right.}. \end{array}$$
(36)

Finally, we obtained the so-called expected free energy as

$$\begin{array}{*{20}{c}} {F\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{Q\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ { - {{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}} \right)} \right] - \ln P\left( {o_{t + 1}} \right)} \right].} \end{array}$$
(37)

where the KL divergence is called a Bayesian surprise, representing the extent to which the agent’s beliefs are updated by observation, whereas the second term \(\ln P\left( {o_{t + 1}} \right)\) can be addressed as the agent’s prior preference of observing \(o_{t + 1}\), which can be interpreted as an instrumental reward. Therefore, the expected free energy is derived in an ad hoc manner.

In the first term of equation (30), the posterior and prior distributions of \({{{\mathbf{w}}}}_{t + 1}\) in the KL divergence are derived as

$$\begin{array}{*{20}{c}} {Q\left( {{{{\mathbf{w}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{w}}}}_t} \right)Q\left( {{{{\mathbf{w}}}}_t} \right){\mathrm{d}}{{{\mathbf{w}}}}_t} ,} \end{array}$$
(38)
$$\begin{array}{*{20}{c}} {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1,}{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{{{{\boldsymbol{t}}}} + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}}{{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}},} \end{array}$$
(39)

respectively. \(o_{t + 1}\) is a future observation, and therefore, the first term was expected by \(P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)\), which can be calculated as

$$\begin{array}{*{20}{c}} {P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}.} } \end{array}$$
(40)

In the second term, the reward is quantitatively interpreted as the desired probability of \(o_{t + 1}\). For the two-choice task, we use

$$\begin{array}{*{20}{c}} {P\left( {o_{t + 1}} \right) = P_o^{o_{t + 1}}\left( {1 - P_o} \right)^{\left( {1 - o_{t + 1}} \right)},} \end{array}$$
(41)

where Po indicates the desired probability of the presence of a reward. According to the probabilistic interpretation of reward22,23, the presence and absence of a reward can be evaluated by ln Po and ln(1 − Po), respectively.

In this study, we changed the sign of the expected free energy as the expected net utility, because we modeled decision-making as the maximization of the expected free energy. We ventured to introduce the curiosity meta-parameter ct to express irrational decision-making. Because rewards are relative, in this study, we set \({{{\mathrm{ln}}}}\left\{ {P_o/\left( {1 - P_o} \right)} \right\}\) and 0 for the presence and absence of a reward, respectively. Thus, our expected net utility can be described by equation (30), in which there is a gap between the original expected free energy and our expected net utility.

Model for action selection

The agent probabilistically selects a choice with higher expected net utility as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{\exp \left( {\beta U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right)}}{{\mathop {\sum}\nolimits_{{{\mathbf{a}}}} {\exp \left( {\beta U\left( {{{\mathbf{a}}}} \right)} \right)} }},} \end{array}$$
(42)

where \(U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)\) indicates the expected net utility of action at+1. Equation (42) is equivalent to equation (2). To derive equation (42), we considered the expectation of the expected net utility with respect to probabilistic action as

$$\begin{array}{*{20}{c}} {E\left[ U \right] = E_{Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {U\left( {{{{\mathbf{a}}}}_{t + 1}} \right) - \beta ^{ - 1}\ln Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right],} \end{array}$$
(43)

where β indicates an inverse temperature, and the entropic constraint of action probability is introduced in the second term. This equation can be rewritten as

$$\begin{array}{*{20}{c}} {E\left[ U \right] = - \beta ^{ - 1}\text{KL}\left[ {Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)\Vert\exp \left( {\beta U\left( {{{{\mathbf{a}}}}_{t + 1}} \right)} \right)/Z} \right] + \beta ^{ - 1}\ln Z,} \end{array}$$
(44)

where Z indicates a normalization constant. Thus, its maximization respective to \(Q\left( {{{{\mathbf{a}}}}_{t + 1}} \right)\) leads to the optimal action probability as shown in equation (42).

Alternative expected net utility

For comparison, we consider an alternative expected net utility by introducing a time-dependent meta-parameter in the second term as follows:

$$\begin{array}{*{20}{c}}{U\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{KL}}}}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right]}\\ { +\, d_t \cdot E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {R\left( {o_{t + 1}} \right)} \right],}\end{array}$$
(45)

where dt denotes the subjective intensity of reward at time t. In this case, the agent with high dt will show more exploitative behavior, whereas the agent with dt = 0 shows more explorative behavior driven by the expected information gain.

Calculation of expected net utility

Here, we present the calculation of the expected net utility. The KL divergence in the first term of equation (30) can be transformed into

$$\begin{array}{*{20}{c}} {E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\text{KL}\left[ {Q\left( {{{{\mathbf{w}}}}_{t + 1}|o_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)||Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right]} \right] = H\left( {o_{t + 1}} \right) - H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right),} \end{array}$$
(46)

where the first and second terms represent the conditional and marginal entropies, respectively:

$$\begin{array}{*{20}{c}} {H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = {{{\mathrm{E}}}}_{P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\boldsymbol{a}}}}_{t + 1}} \right)}\left[ { - \ln P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)} \right],} \end{array}$$
(47)
$$\begin{array}{*{20}{c}} {H\left( {o_{t + 1}} \right) = {{{\mathrm{E}}}}_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ { - \ln P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)} \right].} \end{array}$$
(48)

The conditional entropy \({{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right)\) can be calculated by substituting equation (13) into equation (47) as

$$H\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = - E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\mathop {\sum}\nolimits_i {a_{i,t + 1}g\left( {w_{i,t + 1}} \right)} } \right],$$
(49)

where

$$\begin{array}{*{20}{c}} {g\left( w \right) = f\left( w \right)\ln f\left( w \right) + \left( {1 - f\left( w \right)} \right)\ln \left( {1 - f\left( w \right)} \right).} \end{array}$$
(50)

Here, we approximately calculate this equation by using the second-order Taylor expansion as

$$\begin{array}{*{20}{c}} {{{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) \cong - E_{Q\left( {{{{\mathbf{w}}}}_{t + 1}{{{\mathrm{|}}}}{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {\mathop {\sum}\limits_i {a_{i,t + 1}\left\{ {\begin{array}{*{20}{c}} {g\left( {\mu _{i,t + 1}} \right) + \frac{{\partial g}}{{\partial w_{i,t + 1}}}\left( {w_{i,t + 1} - \mu _{i,t + 1}} \right)} \\ { + \frac{1}{2}\frac{{\partial ^2g}}{{\partial w_{i,t + 1}^2}}\left( {w_{i,t + 1} - \mu _{i,t + 1}} \right)^2} \end{array}} \right\}} } \right],} \end{array}$$
(51)

which leads to

$$\begin{array}{*{20}{c}} {{{\mathrm{H}}}}\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1}} \right) = \\- \mathop {\sum}\limits_i {a_{i,t + 1}\left[ {\begin{array}{*{20}{c}} {f\left( {\mu _{i,t + 1}} \right){\mathrm{ln}}f\left( {\mu _{i,t + 1}} \right) + \left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right){\mathrm{ln}}\left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right)} \\ { + \frac{1}{2}\left\{ {f\left( {\mu _{i,t + 1}} \right)\left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right)\left( {1 + \left( {1 - 2f\left( {\mu _{i,t + 1}} \right)} \right){{{\mathrm{ln}}}}\frac{{f\left( {\mu _{i,t + 1}} \right)}}{{1 - f\left( {\mu _{i,t + 1}} \right)}}} \right)} \right\}} \\ {\left( {p_{i,t}^{ - 1} + p_w^{ - 1}} \right)} \end{array}} \right]} . \end{array}$$
(52)

The marginal entropy \({{{\mathrm{H}}}}\left( {o_{t + 1}} \right)\) can be calculated as

$$\begin{array}{*{20}{c}}{{{{\mathrm{H}}}}\left( {o_{t + 1}} \right) = - \mathop {\sum}\limits_i {a_{i,t + 1}\left\{ {P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)\ln P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)} \right.} ,}\\ {\left. { + P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)\ln P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)} \right\}}\end{array}$$
(53)

where

$$\begin{array}{ccc}\\ & P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right) = {\displaystyle{\int}} {P\left( {o_{t + 1}|{{{\mathbf{w}}}}_{t + 1},{{{\mathbf{a}}}}_{t + 1}} \right)Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} \cr \\ & = {\displaystyle{\int}} {\mathop {\prod }\limits_i \left\{ {f\left( {w_{i,t + 1}} \right)^{o_{t + 1}}\left( {1 - f\left( {w_{i,t + 1}} \right)} \right)^{1 - o_{t + 1}}} \right\}^{a_{i,t + 1}}Q\left( {{{{\mathbf{w}}}}_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right){\mathrm{d}}{{{\mathbf{w}}}}_{t + 1}} \cr \\ & = \mathop {\prod }\limits_i \left[ {\begin{array}{*{20}{c}} {f\left( {\mu _{i,t + 1}} \right)^{o_{t + 1}}\left( {1 - f\left( {\mu _{i,t + 1}} \right)} \right)^{1 - o_{t + 1}}} \\ { + 1^{o_{t + 1}}\left( { - 1} \right)^{1 - o_{t + 1}}\frac{1}{2}f\left( {\mu _{i,t + 1}} \right)\left\{ {1 - f\left( {\mu _{i,t + 1}} \right)} \right\}\left\{ {1 - 2f\left( {\mu _{i,t + 1}} \right)} \right\}\left( {p_{i,t}^{ - 1} + p_w^{ - 1}} \right)} \end{array}} \right]^{a_i,t + 1}.\\ \end{array}$$
(54)

The second term of the expected net utility (equation (36)) is calculated as

$$\begin{array}{ccccc}\\ & E_{P\left( {o_{t + 1}|{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {{{{\mathrm{ln}}}}P\left( {o_{t + 1}} \right)} \right]= E_{P\left( {o_{t + 1}{{{\mathrm{|}}}}{{{\mathbf{a}}}}_{t + 1}} \right)}\left[ {o_{t + 1}{{{\mathrm{ln}}}}P_o + \left( {1 - o_{t + 1}} \right)\ln \left( {1 - P_o} \right)} \right]\cr \\ & = P\left( {o_{t + 1} = 0|{{{\mathbf{a}}}}_{t + 1}} \right)\ln \left( {1 - P_o} \right) + P\left( {o_{t + 1} = 1|{{{\mathbf{a}}}}_{t + 1}} \right)\ln \left( {1 - P_o} \right).\\ \end{array}$$
(55)

Observer-SSM

We constructed the observer-SSM, which describes the temporal transitions of the latent internal state z of agent and the generation of action, from the viewpoint of the observer of the agent. This is depicted graphically in Fig. 4. As prior information, we assumed that the agent acts based on the internal state, that is, the intensity of curiosity, the recognized reward probabilities and their confidence levels. The intensity of curiosity was assumed to change temporally as a random walk:

$$\begin{array}{*{20}{c}} {c_t = c_{t - 1} + {\it{\epsilon }}\zeta _t,} \end{array}$$
(56)

where \(\zeta _t\) denotes white noise with zero mean and unit variance, and \({\it{\epsilon }}\) denotes its noise intensity. Other internal states, that is, \(\mu _i\) and \(p_i\), were assumed to update as equations (28) and (29). The transition of the internal state is expressed by the probability distribution

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right),{{{{{\varGamma}}}}}} \right),} \end{array}$$
(57)
$${{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right) = \left[ {\begin{array}{*{20}{c}} 0 \\ {h\left( {\mu _{1,t - 1},p_{1,t - 1},o_t,a_1} \right)} \\ {h\left( {\mu _{2,t - 1},p_{2,t - 1},o_t,a_2} \right)} \\ {k\left( {\mu _{1,t - 1},p_{1,t - 1},a_1} \right)} \\ {k\left( {\mu _{2,t - 1},p_{2,t - 1},a_2} \right)} \end{array}} \right],$$
(58)

where \({{{\mathbf{z}}}}_t = \left( {c_t,{{{\mathbf{\mu }}}}_t^{\mathrm{T}},{{{\mathbf{p}}}}_t^{\mathrm{T}}} \right)^{{{\mathrm{T}}}}\) and \({{{{{\varGamma}}}}} = {\it{\epsilon }}^2\text{diag}(1,0,0,0,0)\) . \(h\left( {\mu _{i,t - 1},p_{i,t - 1},o_t,a_i} \right)\) and \(k\left( {\mu _{i,t - 1},p_{i,t - 1},a_i} \right)\) represent the right-hand sides of equations (28) and (29), respectively; and \({{{{{\varGamma}}}}}\) and \(\text{diag}\left( {{{\mathbf{x}}}} \right)\) denote the variance–covariance matrix and square matrix whose diagonal component is \({{{\mathbf{x}}}}\), respectively. In addition, the agent was assumed to select an action \({{{\mathbf{a}}}}_{t + 1}\) based on the expected net utilities, as follows:

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{a}}}}_{t + 1}} \right) = \frac{{{{{\mathrm{exp}}}}(\beta U({{{\mathbf{a}}}}_{t + 1}))}}{{\mathop {\sum}\nolimits_{{{\mathbf{a}}}} {\exp \left( {\beta U\left( {{{\mathbf{a}}}} \right)} \right)} }},} \end{array}$$
(59)

and the reward was obtained by the following probability distribution:

$$\begin{array}{*{20}{c}} {P\left( {o_t|{{{\mathbf{a}}}}_t} \right) = \mathop {\prod }\limits_i \left\{ {\lambda _{i,t}^{o_t}(1 - \lambda _{i,t})^{1 - o_t}} \right\}^{a_{i,t}}.} \end{array}$$
(60)

Q-leaning in two-choice task and its observer-SSM

The decision-making in the two-choice task was also modeled by Q-learning. Reward prediction for the ith option Qi,t is updated as

$$\begin{array}{*{20}{c}} {Q_{i,t} = Q_{i,t - 1} + \alpha _{t - 1}\left( {r_ta_{i,t - 1} - Q_{i,t - 1}} \right),} \end{array}$$
(61)

where \(\alpha _t\) indicates a learning rate at trail t. The agents selected action following a softmax function:

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{\exp \left( {B_tQ_{i,t}} \right)}}{{\mathop {\sum}\nolimits_i {\exp \left( {B_tQ_{i,t}} \right)} }},} \end{array}$$
(62)

where Bt indicates the inverse temperature at trail t controlling the randomness of the action selection.

The time-dependent parameters αt and Bt in Q-learning were estimated from behavioral data37. These parameters were assumed to change temporally as a random walk:

$$\begin{array}{*{20}{c}} {\theta _t = \theta _{t - 1} + {\it{\epsilon }}_\theta \zeta _{\theta ,t},} \end{array}$$
(63)

where \(\theta \in \left\{ {\alpha ,B} \right\}\), \(\zeta _{\theta ,t}\) denotes white noise with zero mean and unit variance and \({\it{\epsilon }}_\theta\) denotes its noise intensity. Thus, the transition of the internal state is expressed by the probability distribution

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right),{{{{{\Gamma}}}}}} \right),} \end{array}$$
(64)
$$\begin{array}{*{20}{c}} {\mathbf{F}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right) = \left[ {\begin{array}{*{20}{c}} 0 \\ 0 \\ {h_1\left( {\alpha _t,Q_{1,t - 1},{{{\mathbf{a}}}}_{t - 1}} \right)} \\ {h_2\left( {\alpha _t,Q_{2,t - 1},{{{\mathbf{a}}}}_{t - 1}} \right)} \end{array}} \right],} \end{array}$$
(65)

where \({{{\mathbf{z}}}}_t = \left( {\alpha _t,B_t,Q_{1,t},Q_{2,t}} \right)^{{{\mathrm{T}}}}\), \({{{{{\Gamma}}}}} = {\it{\epsilon }}^2{\mathrm{diag}}\left( {1,1,0,0} \right)\), \(h_i\left( {\alpha _t,Q_{i,t - 1},{{{\mathbf{a}}}}_t} \right)\) represents the right-hand side of equation (61); and \({{{{{\Gamma}}}}}\) and \(\text{diag}\left( {{{\mathbf{x}}}} \right)\) denote the variance–covariance matrix and square matrix whose diagonal component is \({{{\mathbf{x}}}}\), respectively.

iFEP by particle filter and Kalman backward algorithm

Based on the observer-SSM, we estimated the posterior distribution of the latent internal state of agent \(z_t\) given all observations from 1 to T (\(x_{1:T}\)) in a Bayesian manner, that is, \(P\left( {z_t|x_{1:T}} \right)\). This estimation was done by forward and backward algorithms, which are called filtering and smoothing, respectively.

In filtering, the posterior distribution of \(z_t\) given observations until t (\(x_{1:t}\)) is sequentially updated in a forward direction as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right) \propto P\left( {{{{\mathbf{x}}}}_t|{{{\mathbf{z}}}}_t,\theta } \right){\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right)P\left( {{{{\mathbf{z}}}}_{t - 1}|{{{\mathbf{x}}}}_{1:t - 1}} \right){\mathrm{d}}{{{\mathbf{z}}}}_{t - 1},} } \end{array}$$
(66)

where \({{{\mathbf{x}}}}_t = \left( {{{{\mathbf{a}}}}_t^T,o_t} \right)^{{{\mathrm{T}}}}\) and \(\theta = \left\{ {\sigma ^2,\alpha ,P_o} \right\}\). The prior distribution of \({{{\mathbf{z}}}}_1\) is

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_1} \right) = \left[ {\mathop {\prod }\limits_i {{{\mathcal{N}}}}(\mu _{i,1}|\mu _0,\sigma _\mu ^2)\text{Gam}(p_{i,1}|a_g,b_g)} \right]\text{Uni}(c_1|a_u,b_u),} \end{array}$$
(67)

where \(\mu _0\) and \(\sigma _\mu ^2\) denote means and variances, \(\text{Gam}(x|a_g,b_g)\) indicates the Gamma distribution with shape parameter \(a_g\) and scale parameter \(b_g\), and \(\text{Uni}\left( {x|a_u,b_u} \right)\) indicates uniform distribution from \(a_u\) to \(b_u\). We used a particle filter42 to sequentially calculate the posterior \(P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right)\), which cannot be analytically derived because of the nonlinear transition probability.

After the particle filter, the posterior distribution of \({{{\mathbf{z}}}}_t\) given all observations (\({{{\mathbf{x}}}}_{1:T}\)) is sequentially updated in a backward direction as

$$\begin{array}{ccccc}\\ & P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right) = {\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{x}}}}_{1:T}} \right)P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t + 1},{{{\mathbf{x}}}}_{1:t},\theta } \right){\mathrm{d}}{{{\mathbf{z}}}}_{t + 1}} \cr \\ & = {\displaystyle{\int}} {P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{x}}}}_{1:T}} \right)\frac{{P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t},\theta } \right)}}{{\mathop {\smallint }\nolimits^ P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)P({{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t},\theta ){\mathrm{d}}{{{\mathbf{z}}}}_t}}{\mathrm{d}}{{{\mathbf{z}}}}_{t + 1}} .\\ \end{array}$$
(68)

However, this backward integration is intractable because of the non-Gaussian \(P\left( {{{{\mathbf{z}}}}_T|{{{\mathbf{x}}}}_{1:T}} \right)\), which was represented by the particle ensemble in the particle filter, and the nonlinear relationship between \({{{\mathbf{z}}}}_t\) and \({{{\mathbf{z}}}}_{t + 1}\) in \(P\left( {{{{\mathbf{z}}}}_{t + 1}|{{{\mathbf{z}}}}_t,\theta } \right)\) (equation (57)). Thus, we approximated \(P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:t}} \right)\) as \({{{\mathcal{N}}}}({{{\mathbf{z}}}}_t|{{{\mathbf{m}}}}_t,{{{\mathbf{V}}}}_t)\), where \({{{\mathbf{m}}}}_t\) and \({{{\mathbf{V}}}}_t\) denote a sample mean and a sample variance of the particles at t, whereas we linearized \(P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right)\) as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right) \cong {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{Az}}}}_{t - 1} + {{{\mathbf{b}}}},{{{{{\Gamma}}}}}} \right),} \end{array}$$
(69)
$$\begin{array}{*{20}{c}} {{{{\mathbf{A}}}} = \left. {\frac{{\partial {{{\mathbf{F}}}}\left( {{{{\mathbf{z}}}}_{t - 1},{{{\mathbf{a}}}}_{t - 1}} \right)}}{{\partial {{{\mathbf{z}}}}_{t - 1}}}} \right|_{{{{\mathbf{m}}}}_t},} \end{array}$$
(70)
$$\begin{array}{*{20}{c}} {{{{\mathbf{b}}}} = F\left( {{{{\mathbf{m}}}}_t,{{{\mathbf{a}}}}_{t - 1}} \right) - {{{A\mathbf{m}}}}_t,} \end{array}$$
(71)

where A denotes a Jacobian matrix. Because these approximations make the integration of equation (68) tractable, the posterior distribution \(P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right)\) can be computed by a Gaussian distribution as

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{x}}}}_{1:T}} \right) = {{{\mathcal{N}}}}\left( {{{{\mathbf{z}}}}_t|\widehat {{{\mathbf{m}}}}_t,\widehat {{{\mathbf{V}}}}_t} \right)} \end{array}$$
(72)

whose mean and variance were analytically updated by Kalman backward algorithms as43

$$\begin{array}{*{20}{c}} {\widehat {{{\mathbf{m}}}}_t = {{{\mathbf{m}}}}_t + {{{\mathbf{J}}}}_t\left\{ {\widehat {{{\mathbf{m}}}}_{t + 1} - ({{{\mathbf{Am}}}}_t + {{{\mathbf{b}}}})} \right\},} \end{array}$$
(73)
$$\begin{array}{*{20}{c}} {\widehat {{{\mathbf{V}}}}_t = {{{\mathbf{V}}}}_t + {{{\mathbf{J}}}}_t\left\{ {\widehat {{{\mathbf{m}}}}_{t + 1} - ({{{\mathbf{AV}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}} + {{{\mathbf{{\Gamma}}}}})} \right\}{{{\mathbf{J}}}}_t^{{{\mathrm{T}}}},} \end{array}$$
(74)

where

$$\begin{array}{*{20}{c}} {{{{\mathbf{J}}}}_t = {{{\mathbf{V}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}}\left( {{{{\mathbf{AV}}}}_t{{{\mathbf{A}}}}^{{{\mathrm{T}}}} + {{{{{\Gamma}}}}}} \right)^{ - 1}.} \end{array}$$
(75)

Impossibility of model discrimination

In the ReCU and Q-learning models, the action selections were formulated with the same softmax functions as

$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{{{{\mathrm{exp}}}}\left( {\beta \left( {E[\mathrm{Reward}_{i,t}] + c_tE[\text{Info}_{i,t}]} \right)} \right)}}{{\mathop {\sum}\nolimits_j {{{{\mathrm{exp}}}}\left( {\beta \left( {E[\mathrm{Reward}_{j,t}] + c_tE[\text{Info}_{j,t}]} \right)} \right)} }}.} \end{array}$$
(76)
$$\begin{array}{*{20}{c}} {P\left( {a_{i,t} = 1} \right) = \frac{{\exp \left( {\beta _tQ_{i,t}} \right)}}{{\mathop {\sum}\nolimits_i {\exp \left( {\beta _tQ_{i,t}} \right)} }},} \end{array}$$
(77)

which correspond to that of the ReCU and Q-learning models, respectively. These equations contain the time-dependent meta-parameters, that is, ct and βt. For both models, the goodness of fit (that is, likelihood) for the actual behavioral data can be freely improved by tuning the time-dependent meta-parameters. Thus, discrimination between the ReCU and Q-learning models must be essentially impossible.

Estimation of parameters in iFEP

The ReCU model has several parameters: \(\sigma _w^2\), α, β, Po and \({\it{\epsilon }}\). In the estimation, we set \({\it{\epsilon }}\) to 1, which was the optimal value for estimation in the artificial data (Supplementary Fig. 3). We assumed the unit intensity of reward, that is, \(\ln P_o/(1 - P_o) = 1\), because it is impossible to estimate both Po and β caused by multiplying β and \(\ln P_o/(1 - P_o)\) in the expected net utility (equations (30) and (42)). This treatment is suitable for relative comparison between the curiosity meta-parameter and the reward. In addition, we addressed βct as a latent variable as \(\hat c_t = \beta c_t\) because of the multiplication of β in the expected net utility (equations (30) and (42)). Thus, the estimation of ct can be obtained by dividing the estimated \(\hat c_t\) by the estimated β. Therefore, the hyperparameters to be estimated were \(\sigma _w^2\), α and β.

To estimate these parameters \(\theta = \left\{ {\sigma _w^2,\alpha ,\beta } \right\}\), we extended the observer-SSM to a self-organizing SSM44 in which θ was addressed as constant latent variables:

$$\begin{array}{*{20}{c}} {P\left( {{{{\mathbf{z}}}}_t,\theta |{{{\mathbf{x}}}}_{1:t}} \right) \propto P\left( {{{{\mathbf{x}}}}_t|{{{\mathbf{z}}}}_t} \right){\int} {P\left( {{{{\mathbf{z}}}}_t|{{{\mathbf{z}}}}_{t - 1},\theta } \right)P\left( {{{{\mathbf{z}}}}_{t - 1},\theta |{{{\mathbf{x}}}}_{1:t - 1}} \right)\mathrm{d}{{{\mathbf{z}}}}_{t - 1},} } \end{array}$$
(78)

where \(P\left( \theta \right) = \text{Uni}\left( {\sigma ^2|a_\sigma ,b_\sigma } \right)\text{Uni}(\alpha |a_\alpha ,b_\alpha ){{{\mathcal{N}}}}(\beta |m_\beta ,v_\beta )\). To sequentially calculate the posterior \(P\left( {{{{\mathbf{z}}}}_t,\theta |{{{\mathbf{x}}}}_{1:t}} \right)\) using the particle filter, we used 100,000 particles and augmented the state vector of all particles by adding the parameter θ, which was not updated from randomly sampled initial values.

The hyperparameter values used in this estimation were μ0 = 0, \(\sigma _\mu ^2 = 0.01^2\), ag = 10, bg = 0.001, au = −15, bu = 15, aσ = 0.2, bσ = 0.7, aα = 0.04, bα = 0.06, aβ = 0 and bβ = 50, which were heuristically given as parameters correctly estimated using the artificial data (Supplementary Fig. 2).

Statistical testing with Monte Carlo simulations

Supplementary Fig. 5 shows statistical testing of the negative curiosity estimated in Fig. 5. A null hypothesis is that an agent has no curiosity (that is, ct = 0) decides on a choice only depending on its recognition of the reward probability. Under the null hypothesis, model simulations were repeated 1,000 times under the same experimental conditions as in Fig. 5 and the curiosity was estimated for each using iFEP. We adopted the temporal average of the estimated curiosity as a test statistic and plotted the null distribution of the test statistic. Compared with the estimated curiosity of the rat behavior, we computed the P value for a one-sided left-tailed test.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.