Probabilistic rejection templates in visual working memory

Our interactions with the visual world are guided by attention and visual working memory. Things that we look for and those we ignore are stored as templates that reflect our goals and the tasks at hand. The nature of such templates has been widely debated. A recent proposal is that these templates can be thought of as probabilistic representations of task-relevant features. Crucially, such probabilistic templates should accurately reflect feature probabilities in the environment. Here we ask whether observers can quickly form a correct internal model of a complex (bimodal) distribution of distractor features. We assessed observers’ representations by measuring the slowing of visual search when target features unexpectedly match a distractor template. Distractor stimuli were heterogeneous, randomly drawn on each trial from a bimodal probability distribution. Using two targets on each trial, we tested whether observers encode the full distribution, only one peak of it, or the average of the two peaks. Search was slower when the two targets corresponded to the two modes of a previous distractor distribution than when one target was at one of the modes and another between them or outside the distribution range. Furthermore, targets on the modes were reported later than targets between the modes that, in turn, were reported later than targets outside this range. This shows that observers use a correct internal model, representing both distribution modes using templates based on the full probability distribution rather than just one peak or simple summary statistics. The findings further confirm that performance in odd-one out search with repeated distractors cannot be described by a simple decision rule. Our findings indicate that probabilistic visual working memory templates guiding attention, dynamically adapt to task requirements, accurately reflecting the probabilistic nature of the input.


Introduction
Our senses are constantly bombarded with an overwhelming amount of information that needs to be filtered by the brain to guide action. This information, however, is not completely chaotic. For example, leaves on a tree usually have similar colors, and colors within a single leaf would be more similar to each other than to another leaf. Probabilistic models of vision (Bejjanki, Beck, Lu, & Pouget, 2011;Feldman, 2014;Girshick, Landy, & Simoncelli, 2011;Kersten, Mamassian, & Yuille, 2004;Ma, 2012;Rao, Olshausen, & Lewicki, 2002) suggest that the brain utilizes existing correlations in the environment and uses them in perception. However, some of the incoming information is not relevant for current behavior, and it is important to reject it while processing other stimuli in more detail. Traditionally, the rejection of irrelevant information within a specific feature dimension (e.g., orientation) is thought to be based on specific feature values (Woodman, Carlisle, & Reinhart, 2013). Here we ask whether such rejection can instead be based on probabilistic templates and whether such templates accurately reflect the probabilities of distractor features. If this is the case, then probabilistic inference in the brain does not start with perception, but sooner, when to-be-rejected templates are formed (based on previously encountered stimuli) to optimize the prioritization of what is perceived.
Imagine a radiologist looking for signs of tumor in x-ray scans.
Malignant signs can take many forms so the targets to look for are diverse. By many accounts, search in this and other contexts is thought to be guided by templates held in visual working memory (Woodman et al., 2013).
These templates reflect what one should look for, but may also reflect what should be ignored (Arita, Carlisle, & Woodman, 2012;Won & Geng, 2018). For example, distractors such as the rib cage on a lung scan are salient but not informative and radiologists can therefore ignore them. It is well known that information about to-be-ignored stimuli or features is kept in memory, but the way they are represented is still unknown. There are capacity limits in the amount of information that can be stored in visual working memory templates (Bundesen, 1990;Grubert & Eimer, 2013;Vickery, King, & Jiang, 2005), with some authors even suggesting that only one template containing a single feature value can guide attention at any given time (Oberauer, 2002;Olivers, Peters, Houtkamp, & Roelfsema, 2011; van Moorselaar, Theeuwes, & Olivers, 2014). Alternatively, templates could be conceptualized as probabilistic entities of varying precision (Bays, 2015) rather than matches to exact feature values. While previous studies found some support for this, observers typically reported features of single items (Ma, Husain, & Bays, 2014). However, in the real world such isolated features practically never occur. Furthermore, with a few exceptions (Arita et al., 2012;Won & Geng, 2018), templates for ignored information are rarely studied. For any inference based on the probabilistic representations, it is crucial that the internal model used by observers accurately reflects the environment. Here, we provide strong evidence for the probabilistic template view by showing that visual working memory templates for rejection mirror the probability distribution of distractor features.
Our observers searched for two oddly oriented targets among distractors randomly drawn from a bimodal orientation distribution. To expose observers' templates, after a sequence of learning trials with distractors randomly drawn from a bimodal distribution, targets on test trials could either correspond to regions of feature space previously used for distractors, fall in between the modes of the bimodal distribution, or have feature values outside the previous distribution range. We assume that observers' templates reflect what has been relevant on recent trials. If templates contain features of distractors to be ignored, which then become targets on test trials, search should be slower than otherwise (Chetverikov, Campana, & Kristjánsson, 2016;Kristjánsson & Driver, 2008;Lamy, Antebi, Aviani, & Carmel, 2008;Maljkovic & Nakayama, 1994;Wang, Kristjánsson, & Nakayama, 2005). Crucially, experiments with varied set size and trial numbers show that learning in this paradigm cannot be explained by the sampling of a few items (Chetverikov, Campana, & Kristjánsson, 2017b;Chetverikov, Campana, & Kristjánsson, 2017d). It also cannot be explained by simple decision rule learning (e.g., all stimuli that have features in a certain range are distractors), because observers response times, on average, reflect the shape of the distractor distribution rather than just a boundary between a target and distractors (Chetverikov, Campana, & Kristjánsson, 2016, 2017b, Chetverikov, Campana, & Kristjánsson, 2017cChetverikov, Hansmann-Roth, Tanrikulu, & Kristjansson, 2019). However, it is not yet clear whether each single set of learning trials can feed observers' templates with the feature probability distribution of distractors, nor is it clear how accurately the information is stored in the templates.
Under the strong probabilistic template hypothesis, templates would include information about both peaks of a bimodal distribution. That is, observers would develop an accurate internal model for the task and the template would accurately reflect the information about the full probability distribution. Alternatively, templates might include only a single peak (e.g., the attended one), or might reflect only the summary statistics, such as the averages of the whole distribution (Alvarez, 2011). Using a two-target search task we were able to test whether observers encode both peaks of a distribution following a single learning sequence. The predictions of these models (see Simulations) are qualitatively different regarding both the order in which targets are reported in a two-target search, and search times. If observers accurately encode a bimodal distribution, on trials with a target on a peak and target between peaks, targets between the peaks (associated with a lower distractor probability) should be reported before targets on peaks (associated with the highest distractor probability, Fig. 1A). In contrast, if only one peak is encoded or if the whole distribution is averaged, targets on peaks would be associated with a lower distractor probability and should be reported no later than targets between the peaks (associated with lower distractor probability in this case). Notably, while all three hypotheses postulate that observers can use probabilistic inference, only the first one assumes that the distractor probability distribution is encoded accurately, that is, that the observers use relatively accurate probabilistic templates.

Ethics statement
The study was approved by the ethics committee of St. Petersburg State University (#75, 21.06.2017). All participants signed a consent form before taking part in the study.

Participants
Fifteen observers (ten female, age M = 25.67) at St. Petersburg State University, Russia, participated voluntarily in a single experimental session lasting approximately 30 min. The data from two observers were excluded because their response times on test trials were too slow (M = 1464 and M =1871 ms), compared with other observers (M =1064 ms). Following our previous studies (Chetverikov et al., 2016(Chetverikov et al., , 2017b(Chetverikov et al., , 2017c(Chetverikov et al., , 2017d, the design of this study utilized withinsubject comparisons with a relatively small number of trained observers (each observer was trained for at least 100 trials before the main session) performing a large number of trials. The sample size and the trial  Chetverikov, et al. Cognition 196 (2020) 104075 numbers were similar to those in previous studies using the same paradigm.

Method
We used a task similar to our previous studies (Chetverikov et al., 2016(Chetverikov et al., , 2017b. Stimuli were presented on an Acer V193 display (19″ with 1280 × 1024 pixel resolution) using PsychoPy 1.84.2 (Peirce, 2007(Peirce, , 2009. Viewing distance was ∼ 60 cm. Observers searched for two oddly oriented lines in a 6 × 6 grid of 36 lines subtending 16°×16°a t the centre of a display. The length of each line was 1.41°. Line positions were jittered by randomly adding a value between ± 0.5°to both vertical and horizontal coordinates. Observers were instructed to search for two targets on each trial, with targets being the stimuli that were most different from all the others ("odd-one-out" search (Maljkovic & Nakayama, 1994)). Targets were randomly distributed between the four quadrants of the search display with the constraint that the two targets on a given trial could not appear in the same quadrant. Observers reported the locations of the targets by pressing one of four keys ('f', 'g', 'r', 't' on a standard keyboard) corresponding to the quadrants of the search display. They were informed that two targets would be presented on each trial and were encouraged to respond to each target as soon as they found it and not wait until both targets were found.
Trials were organized in intertwined prime and test 'streaks'. During prime streaks, distractors were randomly drawn from a bimodal distribution that included two uniform parts with orientations ranging from -30 to -20 and +20 to +30 relative to the overall mean. The distribution mean was the same within streak but chosen randomly between streaks. Target orientations were selected randomly on each trial with the restriction that the distance between target orientation and distractor mean in feature space was 60 degrees at minimum. Prime streak length was set to 6-7 trials (with equal probability) because this streak length is sufficient to learn bimodal distributions with relative accuracy (Chetverikov et al., 2017b).
Within test streaks, distractor orientations were randomly drawn from a truncated Gaussian with SD = 10 deg. and range 20 deg. Test streaks had one or two trials (with equal probability). Different target types were used on test trials: targets were either located on a peak of the previous bimodal distribution ("Peak", at +/-25 deg. relative to the previous distractor mean), between the peaks ("Between", at 0 deg.) or outside the previous distribution range ("Outside", at +/-50 deg.). Four types of test streaks were used: 1) with two targets either on two different peaks ("Peak + Peak"); 2) on a peak and in-between the peaks ("Peak + Between"); 3) on a peak and outside the previous distribution range ("Peak + Outside"where the "outside" target was always 25 deg. away from the target peak, that is, either the two targets were oriented at +25 and +50 deg. or -25 and -50 relative to the previous distractors' mean); 4) between the peaks and outside the range ("Outside + Between"). These four test types were presented equally often (40 repetitions by participant) in random order. The distractor mean was chosen to be equidistant from both test targets. The second test trial is not analyzed here as the priming effects from the learning streak are not likely to be significant after the first two-target test search. Two-trial test streaks were added for consistency with previous studies and in order to reduce the potential effects of observers' expectations regarding streak lengths.
Observers participated in one session of approximately 1300 trials. Decision time was not limited but participants were encouraged to respond as quickly and accurately as possible. Feedback based on search time and accuracy on previous trials was shown in the upper-left corner of the screen to motivate participants (see Chetverikov et al., 2016, for details on feedback score calculation). The current trial number and the total number of trials were shown beneath the score. If observers made an error, the word "ERROR" appeared in red letters at display centre for 1 s.
In addition to this two-target search experiment, we also ran a single-target search study (see Supplementary Experiment). The latter was used as a comparison for the single-target search time analyses to ensure that the introduction of a second target and specific conditions of the main experiment did not affect the pattern of results.

Overall performance
On learning trials, observers found both targets in most cases (M = 0.72 [0.67, 0.77]), though the share of trials where only one target was reported was high (M = 0.27 [0.22, 0.31]; both targets were reported incorrectly on 1% of trials). On test trials, observers reported both targets correctly on M = 0.91 [0.89, 0.93] trials (accuracy was comparable to the results of single-target search in the Supplementary Experiment). The delay between the report on the first and the second target was relatively short, but longer on learning than on test trials (M = 263 [198, 326] vs. M = 176 [130,233], respectively, t(12.0) = 4.13, p = .001). Similarly, the first target was reported later on learning than test trials (M = 973 [854, 1103] vs. M = 826 [753,904], respectively, t(12.0) = 5.23, p < .001).
The learning effects were also comparable to those from the singletarget search experiment (see supplement). A linear mixed-effects regression with Helmert contrasts (comparing each trial with the average of the following trials) showed that the first trial was slower, (B = 0.11, SE = 0.01, t(52.57) = 9.61, p < 0.001) and less accurate (B = -0.04, SE = 0.02, t(13.12) = -2.62, p = 0.021) than the later trials. The follow-up trials did not differ from one another.
We then analyzed which type of target was reported first in each condition using a binomial mixed-effects regression. The results showed that targets on peaks were reported after targets between the peaks (Z = -2.01, p = .044, Fig. 2C) or targets outside the preceding distribution range (Z = -2.43, p = .015), while the latter were reported earlier than targets between the peaks (Z = 2.08, p = .037).
In sum, search with two targets on the peaks was the most difficult. A comparison of the "Peak + Between" and "Peak + Outside" conditions showed only a numerical difference in total RT. However, in the "Peak + Between" condition, targets on peaks were reported later than targets between the peaks, whereas in the "Outside + Between" condition targets between the peaks were reported later than the "Outside" targets. This shows again that targets on peaks were the most unexpected for observers, followed by targets between the peaks, followed in turn by "Outside" targets that led to the fastest search times.

Simulations
We simulated the predictions from three models (Fig. 2E-F; the simulation code is available at https://osf.io/rg2h8). For our main model of interest, the "bimodal" model, we assumed that the probabilities of A. Chetverikov, et al. Cognition 196 (2020) 104075 different distractors can be represented by two Gaussian templates (for simplicity, we ignore the fact that the stimuli distributions might be more accurately represented by non-Gaussian templates (Chetverikov, Campana, & Kristjánsson, 2017a)) centered on the means of distractor distribution segments. We assumed that observers utilize the knowledge they obtained about distractors and targets optimally. To find a target in a localization search task, an ideal observer, would compare the probability that a given noisy measurement of orientation x at each location L is a target versus the probability that it is a distractor (Ma, Navalpakkam, Beck, van den Berg, & Pouget, 2011;Ma, Shen, Dziugaite, & van den Berg, 2015): where s L is a true stimulus value at this location, p(L) is the probability that a target is presented at this location, T and D are the parameters of target and distractor distributions, respectively. In our simulations, we assumed that internal representations of target and distractor distributions are independent and response times are inversely proportional to the amount of evidence p(L|x). Given that all locations in our experiment were equiprobable, that is, p(L) is the same for all locations, response times will, on average, be proportional to the probabilities of the test target θ T under given distractor template parameters: The width of the Gaussian templates was estimated by fitting the model to single-target response time data. To increase the robustness of the estimates, we used an approach similar to bootstrap aggregating ("bagging"), often employed in machine learning (Breiman, 1996). For each model we obtained 500 bootstrapped samples grouped by participant (that is, on each iteration, sampling with replacement was done for each subject and then the samples were combined). We then estimated the template widths for each sample by fitting response times as a linear function of the stimuli probability. For a "bimodal" model: where μ 1 = 25 and μ 2 = -25, the means of bimodal distractor distribution peaks, and a and b are the scaling parameters necessary to translate the probabilities into response times. The template widths obtained for each sample were then averaged to get the resulting estimates. Estimated template widths were similar for the experiment reported here (18 deg.) and the supplemental experiment (21 deg.). For the "single-peak" model, we assumed that only one of the two Abbreviations: RTresponse times, P -target on a peak, B -target between the peaks, O -target outside the range of previous distractor distribution. The plus sign indicates that two targets of corresponding types are used. For the order of reporting, X|A + B means that target type X was reported first when target types A and B are combined.
A. Chetverikov, et al. Cognition 196 (2020) 104075 peaks was encoded (with the same approach as with the "bimodal" model). Given that the peak means are equidistant to the overall distractor mean: The estimated template widths were 27 and 22 deg. for the main and the supplemental experiment.
Finally, the "averaged" model was based on the idea that observers might use a single set of summary statistics to represent the stimuli. Accordingly, we assumed that observers use a single Gaussian template centered at the mean of the overall bimodal distribution: The template width was also obtained using ML optimization and bootstrapping. For the main experiment the estimated width was 114 deg., while for the supplemental experiment it was 140 deg. (i.e., almost flat template), already suggesting that this model provides a poor fit to the experimental data.
We then used the estimated template widths to obtain the predictions of the three models for the search times for different target types (Fig. 2D), total search time for two targets in different conditions (Fig. 2E), and for the order in which the targets should be reported (Fig. 2F). For single-target search the equations were the same as when we estimated the template widths, however, we used the data averaged by target type for each subject to reduce the effect of trial-by-trial variability. Two-target search times were assumed to be proportional to a sum of two search times predicted in the same way as for a single target: where D reflects the distractor distribution parameters for a given model, that is, the template mean(s) and its estimated width(s). Finally, we assumed that all other things being equal, the order in which the targets are reported would depend on the ratio of the probabilities of observing the test targets under the given distractors template: with k as a scaling constant. The ratio was transformed to logarithm to allow for both positive and negative values. Fig. 2D and E shows that the bimodal model provided more accurate predictions for response times than the other models. For single-target response times, it accurately predicted that targets on peaks would be the hardest to find and targets between the peaks would be harder to find than targets outside the range of previously learned distractors. In contrast, the averaged model (ΔBIC = 5.94; here and later ΔBIC refers to the difference in Bayesian Information Criterion compared to the bimodal model, positive values meaning that the bimodal model has better fit) suggested that the targets in-between the peaks would be hardest to find, while the single-peak model (ΔBIC = 12.47) predicted relatively similar response times for between targets and targets on peaks. For two-target RTs, the bimodal model failed to predict slower search for the "outside + between" condition compared to the "peak + outside" condition. Note, however, that this difference was also not significant in our results. Speculatively, it might be a result of a higher similarity between the targets in the latter than in the former. Nevertheless, the predictions of the bimodal model were still better than of the averaged (ΔBIC = 8.04) or the single-peak model (ΔBIC = 6.59).
Crucially, the bimodal, single-peak, and averaged models gave qualitatively different predictions for the order in which the targets would be found. For both the single-peak and averaged model, the probability of first reporting targets between the peaks when combined with targets on peaks was below 0.5 (Fig. 2C). As outlined in the introduction, when observers encode only one peak, on 50% of the trials, the "peak" target on test trials should be on this peak while in the other half of the trials it will be on the non-encoded peak. Depending on the width of the template, the average ratio of the probabilities for a target would vary: with very large or very small template widths, it will be close to 0.5 because targets between the peaks and at the non-encoded peaks will be equally probable, and with intermediate template widths it will be below 0.5 (note that this conclusion is not limited to the specific equation we used for determining the probability of finding one target before another; in fact, it could be shown that this is the case for any monotonic function describing the transformation of a ratio of probabilities of observing the target under a given Gaussian distractor template into average probability of a given reporting order). For the averaged model the target between the peaks should always be reported later than targets on the peaks. In contrast, for the bimodal model that accurately encodes the probabilities of distractors, the target between the peaks should be reported before the target at the peak. Accordingly, the bimodal model describes the results better than the single-target (ΔBIC = 13.11) or the averaged model (ΔBIC = 6.86).

Discussion
Can observers develop an accurate internal model for the probabilities of to-be-ignored items in a visual search task? We assessed the content of templates guiding visual search in the orientation domain, by measuring slowing for targets drawn from a preceding distractor orientation distribution. The distribution was bimodal and the searches used to probe the representations involved two simultaneous targets within a trial. Response times were slower when the targets corresponded to the two modes ("peaks") of previous distractor distributions than when one target was from one of the modes and another from between them, while the latter combination of targets resulted in slower search than when one of the targets was outside the previous distractor range. Furthermore, the order in which the targets were reported on a test trial followed the distractor probabilities observed during prime trials. Targets outside the previous distractor range were reported earlier than the ones between the modes, while the latter were reported before the targets at the modes of previous distractor distribution. The search times and the order in which targets are reported allowed us to assess the internal model used by observers.
We simulated the predictions of a bimodal, single-template, and averaged template models. The first model accurately reflects the actual distribution of distractor features, while the other two oversimplify it in different ways. We found that the bimodal model predicts the response times pattern for different target types and different conditions far better than the other models. Moreover, only the bimodal model could accurately predict the order in which the targets were reported. Both the single-template and the averaged-template model predicted that the target between the peaks should on average be reported after the targets at the peaks, while the reverse was accurately predicted by the bimodal model. The target between the peaks in the "Peak + Between" condition was on average reported before the target at one of the peaks. This shows that observers simultaneously represent both modes of distractor distributions. Their representations approximate the physical stimuli, and they fill in the gaps in probability space as demonstrated by slower responses when one of the targets was between the peaks compared to when it was outside the previous distractor range, or on the peaks.
Notably, all three models can be considered probabilistic in a sense that they do provide observers with a measure of probability that a certain feature belongs to a distractor class. The difference is in the degree of simplification. The bimodal model reflects the probability distribution accurately (with the assumption of Gaussian approximation). The two other models taken into consideration, however, diverge from an accurate representation in different ways: the "averaged" assumes the use of overall summary statistics, while the "single-peak" assumes the encoding of only one part of the distribution (which could be caused, for example, by biased sampling). Furthermore, every A. Chetverikov, et al. Cognition 196 (2020) 104075 heuristic or decision rule can be cast in terms of probabilities (e.g., a delta function that assigns probability of 1 for one part of feature space and 0 for the rest). Here we show that the representations used by observers mirror the probability distribution of the stimuli. Unlike previous studies assessing how distracting information is stored in visual working memory (Arita et al., 2012;Won & Geng, 2018), the distractors in our studies were heterogeneous and were generated randomly based on a bimodal probability distribution. Nevertheless, observers were able to integrate the information about distractors into an approximate bimodal representation. Speculatively, this demonstrates that using homogeneous distractors may be an artificial limitation, perhaps brought on by earlier technical restrictions on experimental stimuli in pre-modern computer era. In the real world, distracting information is rarely homogeneous, so it may not be particularly surprising that humans are able to form accurate templates representing probability distributions.
Following seminal accounts of priming of pop-out effects (Maljkovic & Nakayama, 1994), we argue that the representations of distractor distributions are kept in visual working memory, rather than long-term memory. Woodman et al (Woodman et al., 2013) have demonstrated that the representation of a single attended target is transferred from VWM to long-term memory in 5 to 7 trials. In contrast, we have previously shown that for simple distractor distributions (such as Gaussian or uniform) one or two trials are enough for observers to develop a probabilistic representation of distractors (Chetverikov et al., 2017b). Representations of more complex distractor distributions take more time (or trials) to develop, but they also progressively change with more repetitions: after one or two trials, bimodal distributions are represented as unimodal, and are only later transformed into bimodal ones. This indicates that more time (trials) is required for sharpening the representation, not for the transfer to long-term memory.
A question of how the probabilistic templates for rejection are stored also taps into a more general question, regarding how working memory templates are stored. Recently, Christophel, Iamshchinina, Yan, Allefeld, and Haynes (2018) demonstrated that while attended stimuli in visual working memory are represented both in parietal and frontal cortex in addition to visual cortex, the latter is not involved in representations of unattended stimuli. It is possible that rejection templates similarly do not involve early visual areas. However, unlike simple unattended items, templates for rejection are actively used by observers to guide attention. As such, their representation might require a level of precision only achievable with the recruitment of sensory areas.
How specific are distractor templates? Won and Geng (Won & Geng, 2018) suggested that distractor templates might be more broadly tuned than target templates. This would allow easy generalization of suppression to similar distractors, while for targets such generalization might be harmful as it would lead to an increased number of false alarms. However, the exact costs of generalization for both target and distractor templates depend on the environment. Specific templates are necessary when a target is similar to distractors, but generalization is helpful otherwise. This has indeed been observed by Geng, DiQuattro, and Helm (Geng, DiQuattro, & Helm, 2017): when a target is similar to distractors, its template is sharpened and shifted away from distractors. Moreover, in the real environment we rarely know how exactly the target or distractors would look under a given illumination and point of view, making some degree of generalization essential for efficient search. In contrast, a typical visual search study would require a very narrow distribution of target features, making a narrow template useful. Our results suggest that distractor templates are specific enough to account for bimodality in the distractor distribution. It remains to be studied whether targets or distractors templates are more specific when their physical distributions are equally shaped.
In contrast to our previous studies (Chetverikov et al., 2016(Chetverikov et al., , 2017b(Chetverikov et al., , 2017c(Chetverikov et al., , 2017d; Hansmann-Roth, Chetverikov, & Kristjánsson, 2019), here, we "probed" the distractor representation only at three different points in the feature space. By using targets with a range of features that covered the full feature space, our previous research showed that observers encode the probability distribution of distractors. Here we extend these findings by showing that observers learn the distribution of distractors following a single learning streak. This demonstrates that the previously obtained results are not an artefact of aggregation over multiple trials but rather a true reflection of the templates' content.
Our results agree with previous findings on probabilistic concept learning. Briscoe and Feldman (2011) found that when observers have to form a decision rule based on a multimodal probability distribution, they could do this, although performance became worse with increased mode number. We did not explicitly ask our observers to categorize the stimuli (as distractors and targets), but it is conceivable that they might do so if asked.
We should note that one might interpret our results as simply demonstrating that humans are capable of learning a nonlinear classification rule/decision boundary over a disjoint set in feature space, and can use this to guide visual search. But we think that this alternative proposal is unlikely to hold water because for a simple classifier in this task, learning is not necessary. There is enough information on each trial to easily tell the target from distractors. Moreover, to include learning in the algorithm, learning of the target would suffice, as the target distribution is constant within the learning streak. The fact that our observers struggle with this shows that they do more than strictly necessary. Second, and perhaps more importantly, we showed in our previous work that observers learn the correct probabilities of the distractor features on average rather than learning a simple decision rule (Chetverikov et al., 2016(Chetverikov et al., , 2017b(Chetverikov et al., , 2017c(Chetverikov et al., , 2017d. A decision rule model cannot explain why the response time curves reflect distractor probability both within and outside the distractor distribution range. By using double-target search we further demonstrate that these results cannot be explained by a combination of different decision rules applied on different test trials.

Conclusions
We found that rejection templates are probabilistic, similarly to items in visual working memory that receive attention (Ma et al., 2014). However, our study also shows that templates for rejection do not need to be simple bell-shaped curves, as it is typically modelled in working memory studies. In contrast, they are dynamically adapted to task requirements, reflecting the probabilistic nature of the input. Whether such flexibility also characterizes templates for attended items remains to be seen. However, our results clearly demonstrate that probabilistic computations start in the brain even before something is perceived, to determine what should be prioritized in perception.

Author contributions
All authors participated equally in conceiving and planning the experiments. AC wrote the experimental scripts, oversaw the data collection, analyzed the results, and wrote the initial version of the manuscript. GC and AK took part in data analyses and interpretation and revised the manuscript.