PHYSICAL REVIEW E NOVEMBER 2000VOLUME 62, NUMBER 5 Variational studies and replica symmetry breaking in the generalization problem of the binary perceptron Evaldo Botelho,1 Cristiano R. Mattos,1,2 and Nestor Caticha1 1Instituto de Fı´sica da Universidade de Sa˜o Paulo, Caixa Postal 66318, Sa˜o Paulo, SP 05315-970, Brazil 2Faculdade de Engenharia, Universidade Estadual Paulista, Caixa Postal 205, Guaratingueta´, SP 12500-000, Brazil ~Received 29 May 2000! We analyze the average performance of a general class of learning algorithms for the nondeterministic polynomial time complete problem of rule extraction by a binary perceptron. The examples are generated by a rule implemented by a teacher network of similar architecture. A variational approach is used in trying to identify the potential energy that leads to the largest generalization in the thermodynamic limit. We restrict our search to algorithms that always satisfy the binary constraints. A replica symmetric ansatz leads to a learning algorithm which presents a phase transition in violation of an information theoretical bound. Stability analysis shows that this is due to a failure of the replica symmetric ansatz and the first step of replica symmetry breaking~RSB! is studied. The variational method does not determine a unique potential but it allows con- struction of a class with a unique minimum within each first order valley. Members of this class improve on the performance of Gibbs algorithm but fail to reach the Bayesian limit in the low generalization phase. They even fail to reach the performance of the best binary, an optimal clipping of the barycenter of version space. We find a trade-off between a good low performance and early onset of perfect generalization. Although the RSB may be locally stable we discuss the possibility that it fails to be the correct saddle point globally. PACS number~s!: 87.10.1e, 84.35.1i, 89.70.1c, 05.50.1q ith f dy he t lm , a o o or th d o rc t to an nd id g f sim le o c cs e.g., ype of t ap- ific a- he hts nce in- u- ar obal to ics. hts nu- ups - op- ht it of nd lli u- for nt of I. INTRODUCTION One topic of interest in the study of neural networks w the tools of statistical mechanics~SM! is that the process o information extraction from data can be modeled by a namical process of minimization of an energy function in t presence of noise. The use of techniques borrowed from study of equilibrium disordered systems@1#, such as, for ex- ample, replica methods, Thouless, Anderson, and Pa equations, cavity analysis, and Monte Carlo simulations well as dynamical analysis techniques, has permitted a c siderable understanding of typical properties of the therm dynamic limit ~TL! of such systems as attractor and feedf ward neural networks. In this paper we are interested in equilibrium properties of the student-teacher problem@2–4# of rule extraction by abinary Boolean perceptron. This means that the weights as well as the output are restricte be 61. We look at the generalization ability for the case a realizable rule represented by a teacher of the same a tecture as the student and in particular we concentrate on determination of thermodynamic limit bounds and how get them by the construction of an appropriate potential. The binary perceptron has been attacked from m fronts. Studies of the computational complexity by Pitt a Valiant have shown that it belongs to the NP~nondetermin- istic polynomial time! complete class@5#. From a SM point of view it has been studied before by Gardner and Derr @6# and Györgyi @7#. Their aim was to study the learnin curve ~generalization error! as a function of the number o examples where the training energy was chosen in the plest way, i.e., error counting. For independent examp drawn from a uniform distribution, the main characteristic this system in the TL is the first order transition to perfe generalization atac.1.25, where as usuala5P/N mea- PRE 621063-651X/2000/62~5!/6999~9!/$15.00 - he er s n- - - e to f hi- he y a - s f t sures the number of examplesP in units of the number of inputs N. This shows the power of statistical mechani methods, since the transition cannot be detected by, Vapnik-Chervonenkis analysis, which aims at a general t of result, such as those independent of the distribution examples, and can therefore miss important features tha pear only for particular but important cases. For a spec comment on this, see@8# A relevant question, complementary in spirit to examin tion of the thermal equilibrium properties, delves into t dynamical aspects of actually determining the set of weig that minimize the training error. The studies of Horner@9# have shown that times exponential inN are required for learning. This is in agreement with the expectation that, si this is a computationally hard problem, it should have sp glass properties. Polynomial time algorithms, such as sim lated annealing of the error-counting energy, with a line decrease of the temperature schedule, fail to reach gl solutions. Different approximations have been devised in order overcome the problems associated with glassy dynam Based on the fact that learning the equivalent real-weig problem is not as slow, other studies have dealt with conti ous approximations to the binary problem. Several gro looked into clipping strategies@10–13# and other transforma tions of the continuous perceptron, in particular one that timally incorporates the information that the teacher weig vector points in the direction of a vertex of the un N-dimensional hypercube@14#. Penney and Sherrington@15# have looked into how to reduce the effective dimension the problem by clipping a partial set of real couplings a posterior learning of the remaining binary weights. Cope et al. @16–18# have looked at a related problem in the uns pervised learning scenario. They employed their methods the teacher-student case also. By extending an argume 6999 ©2000 The American Physical Society ts r o r e e r- t in in n ei on a g p ls ep ar fa th xi in e p ry he - rn or b r- or R t f a xe re ily to w a i rli en II C e f t the - clu- the a en tu- a he on re- , is the on, u- ure 7000 PRE 62BOTELHO, MATTOS, AND CATICHA Watkin for the analogous problem with continuous weigh they looked into the generalization properties of the cente mass of the version space. Its performance should satu the Bayes limit, but it is not itself a vertex on the hypercub They also considered the Bayesian best binary~BB! vector by clipping the real component Bayes vector for both sup vised and unsupervised learning. On-line learning—a possibly efficient strategy to ove come slow dynamics since it makes no attempt thermalize—has been shown to be ineffective if working the binary coupling space directly@19#. Nevertheless, progress has been made in the direction of on-line learn Solla and Winther@20# have shown how to incorporate, i the on-line Bayesian spirit suggested by Opper@21#, infor- mation about the binary nature of the couplings. Th method leads to fast learning times but is not able to c dense on the exact solution in finite times. In this work we extend the work of Kinouchi and Catich @22# for the off-line variational determination of a trainin energy. We look into the extension of the variational a proach to the determination of training potentia optimal—in the generalization sense—for the binary perc tron. A difference from the approach of Copelliet al. @16– 18# is that our construction methods never leave the bin space. Although this in the end might turn out not to be good strategy, it is theoretically interesting to see how one can go on such a discrete problem without leaving allowed space. In the case of random examples,@13# presents a variational calculation for potentials that, using the ma mum stability perceptron as a teacher, aims at determin the continuous vector that on clipping determines the larg number of weights of the maximum stability binary perce tron. The extension of the variational program to the bina case is not at all straightforward due to the failure of t replica symmetry~RS! ansatz. We studied first a replica sym metric variational calculation. The transition to perfect lea ing occurs at a much smaller value ofac (.0.5). This would have been great news if not for the fact that a simple inf mation theoretical bound for perfect generalization isac >1 @23#. Each example cannot carry more than just one of information and at leastN examples are needed to dete mine theN independent bits encoded in the weight vect The stability analysis showed the surprising failure of the ansatz from the very start ata50. The next step we repor here is to perform a variational calculation at the level o one-step replica symmetry breaking~RSB-1!. The variational method does not determine a unique potential, but only fi expected values of the modulation function. We study a stricted set of solutions, which form a one-parameter fam Our choice is based on physical similarities of this family potentials already studied. This is interesting in itself as indicates the possibility that potentials other than the type study may be relevant. The phase diagrams are studied discussed. The learning curves associated with this fam show a trade-off between better generalization and ea transitions to the perfect generalization phase. This paper is organized as follows. In Sec. II A we pres the general replica calculation results. In Secs. II B and the variational procedures under RS and RSB-1, resp tively, are presented. Section III discusses the analysis o , f ate . r- o g. r - - , - y a r e - g st - - - it . S s - . it e nd ly er t c- he RSB-1 free energy. This requires the determination of effective potential, which is explicitly calculated in the Ap pendix. Section IV discusses results and presents con sions. II. OFF-LINE VARIATIONAL METHOD A. General results The statistical mechanics approach to determining generalization ability of a network learning from examples rule that itself is implemented by another network has be studied in several cases; for reviews, see@2–4#. Call the N-dimensional binary weight vectors of the teacher and s dent networksB andJ, respectively. Off-line learning describes the infinite time limit of learning algorithm that proceeds by minimization, in t presence of thermal noise, of an energy functi (m51 P V(J•Sm ,sm) that depends on quenched data rep sented by theP input-output example pairs (Sm ,sm). The free energy for the replicated system presented withP ex- amples, independently chosen from a uniform distribution given by 2b f 5extrqab ,q̂ab ,Ra ,R̂a @Go$qab ,q̂ab ,Ra ,R̂a% 2aGr$qab ,Ra%#, ~1! where Go5 lim n→0 1 n F2 ( a,b qabq̂ab2( a RaR̂a 1 1 N ( j 51 N lnE ) a dx~Jj a! 3expS ( a R̂aBjJj a1 1 2 ( aÞb q̂abJj aJj bD G , ~2! Gr5 lim n→0 1 n lnE Dt2E S ) a dladl̂a 2p D 3expS 2b( a V~la ,t2! D 3expS i( a l̂a~la2Rat2!2 1 2 ( a,b ~qab2RaRb!l̂al̂bD . ~3! The interpretation ofR andq at the extreme justifies the choice of method to study this problem. This is so sinceR 5^J•B/N&, where the average is both thermal and over disorder, is directly related to the generalization erroreG in the case of equivalent replicas. For the uniform distributi eG5arccos(R)/p. The usual order parametersqab 5^Ja•Jb /N& describe the typical overlap between two st dents learning from the same data. The weight meas dx(Jj a) defines the particular problem. In this casedx(Jj a) is the counting measure overJj a561. at tio n t s tio re ha t r a e- e i- th la p iv e- hts be is ng es nt the un- lity a e n. e en- sible - al- we PRE 62 7001VARIATIONAL STUDIES AND REPLICA SYMMETRY . . . Two related questions are discussed in this paper: Wh the largest value that the overlapR can reach for fixeda? That is, what are the upper bounds for the generaliza ability? Note that this is different in spirit from a Bayesia approach since here we ask for best performances within constraints of a given architecture. This immediately lead the second question: What energy functionV should be cho- sen in order to achieve those bounds? The same ques have been answered satisfactorily for this machine with weights; the first by Opper and Haussler@24# and the second by Kinouchi and Caticha@22#. In trying to answer them we have to go beyond the permutation replica symmetry t holds in that case, which we now analyze. B. Replica symmetric analysis It is reasonable to begin by assuming that the solution the extremization problem is replica symmetry. The ze temperature limit can be taken by noticing that an optim potential will most likely have a unique minimum and ther fore it is natural to take theq→1 limit @25#. A sensible limit can be obtained by requiring thatx[b(12q), y[q̂/b2, and v[ r̂ /b remain finite asb→`. The free energy is then th extreme of f RS5 1 2 xy1$R2@122H~h!#%v2A2 p ye2h2/2 12aE DtHS 2tR A12R2D S V~lo!1 ~lo2t !2 2x D , whereh5v/Ay,Dt5exp(2t2/2)dt/A2p, H(x)5*x `Dt, and lo(t) is defined as the minimum ofV(l)1(l2t)2/2x: 2xS ]V~l! ]l D lo 5lo2t. ~4! The saddle point equations lead to R5122H~h!,x5A 2 py e2h2/2, e2h2 5paE DtHS 2tR A12R2D ~lo2t !2, he2h2/25aE dt A2p~12R2! e2t2/2(12R2)lo . ~5! Following @22#, we obtain the result that, in order to max mizeR, the learning potential should be chosen such that RS modulation functionFRS(t)[lo2t is given by FRS~ t !5C e2~Rt)2/2(12R2) H~2Rt/A12R2! , ~6! which can be seen to be very similar to the optimal modu tion function for on-line learning in the real-weights perce tron, a most interesting result from a cavity perspect @26,27#. It is easy to see that is n he to ns al t o o l e - - e C5 e2h2/2 phA12R2 ~7! is a function ofR but not oft. From Eq.~6! we can determine the potentialV, sinceFRS52x]V/]lo . Thus V~l!5E~ t !2 1 2x FRS 2 ~ t !, ~8! where E(t)52@A2p(12R2)C/Rx# ln H(2Rt/A12R2) is proportional to the on-line optimal energy function. It r sembles the equivalent off-line potential for the real-weig perceptron. However, the post-training stability may not all positive, i.e.,lo can be negative. This means that th algorithm would be called inconsistent since after traini there still may be memorization errors. The learning curv ~see Fig. 2 below! are obtained by solving the saddle poi equations numerically. By comparing the free energy of phase with incomplete learningR,1 with that for R51, a first order phase transition atac50.53 is found. As men- tioned earlier, this is unacceptable and the cause of this physical result is the replica symmetric ansatz. A stabi analysis@28# confirms this suspicion and we now turn to one-step replica symmetry breaking ansatz C. RSB-1 variational method We proceed in the by now standard way@1#. Let qab be an l 3 l block matrix and call the block indicesm,n 51,2, . . . ,l 5n/m. For replicas within the same block, th overlaps are taken to have the same overlap valueq1, while replicas belonging to different blocks have overlapqo . The replicas are assumed to be equivalent and thusRa5r, for all replicas. We do not use the same letterR to make explicit that it will have a different value from the previous sectio The conjugated parameters~with carets! are assumed to hav the same structure. The entropic and energetic contributions to the free ergy are Go5 1 2 @mqoq̂o1~12m!q1q̂12q̂1#2rr̂1 1 mE Dzoln 3E Dz1@2 cosh~Aq̂ozo1Aq̂12q̂oz11 r̂ !#m, ~9! Gr52 2 mE 2` ` Dt1E 0 ` Dt2 lnE 2` ` DtoH E dl A2p~12q1! 3expF2bS V~l!1 ~l2t !2 2x1 D G J m , ~10! where t[Aqo2r2t11rt21Aq12qoto . We are interested in the zero temperature case. A sen limit can be found by requiring that within a valley the op timal potential has a unique minimum and that different v leys’ minima have the same overlap. To achieve this tio s: e ob- l- v- he he it on al- for - 7002 PRE 62BOTELHO, MATTOS, AND CATICHA make the ansatzb→`,m→0,q1→1, and q̂o and R̂→`, such thatxo[bm, x1[b(12q1),Q̂o[m2q̂o , and R̂[mr̂ remain finite. To take the limit@29,30# we start with the entropic part Go5 1 m F2 ~12qo!Q̂o 2 2rR̂1E Dzoln 2 cosh~AQ̂ozo1R̂!G ~11! and the energy part Gr52 2 mE 2` ` Dt1E 0 ` Dt2lnE 2` ` Dto e2bm[V(lo)1(lo2t)2/2x1] ~12! where thel integral is done by a saddle method andlo is such that ]V/]l1 ~l2t ! x1 50 for l5lo . ~13! It follows that 2xof 52 ~12qo!Q̂o 2 2rR̂1E Dzoln 2 cosh~AQ̂ozo1R̂! 12aE 2` ` Dt1E 0 ` Dt2lnE 2` ` Dto 3e2xo[V(lo)1(lo2t)2/2x1] . ~14! The order parameters of the incomplete generaliza phase are obtained by solving the saddle point equation ] f ]R̂ 50⇒r5E Dzotanh~AQ̂0zo1R̂![f1~Q̂o ,R̂!, ~15! ] f ]Q̂0 50⇒qo5E Dzotanh2~AQ̂0zo1R̂![f2~Q̂o ,R̂!, ~16! ] f ]r 50⇒R̂52a xo x1 E Du e2u2/2G A2p~12g2! ^~lo2t !& to , ~17! ] f ]qo 50⇒Q̂o52a xo x1 2E Du HS 2 u AG D ^lo2t& to 2 , ~18! whereg[r/Aqo, G[(12g2)/g2, and ^~••• !& to [ E Dtoe2xo[V(lo)1(lo2t)2/2x1]~••• ! E Dtoe2xo[V(lo)1(lo2t)2/2x1] . ~19! The orthogonal transformation (t1 ,t2)→(u,v) defined byu 5(Aqo2r2 t11rt2)/Aqo, v5(2rt11Aqo2r2 t2)/Aqo, and t5Aqo u1A12qo to was used to simplify Eq.~18!. n Introducing an effective modulation function F~u!5^lo2t& to , ~20! the saddle point equations can be written as R̂52a xo x1 E Du e2u2/2G A2p~12g2! F~u!, Q̂o52aS xo x1 D 2E Du HS 2 u AG D F2~u!. We can now determine theF(u) that maximizes, for a fixed a, the overlapr. This is somewhat harder than in th real-weights case, where the solution to the variational pr lem is obtained very easily, just by inspection plus know edge of the equivalent solution for the on-line problem. Ne ertheless, the solution is very similar. So we look at t variations of Eq.~15! with respect toF(u) subject to the constraint ~16!. Let j be a Lagrange multiplier~see also @31#!. F(u) can be determined by maximizing F5f1~Q̂o@qo ,r,F#,R̂@qo ,r,F# ! 2j f2~Q̂o@qo ,r,F#,R̂@qo ,r,F# !, ~21! which yields dQ̂o dF 52 ~]f1 /]R̂2j]f2 /]R̂! ~]f1 /]Q̂o2j]f2 /]Q̂o! dR̂ dF [k dR̂ dF , ~22! and by using the explicit form ofR̂ and Q̂o and the defini- tions ~15! and ~16!: E Du HS 2 u AG D F~u!5 x1 xo k 2E Du e2u2/2G A2p~12g2! . ~23! This does not determineF(u) uniquely. A possible solution is F~u![^lo2t& to 5 1 2 x1 xo k A2p~12g2! e2u2/2G H~2u/AG! . ~24! Any odd function ofu could be added to the Gaussian in t numerator, but we choose to look at this form since as stands it is proportional to the optimal modulation functi that appears in off-line and on-line optimization of the re weights perceptron and we see no physical motivation other terms. This choice~24! leads to the saddle point equa tions R̂5A2 p akI o~g!, ~25! Q̂o52ak2A12g2I o~g!, ~26! where k[ 1 2 k/A2p(12g2), I o(g)[*Dv g(2gv),g [r/Aqo, andg(x)[e2x2/2/H(x). We can still choosek, and e l, ic ed za io th ir w us ht e th th iv fo te ution the , e fr ns - ned ctor the ing ari- rn- - us for PRE 62 7003VARIATIONAL STUDIES AND REPLICA SYMMETRY . . . we do so in order to pick the largest possibler for fixed a. ~Fig. 1 showsa as a function ofkvar , the value ofk that leads to the smallest generalization error.! Note that this lib- erty is due to the fact that the value ofqo is not constrained; only its form is constrained by Eq.~16!. The learning curves of the low generalization phase can be determined num cally @see Figs. 2~a! and 2~b!#. In Fig. 3 the overlapqo is shown as a function ofa for the best choice of the potentia i.e., usingkvar . The determination of the thermodynam phase, that is, the location of the first order transition, ne analysis of the free energy in both high and low generali tion phases. While the free energy of the low generalizat phase can be determined without explicit knowledge of potential, that of the high generalization phase will requ detailed knowledge of the potential. In the next section show how this can be done. III. THE FREE ENERGY AND THE POTENTIAL A. The low generalization phase To complete the analysis of the learning curve we m look into the behavior of the free energy. This is not straig forward since the form of the potential has not yet be determined, but only the expected value]V/]l. It is quite interesting, as we now show, that the precise form of potential is not needed to determine the free energy or learning curve in this phase; just knowledge of the effect modulation function suffices, and even this is the same any solution of Eq.~23!. The energy contribution to the free energy can be writ as 2xof 152aE DuHS 2u AG D lnE Dto 3expF2xoS V~lo!1 ~lo2t !2 2x1 D G , FIG. 1. The value ofa as a function of hyperparameterk at which the low generalization phase performance is optimal~thick line!. For each fixedk, we show also the value ofas , the spinodal point ~short-long dash!, where the low generalization phase ceas to exist, and ofac , the first order transition point~long dash!, above which the perfect generalization phase has the lowest energy. ri- s - n e e e t - n e e e r n provided the order parameters are understood as the sol of the saddle point equations. Integrating by parts, using definitions oflo andF(u), 2xof 152a xo x1 AqoE 2` ` duF~u!E 2` u DyHS 2y AG D , substituting Eq.~24! for F(u), and integrating by parts again s ee FIG. 2. ~a! Learning curves. Generalization errors as functio of a for different algorithms. The Bayes~continuous line! and Gibbs ~top short dashed curve! algorithm learning curves are in cluded for comparison. Note that the Bayes algorithm is obtai by a student outside the binary space. BB is the best binary ve @18# obtained as the result of clipping the center of mass of version space~Bayes!. The learning curve~dots! obtained under the hypothesis of replica symmetry has a transition to perfect learn at the impossible value ofa50.53. The algorithm obtained with k52 gives the Gibbs result. Fork53 andk56 the curves show a trade-off between earlier performance and later transition. For v ablekvar(a) ~Fig. 1! the best performance for potential based lea ing is obtained. See~b! for details. For smalla the unphysical RS, the RSB-1 withkvar , and the BB curves are numerically indistin guishable.~b! Details of the learning curves. The black continuo line is obtained forkvar(a) ~Fig. 1!; it is the envelope, within the low generalization phase, of the family of algorithms obtained k>2. x h t - th , i tr n to c- If on; - on, h- nt. at, on ssed he the line - the ua- hat ntal fol- an- are tial to as ay the per- uan- of es 7004 PRE 62BOTELHO, MATTOS, AND CATICHA 2xof 152akA2pqoGE 2` ` DuHS 2u AG D ln HS 2u AG D ~27! is obtained. B. The high generalization phase We now look at the value of the free energy in the e treme of the allowed interval for the order parameters. T determination of the potential cannot be postponed since free energy depends on it explicitly. Equation~23!, which results from the variational prescrip tion, can be transformed into an integral equation for effective potentialE(t)5V(lo)1(lo2t)2/2x1: E Dtoe2xoE(t)5FHS 2 u AG D G b , ~28! wheret5Aqou1A12qoto and b5Ak2qo 4g2 . ~29! An expression for the potential, obtained in the Appendix E~ t !52 1 xo lnE DxFHS 2 t AGqo 2 ixA~12qo! Gqo D G b , ~30! where the effect of the one-step RSB is seen to be the in duction of a noiselike term in the stabilityt, which depends on the existence of the other valleys. Its influence goes zero asqo→1. Despite its appearance, this expression is real, as ca seen by defining FIG. 3. qo and r as a function ofa. Note thatqo is different from 1 as soon asaÞ0 but is so close that the learning curv under the RS and RSB-1 methods@see Fig. 2~a!# are not very dif- ferent for smalla. - e he e s o- to be C~x,t !5E DKQS K1 t AGqo D cosS xKA~12qo! Gqo D , ~31! S~x,t !5E DKQS K1 t AGqo D sinS xKA~12qo! Gqo D . ~32! Finally we obtain E~ t !52 1 xo lnE dx A2p exp2F1 2 S 12b 12qo Gqo D x2G 3@C 2~x,t !1S 2~x,t !#b/2cosS b tan21 S~x,t ! C~x,t ! D , ~33! which can be used for numerical evaluations. It is easy verify that the solution is real. At this point notice the stru ture of the potential that emerges from the calculation. qo51, then we are back to the replica symmetric calculati the x integral decouples and we are left with2xoE(t) 5b ln H(2t/AG), which is very similar to the optimal poten tials that have been found for the real-weights perceptr both on line and off line, and for the binary perceptron wit out RSB. However RSB introduces a new noiselike eleme One of the most striking features of this solution is th like other variationally determined potentials, it depends the values of the order parameters. This has been discu elsewhere@32# and can be interpreted as the inclusion of t correct annealing along the learning process, generalizing annealing of the learning rate that has been studied in on- learning algorithms~e.g., @33,34,26#!. These order param eters have the role of hyperparameters and will have value determined by the solution of the saddle point eq tions. The fact that these values may differ from those t are determined by the thermodynamics is of fundame importance. The learning process thus proceeds in the lowing way. Determine self-consistently the potential~func- tional form and hyperparameters! that will lead to maximum performance in the low generalization phase in such a m ner that the respective values of the order parameters equal to the hyperparameters. Then minimize the poten by some dynamical process, letting the temperature go zero. We do not worry here about thermalization times, these might diverge for a NP problem. The final result m land on the perfect generalization phase and therefore order parameters will not have the same values as the hy parameters. We denote the hyperparameters by starred q tities. Then we look at the limitqo ,q1→1,b→`,m→0 of Eq. ~14! with E* (t) calculated at the starred~saddle point equation! values, which gives f 52aE 0 ` DtE* ~ t !. ~34! C. Learning curves Equations~15!, ~16!, ~25!, and~26! are used to build the low generalization learning curves. From the comparison th - - t o tio go t is im o- ic en d on he a io no ye o th t a fin b te his al to a nd rali- a he lap der ugh col- not e- t n ts ing ct ions, the at ly itu- par- ill is son- e he s r- i- te ing of a- gy the ere, ng r- ts of ti- d de PRE 62 7005VARIATIONAL STUDIES AND REPLICA SYMMETRY . . . the free energies, Eqs.~27! and ~34!, the phase transition is located. These equations have to be complemented wi choice ofk. We can just look at the numerical value ofk that leads to the smallest generalization error for fixeda, which we calledkvar(a) ~Fig. 1!. We look also at the results ob tained for fixedk. A Bayesian statistician will not be sur prised that the Bayes bounds are not beaten. Neither is ~low phase! generalization error improved, nor the onset the high generalization phase anticipated. These equa just lead in general to smaller errors than the Gibbs al rithm. However for smalla the RS, the RSB-1@kvar(a)#, and the Bayesian best binary of Copelliet al. @18# are very close. At this point this agreement is only numerical, but i possible that these algorithms have the exact same opt ~Bayesian! performance in the limita→0, which is similar to the results of@18# for unsupervised learning. Figure 4 shows the potential@Eq. 33# for different values of b. At b51 the potential turns into the error-counting p tential, which gives the Gibbs performance. Then repl symmetry is restored. Below the valueb51, the potential cannot be determined. IV. DISCUSSION AND CONCLUSIONS The learning curves shown in Fig. 2 show that the pot tials obtained variationally fail to reach the Bayes boun This is in contrast to the continuous weight perceptr where the Bayes limit is obtained by a network with t same architecture as the teacher. As shown by Copelliet al. @18# the Bayes algorithm is equivalent to a network with weight vector given by the center of mass of the vers space, which is not itself a binary vector. It follows that method constrained to the hypercube will reach the Ba limit. A similar failure to reach the Bayes limit was als reported by Wintheret al. @35# for multilayer networks, where again the Bayes algorithm cannot be matched wi the space of students with the same architecture as teacher. The variational method probes potentials from within restricted class and it is therefore natural not to expect to Bayesian performances if the Bayes algorithm does not long to it. The performance of Bayesian inference restric FIG. 4. The effective potentialE(t) as a function of the stability t showing the annealing for different values ofa. a he f ns - al a - . , n s in he d e- d to binary vectors is saturated only for smalla, but then even the simple Hebb algorithm has optimal generalization in t regime. Within the low generalization phase the variation method is able to identify a class of potentials that lead better performance than the Gibbs algorithm. There is trade-off within the class, as can be seen in Figs. 2~a! and 2~b!, between earlier transition to perfect generalization a better performance. A potential that leads to better gene zation will have a delayed transition. The potentials of this class do not work by imposing zero memorization error, not even by minimization of t total error count. They tend to minimize the average over with the teacher, and since this is typically near the bor ~just by the geometry of theN-dimensional space! the ap- proach to the teacher weight vector can be made thro vectors outside the version space. The version space lapses to only one element at typicallya;1.25, but the variational class, not considering the version space, will detect this until later. The replica symmetric variational study incorrectly pr dicts a transition ata,1. A stability analysis indicates tha the replica symmetric ansatz is not adequate for alla.0. This is a little surprising, and the origin of the instability ca be traced back to the requirement thatb→` implies q→1 for aÞ0. This is in contrast with the case of real weigh where, even in the presence of multiplicative noise, learn with the variational potential is replica symmetric. The effe of noise can break phase space into disconnected reg which are essentially ignored by the robust learning of optimal potential, which disregards outliers. A method th insists on minimizing the memorization error will certain lead in such conditions to a replica symmetry breaking s ation. The one-step broken replica symmetry leads to ap ently consistent physical behavior. The stability analysis w be presented elsewhere@28#. A preliminary picture that emerges from such analysis is that, while one-step RSB enough to give locally stable results and suggests a rea able physical picture it may fail to be globally correct. W think that the RSB-1 calculation describes qualitatively t main features of this difficult problem, but a full continuou RSB scheme@28# will be necessary to understand the the modynamic equilibrium bounds obtainable from the minim zation of a potential. Even this will not tell the comple story, however, since issues dealing with effective learn times will still remain. This work, together with the results @16–18,13#, suggests that in the optimization of comput tionally hard discrete problems it might be a better strate to first leave the space of configurations, in this case vertices of the hypercube, then optimize in the hypersph and finally go back to the original space, instead of strivi to respect the discreteness constraints at every step. ACKNOWLEDGMENTS The authors thank M. Copelli and O. Kinouchi for impo tant discussions and suggestions and also the participan the Workshop on Statistical Mechanics, Max Planck Ins tute, Dresden~1999! for discussions on this subject. E.B. an N.C. were partially sponsored by the Conselho Nacional Desenvolvimento Cientı´fico e Tecnologico ~CNPq! and p n- e m a e ie p he e tion 7006 PRE 62BOTELHO, MATTOS, AND CATICHA C.R.M. by the Fundac¸ão para o Desenvolvimento da Unes ~FUNDUNESP!. APPENDIX Equation~23! leads to an integral equation for the pote tial: E Dtoexp@2xoE~ t !#5FHS 2 u AG D G b [A~u!, ~A1! where t5Aqou1A12qoto , b5 k̂Aqo/4g2, and Dto 5(dto /A2p)e2to 2/2. To obtain the effective potential, not that the integral on the left side is a convolution, so it see natural to perform a Fourier transformation. The fact th A(u) is not square integrable is bypassed by defining a n problem: E Dtoexp@2xoEj~ t !#5FHjS 2 u AG D G b [Aj~u! ~A2! Here HjS 2 u AG D 5E 2` ` DyiQjS yi1 u AG D where QjS y1 u AG D 5expF2jS y1 u AG D 2G if y.2u/AG and 0 otherwise. OnceEj(t) is found, we will take the regularizing parameterj to zero. After Fourier trans- forming, dividing by the Gaussian on the left, and Four transforming back, we get Ej[exp@2xoEj~ t !# 5AqoE dk 2p S e(12qo)k2/2E eiAqokuAj~u!duDe2 ikt. ~A3! To be able to perform the above integrals we use a sim replica trick, which consists in consideringb as an integer, FHjS 2 u AG D G b 5) i 51 b E 2u/AG ` DyiexpF2jS y1 u AG D 2G 5) i 51 b E 2` ` DyiQjS yi1 u AG D . ~A4! Introduce the Fourier transforms Qj~yi2x!5E f j~r i !e ir i (yi2x)dri , e2yi 2/2 5E g~v i !e 2 iv i yidv i . The Fourier transform ofAj(u) is s t w r le E eiAqokuFHjS 2 u AG D G b du5E eiAqoku 3H E ) i 51 b S dyi A2p dridv i f j~r i !g~v i !D 3expF i( r iS yi1 u AG D 2 i( v i yi G J du. Theu andyi integrals are now automatic. Going back to t equation for the potential Ej5E Aqodk ~2p!2b/2 e(12qo)k22 ikt/2 3E 2` ` S ) i 51 b dri f j~r i !g~r i !D dS Aqok1( r i AG D , we now integrate overk: Ej5E 2` ` S ) i 51 b A2pdri f j~r i !g~r i !D 3expF1 2 ~12qo! Gqo S ( r i D 2Gexp2 i ( r i AGqo t. The linearization of the(r i is done by a standard trick. Us the fact that g(r ) is a Gaussian and thatf j(r ) 5(1/2p)*Qj(y)e2 iyrdy; then Ej5E dx A2p e2x2/2 3S E dy A2p Qj~y!E 2` ` dre2r 2/2e2x[A(12qo)/Gqo] re2 iKr D b , whereK5t/AGqo1y. Integrating overr, Ej5E dx A2p e2x2/2F E dy A2p Qj~y!exp 1 2 S ~12qo! Gqo x22K2 12ixKA~12qo! Gqo D G b andb can be again taken to be real. Changing the integra variable toK and takingj to zero, E05E dx A2p exp2F1 2 S 12 ~12qo! Gqo bD x2G 3F E DKQS K1 t AGqo D exp1 ixKA~12qo! Gqo G b . Extending the definition of theH function to complex argu- ments, the potential can be written formally as Eq.~30!. te ac . E - - v. E ta- v. E v. ett. PRE 62 7007VARIATIONAL STUDIES AND REPLICA SYMMETRY . . . @1# M. Mezard, G. Parisi, and M. Virassoro,Spin Glass Theory and Beyond~World Scientific, Singapore, 1987!. @2# T.L.H Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys.,65, 499 ~1993!. @3# M. Opper and W. Kinzel, inModels of Neural Networks III, edited by E. Domany, J. L. van Hemmen, and K. Schul ~Springer-Verlag, Berlin, 1996!. @4# A. Engels and C. Van den Broeck,Statistical Mechanics of Learning ~Cambridge University Press, Cambridge, 2000!. @5# L. Pitt and L. Valiant, J. Assoc. Comput. Mach.35, 965 ~1988!. @6# E. Gardner and B. Derrida, J. Phys. A21, 271 ~1988!. @7# G. Györgyi, Phys. Rev. Lett.64, 2957~1990!. @8# D. Haussler, M. Kearns, H.S. Seung, and N. Tishby, M Learning25, 195 ~1996!. @9# H. Horner, Z. Phys. B: Condens. Matter86, 291 ~1992!; 87, 371 ~1992!. @10# M. Golea and M. Marchand, J. Phys. A26, 5751~1993!. @11# C. Van den Broeck and M. Bouten, Europhys. Lett.22, 223 ~1993!. @12# D. Bollé and G.M. Shim, Network Comput. Neural Syst.6, 619 ~1995!. @13# L. Reimers, M. Bouten, and B. Van Rompaey, J. Phys. A29, 6247 ~1996!. @14# M. Bouten, L. Reimers, and B. Van Rompaey, Phys. Rev 58, 2378~1998!. @15# R.W. Penney and D. Sherrington, J. Phys. A26, 6173~1993!. @16# M. Copelli and C. Van den Broeck, Phys. Rev. E61, 6971 ~2000!. @17# M. Copelli, M. Bouten, C. Van den Broeck, and B. Van Rom paey, Europhys. Lett.47, 139 ~1999!. n . @18# M. Copelli, C. Van den Broeck, and M. Opper, J. Phys. A32, L555 ~1999!. @19# W. Kinzel and R. Urbanczik, J. Phys. A31, L27 ~1998!. @20# S. Solla and O. Winther, inOn-Line Learning in Neural Net- works, edited by D. Saad~Cambridge University Press, Cam bridge, 1998!. @21# M. Opper, Phys. Rev. Lett.77, 4671~1996!; in On-Line Learn- ing in Neural Networks~Ref. @20#!. @22# O. Kinouchi and N. Caticha, Phys. Rev. E54, R54 ~1996!. @23# G.J. Bex, R. Seneerls, and C. Van den Broeck, Phys. Re 51, 6309~1995!. @24# M. Opper and D. Haussler, Phys. Rev. Lett.66, 2677~1991!; in Proceedings of the IVth Annual Workshop on Compu tional Learning Theory (COLT91)~Morgan Kauffman, San Mateo CA, 1991!. @25# M. Bouten, J. Schieste, and C. Van den Broeck, Phys. Re 52, 1958~1995!. @26# O. Kinouchi and N. Caticha, J. Phys. A25, 6243~1992!. @27# N. Caticha and E. Arau´jo de Oliveira, Universidade de Sa˜o Paulo report~unpublished!. @28# E. Botelho, C. Mattos, and N. Caticha~unpublished!. @29# W. Krauth and M. Mezzard, J. Phys.~Paris! 50, 3057~1989!. @30# H.S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A45, 6056 ~1992!. @31# A. Buhot, J.M. Torres Moreno, and M.B. Gordon, Phys. Re E 55, 7434~1997!. @32# N. Caticha and O. Kinouchi, Philos. Mag. B77, 1565~1998!. @33# S. Amari, Neural Networks6, 161 ~1993!. @34# N. Barkai, H.S. Seung, and H. Sompolinsky, Phys. Rev. L 75, 1415~1995!. @35# O. Winther, B. Lautrup, and J-B. Zhang, Phys. Rev. E55, 836 ~1997!.