PHYSICAL REVIEW E NOVEMBER 2000VOLUME 62, NUMBER 5
Variational studies and replica symmetry breaking in the generalization problem
of the binary perceptron

Evaldo Botelho,1 Cristiano R. Mattos,1,2 and Nestor Caticha1
1Instituto de Fı´sica da Universidade de Sa˜o Paulo, Caixa Postal 66318, Sa˜o Paulo, SP 05315-970, Brazil

2Faculdade de Engenharia, Universidade Estadual Paulista, Caixa Postal 205, Guaratingueta´, SP 12500-000, Brazil
~Received 29 May 2000!

We analyze the average performance of a general class of learning algorithms for the nondeterministic
polynomial time complete problem of rule extraction by a binary perceptron. The examples are generated by a
rule implemented by a teacher network of similar architecture. A variational approach is used in trying to
identify the potential energy that leads to the largest generalization in the thermodynamic limit. We restrict our
search to algorithms that always satisfy the binary constraints. A replica symmetric ansatz leads to a learning
algorithm which presents a phase transition in violation of an information theoretical bound. Stability analysis
shows that this is due to a failure of the replica symmetric ansatz and the first step of replica symmetry
breaking~RSB! is studied. The variational method does not determine a unique potential but it allows con-
struction of a class with a unique minimum within each first order valley. Members of this class improve on the
performance of Gibbs algorithm but fail to reach the Bayesian limit in the low generalization phase. They even
fail to reach the performance of the best binary, an optimal clipping of the barycenter of version space. We find
a trade-off between a good low performance and early onset of perfect generalization. Although the RSB may
be locally stable we discuss the possibility that it fails to be the correct saddle point globally.

PACS number~s!: 87.10.1e, 84.35.1i, 89.70.1c, 05.50.1q
ith
f
dy
he
t

lm
, a
o
o

or
th

d
o
rc
t

to

an
nd

id
g
f
sim
le
o
c

cs
e.g.,
ype

of
t ap-
ific

a-
he
hts

nce
in-
u-

ar
obal

to
ics.
hts
nu-
ups
-
op-
ht
it

of
nd
lli
u-
for

nt of
I. INTRODUCTION

One topic of interest in the study of neural networks w
the tools of statistical mechanics~SM! is that the process o
information extraction from data can be modeled by a
namical process of minimization of an energy function in t
presence of noise. The use of techniques borrowed from
study of equilibrium disordered systems@1#, such as, for ex-
ample, replica methods, Thouless, Anderson, and Pa
equations, cavity analysis, and Monte Carlo simulations
well as dynamical analysis techniques, has permitted a c
siderable understanding of typical properties of the therm
dynamic limit ~TL! of such systems as attractor and feedf
ward neural networks. In this paper we are interested in
equilibrium properties of the student-teacher problem@2–4#
of rule extraction by abinary Boolean perceptron. This
means that the weights as well as the output are restricte
be 61. We look at the generalization ability for the case
a realizable rule represented by a teacher of the same a
tecture as the student and in particular we concentrate on
determination of thermodynamic limit bounds and how
get them by the construction of an appropriate potential.

The binary perceptron has been attacked from m
fronts. Studies of the computational complexity by Pitt a
Valiant have shown that it belongs to the NP~nondetermin-
istic polynomial time! complete class@5#. From a SM point
of view it has been studied before by Gardner and Derr
@6# and Györgyi @7#. Their aim was to study the learnin
curve ~generalization error! as a function of the number o
examples where the training energy was chosen in the
plest way, i.e., error counting. For independent examp
drawn from a uniform distribution, the main characteristic
this system in the TL is the first order transition to perfe
generalization atac.1.25, where as usuala5P/N mea-
PRE 621063-651X/2000/62~5!/6999~9!/$15.00
-

he

er
s
n-
-

-
e

to
f
hi-
he

y

a

-
s

f
t

sures the number of examplesP in units of the number of
inputs N. This shows the power of statistical mechani
methods, since the transition cannot be detected by,
Vapnik-Chervonenkis analysis, which aims at a general t
of result, such as those independent of the distribution
examples, and can therefore miss important features tha
pear only for particular but important cases. For a spec
comment on this, see@8#

A relevant question, complementary in spirit to examin
tion of the thermal equilibrium properties, delves into t
dynamical aspects of actually determining the set of weig
that minimize the training error. The studies of Horner@9#
have shown that times exponential inN are required for
learning. This is in agreement with the expectation that, si
this is a computationally hard problem, it should have sp
glass properties. Polynomial time algorithms, such as sim
lated annealing of the error-counting energy, with a line
decrease of the temperature schedule, fail to reach gl
solutions.

Different approximations have been devised in order
overcome the problems associated with glassy dynam
Based on the fact that learning the equivalent real-weig
problem is not as slow, other studies have dealt with conti
ous approximations to the binary problem. Several gro
looked into clipping strategies@10–13# and other transforma
tions of the continuous perceptron, in particular one that
timally incorporates the information that the teacher weig
vector points in the direction of a vertex of the un
N-dimensional hypercube@14#. Penney and Sherrington@15#
have looked into how to reduce the effective dimension
the problem by clipping a partial set of real couplings a
posterior learning of the remaining binary weights. Cope
et al. @16–18# have looked at a related problem in the uns
pervised learning scenario. They employed their methods
the teacher-student case also. By extending an argume
6999 ©2000 The American Physical Society


ts
r o
r
e

e

r-
t
in

in
n

ei
on

a
g
p

ls
ep

ar

fa
th

xi
in
e
p

ry
he
-

rn

or

b
r-
or
R
t
f a

xe
re
ily
to

w
a
i

rli

en
II C
e

f t

the
-
clu-

the
a
en

tu-

a
he
on
re-

, is

the

on,

u-
ure

7000 PRE 62BOTELHO, MATTOS, AND CATICHA
Watkin for the analogous problem with continuous weigh
they looked into the generalization properties of the cente
mass of the version space. Its performance should satu
the Bayes limit, but it is not itself a vertex on the hypercub
They also considered the Bayesian best binary~BB! vector
by clipping the real component Bayes vector for both sup
vised and unsupervised learning.

On-line learning—a possibly efficient strategy to ove
come slow dynamics since it makes no attempt
thermalize—has been shown to be ineffective if working
the binary coupling space directly@19#. Nevertheless,
progress has been made in the direction of on-line learn
Solla and Winther@20# have shown how to incorporate, i
the on-line Bayesian spirit suggested by Opper@21#, infor-
mation about the binary nature of the couplings. Th
method leads to fast learning times but is not able to c
dense on the exact solution in finite times.

In this work we extend the work of Kinouchi and Catich
@22# for the off-line variational determination of a trainin
energy. We look into the extension of the variational a
proach to the determination of training potentia
optimal—in the generalization sense—for the binary perc
tron. A difference from the approach of Copelliet al. @16–
18# is that our construction methods never leave the bin
space. Although this in the end might turn out not to be
good strategy, it is theoretically interesting to see how
one can go on such a discrete problem without leaving
allowed space. In the case of random examples,@13# presents
a variational calculation for potentials that, using the ma
mum stability perceptron as a teacher, aims at determin
the continuous vector that on clipping determines the larg
number of weights of the maximum stability binary perce
tron.

The extension of the variational program to the bina
case is not at all straightforward due to the failure of t
replica symmetry~RS! ansatz. We studied first a replica sym
metric variational calculation. The transition to perfect lea
ing occurs at a much smaller value ofac (.0.5). This would
have been great news if not for the fact that a simple inf
mation theoretical bound for perfect generalization isac
>1 @23#. Each example cannot carry more than just one
of information and at leastN examples are needed to dete
mine theN independent bits encoded in the weight vect
The stability analysis showed the surprising failure of the
ansatz from the very start ata50. The next step we repor
here is to perform a variational calculation at the level o
one-step replica symmetry breaking~RSB-1!. The variational
method does not determine a unique potential, but only fi
expected values of the modulation function. We study a
stricted set of solutions, which form a one-parameter fam
Our choice is based on physical similarities of this family
potentials already studied. This is interesting in itself as
indicates the possibility that potentials other than the type
study may be relevant. The phase diagrams are studied
discussed. The learning curves associated with this fam
show a trade-off between better generalization and ea
transitions to the perfect generalization phase.

This paper is organized as follows. In Sec. II A we pres
the general replica calculation results. In Secs. II B and
the variational procedures under RS and RSB-1, resp
tively, are presented. Section III discusses the analysis o
,
f

ate
.

r-

o

g.

r
-

-
,
-

y
a
r
e

-
g
st
-

-

-

it

.
S

s
-
.

it
e
nd
ly
er

t

c-
he

RSB-1 free energy. This requires the determination of
effective potential, which is explicitly calculated in the Ap
pendix. Section IV discusses results and presents con
sions.

II. OFF-LINE VARIATIONAL METHOD

A. General results

The statistical mechanics approach to determining
generalization ability of a network learning from examples
rule that itself is implemented by another network has be
studied in several cases; for reviews, see@2–4#. Call the
N-dimensional binary weight vectors of the teacher and s
dent networksB andJ, respectively.

Off-line learning describes the infinite time limit of
learning algorithm that proceeds by minimization, in t
presence of thermal noise, of an energy functi
(m51

P V(J•Sm ,sm) that depends on quenched data rep
sented by theP input-output example pairs (Sm ,sm). The
free energy for the replicated system presented withP ex-
amples, independently chosen from a uniform distribution
given by

2b f 5extrqab ,q̂ab ,Ra ,R̂a
@Go$qab ,q̂ab ,Ra ,R̂a%

2aGr$qab ,Ra%#, ~1!

where

Go5 lim
n→0

1

n F2 (
a,b

qabq̂ab2(
a

RaR̂a

1
1

N (
j 51

N

lnE )
a

dx~Jj
a!

3expS (
a

R̂aBjJj
a1

1

2 (
aÞb

q̂abJj
aJj

bD G , ~2!

Gr5 lim
n→0

1

n
lnE Dt2E S )

a

dladl̂a

2p D
3expS 2b(

a
V~la ,t2! D

3expS i(
a

l̂a~la2Rat2!2
1

2 (
a,b

~qab2RaRb!l̂al̂bD .

~3!

The interpretation ofR andq at the extreme justifies the
choice of method to study this problem. This is so sinceR
5^J•B/N&, where the average is both thermal and over
disorder, is directly related to the generalization erroreG in
the case of equivalent replicas. For the uniform distributi
eG5arccos(R)/p. The usual order parametersqab
5^Ja•Jb /N& describe the typical overlap between two st
dents learning from the same data. The weight meas
dx(Jj

a) defines the particular problem. In this casedx(Jj
a) is

the counting measure overJj
a561.


at

tio
n
t

s

tio
re

ha

t
r
a

e-

e

i-
th

la
p
iv

e-
hts
be
is
ng
es

nt
the

un-
lity
a

e

n.
e

en-

sible
-

al-
we

PRE 62 7001VARIATIONAL STUDIES AND REPLICA SYMMETRY . . .
Two related questions are discussed in this paper: Wh
the largest value that the overlapR can reach for fixeda?
That is, what are the upper bounds for the generaliza
ability? Note that this is different in spirit from a Bayesia
approach since here we ask for best performances within
constraints of a given architecture. This immediately lead
the second question: What energy functionV should be cho-
sen in order to achieve those bounds? The same ques
have been answered satisfactorily for this machine with
weights; the first by Opper and Haussler@24# and the second
by Kinouchi and Caticha@22#. In trying to answer them we
have to go beyond the permutation replica symmetry t
holds in that case, which we now analyze.

B. Replica symmetric analysis

It is reasonable to begin by assuming that the solution
the extremization problem is replica symmetry. The ze
temperature limit can be taken by noticing that an optim
potential will most likely have a unique minimum and ther
fore it is natural to take theq→1 limit @25#. A sensible limit
can be obtained by requiring thatx[b(12q), y[q̂/b2, and
v[ r̂ /b remain finite asb→`. The free energy is then th
extreme of

f RS5
1

2
xy1$R2@122H~h!#%v2A2

p
ye2h2/2

12aE DtHS 2tR

A12R2D S V~lo!1
~lo2t !2

2x D ,

whereh5v/Ay,Dt5exp(2t2/2)dt/A2p, H(x)5*x
`Dt, and

lo(t) is defined as the minimum ofV(l)1(l2t)2/2x:

2xS ]V~l!

]l D
lo

5lo2t. ~4!

The saddle point equations lead to

R5122H~h!,x5A 2

py
e2h2/2,

e2h2
5paE DtHS 2tR

A12R2D ~lo2t !2,

he2h2/25aE dt

A2p~12R2!
e2t2/2(12R2)lo . ~5!

Following @22#, we obtain the result that, in order to max
mizeR, the learning potential should be chosen such that
RS modulation functionFRS(t)[lo2t is given by

FRS~ t !5C
e2~Rt)2/2(12R2)

H~2Rt/A12R2!
, ~6!

which can be seen to be very similar to the optimal modu
tion function for on-line learning in the real-weights perce
tron, a most interesting result from a cavity perspect
@26,27#. It is easy to see that
is

n

he
to

ns
al

t

o
o
l

e

-
-
e

C5
e2h2/2

phA12R2
~7!

is a function ofR but not oft. From Eq.~6! we can determine
the potentialV, sinceFRS52x]V/]lo . Thus

V~l!5E~ t !2
1

2x
FRS

2 ~ t !, ~8!

where E(t)52@A2p(12R2)C/Rx# ln H(2Rt/A12R2) is
proportional to the on-line optimal energy function. It r
sembles the equivalent off-line potential for the real-weig
perceptron. However, the post-training stability may not
all positive, i.e.,lo can be negative. This means that th
algorithm would be called inconsistent since after traini
there still may be memorization errors. The learning curv
~see Fig. 2 below! are obtained by solving the saddle poi
equations numerically. By comparing the free energy of
phase with incomplete learningR,1 with that for R51, a
first order phase transition atac50.53 is found. As men-
tioned earlier, this is unacceptable and the cause of this
physical result is the replica symmetric ansatz. A stabi
analysis@28# confirms this suspicion and we now turn to
one-step replica symmetry breaking ansatz

C. RSB-1 variational method

We proceed in the by now standard way@1#. Let qab be an
l 3 l block matrix and call the block indicesm,n
51,2, . . . ,l 5n/m. For replicas within the same block, th
overlaps are taken to have the same overlap valueq1, while
replicas belonging to different blocks have overlapqo . The
replicas are assumed to be equivalent and thusRa5r, for all
replicas. We do not use the same letterR to make explicit
that it will have a different value from the previous sectio
The conjugated parameters~with carets! are assumed to hav
the same structure.

The entropic and energetic contributions to the free
ergy are

Go5
1

2
@mqoq̂o1~12m!q1q̂12q̂1#2rr̂1

1

mE Dzoln

3E Dz1@2 cosh~Aq̂ozo1Aq̂12q̂oz11 r̂ !#m, ~9!

Gr52
2

mE
2`

`

Dt1E
0

`

Dt2 lnE
2`

`

DtoH E dl

A2p~12q1!

3expF2bS V~l!1
~l2t !2

2x1
D G J m

, ~10!

where

t[Aqo2r2t11rt21Aq12qoto .

We are interested in the zero temperature case. A sen
limit can be found by requiring that within a valley the op
timal potential has a unique minimum and that different v
leys’ minima have the same overlap. To achieve this


tio
s:

e
ob-
l-
v-
he

he
it

on
al-
for
-

7002 PRE 62BOTELHO, MATTOS, AND CATICHA
make the ansatzb→`,m→0,q1→1, and q̂o and R̂→`,
such thatxo[bm, x1[b(12q1),Q̂o[m2q̂o , and R̂[mr̂
remain finite.

To take the limit@29,30# we start with the entropic part

Go5
1

m
F2

~12qo!Q̂o

2
2rR̂1E Dzoln 2 cosh~AQ̂ozo1R̂!G

~11!

and the energy part

Gr52
2

mE
2`

`

Dt1E
0

`

Dt2lnE
2`

`

Dto e2bm[V(lo)1(lo2t)2/2x1]

~12!

where thel integral is done by a saddle method andlo is
such that

]V/]l1
~l2t !

x1
50 for l5lo . ~13!

It follows that

2xof 52
~12qo!Q̂o

2
2rR̂1E Dzoln 2 cosh~AQ̂ozo1R̂!

12aE
2`

`

Dt1E
0

`

Dt2lnE
2`

`

Dto

3e2xo[V(lo)1(lo2t)2/2x1] . ~14!

The order parameters of the incomplete generaliza
phase are obtained by solving the saddle point equation

] f

]R̂
50⇒r5E Dzotanh~AQ̂0zo1R̂![f1~Q̂o ,R̂!,

~15!

] f

]Q̂0

50⇒qo5E Dzotanh2~AQ̂0zo1R̂![f2~Q̂o ,R̂!,

~16!

] f

]r
50⇒R̂52a

xo

x1
E Du

e2u2/2G

A2p~12g2!
^~lo2t !& to

,

~17!

] f

]qo
50⇒Q̂o52a

xo

x1

2E Du HS 2
u

AG
D ^lo2t& to

2 , ~18!

whereg[r/Aqo, G[(12g2)/g2, and

^~••• !& to
[
E Dtoe2xo[V(lo)1(lo2t)2/2x1]~••• !

E Dtoe2xo[V(lo)1(lo2t)2/2x1]

. ~19!

The orthogonal transformation (t1 ,t2)→(u,v) defined byu
5(Aqo2r2 t11rt2)/Aqo, v5(2rt11Aqo2r2 t2)/Aqo,
and t5Aqo u1A12qo to was used to simplify Eq.~18!.
n

Introducing an effective modulation function

F~u!5^lo2t& to
, ~20!

the saddle point equations can be written as

R̂52a
xo

x1
E Du

e2u2/2G

A2p~12g2!
F~u!,

Q̂o52aS xo

x1
D 2E Du HS 2

u

AG
D F2~u!.

We can now determine theF(u) that maximizes, for a
fixed a, the overlapr. This is somewhat harder than in th
real-weights case, where the solution to the variational pr
lem is obtained very easily, just by inspection plus know
edge of the equivalent solution for the on-line problem. Ne
ertheless, the solution is very similar. So we look at t
variations of Eq.~15! with respect toF(u) subject to the
constraint ~16!. Let j be a Lagrange multiplier~see also
@31#!. F(u) can be determined by maximizing

F5f1~Q̂o@qo ,r,F#,R̂@qo ,r,F# !

2j f2~Q̂o@qo ,r,F#,R̂@qo ,r,F# !, ~21!

which yields

dQ̂o

dF
52

~]f1 /]R̂2j]f2 /]R̂!

~]f1 /]Q̂o2j]f2 /]Q̂o!

dR̂

dF
[k

dR̂

dF
, ~22!

and by using the explicit form ofR̂ and Q̂o and the defini-
tions ~15! and ~16!:

E Du HS 2
u

AG
D F~u!5

x1

xo

k

2E Du
e2u2/2G

A2p~12g2!
.

~23!

This does not determineF(u) uniquely. A possible solution
is

F~u![^lo2t& to
5

1

2

x1

xo

k

A2p~12g2!

e2u2/2G

H~2u/AG!
.

~24!

Any odd function ofu could be added to the Gaussian in t
numerator, but we choose to look at this form since as
stands it is proportional to the optimal modulation functi
that appears in off-line and on-line optimization of the re
weights perceptron and we see no physical motivation
other terms. This choice~24! leads to the saddle point equa
tions

R̂5A2

p
akI o~g!, ~25!

Q̂o52ak2A12g2I o~g!, ~26!

where k[ 1
2 k/A2p(12g2), I o(g)[*Dv g(2gv),g

[r/Aqo, andg(x)[e2x2/2/H(x). We can still choosek, and


e

l,
ic
ed
za
io
th
ir
w

us
ht
e

th
th
iv
fo

te

ution
the

,

e

fr

ns

-
ned
ctor
the

ing

ari-
rn-

-
us

for

PRE 62 7003VARIATIONAL STUDIES AND REPLICA SYMMETRY . . .
we do so in order to pick the largest possibler for fixed a.
~Fig. 1 showsa as a function ofkvar , the value ofk that
leads to the smallest generalization error.! Note that this lib-
erty is due to the fact that the value ofqo is not constrained;
only its form is constrained by Eq.~16!. The learning curves
of the low generalization phase can be determined num
cally @see Figs. 2~a! and 2~b!#. In Fig. 3 the overlapqo is
shown as a function ofa for the best choice of the potentia
i.e., usingkvar . The determination of the thermodynam
phase, that is, the location of the first order transition, ne
analysis of the free energy in both high and low generali
tion phases. While the free energy of the low generalizat
phase can be determined without explicit knowledge of
potential, that of the high generalization phase will requ
detailed knowledge of the potential. In the next section
show how this can be done.

III. THE FREE ENERGY AND THE POTENTIAL

A. The low generalization phase

To complete the analysis of the learning curve we m
look into the behavior of the free energy. This is not straig
forward since the form of the potential has not yet be
determined, but only the expected value]V/]l. It is quite
interesting, as we now show, that the precise form of
potential is not needed to determine the free energy or
learning curve in this phase; just knowledge of the effect
modulation function suffices, and even this is the same
any solution of Eq.~23!.

The energy contribution to the free energy can be writ
as

2xof 152aE DuHS 2u

AG
D lnE Dto

3expF2xoS V~lo!1
~lo2t !2

2x1
D G ,

FIG. 1. The value ofa as a function of hyperparameterk at
which the low generalization phase performance is optimal~thick
line!. For each fixedk, we show also the value ofas , the spinodal
point ~short-long dash!, where the low generalization phase ceas
to exist, and ofac , the first order transition point~long dash!,
above which the perfect generalization phase has the lowest
energy.
ri-

s
-
n
e
e
e

t
-
n

e
e

e
r

n

provided the order parameters are understood as the sol
of the saddle point equations. Integrating by parts, using
definitions oflo andF(u),

2xof 152a
xo

x1
AqoE

2`

`

duF~u!E
2`

u

DyHS 2y

AG
D ,

substituting Eq.~24! for F(u), and integrating by parts again

s

ee

FIG. 2. ~a! Learning curves. Generalization errors as functio
of a for different algorithms. The Bayes~continuous line! and
Gibbs ~top short dashed curve! algorithm learning curves are in
cluded for comparison. Note that the Bayes algorithm is obtai
by a student outside the binary space. BB is the best binary ve
@18# obtained as the result of clipping the center of mass of
version space~Bayes!. The learning curve~dots! obtained under the
hypothesis of replica symmetry has a transition to perfect learn
at the impossible value ofa50.53. The algorithm obtained with
k52 gives the Gibbs result. Fork53 andk56 the curves show a
trade-off between earlier performance and later transition. For v
ablekvar(a) ~Fig. 1! the best performance for potential based lea
ing is obtained. See~b! for details. For smalla the unphysical RS,
the RSB-1 withkvar , and the BB curves are numerically indistin
guishable.~b! Details of the learning curves. The black continuo
line is obtained forkvar(a) ~Fig. 1!; it is the envelope, within the
low generalization phase, of the family of algorithms obtained
k>2.


x
h
t

-
th

, i

tr

n

to
c-
If

on;

-
on,
h-
nt.

at,
on
ssed
he
the

line
-
the
ua-
hat
ntal
fol-

an-
are
tial
to

as
ay
the

per-
uan-

of

es

7004 PRE 62BOTELHO, MATTOS, AND CATICHA
2xof 152akA2pqoGE
2`

`

DuHS 2u

AG
D ln HS 2u

AG
D

~27!

is obtained.

B. The high generalization phase

We now look at the value of the free energy in the e
treme of the allowed interval for the order parameters. T
determination of the potential cannot be postponed since
free energy depends on it explicitly.

Equation~23!, which results from the variational prescrip
tion, can be transformed into an integral equation for
effective potentialE(t)5V(lo)1(lo2t)2/2x1:

E Dtoe2xoE(t)5FHS 2
u

AG
D G b

, ~28!

wheret5Aqou1A12qoto and

b5Ak2qo

4g2
. ~29!

An expression for the potential, obtained in the Appendix

E~ t !52
1

xo
lnE DxFHS 2

t

AGqo

2 ixA~12qo!

Gqo
D G b

,

~30!

where the effect of the one-step RSB is seen to be the in
duction of a noiselike term in the stabilityt, which depends
on the existence of the other valleys. Its influence goes
zero asqo→1.

Despite its appearance, this expression is real, as ca
seen by defining

FIG. 3. qo and r as a function ofa. Note thatqo is different
from 1 as soon asaÞ0 but is so close that the learning curv
under the RS and RSB-1 methods@see Fig. 2~a!# are not very dif-
ferent for smalla.
-
e
he

e

s

o-

to

be

C~x,t !5E DKQS K1
t

AGqo
D cosS xKA~12qo!

Gqo
D ,

~31!

S~x,t !5E DKQS K1
t

AGqo
D sinS xKA~12qo!

Gqo
D .

~32!

Finally we obtain

E~ t !52
1

xo
lnE dx

A2p
exp2F1

2 S 12b
12qo

Gqo
D x2G

3@C 2~x,t !1S 2~x,t !#b/2cosS b tan21
S~x,t !

C~x,t ! D ,

~33!

which can be used for numerical evaluations. It is easy
verify that the solution is real. At this point notice the stru
ture of the potential that emerges from the calculation.
qo51, then we are back to the replica symmetric calculati
the x integral decouples and we are left with2xoE(t)
5b ln H(2t/AG), which is very similar to the optimal poten
tials that have been found for the real-weights perceptr
both on line and off line, and for the binary perceptron wit
out RSB. However RSB introduces a new noiselike eleme

One of the most striking features of this solution is th
like other variationally determined potentials, it depends
the values of the order parameters. This has been discu
elsewhere@32# and can be interpreted as the inclusion of t
correct annealing along the learning process, generalizing
annealing of the learning rate that has been studied in on-
learning algorithms~e.g., @33,34,26#!. These order param
eters have the role of hyperparameters and will have
value determined by the solution of the saddle point eq
tions. The fact that these values may differ from those t
are determined by the thermodynamics is of fundame
importance. The learning process thus proceeds in the
lowing way. Determine self-consistently the potential~func-
tional form and hyperparameters! that will lead to maximum
performance in the low generalization phase in such a m
ner that the respective values of the order parameters
equal to the hyperparameters. Then minimize the poten
by some dynamical process, letting the temperature go
zero. We do not worry here about thermalization times,
these might diverge for a NP problem. The final result m
land on the perfect generalization phase and therefore
order parameters will not have the same values as the hy
parameters. We denote the hyperparameters by starred q
tities. Then we look at the limitqo ,q1→1,b→`,m→0 of
Eq. ~14! with E* (t) calculated at the starred~saddle point
equation! values, which gives

f 52aE
0

`

DtE* ~ t !. ~34!

C. Learning curves

Equations~15!, ~16!, ~25!, and~26! are used to build the
low generalization learning curves. From the comparison


th

-
-
t

o
tio
go

t is
im

o-
ic

en
d

on
he

a
io
no
ye
o

th
t

a
fin
b
te

his

al
to
a

nd
rali-

a
he
lap
der

ugh
col-

not

e-
t

n

ts
ing
ct
ions,
the
at

ly
itu-
par-
ill

is
son-
e
he
s
r-
i-

te
ing
of
a-
gy
the
ere,
ng

r-
ts of
ti-
d
de

PRE 62 7005VARIATIONAL STUDIES AND REPLICA SYMMETRY . . .
the free energies, Eqs.~27! and ~34!, the phase transition is
located. These equations have to be complemented wi
choice ofk. We can just look at the numerical value ofk that
leads to the smallest generalization error for fixeda, which
we calledkvar(a) ~Fig. 1!. We look also at the results ob
tained for fixedk. A Bayesian statistician will not be sur
prised that the Bayes bounds are not beaten. Neither is
~low phase! generalization error improved, nor the onset
the high generalization phase anticipated. These equa
just lead in general to smaller errors than the Gibbs al
rithm. However for smalla the RS, the RSB-1@kvar(a)#,
and the Bayesian best binary of Copelliet al. @18# are very
close. At this point this agreement is only numerical, but i
possible that these algorithms have the exact same opt
~Bayesian! performance in the limita→0, which is similar
to the results of@18# for unsupervised learning.

Figure 4 shows the potential@Eq. 33# for different values
of b. At b51 the potential turns into the error-counting p
tential, which gives the Gibbs performance. Then repl
symmetry is restored. Below the valueb51, the potential
cannot be determined.

IV. DISCUSSION AND CONCLUSIONS

The learning curves shown in Fig. 2 show that the pot
tials obtained variationally fail to reach the Bayes boun
This is in contrast to the continuous weight perceptr
where the Bayes limit is obtained by a network with t
same architecture as the teacher. As shown by Copelliet al.
@18# the Bayes algorithm is equivalent to a network with
weight vector given by the center of mass of the vers
space, which is not itself a binary vector. It follows that
method constrained to the hypercube will reach the Ba
limit. A similar failure to reach the Bayes limit was als
reported by Wintheret al. @35# for multilayer networks,
where again the Bayes algorithm cannot be matched wi
the space of students with the same architecture as
teacher.

The variational method probes potentials from within
restricted class and it is therefore natural not to expect to
Bayesian performances if the Bayes algorithm does not
long to it. The performance of Bayesian inference restric

FIG. 4. The effective potentialE(t) as a function of the stability
t showing the annealing for different values ofa.
a

he
f
ns
-

al

a

-
.
,

n

s

in
he

d
e-
d

to binary vectors is saturated only for smalla, but then even
the simple Hebb algorithm has optimal generalization in t
regime.

Within the low generalization phase the variation
method is able to identify a class of potentials that lead
better performance than the Gibbs algorithm. There is
trade-off within the class, as can be seen in Figs. 2~a! and
2~b!, between earlier transition to perfect generalization a
better performance. A potential that leads to better gene
zation will have a delayed transition.

The potentials of this class do not work by imposing
zero memorization error, not even by minimization of t
total error count. They tend to minimize the average over
with the teacher, and since this is typically near the bor
~just by the geometry of theN-dimensional space! the ap-
proach to the teacher weight vector can be made thro
vectors outside the version space. The version space
lapses to only one element at typicallya;1.25, but the
variational class, not considering the version space, will
detect this until later.

The replica symmetric variational study incorrectly pr
dicts a transition ata,1. A stability analysis indicates tha
the replica symmetric ansatz is not adequate for alla.0.
This is a little surprising, and the origin of the instability ca
be traced back to the requirement thatb→` implies q→1
for aÞ0. This is in contrast with the case of real weigh
where, even in the presence of multiplicative noise, learn
with the variational potential is replica symmetric. The effe
of noise can break phase space into disconnected reg
which are essentially ignored by the robust learning of
optimal potential, which disregards outliers. A method th
insists on minimizing the memorization error will certain
lead in such conditions to a replica symmetry breaking s
ation. The one-step broken replica symmetry leads to ap
ently consistent physical behavior. The stability analysis w
be presented elsewhere@28#. A preliminary picture that
emerges from such analysis is that, while one-step RSB
enough to give locally stable results and suggests a rea
able physical picture it may fail to be globally correct. W
think that the RSB-1 calculation describes qualitatively t
main features of this difficult problem, but a full continuou
RSB scheme@28# will be necessary to understand the the
modynamic equilibrium bounds obtainable from the minim
zation of a potential. Even this will not tell the comple
story, however, since issues dealing with effective learn
times will still remain. This work, together with the results
@16–18,13#, suggests that in the optimization of comput
tionally hard discrete problems it might be a better strate
to first leave the space of configurations, in this case
vertices of the hypercube, then optimize in the hypersph
and finally go back to the original space, instead of strivi
to respect the discreteness constraints at every step.

ACKNOWLEDGMENTS

The authors thank M. Copelli and O. Kinouchi for impo
tant discussions and suggestions and also the participan
the Workshop on Statistical Mechanics, Max Planck Ins
tute, Dresden~1999! for discussions on this subject. E.B. an
N.C. were partially sponsored by the Conselho Nacional
Desenvolvimento Cientı´fico e Tecnologico ~CNPq! and


p

n-

e
m
a
e

ie

p

he

e

tion

7006 PRE 62BOTELHO, MATTOS, AND CATICHA
C.R.M. by the Fundac¸ão para o Desenvolvimento da Unes
~FUNDUNESP!.

APPENDIX

Equation~23! leads to an integral equation for the pote
tial:

E Dtoexp@2xoE~ t !#5FHS 2
u

AG
D G b

[A~u!, ~A1!

where t5Aqou1A12qoto , b5 k̂Aqo/4g2, and Dto

5(dto /A2p)e2to
2/2. To obtain the effective potential, not

that the integral on the left side is a convolution, so it see
natural to perform a Fourier transformation. The fact th
A(u) is not square integrable is bypassed by defining a n
problem:

E Dtoexp@2xoEj~ t !#5FHjS 2
u

AG
D G b

[Aj~u! ~A2!

Here

HjS 2
u

AG
D 5E

2`

`

DyiQjS yi1
u

AG
D

where

QjS y1
u

AG
D 5expF2jS y1

u

AG
D 2G

if y.2u/AG and 0 otherwise. OnceEj(t) is found, we will
take the regularizing parameterj to zero. After Fourier trans-
forming, dividing by the Gaussian on the left, and Four
transforming back, we get

Ej[exp@2xoEj~ t !#

5AqoE dk

2p
S e(12qo)k2/2E eiAqokuAj~u!duDe2 ikt.

~A3!

To be able to perform the above integrals we use a sim
replica trick, which consists in consideringb as an integer,

FHjS 2
u

AG
D G b

5)
i 51

b E
2u/AG

`

DyiexpF2jS y1
u

AG
D 2G

5)
i 51

b E
2`

`

DyiQjS yi1
u

AG
D . ~A4!

Introduce the Fourier transforms

Qj~yi2x!5E f j~r i !e
ir i (yi2x)dri , e2yi

2/2

5E g~v i !e
2 iv i yidv i .

The Fourier transform ofAj(u) is
s
t
w

r

le

E eiAqokuFHjS 2
u

AG
D G b

du5E eiAqoku

3H E )
i 51

b S dyi

A2p
dridv i f j~r i !g~v i !D

3expF i( r iS yi1
u

AG
D 2 i( v i yi G J du.

Theu andyi integrals are now automatic. Going back to t
equation for the potential

Ej5E Aqodk

~2p!2b/2
e(12qo)k22 ikt/2

3E
2`

` S )
i 51

b

dri f j~r i !g~r i !D dS Aqok1(
r i

AG
D ,

we now integrate overk:

Ej5E
2`

` S )
i 51

b

A2pdri f j~r i !g~r i !D
3expF1

2

~12qo!

Gqo
S ( r i D 2Gexp2 i

( r i

AGqo

t.

The linearization of the(r i is done by a standard trick. Us
the fact that g(r ) is a Gaussian and thatf j(r )
5(1/2p)*Qj(y)e2 iyrdy; then

Ej5E dx

A2p
e2x2/2

3S E dy

A2p
Qj~y!E

2`

`

dre2r 2/2e2x[A(12qo)/Gqo] re2 iKr D b

,

whereK5t/AGqo1y. Integrating overr,

Ej5E dx

A2p
e2x2/2F E dy

A2p
Qj~y!exp

1

2 S ~12qo!

Gqo
x22K2

12ixKA~12qo!

Gqo
D G b

andb can be again taken to be real. Changing the integra
variable toK and takingj to zero,

E05E dx

A2p
exp2F1

2 S 12
~12qo!

Gqo
bD x2G

3F E DKQS K1
t

AGqo
D exp1 ixKA~12qo!

Gqo
G b

.

Extending the definition of theH function to complex argu-
ments, the potential can be written formally as Eq.~30!.


te

ac

. E

-

-

v. E

ta-

v. E

v.

ett.

PRE 62 7007VARIATIONAL STUDIES AND REPLICA SYMMETRY . . .
@1# M. Mezard, G. Parisi, and M. Virassoro,Spin Glass Theory
and Beyond~World Scientific, Singapore, 1987!.

@2# T.L.H Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys.,65,
499 ~1993!.

@3# M. Opper and W. Kinzel, inModels of Neural Networks III,
edited by E. Domany, J. L. van Hemmen, and K. Schul
~Springer-Verlag, Berlin, 1996!.

@4# A. Engels and C. Van den Broeck,Statistical Mechanics of
Learning ~Cambridge University Press, Cambridge, 2000!.

@5# L. Pitt and L. Valiant, J. Assoc. Comput. Mach.35, 965
~1988!.

@6# E. Gardner and B. Derrida, J. Phys. A21, 271 ~1988!.
@7# G. Györgyi, Phys. Rev. Lett.64, 2957~1990!.
@8# D. Haussler, M. Kearns, H.S. Seung, and N. Tishby, M

Learning25, 195 ~1996!.
@9# H. Horner, Z. Phys. B: Condens. Matter86, 291 ~1992!; 87,

371 ~1992!.
@10# M. Golea and M. Marchand, J. Phys. A26, 5751~1993!.
@11# C. Van den Broeck and M. Bouten, Europhys. Lett.22, 223

~1993!.
@12# D. Bollé and G.M. Shim, Network Comput. Neural Syst.6,

619 ~1995!.
@13# L. Reimers, M. Bouten, and B. Van Rompaey, J. Phys. A29,

6247 ~1996!.
@14# M. Bouten, L. Reimers, and B. Van Rompaey, Phys. Rev

58, 2378~1998!.
@15# R.W. Penney and D. Sherrington, J. Phys. A26, 6173~1993!.
@16# M. Copelli and C. Van den Broeck, Phys. Rev. E61, 6971

~2000!.
@17# M. Copelli, M. Bouten, C. Van den Broeck, and B. Van Rom

paey, Europhys. Lett.47, 139 ~1999!.
n

.

@18# M. Copelli, C. Van den Broeck, and M. Opper, J. Phys. A32,
L555 ~1999!.

@19# W. Kinzel and R. Urbanczik, J. Phys. A31, L27 ~1998!.
@20# S. Solla and O. Winther, inOn-Line Learning in Neural Net-

works, edited by D. Saad~Cambridge University Press, Cam
bridge, 1998!.

@21# M. Opper, Phys. Rev. Lett.77, 4671~1996!; in On-Line Learn-
ing in Neural Networks~Ref. @20#!.

@22# O. Kinouchi and N. Caticha, Phys. Rev. E54, R54 ~1996!.
@23# G.J. Bex, R. Seneerls, and C. Van den Broeck, Phys. Re

51, 6309~1995!.
@24# M. Opper and D. Haussler, Phys. Rev. Lett.66, 2677~1991!;

in Proceedings of the IVth Annual Workshop on Compu
tional Learning Theory (COLT91)~Morgan Kauffman, San
Mateo CA, 1991!.

@25# M. Bouten, J. Schieste, and C. Van den Broeck, Phys. Re
52, 1958~1995!.

@26# O. Kinouchi and N. Caticha, J. Phys. A25, 6243~1992!.
@27# N. Caticha and E. Arau´jo de Oliveira, Universidade de Sa˜o

Paulo report~unpublished!.
@28# E. Botelho, C. Mattos, and N. Caticha~unpublished!.
@29# W. Krauth and M. Mezzard, J. Phys.~Paris! 50, 3057~1989!.
@30# H.S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A45,

6056 ~1992!.
@31# A. Buhot, J.M. Torres Moreno, and M.B. Gordon, Phys. Re

E 55, 7434~1997!.
@32# N. Caticha and O. Kinouchi, Philos. Mag. B77, 1565~1998!.
@33# S. Amari, Neural Networks6, 161 ~1993!.
@34# N. Barkai, H.S. Seung, and H. Sompolinsky, Phys. Rev. L

75, 1415~1995!.
@35# O. Winther, B. Lautrup, and J-B. Zhang, Phys. Rev. E55, 836

~1997!.