Skip to content

jyyang5/mu_mu_lambda-ES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A mu_mu_lambda-ES using comparision-based surrogate(GP)

Overview

  • Performance for dark art
strategy linear quadratic cubic schwefel quartic
Baseline (Arash's) 503 214 198 1503 1236
CSA \lambda=40 759 226 252 \infty 668
Emergency CSA D=0.72 304 142 184 916.5 717.5
Emergency CSA D=0.68 299 144 178 949 720

Where D is the decrease factor for step size in occurance of emergency

  • Folders
    • Dark Art

      Establish a CSA with stopping in emergence

      • Performance quite good
      • Thesis on this one
    • CSA

      Implement a CSA for small lambda

      • Step size adaptation: CSA
      • Offspring evaluation: GP
      • Parent selection strategy: centroid of best mu offspring
    • CSA_largeLambda

      Implement a CSA for large lambda to be continued

      • Step size adaptation: CSA
      • Offspring evaluation: GP
      • Parent selection strategy: centroid of best mu offspring
    • GP

      Plot fep(x) and f(x) over number of iterations

      • Step size adaptation: CSA
      • Offspring evaluation: GP
      • Parent selection strategy: centroid of best mu offspring
    • choose_centroid_bestSoFar

      Plot fep(x) and f(x) over number of iterations

      • Step size adaptation: CSA
      • Offspring evaluation: GP
      • Parent selection strategy: best centroid of all iterations
    • compare_result

      • Plot the following over objective function evaluations

        • objective function value
        • step size
        • normalized step size
        • relative model error
      • Plot histogram for the following

        • objective function evaluations
        • convergence rate
        • success rate
      • Previous attempts

        Use a combination of mu and lambda plot f(x), sigma and sigmaStar for 5 test functions

        • Step size adaptation: CSA
        • Offspring evaluation: GP
        • Parent selection strategy: centroid of best mu offspring
    • convergence rate_noise

      Plot convergence rate for different noise to signal ratio and dimensionality with expected convergence rate plotted Code to intergrate the expected convergence rate

      • Step size adaptation: sigmaStar
      • Offspring evaluation: noise estimate
      • Parent selection strategy: centroid of best mu offspring
    • convergence_sigmaStar_GP

      Use different strategies and plot convergence rate over sigmaStar Conclusion: Choose centroid is the bset esp. with large lambda

      • Step size adaptation: sigmaStar
      • Offspring evaluation: GP
      • Parent selection strategy:
        1. centroid (centroid of the best mu offsprings ranked by GP)
        2. bestOfMu (best in the mu offspring ranked by GP)
        3. bestSoFar (bestOfMu in each iteration, replace y iff. f(current) < f(previous))
        4. bestCentroid (replace previous centroid iff. f(current) < f(previous))
        5. centroidAddTrain=1 (centroid as parent, add the best offspring ranked by GP to training set per iteration)
        6. centroidQuadratic_withGP (fit a quadratic model for z_centroid, choose lowest point, pos and neg evaluated by GP)
        7. centroidQuadratic_trueObjFun (fit a quadratic model for z_centroid, choose lowest point, pos and neg evaluated by true objective function)

      Change the distance norm in GP use l-0.5 norm and l-3 norm failed

    • mu_sigma_experiment

      Tuned TRAINING_SIZE with sigmaStar/CSA and did a couple of result. Possible solutions for better performance: 1. update GP model 2. tune training size of GP model Best result using (3/3,10)-ES with sigmaStar = 1,2 relative speed-up 1.7,1.3,1.11 for linear, quadratic and cubic sphere respectively (sigmaStar = 1). Performance degrades as sigmaStar grows, but large sigame appears to be more invariant to the growth of sigmaStar

      • tuneTrainSize_sigmaStar_GP
      • Step size adaptation: sigmaStar
      • tuneTrainSize_CSA_GP
      • Step size adaptation: CSA
    • convergence_lengthScale_GP

      Plot convergence rate over theta (fun_c_over_theta.m and run_c_over_theta.m) for sigmaStar=1:0.5:3 lambda = [30,50,70] max convergence rate relative the same for (7/7,30)-ES, relative 0.1 increment for (12/12,50)-ES and (17/17,70)-ES

    • step_size_adaptation

      Use various step size adapation mechinisums to adapt step size

      • CSA
        • Dirk's parameter
        • Niko's CMA tutorial slides
        • Niko's CMA tutorial paper
      • Result
        • E[N(0,I)] \approx n^0.5*(1-1/(4n)+1/(21n^2))
max convergence rate theta linear(sigmaStar) quadratic(sigmaStar) cubic(sigmaStar)
(7/7,30) 8 0.48 1.02 0.811
(7/7,30) 8 0.52(1.5)
(7/7,30) 32 1.18(2.0)
(7/7,30) 2 0.89(1.5,2.0)
(12/12,50) 8 0.53 1.10 0.808
(12/12,50) 32 0.57(1.5)
(12/12,50) 26 1.2(2.0,2.5)
(12/12,50) 2 0.96 (2.5)
(17/17,70) 8 0.54 1.07 0.799
(12/12,50) 16 0.62(1.5)
(12/12,50) 28 1.25(2.0)
(12/12,50) 2 0.96(3.0)

Schedule

  • Thesis write-up

    • Send the paper needed for computing c_{\mu/\mu,\lambda}
      • Include those equations for probbaility
    • Graph
      • Do it 22 or 14
      • expected fitness gain/lambda (normalized same scale)
      • opt. fitness gain and step size (log [.1,10])
    • Medium run
      • Take the run use medium objective function calls
        • plot convergence plot
        • relative model error
        • normalized step size
    • Include the progress rate and convergence rate for sphere functions (pdf or histogram as you like)
    • When noise-to-signal ratio -> infty (model useless) -> point out random search (ultimately diverge)
    • Figure
      • Could divide into row 1,2,3,4 same plot (2,3,4,same scale in x-axies)
        • hist objective function calls
        • convergence plot
        • model error
        • normalized step size
      • Row 5,6 same plot
  • Contribution

    • Simple idea
    • Systematic evaluations

    recent paper introduces (1+1)-ES exploiting GP surrogate models, we wonder whether mml-ES could work better

    1. Model using GP noise, fig. and Eqns.
    2. Even though suggest potential better performance - no gaurantee step size adaptation - very inaccurate finite of n, practical analysis (experimental result)
    3. Runs, many steps in the iteration deterior the candidate solutions -> simplily not using those-> we can get significant advantage
  • Paper wonder (1+1)-ES benefit get better direction

    • MATH
    • Insight any step size adaptation without CSA
  • Thesis write-up

Surrogate-assisted $(\mu/\mu,\lambda)$-ES

Just in last section. Say could do...

Plot just n-> \infty, remove precise approximation for n=10,100

  • Paper write-up

Still call it (1+1)-ES

New step size adaptation mechinisum (decrease in emergency)


  • 20181011

    • Use GP sigmaStar for (3/3,10),(5/5,20), (10,40) for the plots below
      1. expected fitness gain eta over normalied step size sigmaStar
      2. opt. normalized step size over noise-to-signal ratio
      3. opt. expected fitness gain over noise-to-signal ratio
    • Thought
      • Use sigmaStar test one sphere functions for optimal sigmaStar (largest convergence rate)
      • Adapt step size s.t. the normalized step size close to opt. sigmaStar previously
      • Achieve appriximate similar convergence rate
    • Comment
      1. scatter too close to each other
      2. strage n=100 has larger fitness gain compared with n=10 (in opt. normalized step size)
      3. y-axies not correct
      4. problem if
  • 20181012

    • Get better approximate of progress rate using formula in paper when n = infty

      • Use formula (11) from paper denoted n \approx \infty
    • Replace Gaussin distributed noise with GP

  • 20181016

    • [] Gap between the expected progress rate and the one using GP

      • TRAINING_SIZE
      • length scale
    • Twice almost twice for the the opt. step size for experimental result and expected value

      • Fixed problem lies in the sigmaStar used for ploting
    • [] Performance for large dimensionality

      • Expect at least linear to the increase of dim
  • 20181018/20181019

    • [] Thesis write-up 3. Plots - Expected progress rate over \sigmaStar - opt. normalized step size and opt. expected fitness gain over noise-to-signal ratio - step size 4. Step size adaptation (use GP) - Normalized step size over iteration - Noise to signal ratio (var(fep-fy)/var(fy) over iteration) should be a straight line

            all lambda offspring evaluated by GP (fep) and true objective function value (fy)
            
            calculate the noise-to signal ratio per iteration as var(fep,fy)/var(fy)
      
    • [] Draft for thesis

      • Figure and tables are fine
      • Little text
    • [] Step size adaptation

      • Use the standard Niko's

      • Compare with opt. normalized step size using sigmaStar

      • [] TRAINING_SIZE (minor)

        1. Plots
          • Expected progress rate over \sigmaStar (lambda=10,20,40 n=10 include expected and (1+1)-ES)
          • opt. normalized step size and opt. expected fitness gain over noise-to-signal ratio
          • step size
        2. Step size adaptation (use GP)
          • Normalized step size over iteration (plot)

          • Noise to signal ratio (var(fep-fy)/var(fy) over iteration) should be a straight line

            <all lambda offspring evaluated by GP (fep) and true objective function value (fy)>

            <calculate the noise-to signal ratio per iteration as var(fep,fy)/var(fy)>

            <20,40,80 plot same -> 40 is the best (stick to n =10) too much information>

            <focus on the same plot include the (1+1)-ES with GP>

            <variance (noise-to-signal ratio) over iteration>

            <var(fye-fy)/var(fy) fitness in the population>

      • [] TRAINING_SIZE step_size_adaptation

        • Try TRAINING_SIZE = 4n (round up to klambda)???
  • 20181022

    • [] Step size adaptation

    • compare result for plot

      • Histogram
    • Obserations (Use CSA from Dirk's lecture slides wrong)

      1. mml-ES seems to help avoid the potential poor direction resulted from an inaccurate GP model. Since the success rate is always around 0.48 as is expected while the success rate for (1+1)-ES varies among test functions (from hist_xxx_sphere.fig)

      2. it is likely that mml-ES is more sensitive to relative model error which has a direct impact on the convergence rate. We can achieve similar performance to (1+1)-ES if the relative model of mml-ES is at least half smaller than that of (1+1)-ES (see xxx_sphere_funCalls.fig in linear sphere the model error for mml-ES is larger which potentially leads to the small convergence rate and therefore more objective evaluations).

      3. one potential way out is to adapt the length scale factor (probably a length scale adaptation mechanism).The length scale factor 8 seems to be highly biased on quadratic sphere. But from previous experiment I have no idea what the relation might be. I thought it could be related to the complexity of the test function.

  • 20181023

    • Success rate (48% for mml-ES)

      • Gnenerate one direction
      • (1+1) generate one direction (take step only when good)
      • mml generate the gradient (always make the step)
      • Step is bad every other iteration
      • Reject the step
    • Rejecting steps (cannot be slower how much gives)

      • Every other step is bad
      • Could try what Arash did in step size adaptation
    • Impact of population(lambda) on noise-to-signal ratio(model error)

      how error depends on lambda (how model error depend on lambda)

      • Plot model error over lambda
    • Success rate use plot Histogram improvement(pdf) Difference of current centroid compared with previous centroid divide by R* derivative (R is the distance to optimal) - linear: R - quadratic: 2R^2 - cubic: 3*R^3

    • CSA

      • Just change the coefficients
      • Try the latest one see if improvements
  • 20181025

    • CSA
      • Porblem solved $E| N(0,I)| = n^0.5*(1-1/(4n)+1/(21n^2))$
    • Result
      • Niko's CSA did not bring much improvement compared with Dirk's
      • The dark art by combining CSA and 1/5 rule did not seem to be affected by population siz
    • ADVICE
      • Thesis write-up
      • If fast, do it by all means
      • Could just compare why works better on quadratic sphere
    • Convergence rate
      • Linear = -n*log(f(x)/f(y))
      • Quadratic = -n/2*log(f(x)/f(y))
      • Cubic = -n/3*log(f(x)/f(y))
  • 20181025

    • Good result

    • Step size adaptation (story)

      The CSA perform bad (in emergency) we decrease the step size

      Evolution path does not adapt in case of emergency

  • 20181123

    • General ES
      • CMA-ES
        • Covariance matrix ideally approaches the inverse of the Hessian
    • Step size
      • CSA (only for mml-ES)
        • Compare the length of search path

mu_mu_lambda-ES

Implement a (mu/mu,lambda)-ES

The algorithm is initialized with mu parents, choose the centroid of parents to generate lambda offsprings and then pick the mu best offspring as the parents of next iteration.

Note: lambda >= mu

Schedule

1. Model GP estimate using Gaussin distributed noise

Assume the GP estimate can be modelled by the true objective function value with some random Gaussian noise terms.

fep(y) = f(y) + sigma_ep_star*randn(N,1)

2. Step size adapted by sigma = sigma_star/n*dist

The step size is proportional to the distance to the optimal point (dist).

3. Plot convergence rate over sigma* and noise-to-signal ratio

Plot is attached. With noise-to-signal ratio = 0,0.25,1,4. The negative convergence rate means the algorithm does not converge.

4. Step size adapted by cumulative step-size adaptation

The number of iteratins needed coincides that with the convergence rate plot, where a high convergence rate gives a small number of iterations to reach stopping criteria.

  • Convergence rate and normalized step size over noise ratio

5. Build GP model by adding training set

The objective function evaluation value of offsprings is evaluated using the GP model after the model is built and therefore, reduce the objective function evaluation time to 1 in each iteration.

The improvement compared with a normal (mu/mu,lambda)-ES is the saving of objective function evaluation in each itertaion. The effectiveness of the improvement lies in the precision of the approximation using Gaussian Process model.

6. Compare the mml-ES with GP and without and similarly with (1+1)-ES

The objective function evaluation obtained by averaging 400 runs is shown below. The plot of step size $\sigma$ and objective function value f(x) over # of objective function calls for 5 different test functions are attached.

(1+1)-speed-up = (# of objective function calls of (1+1)-ES withGP)/(# of … of (1+1)-ES noGP)
mml-speed-up = (# of objective function calls of mml withGP)/(# of … of mml noGP)
(1+1)-mml-ratio = (# of … of (1+1)-ES noGP)/(# of objective function calls of mml noGP)
(1+1)-mml-GP-ratio = (# of … of (1+1)-ES withGP)/(# of objective function calls of mml withGP)

7. The precision of GP estimate

Compare funCalls (2/2,5)-ES lengthScale=16

Number of runs = 500 Length scale theta = 16 sigmasqrt(n) (2/2,5)-ES Evaluate and choose centroid as parent for each itertation Note: plot and data stored in folder plot_cmp_result3

Test function mml-ES withGP mml-ES noGP (1+1)-ES withGP (1+1)-ES noGP mml-speed-up (1+1)-speed-up mml-(1+1)-GP-ratio mml-(1+1)-ratio
linear sphere 1128 4460 505 1279
quadratic sphere 479 2350 214 676
cubic sphere 386 1640 203 483
Schwefel’s function 1833 13825 1246 4233
quartic function 3356 11320 1502 2385

Observation: the length scale factor does not seem to make a difference.

Compare funCalls (2/2,5)-ES

Number of runs = 500 Length scale theta = 8sigmasqrt(n) (2/2,5)-ES Evaluate and choose centroid as parent for each itertation Note: plot and data stored in folder plot_cmp_result2

Test function mml-ES withGP mml-ES noGP (1+1)-ES withGP (1+1)-ES noGP mml-speed-up (1+1)-speed-up mml-(1+1)-GP-ratio mml-(1+1)-ratio
linear sphere 1097 4470 507 1271
quadratic sphere 485 2340 213 678
cubic sphere 385 1660 202 480
Schwefel’s function 2681 13825 1246 4233
quartic function 3283 11285 1498 2378

Compare funCalls (3/3,10)-ES

Number of runs = 400 Length scale: theta = 8sigmasqrt(n) (3/3,10)-ES Evaluate and choose centroid as parent for each itertation Note: plot and data stored in folder plot_cmp_result1

Test function mml-ES withGP mml-ES noGP (1+1)-ES withGP (1+1)-ES noGP mml-speed-up (1+1)-speed-up mml-(1+1)-GP-ratio mml-(1+1)-ratio
linear sphere 734 2700 505 1275 3.68 2.52 0.688 0.472
quadratic sphere 301 1400 214 681 4.65 3.18 0.711 0.486
cubic sphere 267 960 202 481 3.60 2.38 0.756 0.501
Schwefel’s function 2599 5490 1490 2381 2.11 1.60 0.573 0.434
quartic function 1024 6250 1259 4218 6.10 3.35 1.229 0.675

Compare funCalls (4/4,15)-ES

Number of runs = 500 Length scale theta = 8 sigmasqrt(n) (4/4,15)-ES Evaluate and choose centroid as parent for each itertation Note: plot and data stored in folder plot_cmp_result3

Test function mml-ES withGP mml-ES noGP (1+1)-ES withGP (1+1)-ES noGP mml-speed-up (1+1)-speed-up mml-(1+1)-GP-ratio mml-(1+1)-ratio
linear sphere 670.5 2310 504 1270
quadratic sphere 238 1180 213 680.5
cubic sphere 276 800 202 476
Schwefel’s function 2683 13825 1246 4233
quartic function 841 4610 1250 4191

Next

  • Format the code s.t. the output is the same for 4 different strategies.

    1. t: # of objective function calls
    2. x_last: last parent x(centroid)
    3. fx: last objective function value for parent(centroid)
    4. sigma_array: simage arrary over # of objective function calls
    5. x_array: parent set for x
    6. fx_array: objective function values for parents
    7. convergence_rate: rate of convergence
    8. GP_error: if no GP error = -1
    9. sigma_star_array: normalized step size
  • Modify the plot fucntion (fun_graph_funCall_merged.m,fun_graph_iterate_merged.m) script s.t.

    • the data and plot is automatically saved.
    • plot for step size (sigma), objective function value (fx) and normalized step size (sigma*) over # objective function calls and # iterations respectively.
    • plot over # objective function call: run_graph_funCall_merged.m
    • plot over # iterations call: run_graph_iterate_merged.m
  • Modify the training set s.t. the training data is added by a first come first out basis. See mml_GP.m

  • Double the GP length scale to see what happens.

    • Use the (2/2,5)-ES to test and the result for length scale mutiplied by 8 and 16 are stored in folder plot_cmp_result1 and plot_cmp_result2 respectively.
    • Result: the length scale does not matter in the context.
  • Run Arash's (1+1)-ES on the three sphere functions and plot the normalized step size against the iteration number.

    • Plot attached as the subplot combined with f(x) and sigma over number of objective function calls.
    • We need large normalized step size to beat the (1+1)-ES with GP essentially because of the step-size adaptation proposed. It could produce a large normalized step size s.t. the mml-ES with GP cannot beat that.
    • The normalized step size (sigma*) increases as we increase lambda.
      • sigma* for (2/2,5)-ES < (1+1)-ES
      • sigma* for (3/3,10)-ES > (1+1)-ES
      • The difference grows even larger for (4/4,15)-ES (much larger normalized step size)
  • Plot the relative error of GP using the following and plot the GP error of (1+1)-ES with GP and mml-ES with GP accordingly.

    • GP_error = |f(y)-fep(y)|/(f(y)-f(x)| where x is the parent and y offspring.
    • Function for merged plot of the 5 test functions.
    • Below is by taking the median relative error and the number of objective function calls for each iteration by running 50 replicates.
    strategy\median relative error linear sphere quadratic sphere cubic sphere Schwefel’s function quartic function
    (4/4,15) withGP 1.302 0.9086 2.0487 1.7776 1.2200
    (1+1) withGP 1.2042 1.2085 1.9871 1.3764 1.1586
    strategy\median funCall linear sphere quadratic sphere cubic sphere Schwefel’s function quartic function
    (4/4,15) withGP 633 203 228.5 3005 817.5
    (1+1) withGP 488 216 206 1536 1236
  • Can you extend the GP code to compute variances and plot those in relation to errors and step size? GP estimate: fTest = mu + Ks'(K(fTrain'-mu)) variances = Ks'(K(fTrain'-mu)) error = |f(y)-fep(y)|/|f(y)-f(x)| Find the relation between variance & error, variance & step size. Intepretation: make the decision on variance rather than evaluate evaluate one per iteration.

  • [ ]In the (\mu/\mu,;ambda)$-ES, would evaluating the best candidate solution rather than the centroid make a difference?

  • [ ]Avoid the increase in objective function value by comparing the objective function value of the current centroid and the last one.

    • Problem: may easily trapped in a local minima E.G. a saddle point.
  • Weight using the GP model and the true objective function

8. Alternative strategy

a. Original

while not terminate() do{
	for j = 1,...,lambda{
		z(j) = randn(N,1)
		y(j) = x + sigma*z(j)
		fep(j) = evaluateUseGP(y(i))
	}
	z = sorted(z)
	z = 1/mu*sum(z(1),...,z(mu)))
	x = x + sigma*z
	update step size sigma
}
  • Result for averaging 5 runs

    mu lambda linear quadratic cubic Schwefel quartic
    2.00 10.00 697.00 308.00 267.00 2848.00 1084.00
    2.00 20.00 608.00 219.00 285.00 3040.00 748.00
    2.00 30.00 569.00 208.00 281.00 3040.00 762.00
    2.00 40.00 580.00 198.00 364.00 3040.00 846.00
    2.00 50.00 587.00 195.00 318.00 3040.00 3040.00
    3.00 10.00 743.00 272.00 296.00 2844.00 969.00
    3.00 20.00 606.00 216.00 250.00 3040.00 715.00
    3.00 30.00 606.00 198.00 269.00 3040.00 623.00
    3.00 40.00 592.00 197.00 285.00 3040.00 3040.00
    3.00 50.00 607.00 195.00 3040.00 3040.00 3040.00
    4.00 10.00 785.00 308.00 270.00 2654.00 1100.00
    4.00 20.00 641.00 211.00 264.00 3040.00 674.00
    4.00 30.00 609.00 196.00 269.00 3040.00 686.00
    4.00 40.00 650.00 191.00 291.00 3040.00 3040.00
    4.00 50.00 677.00 190.00 408.00 3040.00 3040.00
    5.00 10.00 860.00 329.00 327.00 2429.00 1352.00
    5.00 20.00 654.00 207.00 274.00 3040.00 736.00
    5.00 30.00 656.00 204.00 273.00 3040.00 607.00
    5.00 40.00 695.00 186.00 315.00 3040.00 3040.00

b. choose the best centrod so far as parent

while not terminate() do{
	for j = 1,...,lambda{
		z(j) = randn(N,1)
		y(j) = x + sigma*z(j)
		fep(j) = evaluateUseGP(y(i))
	}
	z = sorted(z)
	z = 1/mu*sum(z(1),...,z(mu)))
	x_temp = x + sigma*z
	// update iff. centroid is better
	if(f(x) > f(x_temp)){
		x = x_temp
	}
}
	update step size sigma
}
  • Result for averaging 5 runs

    mu lambda linear quadratic cubic Schwefel quartic
    2.00 8.00 791.00 342.00 399.00 1794.00 1147.00
    2.00 9.00 722.00 342.00 340.00 1420.00 1022.00
    2.00 10.00 756.00 306.00 383.00 1286.00 1132.00
    2.00 11.00 696.00 324.00 463.00 5040.00 1048.00
    2.00 12.00 703.00 310.00 816.00 1405.00 960.00
    2.00 13.00 721.00 310.00 828.00 5040.00 967.00
    2.00 14.00 719.00 271.00 809.00 5040.00 952.00
    2.00 15.00 771.00 283.00 913.00 1461.00 915.00
    3.00 8.00 784.00 370.00 378.00 1507.00 1279.00
    3.00 9.00 742.00 364.00 450.00 1299.00 995.00
    3.00 10.00 734.00 329.00 375.00 1327.00 1099.00
    3.00 11.00 768.00 326.00 531.00 1319.00 1118.00
    3.00 12.00 714.00 315.00 865.00 1571.00 1037.00
    3.00 13.00 767.00 304.00 646.00 1512.00 995.00
    3.00 14.00 745.00 279.00 895.00 1431.00 956.00
    3.00 15.00 731.00 279.00 625.00 2124.00 957.00
    4.00 8.00 903.00 400.00 426.00 1675.00 1480.00
    4.00 9.00 858.00 336.00 401.00 1611.00 1242.00
    4.00 10.00 814.00 338.00 420.00 1402.00 1119.00
    4.00 11.00 737.00 368.00 619.00 1349.00 998.00
    4.00 12.00 796.00 308.00 674.00 1345.00 1092.00
    4.00 13.00 763.00 310.00 831.00 1263.00 1098.00
    4.00 14.00 805.00 312.00 945.00 1315.00 1028.00
    4.00 15.00 724.00 300.00 810.00 1316.00 1000.00

Different noise level

Plot the performance of mml-ES under different noise-to-signal ratio defined as v = sigma_ep_star/sigma_star.

The undesired pattern for green dots is that the convergence rate is nan.

Under high v (E.G. v=4) and large sigma_star the algorithm does not converge, which gives the negative convergence rate.

Use CSA to adapt step size

Algorithm

Replace the previous step size adaptation using sigma = sigma_star/n*dist (denoted appraoch 0.) with Cumulative Step Length Adaptation (CSA) from https://pdfs.semanticscholar.org/2556/e4670234fc74232a1a419260fe8c1a29d0ba.pdf.

c = 1/sqrt(N)
D = sqrt(N)
s = 0
while not terminate() do{
	for j = 1,...,lambda{
		z(j) = randn(N,1)
		y(j) = x + sigma*z(j)
		f(j) = evaluate(y(i))
	}
	z = sorted(z)
	z = 1/mu*sum(z(1),...,z(mu)))
	x = x + sigma*z
	s = (1-c)*s + sqrt(mu*c*(2-c))*z
	sigma = sigma*exp((norm(s)-N)/(2*D*N)
}

Three approaches are used for CSA:

  1. Average each dim of N-dimensional Z z(j) = randn(N,1) s = (1-c)s + sqrt(muc*(2-c))*mean(z)

  2. N-dimensional Z dealt with by L2-norm z(j) = randn(N,1) s = (1-c)s + sqrt(muc*(2-c))*z

  3. Use 1-dimensional Z lacking randomness z(j) = randn() s = (1-c)s + sqrt(muc*(2-c))*z

Comparison

CMA does not seem to work in the context on the quadratic sphere. Compared with 0), CSA presumably decreases the step size too fast. So that the algorithm is trapped in some local point and therefore cannot move further.

CMA is sensitive to initial step size

The performance is also very sensitive to the initial step size sigma0.

If we set a large sigma0, the improvement in the first few steps is significant where the objective function value can be reduce to e^-2 in 10 runs. But the number of GP estimation in each iteration grows significantly. An example with sigma_ep_star = 4, sigma0 = 2, mu=3, lambda=10 gives the GP estimate in each iteration [0 0 0 0 460 1215 1706 4393 10751 755804 138400157] where the first 4 iterations uses the true objective function.

For a small sigma0, the algorithm is trapped in some local point and therefore cannot move further as is in the case for 3) no matter small or large sigma0.

If we tune sigma0 to a proper value 0.5-0.6 we could achieve e^-4 within 20 iterations but the number of GP estimate also increases significantly as the number of iteration goes large.

Change of offspring selection criteria

Initially, we adapt the simple criteria by choose offsprings iff. f(centroid) > fep(candidate solution) where fep is the GP estimate of the candidate solution.

Using such criteria lead to the problem that even millions of sampling cannot obtain an offspring that matches the centroid. So we tried to loose the restriction on the offspring selected by adding a scalar sigma_GP i.e. the maximum tolerance that we can accpet for getting an offspring inferior to the centroid.

Another interpretation would be we have a centroid as the parent then we gennerate offspring in arbitrary direction. There must be a direction that leads to the optimal. What the offspring selection criteria does is to choose the area of that region that a candidate solution can be selected as an offspring. Once we increase the level of tolerance, the area that offspring can be obatined will increase and therefore the probabiliy of obtaining a candidate solution as an offspring where less sampling is needed.

Now we choose an offspring iff. f(centroid)*sigma_GP > fep(candidate solution).

More tolerance (larger sigma_GP) would allow more progress (more steps), less sampling but less progress-to-step ratio i.e. less fitness gain for each step.

Presumably, there is a relation among sigma_GP and sigma_ep_star. It is understandable when there is more noise (variance) added to fy we need to put more selective on the offsprings. i.e. larger sigma_ep_star needs smaller sigma_GP.

Experiment

x0 = -0.203102904230873 -0.0856172663925321 1.30705750617960 -0.150834774366184 1.32683079014856 1.04826284432273 0.224821585068546 0.833055309940219 -0.763427819121924 -0.0629202204887581 -0.557425428225165 0.564949072093604 0.106706826487156 -0.599811245264293 0.0782480156340188 -1.54840125247811 -0.146604917055879 -0.340021761681992 -0.0408603991447191 1.87413826421482 -0.0702531036583463 0.287987538717003 -1.09111070759422 -0.358833142760595 -1.56273576980328 -0.531376791279362 1.39094457594521 -1.50993837819604 1.72474702674583 -0.702997115026487

sigma0 sigma_GP sigma_ep_star t(# of iterations) # of GP estimates
1 3.5 1 55,73,64,55 231894,621,9789,65221588

There is much randomness in the algorithm. Using the same parameter setting, the algorithm sometimes finishes with very limited sampling, sometimes finishes with over millions sampleing and sometimes just does not converge over 10000 iterations (almost half of the time).

Sampling a good candidate solution becomes much harder as we approaches the optimal, the number of sampling increases drastically in the last few steps before reaching the stopping criteria.

Test CSA parameter

Verified in CSA, s = (1-c)s + sqrt(muc*(2-c))*z which is n-dimensional.

We plot |s|^2-n as the iteration goes.

For noise-to-signal ratio v = 4 (quite high noise level) we obatin a similar pattern for sigma_star = 1,2 where the value goes up to around -9.7 then jump back to -10 and converge to -10. The scalar (exp((norm(s)-N)/(2DN)) mutiplied to sigma each time goes up to 0.216 then jump to 0.206. When sigma_star > 3, the algorithm does not converge and the both two terms are zero almost all time.

v = 2 similar pattern as stated.


Change of Algorithm

Usual (mu/mu,lambda)-ES replace the true objective function evaluation to candidate solutions generated by centroid with a Gaussian Process estimate, pick up the best mu candidate solutions as offsprings. Average the offsprings and get the new centroid (parent) for offspring generation in next iteration. Pseudocode is below.

while not terminate() do{
	for j = 1,...,lambda{
		z(j) = randn(N,1)
		y(j) = x + sigma*z(j)
		fep(j) = evaluateUseGP(y(i))
	}
	z = sorted(z)
	z = 1/mu*sum(z(1),...,z(mu)))
	x = x + sigma*z
	update step size sigma
}

Releases

No releases published

Packages

No packages published