Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

DAMN !!! A bug – A 4th iteration nightmare

When was the last time you had a bug and you were clueless on how to fix the issue ? Recently, I was in the same situation and ran out of ideas. I believe that the reason for this situation was my lack of knowledge in the problem domain that I was working on. Let me explain you the problem and the way I found out the solution. It will give you more idea on my sufferings. Btw it took me 4 days to figure out the solution :D.

 

Problem background:

I was implementing an algorithm [1] on a big data platform Apache Hama. The domain of the algorithm is Energy Informatics and uses Convex Optimization and ADMM [2] to make the problem distributed. Let’s leave the complexity a side and move directly to the main points.

I had to implement 2 main equations from [1].

The goal of equation 20 is find the min value of x_n.  p^T, delta T, x_n^k, x mean, u^k are vectors of length 96. x_n is also a vector of length 96.

The goal of equation 26 is to find the min value of x_i. gamma, rho and alpha are constants. Everything else is a vector of length 96.

The distributed algorithm looks something like this. Image taken from [1].

In the above diagram, Aggregator and EV’s (Electric Vehicle) represent a separate machine. The algorithm works like

  1. Aggregator initializes some vectors (u,x) and sends them to different EV machines.  We call this incentive signal.
  2. Each EV machine solves its equation (Equation 26 shown above).
  3. While EV’s are solving their equations, Aggregator solves its equation (shown in equation 20).
  4. Each EV returns the result back to the Aggregator.
  5. Aggregator picks up all the results and aggregates them.
  6. It updates the incentive signals (u,x)
  7. Aggregator sends the updated values to each machine and process continues until the optimal solution is achieved.

Now, you have a good idea of the problem. Lets see what issues I got that took me 4 days to fix.

Problem:

Initially, I ran the algorithm on 1 Iteration only. I got a few logical errors and I fixed them in no time. Then I ran the algorithm for 2 iterations and my Output was correct. To verify my output, I was comparing my output with the MATLAB implementation of the same algorithm but MATLAB code was serial but answers were pretty close.

Then I ran it for 10 iterations and my answer was way off. I was not able to understand why. I started to increase my iterations one by one. The output of 3rd iteration was also correct. But the output of 4th iteration was wrong. I was confused that how can it happen that the output of first 3 iterations is correct but the 4th is wrong. I thought maybe the output of last iteration is always wrong. Ran the algorithm to 5th iteration but the answer of 4th was still wrong.

Note: Output of each iteration depends the output of previous iteration. So, 5th iteration output was automatically wrong.

Clue 1: So, I knew that something goes wrong in 4th iteration.

Since, the output of each iteration depends on others, I thought that something must be going wrong in previous iterations. I did the following steps

  1. I checked all of my utility functions like vectorAdd, vectorScalerMultiply, vectorScalerAdd, vectorNorm, vectorMean and others. Everything looked fine.
  2. I checked my algorithm logic and everything also looked fine.

At this point, I was annoyed that everything looks fine and yet my answers are wrong. I went on console printing spree and printed everything that was going on the console in eclipse. The problem was I had too much data in console to understand everything. For example, each vector had 96 double values and in each iteration, I had like 6-7 different vectors. So, it was a mess. Further to complicate things, I was in a distributed mode and each machine was printing to the same console. (I was mimicking the distributed mode by running multiple separate tasks in the same machine). Check out part of my console.

EV ADMM Java console output

So instead of printing the vectors on the console, I printed the sum of all the values in a vector. So, I reduced my output from 96 double values to a single double value for each vector. So, I printed everything including output and input to different equations. The combined output looked something like

EV ADMM averaged console output

local:X –> Represents an EV machine output. In my case I had 3 EV machines

M:=>local:0 –> Represents aggregator machine output.

Even though this approach is not the best but it gave me a very good overall understanding of all the data in my program. I generated a similar aggregated output from MATLAB code also. After comparison I got my second clue.

Clue #2: The input data to equation 20 was similar to MATLAB in 4th iteration but the output was wrong. So, something is going wrong in solving the equation 20. 

4th Iteration data comparison

You can see in the above picture that the there is a difference between outputs of MATLAB and JAVA code. So, I focused just on this.

The problem here was that to solve the equations, I was using a different solver for MATLAB and JAVA. My first guess was that the solvers are buggy but which one ??

For JAVA -> CPLEX [3]

public double optimize(double[] xold,int iteration) throws IloException, FileNotFoundException
	{		
		
		IloCplex cplex = new IloCplex();
		OutputStream out = new FileOutputStream("logfile");
		cplex.setOut(out);
		
		IloNumVar[] x_n = cplex.numVarArray(masterData.getPrice().length, Double.MIN_VALUE, Double.MAX_VALUE);
		
		double[] priceRealMatrix = masterData.getPrice();
		priceRealMatrix = Utils.scalerMultiply(priceRealMatrix, -1);
		
		double[] data = subtractOldMeanU(xold);
		
		IloNumExpr[] exps = new IloNumExpr[data.length];
		
		for(int i =0; i< data.length; i++)
		{	
			//Original equation
			exps[i] = cplex.sum(cplex.prod(priceRealMatrix[i], x_n[i]) ,cplex.prod(rho/2, cplex.square(cplex.sum(x_n[i], cplex.constant(data[i])))));
		}
		
		IloNumExpr rightSide = cplex.sum(exps);
		cplex.addMinimize(rightSide);
		
		cplex.solve();
	
		x_optimal = new double[x_n.length];
		
		for(int u=0; u< x_n.length; u++)
			x_optimal[u] = cplex.getValues(x_n)[u];
		
		this.setX(Utils.setColumnInMatrix(this.getx(), x_optimal, this.getN() - 1));

		return cplex.getObjValue();
	}

 

For MATLAB -> CVX [4]

cvx_begin
    variable x_n(T)
        
    minimize( (-1 * price' * x_n) + (rho/2 * sum((x_n - xold' + xmean' + u').*(x_n - xold' + xmean' + u')))  )
    subject to
         -xa_min >= x_n >= -xa_max 
            
cvx_end

disp(x_n)

 

Clue 3: Solver is buggy. Somehow verify the output

The only solution to that was to introduce a third solver. Generate the output and compare it with the output of other 2. So, I solved the equation using YALMIP solver [5] in MATLAB.

%YALMIP SOLVER
xtemp= sdpvar(T,1,'full');
   
Cost= (-1 * price' * xtemp) + (rho/2 * sum((xtemp - xold' + xmean' + u').*(xtemp - xold' + xmean' + u')));
                
Constraints=[-xa_min >= xtemp >= -xa_max];
 
ops = sdpsettings('verbose',0);
sol=solvesdp(Constraints,Cost,ops);

if sol.problem == 0
     x = double(xtemp);
     disp(x)
end

The output of YALMIP and CVX were similar. This means that something is wrong with JAVA solver CPLEX.

Clue#4: JAVA solver CPLEX is giving the wrong output.

Ok the output of JAVA solver is wrong but I just pass on some parameters and what CPLEX does internally is not in my control. So, what to do ?? I was clueless because the input was similar in MATLAB and JAVA and only the output was different. I had no idea what to do and was going to ask my professor for help. But an idea struck me. Maybe, just maybe, if I can somehow output the models generated by MATLAB and JAVA and compare them, I might find something. But how ?

Clue#5: Models generated by the solvers might be different even though everything else looks similar

I did some research online and luckily found some methods in both CPLEX and YALMIP. So from YALMIP, I generated a model for CPLEX. And from CPLEX, I generated a model for CPLEX. Both should be similar ideally.

YALMIP export method:

savecplexlp(Constraints, Cost, 'CplexModelByYalmip.mod')

CPLEX export method:

cplex.exportModel("CplexModel.lp");

So, I generated the models in textual format. They look something like

Minimize
 obj:  0.0027139118708454715301 x1 + 0.0027139118744760569457 x2 + 0.0027139118782506625738 x3 + 0.0027139118821816708704 x4 + 0.0052164424950410723247 x5 + 0.0052164424993325630986 x6 + 0.0052164425038350274721 x7 + 0.0052164425085710862395 x8 + 0.0059687690432103993055 x9 + 0.0059687690484997342477 x10 + 0.0059687690541213173079 x11 + 0.0059687690601217327502 x12 + 0.0065121160046523844656 x13 + 0.0065121160116004932206 x14 + 0.0065121160191558349895 x15 + 0.0065121160274452145889 x16 + 0.0075413405251854442879 x17 + 0.0075413405355606866198 x18 + 0.0075413405475255601562 x19 + 0.0075413405617838873141 x20 + 0.008345912007262748164 x21 + 0.0083459120317961894148 x22 + 0.0083459120710129685444 x23 + 0.0083459121476222742492 x24 -0.0012954615156441917634 x25 -0.0012954614005996779175 x26 -0.0055390508959474371239 x27 -0.0055390508196974204067 x28 -0.0073086637207278809872 x29 -0.012196114279733692026 x30 -0.018063360000000000527 x31 -0.018063360000000000527 x32 -0.019645440000000000125 x33 -0.019645440000000000125 x34 -0.019645440000000000125 x35 -0.019645440000000000125 x36 -0.021918719999999995707 x37 -0.021918719999999995707 x38 -0.021918719999999995707 x39 -0.021918719999999995707 x40 -0.022310400000000001064 x41 -0.022310400000000001064 x42 -0.022310400000000001064 x43 -0.022310400000000001064 x44 -0.022901760000000007111 x45 -0.022901760000000007111 x46 -0.022901760000000007111 x47 -0.022901760000000007111 x48 -0.023051519999999998956 x49 -0.023051519999999998956 x50 -0.023051519999999998956 x51 -0.023051519999999998956 x52 -0.02039423999999999404 x53 -0.02039423999999999404 x54 -0.02039423999999999404 x55 -0.02039423999999999404 x56 -0.019557119999999997284 x57 -0.019557119999999997284 x58 -0.019557119999999997284 x59 -0.019557119999999997284 x60 -0.019541759999999998298 x61 -0.019541759999999998298 x62 -0.019541759999999998298 x63 -0.019541759999999998298 x64 -0.021146879999999996375 x65 -0.01558903559770834571 x66 -0.011026331106371425883 x67 -0.011026331106371425883 x68 -0.018352345794825170033 x69 -0.018352345794825170033 x70 -0.01500067646351784488 x71 -0.01500067646258635562 x72 -0.016874263397831341244 x73 -0.016874263396865745585 x74 -0.016874263395882407174 x75 -0.016874263394559104595 x76 -0.0137319482953197642 x77 -0.013731948293942381617 x78 -0.013731948292535534756 x79 -0.013731948291097795939 x80 -0.0076784429895759365517 x81 -0.007678442988066850293 x82 -0.0076784429865181906549 x83 -0.0076784429849277892333 x84 -0.0049885425772680886888 x85 + 0.0037588098044532072817 x86 + 0.0037588098060396188393 x87 + 0.0037588098076660400593 x88 + 0.0026094220559031659074 x89 + 0.0026094220576143908191 x90 + 0.0026094220593719397866 x91 + 0.0026094220611787202063 x92 + 0.0034035445108561296346 x93 + 0.0034035445127725272307 x94 + 0.0034035445160309728274 x95 + 0.0034035445194046005934 x96  + [0.010000000000000000208 x1 ^2 + 0.010000000000000000208 x2 ^2 + 0.010000000000000000208 x3 ^2 + 0.010000000000000000208 x4 ^2 + 0.010000000000000000208 x5 ^2 + 0.010000000000000000208 x6 ^2 + 0.010000000000000000208 x7 ^2 + 0.010000000000000000208 x8 ^2 + 0.010000000000000000208 x9 ^2 + 0.010000000000000000208 x10 ^2 + 0.010000000000000000208 x11 ^2 + 0.010000000000000000208 x12 ^2 + 0.010000000000000000208 x13 ^2 + 0.010000000000000000208 x14 ^2 + 0.010000000000000000208 x15 ^2 + 0.010000000000000000208 x16 ^2 + 0.010000000000000000208 x17 ^2 + 0.010000000000000000208 x18 ^2 + 0.010000000000000000208 x19 ^2 + 0.010000000000000000208 x20 ^2 + 0.010000000000000000208 x21 ^2 + 0.010000000000000000208 x22 ^2 + 0.010000000000000000208 x23 ^2 + 0.010000000000000000208 x24 ^2 + 0.010000000000000000208 x25 ^2 + 0.010000000000000000208 x26 ^2 + 0.010000000000000000208 x27 ^2 + 0.010000000000000000208 x28 ^2 + 0.010000000000000000208 x29 ^2 + 0.010000000000000000208 x30 ^2 + 0.010000000000000000208 x31 ^2 + 0.010000000000000000208 x32 ^2 + 0.010000000000000000208 x33 ^2 + 0.010000000000000000208 x34 ^2 + 0.010000000000000000208 x35 ^2 + 0.010000000000000000208 x36 ^2 + 0.010000000000000000208 x37 ^2 + 0.010000000000000000208 x38 ^2 + 0.010000000000000000208 x39 ^2 + 0.010000000000000000208 x40 ^2 + 0.010000000000000000208 x41 ^2 + 0.010000000000000000208 x42 ^2 + 0.010000000000000000208 x43 ^2 + 0.010000000000000000208 x44 ^2 + 0.010000000000000000208 x45 ^2 + 0.010000000000000000208 x46 ^2 + 0.010000000000000000208 x47 ^2 + 0.010000000000000000208 x48 ^2 + 0.010000000000000000208 x49 ^2 + 0.010000000000000000208 x50 ^2 + 0.010000000000000000208 x51 ^2 + 0.010000000000000000208 x52 ^2 + 0.010000000000000000208 x53 ^2 + 0.010000000000000000208 x54 ^2 + 0.010000000000000000208 x55 ^2 + 0.010000000000000000208 x56 ^2 + 0.010000000000000000208 x57 ^2 + 0.010000000000000000208 x58 ^2 + 0.010000000000000000208 x59 ^2 + 0.010000000000000000208 x60 ^2 + 0.010000000000000000208 x61 ^2 + 0.010000000000000000208 x62 ^2 + 0.010000000000000000208 x63 ^2 + 0.010000000000000000208 x64 ^2 + 0.010000000000000000208 x65 ^2 + 0.010000000000000000208 x66 ^2 + 0.010000000000000000208 x67 ^2 + 0.010000000000000000208 x68 ^2 + 0.010000000000000000208 x69 ^2 + 0.010000000000000000208 x70 ^2 + 0.010000000000000000208 x71 ^2 + 0.010000000000000000208 x72 ^2 + 0.010000000000000000208 x73 ^2 + 0.010000000000000000208 x74 ^2 + 0.010000000000000000208 x75 ^2 + 0.010000000000000000208 x76 ^2 + 0.010000000000000000208 x77 ^2 + 0.010000000000000000208 x78 ^2 + 0.010000000000000000208 x79 ^2 + 0.010000000000000000208 x80 ^2 + 0.010000000000000000208 x81 ^2 + 0.010000000000000000208 x82 ^2 + 0.010000000000000000208 x83 ^2 + 0.010000000000000000208 x84 ^2 + 0.010000000000000000208 x85 ^2 + 0.010000000000000000208 x86 ^2 + 0.010000000000000000208 x87 ^2 + 0.010000000000000000208 x88 ^2 + 0.010000000000000000208 x89 ^2 + 0.010000000000000000208 x90 ^2 + 0.010000000000000000208 x91 ^2 + 0.010000000000000000208 x92 ^2 + 0.010000000000000000208 x93 ^2 + 0.010000000000000000208 x94 ^2 + 0.010000000000000000208 x95 ^2 + 0.010000000000000000208 x96 ^2 ] / 2

Subject To

Bounds
-60 <= x1 <= 100000 
-60 <= x2 <= 100000 
-60 <= x3 <= 100000 
-60 <= x4 <= 100000 
-60 <= x5 <= 100000 
-60 <= x6 <= 100000 
-60 <= x7 <= 100000 
-60 <= x8 <= 100000 
-60 <= x9 <= 100000 
-60 <= x10 <= 100000 
-60 <= x11 <= 100000 
-60 <= x12 <= 100000 
-60 <= x13 <= 100000 
-60 <= x14 <= 100000 
-60 <= x15 <= 100000 
-60 <= x16 <= 100000 
-60 <= x17 <= 100000 
-60 <= x18 <= 100000 
-60 <= x19 <= 100000 
-60 <= x20 <= 100000 
-60 <= x21 <= 100000 
-60 <= x22 <= 100000 
-60 <= x23 <= 100000 
-60 <= x24 <= 100000 
-60 <= x25 <= 100000 
-60 <= x26 <= 100000 
-60 <= x27 <= 100000 
-60 <= x28 <= 100000 
-60 <= x29 <= 100000 
-60 <= x30 <= 100000 
-60 <= x31 <= 100000 
-60 <= x32 <= 100000 
-60 <= x33 <= 100000 
-60 <= x34 <= 100000 
-60 <= x35 <= 100000 
-60 <= x36 <= 100000 
-60 <= x37 <= 100000 
-60 <= x38 <= 100000 
-60 <= x39 <= 100000 
-60 <= x40 <= 100000 
-60 <= x41 <= 100000 
-60 <= x42 <= 100000 
-60 <= x43 <= 100000 
-60 <= x44 <= 100000 
-60 <= x45 <= 100000 
-60 <= x46 <= 100000 
-60 <= x47 <= 100000 
-60 <= x48 <= 100000 
-60 <= x49 <= 100000 
-60 <= x50 <= 100000 
-60 <= x51 <= 100000 
-60 <= x52 <= 100000 
-60 <= x53 <= 100000 
-60 <= x54 <= 100000 
-60 <= x55 <= 100000 
-60 <= x56 <= 100000 
-60 <= x57 <= 100000 
-60 <= x58 <= 100000 
-60 <= x59 <= 100000 
-60 <= x60 <= 100000 
-60 <= x61 <= 100000 
-60 <= x62 <= 100000 
-60 <= x63 <= 100000 
-60 <= x64 <= 100000 
-60 <= x65 <= 100000 
-60 <= x66 <= 100000 
-60 <= x67 <= 100000 
-60 <= x68 <= 100000 
-60 <= x69 <= 100000 
-60 <= x70 <= 100000 
-60 <= x71 <= 100000 
-60 <= x72 <= 100000 
-60 <= x73 <= 100000 
-60 <= x74 <= 100000 
-60 <= x75 <= 100000 
-60 <= x76 <= 100000 
-60 <= x77 <= 100000 
-60 <= x78 <= 100000 
-60 <= x79 <= 100000 
-60 <= x80 <= 100000 
-60 <= x81 <= 100000 
-60 <= x82 <= 100000 
-60 <= x83 <= 100000 
-60 <= x84 <= 100000 
-60 <= x85 <= 100000 
-60 <= x86 <= 100000 
-60 <= x87 <= 100000 
-60 <= x88 <= 100000 
-60 <= x89 <= 100000 
-60 <= x90 <= 100000 
-60 <= x91 <= 100000 
-60 <= x92 <= 100000 
-60 <= x93 <= 100000 
-60 <= x94 <= 100000 
-60 <= x95 <= 100000 
-60 <= x96 <= 100000 

End

Pretty cool ??. Now, my job was to compare the files and look for a bug. I noticed something that YALMIP solver was defining some bounds on the equation like

-60 <= x1 <= 100000

but the bounds in CPLEX looked very different like

x1 >= 4.94065645841247e-324

Note: Bounds are different from constraints. I was setting the -60 and 100000 constraints in JAVA. Bounds tell the solver that what is the min and max value of the variables that we are trying to find.

Clue#6: Something is wrong with bounds

The bound generated by CPLEX looked really familiar because it was Double.MIN_VALUE. So, where I have used it in my code ??. I found it, I used it in my optimize() method to define the bounds of my vector.

IloNumVar[] x_n = cplex.numVarArray(masterData.getPrice().length, Double.MIN_VALUE, Double.MAX_VALUE);

OH CRAP !

Changed it to the logically correct bounds.

IloNumVar[] x_n = cplex.numVarArray(masterData.getPrice().length, -60, 100000);

 

VOILA !!!!

My code was fixed. This was the bug.

Who would have thought that a simple boundary condition can mess up the code so bad that it took multiple days just to figure out what was happening. I was really happy to finally figure out the issue because it was really important to me.

Btw why the output of 4th iteration was wrong and not the others ? 

I do not have a clear answer to this but my guess is that the output was also wrong for the first 3 iterations but the difference was not noticeable.

 

So, I being a computer scientist sometimes requires detective level debugging. It might cost you days but the journey is worth it. I learned so many extra stuff that otherwise I might not have tried.

Drop a comment, if you like this or you have been in a similar situation.

References:

[1] – https://mediatum.ub.tum.de/doc/1187583/1187583.pdf

[2] – https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

[3] – http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/

[4] – http://cvxr.com/cvx/

[5] – http://users.isy.liu.se/johanl/yalmip/



This post first appeared on Techroz, please read the originial post: here

Share the post

DAMN !!! A bug – A 4th iteration nightmare

×

Subscribe to Techroz

Get updates delivered right to your inbox!

Thank you for your subscription

×