One way to calculate base 2 logarithms

Computing logarithms is a fairly common operation in digital signal processing. Most often, only convolutions (multiplication with accumulation) and amplitudes with phases have to be considered. As a rule, to calculate the logarithms on an FPGA, the CORDIC algorithm is used in a hyperbolic version, requiring only tables and simple arithmetic operations. However, this is not always convenient, especially if the project is large, the crystal is small and dances with optimization begin. It was with such a situation that I had to face one day. Both ports of the RAM block (Cyclone IV) were already working tightly, leaving no free windows. I did not want to use another block for hyperbolic CORDIC. But there was a multiplier, for which a decent free window appeared in the time diagram. After thinking a day, I composed the following algorithm, in which tables are not used, but there is multiplication, more precisely squaring. And since circuit squaring is simpler than the general case of multiplication, perhaps this algorithm is of interest to specialized microcircuits, although of course there is no difference for FPGAs. More details under the cut.



Explaining what's what, it's easier for real numbers. Let's start with them. We proceed to the integer implementation later.



Let there be a number X. Find the number Y such that X=2Y .

We also assume that X lies in the range from 1 to 2. This does not limit the generality too much, since X can always be transferred to this interval by multiplication or division by a power of two. For Y, this will mean adding or subtracting an integer, which is easy. So X lies in the interval from 1 to 2. Then Y will lie in the interval from 0 to 1. We write Y as an infinite binary fraction:





Y=b020+b12βˆ’1+...+bn2βˆ’n+...







Odds bi there is nothing else in this record than just the bits of the binary representation of the number Y. Moreover, since Y is less than 1, it is obvious that b0 = 0.



Let's square our first equation: X2=22Y and as before, we write the binary representation of 2Y . It's obvious that





2Y=b120+b22βˆ’1+...+bn2βˆ’(nβˆ’1)+...





Those. bits bi remained the same, just the powers of two moved. Bat b0 not present in the view because it is zero.



Two cases are possible:



one) X2>2 , 2Y> 1, b1=1



2) X2<2 , 2Y <1, b1=0



In the first case, we take as the new value of X X2/2 in the second - X2 .



As a result, the task was reduced to the former. The new X again lies in the range from 1 to 2, the new Y from 0 to 1. But we learned one bit of the result. Taking the same steps in the future, we can get as many bits of Y as possible.



Let's see how it works in a C program:



#include <stdio.h> #include <math.h> int main() { double w=1.4; double s=0.0; double a=0.5; double u=w; for(int i=0; i<16; i++) { u=u*u; if(u>2) { u=u/2; s+=a; } a*=0.5; } w=log2(w); double err=100*abs(2*(sw)/(s+w)); printf("res=%f, log=%f, err=%f%c\n",s,w,err,'%'); return 0; }
      
      





We calculated the logarithm with an accuracy of 16 bits and compared with what the mathematical library gives. The program brought:



res = 0.485413, log = 0.485427, err = 0.002931%



The result coincided with the library with an accuracy of 0.003%, which shows the efficiency of our algorithm.



Let's move on to an integer implementation.



Let N-bit binary unsigned numbers represent the interval [0, 1]. For convenience, we consider the unit number 2N , but not 2Nβˆ’1 , and accordingly a deuce number 2 N + 1 . We will write a program in the image and likeness of the previous one, but working with integers:



 #include <stdio.h> #include <math.h> #define DIG 18 //  #define N_BITS 16 //    unsigned ONE=1<<(DIG-1); // unsigned TWO=ONE<<1; // unsigned SCALE=1<<(N_BITS+1); //  unsigned myLog(unsigned w) { unsigned s=0; unsigned long long u=w; for(int i=0; i<N_BITS+1; i++) { s<<=1; u=(u*u)>>(DIG-1); //    ! if(u&TWO) //      { u>>=1; s+=1; } printf("%X\n", (int)u); } return s; } int main() { double w=1.2345678; unsigned iw=(unsigned)(ONE*w); double dlog=log2(w); unsigned ilog=myLog(iw); unsigned test=(unsigned)(SCALE*dlog); int err=abs((int)(ilog-test)); printf("val=0x%X, res=0x%X, log=0x%X, err=%d\n",iw,ilog,test,err); return 0; }
      
      





Having played in a program with different bit depths (DIG), calculation accuracy (N_BITS), and logarithm arguments (w), we see that everything is calculated correctly. In particular, with those parameters that are specified in this source, the program produces:



val = 0x27819, res = 0x9BA5, log = 0x9BA6, err = 1



Now everything is ready in order to implement a piece of iron on veril, doing exactly the same thing as the myLog function in C. The variables s and u in our function can be printed in a loop and compared with what the verilog simulator produces. The correspondence of these variables to the iron implementation is very transparent and understandable. u is a working register that takes new values ​​of X during iterations. s is a shift register in which the result is accumulated. The interface of our module will look like this:



 module logarithm( input clk, // input wr, //   input[17:0] din, //   output[nbits-1:0] dout, //   output rdy //  ); parameter nbits=16; // 
      
      





The input bus is adopted 18-bit, respectively, the width of the multipliers in Cyclone IV. The numbers on our module should come normalized. Those. with high bit equal to one. In my project, this was done automatically. But in which case to implement the normalizer, I think it’s not difficult for anyone. The accuracy of the calculations is set by the nbits parameter, by default equal to 16. The module counts one bit per cycle, and for 16 cycles it calculates the logarithm with an accuracy of 16 bits. If you need faster with the same accuracy or more precisely with the same speed, I hope it will not be difficult for anyone to divide the module into several devices and convey.



Here is the complete module and test code
 //--------------------- logarithm.v ------------------------------// module logarithm( input clk, // input wr, //   input[17:0] din, //   output[nbits-1:0] dout, //   output rdy //  ); parameter nbits=16; //  reg[4:0] cnt; // reg[17:0] acc; // - reg[nbits-1:0] res; // always @(posedge clk) if(wr) cnt<=nbits+1; else if(cnt != 0) cnt<=cnt-1; wire[35:0] square=acc*acc; //  wire bit=square[35]; //  wire[17:0] next = bit ? square[35:18] : square[34:17]; //  always @(posedge clk) if(wr) acc<=din; else if(cnt != 0) begin acc<=next; #10 $display("%X", acc); end always @(posedge clk) if(wr) res<=0; else if(cnt != 0) begin res[nbits-1:1]<=res[nbits-2:0]; res[0]<=bit; end assign dout=res; assign rdy=(cnt==0); endmodule //======================== testbench.v =====================// module testbench(); reg clk; // always #100 clk=~clk; reg wr; // reg[17:0] din; // wire rdy; // wire[15:0] dout; // logarithm log2( .clk (clk), .wr (wr), .din (din), .dout (dout), .rdy (rdy) ); //  n     task skipClk(integer n); integer i; begin for(i=0; i<n; i=i+1) @(posedge clk); #10 ; end endtask initial begin // $dumpfile("testbench.vcd"); $dumpvars(0, testbench); clk=0; wr=0; din=18'h27819; skipClk(3); wr=1; skipClk(1); wr=0; @(rdy); skipClk(3); $display("value=%X, result=%X", din, dout); $display("Done !"); $finish; end endmodule
      
      







Run the test with this script:



 #!/bin/sh rm -f *.vvp rm -f *.vcd iverilog -o testbench.vvp logarithm.v testbench.v vvp testbench.vvp gtkwave testbench.vcd testbench.gtkw
      
      





Running the test, we see the final output of the simulator - value = 27819, result = 9ba5. Verilog gave the same thing as C. The timing diagram here is quite trivial and is not of particular interest. Therefore, I do not bring it.



Compare the intermediate output of the simulator (acc) and the program in C (s):
Verilog C

30c5d 30C5D

252b1 252B1

2b2bc 2B2BC

3a3dc 3A3DC

35002 35002

2be43 2BE43

3c339 3C339

38a0d 38A0D

321b0 321B0

273a3 273A3

30163 30163

24214 24214

28caf 28CAF

34005 34005

2a408 2A408

37c9d 37C9D

30a15 30A15









Make sure they match bit by bit. In total, the implementation on a verilo repeats the model on C up to a bit. This is the result that should be achieved by implementing algorithms in hardware.



That’s probably all. I hope someone will find this my experience useful.



All Articles