Showing posts with label Algorithm. Show all posts
Showing posts with label Algorithm. Show all posts

Sunday, February 25, 2024

My Arbitrary Precision Arithmetic Libraries


Hello C# code enjoyers. It's been a long time since I updated this blog, but that is going to change. I've been keeping a folder of blog post ideas and typed up thoughts on subjects that I think would fit here. So I have a lot of posts I wanted to make, about cool things I've written and discovered along the way.

For the coding side of things, boy have I been busy. I've written a number of arbitrary precision numeric type libraries, and other maths related projects.

I wanted to share some of these projects with you.


General Number Theory Sieve

The general number field sieve (GNFS) is currently the most efficient algorithm known for factoring very large semiprime numbers. It is the primary algorithm that has been used to set the last several world record factorizations.
  • GNFS - This was a long, difficult project that took over a year to develop and required me teaching myself the basics of abstract algebra to be able to finish it. This project was intended to be more of a C# reference implementation of the General Number Field Sieve algorithm for the purpose of better understanding the General Number Field Sieve algorithm better than it was to be anything performant or fast.


Arbitrary Precision Arithmetic Types

  • BigDecimal - An arbitrary precision, base-10 floating point number class. This is probably the most popular of my numeric libraries. It's getting quite popular, and may soon become the most downloaded 'big decimal' library on Nuget! One of the features my library has that its competitors do not is Trigonometric functions, Hyperbolic Trigonometric functions, and inverse trigonometric functions. Yes, to arbitrary precision. It accomplishes this by using the taylor series, and for the trig functions where the taylor series converges too slowly, ArcCosine for example, it uses continued fractions. The iteration of the infinite series stops once the difference between two iterations becomes less than some threshold you supply in terms of powers of negative 10, which is the same as saying how many digits past the decimal point do you want it to be accurate to.


  • BigRational - Arbitrary precision number with arithmetic, except this one stores the value as an improper fraction under the hood. Actually, thats the fraction class of this library. BigRational represents the value as a mixed fraction. That is: Integer value + Fractional value


  • BigComplex - Essentially the same thing as System.Numerics.Complex except that it uses a System.Numerics.BigInteger type for the real and imaginary parts instead of a double.


Polynomials

  • Polynomial - The original. A univariate polynomial that uses System.Numerics.BigInteger for the type of the indeterminate (variables/letters).
  • ComplexPolynomial - A univariate polynomial library that has System.Numerics.Complex type indeterminates.


And some variations using different underlying types for the variable:

Multivariate

  • MultivariatePolynomial - A multivariate polynomial (meaning more than one indeterminate, e.g. 2XY^2) which uses BigInteger as the type for the indeterminates


Generic Arithmetic


Seeing all these Polynomial libraries, you might be thinking: Wouldn't you rather make a class that you write only once and swap out the underlying type? Yes. Yes I would. Unfortunately, inheritance doesnt work. I thought generics would do it, but there is no generic constraints that says 'this is an arithmetic type. allow the +-*/ operators to work on these types in the normal way'. Thus the arithmetic operations are not available and thats kinda the whole point. 

Fortunately, there was a way to call each type's respective operator overload function dynamically at runtime by building a Linq.Expressions LambdaExpression that invokes the appropriate function on the class. This allows for you to specify the type as the generic type T with this generic class library just like you might have expected to be possible natively.

  • GenericArithmetic - A core math library. Its a class of static methods that allows you to perform arithmetic on an arbitrary numeric type, represented by the generic type T, who's concrete type is decided by the caller. This is implemented using System.Linq.Expressions and reflection to resolve the type's static overloadable operator methods at runtime, so it works on all the .NET numeric types automagically, as well as any custom numeric type, provided it overloads the numeric operators and standard method names for other common functions (Min, Max, Abs, Sqrt, Parse, Sign, Log, Round, etc.). Every generic arithmetic class listed below takes a dependency on this class.

After writing that, naturally, I had to make:
  • GenericPolynomial - A univariate polynomial library that allows the indeterminate to be of an arbitrary type, as long as said type implements operator overloading. This is implemented dynamically, at run time, calling the operator overload methods using Linq.Expressions and reflection.
  • GenericMultivariatePolynomial - A multivariate polynomial that allows the indeterminates to be of [the same] arbitrary type. GenericMultivariatePolynomial is to MultivariatePolynomial what GenericPolynomial is to Polynomial, and indeed is implemented using the same strategy as GenericPolynomial (i.e. dynamic calling of the operator overload methods at runtime using Linq.Expressions and reflection).
  • GenericVector - A generic Vector numeric type. Supports: Scalar arithmetic, vector arithmetic, square root, dot product, normal, reflection, distance, lerp, sum of squares and cosine similarity.


Miscellaneous

And some miscellaneous and smallish libraries playing around with various types of arithmetic:

  • Continued Fraction - A continued fraction class. Arbitrary precision. Supports converting a continued fraction into rational approximations up to precision.

  • IntervalArithmetic - Instead of representing a value as a single number, interval arithmetic represents each value as a mathematical interval, or range of possibilities, [a,b], and allows the standard arithmetic operations to be performed upon them too, adjusting or scaling the underlying interval range as appropriate. This is an interesting way of doing things and has some narrow uses. Despite that its sort of feels like a solution in need of a problem. See Wikipedia's article on Interval Arithmetic for further information.






Friday, April 13, 2018

Pascal's Triangle



Pascal's triangle has a lot of mathematically interesting properties; It represents binomial coefficients (n choose k or combination of a set), you can find within it the powers of 2, the powers of 11, the Fibonacci sequence, Sierpinski's triangle (a fractal), all the figurate numbers, Mersenne numbers and Catalan numbers, just to name a few. Furthermore, generating Pascal's triangle is quite simple, requiring only addition. Sometimes in mathematics something is both profound AND simple to understand. Such is the case with Pascal's triangle.

Recently, I made a contribution to rosettacode.org. If you have not checked out rosettacode.org before, you should definitely do so. I made a contribution to the Pascal's triangle task for the C# language. You can check it out here, or you can just view the code below.

My version of Pascal's Triangle is short, succinct, uses the BigInteger class for arbitrarily large numbers, and uses the algorithm to generate a single row of Pascal's triangle without needing to generate every row before it. This was originally the use case for writing this code in the first place; I wanted to generate high-numbered rows in a computationally feasible way. What does row # 5000 look like, for example? Well, could use the code below and generate the 5000th row of Pascal's triangle and format it as a string with the following one-liner: `string result = string.Join(" ", PascalsTriangle.GetRow(5000).Select(n => n.ToString()));`. You can also, of course, generate actual triangle-shaped rows of numbers:
                                                           1                                                           
                                                        1     1                                                        
                                                     1     2     1                                                     
                                                  1     3     3     1                                                  
                                               1     4     6     4     1                                               
                                            1     5    10    10     5     1                                            
                                         1     6    15    20    15     6     1                                         
                                      1     7    21    35    35    21     7     1                                      
                                   1     8    28    56    70    56    28     8     1                                   
                                1     9    36    84    126   126   84    36     9     1                                
                             1    10    45    120   210   252   210   120   45    10     1                             
                          1    11    55    165   330   462   462   330   165   55    11     1                          
                       1    12    66    220   495   792   924   792   495   220   66    12     1                       
                    1    13    78    286   715  1287  1716  1716  1287   715   286   78    13     1                    
                 1    14    91    364  1001  2002  3003  3432  3003  2002  1001   364   91    14     1                 
              1    15    105   455  1365  3003  5005  6435  6435  5005  3003  1365   455   105   15     1              
           1    16    120   560  1820  4368  8008  11440 12870 11440 8008  4368  1820   560   120   16     1           
        1    17    136   680  2380  6188  12376 19448 24310 24310 19448 12376 6188  2380   680   136   17     1        
     1    18    153   816  3060  8568  18564 31824 43758 48620 43758 31824 18564 8568  3060   816   153   18     1     
  1    19    171   969  3876  11628 27132 50388 75582 92378 92378 75582 50388 27132 11628 3876   969   171   19     1  

Awww, yes... that's pleasing.

Much of the pleasing effect is due to the CenterString(string, int) method (below).



And the code to do all that is thus:

public static class PascalsTriangle
{  
 public static IEnumerable GetTriangle(int quantityOfRows)
 {
  IEnumerable range = Enumerable.Range(0, quantityOfRows).Select(num => new BigInteger(num));
  return range.Select(num => GetRow(num).ToArray());
 }

 public static IEnumerable GetRow(BigInteger rowNumber)
 {
  BigInteger denominator = 1;
  BigInteger numerator = rowNumber;

  BigInteger currentValue = 1;
  for (BigInteger counter = 0; counter <= rowNumber; counter++)
  {
   yield return currentValue;
   currentValue = BigInteger.Multiply(currentValue, numerator--);
   currentValue = BigInteger.Divide(currentValue, denominator++);
  }
  yield break;
 }

 public static string FormatTriangleString(IEnumerable triangle)
 {
  int maxDigitWidth = triangle.Last().Max().ToString().Length;
  IEnumerable rows = triangle.Select(arr =>
    string.Join(" ", arr.Select(array => CenterString(array.ToString(), maxDigitWidth)) )
  );
  int maxRowWidth = rows.Last().Length;
  return string.Join(Environment.NewLine, rows.Select(row => CenterString(row, maxRowWidth)));
 }

 private static string CenterString(string text, int width)
 {
  int spaces = width - text.Length;
  int padLeft = (spaces / 2) + text.Length;
  return text.PadLeft(padLeft).PadRight(width);
 }
}
Note: This requires the System.Numerics library.

I make liberal use of Linq to keep the code short, yet expressive.

The code that generated the Pascal's triangle above, is:

IEnumerable triangle = PascalsTriangle.GetTriangle(20);
string output = PascalsTriangle.FormatTriangleString(triangle)
Console.WriteLine(output);


Other, arbitrarily large, number types available include the BigDecimal class, the BigComplex class, and the BigRational class. They are all available on my GitHub.


Thanks for stopping by!


Tuesday, December 20, 2016

Equidistribution of ornaments across your Christmas tree's lateral surface area




Introduction


I have been learning a lot of math lately. I now know more math than I ever fathomed I would, more than I knew there was out there to learn. This is great, and has given me a (much improved) altered view of the world. I now see and approach everything through mathematical lenses. Even the most mundane or routine day-to-day tasks can benefit from this outlook. For example: Decorating the Christmas tree.



The problem


It seems like just about everybody is concerned with ensuring the ornaments on the Christmas tree are placed evenly, and there is no bunching or grouping of ornaments on one side of the tree versus the other. This is usually achieved by eyeballing it; a totally subjective experience. Well, this concern came up again this year as it often does, except this time, I had my mathematical lenses...

Lets say you want to achieve near-perfect spacing of the ornaments on your Christmas tree, and all you have is a ruler, or maybe a carpenters square (apparently Sheldon does this on The Big Bang Theory, as my friend explained to me while I was writing this).

You can figure out the number of ornaments you have, since you can count them, so how do you know how far apart to space each ornament?



The strategy


Thinking about it, it seems the simplest way would be to divide the number of ornaments you have over the area of the ornament-hanging real estate, and we would get a result in terms of 1 ornament per X feet^2. I think we can work with that.



Surface area of the ornament-space of a standard tree


The formula for calculating the surface areas is well know for a wide variety of shapes. Given a shape, we can just google for the formula. Take a gander at your Christmas tree, what shape does it remind you of? Well, its conical. We can generalize a Christmas tree to the shape of a cone:

Now when I first did this, the result I got just didn't seem right, I was getting a result of ~91.5 square feet. There was no way my tree was exposing 91 square feet of ornament space! Well, the surface area of a cone includes the base, of course, and not too many people hang Christmas ornaments underneath of the tree. Whoops!

So after some googling, I found out that the surface area of the slanted part of the cone is called the Lateral Surface Area of a cone.



The formula


So all that is required is for you measure the radius of your Christmas tree, the height from the base, plug those numbers into the above formula, and that returns the surface area in square units. The units are whatever units you put into the formula, typically inches.

Note: The radius (r) is measured from the trunk out to the end of the branches at the base. The height (h) is measured from the base to the top. Do not measure from the floor to the tip, as this will include the length of the trunk, and since the surface area per unit of height is greatest at the bottom, this will throw your surface area calculation off by a lot.

Then, dividing the number of ornaments you have by the surface area gives you the number or ornaments per square unit. You can then at this point, cut out a square piece of paper to match the area that one ornament should occupy. Do this four times to make four squares. If you have thicker paper or card stock, use that. Tape the four squares together, two on top, two on bottom, offset by half a square, like seen here:


This will be your template. This gives you a frame of reference, so you can align ornaments in reference to ones already hung. You can even draw and cut out circles in the center to obviate any need for guesswork or eye-balling their placement.


Thursday, November 24, 2016

EntropyGlance

Entropy at a glance



In a hurry? Skip straight to the C# source code - EntropyGlance; Entropy at a glance - A C# WinForms project - https://github.com/AdamWhiteHat/EntropyGlance



So I wrote an file entropy analysis tool for my friend, who works in infosec. Here it is, hands-down the coolest feature this tool offers is a System.Windows.Forms.DataVisualization.Charting visualization that graphs how the entropy changes across a whole file:



This application provides both Shannon (data) entropy and entropy as a compression ratio.
Get a more intuitive feel for the overall entropy at a glance with by visualizing both measures of entropy as a percentage of a progress bar, instead of just numbers.





   However, for those who love numbers, standard measures of entropy are also given as well. Information entropy is expressed both as the quantity of bits/byte (on a range from 0 to 8), and as the 'normalized' value (range 0 to 1). High entropy means it the data is random-looking, like encrypted or compressed information.
   The Shannon 'specific' entropy calculation makes no assumptions about the type of message it is measuring. What this means is that while a message consisting of only 2 symbols will get a very low entropy score of 0.9/8, a message of 52 symbols (the alphabet, as lower case first, then upper) repeated in the same sequence one hundred times would be yield a higher-than-average score of 6/8.
   This is precisely why I included a compression ratio as a ranking of entropy that is much closer to notion of entropy that takes into account repeated patterns or predictable sequences, in the sense of Shannon's source coding theorem.



Dive deep into the symbol distribution and analysis. This screen gives you the per-symbol entropy value and the ability to sort by rank, symbol, ASCII value, count, entropy, and hex value:



As always, the C# source code is being provided, hosted on my GitHub:
EntropyGlance; Entropy at a glance - A C# WinForms project - https://github.com/AdamWhiteHat/EntropyGlance



Wednesday, September 21, 2016

RC4 stream cipher variants and visualization of table permutation state



Here, I present some work I have been doing on two RC4 stream cipher variants. The first variant, as seen below, I wrote to help me visualize and understand what the RC4 tables was doing, and help me understand its properties.

Identity Permutation

The class that contains it is called SimpleTable and is exactly that; The simplest R4C implementation possible. It is notable in the fact that it does not use key scheduling at all, and its starting state is that of the Identity Permutation. The identity permutation is where the value at index zero equals zero, the value at index one is one, and so on. An easy way to remember what the Identity Permutation is, just recall the notion of a Multiplicative Identity (which is 1), where by multiplying a number N by the Multiplicative Identity gives you back the value of N, also known as the identity. Similarly, the Identity Permutation of an array A just gives you A. This is the trivial permutation. That is, there is no permuting of the array at all!

Anyways, this is done to see the perfectly ordered state, and how each round effects that state. In this way, we can visually check for the avalanche effect. In order to visualize the table, i just assign each value 0 to 255 a different shade of grey (I also have a rainbow-colored option that might be easier to tell apart similar values). At each step I create a Bitmap by looping through the table. Below, you can find an animated GIF of the first 100 steps of this cipher being applied to the identity permutation:


Notice how it takes a while to get going, and the first several values don't move much at all. After 256 steps, or one round, the cursor arrives back at index zero. Because the location of the first several values have not moved much or at all, we can clearly see that a mere 256 steps is insufficient at permuting the state enough to avoid leaking the first part of your key. Therefore it it is vital to permute the table for several rounds (256 steps per round) before you start using the stream.

Wired Equivalent Protocol

As some of you may know, WEP used RC4 with a weak key schedule. The key is spread out over 256 bytes using the following approach:


j = (j + table[i] + key[i mod keylength]) % 256;
SwapValues(table[i], table[j]);


and then it began streaming bytes from the table. Typically a nonce is concatenated to the key. Every time the table is set up and/or the nonce changes, some information about the key is leaked. Obviously a more secure procedure would use a hash of the key and the nonce, instead of the plain-text key, and to toss away the first 1024 bytes or so.

Cycle Length

Because each step in an RC4 cipher is a permutation, there is a limit to the number of unique bytes that can be produced before it begins repeating. This is called the permutation cycle. The length of the permutation cycle depends on the exact starting state, but we can get an upper bound.

Since there are 256 elements in the array, and two indices into the array (i and j), there is a maximum of
256! * 256^2 = 5.62 * 10^512 = 2^1700
possible states. That's 4.6 * 10^488 yottabytes!

This is the maximum possible states, however, and other starting states could have less. If the RC4 algorithm performed as a random permutation (which it does not, it performs worse), the cycle length would be half of the theoretical maximum above. Luckily the number above is so vast, that even some faction of it is still so many bytes that all of humanity has never and likely will never have that much total storage.

One thing to watch out for, however is something called Finney States. If an RC4 is started in one of these Finney States, the length of the cycle is much, much reduced. The chance of randomly generating one of these starting states, however, is VERY, very low.

Strengthening RC4

As stated, and visualized, above, it is vital to permute the table for several rounds (at 256 steps per round) after the key schedule, discarding the bytes, before you start using the stream. Also, it would be foolish to use the actual bytes of the key for permuting the starting state. It would be instead better to use a hash of the key + nonce or a key derivation function from the key instead the actual value of the key itself.

Another idea is, after shuffling the table enough rounds to hide the key, scramble the table an additional number of rounds, that value being some function of the key. This increases the possible starting states by whatever your range is.

In the classic RC4, each step would return one byte. The number of steps taken before returning each byte is configurable in my implementation.

Memory hardening

Check out the experimental branch for a memory hardened version. It stores the key class in memory, with the key XORed with a one-time pad, and then is protected in memory from access with the System.Security.Cryptography.ProtectedMemory class.

Other uses

The pseudo-random byte stream from the RC4 table is deterministic. Therefore if two remote computers with a shared secret, both computers can independently set up an RC4 table with exactly the same starting state and will get the same sequence of bytes which would be difficult to guess, given just the stream of bytes. If the plain text is XORed by the pseudorandom byte stream, then it can be decrypted by XORing it by the same byte steam.

The project includes 2 variants: 1) A simple table with a method to visualize the permutation state of the table and the avalanche effect as a bitmap 2) A more serious attempt at a secure implementation.


NOTE: THIS HAS NOT BEEN CRYPTO-ANALYZED AND PROBABLY NOT ACTUALLY SECURE, SO DO NOT TRUST IT!

Screenshots




Source code

Here is the GitHub page to the project (master branch).
Or just directly download the Zip file (experimental branch).




Tuesday, August 16, 2016

Lorenz Chaos Attractor



This project was inspired by one of Daniel Shiffman's 10 minute coding challenge YouTube videos, The Lorenz Attractor in Processing.


        dX = ((A * y)  -  (A * x)) * time;
        dY = ((B * x) -y -(x * z)) * time;
        dZ = ((x * y)  -  (c * z)) * time;

So it turns out this it not too terribly exciting. While its true that adjusting the starting values by a small amount change the behavior, if you go much outside the values its currently set for, you will end up with a pattern that quickly degenerates to a single, boring point. Personally, I was hoping for a more chaotic system. You might notice I am not using the 3rd point. I have yet to find a 3D drawing library that I like, though I need one for visualizing other projects. Anyways, since this was an experiment, I did the pragmatic thing and just made it 2D since I already knew how to do that.

Here is the result:




GitHub project

It wanted to draw the pattern very small, so I had to scale up the image by multiplying each number by some scale number.

One possibly useful idea is to use the cosine of the tangent of each number. This has the effect of canceling out the spiral and spreading the numbers out over a field. If you use just the tangent, you get a gradient from the top left corner. Perhaps you could use this as a pseudo-random noise source.


public static void TanCos(Lorenz system)
{
        system.x = 16 * (decimal)Math.Tan(Math.Cos((double)system.x));
        system.y = 16 * (decimal)Math.Tan(Math.Cos((double)system.y));
}

public static void Tan(Lorenz system)
{
        system.x = 6 * (decimal)Math.Tan((double)system.x);
        system.y = 6 * (decimal)Math.Tan((double)system.y);
}


Wednesday, June 29, 2016

Bloom Filter - A novel, space efficient data structure like a hash-table for billions of values.




Introduction


A bloom filter is a truly novel data structure. Similar to a hash table, it can tell you if you've hashed a particular value previously. You can add many, many more values to a bloom filter than you can to a hash table, does not degrade performance as the number of values in the set grows large, and requires only a fraction of the space of a hash table to store it!

This is not just an academic exercise, or something that only works in theory or in special cases. Indeed, companies like google use bloom filters to quickly determine if it has never seen that value before, thus avoiding a more costly lookup against a database every time the bloomfilter returns false.



Probabilistic


First off, its important to understand that a bloom filter is NOT a hash table, it operates in an entirely different way. A bloom-filter is what is known as a probabilistic data structure. What this means is, that it can tell you to within a certain probability, if an element exists in a set. In other words, false positive matches ARE possible, but false negative matches ARE NOT possible. For example, if you check a bloom filter for the existence of a value, and it returns false, you can know with 100% certainty that the bloom filter does not contain that value value before. However, if you test a value against the filter and it returns true, there is a small probability that it has in fact not seen that value before, but is returning a false positive. How big of a probability? Here's the beauty: It can be as small as you want it to be. It depends on a few factors, including the size of the filter, how full it is, and how many bits you use to store each value in the filter.

In a HashTable class, each item is stored as a key value pair, so the size of your object plus a 32 bit integer. Contrast that to a bloom filter, which stores only about 3-7 bits per value hashed. Also, my implementation applies compression when saving the filter to disk, providing even more space savings. A bloom filter with 160,000 values hashed and a 1% collision probability results in a filter that is 235KB uncompressed, and a whopping 54KB when compressed! Remember the filter is an array of bits. The entropy of the array is going to be at its greatest, and thus the compression ratio lowest, when exactly 1/2 of the bits are flipped, or the filter is half-way 'full'. This has the unusual property of getting smaller as you add more hashes to the filter. Actually, this is misleading--the actual filter itself never changes size, its only the compressed version that varies in size.

To handle the compression I just used the System.IO.Compression.DeflateStream class. An important note about working with this class: build an array of bytes and send your entire file in one go. In this way it will compress the whole file as one chunk. If you sent data to this stream piecemeal, it will compress each piece separately and you will get a poor compression ratio.

How it works


So how does this all work? The filter part of a bloom filter is just a large array of bits. You also require several different hash functions that each return a unique result for the same input value. When you add a value to the filter, the value is sent to about 3-7 different hash functions. Each hash function will return a value that is between 0 and the number of bits in the filter. Each value is used as an index to access and element on the array of bits that is your filter. When hashing a value, you just set the bit at each index location in the array to 1. Then testing for the presence of a value in the filter, you pass the value to the hash functions the same way as above, then visit each index, checking to see if any of them are 0. If even one bit at one of those index positions are zero, it means the filter has never seen that value before, because it would have set all those bits to 1. If all the bits at the index locations are 1, then it is likely that the filter has seen that value before. However,there is a chance that it is a false positive, because it could be that that value's different hashes all mapped to bits from other values. As the filter becomes more full, more bits are set to 1, and so the odds of a false positive go up. To build your filter by supplying the estimated number of values you think you are likely to store in the filter, and don't go above a certain ratio of 1 bits to 0 bits. If you were to let your filter hash so many values that every bit got set to 1, then the probability of receiving a false positive for a random value becomes 100%.



Solving the many hash problem

As I mentioned before, this requires several different hash functions that each return a unique result for the same input value. Although I said 3-7 hash functions, you might require 14 or more, when working with filters that can handle large number of hashes or a low false positive likelihood or both.

Instead of writing a bunch of separate hash algorithms, I implemented a stream cipher where in I just scramble the cipher table by a number of rounds that is unique to that input. Then, I can return as many indices as the filter is configured for. This sets up the table once per value. It needs to reset the table or else the indices that we mark will depend on every value that came before it, and in that particular order. Currently the bottle-neck is how many times it has the scramble the table for each value. If you need to hash really long values, you'll want to lower the number of rounds it scrambles the table.



Variations


In this implementation, the bloom-filter size is set once you create it, meaning that it cannot grow bigger if it gets too full, nor can you resize this bloom filter to become smaller if you sized it too big. Because multiple values could rely on the same bit, this implementation does not support removal of items, because to do so would cause several values to begin reporting false negatives.

In order to make a bloom filter that supports deletion, use a number like a byte instead of bits in your filter, and each time you visit an index in the filter while adding values, increment the number you find there. Then, to delete a value, visit each index as you did before, but decrement the number there. This way, if two values map to the same index, that information is tracked by incrementing the value. This is what is known as a Counting Bloom Filter.

There are other variants of bloom filters out there, including bloom filters that can grow in size if it gets too full, but such a thing is beyond the scope of my needs. In essence, when the filter gets too full, you create another separate filter, and add new values by first checking the first filter to see if it exists, and if not, adding the value to the second filter. Checking for the presence of a value requires checking both (and other) filters. For information on scalable bloom filters, please see this whitepaper.

The code


My C# Bloom Filter project on GitHub
Or download zip here.








Thursday, December 24, 2015

Infix Notation Parser via Shunting-Yard Algorithm





Infix notation is the typical notation for writing equations in algebra.
An example would be: 7 - (2 * 5)

Parsing such an equation is not a trivial task, but I wanted one for my EquationFinder project, as I wanted to respect order of operations.

Strategies include substitution/replacement algorithms, recursion to parse into a tree and then tree traversal, or converting the infix notation to reverse polish notation (RPN), also known as post-fix notation, then using a stack based postfix notation evaluator. I choose the latter, as such algorithms are well defined in many places on the web.

My code consists of 3 classes, all static:
(Links go to the .cs file on GitHub)
  1. InfixNotation - this simply holds a few public variables and calls the public methods on the below two classes.
  2. ShuntingYardAlgorithm - this converts an equation in infix notation into postfix notation (aka RPN).
  3. PostfixNotation - this evaluates the equation in postfix notation and returns a numerical result value.

In order to implement the shunting-yard algorithm and the postfix evaluator, I simply wrote the steps to the algorithms as written on Wikipedia:
(Links go to the Wikipedia article)
Link to the Shunting-Yard Algorithm to convert Infix notation to Postfix notation.
Link to the Postfix Notation Evaluation Algorithm.


The code for this is pretty extensive, but I will prettify it and present it below. Alternatively, you can view and download the code from the MathNotationConverter project on my GitHub.


InfixNotationParser:


public static class InfixNotation
{
   public static string Numbers = "0123456789";
   public static string Operators = "+-*/^";

   public static bool IsNumeric(string text)
   {
      return string.IsNullOrWhiteSpace(text) ? false : text.All(c => Numbers.Contains(c));
   }

  public static int Evaluate(string infixNotationString)
  {
    string postFixNotationString = ShuntingYardConverter.Convert(infixNotationString);
    int result = PostfixNotation.Evaluate(postFixNotationString);
    return result;
  }
}



ShuntingYardConverter 

(converts an equation from infix notation into postfix notation):

public static class ShuntingYardAlgorithm
{
   private static string AllowedCharacters = InfixNotation.Numbers + InfixNotation.Operators + "()";

   private enum Associativity
   {
      Left, Right
   }
   private static Dictionary<char, int> PrecedenceDictionary = new Dictionary<char, int>()
   {
      {'(', 0}, {')', 0},
      {'+', 1}, {'-', 1},
      {'*', 2}, {'/', 2},
      {'^', 3}
   };
   private static Dictionary<char, Associativity> AssociativityDictionary = new Dictionary<char, Associativity>()
   {
      {'+', Associativity.Left},
      {'-', Associativity.Left},
      {'*', Associativity.Left},
      {'/', Associativity.Left},
      {'^', Associativity.Right}
   };

   private static void AddToOutput(List<char> output, params char[] chars)
   {
      if (chars != null && chars.Length > 0)
      {
         foreach (char c in chars)
         {
            output.Add(c);
         }
         output.Add(' ');
      }
   }
   
   public static string Convert(string infixNotationString)
   {
      if (string.IsNullOrWhiteSpace(infixNotationString))
      {
         throw new ArgumentException("Argument infixNotationString must not be null, empty or whitespace.", "infixNotationString");
      }

      List<char> output = new List<char>();
      Stack<char> operatorStack = new Stack<char>();
      string sanitizedString = new string(infixNotationString.Where(c => AllowedCharacters.Contains(c)).ToArray());

      string number = string.Empty;
      List<string> enumerableInfixTokens = new List<string>();
      foreach (char c in sanitizedString)
      {
         if (InfixNotation.Operators.Contains(c) || "()".Contains(c))
         {
            if (number.Length > 0)
            {
               enumerableInfixTokens.Add(number);
               number = string.Empty;
            }
            enumerableInfixTokens.Add(c.ToString());
         }
         else if (InfixNotation.Numbers.Contains(c))
         {
            number += c.ToString();
         }
         else
         {
            throw new Exception(string.Format("Unexpected character '{0}'.", c));
         }
      }

      if (number.Length > 0)
      {
         enumerableInfixTokens.Add(number);
         number = string.Empty;
      }

      foreach (string token in enumerableInfixTokens)
      {
         if (InfixNotation.IsNumeric(token))
         {
            AddToOutput(output, token.ToArray());
         }
         else if (token.Length == 1)
         {
            char c = token[0];

            if (InfixNotation.Numbers.Contains(c)) // Numbers (operands)
            {
               AddToOutput(output, c);
            }
            else if (InfixNotation.Operators.Contains(c)) // Operators
               if (operatorStack.Count > 0)
               {
                  char o = operatorStack.Peek();
                  if ((AssociativityDictionary[c] == Associativity.Left &&
                     PrecedenceDictionary[c] <= PrecedenceDictionary[o])
                        ||
                     (AssociativityDictionary[c] == Associativity.Right &&
                     PrecedenceDictionary[c] < PrecedenceDictionary[o]))
                  {
                     AddToOutput(output, operatorStack.Pop());
                  }
               }
               operatorStack.Push(c);
            }
            else if (c == '(') // open brace
            {
               operatorStack.Push(c);
            }
            else if (c == ')') // close brace
            {
               bool leftParenthesisFound = false;
               while (operatorStack.Count > 0 )
               {
                  char o = operatorStack.Peek();
                  if (o != '(')
                  {
                     AddToOutput(output, operatorStack.Pop());
                  }
                  else
                  {
                     operatorStack.Pop();
                     leftParenthesisFound = true;
                     break;
                  }
               }

               if (!leftParenthesisFound)
               {
                  throw new FormatException("The algebraic string contains mismatched parentheses (missing a left parenthesis).");
               }
            }
            else // wtf?
            {
               throw new Exception(string.Format("Unrecognized character '{0}'.", c));
            }
         }
         else
         {
            throw new Exception(string.Format("String '{0}' is not numeric and has a length greater than 1.", token));
         }
      } // end foreach

      while (operatorStack.Count > 0)
      {
         char o = operatorStack.Pop();
         if (o == '(')
         {
            throw new FormatException("The algebraic string contains mismatched parentheses (extra left parenthesis).");
         }
         else if (o == ')')
         {
            throw new FormatException("The algebraic string contains mismatched parentheses (extra right parenthesis).");
         }
         else
         {
            AddToOutput(output, o);
         }
      }

      return new string(output.ToArray());
   }
}










PostfixNotation

(evaluates the postfix notation and returns a numerical result):

public static class PostfixNotation
{
   private static string AllowedCharacters = InfixNotation.Numbers + InfixNotation.Operators + " ";

   public static int Evaluate(string postfixNotationString)
   {
      if (string.IsNullOrWhiteSpace(postfixNotationString))
      {
         throw new ArgumentException("Argument postfixNotationString must not be null, empty or whitespace.", "postfixNotationString");
      }

      Stack<string> stack = new Stack<string>();
      string sanitizedString = new string(postfixNotationString.Where(c => AllowedCharacters.Contains(c)).ToArray());
      List<string> enumerablePostfixTokens = sanitizedString.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).ToList();

      foreach (string token in enumerablePostfixTokens)
      {
         if (token.Length > 0)
         {
            if (token.Length > 1)
            {
               if (InfixNotation.IsNumeric(token))
               {
                  stack.Push(token);
               }
               else
               {
                  throw new Exception("Operators and operands must be separated by a space.");
               }
            }
            else
            {
               char tokenChar = token[0];

               if (InfixNotation.Numbers.Contains(tokenChar))
               {
                  stack.Push(tokenChar.ToString());
               }
               else if (InfixNotation.Operators.Contains(tokenChar))
               {
                  if (stack.Count < 2)
                  {
                     throw new FormatException("The algebraic string has not sufficient values in the expression for the number of operators.");
                  }

                  string r = stack.Pop();
                  string l = stack.Pop();

                  int rhs = int.MinValue;
                  int lhs = int.MinValue;

                  bool parseSuccess = int.TryParse(r, out rhs);
                  parseSuccess &= int.TryParse(l, out lhs);
                  parseSuccess &= (rhs != int.MinValue && lhs != int.MinValue);

                  if (!parseSuccess)
                  {
                     throw new Exception("Unable to parse valueStack characters to Int32.");
                  }

                  int value = int.MinValue;
                  if (tokenChar == '+')
                  {
                     value = lhs + rhs;
                  }
                  else if (tokenChar == '-')
                  {
                     value = lhs - rhs;
                  }
                  else if (tokenChar == '*')
                  {
                     value = lhs * rhs;
                  }
                  else if (tokenChar == '/')
                  {
                     value = lhs / rhs;
                  }
                  else if (tokenChar == '^')
                  {
                     value = (int)Math.Pow(lhs, rhs);
                  }

                  if (value != int.MinValue)
                  {
                     stack.Push(value.ToString());
                  }
                  else
                  {
                     throw new Exception("Value never got set.");
                  }
               }
               else
               {
                  throw new Exception(string.Format("Unrecognized character '{0}'.", tokenChar));
               }
            }
         }
         else
         {
            throw new Exception("Token length is less than one.");
         }
      }

      if (stack.Count == 1)
      {
         int result = 0;
         if (!int.TryParse(stack.Pop(), out result))
         {
            throw new Exception("Last value on stack could not be parsed into an integer.");
         }
         else
         {
            return result;
         }
      }
      else
      {
         throw new Exception("The input has too many values for the number of operators.");
      }

   } // method
} // class


Another alternative technique is to using the Shunting-Yard Algorithm to turn infix notation into an abstract syntax tree (Linq.Expressions anyone?). I will likely post this technique later.


Other blog posts by me that are related to this article are the Threaded Equation Finder, a Mixed Radix System Calulator and Drawing Text Along a Bezier Spline.



Thursday, October 29, 2015

Thinq - A Linq Experiment




Thinq - A Linq Experiment




   View/Download the source code from the project's GitHub



   So I wrote a program, just an experiment, where I was making a range class using IEnumerables (C#), and each element doesn't have to increment by one, but any amount. so I was creating ranges like 7 to 10 million, increment by 7, so upon enumeration it would yield multiples of 7. This is also called arithmetic progression.

   Then I started combining different multiples with query operators like Where operator or Intersect like IEnumerable result = multiples7.Intersect(multiples13.MoveNext()), essentially creating a function that keeps only those numbers that are multiples of both 7 and 13, starting with the least common multiple.

   So I began testing. After some playing, I decided to take the first 7 primes, and find any common multiples to them between 1 and 10 million. Much to my surprise, it found all the common multiples of the first 7 prime numbers under 10 million (there are only two of them, 4849845 & 9699690), and it did it in 500 milliseconds on some very modest hardware (1 core, 2.16GHz, 4GB ram).

   I bumped up the ceiling to 50 million and I got an OutOfMemoryException because the IEnumerable holds on to every value it gets from the function MoveNext(). I threw in some metrics and discovered that it took about 3 seconds and some 32-million, 64-bit integers for my computer to declare 'out of memory'.

   Well, at least it was fast, even if it did eat up all my ram in 3 seconds, it was still promising. 


   The solution was to create an IEnumerator that was aware of the arithmetic sequences that constrained the results set. When MoveNext() is called repeatedly during enumeration, I avoid the infinite memory requirement by restricting the result set returned from MoveNext(); it returns the next whole number that is divisible by every arithmetic sequence's 'common difference', or increment value. In this way, you have created a enumerable sequence that is the _intersection_ of all of the sequences.

   The enumerator is prevented from running to infinity by obeying two limits: A maximum numeric value (cardinal) that GetNext() will return to ("results less than 50 million") and a maximum quantity of results (ordinal) that GetNext() yields ("the one millionth result").    If either of these limits are exceeded, the while loop will fail to evaluate to true. It is very common for my processor-intensive, long running or 'mathy' applications to employ a temporal limit (maximum time-to-live) or support cancellation, but this little experiment has been so performant that I have been able to get by without one.

   So what kind of improvement did we get out of our custom enumerable? I can now find all the common factors for the first 8 prime numbers up to 1 billion in 25 seconds! I was impressed; the application used to max out around 50 million and run out of memory, and now it can investigate to one billion in a reasonable amount of time and the memory it uses is not much more than the 8 or so integers in the result set. 1 billion, however seems to be the sweet spot for my single 2.13 GHz laptop. I ran the same 8 primes to 2 billion and it took 1 minute, 12 seconds:



TIME ELAPSED: 01:12.38
LCM[3,5,7,11,13,17,19,23] (max 2,000,000,000)

17 FACTORS:

111546435 
223092870 
334639305 
446185740 
557732175 
669278610 
780825045 
892371480 
1003917915 
1115464350 
1227010785 
1338557220 
1450103655 
1561650090 
1673196525 
1784742960 
1896289395


Tuesday, October 6, 2015

Mixed Radix Numeral System class and Counter


Mixed Radix Calculator

   My 'Mixed Radix Calculator' creates a counting system of radices (plural of radix), such as base 12 or mixed radices such as Minutes/Hours/Days/Years: 365:24:60:60. I choose the left side to be the most significant side. This is merely a personal preference, and my MixedRadixSystem class supports displaying both alignments.

   Of course you dont have to choose a mixed radix numeral system, you can count in an N-base numeral system, such as base 7 or a more familiar base 16. Another feature lies in my RadixNumeral class. Each numeral, or place value, supports having its own dictionary of symbols.


Screenshot of Mixed Radix Calculator
      (Project released under Creative Commons)

-  52:7:24:60:60:1000  -


  A numeral system (or system of numeration) is a writing system for expressing numbers.


  The most familiar one is of course the decimal numeral system. This is a 10-base numbering system. Computers use a binary numeral system. The base is sometimes called the radix or scale.

  Not all numbering systems have just one base. Take for example, how we currently divide time: There are 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 365 days in a year. This is called a mixed radix numeral system, and one might express the above mixed radix system like: 365:24:60:60.

  https://en.wikipedia.org/wiki/Mixed_radix
  http://mathworld.wolfram.com/Base.html

Uses:
  I haven't found a lot of use cases for it yet, but it is interesting. I originally built this because I wanted to experiment with numeral systems that uses increasing consecutive prime numbers for each radix, as well as experiment with some off-bases, such as base 3 or base 7.

  In a single base, say base 7, then 'round numbers' with only one place value having a 1 and the rest having zeros, such as 1:0:0:0:0 (in base 7), such numbers are powers of 7, and ever other number except for the 1's place value is a multiple of 7.

  A mixed radix numeral system can represent a polynomial, and possibly provide for a simpler way to visualize and reason about them.

  Yet another possible use is to make a numeral system with a base that is larger than and co-prime to some other target number (say 256) to make a bijective map from every value in a byte to some other value exactly once by repeatedly adding the value of the co-prime, modulus 256. This can appear rather random (or sometimes not at all) but the mapping is easily determined given the co-prime. I have talked about this notion before on my blog
  https://csharpcodewhisperer.blogspot.com/search/label/Coprime

  If you like this project you would probably like my project EquationFinder, it finds equations given constraints
  https://github.com/AdamWhiteHat/EquationFinder


Tuesday, September 22, 2015

Threaded Equation Finder



Threaded Equation Finder

Find arithmetic equations that equates to a given 'target' value, number of terms, and operators.

Introduction

   You should all be familiar with how a typical computer works; you give it some variables, describe some quantities of some resources you have, choose an algorithm, let it process, and it returns to you a result or outcome. Now imagine a computer if you could work with a computer that worked the other way around. I believe it was Douglas Adams that described the notion of an all-together different type of computer; That is, you tell the computer what you want the outcome to be, and it goes off figuring out how to get there and what you need to do it. Z3, the Theorem Prover, and the constraint satisfaction problem (CSP) solver (and probably others) in Microsoft's Solver Foundation do almost exactly that.
   There is also the idea of Backcasting, which is a similar, but different idea.

   My program isn't as fancy as all that, but it does find equations that equates to a given 'target' value, albeit at random. You define constraints other than just the target value, such as what operators are allowed in the equation, the quantity of terms there, and the range or set of allowed terms.
For example, how many different ways can 9 nines equal 27, using only addition, subtraction, and multiplication, if evaluated left-to-right (ignore the order of operations)? Turns out there are only 67 ways.

(above) Application Screen Shot

How it works

   The actual approach for finding equations that equate to an arbitrary target or 'goal' value is rather primitive. By way of Brute Force, several threads are launched asynchronously to generate thousands of random equations, evaluate them and keep the results that have not been found yet.

   This is something I wrote approx. 2 years ago. I dug it up and decided to publish it, because I thought it was interesting. As this was an 'experiment', I created different ways of storing and evaluating the expressions, and have made those different 'strategies' conform to a common interface so I could easily swap them out to compare the different strategies. I have refactored the code so that each class that implements IEquation is in its own project and creates its own assembly.

   There are two fully-working strategies for representing equations: one that represented the equation as a list of 2-tuples (Term,Operator), did not perform order of operations, and was rather obtuse. The other strategy was to store the equation as a string and evaluate it using MSScriptControl.ScriptControl to Eval the string as a line of VBScript. This was unsurprisingly slower but allowed for much more robust equation evaluation. Order of operations is respected with the ScriptControl strategy, and opens the way to using using parenthesis.

   The other idea for a strategy which I have not implemented but would like to, would be a left-recursive Linq.Expression builder. Also, maybe I could somehow use Microsoft Solver Foundation for a wiser equation generation strategy than at random.



Limitations

   Today, however, there are better architectures. A concurrent system like this would benefit greatly from the Actor model. If you wanted to write complex queries against the stream of equations being generated or selected or solved, maybe reactive extensions would be a slam dunk.

   Although this project certainly is no Z3, it does provide an example of an interface... perhaps even a good one.



Running on Raspberry Pi 2

   Microsoft's Managed Extensibility Framework (MEF) might be a good thing here, but I also wrote a console client that is designed to be ran with Mono on the Raspberry Pi 2. MEF is a proprietary Microsoft .NET technology that is not supported in Mono. The extra meta data in the assembly shouldn't be a problem, but having a dependency on the MEF assembly will be. Probing of the runtime environment and dynamically loading of assemblies is required here, which I have not had time to do, so at this time, there is no MEF.

   The reason the mono client is a console application is because mono and winforms on the Raspberry Pi 2 fails for some people. The problem has something to do with a hardware floating point vs a software float, and it happens to manifest itself when using a TextBox control. The only thing that I haven't tried, and that should presumably fix it, is to re-build mono from the latest source.



Wednesday, December 3, 2014

Word Frequency Dictionary & Sorted Occurrence Ranking Dictionary for Generic Item Quantification and other Statistical Analyses



Taking a little inspiration from DotNetPerl's MultiMap, I created a generic class that uses a Dictionary under the hood to create a keyed collection that tracks the re-occurrences or frequency matching or duplicate items of arbitrary type. For lack of a better name, I call it a FrequencyDicionary. Instead of Key/Value pairs, the FrequencyDictionary has the notion of a Item/Frequency pair, with the key being the Item. I strayed from Key/Value because I may ultimately end up calling the Item the Value.

I invented the FrequencyDictionary while writing a pisg-like IRC log file statistics generator program. It works well for word frequency analysis. After some pre-processing/cleaning of the text log file, I parse each line by spaces to get an array of words. I then use FrequencyDictionary.AddRange to add the words to my dictionary.

The FrequencyDictionary is essentially a normal Dictionary (with T being type String in this case), however when you attempt to add a key that already exists, it simply increments the Value integer. After processing the file, you essentially have a Dictionary where the Key is a word that was found in the source text, and the Value is the number of occurrences of that word in the source text.

Taking the idea even further, I created complementary Dictionary that essentially a sorted version of FrequencyDictionary with the Key/Values reversed. I call it a RankingDictionary because the Keys represent the number of occurrences of a word, with the Values being a List of words that occurred that many times in the source text. Its called a RankingDictionary because I was using it to produce a top 10 ranking of most popular words, most popular nicks, ect.

The FrequencyDictionary has a GetRankingDictionary() method in it to make the whole thing very easy to use. Typically I don't use the FrequencyDictionary too extensively, but rather as a means to get a RankingDictionary, which I base the majority of my IRC statistics logic on. The RankingDictionary also proved very useful for finding Naked Candidates and Hidden Pairs or Triples in my Sudoku solver application that I will be releasing on Windows, Windows Phone and will be blogging about shortly. Hell, I was even thinking about releasing the source code to my Sudoku App, since its so elegant and a great example of beautiful, readable code to a complex problem.

Anyways, the code for the Frequency and Ranking Dictionary is heavily commented with XML Documentation Comments, so I'm going to go ahead and let the code speak for itself. I will add usage examples later. In fact, I will probably release the pisg-like IRC Stats Prog source code since I don't think I'm going to go any farther with it.

Limitations: I ran into problems trying to parse large text files approaching 3-4 megabytes. Besides taking up a bunch of memory, the Dictionary begins to encounter many hash collisions once the size of the collection gets large enough. This completely kills performance, and eventually most the time is spent trying to resolve collisions, so it grinds to a near halt. You might notice the constructor public FrequencyDictionary(int Capacity), where you can specify a maximum capacity for the Dictionary. A good, safe ceiling is about 10,000. A better implementation of the GetHash() method might be in order, but is not a problem I have felt like solving yet.




/// <summary>
/// A keyed collection of Item/Frequency pairs, (keyed off Item).
/// If a duplicate Item is added to the Dictionary, the Frequency for that Item is incremented.
/// </summary>
public class FrequencyDictionary<ITM>
{
  // The underlying Dictionary
  private Dictionary<ITM, int> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public FrequencyDictionary() { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that has a maximum Capacity limit.
  /// </summary>
  public FrequencyDictionary(int Capacity) { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Gets a collection containing the Items in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<ITM> ItemCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Values;} }
  
  /// <summary>
  /// Adds the specified Item to the FrequencyDictionary.
  /// </summary>
  public void Add(ITM Item)
  {
    if  ( this._dictionary.ContainsKey(Item))   { this._dictionary[Item]++; }
    else    { this._dictionary.Add(Item,1); }
  }
  
  /// <summary>
  /// Adds the elements of the specified array to the FrequencyDictionary.
  /// </summary>
  public void AddRange(ITM[] Items)
  {
    foreach(ITM item in Items) { this.Add(item); }
  }
  
  /// <summary>
  /// Gets the Item that occurs most frequently.
  /// </summary>
  /// <returns>A KeyValuePair containing the Item (key) and how many times it has appeard (value).</returns>
  public KeyValuePair<ITM,int> GetMostFrequent()
  {
    int maxValue = this._dictionary.Values.Max();
    return this._dictionary.Where(kvp => kvp.Value == maxValue).FirstOrDefault();
  }
  
  /// <summary>
  /// Gets the number of Item/Frequency pairs contained in the FrequencyDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the FrequencyDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<ITM,int>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the Frequency (occurrences) associated with the specified Item.
  /// </summary>
  public int this[ITM Item]
  {
    get { if (this._dictionary.ContainsKey(Item)) { return this._dictionary[Item]; } return 0; }
  }
  
  /// <summary>
  /// Creates a RankingDictionary from the current FrequencyDictionary.
  /// </summary>
  /// <returns>A RankingDictionary of Frequency/ItemCollection pairs ordered by Frequency.</returns>
  public RankingDictionary<ITM> GetRankingDictionary()
  {
    RankingDictionary<ITM> result = new RankingDictionary<ITM>();
    foreach(KeyValuePair<ITM,int> kvp in _dictionary)
    {
      result.Add(kvp.Value,kvp.Key);
    }
    return result;
  }
  
  /// <summary>
  /// Displays usage information for FrequencyDictionary 
  /// </summary>
  public override string ToString()
  {
    return "FrequencyDictionary<Item, Frequency> : Key=\"Item=\", Value=\"Frequency\"\".";
  }
}




And now the RankingDictionary:

/// <summary>
/// A keyed collection of Frequency/ItemCollection pairs that is ordered by Frequency (rank).
/// If an Item is added that has the same Frequency as another Item, that Item is added to the Item collection for that Frequency.
/// </summary>
public class RankingDictionary<ITM>
{
  // Underlying dictionary
  SortedDictionary<int,List<ITM>> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public RankingDictionary() { _dictionary = new SortedDictionary<int,List<ITM>>(new FrequencyComparer()); }
  
  /// <summary>
  /// The Comparer used to compare Frequencies.
  /// </summary>
  public class FrequencyComparer : IComparer<int>
  {
    public int Compare(int one,int two) { if(one == two) return 0; else if(one > two) return -1; else return 1; }
  }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the RankingDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the ItemCollection in the RankingDictionary.
  /// </summary>
  public IEnumerable<List<ITM>> ItemCollections   { get { return this._dictionary.Values; } }
  
  /// <summary>
  /// Adds the specified Frequency and Item to the RankingDictionary.
  /// </summary>
  public void Add(int Frequency, ITM Item)
  {
    List<ITM> itemCollection = new List<ITM>();
    itemCollection.Add(Item);
    // If the specified Key is not found, a set operation creates a new element with the specified Key
    this._dictionary[Frequency] = itemCollection;
  }
  
  /// <summary>
  ///  Gets the number of Frequency/ItemCollection pairs contained in the RankingDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the RankingDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<int,List<ITM>>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the ItemCollection associated with the specified Frequency.
  /// </summary>
  public List<ITM> this[int Frequency]
  {
    get
    {
      List<ITM> itemCollection;
      if (this._dictionary.TryGetValue(Frequency,out itemCollection)) return itemCollection;
      else return new List<ITM>();
    }
  }
  
  /// <summary>
  /// Displays usage information for RankingDictionary 
  /// </summary>
  public override string ToString()
  {
    return "RankingDictionary<Frequency, List<Item>> : Key=\"Frequency\", Value=\"List<Item>\".";
  }
}


Usage examples will be posted later.

Wednesday, April 9, 2014

An Implementation of a One-Way Hashing Algorithm using Substitution and Permutation




        This program is a one-way hashing algorithm that can be used for hashing passwords or verifying a file's integrity. Like other ciphers, it has rounds. While this particular implementation was designed to overwrite some of the information required to reverse the rounds operations to be asymmetric, the same core 'substitution' algorithm could be used to make a symmetric stream cipher by taking a different approach.
        Each round proceeds as follows: First a scrambling or mangling (sounds violent) of a substitution table is performed before substituting the plain-text (or last round's output) with values from the table chosen by the index corresponding to the plain-text character's equivalent byte value. The substituted-text is then converted into a List of bits (bool) and the list then split in half into two lists of bits. One of the lists of bits is optionally flipped (reversed) based on the value of the first bit of that list. The value of the middle bit of the other list is used to determine which list is to lead during the (next step) bit shuffling. Starting with the selected list, the bits of the two lists are interleaved (shuffled) making one list again. The bits are converted back into bytes via an extension method and used as the input for the next round.
        The substitution/cipher block step uses a variation of the RC4 cipher with the key-scheduling algorithm dropped in favor of a pseudo-random-appearing, but evenly distributed table with each number from 0 to 256 appearing exactly once. This table is deterministically populated by an initial value, IV, which is some number that is both greater than 256 and co-prime to it. Starting with zero, set each byte in the table to the value plus IV, modulus 256, in order. Then, to 'prime' the table/block, iterate over the table several thousand times with the RC4 algorithm, discarding the byte stream (in the case of RC4). In the case of my particular RC4 variant implementation, the result byte is put into the table at an index that congruent to 256 modulus the current iteration count multiplied by some co-prime of 256, in this case, 331.
        While many people believe that RC4 is well on its way to being broken, it is my hope that by using a smarter key-scheduling function with a different (larger) block size and maybe some bit country line-dancing (shuffling) will make a variant of RC4 that we all can have confidence in again. Hmm, there is already an RC5 and RC6. Maybe a good name would be RC4.1 or RC4.Awesome. RC4 and seven thirteenths? R2C2 perhaps? No, I got it: The stream cipher formally known as RC4. RC4Realz or RCL33t? Anyways, this article is not about symmetric algorithms, but rather asymmetric ones, so read on...

        Cryptography is one of my interests. I love the math behind it and I think the mechanics behind RSA are just so elegant. I often daydream that one day I will be able to come out with something incredibly clever and useful. For the past several months, I have been working on implementing several different concepts of cryptography in C#. I have wrote several classes that do one specific type of job, such as transposition, permutation or substitution. I have been toying with combinations these different methods into 'rounds', with each round getting its input from the output of the previous round.  In this post, I am proposing a one-way hashing algorithm that uses a combination of bit-shuffling (permutation) and a substitution table with logic that mangle the table based on the current state of the table. It is possible to make an asymmetric algorithm from this idea, but that is not the aim of this particular tool.

        This algorithm still has some areas that need improving. First of all, its rather slow, so hashing large files is not practical. The number of rounds has a large baring on its hash time. A smaller number of rounds is recommended for anything beyond a paragraph. The current default is 18. This is rather arbitrary, as I have not done any testing to find the optimal number of rounds. Anyone who uses less than 4 rounds will start to see some repeating characters, not to mention make their implementation more susceptible to cryptanalysis. Sometimes a lengthy hashing time can be desirable, such as a proof of work concept, or to make it more difficult to brute force. For a password-length input at 1024 rounds, it takes almost a full second.


        Another area for improvement is the permutation step; is basically just shuffles the bits by dividing the bit array in half and interleaving the bits (think a deck of cards). While there is some variation in the order, based off the input, a much better bit shuffling would probably be an expansion or compression style of permutation. See this link to Wikipedia for an visual depiction.


        Well that about sums it up, so now on to the code! Take note at how I used extension methods to take care of converting a BitArray and a Byte array into a string, and back again. This makes the code very readable and not take up nearly as many lines by reusing code. The code is pretty heavily commented, so I will just let it explaining itself:





public class AJRHash
{
 byte[] table;
 // Unsure the optimum setting for rounds.
 // No less that 4. 1024 takes around 1 second for password-length hashes.
 // Use lower setting for large hashes
 static int rounds = 18;
 static int minimumLength = 10;
 static int tableSize = 256; // Because we are using bytes
 // By choosing a co-prime to 256, we ensure we get an evenly distributed table
 static int[] coprimesOf256 = { 4489, 2209, 1369 }; // Changing the co-primes will change the hash
 
 public AJRHash()
 {
  table = new byte[tableSize];
  InitializeTable();
 }
 
 public string Hash(string Input)
 {
  // Pad the input to minimum length
  if(Input.Length<minimumLength)
  {
   List<byte> padding = new List<byte>();
   
   // Copy the input first
   if(Input.Length>0)
    padding.AddRange(Input.GetBytes());
   
   // Make up the diffrence with nulls
   int diffrence = minimumLength - Input.Length;
   while(diffrence>0)
   {
    padding.Add(0);
    diffrence--;
   }
   Input = padding.ToArray().GetString();
  }
  
  byte[] permutation = Input.GetBytes();
  byte[] substitution = new byte[Input.Length+1];
  
  int counter = rounds;
  while(counter>0)
  {
   substitution = Substitution(permutation);
   permutation  = Permutation_Interleave(substitution);
   counter--;
  }

  return Convert.ToBase64String(permutation);
 }
 
 void InitializeTable()
 {
  int prime = coprimesOf256[0];
  
  byte index = 0;
  for (int i = 0; i < tableSize; i++)
  {
   table[index] = (byte)i;
   
   unchecked // The large prime will just roll over. Essentially just modular arithmetic
   { // By choosing a co-prime to 256, we ensure we get an evenly distributed table
    index += (byte)prime;
   }
  }
  
  // Finish initializing the table by shuffling it
  ScrambleTable(prime);
 }
 
 void ScrambleTable(int iterations)
 {
  byte i = 0;
  byte j = 0;
  byte k = 0;
  byte l = 0;
  
  byte iteration = 0;
  int counter = iterations;
  
  // Just roll over on overflow
  unchecked
  {
   // Some initial values
   j=table[table.Length/2];
   i=table[j];

   // Use the current state of the table to determine how it is changed
   while(counter>0)
   {
    i++;
    l=(byte)(table[i]+1);
    j=(byte)(j+l+1);
    
    table[i] = (byte)(j + 3);
    table[j] = (byte)(l - 1);
    
    k = (byte)( (table[i] + table[j]));
    l = (byte)(k % 255);
    
    table[k] = (byte)(table[k]+1);
    
    k = table[l];
    l = table[iteration];
    
    table[(byte)(iteration*331)] = (byte)(k ^ l);
    
    counter--;
    iteration++;
   }
  }
 }
 
 byte[] Substitution(byte[] input)
 {
  // Shuffle the table by an ammount unique to the input
  ScrambleTable(input.Length*input[0]);
  
  // Apply input against the substitution table
  List<byte> result = new List<byte>();
  foreach(byte b in input)
  {
   result.Add( table[(int)b] );
  }
  
  return result.ToArray();
 }
 
 // TODO: Add more robust bit shuffling/permutation such as compression or expansion
 byte[] Permutation_Interleave(byte[] input)
 {
  // Convert input into an array of bits
  BitArray temp = new BitArray(input);
  List<bool> bits = temp.GetList();
  
  // Ensure we have an even number of bits
  if(bits.Count%2 != 0)
   bits.Add(false);
  
  // Split the input up into two arrays of bits
  List<bool[]> split = bits.Split();
  
  // Make sure we have at least two arrays to interleave
  if(split.Count<2)
   throw new Exception("Input is too short.");
  
  bool[] one = split[0];
  bool[] two = split[1];
  
  // Ensure both arrays are the same length
  if(one.Length != two.Length)
   throw new Exception("Lengths must match.");
  
  // If the first bit of two is 1, reverse it.
  //   This makes the deterministic shuffling more difficult to reverse 
  if(two[0])
   Array.Reverse(two);
  
  // If the 'middle' bit of one is 1, take from two first.
  //   This makes the deterministic shuffling more difficult to reverse
  bool SecondFirst = false;
  if(one[one.Length/2])
   SecondFirst = true;
  
  // Interleave the two arrays
  int counter = 0;
  List<bool> result = new List<bool>();
  while(counter<one.Length)
  {
   // Two possible ways to order (decided above)
   if(SecondFirst)
   {
    result.Add(two[counter]);
    result.Add(one[counter]);
   }
   else
   {
    result.Add(one[counter]);
    result.Add(two[counter]);
   }
   counter++;
  }
  
  return new BitArray(result.ToArray()).GetBytes();
 }
}



I also wrote a few extension methods to help with all the tedious converting of data types. Extension methods can help keep you code tidy, readable. Don't forget to add these extension methods, or the above code wont compile:




public static class ExtentionMethods
{
 public static byte[] GetBytes(this string Input)
 {
  if(Input == null) return new byte[0];   
  return Encoding.UTF8.GetBytes(Input);
 }
 
 public static string GetString(this byte[] Input)
 {
  if(Input == null) return string.Empty;   
  return Encoding.UTF8.GetString(Input);
 }
 
 public static string GetString(this BitArray Input)
 {
  if(Input == null) return string.Empty;   
  return Encoding.UTF8.GetString(Input.GetBytes());
 }
 
 public static byte[] GetBytes(this BitArray Input)
 {
  if(Input == null) return new byte[0];   
  int lengthInBytes = (Input.Length%8==0) ? Input.Length/8 : (Input.Length/8)+1;
  
  byte[] result = new byte[lengthInBytes];
  Input.CopyTo(result,0);
  return result;
 }
 
 public static bool[] GetArray(this BitArray Input)
 {
  if(Input == null) return new bool[0];   
  return Input.GetList().ToArray();
 }
 
 public static List<bool> GetList(this BitArray Input)
 {
  if(Input == null) return new List<bool>();
  
  List<bool> result = new List<bool>();
  foreach(bool b in Input)
  {
   result.Add(b);
  }   
  return result;
 }
 
 public static List<bool[]> Split(this List<bool> Input)
 {
  if(Input == null) return new List<bool[]>();
  
  int midpoint = (Input.Count/2);
  List<bool> result1 = new List<bool>();
  List<bool> result2 = new List<bool>();
  
  int counter = 0;
  foreach(bool b in Input)
  {
   if(counter<midpoint)
   {
    result1.Add(b);
   }
   else
   {
    result2.Add(b);
   }    
   counter++;
  }
  
  List<bool[]> result = new List<bool[]>();
  result.Add(result1.ToArray());
  result.Add(result2.ToArray());
  return result;   
 }
}

Wednesday, August 7, 2013

Pseudo 'random' even distribution table




In my last post, I discussed what a co-prime is and showed you to find them.

So, what's so special about relatively prime numbers? Well, then can be used to create an one-for-one distribution table that is seemingly random, but is deterministically calculated (created).

To understand what I mean, picture the face of a clock...


It has the hours 1 through 12, and if you and an hour to 12, you get 1. This can also be thought of as a single digit in a base 12 number system. Now we need a co-prime to 12. 7 is relatively prime to 12, so lets choose 7.

Starting at hour 1, if we add 7 hours, it will be 8. If we add 7 more hours, we will get 3. 7 more, 10. If we keep adding 7 hours to our clock, the hour hand will land on each of the different numbers exactly once before repeating itself, 12 steps later. Intrigued yet?

If, say, we find a co-prime to the largest number that can be represented by a byte (8-bits, 256 [also expressed as 2^8=256 or 8=Log2(256)]), we can create an array of bytes with a length of 256, containing each of the 256 different possible bytes, distributed in a seemingly random order. The discrete order, or sequence, in which each each number is visited it completely dependent on the value of the co-prime that was selected.

This table is now essentially a one-to-one, bijective mapping of one byte to another. To express this mapping to another party, say to map a stream of bytes back to their original values (decrypt), the entire table need not be exchanged, only the co-prime.

This provides a foundation for an encryption scheme who's technical requirements are similar to handling a cipher-block-chain (CBC) and its changing IV (initialization vector).

Now, it it easy to jump to the conclusion that such an encryption scheme is less secure than a CBC, but this is not necessarily the case. While this approach may be conceptually more simple, the difficulty of discovering the sequence can be made arbitrarily hard.

First of all, the number of relatively prime numbers to 256 is probably infinite. A co-prime to 256 does not have to be less than 256. Indeed, it may be several thousand time greater than 256. Additionally, any prime greater than 256 is, by definition, co-prime to 256, and likely will have a seemingly more 'random' distribution/appearance.

There is, however, a limit here. It does not have to do with the number of co-primes, but is instead limited by the number of possible sequences that can be represented by our array of 256 bytes; eventually, two different co-primes are going to map to the same unique sequence. The order matters, and we don't allow repetition to exist in our sequence. This is called a permutation without repetition, and can be expressed as 256! or 256 factorial and is instructing one to calculate the product of 256 * 255 * 254 * 253 * [...] * 6 * 5 * 4 * 3 * 2 * 1, which equals exactly this number:

857817775342842654119082271681232625157781520279485619859655650377269452553147589377440291360451408450375885342336584306157196834693696475322289288497426025679637332563368786442675207626794560187968867971521143307702077526646451464709187326100832876325702818980773671781454170250523018608495319068138257481070252817559459476987034665712738139286205234756808218860701203611083152093501947437109101726968262861606263662435022840944191408424615936000000000000000000000000000000000000000000000000000000000000000

Yeah, that's right, that number has exactly 63 zeros on the end and is 507 digits long. (As an aside, the reason there is so many zeros on the end of this number is, well for one it is highly composite, but more specifically, its prime factorization includes 2^255 and 5^63 and so 63 fives multiply with 63 of those twos to make 63 tens, and hence that many zeros.)

Above I said arbitrarily hard. So far we have only considered one table, but try and fathom the complexity of many tables. I present three different ways to use multiple tables; Nested, sequentially, and mangled.
Furthermore, the distribution tables can be discarded and replaced.

I will explain what those mean and finish this post tomorrow.