Wednesday, September 21, 2016

RC4 stream cipher variants and visualization of table permutation state

Here, I present some work I have been doing on two RC4 stream cipher variants. The first variant, as seen below, I wrote to help me visualize and understand what the RC4 tables was doing, and help me understand its properties.

Identity Permutation

The class that contains it is called SimpleTable and is exactly that; The simplest R4C implementation possible. It is notable in the fact that it does not use key scheduling at all, and its starting state is that of the Identity Permutation. The identity permutation is where the value at index zero equals zero, the value at index one is one, and so on. An easy way to remember what the Identity Permutation is, just recall the notion of a Multiplicative Identity (which is 1), where by multiplying a number N by the Multiplicative Identity gives you back the value of N, also known as the identity. Similarly, the Identity Permutation of an array A just gives you A. This is the trivial permutation. That is, there is no permuting of the array at all!

Anyways, this is done to see the perfectly ordered state, and how each round effects that state. In this way, we can visually check for the avalanche effect. In order to visualize the table, i just assign each value 0 to 255 a different shade of grey (I also have a rainbow-colored option that might be easier to tell apart similar values). At each step I create a Bitmap by looping through the table. Below, you can find an animated GIF of the first 100 steps of this cipher being applied to the identity permutation:

Notice how it takes a while to get going, and the first several values don't move much at all. After 256 steps, or one round, the cursor arrives back at index zero. Because the location of the first several values have not moved much or at all, we can clearly see that a mere 256 steps is insufficient at permuting the state enough to avoid leaking the first part of your key. Therefore it it is vital to permute the table for several rounds (256 steps per round) before you start using the stream.

Wired Equivalent Protocol

As some of you may know, WEP used RC4 with a weak key schedule. The key is spread out over 256 bytes using the following approach:

j = (j + table[i] + key[i mod keylength]) % 256;
SwapValues(table[i], table[j]);

and then it began streaming bytes from the table. Typically a nonce is concatenated to the key. Every time the table is set up and/or the nonce changes, some information about the key is leaked. Obviously a more secure procedure would use a hash of the key and the nonce, instead of the plain-text key, and to toss away the first 1024 bytes or so.

Cycle Length

Because each step in an RC4 cipher is a permutation, there is a limit to the number of unique bytes that can be produced before it begins repeating. This is called the permutation cycle. The length of the permutation cycle depends on the exact starting state, but we can get an upper bound.

Since there are 256 elements in the array, and two indices into the array (i and j), there is a maximum of
256! * 256^2 = 5.62 * 10^512 = 2^1700
possible states. That's 4.6 * 10^488 yottabytes!

This is the maximum possible states, however, and other starting states could have less. If the RC4 algorithm performed as a random permutation (which it does not, it performs worse), the cycle length would be half of the theoretical maximum above. Luckily the number above is so vast, that even some faction of it is still so many bytes that all of humanity has never and likely will never have that much total storage.

One thing to watch out for, however is something called Finney States. If an RC4 is started in one of these Finney States, the length of the cycle is much, much reduced. The chance of randomly generating one of these starting states, however, is VERY, very low.

Strengthening RC4

As stated, and visualized, above, it is vital to permute the table for several rounds (at 256 steps per round) after the key schedule, discarding the bytes, before you start using the stream. Also, it would be foolish to use the actual bytes of the key for permuting the starting state. It would be instead better to use a hash of the key + nonce or a key derivation function from the key instead the actual value of the key itself.

Another idea is, after shuffling the table enough rounds to hide the key, scramble the table an additional number of rounds, that value being some function of the key. This increases the possible starting states by whatever your range is.

In the classic RC4, each step would return one byte. The number of steps taken before returning each byte is configurable in my implementation.

Memory hardening

Check out the experimental branch for a memory hardened version. It stores the key class in memory, with the key XORed with a one-time pad, and then is protected in memory from access with the System.Security.Cryptography.ProtectedMemory class.

Other uses

The pseudo-random byte stream from the RC4 table is deterministic. Therefore if two remote computers with a shared secret, both computers can independently set up an RC4 table with exactly the same starting state and will get the same sequence of bytes which would be difficult to guess, given just the stream of bytes. If the plain text is XORed by the pseudorandom byte stream, then it can be decrypted by XORing it by the same byte steam.

The project includes 2 variants: 1) A simple table with a method to visualize the permutation state of the table and the avalanche effect as a bitmap 2) A more serious attempt at a secure implementation.

NOTE: THIS HAS NOT BEEN CRYPTO-ANALYZED AND PROBABLY NOT ACTUALLY SECURE, SO DO NOT TRUST IT!

Screenshots

Source code

Here is the GitHub page to the project (master branch).
Or just directly download the Zip file (experimental branch).

Tuesday, August 16, 2016

Lorenz Chaos Attractor

This project was inspired by one of Daniel Shiffman's 10 minute coding challenge YouTube videos, The Lorenz Attractor in Processing.


        dX = ((A * y)  -  (A * x)) * time;
        dY = ((B * x) -y -(x * z)) * time;
        dZ = ((x * y)  -  (c * z)) * time;

So it turns out this it not too terribly exciting. While its true that adjusting the starting values by a small amount change the behavior, if you go much outside the values its currently set for, you will end up with a pattern that quickly degenerates to a single, boring point. Personally, I was hoping for a more chaotic system. You might notice I am not using the 3rd point. I have yet to find a 3D drawing library that I like, though I need one for visualizing other projects. Anyways, since this was an experiment, I did the pragmatic thing and just made it 2D since I already knew how to do that.

Here is the result:

GitHub project

It wanted to draw the pattern very small, so I had to scale up the image by multiplying each number by some scale number.

One possibly useful idea is to use the cosine of the tangent of each number. This has the effect of canceling out the spiral and spreading the numbers out over a field. If you use just the tangent, you get a gradient from the top left corner. Perhaps you could use this as a pseudo-random noise source.


public static void TanCos(Lorenz system)
{
        system.x = 16 * (decimal)Math.Tan(Math.Cos((double)system.x));
        system.y = 16 * (decimal)Math.Tan(Math.Cos((double)system.y));
}

public static void Tan(Lorenz system)
{
        system.x = 6 * (decimal)Math.Tan((double)system.x);
        system.y = 6 * (decimal)Math.Tan((double)system.y);
}

Tuesday, July 5, 2016

True hardware random number generator with the Raspberry PI

So, I have been getting into cryptology lately, (For my most recent projects that I may or may not have blogged about at this point, see Bloom Filter and RC4Ever on GitHub).

The other day, I had a need for a TRUE random number generator, so I was searching the web for a hardware random number generator, when I found some very pleasant information: I already own one!

As it turns out, the Raspberry Pi (A/A+/B/B+/and 2) includes a hardware based random number generator, and according to many sources, its a very good source of truly random bytes. Yay!

To get this working on your own Pi, its a breeze:
1) Install the RasPi's random number generator tools: sudo apt-get install rng-tools.
2) Add to the boot process file (/etc/modules.conf) the command to run the hwrng module: bcm2708-rng.
3) Reboot the Pi.

Now, /dev/hwrng is available for reading. Its treated like a device, and you can use the dd command to copy bytes from the device stream to a file (like examples you might have seen doing the same from /dev/random).

NOTE: /dev/hwrng is accessible by the user root only.

But we can change that! The following command gives the user level read access: sudo chmod a+r /dev/hwrng

NOTE: This setting gets reset upon every reboot.

Again, we can change that: Add the following line to /etc/rc.local file, just above the exit 0 line:
chmod a+r /dev/hwrng

And its just that easy!

Now, say if you want to generate 1 megabyte worth of random bytes to a file in /tmp, simply enter the following command into a terminal:
dd if=/dev/hwrng of=hwrng-test-data.bin bs=1024 count=1024

The bs argument specifies the size to buffer before writing to disk. You probably want to leave that at or around 1024. Its the count argument that specifies the size, or amount, of data you want to copy from /dev/hwrng, in Kilobytes. So 1024 == 1 MB, where as 1 == 1KB.

Now, its time for Step 4) Create a C# helper library to simplify the retrieval of random bytes from /dev/hwrng.

So there is two ways to approach this. One is to make a C++ library that makes native calls and then write a .NET interop library to wrap that Or, if you are like me, a little lazy, and find that ever since transitioning to C# you find it difficult to write anything in C or C++ that compiles, you may opt to just issue the above commands to the shell and just read in the resulting file from the tmp directory.

As hackey as this second option might seem, it works remarkably well and I have written a GitHub project doing just that. It consists of a library to return random bytes, and a console executable exercising said library to get random bytes. Links below!

- The PiRngWrapper GitHub Project
- The PiRngWrapperLibrary.cs wrapper code file

Wednesday, June 29, 2016

Bloom Filter - A novel, space efficient data structure like a hash-table for billions of values.

Introduction

A bloom filter is a truly novel data structure. Similar to a hash table, it can tell you if you've hashed a particular value previously. You can add many, many more values to a bloom filter than you can to a hash table, does not degrade performance as the number of values in the set grows large, and requires only a fraction of the space of a hash table to store it!

This is not just an academic exercise, or something that only works in theory or in special cases. Indeed, companies like google use bloom filters to quickly determine if it has never seen that value before, thus avoiding a more costly lookup against a database every time the bloomfilter returns false.

Probabilistic

First off, its important to understand that a bloom filter is NOT a hash table, it operates in an entirely different way. A bloom-filter is what is known as a probabilistic data structure. What this means is, that it can tell you to within a certain probability, if an element exists in a set. In other words, false positive matches ARE possible, but false negative matches ARE NOT possible. For example, if you check a bloom filter for the existence of a value, and it returns false, you can know with 100% certainty that the bloom filter does not contain that value value before. However, if you test a value against the filter and it returns true, there is a small probability that it has in fact not seen that value before, but is returning a false positive. How big of a probability? Here's the beauty: It can be as small as you want it to be. It depends on a few factors, including the size of the filter, how full it is, and how many bits you use to store each value in the filter.

In a HashTable class, each item is stored as a key value pair, so the size of your object plus a 32 bit integer. Contrast that to a bloom filter, which stores only about 3-7 bits per value hashed. Also, my implementation applies compression when saving the filter to disk, providing even more space savings. A bloom filter with 160,000 values hashed and a 1% collision probability results in a filter that is 235KB uncompressed, and a whopping 54KB when compressed! Remember the filter is an array of bits. The entropy of the array is going to be at its greatest, and thus the compression ratio lowest, when exactly 1/2 of the bits are flipped, or the filter is half-way 'full'. This has the unusual property of getting smaller as you add more hashes to the filter. Actually, this is misleading--the actual filter itself never changes size, its only the compressed version that varies in size.

To handle the compression I just used the System.IO.Compression.DeflateStream class. An important note about working with this class: build an array of bytes and send your entire file in one go. In this way it will compress the whole file as one chunk. If you sent data to this stream piecemeal, it will compress each piece separately and you will get a poor compression ratio.

How it works

So how does this all work? The filter part of a bloom filter is just a large array of bits. You also require several different hash functions that each return a unique result for the same input value. When you add a value to the filter, the value is sent to about 3-7 different hash functions. Each hash function will return a value that is between 0 and the number of bits in the filter. Each value is used as an index to access and element on the array of bits that is your filter. When hashing a value, you just set the bit at each index location in the array to 1. Then testing for the presence of a value in the filter, you pass the value to the hash functions the same way as above, then visit each index, checking to see if any of them are 0. If even one bit at one of those index positions are zero, it means the filter has never seen that value before, because it would have set all those bits to 1. If all the bits at the index locations are 1, then it is likely that the filter has seen that value before. However,there is a chance that it is a false positive, because it could be that that value's different hashes all mapped to bits from other values. As the filter becomes more full, more bits are set to 1, and so the odds of a false positive go up. To build your filter by supplying the estimated number of values you think you are likely to store in the filter, and don't go above a certain ratio of 1 bits to 0 bits. If you were to let your filter hash so many values that every bit got set to 1, then the probability of receiving a false positive for a random value becomes 100%.

Solving the many hash problem

As I mentioned before, this requires several different hash functions that each return a unique result for the same input value. Although I said 3-7 hash functions, you might require 14 or more, when working with filters that can handle large number of hashes or a low false positive likelihood or both.

Instead of writing a bunch of separate hash algorithms, I implemented a stream cipher where in I just scramble the cipher table by a number of rounds that is unique to that input. Then, I can return as many indices as the filter is configured for. This sets up the table once per value. It needs to reset the table or else the indices that we mark will depend on every value that came before it, and in that particular order. Currently the bottle-neck is how many times it has the scramble the table for each value. If you need to hash really long values, you'll want to lower the number of rounds it scrambles the table.

Variations

In this implementation, the bloom-filter size is set once you create it, meaning that it cannot grow bigger if it gets too full, nor can you resize this bloom filter to become smaller if you sized it too big. Because multiple values could rely on the same bit, this implementation does not support removal of items, because to do so would cause several values to begin reporting false negatives.

In order to make a bloom filter that supports deletion, use a number like a byte instead of bits in your filter, and each time you visit an index in the filter while adding values, increment the number you find there. Then, to delete a value, visit each index as you did before, but decrement the number there. This way, if two values map to the same index, that information is tracked by incrementing the value. This is what is known as a Counting Bloom Filter.

There are other variants of bloom filters out there, including bloom filters that can grow in size if it gets too full, but such a thing is beyond the scope of my needs. In essence, when the filter gets too full, you create another separate filter, and add new values by first checking the first filter to see if it exists, and if not, adding the value to the second filter. Checking for the presence of a value requires checking both (and other) filters. For information on scalable bloom filters, please see this whitepaper.

The code

My C# Bloom Filter project on GitHub
Or download zip here.

Sunday, June 12, 2016

The biggest problem with Mathematics today

I start this time with a little disclaimer: This is, after all, a blog, and most blogs still fit the original definition, which is a public forum used for a person to state or explain their beliefs and/or feelings. So it is without further ado, that I proceed, unabashed:

The biggest problem with Mathematics today, particularly around people approaching, or new to the field, is the conventions of naming things.

Historically, and seemingly by convention, mathematical concepts are named after the first person to define or make serious contributions to that field. This is what is known as an eponym.

This is terrible practice, even tear-able! I am bitter about the amount of time I waste trying to find the 'name' of the mathmatical concept I wish to express or research, or when I have to derail my research to go define a term or concept I don't recognize, only to find out that I am already know what it was.

If people would just name stuff after what it actually fucking does, instead of some person's last name, we would all be a lot better off (except for maybe the aforementioned person).

In software development, we have this concept known as refactoring. This term includes fundamental re-structuring of the code, but also can be as simple as a bunch of renaming of everything to fit a more consistent, or holistic, view or concept. A similar thing needs to happen to the field of mathematics, and sooner rather than later!

No more eponyms in Mathematics!

Friday, January 22, 2016

Trick: How to mentally convert and calculate rate of pay

I have been real busy lately, so I will share only a short tip this week. However, I have some cool new projects/concepts I have been working on, such as a term rewriting system, so keep checking back.

I have been interviewing for a new job, its time to move up. Often I am quoted an annual salary, and I want to see how that compares with hourly wage. I do this calculation in my head, on the spot, and so can you. This technique leverages Estimation/Approximation.

Since 40hrs/week times 52wks/year = 40 * 52 = 2080 full-time work hours in a year.
We can use approximation by multiplying or dividing by 2000. Since we know that 2 * 1000 = 2000, multiplying/diving by 2000 is trivial. Remember, to multiply or divide by any power of ten, now matter how great, just count the zeros and just shift the decimal place once to the right, or towards a smaller quantity, that number of times. e.g. 43.50 * 1000 = 43,500

'So how large is the error from estimating?', One might ponder... Well ponder no more! Hark:

Estimation Error - From hourly to yearly--

$12/HR @ 40HR/WK
 
  $24,000/YR - Estimated
  $24,960/YR - Actual
  -------
     -$960  - Difference

$25/HR @ 40HR/WK
 
  $50,000/YR - Estimated
  $52,000/YR - Actual
  -------
   $2,000/YR - Difference

$50/HR @ 40HR/WK
 
 $100,000/YR - Estimated
 $104,000/YR - Actual
  --------
   $4,000  - Difference
 
 
Estimation Error - From yearly to hourly--
 
 $25K/YR @ 40HR/WK
  
   $12.50/HR - Estimated
   $12.02/HR - Actual
   ------
  -$0.48/HR  - Difference
 
 $50K/YR @ 40HR/WK
  
   $25.00/HR - Estimated
   $24.04/HR - Actual
   ------
   -$0.96/HR
     
 $100K/YR @ 40HR/WK
  
   $50.00/HR - Estimated
   $48.07/HR - Actual
   ------
   -$1.92/HR

Thursday, December 24, 2015

Infix Notation Parser via Shunting-Yard Algorithm

Infix notation is the typical notation for writing equations in algebra.
An example would be: 7 - (2 * 5)

Parsing such an equation is not a trivial task, but I wanted one for my EquationFinder project, as I wanted to respect order of operations.

Strategies include substitution/replacement algorithms, recursion to parse into a tree and then tree traversal, or converting the infix notation to reverse polish notation (RPN), also known as post-fix notation, then using a stack based postfix notation evaluator. I choose the latter, as such algorithms are well defined in many places on the web.

My code consists of 3 classes, all static:
(Links go to the .cs file on GitHub)

InfixNotation - this simply holds a few public variables and calls the public methods on the below two classes.
ShuntingYardAlgorithm - this converts an equation in infix notation into postfix notation (aka RPN).
PostfixNotation - this evaluates the equation in postfix notation and returns a numerical result value.

In order to implement the shunting-yard algorithm and the postfix evaluator, I simply wrote the steps to the algorithms as written on Wikipedia:
(Links go to the Wikipedia article)
Link to the Shunting-Yard Algorithm to convert Infix notation to Postfix notation.
Link to the Postfix Notation Evaluation Algorithm.

The code for this is pretty extensive, but I will prettify it and present it below. Alternatively, you can view and download the code from the MathNotationConverter project on my GitHub.

InfixNotationParser:


public static class InfixNotation
{
   public static string Numbers = "0123456789";
   public static string Operators = "+-*/^";

   public static bool IsNumeric(string text)
   {
      return string.IsNullOrWhiteSpace(text) ? false : text.All(c => Numbers.Contains(c));
   }

  public static int Evaluate(string infixNotationString)
  {
    string postFixNotationString = ShuntingYardConverter.Convert(infixNotationString);
    int result = PostfixNotation.Evaluate(postFixNotationString);
    return result;
  }
}

ShuntingYardConverter

(converts an equation from infix notation into postfix notation):


public static class ShuntingYardAlgorithm
{
   private static string AllowedCharacters = InfixNotation.Numbers + InfixNotation.Operators + "()";

   private enum Associativity
   {
      Left, Right
   }
   private static Dictionary<char, int> PrecedenceDictionary = new Dictionary<char, int>()
   {
      {'(', 0}, {')', 0},
      {'+', 1}, {'-', 1},
      {'*', 2}, {'/', 2},
      {'^', 3}
   };
   private static Dictionary<char, Associativity> AssociativityDictionary = new Dictionary<char, Associativity>()
   {
      {'+', Associativity.Left},
      {'-', Associativity.Left},
      {'*', Associativity.Left},
      {'/', Associativity.Left},
      {'^', Associativity.Right}
   };

   private static void AddToOutput(List<char> output, params char[] chars)
   {
      if (chars != null && chars.Length > 0)
      {
         foreach (char c in chars)
         {
            output.Add(c);
         }
         output.Add(' ');
      }
   }
   
   public static string Convert(string infixNotationString)
   {
      if (string.IsNullOrWhiteSpace(infixNotationString))
      {
         throw new ArgumentException("Argument infixNotationString must not be null, empty or whitespace.", "infixNotationString");
      }

      List<char> output = new List<char>();
      Stack<char> operatorStack = new Stack<char>();
      string sanitizedString = new string(infixNotationString.Where(c => AllowedCharacters.Contains(c)).ToArray());

      string number = string.Empty;
      List<string> enumerableInfixTokens = new List<string>();
      foreach (char c in sanitizedString)
      {
         if (InfixNotation.Operators.Contains(c) || "()".Contains(c))
         {
            if (number.Length > 0)
            {
               enumerableInfixTokens.Add(number);
               number = string.Empty;
            }
            enumerableInfixTokens.Add(c.ToString());
         }
         else if (InfixNotation.Numbers.Contains(c))
         {
            number += c.ToString();
         }
         else
         {
            throw new Exception(string.Format("Unexpected character '{0}'.", c));
         }
      }

      if (number.Length > 0)
      {
         enumerableInfixTokens.Add(number);
         number = string.Empty;
      }

      foreach (string token in enumerableInfixTokens)
      {
         if (InfixNotation.IsNumeric(token))
         {
            AddToOutput(output, token.ToArray());
         }
         else if (token.Length == 1)
         {
            char c = token[0];

            if (InfixNotation.Numbers.Contains(c)) // Numbers (operands)
            {
               AddToOutput(output, c);
            }
            else if (InfixNotation.Operators.Contains(c)) // Operators
               if (operatorStack.Count > 0)
               {
                  char o = operatorStack.Peek();
                  if ((AssociativityDictionary[c] == Associativity.Left &&
                     PrecedenceDictionary[c] <= PrecedenceDictionary[o])
                        ||
                     (AssociativityDictionary[c] == Associativity.Right &&
                     PrecedenceDictionary[c] < PrecedenceDictionary[o]))
                  {
                     AddToOutput(output, operatorStack.Pop());
                  }
               }
               operatorStack.Push(c);
            }
            else if (c == '(') // open brace
            {
               operatorStack.Push(c);
            }
            else if (c == ')') // close brace
            {
               bool leftParenthesisFound = false;
               while (operatorStack.Count > 0 )
               {
                  char o = operatorStack.Peek();
                  if (o != '(')
                  {
                     AddToOutput(output, operatorStack.Pop());
                  }
                  else
                  {
                     operatorStack.Pop();
                     leftParenthesisFound = true;
                     break;
                  }
               }

               if (!leftParenthesisFound)
               {
                  throw new FormatException("The algebraic string contains mismatched parentheses (missing a left parenthesis).");
               }
            }
            else // wtf?
            {
               throw new Exception(string.Format("Unrecognized character '{0}'.", c));
            }
         }
         else
         {
            throw new Exception(string.Format("String '{0}' is not numeric and has a length greater than 1.", token));
         }
      } // end foreach

      while (operatorStack.Count > 0)
      {
         char o = operatorStack.Pop();
         if (o == '(')
         {
            throw new FormatException("The algebraic string contains mismatched parentheses (extra left parenthesis).");
         }
         else if (o == ')')
         {
            throw new FormatException("The algebraic string contains mismatched parentheses (extra right parenthesis).");
         }
         else
         {
            AddToOutput(output, o);
         }
      }

      return new string(output.ToArray());
   }
}

PostfixNotation

(evaluates the postfix notation and returns a numerical result):


public static class PostfixNotation
{
   private static string AllowedCharacters = InfixNotation.Numbers + InfixNotation.Operators + " ";

   public static int Evaluate(string postfixNotationString)
   {
      if (string.IsNullOrWhiteSpace(postfixNotationString))
      {
         throw new ArgumentException("Argument postfixNotationString must not be null, empty or whitespace.", "postfixNotationString");
      }

      Stack<string> stack = new Stack<string>();
      string sanitizedString = new string(postfixNotationString.Where(c => AllowedCharacters.Contains(c)).ToArray());
      List<string> enumerablePostfixTokens = sanitizedString.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).ToList();

      foreach (string token in enumerablePostfixTokens)
      {
         if (token.Length > 0)
         {
            if (token.Length > 1)
            {
               if (InfixNotation.IsNumeric(token))
               {
                  stack.Push(token);
               }
               else
               {
                  throw new Exception("Operators and operands must be separated by a space.");
               }
            }
            else
            {
               char tokenChar = token[0];

               if (InfixNotation.Numbers.Contains(tokenChar))
               {
                  stack.Push(tokenChar.ToString());
               }
               else if (InfixNotation.Operators.Contains(tokenChar))
               {
                  if (stack.Count < 2)
                  {
                     throw new FormatException("The algebraic string has not sufficient values in the expression for the number of operators.");
                  }

                  string r = stack.Pop();
                  string l = stack.Pop();

                  int rhs = int.MinValue;
                  int lhs = int.MinValue;

                  bool parseSuccess = int.TryParse(r, out rhs);
                  parseSuccess &= int.TryParse(l, out lhs);
                  parseSuccess &= (rhs != int.MinValue && lhs != int.MinValue);

                  if (!parseSuccess)
                  {
                     throw new Exception("Unable to parse valueStack characters to Int32.");
                  }

                  int value = int.MinValue;
                  if (tokenChar == '+')
                  {
                     value = lhs + rhs;
                  }
                  else if (tokenChar == '-')
                  {
                     value = lhs - rhs;
                  }
                  else if (tokenChar == '*')
                  {
                     value = lhs * rhs;
                  }
                  else if (tokenChar == '/')
                  {
                     value = lhs / rhs;
                  }
                  else if (tokenChar == '^')
                  {
                     value = (int)Math.Pow(lhs, rhs);
                  }

                  if (value != int.MinValue)
                  {
                     stack.Push(value.ToString());
                  }
                  else
                  {
                     throw new Exception("Value never got set.");
                  }
               }
               else
               {
                  throw new Exception(string.Format("Unrecognized character '{0}'.", tokenChar));
               }
            }
         }
         else
         {
            throw new Exception("Token length is less than one.");
         }
      }

      if (stack.Count == 1)
      {
         int result = 0;
         if (!int.TryParse(stack.Pop(), out result))
         {
            throw new Exception("Last value on stack could not be parsed into an integer.");
         }
         else
         {
            return result;
         }
      }
      else
      {
         throw new Exception("The input has too many values for the number of operators.");
      }

   } // method
} // class

Another alternative technique is to using the Shunting-Yard Algorithm to turn infix notation into an abstract syntax tree (Linq.Expressions anyone?). I will likely post this technique later.

Other blog posts by me that are related to this article are the Threaded Equation Finder, a Mixed Radix System Calulator and Drawing Text Along a Bezier Spline.

Tuesday, December 1, 2015

Detailed Exception class

The C# StackTrace class can be useful for logging the source of errors, but when your assembly is built in Release mode, you lose valuable information in the StackFrame, like the line number, the column number or the file name.

Part of my error handling strategy involved setting an error string, and using StackTrace to log the function calling the setter and the location in the code the error occurred. Unfortionatly, as mentioned above, I was losing error information like line number, and that kind of information sure is nice to have. Thats why I invented the DetailedException class.

In .NET 4.5, one can get caller information by the use of default value parameters tagged with an special attribute, namely CallerFilePathAttribute, CallerMemberNameAttribute, CallerLineNumberAttribute.

How about a code example:

 [Serializable]
 public class DetailedException : Exception
 {
  public int SourceLineNumber { get; private set; }
  public string SourceFilePath { get; private set; }
  public string SourceMemberName { get; private set; }
   
  public DetailedException(string message,
     [CallerMemberName] string sourceMemberName = "",
     [CallerFilePath] string sourceFilePath = "",
     [CallerLineNumber] int sourceLineNumber = 0)
   : base(message)
  {
   this.SourceMemberName = sourceMemberName;
   this.SourceFilePath = sourceFilePath;
   this.SourceLineNumber = sourceLineNumber;
  }

Now if you have to throw an exception, throw new DetailedException("Testing DetailedException. WOW. SUCH DETAILS."); and you will gain information like SourceLineNumber!

If you decide to overload the constructor, be warned: You will be required to use named parameters when calling the DetailedException constructor

A Simple Word Prediction Library

The word prediction feature on our phones are pretty handy and I've always and thought it would be fun to write one, and last night I decided to check that off my list. As usual, the whole project and all of its code is available to browse on GitHub. I talk more about the library and the design choices I made below the obnoxiously long image:

[Image of Windows Phone's Word Prediction feature]

Visit the project and view the code on my GitHub, right here.
(Project released under Creative Commons)

Overview:

One thing you might notice, if for no other reason than I bring it up, is that I favor composition over inheritance. That is, my classes use a Dictionary internally, but they do not inherit from Dictionary. My word prediction library is not a minor variation or different flavor of the Dictionary class, and while it might be cool to access the word predictions for a word via an indexer, my word prediction library should not be treated as a dictionary.

Under the hood:

There is a dictionary (a list of key/value pairs) of 'Word' objects. Each Word class has a value (the word), and its own dictionary of Word objects implemented as its own separate class (that does not inherit from Dictionary). This hidden dictionary inside each Word class keeps track of the probabilities of the the next word, for that given word. It does so by storing a Word as the key, and an integer counter value that gets incremented every time it tries to add a word to the dictionary that already exists (similar to my frequency dictionary, covered here).
The WordPredictionDictionary class doesn't grow exponentially, because each word is only represented once, by one Word class. The dictionaries inside the Word class only stores the references to the Word objects, not a copy of their values.
In order to begin using the WordPredictionDictionary to suggest words, one must train the WordPredictionDictionary on a representative body of text.

TODO:

~~Write methods to serialize the trained data sets so they can be saved and reloaded.~~ This has been implemented.
Write an intelli-sense-like word suggestion program that implements the WordPredictionDictionary in an end-user application.

Thursday, October 29, 2015

Thinq - A Linq Experiment

   View/Download the source code from the project's GitHub

   So I wrote a program, just an experiment, where I was making a range class using IEnumerables (C#), and each element doesn't have to increment by one, but any amount. so I was creating ranges like 7 to 10 million, increment by 7, so upon enumeration it would yield multiples of 7. This is also called arithmetic progression.

   Then I started combining different multiples with query operators like Where operator or Intersect like IEnumerable result = multiples7.Intersect(multiples13.MoveNext()), essentially creating a function that keeps only those numbers that are multiples of both 7 and 13, starting with the least common multiple.

   So I began testing. After some playing, I decided to take the first 7 primes, and find any common multiples to them between 1 and 10 million. Much to my surprise, it found all the common multiples of the first 7 prime numbers under 10 million (there are only two of them, 4849845 & 9699690), and it did it in 500 milliseconds on some very modest hardware (1 core, 2.16GHz, 4GB ram).

   I bumped up the ceiling to 50 million and I got an OutOfMemoryException because the IEnumerable holds on to every value it gets from the function MoveNext(). I threw in some metrics and discovered that it took about 3 seconds and some 32-million, 64-bit integers for my computer to declare 'out of memory'.

   Well, at least it was fast, even if it did eat up all my ram in 3 seconds, it was still promising.

   The solution was to create an IEnumerator that was aware of the arithmetic sequences that constrained the results set. When MoveNext() is called repeatedly during enumeration, I avoid the infinite memory requirement by restricting the result set returned from MoveNext(); it returns the next whole number that is divisible by every arithmetic sequence's 'common difference', or increment value. In this way, you have created a enumerable sequence that is the _intersection_ of all of the sequences.

   The enumerator is prevented from running to infinity by obeying two limits: A maximum numeric value (cardinal) that GetNext() will return to ("results less than 50 million") and a maximum quantity of results (ordinal) that GetNext() yields ("the one millionth result").    If either of these limits are exceeded, the while loop will fail to evaluate to true. It is very common for my processor-intensive, long running or 'mathy' applications to employ a temporal limit (maximum time-to-live) or support cancellation, but this little experiment has been so performant that I have been able to get by without one.

   So what kind of improvement did we get out of our custom enumerable? I can now find all the common factors for the first 8 prime numbers up to 1 billion in 25 seconds! I was impressed; the application used to max out around 50 million and run out of memory, and now it can investigate to one billion in a reasonable amount of time and the memory it uses is not much more than the 8 or so integers in the result set. 1 billion, however seems to be the sweet spot for my single 2.13 GHz laptop. I ran the same 8 primes to 2 billion and it took 1 minute, 12 seconds:


TIME ELAPSED: 01:12.38
LCM[3,5,7,11,13,17,19,23] (max 2,000,000,000)

17 FACTORS:

111546435 
223092870 
334639305 
446185740 
557732175 
669278610 
780825045 
892371480 
1003917915 
1115464350 
1227010785 
1338557220 
1450103655 
1561650090 
1673196525 
1784742960 
1896289395

Tags