Showing posts with label Cool. Show all posts
Showing posts with label Cool. Show all posts

Tuesday, December 6, 2016

Javascript inside svg?



Readers might or might not be aware that an SVG file is really just an XML document. And yes, as you suspected, it allows javascript and viewers are expected to support it, as if that was a sane thing to do. I'm sure this kind of thing is just about as smart as it sounds, in fact I believe I first discovered this foolhardy feature in a forum post displaying some obfuscated javascript inside an svg tag and it was in reference to a XSS attack that was leveraging Facebook.

I had to run a quick test, to see if it was true. I typed the following text to a file with the extension .svg, and the file opened up in internet explorer:

<svg version="1.1" xmlns="http://www.w3.org/2000/svg">
  <script type="text/javascript">alert('Hax!');</script>
</svg>

Now, I use chrome and keep my internet explorer locked down. I don't know what comes up on other people's machines, but at least mine prompted me, asking if I wanted to allow blocked ActiveX controls. However, chrome runs it no problem and without bothering to prompt me. Joy.

I don't know about you, but I don't want my images to be able to prompt me. It is not hard to imagine a scenario where an svg being served up by a server could contain javascript attempting to access the user's cookies for that server.

While the black hats of the world are busy thinking of new (ab)uses of this technology, it is interesting to consider the creative aspects. If an SVG contains only a JavaScript loop that draws the image, can the image be said to draw itself? Clearly, not in the most literal sense, as rather the image is interpreted, and it is the interpreter that does the drawing. Yet in some sense it does draw itself. At any rate, it can certainly create some intense images in only 1 kB or so.

Naturally, my mind jumped to the idea of making a prime number sieve, as a way to make a complex drawing with only a single loop. After a little bit of playing, I had something that looks a lot like Sieves of Chaos. After a little more tweaking, I had something that looks almost identical to Sieves of Chaos.


I encourage you to try it out for yourself; just copy and paste the below code to Notepad++, and save with a .SVG extension. You will notice the image is 8096 units wide. If your machine can handle it, there is no reason the width couldn't be extended. For some machines, this may already be a crippling number. The effect is very nice, and I even created a spanning desktop background out of it.


<svg version="1.1" xmlns="http://www.w3.org/2000/svg">
 <script type="text/javascript"><![CDATA[
  var width=8096;
  var height=900;
  var ns="http://www.w3.org/2000/svg";
  var svg = document.getElementsByTagName('svg')[0];
  svg.setAttribute("width", width);
  svg.setAttribute("height", height);
  
  var rect = document.createElementNS(ns, 'rect');
  rect.setAttribute("height", height);
  rect.setAttribute("width", width);
  rect.style.fill = "black";
  svg.appendChild(rect);

  for (var b = 2; b < width; b += 1)
  {
   for (var a = 1; a < width; a += b*2)
   {
    var cir = document.createElementNS(ns, 'circle');
    cir.setAttribute("shape-rendering", "geometricPrecision");
    cir.setAttribute("stroke-opacity", "0.07");
    cir.setAttribute("r",  b);
    cir.setAttribute("cx", a+b);
    cir.setAttribute("cy", height/2);
    cir.style.stroke = "red";
    cir.style.strokeWidth = "1";
    cir.style.fill = "none";
    svg.appendChild(cir);
   }
  }
 ]]></script> 
</svg>

Blogger is wise, and does not accept .svg files. There is hope. Here is the .png image that I am using for my background (below). Feel free to use that or render your own with the above code. Be warned, the below png file is 3 MB.






Wednesday, September 21, 2016

RC4 stream cipher variants and visualization of table permutation state



Here, I present some work I have been doing on two RC4 stream cipher variants. The first variant, as seen below, I wrote to help me visualize and understand what the RC4 tables was doing, and help me understand its properties.

Identity Permutation

The class that contains it is called SimpleTable and is exactly that; The simplest R4C implementation possible. It is notable in the fact that it does not use key scheduling at all, and its starting state is that of the Identity Permutation. The identity permutation is where the value at index zero equals zero, the value at index one is one, and so on. An easy way to remember what the Identity Permutation is, just recall the notion of a Multiplicative Identity (which is 1), where by multiplying a number N by the Multiplicative Identity gives you back the value of N, also known as the identity. Similarly, the Identity Permutation of an array A just gives you A. This is the trivial permutation. That is, there is no permuting of the array at all!

Anyways, this is done to see the perfectly ordered state, and how each round effects that state. In this way, we can visually check for the avalanche effect. In order to visualize the table, i just assign each value 0 to 255 a different shade of grey (I also have a rainbow-colored option that might be easier to tell apart similar values). At each step I create a Bitmap by looping through the table. Below, you can find an animated GIF of the first 100 steps of this cipher being applied to the identity permutation:


Notice how it takes a while to get going, and the first several values don't move much at all. After 256 steps, or one round, the cursor arrives back at index zero. Because the location of the first several values have not moved much or at all, we can clearly see that a mere 256 steps is insufficient at permuting the state enough to avoid leaking the first part of your key. Therefore it it is vital to permute the table for several rounds (256 steps per round) before you start using the stream.

Wired Equivalent Protocol

As some of you may know, WEP used RC4 with a weak key schedule. The key is spread out over 256 bytes using the following approach:


j = (j + table[i] + key[i mod keylength]) % 256;
SwapValues(table[i], table[j]);


and then it began streaming bytes from the table. Typically a nonce is concatenated to the key. Every time the table is set up and/or the nonce changes, some information about the key is leaked. Obviously a more secure procedure would use a hash of the key and the nonce, instead of the plain-text key, and to toss away the first 1024 bytes or so.

Cycle Length

Because each step in an RC4 cipher is a permutation, there is a limit to the number of unique bytes that can be produced before it begins repeating. This is called the permutation cycle. The length of the permutation cycle depends on the exact starting state, but we can get an upper bound.

Since there are 256 elements in the array, and two indices into the array (i and j), there is a maximum of
256! * 256^2 = 5.62 * 10^512 = 2^1700
possible states. That's 4.6 * 10^488 yottabytes!

This is the maximum possible states, however, and other starting states could have less. If the RC4 algorithm performed as a random permutation (which it does not, it performs worse), the cycle length would be half of the theoretical maximum above. Luckily the number above is so vast, that even some faction of it is still so many bytes that all of humanity has never and likely will never have that much total storage.

One thing to watch out for, however is something called Finney States. If an RC4 is started in one of these Finney States, the length of the cycle is much, much reduced. The chance of randomly generating one of these starting states, however, is VERY, very low.

Strengthening RC4

As stated, and visualized, above, it is vital to permute the table for several rounds (at 256 steps per round) after the key schedule, discarding the bytes, before you start using the stream. Also, it would be foolish to use the actual bytes of the key for permuting the starting state. It would be instead better to use a hash of the key + nonce or a key derivation function from the key instead the actual value of the key itself.

Another idea is, after shuffling the table enough rounds to hide the key, scramble the table an additional number of rounds, that value being some function of the key. This increases the possible starting states by whatever your range is.

In the classic RC4, each step would return one byte. The number of steps taken before returning each byte is configurable in my implementation.

Memory hardening

Check out the experimental branch for a memory hardened version. It stores the key class in memory, with the key XORed with a one-time pad, and then is protected in memory from access with the System.Security.Cryptography.ProtectedMemory class.

Other uses

The pseudo-random byte stream from the RC4 table is deterministic. Therefore if two remote computers with a shared secret, both computers can independently set up an RC4 table with exactly the same starting state and will get the same sequence of bytes which would be difficult to guess, given just the stream of bytes. If the plain text is XORed by the pseudorandom byte stream, then it can be decrypted by XORing it by the same byte steam.

The project includes 2 variants: 1) A simple table with a method to visualize the permutation state of the table and the avalanche effect as a bitmap 2) A more serious attempt at a secure implementation.


NOTE: THIS HAS NOT BEEN CRYPTO-ANALYZED AND PROBABLY NOT ACTUALLY SECURE, SO DO NOT TRUST IT!

Screenshots




Source code

Here is the GitHub page to the project (master branch).
Or just directly download the Zip file (experimental branch).




Wednesday, June 29, 2016

Bloom Filter - A novel, space efficient data structure like a hash-table for billions of values.




Introduction


A bloom filter is a truly novel data structure. Similar to a hash table, it can tell you if you've hashed a particular value previously. You can add many, many more values to a bloom filter than you can to a hash table, does not degrade performance as the number of values in the set grows large, and requires only a fraction of the space of a hash table to store it!

This is not just an academic exercise, or something that only works in theory or in special cases. Indeed, companies like google use bloom filters to quickly determine if it has never seen that value before, thus avoiding a more costly lookup against a database every time the bloomfilter returns false.



Probabilistic


First off, its important to understand that a bloom filter is NOT a hash table, it operates in an entirely different way. A bloom-filter is what is known as a probabilistic data structure. What this means is, that it can tell you to within a certain probability, if an element exists in a set. In other words, false positive matches ARE possible, but false negative matches ARE NOT possible. For example, if you check a bloom filter for the existence of a value, and it returns false, you can know with 100% certainty that the bloom filter does not contain that value value before. However, if you test a value against the filter and it returns true, there is a small probability that it has in fact not seen that value before, but is returning a false positive. How big of a probability? Here's the beauty: It can be as small as you want it to be. It depends on a few factors, including the size of the filter, how full it is, and how many bits you use to store each value in the filter.

In a HashTable class, each item is stored as a key value pair, so the size of your object plus a 32 bit integer. Contrast that to a bloom filter, which stores only about 3-7 bits per value hashed. Also, my implementation applies compression when saving the filter to disk, providing even more space savings. A bloom filter with 160,000 values hashed and a 1% collision probability results in a filter that is 235KB uncompressed, and a whopping 54KB when compressed! Remember the filter is an array of bits. The entropy of the array is going to be at its greatest, and thus the compression ratio lowest, when exactly 1/2 of the bits are flipped, or the filter is half-way 'full'. This has the unusual property of getting smaller as you add more hashes to the filter. Actually, this is misleading--the actual filter itself never changes size, its only the compressed version that varies in size.

To handle the compression I just used the System.IO.Compression.DeflateStream class. An important note about working with this class: build an array of bytes and send your entire file in one go. In this way it will compress the whole file as one chunk. If you sent data to this stream piecemeal, it will compress each piece separately and you will get a poor compression ratio.

How it works


So how does this all work? The filter part of a bloom filter is just a large array of bits. You also require several different hash functions that each return a unique result for the same input value. When you add a value to the filter, the value is sent to about 3-7 different hash functions. Each hash function will return a value that is between 0 and the number of bits in the filter. Each value is used as an index to access and element on the array of bits that is your filter. When hashing a value, you just set the bit at each index location in the array to 1. Then testing for the presence of a value in the filter, you pass the value to the hash functions the same way as above, then visit each index, checking to see if any of them are 0. If even one bit at one of those index positions are zero, it means the filter has never seen that value before, because it would have set all those bits to 1. If all the bits at the index locations are 1, then it is likely that the filter has seen that value before. However,there is a chance that it is a false positive, because it could be that that value's different hashes all mapped to bits from other values. As the filter becomes more full, more bits are set to 1, and so the odds of a false positive go up. To build your filter by supplying the estimated number of values you think you are likely to store in the filter, and don't go above a certain ratio of 1 bits to 0 bits. If you were to let your filter hash so many values that every bit got set to 1, then the probability of receiving a false positive for a random value becomes 100%.



Solving the many hash problem

As I mentioned before, this requires several different hash functions that each return a unique result for the same input value. Although I said 3-7 hash functions, you might require 14 or more, when working with filters that can handle large number of hashes or a low false positive likelihood or both.

Instead of writing a bunch of separate hash algorithms, I implemented a stream cipher where in I just scramble the cipher table by a number of rounds that is unique to that input. Then, I can return as many indices as the filter is configured for. This sets up the table once per value. It needs to reset the table or else the indices that we mark will depend on every value that came before it, and in that particular order. Currently the bottle-neck is how many times it has the scramble the table for each value. If you need to hash really long values, you'll want to lower the number of rounds it scrambles the table.



Variations


In this implementation, the bloom-filter size is set once you create it, meaning that it cannot grow bigger if it gets too full, nor can you resize this bloom filter to become smaller if you sized it too big. Because multiple values could rely on the same bit, this implementation does not support removal of items, because to do so would cause several values to begin reporting false negatives.

In order to make a bloom filter that supports deletion, use a number like a byte instead of bits in your filter, and each time you visit an index in the filter while adding values, increment the number you find there. Then, to delete a value, visit each index as you did before, but decrement the number there. This way, if two values map to the same index, that information is tracked by incrementing the value. This is what is known as a Counting Bloom Filter.

There are other variants of bloom filters out there, including bloom filters that can grow in size if it gets too full, but such a thing is beyond the scope of my needs. In essence, when the filter gets too full, you create another separate filter, and add new values by first checking the first filter to see if it exists, and if not, adding the value to the second filter. Checking for the presence of a value requires checking both (and other) filters. For information on scalable bloom filters, please see this whitepaper.

The code


My C# Bloom Filter project on GitHub
Or download zip here.








Tuesday, December 1, 2015

Detailed Exception class



   The C# StackTrace class can be useful for logging the source of errors, but when your assembly is built in Release mode, you lose valuable information in the StackFrame, like the line number, the column number or the file name.

   Part of my error handling strategy involved setting an error string, and using StackTrace to log the function calling the setter and the location in the code the error occurred. Unfortionatly, as mentioned above, I was losing error information like line number, and that kind of information sure is nice to have. Thats why I invented the DetailedException class.

   In .NET 4.5, one can get caller information by the use of default value parameters tagged with an special attribute, namely CallerFilePathAttribute, CallerMemberNameAttribute, CallerLineNumberAttribute.


How about a code example:

 [Serializable]
 public class DetailedException : Exception
 {
  public int SourceLineNumber { get; private set; }
  public string SourceFilePath { get; private set; }
  public string SourceMemberName { get; private set; }
   
  public DetailedException(string message,
     [CallerMemberName] string sourceMemberName = "",
     [CallerFilePath] string sourceFilePath = "",
     [CallerLineNumber] int sourceLineNumber = 0)
   : base(message)
  {
   this.SourceMemberName = sourceMemberName;
   this.SourceFilePath = sourceFilePath;
   this.SourceLineNumber = sourceLineNumber;
  }



   Now if you have to throw an exception, throw new DetailedException("Testing DetailedException. WOW. SUCH DETAILS."); and you will gain information like SourceLineNumber!

   If you decide to overload the constructor, be warned: You will be required to use named parameters when calling the DetailedException constructor

A Simple Word Prediction Library




   The word prediction feature on our phones are pretty handy and I've always and thought it would be fun to write one, and last night I decided to check that off my list. As usual, the whole project and all of its code is available to browse on GitHub. I talk more about the library and the design choices I made below the obnoxiously long image:


[Image of Windows Phone's Word Prediction feature]

   Visit the project and view the code on my GitHub, right here.
      (Project released under Creative Commons)

Overview:

   One thing you might notice, if for no other reason than I bring it up, is that I favor composition over inheritance. That is, my classes use a Dictionary internally, but they do not inherit from Dictionary. My word prediction library is not a minor variation or different flavor of the Dictionary class, and while it might be cool to access the word predictions for a word via an indexer, my word prediction library should not be treated as a dictionary.

Under the hood:

   There is a dictionary (a list of key/value pairs) of 'Word' objects. Each Word class has a value (the word), and its own dictionary of Word objects implemented as its own separate class (that does not inherit from Dictionary). This hidden dictionary inside each Word class keeps track of the probabilities of the the next word, for that given word. It does so by storing a Word as the key, and an integer counter value that gets incremented every time it tries to add a word to the dictionary that already exists (similar to my frequency dictionary, covered here).
The WordPredictionDictionary class doesn't grow exponentially, because each word is only represented once, by one Word class. The dictionaries inside the Word class only stores the references to the Word objects, not a copy of their values.
In order to begin using the WordPredictionDictionary to suggest words, one must train the WordPredictionDictionary on a representative body of text.

TODO:

  • Write methods to serialize the trained data sets so they can be saved and reloaded. This has been implemented.
  • Write an intelli-sense-like word suggestion program that implements the WordPredictionDictionary in an end-user application.

Thursday, October 29, 2015

Thinq - A Linq Experiment




Thinq - A Linq Experiment




   View/Download the source code from the project's GitHub



   So I wrote a program, just an experiment, where I was making a range class using IEnumerables (C#), and each element doesn't have to increment by one, but any amount. so I was creating ranges like 7 to 10 million, increment by 7, so upon enumeration it would yield multiples of 7. This is also called arithmetic progression.

   Then I started combining different multiples with query operators like Where operator or Intersect like IEnumerable result = multiples7.Intersect(multiples13.MoveNext()), essentially creating a function that keeps only those numbers that are multiples of both 7 and 13, starting with the least common multiple.

   So I began testing. After some playing, I decided to take the first 7 primes, and find any common multiples to them between 1 and 10 million. Much to my surprise, it found all the common multiples of the first 7 prime numbers under 10 million (there are only two of them, 4849845 & 9699690), and it did it in 500 milliseconds on some very modest hardware (1 core, 2.16GHz, 4GB ram).

   I bumped up the ceiling to 50 million and I got an OutOfMemoryException because the IEnumerable holds on to every value it gets from the function MoveNext(). I threw in some metrics and discovered that it took about 3 seconds and some 32-million, 64-bit integers for my computer to declare 'out of memory'.

   Well, at least it was fast, even if it did eat up all my ram in 3 seconds, it was still promising. 


   The solution was to create an IEnumerator that was aware of the arithmetic sequences that constrained the results set. When MoveNext() is called repeatedly during enumeration, I avoid the infinite memory requirement by restricting the result set returned from MoveNext(); it returns the next whole number that is divisible by every arithmetic sequence's 'common difference', or increment value. In this way, you have created a enumerable sequence that is the _intersection_ of all of the sequences.

   The enumerator is prevented from running to infinity by obeying two limits: A maximum numeric value (cardinal) that GetNext() will return to ("results less than 50 million") and a maximum quantity of results (ordinal) that GetNext() yields ("the one millionth result").    If either of these limits are exceeded, the while loop will fail to evaluate to true. It is very common for my processor-intensive, long running or 'mathy' applications to employ a temporal limit (maximum time-to-live) or support cancellation, but this little experiment has been so performant that I have been able to get by without one.

   So what kind of improvement did we get out of our custom enumerable? I can now find all the common factors for the first 8 prime numbers up to 1 billion in 25 seconds! I was impressed; the application used to max out around 50 million and run out of memory, and now it can investigate to one billion in a reasonable amount of time and the memory it uses is not much more than the 8 or so integers in the result set. 1 billion, however seems to be the sweet spot for my single 2.13 GHz laptop. I ran the same 8 primes to 2 billion and it took 1 minute, 12 seconds:



TIME ELAPSED: 01:12.38
LCM[3,5,7,11,13,17,19,23] (max 2,000,000,000)

17 FACTORS:

111546435 
223092870 
334639305 
446185740 
557732175 
669278610 
780825045 
892371480 
1003917915 
1115464350 
1227010785 
1338557220 
1450103655 
1561650090 
1673196525 
1784742960 
1896289395


Tuesday, October 6, 2015

Mixed Radix Numeral System class and Counter


Mixed Radix Calculator

   My 'Mixed Radix Calculator' creates a counting system of radices (plural of radix), such as base 12 or mixed radices such as Minutes/Hours/Days/Years: 365:24:60:60. I choose the left side to be the most significant side. This is merely a personal preference, and my MixedRadixSystem class supports displaying both alignments.

   Of course you dont have to choose a mixed radix numeral system, you can count in an N-base numeral system, such as base 7 or a more familiar base 16. Another feature lies in my RadixNumeral class. Each numeral, or place value, supports having its own dictionary of symbols.


Screenshot of Mixed Radix Calculator
      (Project released under Creative Commons)

-  52:7:24:60:60:1000  -


  A numeral system (or system of numeration) is a writing system for expressing numbers.


  The most familiar one is of course the decimal numeral system. This is a 10-base numbering system. Computers use a binary numeral system. The base is sometimes called the radix or scale.

  Not all numbering systems have just one base. Take for example, how we currently divide time: There are 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 365 days in a year. This is called a mixed radix numeral system, and one might express the above mixed radix system like: 365:24:60:60.

  https://en.wikipedia.org/wiki/Mixed_radix
  http://mathworld.wolfram.com/Base.html

Uses:
  I haven't found a lot of use cases for it yet, but it is interesting. I originally built this because I wanted to experiment with numeral systems that uses increasing consecutive prime numbers for each radix, as well as experiment with some off-bases, such as base 3 or base 7.

  In a single base, say base 7, then 'round numbers' with only one place value having a 1 and the rest having zeros, such as 1:0:0:0:0 (in base 7), such numbers are powers of 7, and ever other number except for the 1's place value is a multiple of 7.

  A mixed radix numeral system can represent a polynomial, and possibly provide for a simpler way to visualize and reason about them.

  Yet another possible use is to make a numeral system with a base that is larger than and co-prime to some other target number (say 256) to make a bijective map from every value in a byte to some other value exactly once by repeatedly adding the value of the co-prime, modulus 256. This can appear rather random (or sometimes not at all) but the mapping is easily determined given the co-prime. I have talked about this notion before on my blog
  https://csharpcodewhisperer.blogspot.com/search/label/Coprime

  If you like this project you would probably like my project EquationFinder, it finds equations given constraints
  https://github.com/AdamWhiteHat/EquationFinder


Tuesday, September 22, 2015

Threaded Equation Finder



Threaded Equation Finder

Find arithmetic equations that equates to a given 'target' value, number of terms, and operators.

Introduction

   You should all be familiar with how a typical computer works; you give it some variables, describe some quantities of some resources you have, choose an algorithm, let it process, and it returns to you a result or outcome. Now imagine a computer if you could work with a computer that worked the other way around. I believe it was Douglas Adams that described the notion of an all-together different type of computer; That is, you tell the computer what you want the outcome to be, and it goes off figuring out how to get there and what you need to do it. Z3, the Theorem Prover, and the constraint satisfaction problem (CSP) solver (and probably others) in Microsoft's Solver Foundation do almost exactly that.
   There is also the idea of Backcasting, which is a similar, but different idea.

   My program isn't as fancy as all that, but it does find equations that equates to a given 'target' value, albeit at random. You define constraints other than just the target value, such as what operators are allowed in the equation, the quantity of terms there, and the range or set of allowed terms.
For example, how many different ways can 9 nines equal 27, using only addition, subtraction, and multiplication, if evaluated left-to-right (ignore the order of operations)? Turns out there are only 67 ways.

(above) Application Screen Shot

How it works

   The actual approach for finding equations that equate to an arbitrary target or 'goal' value is rather primitive. By way of Brute Force, several threads are launched asynchronously to generate thousands of random equations, evaluate them and keep the results that have not been found yet.

   This is something I wrote approx. 2 years ago. I dug it up and decided to publish it, because I thought it was interesting. As this was an 'experiment', I created different ways of storing and evaluating the expressions, and have made those different 'strategies' conform to a common interface so I could easily swap them out to compare the different strategies. I have refactored the code so that each class that implements IEquation is in its own project and creates its own assembly.

   There are two fully-working strategies for representing equations: one that represented the equation as a list of 2-tuples (Term,Operator), did not perform order of operations, and was rather obtuse. The other strategy was to store the equation as a string and evaluate it using MSScriptControl.ScriptControl to Eval the string as a line of VBScript. This was unsurprisingly slower but allowed for much more robust equation evaluation. Order of operations is respected with the ScriptControl strategy, and opens the way to using using parenthesis.

   The other idea for a strategy which I have not implemented but would like to, would be a left-recursive Linq.Expression builder. Also, maybe I could somehow use Microsoft Solver Foundation for a wiser equation generation strategy than at random.



Limitations

   Today, however, there are better architectures. A concurrent system like this would benefit greatly from the Actor model. If you wanted to write complex queries against the stream of equations being generated or selected or solved, maybe reactive extensions would be a slam dunk.

   Although this project certainly is no Z3, it does provide an example of an interface... perhaps even a good one.



Running on Raspberry Pi 2

   Microsoft's Managed Extensibility Framework (MEF) might be a good thing here, but I also wrote a console client that is designed to be ran with Mono on the Raspberry Pi 2. MEF is a proprietary Microsoft .NET technology that is not supported in Mono. The extra meta data in the assembly shouldn't be a problem, but having a dependency on the MEF assembly will be. Probing of the runtime environment and dynamically loading of assemblies is required here, which I have not had time to do, so at this time, there is no MEF.

   The reason the mono client is a console application is because mono and winforms on the Raspberry Pi 2 fails for some people. The problem has something to do with a hardware floating point vs a software float, and it happens to manifest itself when using a TextBox control. The only thing that I haven't tried, and that should presumably fix it, is to re-build mono from the latest source.



Sunday, August 2, 2015

Certificate Enumerator



     Recently, my windows quit updating. Just prior to that, I had been messing around with my certificate store, so I suspected that to be the cause. Running Microsoft's troubleshooter reset the download, which I had a lot of hope of fixing the issue, but the download still continued to to fail. I decided to check the Windows Event Logs, and that's where I found an error message about a certificate in the chain failing. I knew it! However, I did not know whether a trusted certificate accidentally got put in the untrusted store, or whether an untrusted certificate was accidentally put in the trusted store. I needed a way to search all of my certificates' thumbprints or serial numbers against a know repository of trusted or untrusted certificates.
    Microsoft's certificate snap-in for MMC does not allow you to view certificates in an efficient way. Opening them one at a time, manually, and then scrolling all the way down to where the thumbprint is displayed to compare it to a webpage is painful. Also, I was not satisfied with the way that Microsoft allows you to search the certificate store. The search is not very effective and you cant even search for thumbprints! Also I do not believe the search feature allows you anyway to copy any of that information to clipboard.
Most of what I needed to accomplish could simply be done if I could just export all of my computers certificates thumbprint or serial numbers to a text file, csv file, or other simple and searchable format. Then I thought to myself, I know how to do that! It was the great the lack of features of the MMC certificate snap-in, and the inability to search for certificate thumbprints that inspired me to write my own certificate utility, known simply as Certificate Enumerator.




     CertificateEnumerator can list every certificate in your various certificate stores for your local machine and currently logged in user. It can then display that information to you either in a DataGridView or TextBox (as columnarized text), and provides the ability to persist that information to file as text, comma separated values (CSV), excel format or HTML table.


     The Certificate Enumerator also has the ability to 'validate' each certificate against its CRL (certificate revocation list), if it supplied one.


     The GUI could really use some love. In case you missed it, the project is on my GitHub, so feel free to download the source and play with it. If you come up with useful, submit a pull request.


Friday, April 24, 2015

Install C# on Raspberry Pi 2 with Mono



Intro

We can run C# on the Raspberry Pi 2! This has made the raspberry pi very valuable in my eyes. You can use Visual Studio to compile a .exe and then RUN IT on The Pi using Mono.

What I liked about The Pi: Getting it working was dead simple. I just formatted an SD card in FAT 32, downloading and extracting a 500mb .zip file and copying the contents of the extracted folder onto the SD card. After you've accomplished that, go take a beer break, you deserved it *whew*.

I was able to just plug in the cables, put in the SD card and I running Linux on The Pi. I was using the HDMI and Ethernet ports, however. If you wish to use the tiny, headphone-jack-looking component video or a USB wireless adapter however, there will be additional steps.


Problems / Gotchas 

Winforms may not play nice with Mono when on the Raspberry pi. Specifically, the problem manifests when you attempt to use a TextBox on your form, but who uses those? This bug  is supposedly fixed in the latest releases, but I have yet to get it to work by just updating my mono. It is very likely that I have to REBUILD mono from the latest source on the pi, which can take several hours. I have not tested this. The bug tracker for this says it has to do with with the pi's 'hard floats', which is referring to hardware floating point calculations.



The hardware is not exactly what I would call stable. Sometimes it does not POST (the BIOS is a binary blob, so does a pi really have a POST?). Anyways, ensure your USB charger that plugs into the wall that can supply plenty of power. The websites say 700ma at a minimum. I find using a 700ma charger is not sufficient. I use a charger that can push 2A, and I dont see many problems.




How does this work? 

A project called Mono. The Mono team has implemented a Common Language Runtime (CLR) that runs in the Linux and Mac environments and is coded to the ECMA-335 Common Language Infrastructure (CLI) AND covers functions and classes in the .NET framework, going all the way to 4.0 with some 4.5! Its open source, of course, and also has a compiler that follows the ECMA-334 C# Language Specification to turn C# code into CLI intermediate code that can be ran in windows as well. Remember, the CLI is just a specification that any language could (theoretically) be compiled to CLI.




Installing Mono C# on Raspberry Pi

This step was pretty straight forward as well. The instructions below are if you installed the NOOBS package which uses Raspian. If you have a different flavor of Linux, the commands below might be different.

You should already have sudo installed on your raspb pi. Prefixing a command with sudo allows you to run processes under a higher privilege without having to log into root. Also if you didnt know, the default username/password to log in the first time is pi/raspberry. If you prefer, you can type 'startx' to bring up the xwindows GUI. The commands below need to be entered into a terminal, which can be brought up within the GUI.

Fist, you want to update your Raspian to the latest version. To do that, you have to have the raspberry pi hooked up the the internet. If you have an ethernet cable and the router uses DHCP, plugging in the ethernet will be sufficient. Then, at the console simply enter the below line into the console:

sudo apt-get update

That may take a few minutes to complete and might prompt you to hit Y/N regarding the size of the package.

sudo apt-get install mono-runtime

This will also will take a few minutes to complete and might prompt you to hit Y/N regarding the size of the package.

After you've accomplished that, go have another beer because that was hard work!




Mono can now be used to run .NET executables:

mono Test.exe

Or, if your program requires elevated rights:

sudo mono Test.exe


Mono also features a REPL (read eval print loop). To access it, type 'csharp'. You can use your favorite text editor to make .cs files and compile them. My recommendation is, of course, to use Microsoft's Visual Studio. You can build and compile an application with MSVS in windows, and just transfer the executable by SCP or thumb-drive to the raspberry pi, which is pretty slick. Hats off to the developer(s) of Mono.


Saturday, April 18, 2015

Hindenbugs, Heisenbugs and other types of software bugs




Different Types of Software Bugs


    Sorry, I haven't gotten around to creating a new post in close to a month (yikes!). I have just been super busy at work and have been trying to complete about 3 out of several not-quite-finished personal projects that I will be releasing on GitHub and blogging about

Heisenbug--
    1) A software bug that disappears or alters its behavior by the action of attempting to debug it.
    Possible sources: The effect of debuggers on the run-time code or JIT can be one source. Another might arise because of process scheduling or threading and race conditions. Or both, as stepping through the code with a debugger will affect the timing of threads.
    Derivation: Werner Heisenberg and his Uncertainty Principle.


Mandelbug--
    1) A bug whose causes are so complex it defies repair, or makes its behavior appear chaotic or even non-deterministic. Other symptoms include core dumps that, when viewed in an editor, form complex but repeating patterns or designs of characters and symbols.
    Possible sources: Operating system environment or configuration, the presence or absence of file system objects, time or scheduling-dependent code. Also, a very complicated state machine esp. one with more than one variable that hold state information and where transitions may not exists for every possible permutation of one state to another.
    Derivation: Benoît Mandelbrot and his research in fractal geometry.


Schrödinbug--
    1) A bug that manifests itself in running software after a programmer notices that the code should never have worked in the first place.
    Possible sources: A combination of naughty coding practices such as coding by coincidence and something else that allows for unexpected behavior such as a the effect of custom configurations or compiler optimizations, code obfuscation, or the JIT or on code. Also, a dependency on some arcane piece of technology that no one truly understands or employment of a black-box system or library, or some ancient, dinosaur system with a highly specific configuration that is so fragile that no-one dares update it because said updates could likely or is known to break functionality and where even a reboot is considered hairy.
    Derivation: Named for Erwin Schrödinger and his thought experiment.


Bohrbug--
    1) In contrast to the Heisenbug, the Bohrbug, is a "good, solid bug", easy to hunt down, or easily predicted from the description, esp. when a bug is the result of a classical programming mistake, or 'a rookie move'.
    Possible sources: Failing to initialize a variable or class to anything other than null, failure to RTFM, misuse of pointers or general abuse of memory, not coding defensively, coding by coincidence or not truly understanding how (or why) your code works.
    Derivation: Named after the physicist Niels Bohr and his rather simple atomic model.


Hindenbug--
    1) A bug with catastrophic consequences, esp. one that actually causes the server to burst into bright, hot flames.
    Possible sources: Code that dynamically builds and executes SQL scripts, esp. scripts that make use of the DELETE FROM command, code that is self modifying or replicates, or code that controls large machines or a physical system such as pumps, belts, cooling systems, aggregate pulverizer, fans, or any sort of thing with large, rotating blades. Also, any software that controls, monitors, tests, or is in any way whatsoever, wired up to a tactical nuclear defense system. In fact, lets just include anything with 'nuclear' or 'pulverize' in the name.
    Derivation: Refers to the Hindenburg disaster. The Hindenburg Blimp was, in turn, named after Paul von Hindenburg, the then-president of Germany from 1925->1934.



    Kishor S. Trivedi has a great talk/slideshow about Software Faults, Failures and Their Mitigations. He posits that it is not realistic to write software that is 100% bug free, or have 100% up-time. He displays the downtime in terms of minutes per year (MPY?) of several reliable corporations to support his claims, and it is true that as far as I am aware,there have not been any software ever written that did not have some downtime, even NASA.
    However, Trivedi shares a quote by one E. W. Dijkstra: "Testing shows the presence, not the absence, of bugs". Aww, logic. Now that's something I can get behind. Indeed; It is impossible to provably show that any software (of sufficient complexity) is absolutely error-free.
    Following this, he suggests we should not strive for the virtually impossible task building fault-free software, but rather aim to build instead fault-tolerant software.



Software Faults -> Software Errors -> Software Failures

Software Faults lead to Software Errors that lead to Software Failures.

Trivedi defines faults, errors and failures:
      -Software Failure occurs when the delivered service no longer complies with the desired output.
      -Software Error is that part of the system state which is liable to lead to subsequent failure.
      -Software Fault is adjudged or hypothesized cause of an error.Faults are the cause of errors that may lead to failures



    So, how does one test their tactical nuclear defense system code? VERY carefully, and it probably wouldn't hurt to comment out the Launch() method as well.

Wednesday, February 18, 2015

The Feature Flags Pattern





I was listening to Episode 1101 of the podcast Dot Net Rocks. Jez Humble was talking about the concept of Feature Toggles or Feature Flags. Feature Switches? While the term has some opinionated definitions, the concept that I found most interesting was the idea of deploying software with the new features initially disabled, or switched off by some mechanism. After you think the feature is ready for production, switch on the feature. If there is an issue, you don't have to roll back the version or deploy another release, just switch the feature back off again and replace its .dll. Didn't get enough debug information? Switch the feature back on and let a few crash reports trickle in and then switch it back off. If you feature is particularly processor or network intensive, you can perform load testing by slowly releasing the features to select clients or only part of the population/userbase.

While I personally choose an SQL table for my approach to storing the toggle switches (internal business app), one could use the application's .config file. An application could pull the settings from a .config file on a networked drive as a way of controlling multiple application instances by making one changed to a centralized location. Below I show a minimalist implementation by creating a Dictionary from the <appSettings> in a App.config file.

Behold:


public static Dictionary<string, bool> GetFeatureFlags()
{
  return ConfigurationManager.AppSettings.AllKeys.ToDictionary(s => s, IsFlagSet);
}

public static bool IsFlagSet(string settingName)
{
  bool result = false;
  bool.TryParse(ConfigurationManager.AppSettings[settingName], out result);
  return result;
}

Of course with a dictionary you have to be careful that you don't try an access a column that does not exist with the indexer, so you might be better off using IsFlagSet(string), which will never throw. Although this is of limited use (AppSettings is already a NameValueCollecion), perhaps you can make use of this generic function I wrote that uses generics to convert AppSettings into a dictionary with the type of the value being the generic:


public static Dictionary<string, T> GetDictionary<T>()
{
 return ConfigurationManager.AppSettings.AllKeys.ToDictionary<string, string, T>(s => s, GetSetting<T>);
}

public static T GetSetting<T>(string settingName)
{
  try
  {
    T result = (T) Convert.ChangeType(
      ConfigurationManager.AppSettings[settingName],
      typeof (T)
    );
    if (result != null)
    {
      return result;
    }
  }
  catch { }

  return default(T);
}

Please note that swallowing an exception ("catch { }") is typically considered poor form. In this particular scenario, I am aware of the possible exceptions that can be thrown by this code and I want the code to return the default(T) in this scenario and never throw.



Tuesday, January 20, 2015

Validate all input parameters in one line/Check several objects for empty or null in a single method call.



Often my methods start with several guarding clauses. That is, conditional if statements that check for null or empty parameters and immediately return if they are. This is also known as defensive programming. Usually I am not concerned with this code; indeed I identify these blocks as validation code and overlook this code entirely. It was not until recently that I noticed this as an area that I was repeating myself and could be put in a reusable function.

Lets look at the code:

/// <summary>
/// Checks the parameters for empty, nulls, or invalid states.
/// </summary>
/// <returns>True if the params are null, empty, contains an array or object that is null or empty, contains a blank, whitespace, null or empty string, or contains DataTable that does not pass a call to IsValidDatatable().</returns>
public static bool ContainsNullOrEmpty(params object[] Items)
{
    if (Items == null || Items.Length < 1)
        return true;
    
    foreach (object item in Items)
    {
        if (item == null)
            return true;
        
        if (item is string)
        {
            if (string.IsNullOrWhiteSpace(item as String))
                return true;
        }
        else if (item is DataTable)
        {
            if (!IsValidDatatable(item as DataTable))
                return true;
        }
        
        if (item.GetType().IsArray)
        {
            bool isEmpty = true;
            foreach (object itm in (Array)item)
            {
                if (ContainsEmptyOrNulls(itm))
                    return true;
                
                isEmpty = false;
            }
            if (isEmpty)
                return true;
        }
    }

    return false;
}


My approach above uses the params keyword. By using params, I can pass in any number of parameters (including zero, although that doesn't help us in this context). By using an object instead of a generic type I can pass multiple different types in one method call. If I used generics, I would have to have one method call for each type that I wanted to validate.

The idea is to cram all of your common validation logic into this method, and call it everywhere to increase the readability of your business logic by not cluttering it up with validation logic. Notice how this function also checks for empty or white-space strings, as well as calling a custom IsValidDatatable(DataTable) function. If you have several functions that return an int of -1 or 0 upon failure, you might want to add another conditional to check if (item is int) and then if the value of the integer reflects an erroneous state.

Another cool feature it that if the item is an an array, or even an array of nested arrays, it will still check every item in those arrays. Notice how I get the item type, then check the Type.IsArray property boolean. If it is true, I cast the object as an System.Array, then call ContainsEmptyOrNulls() recursively in a foreach loop. We can return true right away on the first null condition met, but we must be careful not to return on a false condition and to instead let the false conditions fall through and continue on, in the case that there is a null later or in another array.

Enjoy!

Tuesday, January 13, 2015

EntityJustWorks - C# class/object/entity to SQL database/script/command mapper.



Note by author:

   Since writing this, I have expanded on this idea quite a bit. I have written a lightweight ORM class library that I call EntityJustWorks.

   The full project can be found on 
GitHub or CodePlex.


   EntityJustWorks not only goes from a class to DataTable (below), but also provides:


Security Warning:
This library generates dynamic SQL, and has functions that generate SQL and then immediately executes it. While it its true that all strings funnel through the function Helper.EscapeSingleQuotes, this can be defeated in various ways and only parameterized SQL should be considered SAFE. If you have no need for them, I recommend stripping semicolons ; and dashes --. Also there are some Unicode characters that can be interpreted as a single quote or may be converted to one when changing encodings. Additionally, there are Unicode characters that can crash .NET code, but mainly controls (think TextBox). You almost certainly should impose a white list:
string clean = new string(dirty.Where(c => "abcdefghijklmnopqrstuvwxyz0123456789.,\"_ !@".Contains(c)).ToArray());

PLEASE USE the SQLScript.StoredProcedure and DatabaseQuery.StoredProcedure classes to generate SQL for you, as the scripts it produces is parameterized. All of the functions can be altered to generate parameterized instead of sanitized scripts. Ever since people have started using this, I have been maintaining backwards compatibility. However, I may break this in the future, as I do not wish to teach one who is learning dangerous/bad habits. This project is a few years old, and its already showing its age. What is probably needed here is a total re-write, deprecating this version while keep it available for legacy users after slapping big warnings all over the place. This project was designed to generate the SQL scripts for standing up a database for a project, using only MY input as data. This project was never designed to process a USER'S input.! Even if the data isn't coming from an adversary, client/user/manually entered data is notoriously inconsistent. Please do not use this code on any input that did not come from you, without first implementing parameterization. Again, please see the SQLScript.StoredProcedure class for inspiration on how to do that.





    In this post, I bring the wagons 'round full circle and use what I have shown you in older posts to make a class that can create (or insert into) a table in an SQL database from a C# class object at run-time. The only prior knowledge needed would be the connection string. The name of the the table will be the name of the class Type, and the tables's columns are created from the class's public properties, with their names being the property names.

    If you want to take a data-first approach, you can generate the code to C# classes that mirrors a table in a relational database given a connection string and the table name, or use the new Emit class and generate a dynamic assembly at run-time that contains a class with public auto-properties matching the DataColumns that you can use to store DataRows in. Similarly, the name of class name will be the same name as the table, with each of the table's columns being represented by a public auto property on the class of the same name and type. An improvement here would be to create the class inside a dynamic assembly using System.Reflection.Emit. Anywho, I have been holding on to this code for too long I just want to get it out of the door.

    For those of you who just want the code:
      GitHub  (Download zip)
      BitBucket (Download)
      code.msdn.microsoft.com

    Going from a query to a DataTable is as simple as a call to SQLAdapter.Fill(). For populating the public properties of a class from a DataTable, see this post I made. To build a DataTable who's columns match the public properties of a class, check out this post here. Going from a DataTable to a class as a C# code file is covered in this previous post.

    Going from a DataTable to SQL CREATE TABLE or INSERT INTO scripts that can be passed to ExecuteNonQuery() to execute the command, see below.

    Here is the INSERT INTO code. This code takes advantage of some functions in the Helper class that converts a DataTable's ColumnNames and Row's Values as a List of String, which is then passed into a function that takes a list of string and formats it with a delimiter string and prefix and postfix strings so I can easily format the list to look like the COLUMNS and VALUES part of an SQL INSERT INTO statement. For example: INSERT INTO ([FullName], [Gender], [BirthDate]) VALUES ('John Doe', 'M', '10/3/1981''). If you want to view how the functions RowToColumnString(), RowToValueString(), and ListToDelimitedString() work, click here to view Helper.cs.

public static string InsertInto<T>(params T[] ClassObjects) where T : class
{
   DataTable table = Map.ClassToDatatable<T>(ClassObjects);
   return InsertInto(table);   // We don't need to check IsValidDatatable() because InsertInto does
}

public static string InsertInto(DataTable Table)
{
   if (!Helper.IsValidDatatable(Table))
      return string.Empty;

   StringBuilder result = new StringBuilder();
   foreach (DataRow row in Table.Rows)
   {
      if (row == null || row.Table.Columns.Count < 1 || row.ItemArray.Length < 1)
         return string.Empty;

      string columns = Helper.RowToColumnString(row);
      string values = Helper.RowToValueString(row);

      if (string.IsNullOrWhiteSpace(columns) || string.IsNullOrWhiteSpace(values))
         return string.Empty;

      result.AppendFormat("INSERT INTO [{0}] {1} VALUES {2}", Table.TableName, columns, values);
   }

   return result.ToString();
}

    Here is the CREATE TABLE CODE. A very important function here is the GetDataTypeString() member, which converts the Column.DataType to the SQL equivalent as a string, which is used to craft a CREATE TABLE script. I use static strings formatted to work with String.Format to replace the crucial bits in the CREATE TABLE command:

public static string CreateTable<T>(params T[] ClassObjects) where T : class
{
 DataTable table = Map.ClassToDatatable<T>(ClassObjects);
 return Script.CreateTable(table);
}

public static string CreateTable(DataTable Table)
{
   if (Helper.IsValidDatatable(Table, IgnoreRows: true))
      return string.Empty;

   StringBuilder result = new StringBuilder();
   result.AppendFormat("CREATE TABLE [{0}] ({1}", Table.TableName, Environment.NewLine);

   bool FirstTime = true;
   foreach (DataColumn column in Table.Columns.OfType<DataColumn>())
   {
      if (FirstTime) FirstTime = false;
      else
         result.Append(",");

      result.AppendFormat("[{0}] {1} {2}NULL{3}",
         column.ColumnName,
         GetDataTypeString(column.DataType),
         column.AllowDBNull ? "" : "NOT ",
         Environment.NewLine
      );
   }
   result.AppendFormat(") ON [PRIMARY]{0}GO", Environment.NewLine);

   return result.ToString();
}

private static string GetDataTypeString(Type DataType)
{
   switch (DataType.Name)
   {
      case "Boolean": return "[bit]";
      case "Char": return "[char]";
      case "SByte": return "[tinyint]";
      case "Int16": return "[smallint]";
      case "Int32": return "[int]";
      case "Int64": return "[bigint]";
      case "Byte": return "[tinyint] UNSIGNED";
      case "UInt16": return "[smallint] UNSIGNED";
      case "UInt32": return "[int] UNSIGNED";
      case "UInt64": return "[bigint] UNSIGNED";
      case "Single": return "[float]";
      case "Double": return "[double]";
      case "Decimal": return "[decimal]";
      case "DateTime": return "[datetime]";
      case "Guid": return "[uniqueidentifier]";
      case "Object": return "[variant]";
      case "String": return "[nvarchar](250)";
      default: return "[nvarchar](MAX)";
   }
}


...


    Here is definition of all the public members:

namespace EntityJustWorks.SQL
{
   public static class Convert
   {
      public static string SQLTableToCSharp(string ConnectionString, string TableName);
      public static bool ClassToSQLTable(string ConnectionString, params T[] ClassCollection);
   }
   
   public static class Code
   {
      public static string SQLTableToCSharp(string ConnectionString, string TableName);
      public static string DatatableToCSharp(DataTable Table);
   }

   public static class Script
   {
      public static string InsertInto<T>(params T[] ClassObjects);
      public static string InsertInto(DataTable Table);
      public static string CreateTable<T>(params T[] ClassObjects);
      public static string CreateTable(DataTable Table);
   }
   
   public static class Map
   {
      public static IList<T> DatatableToClass<T>(DataTable Table);
      public static DataTable ClassToDatatable<T>(params T[] ClassCollection);
      public static DataTable ClassToDatatable<T>();
   }

   public static class Query
   {
      public static IList<T> QueryToClass<T>(string ConnectionString, string FormatString_Query,
                                                            params object[] FormatString_Parameters);
      public static DataTable QueryToDataTable(string ConnectionString, string FormatString_Query,
                                                            params object[] FormatString_Parameters);
      public static T QueryToScalarType<T>(string ConnectionString, string FormatString_Query,
                                                            params object[] FormatString_Parameters);
      public static int ExecuteNonQuery(string ConnectionString, string FormatString_Command,
                                                            params object[] FormatString_Parameters);
   }
   
   public static class Helper
   {
      public static bool IsValidDatatable(DataTable Table, bool IgnoreRows = false);
      public static bool IsCollectionEmpty<T>(IEnumerable<T> Input);
      public static bool IsNullableType(Type Input);
   }
}

    Please feel free to comment, either on my GitHub or my blog, with criticisms or suggestions.

Wednesday, December 3, 2014

Word Frequency Dictionary & Sorted Occurrence Ranking Dictionary for Generic Item Quantification and other Statistical Analyses



Taking a little inspiration from DotNetPerl's MultiMap, I created a generic class that uses a Dictionary under the hood to create a keyed collection that tracks the re-occurrences or frequency matching or duplicate items of arbitrary type. For lack of a better name, I call it a FrequencyDicionary. Instead of Key/Value pairs, the FrequencyDictionary has the notion of a Item/Frequency pair, with the key being the Item. I strayed from Key/Value because I may ultimately end up calling the Item the Value.

I invented the FrequencyDictionary while writing a pisg-like IRC log file statistics generator program. It works well for word frequency analysis. After some pre-processing/cleaning of the text log file, I parse each line by spaces to get an array of words. I then use FrequencyDictionary.AddRange to add the words to my dictionary.

The FrequencyDictionary is essentially a normal Dictionary (with T being type String in this case), however when you attempt to add a key that already exists, it simply increments the Value integer. After processing the file, you essentially have a Dictionary where the Key is a word that was found in the source text, and the Value is the number of occurrences of that word in the source text.

Taking the idea even further, I created complementary Dictionary that essentially a sorted version of FrequencyDictionary with the Key/Values reversed. I call it a RankingDictionary because the Keys represent the number of occurrences of a word, with the Values being a List of words that occurred that many times in the source text. Its called a RankingDictionary because I was using it to produce a top 10 ranking of most popular words, most popular nicks, ect.

The FrequencyDictionary has a GetRankingDictionary() method in it to make the whole thing very easy to use. Typically I don't use the FrequencyDictionary too extensively, but rather as a means to get a RankingDictionary, which I base the majority of my IRC statistics logic on. The RankingDictionary also proved very useful for finding Naked Candidates and Hidden Pairs or Triples in my Sudoku solver application that I will be releasing on Windows, Windows Phone and will be blogging about shortly. Hell, I was even thinking about releasing the source code to my Sudoku App, since its so elegant and a great example of beautiful, readable code to a complex problem.

Anyways, the code for the Frequency and Ranking Dictionary is heavily commented with XML Documentation Comments, so I'm going to go ahead and let the code speak for itself. I will add usage examples later. In fact, I will probably release the pisg-like IRC Stats Prog source code since I don't think I'm going to go any farther with it.

Limitations: I ran into problems trying to parse large text files approaching 3-4 megabytes. Besides taking up a bunch of memory, the Dictionary begins to encounter many hash collisions once the size of the collection gets large enough. This completely kills performance, and eventually most the time is spent trying to resolve collisions, so it grinds to a near halt. You might notice the constructor public FrequencyDictionary(int Capacity), where you can specify a maximum capacity for the Dictionary. A good, safe ceiling is about 10,000. A better implementation of the GetHash() method might be in order, but is not a problem I have felt like solving yet.




/// <summary>
/// A keyed collection of Item/Frequency pairs, (keyed off Item).
/// If a duplicate Item is added to the Dictionary, the Frequency for that Item is incremented.
/// </summary>
public class FrequencyDictionary<ITM>
{
  // The underlying Dictionary
  private Dictionary<ITM, int> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public FrequencyDictionary() { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that has a maximum Capacity limit.
  /// </summary>
  public FrequencyDictionary(int Capacity) { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Gets a collection containing the Items in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<ITM> ItemCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Values;} }
  
  /// <summary>
  /// Adds the specified Item to the FrequencyDictionary.
  /// </summary>
  public void Add(ITM Item)
  {
    if  ( this._dictionary.ContainsKey(Item))   { this._dictionary[Item]++; }
    else    { this._dictionary.Add(Item,1); }
  }
  
  /// <summary>
  /// Adds the elements of the specified array to the FrequencyDictionary.
  /// </summary>
  public void AddRange(ITM[] Items)
  {
    foreach(ITM item in Items) { this.Add(item); }
  }
  
  /// <summary>
  /// Gets the Item that occurs most frequently.
  /// </summary>
  /// <returns>A KeyValuePair containing the Item (key) and how many times it has appeard (value).</returns>
  public KeyValuePair<ITM,int> GetMostFrequent()
  {
    int maxValue = this._dictionary.Values.Max();
    return this._dictionary.Where(kvp => kvp.Value == maxValue).FirstOrDefault();
  }
  
  /// <summary>
  /// Gets the number of Item/Frequency pairs contained in the FrequencyDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the FrequencyDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<ITM,int>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the Frequency (occurrences) associated with the specified Item.
  /// </summary>
  public int this[ITM Item]
  {
    get { if (this._dictionary.ContainsKey(Item)) { return this._dictionary[Item]; } return 0; }
  }
  
  /// <summary>
  /// Creates a RankingDictionary from the current FrequencyDictionary.
  /// </summary>
  /// <returns>A RankingDictionary of Frequency/ItemCollection pairs ordered by Frequency.</returns>
  public RankingDictionary<ITM> GetRankingDictionary()
  {
    RankingDictionary<ITM> result = new RankingDictionary<ITM>();
    foreach(KeyValuePair<ITM,int> kvp in _dictionary)
    {
      result.Add(kvp.Value,kvp.Key);
    }
    return result;
  }
  
  /// <summary>
  /// Displays usage information for FrequencyDictionary 
  /// </summary>
  public override string ToString()
  {
    return "FrequencyDictionary<Item, Frequency> : Key=\"Item=\", Value=\"Frequency\"\".";
  }
}




And now the RankingDictionary:

/// <summary>
/// A keyed collection of Frequency/ItemCollection pairs that is ordered by Frequency (rank).
/// If an Item is added that has the same Frequency as another Item, that Item is added to the Item collection for that Frequency.
/// </summary>
public class RankingDictionary<ITM>
{
  // Underlying dictionary
  SortedDictionary<int,List<ITM>> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public RankingDictionary() { _dictionary = new SortedDictionary<int,List<ITM>>(new FrequencyComparer()); }
  
  /// <summary>
  /// The Comparer used to compare Frequencies.
  /// </summary>
  public class FrequencyComparer : IComparer<int>
  {
    public int Compare(int one,int two) { if(one == two) return 0; else if(one > two) return -1; else return 1; }
  }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the RankingDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the ItemCollection in the RankingDictionary.
  /// </summary>
  public IEnumerable<List<ITM>> ItemCollections   { get { return this._dictionary.Values; } }
  
  /// <summary>
  /// Adds the specified Frequency and Item to the RankingDictionary.
  /// </summary>
  public void Add(int Frequency, ITM Item)
  {
    List<ITM> itemCollection = new List<ITM>();
    itemCollection.Add(Item);
    // If the specified Key is not found, a set operation creates a new element with the specified Key
    this._dictionary[Frequency] = itemCollection;
  }
  
  /// <summary>
  ///  Gets the number of Frequency/ItemCollection pairs contained in the RankingDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the RankingDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<int,List<ITM>>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the ItemCollection associated with the specified Frequency.
  /// </summary>
  public List<ITM> this[int Frequency]
  {
    get
    {
      List<ITM> itemCollection;
      if (this._dictionary.TryGetValue(Frequency,out itemCollection)) return itemCollection;
      else return new List<ITM>();
    }
  }
  
  /// <summary>
  /// Displays usage information for RankingDictionary 
  /// </summary>
  public override string ToString()
  {
    return "RankingDictionary<Frequency, List<Item>> : Key=\"Frequency\", Value=\"List<Item>\".";
  }
}


Usage examples will be posted later.

Tuesday, October 28, 2014

Create C# Class Code From a DataTable using CodeDOM



Note by author:

   Since writing this, I have expanded on this idea quite a bit. I have written a lightweight ORM class library that I call EntityJustWorks.

   The full project can be found on
GitHub or CodePlex.


   EntityJustWorks not only goes from a class to DataTable (below), but also provides:

Security Warning:
This library generates dynamic SQL, and has functions that generate SQL and then immediately executes it. While it its true that all strings funnel through the function Helper.EscapeSingleQuotes, this can be defeated in various ways and only parameterized SQL should be considered SAFE. If you have no need for them, I recommend stripping semicolons ; and dashes --. Also there are some Unicode characters that can be interpreted as a single quote or may be converted to one when changing encodings. Additionally, there are Unicode characters that can crash .NET code, but mainly controls (think TextBox). You almost certainly should impose a white list: string clean = new string(dirty.Where(c => "abcdefghijklmnopqrstuvwxyz0123456789.,\"_ !@".Contains(c)).ToArray()); 
PLEASE USE the SQLScript.StoredProcedure and DatabaseQuery.StoredProcedure classes to generate SQL for you, as the scripts it produces is parameterized. All of the functions can be altered to generate parameterized instead of sanitized scripts. Ever since people have started using this, I have been maintaining backwards compatibility. However, I may break this in the future, as I do not wish to teach one who is learning dangerous/bad habits. This project is a few years old, and its already showing its age. What is probably needed here is a total re-write, deprecating this version while keep it available for legacy users after slapping big warnings all over the place. This project was designed to generate the SQL scripts for standing up a database for a project, using only MY input as data. This project was never designed to process a USER'S input.! Even if the data isn't coming from an adversary, client/user/manually entered data is notoriously inconsistent. Please do not use this code on any input that did not come from you, without first implementing parameterization. Again, please see the SQLScript.StoredProcedure class for inspiration on how to do that.


So far I have posted several times on the DataTable class. I have shown how to convert a DataTable to CSV or tab-delimited file using the clipboard, how to create a DataTable from a class using reflection, as well as how to populate the public properties of a class from a DataTable using reflection. Continuing along these lines, I decided to bring the DataTable-To-Class wagons around full-circle and introduce a class that will generate the C# code for the class that is used by the DataTableToClass<T> function, so you don't have to create it manually. The only parameter required to generate the C# class code is, of course, a DataTable.

The code below is rather trivial. It uses CodeDOM to build up a class with public properties that match the names and data types of the data columns of the supplied DataTable. I really wanted the output code to use auto properties. This is not supported by CodeDOM, however, so I used a little hack or workaround to accomplish the same thing. I simply added the getter and setter code for the property to the member's field name. CodeDOM adds a semicolon to the end of the CodeMemberField statement, which would cause the code not to compile, so I added two trailing slashes "//" to the end of the field name to comment out the semicolon. The whole point of creating auto properties was to have clean, succinct code, so after I generate the source code file, I clean up the commented-out semicolons by replacing every occurrence with an empty string. The main disadvantage of this 'workaround' is that the code cannot be used to generate a working class in Visual Basic code. I do have proper CodeDOM code that does not employ this workaround, but I prefer the output code to contain auto-properties; auto-generated code is notorious for being messy and hard to read, and I did not want my generated code to feel like generated code.


Below is the DataTableToCode function, its containing class and its supporting functions. The code is short, encapsulated, clean and commented, so I will just let it speak for itself:

public static class DataTableExtensions
{
   public static string DataTableToCode(DataTable Table)
   {
      string className = Table.TableName;
      if(string.IsNullOrWhiteSpace(className))
      {   // Default name
         className = "Unnamed";
      }
      className += "TableAsClass";
      
      // Create the class
      CodeTypeDeclaration codeClass = CreateClass(className);
      
      // Add public properties
      foreach(DataColumn column in Table.Columns)
      {
         codeClass.Members.Add( CreateProperty(column.ColumnName, column.DataType) );
      }
      
      // Add Class to Namespace
      string namespaceName = "AutoGeneratedDomainModels";
      CodeNamespace codeNamespace = new CodeNamespace(namespaceName);
      codeNamespace.Types.Add(codeClass);
      
      // Generate code
      string filename = string.Format("{0}.{1}.cs",namespaceName,className);
      CreateCodeFile(filename, codeNamespace);
      
      // Return filename
      return filename;
   }
   
   static CodeTypeDeclaration CreateClass(string name)
   {
      CodeTypeDeclaration result = new CodeTypeDeclaration(name);
      result.Attributes = MemberAttributes.Public;
      result.Members.Add(CreateConstructor(name)); // Add class constructor
      return result;
   }
   
   static CodeConstructor CreateConstructor(string className)
   {
      CodeConstructor result = new CodeConstructor();
      result.Attributes = MemberAttributes.Public;
      result.Name = className;
      return result;
   }
   
   static CodeMemberField CreateProperty(string name, Type type)
   {
      // This is a little hack. Since you cant create auto properties in CodeDOM,
      //  we make the getter and setter part of the member name.
      // This leaves behind a trailing semicolon that we comment out.
      //  Later, we remove the commented out semicolons.
      string memberName = name + "\t{ get; set; }//";
      
      CodeMemberField result = new CodeMemberField(type,memberName);
      result.Attributes = MemberAttributes.Public | MemberAttributes.Final;
      return result;
   }
   
   static void CreateCodeFile(string filename, CodeNamespace codeNamespace)
   {
      // CodeGeneratorOptions so the output is clean and easy to read
      CodeGeneratorOptions codeOptions = new CodeGeneratorOptions();
      codeOptions.BlankLinesBetweenMembers = false;
      codeOptions.VerbatimOrder = true;
      codeOptions.BracingStyle = "C";
      codeOptions.IndentString = "\t";
      
      // Create the code file
      using(TextWriter textWriter = new StreamWriter(filename))
      {
         CSharpCodeProvider codeProvider = new CSharpCodeProvider();
         codeProvider.GenerateCodeFromNamespace(codeNamespace, textWriter, codeOptions);
      }
      
      // Correct our little auto-property 'hack'
      File.WriteAllText(filename, File.ReadAllText(filename).Replace("//;", ""));
   }
}


An example of the resulting code appears below:

namespace AutoGeneratedDomainModels
{
   public class CustomerTableAsClass
   {
      public CustomerTableAsClass()
      {
      }
      public string FirstName   { get; set; }
      public string LastName    { get; set; }
      public int  Age           { get; set; }
      public char Sex           { get; set; }
      public string Address     { get; set; }
      public string Birthdate   { get; set; }
   }
}

I am satisfied with the results and look of the code. The DataTableToCode() function can be a huge time saver if you have a large number of tables you need to write classes for, or if each DataTable contains a large number of columns.

If you found any of this helpful, or have any comments or suggestions, please feel free to post a comment.