Software Usage License Agreement
Copyright (C) [year] [copyright holders]
This software, it's code, or any fan fiction resulting from its enormous popularity, is provided "AS IS". That means that the software included with with this license is provided without warranty, regardless of any previous verbal contract you may have swindled out of the author with your fast words on the phone...
To the maximum extent permitted by law, the author of this software disclaims all liability for any damages, lost profits, lost socks, lost puppies, hair loss, or architectural fanaticism and cults that may occur as a result of using this software. Under no circumstances shall the author of this software, nor his pet cat 'Mr.Whiskers', be held legally, financially or morally liable for any claim of damages or liability resulting from the use of, abstinence from, ritualistic worshiping of, or illegal pirating of this software. This includes, but is not limited to, lost profits, stolen data, bankruptcy, autonomous homicidal cyborgs or man-eating miniature pony attacks.
Use at your own risk! This software comes with no guarantees of fitness for any purpose, so do not use it for the back-end of your fortune 500 business without at least hiring me first.
Tags
C#
Math
Algorithm
Data
Error handling
Statistics
Cryptography
hack
Best practices
Architecture
CSV
DataTable
Encryption
SQL
Database
Humor
snippets
Entropy
ORM
Design Pattern
Dictionary
Jokes
Prime
Pseudorandom
Serialization
Code Generation
Compression
Coprime
Exception
Object Relational Mapping
Raspberry Pi
Reflection
Winform
Attributes
CodeDOM
Console
DataGrid
DataGridView
DataSet
Drawing
GCD
List
Mono
PropertyInfo
Sort
Static Analysis
Validation
XML
Benchmarking
Bugs
Bézier
Cellular automata
Chrome
Clipboard
Data Mapper
Designer
EULA
Extension Methods
Firefox
Games
Generics
GraphicsPath
HoneyPot
HowJSay
IComparer
IE
KeyedCollection
Luhn Algorithm
Parameter Sanitization
Query
Rogue-like
SortedDictionary
UnhandledExceptionHandler
Usercontrol
WIN32API
Wednesday, July 31, 2013
Saturday, July 27, 2013
Information entropy and data compression
In my last post, I talked about Shannon data entropy and showed a class to calculate that. Lets take it one step further and actually compress some data based off the data entropy we calculated.
To do this, first we calculate how many bits are needed to compress each byte of our data. Theoretically, this is the data entropy, rounded up to the next whole number (Math.Ceiling). But this is not always the case, and the number of unique symbols in our data may be a number that is too large to be represented in that many number of bits. We calculate the number of bits needed to represent the number of unique symbols by getting its Base2 logarithm. This returns a decimal (double), so we use Math.Ceiling to round to up to the nearest whole number as well. We set entropy_Ceiling to which ever number is larger. If the entropy_Ceiling is 8, then we should immediately return, as we cannot compress the data any further.
We start by making a compression and decompression dictionary. We make these by taking the sorted distribution dictionary (DataEntropyUTF8.GetSortedDistribution) and start assigning X-bit-length value to each entry in the sorted distribution dictionary, with X being entropy_Ceiling. The compression dictionary has a byte as the key and an array of bool (bool[]) as the value, while the decompression dictionary has an array of bool as the key, and a byte as a value. You'll notice in the decompression dictionary we store the array of bool as a string, as using an actual array as a key will not work, as the dictionary's EqualityComparer will not assign the same hash code for two arrays of the same values.
Then, compression is as easy as reading each byte, and getting the value from the compression dictionary for that byte and adding it to a list of bool (List
Decompression consists of converting the compressed array of bytes into an array of bool, then reading in X bools at a time and getting the byte value from the decompression library, again with X being entropy_Ceiling.
But first, to make this process easier, and to make our code more manageable and readable, I define several extension methods to help us out, since .NET provides almost no support for working with data on the bit level, besides the BitArray class. Here are the extension methods that to make working with bits easier:
public static class BitExtentionMethods
{
//
// List<bool> extention methods
//
public static List<bool> ToBitList(this byte source)
{
List<bool> temp = ( new BitArray(source.ToArray()) ).ToList();
temp.Reverse();
return temp;
}
public static List<bool> ToBitList(this byte source,int startIndex)
{
if(startIndex<0 || startIndex>7) {
return new List<bool>();
}
return source.ToBitList().GetRange(startIndex,(8-startIndex));
}
//
// bool[] extention methods
//
public static string GetString(this bool[] source)
{
string result = string.Empty;
foreach(bool b in source)
{
if(b) {
result += "1";
} else {
result += "0";
}
}
return result;
}
public static bool[] ToBitArray(this byte source,int MaxLength)
{
List<bool> temp = source.ToBitList(8-MaxLength);
return temp.ToArray();
}
public static bool[] ToBitArray(this byte source)
{
return source.ToBitList().ToArray();
}
//
// BYTE extention methods
//
public static byte[] ToArray(this byte source)
{
List<byte> result = new List<byte>();
result.Add(source);
return result.ToArray();
}
//
// BITARRAY extention methods
//
public static List<bool> ToList(this BitArray source)
{
List<bool> result = new List<bool>();
foreach(bool bit in source)
{
result.Add(bit);
}
return result;
}
public static bool[] ToArray(this BitArray source)
{
return ToList(source).ToArray();
}
}
Remember, these need to be the base class in a namespace, not in a nested class.Now, we are free to write our compression/decompression class:
public class BitCompression
{
// Data to encode
byte[] data;
// Compressed data
byte[] encodeData;
// # of bits needed to represent data
int encodeLength_Bits;
// Original size before padding. Decompressed data will be truncated to this length.
int decodeLength_Bits;
// Bits needed to represent each byte (entropy rounded up to nearist whole number)
int entropy_Ceiling;
// Data entropy class
DataEntropyUTF8 fileEntropy;
// Stores the compressed symbol table
Dictionary<byte,bool[]> compressionLibrary;
Dictionary<string,byte> decompressionLibrary;
void GenerateLibrary()
{
byte[] distTable = new byte[fileEntropy.Distribution.Keys.Count];
fileEntropy.Distribution.Keys.CopyTo(distTable,0);
byte bitSymbol = 0x0;
bool[] bitBuffer = new bool[entropy_Ceiling];
foreach(byte symbol in distTable)
{
bitBuffer = bitSymbol.ToBitArray(entropy_Ceiling);
compressionLibrary.Add(symbol,bitBuffer);
decompressionLibrary.Add(bitBuffer.GetString(),symbol);
bitSymbol++;
}
}
public byte[] Compress()
{
// Error checking
if(entropy_Ceiling>7 || entropy_Ceiling<1) {
return data;
}
// Compress data using compressionLibrar
List<bool> compressedBits = new List<bool>();
foreach(byte bite in data) { // Take each byte, find the matching bit array in the dictionary
compressedBits.AddRange(compressionLibrary[bite]);
}
decodeLength_Bits = compressedBits.Count;
// Pad to fill last byte
while(compressedBits.Count % 8 != 0) {
compressedBits.Add(false); // Pad to the nearest byte
}
encodeLength_Bits = compressedBits.Count;
// Convert from array of bits to array of bytes
List<byte> result = new List<byte>();
int count = 0;
int shift = 0;
int offset= 0;
int stop = 0;
byte current = 0;
do
{
stop = encodeLength_Bits - count;
stop = 8 - stop;
if(stop<0) {
stop = 0;
}
if(stop<8)
{
shift = 7;
offset = count;
current = 0;
while(shift>=stop)
{
current |= (byte)(Convert.ToByte(compressedBits[offset]) << shift);
shift--;
offset++;
}
result.Add(current);
count += 8;
}
} while(count < encodeLength_Bits);
encodeData = result.ToArray();
return encodeData;
}
public byte[] Decompress(byte[] compressedData)
{
// Error check
if(compressedData.Length<1) {
return null;
}
// Convert to bit array for decompressing
List<bool> bitArray = new List<bool>();
foreach(byte bite in compressedData) {
bitArray.AddRange(bite.ToBitList());
}
// Truncate to original size, removes padding for byte array
int diff = bitArray.Count-decodeLength_Bits;
if(diff>0) {
bitArray.RemoveRange(decodeLength_Bits-1,diff);
}
// Decompress
List<byte> result = new List<byte>();
int count = 0;
do
{
bool[] word = bitArray.GetRange(count,entropy_Ceiling).ToArray();
result.Add(decompressionLibrary[word.GetString()]);
count+=entropy_Ceiling;
} while(count < bitArray.Count);
return result.ToArray();
}
public BitCompression(string filename)
{
compressionLibrary = new Dictionary<byte, bool[]>();
decompressionLibrary = new Dictionary<string, byte>();
if(!File.Exists(filename)) {
return;
}
data = File.ReadAllBytes(filename);
fileEntropy = new DataEntropyUTF8();
fileEntropy.ExamineChunk(data);
int unique = (int)Math.Ceiling(Math.Log((double)fileEntropy.UniqueSymbols,2f));
int entropy = (int)Math.Ceiling(fileEntropy.Entropy);
entropy_Ceiling = Math.Max(unique,entropy);
encodeLength_Bits = data.Length * entropy_Ceiling;
GenerateLibrary();
}
}
Please feel free to comment with ideas, suggestions or corrections.
Labels:
.net,
C#,
Compression,
Cryptography,
csharp,
Dictionary,
Encryption,
Entropy,
Extension Methods,
Information,
Math,
security
Monday, July 22, 2013
Information Shannon Entropy
Shannon/data entropy is a measurement of uncertainty. Entropy can be used as a measure of randomness. Data entropy is typically expressed as the number of bits needed to encode or represent data. In the example below, we are working with bytes, so the max entropy for a stream of bytes is 8.
A file with high entropy means that each symbol is more-or-less equally as likely to appear next. If a file or file stream has high entropy, it is either probably compressed, encrypted or random. This can be used to detect packed executables, cipher streams on a network, or a breakdown of encrypted communication on a network that is expected to be always encrypted.
A text file will have low entropy. If a file has low data entropy, it mean that the file will compress well.
This post and code was inspired by Mike Schiffman's excelent explaination of data entropy on his Cisco Security Blog.
Here is what I wrote:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace DataEntropy
{
public class DataEntropyUTF8
{
// Stores the number of times each symbol appears
SortedList<byte,int> distributionDict;
// Stores the entropy for each character
SortedList<byte,double> probabilityDict;
// Stores the last calculated entropy
double overalEntropy;
// Used for preventing unnecessary processing
bool isDirty;
// Bytes of data processed
int dataSize;
public int DataSampleSize
{
get { return dataSize; }
private set { dataSize = value; }
}
public int UniqueSymbols
{
get { return distributionDict.Count; }
}
public double Entropy
{
get { return GetEntropy(); }
}
public Dictionary<byte,int> Distribution
{
get { return GetSortedDistribution(); }
}
public Dictionary<byte,double> Probability
{
get { return GetSortedProbability(); }
}
public byte GetGreatestDistribution()
{
return distributionDict.Keys[0];
}
public byte GetGreatestProbability()
{
return probabilityDict.Keys[0];
}
public double GetSymbolDistribution(byte symbol)
{
return distributionDict[symbol];
}
public double GetSymbolEntropy(byte symbol)
{
return probabilityDict[symbol];
}
Dictionary<byte,int> GetSortedDistribution()
{
List<Tuple<int,byte>> entryList = new List<Tuple<int, byte>>();
foreach(KeyValuePair<byte,int> entry in distributionDict)
{
entryList.Add(new Tuple<int,byte>(entry.Value,entry.Key));
}
entryList.Sort();
entryList.Reverse();
Dictionary<byte,int> result = new Dictionary<byte, int>();
foreach(Tuple<int,byte> entry in entryList)
{
result.Add(entry.Item2,entry.Item1);
}
return result;
}
Dictionary<byte,double> GetSortedProbability()
{
List<Tuple<double,byte>> entryList = new List<Tuple<double,byte>>();
foreach(KeyValuePair<byte,double> entry in probabilityDict)
{
entryList.Add(new Tuple<double,byte>(entry.Value,entry.Key));
}
entryList.Sort();
entryList.Reverse();
Dictionary<byte,double> result = new Dictionary<byte,double>();
foreach(Tuple<double,byte> entry in entryList)
{
result.Add(entry.Item2,entry.Item1);
}
return result;
}
double GetEntropy()
{
// If nothing has changed, dont recalculate
if(!isDirty) {
return overalEntropy;
}
// Reset values
overalEntropy = 0;
probabilityDict = new SortedList<byte,double>();
foreach(KeyValuePair<byte,int> entry in distributionDict)
{
// Probability = Freq of symbol / # symbols examined thus far
probabilityDict.Add(
entry.Key,
(double)distributionDict[entry.Key] / (double)dataSize
);
}
foreach(KeyValuePair<byte,double> entry in probabilityDict)
{
// Entropy = probability * Log2(1/probability)
overalEntropy += entry.Value * Math.Log((1/entry.Value),2);
}
isDirty = false;
return overalEntropy;
}
public void ExamineChunk(byte[] chunk)
{
if(chunk.Length<1 || chunk==null) {
return;
}
isDirty = true;
dataSize += chunk.Length;
foreach(byte bite in chunk)
{
if(!distributionDict.ContainsKey(bite))
{
distributionDict.Add(bite,1);
continue;
}
distributionDict[bite]++;
}
}
public void ExamineChunk(string chunk)
{
ExamineChunk(StringToByteArray(chunk));
}
byte[] StringToByteArray(string inputString)
{
char[] c = inputString.ToCharArray();
IEnumerable<byte> b = c.Cast<byte>();
return b.ToArray();
}
void Clear()
{
isDirty = true;
overalEntropy = 0;
dataSize = 0;
distributionDict = new SortedList<byte, int>();
probabilityDict = new SortedList<byte, double>();
}
public DataEntropyUTF8(string fileName)
{
this.Clear();
if(File.Exists(fileName))
{
ExamineChunk( File.ReadAllBytes(fileName) );
GetEntropy();
GetSortedDistribution();
}
}
public DataEntropyUTF8()
{
this.Clear();
}
}
}
Labels:
.net,
C#,
Compression,
Cryptography,
csharp,
Data,
Encryption,
Entropy,
Information,
Math,
security
Sunday, July 21, 2013
C# developer humor - Chuck Norris.
These jokes were inspired by this link. I have modified them a bit to apply to the C or C# language.
Here are some originals:
- Chuck Norris can make a class that is both abstract and constant.
- Chuck Norris serializes objects straight into human skulls.
- Chuck Norris doesn’t deploy web applications, he roundhouse kicks them into the server.
- Chuck Norris always uses his own design patterns, and his favorite is the Roundhouse Kick.
- Chuck Norris always programs using unsafe code.
- Chuck Norris only enumerates roundhouse kicks to the face.
- Chuck Norris demonstrated the meaning of float.PositiveInfinity by counting to it, twice.
- A lock statement doesn’t protect against Chuck Norris, if he wants the object, he takes it.
- Chuck Norris doesn’t use VisualStudio, he codes .NET by using a hex editor on the MSIL.
- When someone attempts to use one of Chuck Norris’ deprecated methods, they automatically get a roundhouse kick to the face at compile time.
- Chuck Norris never has a bug in his code, without exception!
- Chuck Norris doesn’t write code. He stares at a computer screen until he gets the progam he wants.
- Code runs faster when Chuck Norris watches it.
- Chuck Norris methods don't catch exceptions because no one has the guts to throw any at them.
- Chuck Norris will cast a value to any type, just by staring at it.
- If you catch { } a ChuckNorrisException, you’ll probably die.
- Chuck Norris’s code can roundhouse kick all other classes' privates.
- C#'s visibility levels are public, private, protected, and “protected by Chuck Norris”. Don’t try to access a field with this last modifier!
- Chuck Norris can divide by 0!
- The garbage collector only runs on Chuck Norris code to collect the bodies.
- Chuck Norris can execute 64bit length instructions in a 32bit CPU.
- To Chuck Norris, all other classes are IDisposable.
- Chuck Norris can do multiple inheritance in C#.
- MSBuild never throws exceptions to Chuck Norris, not anymore. 753 killed Microsoft engineers is enough.
- Chuck Norris doesn’t need unit tests, because his code always work. ALWAYS.
- Chuck Norris has been coding in generics since 1.1.
- Chuck Norris’ classes can’t be decompiled... don’t bother trying.
Here are some originals:
- If you try derive from a Chuck Norris Interface, you'll only get an IRoundhouseKick in-the-face.
- Chuck Norris can serialize a dictionary to XML without implementing IXMLSerializable.
- Chuck Norris can decompile your assembly by only reading the MSIL.
Tuesday, July 16, 2013
Convert a Class or List of Class to a DataTable, using reflection.
Note by author:
Since writing this, I have expanded on this idea quite a bit. I have written a lightweight ORM class library that I call EntityJustWorks.
The full project can be found on GitHub or CodePlex.
EntityJustWorks not only goes from a class to DataTable (below), but also provides:
- SQL 'SELECT' statement to a List<T> of populated classes, each one resembling a row
Security Warning:
This library generates dynamic SQL, and has functions that generate SQL and then immediately executes it. While it its true that all strings funnel through the function Helper.EscapeSingleQuotes, this can be defeated in various ways and only parameterized SQL should be considered SAFE. If you have no need for them, I recommend stripping semicolons ; and dashes --. Also there are some Unicode characters that can be interpreted as a single quote or may be converted to one when changing encodings. Additionally, there are Unicode characters that can crash .NET code, but mainly controls (think TextBox). You almost certainly should impose a white list:
string clean = new string(dirty.Where(c => "abcdefghijklmnopqrstuvwxyz0123456789.,\"_ !@".Contains(c)).ToArray());
PLEASE USE the SQLScript.StoredProcedure and DatabaseQuery.StoredProcedure classes to generate SQL for you, as the scripts it produces is parameterized. All of the functions can be altered to generate parameterized instead of sanitized scripts. Ever since people have started using this, I have been maintaining backwards compatibility. However, I may break this in the future, as I do not wish to teach one who is learning dangerous/bad habits. This project is a few years old, and its already showing its age. What is probably needed here is a total re-write, deprecating this version while keep it available for legacy users after slapping big warnings all over the place. This project was designed to generate the SQL scripts for standing up a database for a project, using only MY input as data. This project was never designed to process a USER'S input.! Even if the data isn't coming from an adversary, client/user/manually entered data is notoriously inconsistent. Please do not use this code on any input that did not come from you, without first implementing parameterization. Again, please see the SQLScript.StoredProcedure class for inspiration on how to do that.
This class uses generics to accepts a class type, and uses reflection to determine the name and type of the class's public properties. With that, a new DataTable is made and the DataColumnCollection is fleshed out. Then you can add rows to the DataTable by passing instances of the class with it's property fields containing values.
Finally, we serialize the DataTable to an XML file, save it's Schema, then load it all back in again as a proof of concept.
Usage example:
List<Order> orders = new List<Order>();
// Fill in orders here ...
// orders.Add(new Order());
// Convert class to DataTable
DataTable ordersTable = ClassListToDataTable(orders);
// Set DataGrid's DataSource to DataTable
dataGrid1.DataSource = ordersTable;
Here is the Code:
public static DataTable ClassToDataTable<T>() where T : class
{
Type classType = typeof(T);
List<PropertyInfo> propertyList = classType.GetProperties().ToList();
if (propertyList.Count < 1)
{
return new DataTable();
}
string className = classType.UnderlyingSystemType.Name;
DataTable result = new DataTable(className);
foreach (PropertyInfo property in propertyList)
{
DataColumn col = new DataColumn();
col.ColumnName = property.Name;
Type dataType = property.PropertyType;
if (IsNullable(dataType))
{
if(dataType.IsGenericType)
{
dataType = dataType.GenericTypeArguments.FirstOrDefault();
}
}
else
{ // True by default
col.AllowDBNull = false;
}
col.DataType = dataType;
result.Columns.Add(col);
}
return result;
}
public static DataTable ClassListToDataTable<T>(List<T> ClassList) where T : class
{
DataTable result = ClassToDataTable<T>();
if(result.Columns.Count < 1)
{
return new DataTable();
}
if(ClassList.Count < 1)
{
return result;
}
foreach(T item in ClassList)
{
ClassToDataRow(ref result, item);
}
return result;
}
public static void ClassToDataRow<T>(ref DataTable Table, T Data) where T : class
{
Type classType = typeof(T);
string className = classType.UnderlyingSystemType.Name;
// Checks that the table name matches the name of the class.
// There is not required, and it may be desirable to disable this check.
// Comment this out or add a boolean to the parameters to disable this check.
if (!Table.TableName.Equals(className))
{
return;
}
DataRow row = Table.NewRow();
List<PropertyInfo> propertyList = classType.GetProperties().ToList();
foreach (PropertyInfo prop in propertyList)
{
if (Table.Columns.Contains(prop.Name))
{
if (Table.Columns[prop.Name] != null)
{
row[prop.Name] = prop.GetValue(Data, null);
}
}
}
Table.Rows.Add(row);
}
public static bool IsNullable(Type Input)
{
if (!Input.IsValueType) return true; // Is a ref-type, such as a class
if (Nullable.GetUnderlyingType(Input) != null) return true; // Nullable
return false; // Must be a value-type
}
Here is an example of how to serialize a DataTable to XML, and load it back again
string filePath = "order1.xml";
string schemaPath = Path.ChangeExtension(filePath,".xsd");
ordersTable.WriteXml(filePath);
ordersTable.WriteXmlSchema(schemaPath);
// Load
DataTable loadedTable = new DataTable();
loadedTable.ReadXmlSchema(schemaPath);
loadedTable.ReadXml(filePath);
// Set DataGrid's DataSource
dataGrid1.DataSource = dataTable;
The full project and source code for EntityJustWorks can be found on GitHub and CodePlex.
Labels:
.net,
C#,
Cool,
csharp,
CSV,
Data,
Database,
DataBind,
DataTable,
Mapping Class,
Object Mapping,
Object Relational Mapping,
ORM,
PropertyInfo,
Reflection,
Serialization,
spreadsheet,
SQL,
Table,
XML
Subscribe to:
Posts (Atom)