Warning. This post has absolutely nothing to do with SQL Server.
What are Genetic Algorithms?
Genetic algorithms are a form of evolutionary computation, a branch of artificial intelligence that focuses on evolving effective or optimal solutions to difficult problems, based on the biological theory of evolution.
Genetic algorithms are, at their core, a search/optimisation technique. They are a way of finding maximum/minimum solutions to problems and, can be effective when there is no algorithmic solution to the problem. An example here would be the ‘Travelling Salesman’ problem.
Genetic algorithms work by taking an initial population of potential solutions (referred to as individuals), selecting a subset of the population that has the highest fitness then using that subset to generate a second generation. From the second generation again a subset with the highest fitness is selected and used to generate a third generation. This repeats until either the ‘fittest’ individual is considered a good enough solution, or until a certain number of generations have passed.
There are advantage to using genetic algorithms to solve problems over more traditional methods like hill climbing.
- Genetic algorithms can quickly produce good solutions though they may take a lot of time to find the best solution. This is a benefit when the problem is such that the absolute best solution is not necessary, just one that is ‘good enough’
- They are not susceptible to getting trapped by local maxima.
- They do not work on the entire search space one potential solution at a time, but rather work on populations of potential solutions, focusing towards more optimal areas of the search space.
A genetic algorithm will almost always find an optimal solution, given enough time. The main downside is that they may take a lot of time to find that optimal solution.
Components of a Genetic Algorithm
There are two main critical parts in setting up a genetic algorithm for a problem.
- The encoding of the potential solutions into a form where they can be operated on.
- The fitness function which defines which individuals are better than others, which are closer to the maximum that is being searched for.
Most of the design work when using genetic algorithms goes into those two problems.
Encoding
Encoding is the process of taking all the values that make up a potential solution and turning them into a form that the genetic algorithm can operate on.
The selection of an encoding is of utmost importance to the effectiveness of the entire process and a poor representation can make the entire problem much harder than it should. Unfortunately there has been little academic work done on the process of designing representations.
Often for genetic algorithms, the end result of the encoding will be a binary string. There are other variations of evolutionary computation that use other representations, from the arrays of real numbers used by evolutionary strategies to the code trees used by genetic programming.
Fitness function
Depending on the problem, the fitness function can be trivial to write or near-impossible. The design of the fitness function is completely based on the problem that is being solved.
There are two important considerations for a fitness function.
- It must be deterministic.
- It must be fast
If the fitness of an individual is assessed twice, it must come to the same value1. If the fitness function could return different values for the same individual, then it is of no use in determining the fittest individuals in the population and hence the genetic algorithm will not be able to identify the best solution to the problem.
The fitness of each individual is assessed at least once in each generation. The calculation of the fitness function is usually the most time consuming part of the entire process and the longer the fitness function takes to run, the longer the entire process is run
(1) There are cases where the fitness of an individual may depend on external factors which change over time. Hence a fitness function may give different values for one individual if calculated at different times. Genetic algorithms in a changing environment are a little beyond the scope of this entry.
Evolution Process
In order to create a new generation, the fittest individuals from the previous generation are taken and used to generate the next generation. There are two main operators that are used to generate a generation from the previous one. Crossover and mutation.
Crossover
Crossover involved taking two individuals, splitting each one’s encoded string and swapping parts to generate two new individuals.
Say we had two individuals with the following encoded strings (spaces added for clarity)
0000 0001 1111 1110
0101 1010 1100 0011
and we chose the splitting point for the crossover after the 4th bit, the resulting strings after the crossover will be
0000 1010 1100 0011
0101 0001 1111 1110
In genetic algorithms crossover is the primary operator used. What I described here was a single crossover. There are a number of other variations that can be used.
Mutation
Mutation is an operator applied to a single individual. It’s usually applied after crossover has generated new individuals. Mutation involves flipping a single bit somewhere in the encoded string.
Let’s take the two individuals that were generates by the crossover earlier and apply a random mutation to each
0000 1010 1100 0011
0101 0001 1111 1110
After
0000 1010 1101 0011
0101 0001 1011 1110
In genetic algorithms mutation is very seldom applied and only a small percentage of individuals in a generation will be affected by the mutation operator.
Example
As a quick example let’s manually evolve a simple function to see how the whole thing works.
Let’s say I have an array of 4 numbers (call it num) between 0 and 15. I want to know what values give me the best value for the following.
num[1]-num[2]-num[3]+num[4]
I know, that’s simple enough that we could work out the optimal solution just by eye. Not the point. This is enough to do a quick and effective demo with.
I’m going to encode that by simply converting the numbers in the array to binary and concatenating the binary representations of the 4 numbers (spaces just added for clarity). The fitness function is already defined. I’m going to start with an initial population of eight individuals.
1111 1111 1100 1110 – fitness = (15-15-12+14) = 2
0101 1010 1100 0011 – fitness = (9-10-12+3) = –10
1011 0111 0011 1111 – fitness = (11-7-3+15) = 16
1111 1001 1010 0011 – fitness = (15-9-10+3) = -1
1010 1010 1010 1010 – fitness = (10-10-10+10) = 0
1000 0010 0111 0110 – fitness = (8-2-7+6) = 5
0000 0001 1111 1110 – fitness = (0-1-15+14) = -2
1010 0101 0010 0101 – fitness = (10-5-2+5) = 8
From this I’m going to take the 4 individuals with the highest fitness, use crossover operations (with the crossover point exactly in the middle) between them until I have 8 individuals for the 2nd generation and then apply a single bit mutation to one of the individuals (detailed steps left as an exercise for the reader)
1111 1111 0011 1111 – fitness = (15-15-3+15) = 12
1011 0111 1100 1110 – fitness = (11-7-12+14) = 6
1111 1011 0010 0101 – fitness = (15-11-2+5) = 7
1010 0101 1100 1110 – fitness = (10-5-7+6) = 4
1011 0111 0111 0110 – fitness = (11-7-7+6) = 3
1000 0010 0011 1111 – fitness = (8-2-3+15) = 18
1010 0101 0111 0110 – fitness = (10-5-7+6) = 4
1000 0010 0010 0101 – fitness = (8-2-2+5) = 9
We can already see an improvement. The average and maximum fitness is much higher than for the first generation. I’ll do one more generation in this example, again taking the 4 fittest individuals, crossing over to generate 8 new individuals and then applying a single bit mutation to two individuals. This time however, the crossover point will between the 4th and 5th bit.
1000 1111 0011 1111 – fitness = (8-15-3+15) = 5
1111 0010 0011 1111 – fitness = (15-2-3+15) = 25
1000 1011 0010 0101 – fitness = (8-11-2+5) = 0
1111 0010 0011 1111 – fitness = (15-2-3+15) = 25
1111 0010 1010 0101 – fitness = (15-2-10+5) = 8
1000 0111 0011 1111 – fitness = (8-7-3+15) = 13
1111 0010 0010 0101 – fitness = (15-2-2+5) = 16
1000 1011 0010 0101 – fitness = (8-11-2+5) = 0
I think that’s enough for this example. We’re getting fairly close to the best possible solution (15,0,0,15), close enough to see how this works. The population size was very low, that’s why there are duplicates appearing in the results of the crossover. With a larger search space there would be a lot more diversity.
I hope that anyone still reading found this brief diversion into the realms of AI interesting. The regular SQL-related posts will return soon.
Hi Gail
This made an interesting read although I have never dealt with AI before.
Even though I don’t fully understand what’s going on, I tried doing the reader’s exercise and was stumped for quite a while trying to come up with your results for the 2nd generation. This carried on for a while until I realised that the fitness for 1010 0101 0010 0101 isn’t (10-5-4+5) = 6, but rather (10-5-2+5) = 8.
Lacking natural intelligence, not to mention the inability to comprehend the artificial sort :-), I was concentrating on the numeric values and ignoring the binary ones.
Two questions:
1. Am I on the money with the above?
2. How does the algorithm decide which two individuals are to be crossover partners?
Otherwise, thanks for your blog which is always an interesting read.
Oops, did I make a mistake calculating the fitnesses? I’ll check.
You are correct with your fitness calculations.
The crossover pairs can be chosen in a variety of ways. Here I went for a biased random (I picked at random and, if the two didn’t produce a useful new individual went back and picked two others)
Thanks for that! I am glad I wasn’t completely lost. By using a biased random, it sounds like you too (intuitively) used a genetic algorithm to get “useful” individuals. Interesting.
There is nothing like pumping some “cranial iron” to get revved up for the weekend 😉 (That does make me sound like a boring person, doesn’t it? But it was fun!)
Have a good one.
A very interesting introduction to something I expect to be looking into a lot more.
Gail,
I’ve been using your code to help me better understand tsql and AI (GA in particular) so I do have a question. I am not understanding why you divide by four on this particular line of code:
CEILING(RAND(CHECKSUM(NEWID()))*NumberAvailable/4)
It seems that whatever the factor (whether it be 1, 4 or 1000) I always receive a very fit individual, so what am I missing?
Thanks!
i want to: optimize database join query with genetic algorithms
How to do it?
tnx
That’s not a good use case. GAs are slow, they’re good when run once to optimise something, not as part of a database optimisation process which has to be very fast and happens frequently.