And now for a completely inappropriate use of SQL Server

A while back I wrote up a short introductory overview of Genetic Algorithms. Just for the shear, absolute fun of it, I thought I’d implement a basic genetic algorithm within SQL Server and use it to solve a form of the knapsack problem.

Now first a few comments on this. As the title states, this is not really an appropriate use of SQL Server. Genetic algorithms are generally implemented in languages like Java, C++, C#, etc; languages that are good at complex mathematics, string manipulation and have complex data types. I’m not necessarily using efficient, well-performing methods here, UDFa abound. This is not an article on best practices and well-performing code. I’m also doing no error handling, which I would if this were a real system (in a more suitable language)

Still, doing just for the sake of seeing if it’s possible is all sorts of fun. So, without further ado, the knapsack problem, an approximate solution with genetic algorithms in SQL Server. (For the record, this is a multi-constrained, bounded knapsack problem)

The scenario

There’s a bag that has a maximum volume that it can hold and a maximum mass that it can hold (and we assume that we can pack perfectly with no wasted space). There are eight items, all with different masses, different volumes and different values. The goal here is to maximise the total value of all the items contained within the bag.

CREATE TABLE ObjectStatistics (
  ObjectNumber TINYINT NOT NULL,
  Mass NUMERIC(4,2) NOT NULL,
  Volume NUMERIC(4,2) NOT NULL,
  Value NUMERIC(4,2) NOT NULL,
  NumberAvailable TINYINT NOT NULL,
  CONSTRAINT pk_ObjectStats PRIMARY KEY CLUSTERED (ObjectNumber)
);

CREATE TABLE BagStatistics (
  MaxMass NUMERIC(5,2),
  MaxVolume NUMERIC(5,2)
);

INSERT INTO dbo.ObjectStatistics (ObjectNumber, Mass, Volume, Value, NumberAvailable)
VALUES
  (1,0.5,0.45,2,50),
  (2,1.5,0.25,20,10),
  (3,0.1,1.25,8,150),
  (4,0.25,0.1,1,250),
  (5,0.67,0.3,6,100),
  (6,0.34,0.75,18,5),
  (7,1.25,0.25,5,40),
  (8,1.1,0.8,10,25);

INSERT INTO dbo.BagStatistics (MaxMass, MaxVolume)
VALUES  (100, 75);

Those two tables set up the constraints for the scenario, the maximum mass and volume for the bag and the mass, volume, value and maximum number available for each of the items.

The setup

Next we need a way to store the potential solutions (referred to as ‘individuals’). I decided on two tables (pretending to be doing proper DB design here). Keeping everything in one table would also have worked, it would just have required a different implementation later on.

CREATE TABLE Individuals (
  Generation SMALLINT,
  IndividualID SMALLINT,
  CONSTRAINT pk_Individuals PRIMARY KEY CLUSTERED (Generation, IndividualID)
);

CREATE TABLE IndividualDetails (
  Generation SMALLINT,
  IndividualID SMALLINT,
  ObjectNumber TINYINT,
  Quantity TINYINT,
  CONSTRAINT pk_IndividualDetails PRIMARY KEY CLUSTERED (Generation, IndividualID, ObjectNumber),
  CONSTRAINT fk_IndividualDetails_Individuals FOREIGN KEY (Generation, IndividualID) REFERENCES dbo.Individuals (Generation, IndividualID)
);

I’m going to add two more tables, one to make the processing easier (it’ll become clear later where this is used) and one to store some results. Also a view to get around the ‘cannot use RAND inside a UDF’ limitation.

CREATE TABLE Numbers (
  Value TINYINT PRIMARY KEY,
  BinaryRepresentation CHAR(8) UNIQUE
);

CREATE TABLE GenerationStatistics (
  Generation SMALLINT,
  MaxFitness NUMERIC(8,2),
  AvgFitness NUMERIC(8,2)
);
GO

CREATE VIEW RandomNumber
AS
  SELECT RAND(CHECKSUM(NEWID())) AS Random;
GO

That’s the basics out of the way. There are two important things I need to decide on straightaway.

The first is an encoding of the individual’s properties (in this case object quantities) so that the genetic algorithm can manipulate them. I’m going to use a binary representation – convert the quantity for each object into binary and concatenate the strings together. 8 bits per object.

The second is a fitness function, a way to tell which individuals are the best. In this case it’ll be a simple sum of the values of all the items, taking into account the max mass, max volume and max quantity for each.

I decided to implement both of these as scalar UDFs (yes, shock and horror, I’m using data-accessing scalar UDFs) and make them calculated columns of the Individuals table.

The integer-to-binary uses a slightly modified version of a function written by Mitch Wheat, available on his blog – http://mitch-wheat.blogspot.com/2006/10/t-sql-way-converting-integers-to.html. Thanks Mitch

-- Originally from http://mitch-wheat.blogspot.com/2006/10/t-sql-way-converting-integers-to.html
CREATE FUNCTION IntegerToBinary(@intval int)
RETURNS char(8)
AS
BEGIN
 DECLARE @bincode char(64);

 SET @bincode = '0000000100100011010001010110011110001001101010111100110111101111';

 RETURN (
   SUBSTRING( @bincode, ((@IntVal / 16) & 15) * 4 + 1 , 4 ) +
   SUBSTRING( @bincode, (@IntVal & 15) * 4 + 1 , 4 )
 );
END
GO

CREATE FUNCTION dbo.ComputeBinaryString(@Generation SMALLINT, @Individual SMALLINT)
RETURNS CHAR(64)
AS
BEGIN
 DECLARE @BinaryString VARCHAR(64) = '';

 SELECT @BinaryString = @BinaryString + dbo.IntegerToBinary(Quantity)
   FROM dbo.IndividualDetails
   WHERE Generation = @Generation AND IndividualID = @Individual
   ORDER BY ObjectNumber;

 RETURN @BinaryString;
END;
GO

CREATE FUNCTION dbo.Fitness(@Generation SMALLINT, @Individual SMALLINT)
RETURNS NUMERIC(6,2)
AS
BEGIN
 DECLARE @MaxMass NUMERIC(5,2), @MaxVolume NUMERIC(5,2);
 SELECT @MaxMass = MaxMass, @MaxVolume = MaxVolume
   FROM dbo.BagStatistics;

 DECLARE @TotalValue NUMERIC(6,2) = 0;

 SELECT @TotalValue = SUM(id.Quantity*os.Value)
   FROM dbo.IndividualDetails id
   INNER JOIN dbo.ObjectStatistics os ON id.ObjectNumber = os.ObjectNumber
   WHERE Generation = @Generation AND IndividualID = @Individual
   HAVING SUM(id.Quantity*os.Mass) <= @MaxMass
     AND SUM(id.Quantity*os.Volume) <= @MaxVolume
     AND SUM(CASE WHEN id.Quantity>os.NumberAvailable THEN 1 ELSE 0 END) = 0;

 RETURN @TotalValue;
END;
GO

ALTER TABLE dbo.Individuals
ADD BinaryString AS dbo.ComputeBinaryString (Generation, IndividualID),
    Fitness AS dbo.Fitness(Generation, IndividualID);
GO

At this point I have a way to store the individual in each generation, a way to represent them in order for the genetic algorithm to process them and a way to calculate the fitness. Good to go.

The framework

I now need to implement the genetic algorithm itself. The main components that I need to implement are the crossover and the mutation and then an evolution procedure to generate new individuals from the current ones.

I’m going to again use functions for the crossover and mutation, for convenience and easy coding. A multi-statement table-valued UDF for the crossover and a scalar UDF for the mutation. (Performance? What performance?)

The crossover’s fairly simple. Take 2 binary strings as parameters. Pick a random point along the string and cut the two strings at that point. Create two new strings from the first part of the first input string concatenated with the second part of the second input string and from the first part of the second input string concatenated with the second part of the first input string.

If I started with ‘00000000’ and ‘11111111’ and picked 5 as the crossover point, afterwards I’d get ‘00000111’ and ‘11111000’

CREATE FUNCTION dbo.Crossover(@BinaryString1 CHAR(64), @BinaryString2 CHAR(64))
  RETURNS @NewIndividuals TABLE (CrossoverPoint TINYINT, BinaryString CHAR(64))
AS
BEGIN

 -- where in the string is the crossover
 DECLARE @CrossoverPoint TINYINT;
 SELECT @CrossoverPoint = FLOOR(Random*64)+1
   FROM RandomNumber;

 INSERT INTO @NewIndividuals
 SELECT @CrossoverPoint, SUBSTRING(@BinaryString1,0,@CrossoverPoint) + SUBSTRING(@BinaryString2,@CrossoverPoint, 64-@CrossoverPoint+1)
 UNION ALL
 SELECT @CrossoverPoint, SUBSTRING(@BinaryString2,0,@CrossoverPoint) + SUBSTRING(@BinaryString1,@CrossoverPoint, 64-@CrossoverPoint+1);

 RETURN;

END
GO

The mutation’s not much more complex. Take a single binary string as input. Pick a random position along the string. Flip that bit (0->1; 1->0). Return the modified string

CREATE FUNCTION dbo.ApplyMutation(@BinaryString CHAR(64))
RETURNS CHAR(64)
AS
BEGIN

DECLARE @MutationLocation TINYINT, @SwappedBit CHAR(1);

-- Which bit are we toggling?
SELECT @MutationLocation = FLOOR(Random*64)+1
  FROM RandomNumber;

-- simple toggle for the chosen bit
SET @SwappedBit = 1 - CAST(SUBSTRING(@BinaryString, @MutationLocation,1) AS BIT);

-- and stuff it back in
SET @BinaryString = STUFF(@BinaryString,@MutationLocation,1,@SwappedBit);

RETURN @BinaryString;
END
GO

One thing to note at this point is that neither the crossover nor the mutation has any dependency on the details of the problem being solved. This is intentional, and is generally how genetic algorithms work. With the details encoded properly (in this case into a binary string), the crossover and mutation should be able to work just on the encoded string without needing any knowledge of the actual problem.

Last thing needed here is something that will take the existing individuals, apply the crossover and mutation and return new individuals. This time I think a stored procedure would be best.

CREATE PROCEDURE Evolve (@CurrentGeneration INT)
AS
BEGIN

CREATE TABLE #SelectedIndividuals (
  BinaryString CHAR(64),
  CrossoverGroup TINYINT
);

-- use the fittest half of the current population to evolve the next generation
-- ensures that the low fitness individuals are eliminated (survival of the fittest FTW)
INSERT INTO #SelectedIndividuals (BinaryString, CrossoverGroup)
SELECT BinaryString, NTILE(2) OVER (ORDER BY (SELECT NEWID())) AS CrossoverGroup
  FROM (
    SELECT TOP (50) PERCENT BinaryString
      FROM Individuals
      ORDER BY Fitness DESC
  ) sub;

-- Join so that I have the two binary strings I'm crossing over in the same row.
-- The function then does the cross over and the mutation applies to the result
SELECT @CurrentGeneration AS Generation, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS IndividualID, dbo.ApplyMutation(co.BinaryString) AS NewIndividual
 FROM (
    SELECT BinaryString, ROW_NUMBER() OVER (ORDER BY (SELECT NEWID())) AS PartnerNo
      FROM #SelectedIndividuals
      WHERE CrossoverGroup = 1
    ) AS Group1
  INNER JOIN (
    SELECT BinaryString, ROW_NUMBER() OVER (ORDER BY (SELECT NEWID())) AS PartnerNo
      FROM #SelectedIndividuals
      WHERE CrossoverGroup = 2
    ) AS Group2
    ON Group1.PartnerNo = Group2.PartnerNo
   CROSS APPLY dbo.CrossOver(Group1.BinaryString, Group2.BinaryString) co;
END;
GO

That’s all that’s needed for the GA framework.

The solution

What’s needed for the solution is a loop to run through the generations one by one, a way to decode the individual returned by the Evolve procedure and some housekeeping.

First off, I need to fill the Numbers table.

INSERT INTO dbo.Numbers (Value, BinaryRepresentation)
SELECT Value, dbo.IntegerToBinary(Value)
  FROM (
    SELECT TOP (256) Row_Number() OVER (ORDER BY column_id)-1 AS Value FROM sys.columns
  ) sub;
GO

The setup for the execution loop involves declaring a few variable, a temp table and setting up the initial generation (randomly).

SET NOCOUNT ON
DECLARE @CurrentGeneration SMALLINT = 0;
DECLARE @TotalGenerations SMALLINT = 50;
DECLARE @GenerationSize SMALLINT = 64; -- keep to powers of 2, and everything will work...

-- Create initial generation
-- First the individuals
INSERT INTO Individuals (Generation, IndividualID)
SELECT TOP(@GenerationSize) @CurrentGeneration AS Generation, ROW_NUMBER() OVER (ORDER BY column_id) AS IndividualID
  FROM sys.columns;

-- Now the details
INSERT INTO IndividualDetails (Generation, IndividualID, ObjectNumber, Quantity)
SELECT Generation, IndividualID, ObjectNumber, CEILING(RAND(CHECKSUM(NEWID()))*NumberAvailable/4)
  FROM Individuals
    CROSS JOIN ObjectStatistics;

-- Initial generation statistics
INSERT INTO GenerationStatistics (Generation, MaxFitness, AvgFitness)
SELECT @CurrentGeneration, MAX(Fitness), AVG(Fitness)
FROM Individuals;

-- placeholder table for new generations
CREATE TABLE #NewIndividuals (
  Generation SMALLINT,
  IndividualID SmallINT,
  BinaryString CHAR(64)
);

The processing loop itself is nothing complex. Insert the results of the Evolve procedure into a temp table. From the temp table insert into the Individuals, decode the binary string and insert into IndividualDetails, then delete the lower-fitness individuals so that the population remains at a constant size.

In this case I’ve decided just to run for a fixed number of iterations. Often genetic algorithms would terminate after a certain number of iterations or if the max fitness didn’t improve by more than a certain threshold. For simplicity I’ll leave that out. It’s trivial to implement if anyone wants to experiment further.

WHILE (@CurrentGeneration<@TotalGenerations)
 BEGIN

   SET @CurrentGeneration+=1;

 -- Generate the new generation
   INSERT INTO #NewIndividuals (Generation, IndividualID, BinaryString)
   EXEC Evolve @CurrentGeneration;

 -- store the new individuals
   INSERT INTO Individuals (Generation, IndividualID)
   SELECT Generation, IndividualID
     FROM #NewIndividuals;

 -- chop binary string, insert into details table.
   INSERT INTO IndividualDetails (Generation, IndividualID, ObjectNumber, Quantity)
   SELECT Generation, IndividualID, Allele, Value
     FROM (
       SELECT Generation, IndividualID, Value+1 AS Allele, SUBSTRING(BinaryString, 1 + (Value * 8), 8 ) AS BinaryString
         FROM #NewIndividuals
           CROSS JOIN Numbers
         WHERE Value < 8
     ) sub
       INNER JOIN Numbers ON sub.BinaryString = Numbers.BinaryRepresentation;

 -- delete all but the top (@GenerationSize) fittest individuals. Keep fixed population size.
   DELETE FROM IndividualDetails
     FROM IndividualDetails LEFT OUTER JOIN (
       SELECT TOP (@GenerationSize) Generation, IndividualID
         FROM Individuals
         ORDER BY fitness DESC
     ) Chosen ON IndividualDetails.Generation = Chosen.Generation AND IndividualDetails.IndividualID = Chosen.IndividualID
     WHERE Chosen.IndividualID IS NULL;

   DELETE FROM Individuals WHERE BinaryString = ''; -- only possible when there are no details

 -- Compute current max fitness and current avg fitness
   INSERT INTO GenerationStatistics (Generation, MaxFitness, AvgFitness)
   SELECT @CurrentGeneration, MAX(Fitness), AVG(Fitness)
     FROM Individuals;

 -- cleanup
 TRUNCATE TABLE #NewIndividuals;
 END -- while
DROP TABLE #NewIndividuals

And that’s pretty much all there is to it. All that’s left is to see how well the genetic algorithm did and how good the best solution it found really is.

Please note that this entire solution is driven by random numbers, if you run it you will get different results to what I did.

SELECT TOP (1) i.Generation, i.IndividualID, TotalMass, TotalVolume, TotalValue
  FROM dbo.Individuals AS i
    INNER JOIN (
      SELECT Generation, IndividualID, SUM(Quantity*Value) AS TotalValue,  SUM(Quantity*Mass) AS TotalMass, SUM(Quantity*Volume) AS TotalVolume
        FROM dbo.IndividualDetails d INNER JOIN dbo.ObjectStatistics AS os ON d.ObjectNumber = os.ObjectNumber
        GROUP BY Generation, IndividualID
    ) id ON i.Generation = id.Generation AND i.IndividualID = id.IndividualID
  ORDER BY Fitness DESC;

GA Results

Not half bad.Is it the absolute best? Probably not. Genetic algorithms do not guarantee that they will find the best solution, just that they will find good ones

If we look at how the fitness (total value) improved over the generations it shows a massive initial increase in fitness then slow refinements. I could have run only 10 generations and still had a reasonably good result.

GA Graph

I think that’s enough playing with AI for now. While I doubt there’s any real practical use, I hope it was at least entertaining.

Full code: GeneticAlgorithms

7 Comments

  1. jonmcrawford

    This may be a dumb question, but not knowing anything about Genetic Algorithms, could you use one (perhaps not in SQL) to perform predictive modeling, by establishing the first few (or more) generations off of observed history, say in medical data across a group of patients, then having calculated mutations based on historical trends introduced into future generations? Might be kind of counter-intuitive to what it’s supposed to do, not sure.

    Reply
    1. Gail (Post author)

      You probably could, the tricky part there will be the fitness function. What do you test the individuals against. From things my university supervisor has discussed, that sounds like PhD level work.

      Reply
  2. James

    Predictive modelling is a common use of genetic algorithms. I myself have written a similarly inappropriate genetic algorithm code in SAS to do exactly that, but I don’t have the code anymore. I would suggest prototyping an approach using one of the packages in R, such as genaud, to see if you are getting the results you need. Fitness functions are relatively straightforward in that they are the usual measuer of predictiveness (eg classification rate, mean square error etc).

    Reply
  3. Pingback: Indexing Just Got Easy | SQL RNNR

  4. Ben

    Hi, Gail. I’m Ben, your Genetic algo sql code is awesome, and could you taught me how to use this code to solve the “flow job shop scheduling problem” ? Thanks a lot and have a nice day

    Reply
    1. Gail (Post author)

      No, that’s not something that blog comments are appropriate for. Do your research, and if you need help ask on an appropriate forum.

      Also, “As the title states, this is not really an appropriate use of SQL Server. Genetic algorithms are generally implemented in languages like Java, C++, C#, etc; languages that are good at complex mathematics, string manipulation and have complex data types.”

      So please don’t try implementing anything in a real system with this example. Use a proper GA library for whatever language you program in.

      Reply
  5. Pingback: Shop Talk: 2023-03-13 – Shop Talk with TriPASS

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.