A few days ago Stijn Cambie gave a nice presentation at the FMT colloquium about the Erdős problem database and the community surrounding it. Of the interesting problems discussed in the talk, one in particular caught my eye: problem #307 (EP307). Not because I thought that young and naive computer science PhD could make some progress on this idea, but because I was curious how bad (or how well!) genetic programming (GP) would be able to solve it.
In particular, I thought it was nice to apply GP here because there is a nice concise way to encode a solution for EP307 into a genome, i.e. a list of natural numbers. The script was fun to write, but in the end the results were not very impressive. To my amateur mathematician’s eyes, it got pretty close to a solution, but I can imagine an expert might still judge the result to be wayyy off.
Skip to the end if you just want to run the code. The script’s made using only python 3.8, so it should run just about anywhere. Otherwise, read on for more about EP307, genetic programming, and the observations of the script’s results.
Paul
Erdős is a famous mathematician from the previous century. Most
people know him from the concept of Erdős number,
which is defined as the number of paper authors between you and Erdős,
where two authors are connected if they wrote a paper together. I’ve
been told my Erdős
number is 4, but I’ve also been told the average is between
4 and 5. Do with that information what you will :-).
Among mathematicians, Erdős has a reputation for posing and solving many interesting mathematical problems. New Erdős problems are discovered every day, and added to the Erdős problem database, as mathematicians read papers from Erdős. This post is about Erdős problem #307 (EP307), which is the following.
Do there exist two finite sets of primes and , such that
So basically, summing all the reciprocals of elements of , doing the same for , and multiplying those results. If you can get to 1 for some and , you win. Sounds easy enough! Let’s try to find and with GP. Spoiler alert: it won’t work. But it’s a fun exercise!
Genetic programming seeks to find solutions to problems by applying the principle of evolution. The idea is to start with a population of solutions, and then to keep evolving the population, throwing out unlikely solutions every once in a while. Until an individual appears that solves the problem: then you’re done!
Each individual has a genome, which is usually a sequence of integers, but this need not to be the case. Then, given two individuals, you can make two new individuals by splicing the genomes. Splicing works by picking a point somewhere in both genomes, cutting the two genomes there, and then swapping the genome halves. That sounds confusing, so here is a picture:
That’s the essence of it. Of course, you can come up with many ways to combine genomes. For example, another option to create new genomes is to exchange two numbers of the genomes, creating two new genomes that are the same as their parents, except for one index.
Another feature that GP algorithms often have is the mutation operation. This takes only one genome and mutates it in some way. For example, the mutation might add or remove numbers in the genome, or maybe it incremens or decrements some numbers.
The nice thing about GP is that, to use it, you only need to provide an encoding of your problem into GP. Then, because this evolutionary mechanism of splicing, mutation, and natural selection is so general, the algorithm finds a good solution entirely on its own. This encoding requires the following:
Putting all of this together, you get the following pseudocode:
population = generate_ok_individuals(150)
while fitness(max(population, key=fitness)) < goal_fitness:
A = random.choice(population)
B = random.choice(population)
children = splice(A, B)
for child in children:
if is_ok(child): population += child
# Maintain a stable population size
while len(population) > 150:
population.remove(min(population, key=fitness))For some problems the above pseudocode might already suffice, but if not, there are many ways to extend it. That’s the beauty of GP: it’s a general framework that you can easily extend with your own ideas. You can come up with new ways to combine and mutate genomes, add probabilities to them, and this way tune your algorithm to be more or less noisy. Maybe you want to cull a large part of your population every generation. Or perhaps when the search gets stuck, you take the 10 most promising individuals and restart the search. One cool extension is to splice genomes together with a probability proportional to their relative fitness, which is somewhat resource intensive to do. I also like the idea of tournament selection, where you hold small tournaments where genomes compete to reproduce, though it is also resource intensive.
That’s actually also the fun practical part of GP: making this all fast. For example, by memoizing the result of the fitness function, clever datastructure use for easily removing the best and worst individual from the population, and coming up with an encoding that encodes just the right amount of information into a genome.
There’s only one requirement (as far as I know; be sure to look into the literature if you’re curious about this stuff) for GP to be effective to a problem: that the solution space is kind of continuous. By continous, I mean that, for most candidate solutions, it should be easy to make it a bit better by applying some small changes. You can imagine that if for all candidate solutions, you need to change a lot about a candidate solution to improve it, then it’s unlikely that splicing and mutation will yield genomes with higher fitness. Luckily, lots of problems have this property, or can be rephrased in continuous terms, so there is lots of literature to read up on if you’re interested in using this technique 😄.
To get a better idea of what the encoding into GP can look like, let’s have a look at encoding EP307 into GP.
To complete the encoding of EP307 for GP, we need three things:
The back-and-forth translation needs two qualities. First, as said before, it needs to turn two sets into a single list of numbers. Second, it also needs to be robust against the splicing and mutation operations that the GP algorithm will do to the genomes. Ideally, the translation should be robust enough that randomly splicing two genomes together has a very high probability of resulting in a new, valid, genome.
We can use the following small insight: you can encode any two lists
of integers in a single list of integers, as long as you know where to
split the list to get the original lists back. We’ll call this special
index the split index. For example, the lists
[2, 7, 5] and [13, 17] are encoded in the list
[2, 7, 5, 13, 17], because we know that splitting at the
split index (2) gives back the original list. Then, if we prepend the
split index to the beginning of the list, we’re done. This way, the list
[2, 2, 7, 5, 13, 17] encodes exactly the candidate solution
.
We still need to adapt this encoding a bit, because it can break very
easily. For example, consider again the genome from earlier,
[2, 2, 7, 5, 13, 17]. What if a mutation adds a
100 to the front, resulting in the genome
[100, 2, 2, 7, 5, 13, 17]. If we try to split the rest of
the list at index 100… That doesn’t work out! Another
problem is that the numbers in the list might not remain primes. For
example, if a mutation causes the 7 in the genome to be
incremented to 8, the resulting
contains the number 8, which means
is no longer a set of just primes.
The first problem we solve by always taking the split index
modulo the length of the remaining list. So in the case of the
genome [100, 2, 2, 7, 5, 13, 17], the split index would be
.
The second problem is solved by not having primes directly in the
genome, but having indices of primes. In other words, we start
with a list of primes, say, the first 1000. Then whe represent the
primes in
and
by taking the indices of these primes in the list of primes. Applying
this indirection to [2, 2, 7, 5, 13, 17], we get:
[2, 0, 3, 2, 5, 6].
Now, there’s still a problem. Consider the genome
[1, 0, 1], which represents
.
When
,
the final product described by EP307 will never be 1, so we’d like to
exclude those solutions from the population. We’ll do this by adding
this constraint to the definition of when a genome is “ok”.
When a genome is “ok”, this means a genome is valid, i.e. is safe to
put in the population. In other words, an ok genome does not violate the
constraints imposed by the problem we’re trying to solve with GP. What
it means to be “ok” is implemented by the is_ok function,
which returns True if a given genome is “ok”.
Ideally, we’d rather not have to check if each new genome we produce
is ok. You might even be able to come up with an encoding that doesn’t
need it. However, in the interest of time, I decided the encoding I had
in mind was good enough, and patched up the remaining holes with
is_ok.
The first constraint we put in is_ok is that the
and
derived from the genome must be non-empty. That excludes all nonsensical
candidate solutions.
What’s nice about is_ok is that you can use it to
further optimize the GP process. For example, Stijn observed that a
possible solution of EP307 will also have the following properties:
By putting these constraints in is_ok, we make sure that
we don’t get individuals in the population that have no chance at ever
becoming a solution.
There’s a hidden assumption in adding these constraints to
is_ok: we are assuming that we can find a solution to EP307
by randomly splicing together “ok” genomes. But this might be very
difficult. It could be the case that by allowing genomes in the
population that are actually not ok according to Stijn’s
constraints, we make it easier to get to a solution that is in
fact ok in terms of Stijn’s constraints.
To phrase it differently, imagine GP as finding a series of stepping stones to the other side of a river. Naturally, you restrict yourself to only the rough stones for your route, because you want to have a safe route. But in a river, rough stones are actually rare, so you might not be able to find a safe route. If you relax your search to also include somewhat slippery stones, you might end up with a dangerous route, but at least you’ll have a route! That’s what we’re doing by adding Stijn’s constraints: we’re only considering the rough stones, and in doing so, assuming there are enough rough stones to get us to the other side.
As I’m just doing this for fun, it doesn’t matter too much. For a more serious application of GP, these questions are important.
Now for the crown jewel of any GP encoding, the fitness
function. It is actually a bit boring: given a genome
,
we just extract the sets
and
from it, and compute the result as described by EP307, with the
reciprocals and summing and all that. Then we compute the distance from
1, take the absolute value, and voilà. This is easy to compute, and also
continuous in the sense that the fitter the genome is, the smaller the
fitness. The only tricky part is that, for this fitness function, a
smaller fitness is better than a larger fitness. This means the
min and max need to be flipped around in the
pseudocode from earlier.
And that’s all you need to get started with solving EP307 with genetic programming!
While running my script a bunch of times and eyeballing the output, I noticed some things, which, to my amateur mathematician’s mind, seemed interesting. When I told Stijn about them, he didn’t seem too surprised, so some of these likely follow directly from the definition of EP307.
Limited candidate solution composition: any time I inspected the best candidate solution so far, and always included only the first 100 or so prime numbers. I haven’t seen the prime sets go beyond 541. I don’t have a good explanation or even intuition for this. To be honest, I would’ve expected the reverse to happen, that every once in a regular while you’d see a large prime floating by. Instead, the solutions my script has been reporting have been pretty compact, that is, with few primes missing inbetween the smallest and largest prime of the solution.
Taking this one step further: if we assume this is an indication that a proper solution to EP307 will only use small primes, a nice optimization could be to only use the first 128 primes. This means you can use 2 64-bit integers as bitsets to represent a candidate solution, which I’m sure can help speed things up (though you might need a language with access to low-level primitives).
Quick convergence towards “final” solution: my script almost always converges towards the final solution in less than 1000 generations. With final here I mean that, after it has reached this solution, it rarely finds a better one if you leave it running for an hour or so.
For the record, this is the best solution I’ve seen, in terms of the exponent. I’ve never seen a solution with an exponent smaller than -9.
Is genetic programming the right fit for the
problem? Maybe. I’m fairly sure it’s not a bad fit, but calling
it a good fit is maybe a bit much. Probably some hybrid algorithm that
combines [gradient descent] with something [monte carlo] inspired could
be faster and find solutions with higher fitness. That is, I think it
makes a lot of sense to start with some imperfect solution, and then
breadth-first explore the state space around that solution with some
randomness sprinkled in. If anyone ends up doing any work in this
direction, do let me know :-).
If you want to have a look at the code, fair warning, it’s not
polished, but it works. It’s built using only Python 3.8 primitivies,
and includes the first 1000 primes, absolutely for free!1 So it should be pretty portable.
It’s should be pretty fertile ground for more GP experiments and
refactoring :-).
You can find it here. Enjoy!
Generated with BYOB.
License: CC-BY-SA.
This page is designed to last.