Consider the following two numbers:
20240510235959
12345678901234
One of them is a proper timestamp, and the other just a 14-digit number:
2024-05-10 23:59:59
1234-56-78 90:12:34
Imagine a situation where you are recovering creation date from a set of filenames. You know some filenames contain a timestamp in the pattern above, and write the following regex to detect those filenames:
[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]
The regex is not perfect, as there is a risk of false positives,
meaning some filenames might match the regular expression which were not
actually intended as timestamps. For the sake of this blog post, let’s
ignore the fact that all formatting characters used, such as
: and -, take 99% of the doubt away that this
filename might not be a timestamp. What’s the risk of mistakenly
identifying a filename as a timestamp?
We’ll first define what it means for filenames to be timestamped, and then calculate the probability of getting it wrong
Warning: I’m not a statistics geek, so if you’re prone to getting pedantic about statistics, this might not be the blog post for you… If you want to read more about statistics as used in this blog post, chapter five of Online Statistics Education: A Multimedia Course of Study is a good place to start.
Let’s first make clear what I mean by mistakenly identifying a
filename. I define a “filenumber” to be a sequence of 14 digits. We
assume that from each filename a filenumber can be extracted, as done
above with e.g. a regex. Each filenumber has an “intention” when the
file was first created. The intention is either timestamp
when it refers to an actual time, or plain when it’s just
intended for identification of the file.
We call a file, filename or filenumber with intention
timestamped a timestamped file/filename/filenumber, and
vice versa for plain.
For example, when Signal exports a
picture, it names it something like
signal-2025-08-29-22-51-40-350.jpg. This name contains the
filenumber 20250829225140. As this is a timestamp of the
moment I exported the picture, this is a timestamped file. Conversely, a
hypothetical app could also generate a random 14-digit number with the
purpose of just identifying the picture. In that case, it’s a plain
file.
Note that an intention requires both a file and a filenumber. Given only a filenumber, it is impossible to decide what the intention is. In other words, a filenumber might be assigned two intentions, based on the files that have that file number.
With an intention detection technique, the intention of a filenumber can be guessed. For example, given a detection technique (for uess), it might make the following predictions:
Here,
takes as an input the filenumber, and outputs either plain
or timestamp. Hypothetically
could take more inputs, such as the file, or metadata from the file,
which can be used to make the detection more accurate. In this blog
post, we only consider the filenumber as an input.
The problem is that can be wrong. For example, here’s the I used the other day to classify my photos. Let’s call it :
Essentially, whenever the filenumber looks like a timestamp and falls in a reasonable range, classify it as a timestamp.
Now here’s the problem. Assuming my filenumbers are a mix of plain
and timestamped filenumbers, and that plain filenumbers are uniformly
distributed, what are the chances my
will classify one or more as timestamp, when they are in
fact plain?
There are a few assumptions we can make to make the analysis more robust:
Let’s first determine the fraction of filenumbers that look like timestamps:
Given
,
and the fact that plain filenumbers are uniformly
distributed, it’s easy to calculate how many
will get wrong. We’ll just multiply the number of plain
files with
:
So, on average, we’ll get less than 1 card wrong! While that sounds nice, ultimately it doesn’t say much. Intuitively speaking, maybe there’ll be an unlucky streak and five filenumbers end up in the sensible timestamp range. Is there not a more robust way to get an indicator of how dangerous is?
There is! I actually want to know the following:
Unfortunately, we can’t (yet) put plain english into a calculator. Let’s use a basic statistics trick to invert probabilities and reduce the formula to something we can actually calculate. The trick is: if something happens with chance , then the chance of the thing not happening is . So, the chance that we wrongly classify one or more numbered filenames as timestamped can also be written as:
We’ll have to further unpack by considering . Essentially, does detection by checking if the filenumber falls inside a certain range. Therefore, the chance that no plain files are misclassified is equal to the chance that all numbered files fall outside of that range to begin with.
That sounds tricky; let’s first calculate what the chance is that plain filenumbers lie outside the range of timestamps. This we can calculate using , the chance that a plain filenumber is classified as timestamped. Using the probability inversion trick, the chance that one particular filenumber is outside the range of timestamps is . Generalizing this to all numbered files, the chance that all of them lie outside the range of timestamps is .
We now have enough to define , using, again, the probability inversion trick. If the chance that all plain filenumbers lie outside of the range of timestamps is , then the chance that one or more lie inside the range of timestamps is:
To actually calculate this, you need a very precise calculator. If the decimal component is not handled correctly while computing , the result will be meaningless. Thankfully, my trusty built-in android calculator app actually has very good precision, so we’ll just use that. Case in point: my Samsung tablet cannot go beyond a precision of 10 decimals.
Here we go:
So that makes about a chance of mis-classifying one or more filenumbers as timestamp. That’s actually not as safe as I thought! A chance below would’ve given me a safe “gut feeling”. Luckily no lives depend on the classification of these filenames so I’m not too worried 🙂.1
Generated with BYOB.
License: CC-BY-SA.
This page is designed to last.