input

---
title: "How Many 14-digit Numbers are Sensible Timestamps?"
author: Bob Rubbens
publish_timestamp: 2025-08-31T22:07:24+01:00
state: published
template: post.mako
id: 0c0b0b84-d0f3-4a4a-bbeb-d3735acb9a10
# bibliography: refs.bib
# reference-section-title: References
# link-citations: true
---

Consider the following two numbers:

```
20240510235959
12345678901234
```

One of them is a proper timestamp, and the other just a 14-digit number:

```
2024-05-10 23:59:59
1234-56-78 90:12:34
```

Imagine a situation where you are recovering creation date from a set of filenames. You know some filenames contain a timestamp in the pattern above, and write the following regex to detect those filenames:

```
[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]
```

The regex is not perfect, as there is a risk of false positives, meaning some filenames might match the regular expression which were not actually intended as timestamps. For the sake of this blog post, let's ignore the fact that all formatting characters used, such as `:` and `-`, take 99% of the doubt away that this filename might not be a timestamp. What's the risk of mistakenly identifying a filename as a timestamp?

We'll first define what it means for filenames to be timestamped, and then calculate the probability of getting it wrong

Warning: I'm not a statistics geek, so if you're prone to getting pedantic about statistics, this might not be the blog post for you... If you want to read more about statistics as used in this blog post, chapter five of [Online Statistics Education: A Multimedia Course of Study](http://onlinestatbook.com/) is a good place to start.

# Plain and timestamped filenames

Let's first make clear what I mean by mistakenly identifying a filename. I define a "filenumber" to be a sequence of 14 digits. We assume that from each filename a filenumber can be extracted, as done above with e.g. a regex. Each filenumber has an "intention" when the file was first created. The intention is either `timestamp` when it refers to an actual time, or `plain` when it's just intended for identification of the file.

We call a file, filename or filenumber with intention `timestamped` a timestamped file/filename/filenumber, and vice versa for plain.

For example, when [Signal](https://signal.org) exports a picture, it names it something like `signal-2025-08-29-22-51-40-350.jpg`. This name contains the filenumber `20250829225140`. As this is a timestamp of the moment I exported the picture, this is a timestamped file. Conversely, a hypothetical app could also generate a random 14-digit number with the purpose of just identifying the picture. In that case, it's a plain file.

Note that an intention requires both a *file* and a *filenumber*. Given *only* a filenumber, it is impossible to decide what the intention is. In other words, a filenumber might be assigned two intentions, based on the files that have that file number. 

With an intention detection technique, the intention of a filenumber can be guessed. For example, given a detection technique $G$ (for $G$uess), it might make the following predictions:

$$
G(\texttt{199811230900}) = \mathtt{timestamp}
$$

Here, $G$ takes as an input the filenumber, and outputs either `plain` or `timestamp`. Hypothetically $G$ could take more inputs, such as the file, or metadata from the file, which can be used to make the detection more accurate. In this blog post, we only consider the filenumber as an input.

# How inaccurate is $G_b$?

The problem is that $G$ can be wrong. For example, here's  the $G$ I used the other day to classify my photos. Let's call it $G_b$:

$$
G_b(n)\equiv
\begin{cases}
\mathtt{timestamp}, & \text{if } \texttt{1970-1-1 00:00} \leq n \leq \texttt{2025-08-15 18:30} \\
\mathtt{plain}, & \text{otherwise}
\end{cases}
$$

Essentially, whenever the filenumber $n$ looks like a timestamp and falls in a reasonable range, classify it as a timestamp.

Now here's the problem. Assuming my filenumbers are a mix of plain and timestamped filenumbers, and that plain filenumbers are uniformly distributed, what are the chances my $G_b$ will classify one or more as `timestamp`, when they are in fact `plain`? 

# Average of misclassified files

There are a few assumptions we can make to make the analysis more robust:

- **Ranges:** Every timestamp must lie somewhere between 1970 inclusive and 2026 exclusive. This is not generally true, but in my case the files were photos, and we're about halfway through 2025, which makes this a sensible assumption.
- **Uniform distribution:** We assume that plain filenumber are uniformly distributed over all possible 14-digit numbers. Meaning, each filenumber has the same chance to be used as any other filenumber. An app that uses a non-uniform distribution for ID generation would be non-standard, to say the least.
- **Imprecision:** This is just a calculation done for fun, so I don't need to take into account leap days, seconds, etc. I'll settle for an approximation of the exact answer.
- **Population sizes:** For the sake of the example, let's say my collection has 4000 files, 3000 of which are timestamped, and 1000 of which are plain.

Let's first determine the fraction of filenumbers that look like timestamps:

- Number of filenumbers = $10^{14}$
- Number of valid timestamps given above assumptions: $56 \times 12 \times 31 \times 24 \times 60 \times 60 = 1799884800$. Here, we count all timestamps until 2026 *exclusive.*
- Fraction of valid timestamps in set of filenumbers: $1799884800 \div 10^{14} = 0.000017998848$. That's a small number; let's call this fraction $\alpha$.

Given $\alpha$, and the fact that `plain` filenumbers are uniformly distributed, it's easy to calculate how many $G_b$ will get wrong. We'll just multiply the number of `plain` files with $\alpha$:

$$
\alpha \times 1000 = 0.017998848
$$

So, on average, we'll get less than 1 card wrong! While that sounds nice, ultimately it doesn't say much. Intuitively speaking, maybe there'll be an unlucky streak and five filenumbers end up in the sensible timestamp range. Is there not a more robust way to get an indicator of how dangerous $G_b$  is?

# Chances of getting one or more wrong

There is! I actually want to know the following:

$$
\text{``chance that one or more filenumbers are misclassified as timestamped''}
$$

Unfortunately, we can't (yet) put plain english into a calculator. Let's use a basic statistics trick to invert probabilities and reduce the formula to something we can actually calculate. The trick is: if something happens with chance $p$, then the chance of the thing *not* happening is $1 - p$. So, the chance that we wrongly classify one or more numbered filenames as timestamped can also be written as:

$$
1 - \text{``chance that no filenumbers are misclassified as timestamped''}
$$

We'll have to further unpack $\text{``chance that no filenumbers are misclassified as timestamped''}$ by considering $G_b$. Essentially, $G_b$ does detection by checking if the filenumber falls inside a certain range. Therefore, the chance that no plain files are misclassified is equal to the chance that all numbered files fall outside of that range *to begin with.*

That sounds tricky; let's first calculate what the chance is that plain filenumbers lie outside the range of timestamps. This we can calculate using $\alpha$, the chance that a plain filenumber is classified as timestamped. Using the probability inversion trick, the chance that one particular filenumber is *outside* the range of timestamps is $1 - \alpha$. Generalizing this to all numbered files, the chance that all of them lie outside the range of timestamps is $(1-\alpha)^{1000}$.

We now have enough to define $\text{``chance that no filenumbers are misclassified as timestamped''}$, using, again, the probability inversion trick. If the chance that all plain filenumbers lie outside of the range of timestamps is $(1-\alpha)^{1000}$, then the chance that one or more lie inside the range of timestamps is:

$$
1 - (1 - \alpha)^{1000}
$$

To actually calculate this, you need a *very* precise calculator. If the decimal component is not handled correctly while computing $(1 - \alpha)^{1000}$ , the result will be meaningless. Thankfully, my trusty built-in android calculator app actually [has very good precision](https://chadnauseam.com/coding/random/calculator-app), so we'll just use that. Case in point: my [Samsung tablet](https://www.samsung.com/nl/tablets/galaxy-tab-a/galaxy-tab-a9-plus-graphite-128gb-sm-x210rzareub/) cannot go beyond a precision of 10 decimals.

Here we go:

$$
1 - (1 - \alpha)^{1000} = 0.0178\dotsc
$$

So that makes about a $1.78\%$ chance of mis-classifying one or more filenumbers as timestamp. That's actually not as safe as I thought! A chance below $0.1\%$ would've given me a safe "gut feeling". Luckily no lives depend on the classification of these filenames so I'm not too worried 🙂.[^1]

[^1]: The final probability and $\alpha$ are a bit too eerily similar for my taste, so I also did the calculation in [a spreadsheet](./calculation.ods) just to be sure. The first 12 digits of the result of the Android calculator and of the spreadsheet are the same. Based on this, I think it's unlikely there are imprecision shenanigans messing up the final result. It looks like the population size I chose for `plain` filenames (1000) just happens to make the final probability result look like $\alpha$. Sketchy, but AFAICT the math checks out.