John D. Cook's Avatar

John D. Cook

@johndcook.mathstodon.xyz.ap.brid.gy

Consultant in applied mathematics and data privacy https://www.johndcook.com [bridged from https://mathstodon.xyz/@johndcook on the fediverse by https://fed.brid.gy/ ]

157
Followers
0
Following
413
Posts
18.11.2024
Joined
Posts Following

Latest posts by John D. Cook @johndcook.mathstodon.xyz.ap.brid.gy

From logistic regression to AI It is sometimes said that neural networks are “just” logistic regression. (Remember neural networks? LLMs _are_ neural networks, but nobody talks about neural networks anymore.) In some sense a neural network is logistic regression with more parameters, a _lot_ more parameters, but more is different. New phenomena emerge at scale that could not have been anticipated at a smaller scale. Logistic regression can work surprisingly well on small data sets. One of my clients filed a patent on a simple logistic regression model I created for them. You can’t patent logistic regression—the idea goes back to the 1840s—but you can patent its application to a particular problem. Or at least you can try; I don’t know whether the patent was ever granted. Some of the clinical trial models that we developed at MD Anderson Cancer Center were built on Bayesian logistic regression. These methods were used to run early phase clinical trials, with dozens of patients. Far from “big data.” Because we had modest amounts of data, our models could not be very complicated, though we tried. The idea was that informative priors would let you fit more parameters than would otherwise be possible. That idea was partially correct, though it leads to a sensitive dependence on priors. When you don’t have enough data, additional parameters do more harm than good, at least in the classical setting. Over-parameterization is bad in classical models, though over-parameterization can be good for neural networks. So for a small data set you commonly have only two parameters. With a larger data set you might have three or four. There is a rule of thumb that you need at least 10 events per parameter (EVP) [1]. For example, if you’re looking at an outcome that happens say 20% of the time, you need about 50 data points per parameter. If you’re analyzing a clinical trial with 200 patients, you could fit a four-parameter model. But those four parameters better pull their weight, and so you typically compute some sort of information criteria metric—AIC, BIC, DIC, etc.—to judge whether the data justify a particular set of parameters. Statisticians agonize over each parameter because it really matters. Imaging working in the world of modest-sized data sets, carefully considering one parameter at a time for inclusion in a model, and hearing about people fitting models with millions, and later billions, of parameters. It just sounds insane. And sometimes it _is_ insane [2]. And yet it can work. Not automatically; developing large models is still a bit of a black art. But large models can do amazing things. How do LLMs compare to logistic regression as far as the ratio of data points to parameters? Various scaling laws have been suggested. These laws have some basis in theory, but they’re largely empirical, not derived from first principles. “Open” AI no longer shares stats on the size of their training data or the number of parameters they use, but other models do, and as a _very_ rough rule of thumb, models are trained using around 100 tokens per parameter, which is not very different from the EVP rule of thumb for logistic regression. Simply counting tokens and parameters doesn’t tell the full story. In a logistic regression model, data are typically binary variables, or maybe categorical variables coming from a small number of possibilities. Parameters are floating point values, typically 64 bits, but maybe the parameter values are important to three decimal places or 10 bits. In the example above, 200 samples of 4 binary variables determine 4 ten-bit parameters, so 20 bits of data for every bit of parameter. If the inputs were 10-bit numbers, there would be 200 bits of data per parameter. When training an LLM, a token is typically a 32-bit number, not a binary variable. And a parameter might be a 32-bit number, but quantized to 8 bits for inference [3]. If a model uses 100 tokens per parameter, that corresponds to 400 bits of training data per inference parameter bit. In short, the ratio of data bits to parameter bits is roughly similar between logistic regression and LLMs. I find that surprising, especially because there’s a short of no man’s land between [2] a handful of parameters and billions of parameters. ## Related posts * Sensitivity in logistic regression * Experiences with GPT-5 codex * Prompting peril * Using classical statistics to avoid regulatory burdens [1] P Peduzzi 1, J Concato, E Kemper, T R Holford, A R Feinstein. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clincal Epidemiolgy 1996 Dec; 49(12):1373-9. doi: 10.1016/s0895-4356(96)00236-3. [2] A lot of times neural networks don’t scale down to the small data regime well at all. It took a lot of audacity to believe that models would perform disproportionately better with more training data. Classical statistics gives you good reason to expect diminishing returns, not increasing returns. [3] There has been a lot of work lately to find low precision parameters directly. So you might find 16-bit parameters rather than finding 32 bit parameters then quantizing to 16 bits.

From logistic regression to AI

https://www.johndcook.com/blog/2026/03/04/from-logistic-regression-to-ai/

04.03.2026 15:50 👍 1 🔁 0 💬 0 📌 0
An AI Odyssey, Part 2: Prompting Peril I was working with a colleague recently on a project involving the use of the OpenAI API. I brought up the idea that, perhaps it is possible to improve the accuracy of API response by modifying the API call to increase the amount of reasoning performed. My colleague quickly asked ChatGPT if this was possible, and the answer came back “No, it’s not possible to do that.” then I asked essentially the same question to my own instance of ChatGPT, and the answer was, “Yes, you can do it, but you need to use the OpenAI Responses API.” How did we get such different answers? Was it the wording of the prompt? Was it the custom instructions given in the account personalization, where you describe who you are and how you want ChatGPT to respond? Is it possibly different conversation history? Many factors could have contributed to the different response. Unfortunately, many of these factors are either not easily controllable at the user level or not convenient to change to alternatives in a protracted trial and error search. I’ve had other times when I will first get a highly standardized, generic answer from ChatGPT, even in Thinking mode, that I know is not quite right or just seems off. Then when I push back, I may get a profoundly different answer. It’s simply a fact that large language models are conditional probabilistic systems that do not guarantee reproducibility in practice, even given the same inputs, even at temperature=0 [1]. Their outputs depend sensitively on prompt wording, context window contents, system instructions, and model configuration. Small differences in these inputs can yield substantially different outputs. How well an AI chatbot responds can obviously have a massive impact on how effective the tool will be for your use case. Differences in responses could materially affect the outcome of your project. I take this as a wake-up call to be persistent, vigilant and flexible in attempts to obtain reliable answers from these new AI tools. #### Notes [1] (some) sources of nondeterminism: floating point / GPU nondeterminism, differing order of operations from distributed collectives, ties or near-ties in token probabilities, backend/infrastructure changes, model routing, hidden system prompt differences or tool availability.

Prompting Peril

https://www.johndcook.com/blog/2026/03/04/an-ai-odyssey-part-2-prompting-peril/

Part 2 of Wayne Joubert's series of posts on experience working with AI

04.03.2026 14:19 👍 0 🔁 0 💬 0 📌 0
An AI Odyssey, Part 1: Correctness Conundrum I recently talked with a contact who repeated what he’d heard regarding agentic AI systems—namely, that they can greatly increase productivity in professional financial management tasks. However, I pointed out that though this is true, these tools do not guarantee correctness, so one has to be very careful letting them manage critical assets such as financial data. It is widely recognized that AI models, even reasoning models and agentic systems, can make mistakes. One example is a case showing that one of the most recent and capable AI models made multiple factual mistakes in drawing together information for a single slide of a presentation. Sure, people can give examples where agentic systems can perform amazing tasks. But it’s another question as to whether they can always do them reliably. Unfortunately, we do not yet have procedural frameworks that provides reliability guarantees that are comparable to those required in other high-stakes engineering domains. Many leading researchers have acknowledged that current AI systems have a degree of technical unpredictability that has not been resolved. For example, one has recently said, _**“Anyone who has worked with AI models understands that there is a basic unpredictability to them, that in a purely technical way we have not solved.”**_ #### What industrial-strength reliability looks like Manufacturing has the notion of Six Sigma quality, to reduce the level of manufacturing defects to an extremely low level. In computing, the correctness requirements are even higher, sometimes necessitating provable correctness. The Pentium FDIV bug in the 1990s caused actual errors in calculations to occur in the wild, even though the chance of error was supposedly “rare.” These were silent errors that might have occurred undetected in mission critical applications, leading to failure. This being said, the Pentium FDIV error modes were precisely definable, whereas AI models are probabilistic, making it much harder to bound the errors. Exact correctness is at times considered so important that there is an entire discipline, known as formal verification, to prove specified correctness properties for critical hardware and software systems (for example, the manufacture of computing devices). These methods play a key role in multi-billion dollar industries. When provable correctness is not available, having at least a rigorous certification process (see here for one effort) is a step in the right direction. It has long been recognized that reliability is a fundamental challenge in modern AI systems. Despite dramatic advances in capability, strong correctness guarantees remain an open technical problem. The central question is how to build AI systems whose behavior can be bounded, verified, or certified at domain-appropriate levels. Until this is satisfactorily resolved, we should use these incredibly useful tools in smart ways that do not create unnecessary risks.

An AI Odyssey, Part 1: Correctness Conundrum

https://www.johndcook.com/blog/2026/03/02/an-ai-odyssey-part-1-correctness-conundrum/

03.03.2026 11:58 👍 0 🔁 0 💬 0 📌 0
Preview
Shortcut and Delay Shortcut for previous working directory. Equations with a small delay.

Latest newsletter: Shortcut and Delay

https://johndcookconsulting.substack.com/p/shortcut-and-delay

03.03.2026 01:11 👍 0 🔁 0 💬 0 📌 0
Post image

Differential equation with a small delay

https://www.johndcook.com/blog/2026/03/02/small-delay/

02.03.2026 16:10 👍 0 🔁 0 💬 0 📌 0
Post image

A sort of multiplication table for trig functions and inverse trig functions.

https://www.johndcook.com/blog/2026/02/25/trig-of-inverse-trig/

25.02.2026 11:10 👍 1 🔁 1 💬 3 📌 0
Post image

https://www.johndcook.com/blog/2026/02/24/a-curious-trig-identity/

24.02.2026 23:35 👍 1 🔁 2 💬 1 📌 0
Preview
Learning in High Dimension Always Amounts to Extrapolation The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample $x$ whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when $x$ falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional ($>$100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

LLMs don't interpolate. They extrapolate.

https://arxiv.org/abs/2110.09485v2

24.02.2026 21:18 👍 1 🔁 1 💬 0 📌 1
Post image

New post: Giant Steps
https://www.johndcook.com/blog/2026/02/23/giant-steps/

23.02.2026 19:22 👍 0 🔁 0 💬 0 📌 0
Post image

Tritone sub
https://www.johndcook.com/blog/2026/02/23/tritone-sub/

23.02.2026 12:53 👍 0 🔁 0 💬 0 📌 0
Post image

Exahash, zettahash, Yottahash

https://www.johndcook.com/blog/2026/02/22/zettahash/

22.02.2026 18:46 👍 1 🔁 2 💬 0 📌 0
10,000,000th Fibonacci number I’ve written a couple times about Fibonacci numbers and certificates. Here the certificate is auxiliary data that makes it faster to confirm that the original calculation was correct. This post puts some timing numbers to this. I calculated the 10 millionth Fibonacci number using code from this post. n = 10_000_000 F = fib_mpmath(n) This took 150.3 seconds. As I’ve discussed before, a number _f_ is a Fibonacci number if and only if 5 _f_ ² ± 4 is a square _r_ ². And for the _n_ th Fibonacci number, we can take ± to be positive if _n_ is even and negative if _n_ is odd. I computed the certificate _r_ with r = isqrt(5*F**2 + 4 -8*(n%2)) and this took 65.2 seconds. Verifying that _F_ is correct with n = abs(5*F**2 - r**2) assert(n == 4) took 3.3 seconds. ## About certificates So in total it took 68.5 seconds to verify that _F_ was correct. But if someone had pre-computed _r_ and handed me _F_ and _r_ I could use _r_ to verify the correctness of _F_ in 3.3 seconds, about 2% of the time it took to compute _F_. Sometimes you can get a certificate of correctness for free because it is a byproduct of the problem you’re solving. Other times computing the certificate takes a substantial amount of work. Here computing _F_ and _r_ took about 40% more work than just computing _F_. What’s not typical about this example is that the solution provider carries out exactly the same process to create the certificate that someone receiving the solution without a certificate would do. Typically, even if the certificate isn’t free, it does come at a “discount,” i.e. the problem solver could compute the certificate more efficiently than someone given only the solution. ## Proving you have the right Fibonacci number Now suppose I have you the number _F_ above and claim that it is the 10,000,000th Fibonacci number. You carry out the calculations above and say “OK, you’ve convinced me that _F_ is _a_ Fibonacci number, but how do I know it’s the 10,000,000th Fibonacci number? The 10,000,000th Fibonacci number has 2,089,877 digits. That’s almost enough information to verify that a Fibonacci number is indeed the 10,000,000th Fibonacci number. The log base 10 of the _n_ th Fibonacci number is very nearly _n_ log10 φ − 0.5 log10 5 based on Binet’s formula. From this we can determine that the _n_ th Fibonacci number has 2,089,877 digits if _n_ = 10,000,000 + _k_ where _k_ equals 0, 1, 2, or 3. Because log10 _F_ 10,000,000 = 2089876.053014785 and 100.053014785 = 1.129834377783962 we know that the first few digits of _F_ 10,000,000 are 11298…. If I give you a 2,089,877 digits that you can prove to be a Fibonacci number, and its first digit is 1, then you know it’s the 10,000,000th Fibonacci number. You could also verify the number by looking at the last digit. As I wrote about here, the last digits of Fibonacci number have a period of 60. That means the last digit of the 10,000,000th Fibonacci number is the same as the last digit of the 40th Fibonacci number, which is 5. That is enough to rule out the other three possible Fibonacci numbers with 2,089,877 digits.

The 10,000,000th Fibonacci number

https://www.johndcook.com/blog/2026/02/21/f10000000/

22.02.2026 01:01 👍 0 🔁 0 💬 0 📌 0
Computing big, certified Fibonacci numbers I’ve written before about computing big Fibonacci numbers, and about creating a certificate to verify a Fibonacci number has been calculated correctly. This post will revisit both, giving a different approach to computing big Fibonacci numbers that produces a certificate along the way. As I’ve said before, I’m not aware of any practical reason to compute large Fibonacci numbers. However, the process illustrates techniques that are practical for other problems. The fastest way to compute the _n_ th Fibonacci number for sufficiently large _n_ is Binet’s formula: _F_ _n_ = round( φ _n_ / √5 ) where φ is the golden ratio. The point where _n_ is “sufficiently large” depends on your hardware and software, but in my experience I found the crossover to be somewhere 1,000 and 10,000. The problem with Binet’s formula is that it requires extended precision floating point math. You need extra guard digits to make sure the integer part of your result is entirely correct. How many guard digits you’ll need isn’t obvious _a priori_. This post gave a way of _detecting_ errors, and it could be turned into a method for _correcting_ errors. But how do we know an error didn’t slip by undetected? This question brings us back to the idea of certifying a Fibonacci number. A number _f_ Fibonacci number if and only if one of 5 _f_ 2 ± 4 is a perfect square. Binet’s formula, implemented in finite precision, takes a positive integer _n_ and gives us a number _f_ that is approximately the _n_ th Fibonacci number. Even in low-precision arithmetic, the _relative_ error in the computation is small. And because the ratio of consecutive Fibonacci numbers is approximately φ, an approximation to _F_ _n_ is far from the Fibonacci numbers _F_ _n_ − 1 and _F_ _n_ + 1. So the closest Fibonacci number to an approximation of _F_ _n_ is exactly _F_ _n_ unless the error in the approximation is more than 50%. Now if _f_ is approximately _F_ _n_ , then 5 _f_ 2 is approximately a square. Find the integer _r_ minimizing |5 _f_ 2 − _r_ ²|. In Python you could do this with the `isqrt` function. Then either _r_ ² + 4 or _r_ ² − 4 is divisible by 5. You can know which one by looking at _r_ ² mod 5. Whichever one is is, divide it by 5 and you have the square of _F_ _n_. You can take the square root exactly, in integer arithmetic, and you have _F_ _n_. Furthermore, the number _r_ that you computed along the way is the certificate of the calculation of _F_ _n_. With the algorithm described above, we don’t need the initial approximation of _F_ _n_ to be very accurate. We could use ordinary floating point arithmetic, as long as our numbers don’t overflow. If we need extended floating point, it would not be for accuracy but to avoid overflow. It in terms of floating point representation, you need lots of exponent bits but not many significand bits.

Calculating large Fibonacci numbers.

https://www.johndcook.com/blog/2026/02/21/big-certified-fibonacci/

21.02.2026 18:16 👍 0 🔁 0 💬 0 📌 0
Visualizing orbital velocity The shape of a planet’s orbit around a star is an ellipse. To put it another way, a plot of the _position_ of a planet’s orbit over time forms an ellipse. What about the velocity? Is it’s plot also an ellipse? Surprisingly, a plot of the velocity forms a _circle_ even if a plot of the position is an ellipse. If an object is in a circular orbit, it’s velocity vector traces out a circle too, with the same center. If the object is in an elliptical orbit, it’s velocity vector still traces out a circle, but one with a different center. When the orbit is eccentric, the **hodograph** , the figure traced out by the velocity vector, is also eccentric, though the two uses of the word “eccentric” are slightly different. The eccentricity _e_ of an ellipse is the ratio _c_ /_a_ where _c_ is the distance between the two foci and _a_ is the semi-major axis. For a circle, _c_ = 0 and so _e_ = 0. The more elongated an ellipse is, the further apart the foci are relative to the axes and so the greater the eccentricity. The plot of the orbit is eccentric in the sense that the two foci are distinct because the shape is an ellipse. The hodograph is eccentric in the sense that the circle is not centered at the origin. The two kinds of eccentricity are related: the displacement of the center of the hodograph from the origin is proportional to the eccentricity of the ellipse. Imagine the the orbit of a planet with its major axis along the _x_ -axis and the coordinate of its star positive. The hodograph is a circle shifted up by an amount proportional to the eccentricity of the orbit _e_. The top of the circle corresponds to perihelion, the point closest to the star, and the bottom corresponds to aphelion, the point furthest from the star. For more details, see the post Max and min orbital speed.

New post: Plotting orbital velocity

https://www.johndcook.com/blog/2026/02/16/hodograph/

16.02.2026 21:25 👍 0 🔁 1 💬 0 📌 0
Post image

Race between primes of the forms 4k + 1 and 4k + 3

https://www.johndcook.com/blog/2026/02/15/chebyshev-bias/

15.02.2026 19:21 👍 0 🔁 0 💬 0 📌 0
Finding a square root of -1 mod p If _p_ is an odd prime, there is a theorem that says _x_ ² = −1 mod _p_ has a solution if and only if _p_ = 1 mod 4. When a solution _x_ exists, how do you find it? The previous two posts have discussed Stan Wagon’s algorithm for expressing an odd prime _p_ as a sum of two squares. This is possible if and only if _p_ = 1 mod 4, the same condition on _p_ for −1 to have a square root. In the previous post we discussed how to find _c_ such that _c_ does not have a square root mod _p_. This is most of the work for finding a square root of −1. Once you have _c_ , set _x_ = _c_ _k_ where _p_ = 4 _k_ + 1. For example, let’s find a square root of −1 mod _p_ where _p_ = 2255 − 19. We found in the previous post that _c_ = 2 is a non-residue for this value of _p_. >>> p = 2**255 - 19 >>> k = p // 4 >>> x = pow(2, k, p) Let’s view _x_ and verify that it is a solution. >>> x 19681161376707505956807079304988542015446066515923890162744021073123829784752 >>> (x**2 + 1) % p 0 Sometimes you’ll see a square root of −1 mod _p_ written as _i_. This makes sense in context, but it’s a little jarring at first since here _i_ is an integer, not a complex number.

Finding a square root of -1 mod p

https://www.johndcook.com/blog/2026/02/14/square-root-minus-1-mod-p/

14.02.2026 23:11 👍 0 🔁 1 💬 0 📌 0
Expressing a prime as the sum of two squares I saw where Elon Musk posted Grok’s answer to the prompt “What are the most beautiful theorems.” I looked at the list, and there were no surprises, as you’d expect from a program that works by predicting the most likely sequence of words based on analyzing web pages. There’s only one theorem on the list that hasn’t appeared on this blog, as far as I can recall, and that’s Fermat’s theorem that an odd prime _p_ can be written as the sum of two squares if and only if _p_ = 1 mod 4. The “only if” direction is easy [1] but the “if” direction takes more effort to prove. If _p_ is a prime and _p_ = 1 mod 4, Fermat’s theorem guarantees the existence of _x_ and _y_ such that ## Gauss’ formula Stan Wagon [2] gave an algorithm for finding a pair (_x_ , _y_) to satisfy the equation above [2]. He also presents “a beautiful formula due to Gauss” which “does not seem to be of any value for computation.” Gauss’ formula says that if _p_ = 4 _k_ + 1, then a solution is For _x_ and _y_ we choose the residues mod _p_ with |_x_ | and |_y_ | less than _p_ /2. Why would Wagon say Gauss’ formula is computationally useless? The number of multiplications required is apparently on the order of _p_ and the size of the numbers involved grows like _p_!. You can get around the problem of intermediate numbers getting too large by carrying out all calculations mod _p_ , but I don’t see a way of implementing Gauss’ formula with less than _O_(_p_) modular multiplications. ## Wagon’s algorithm If we want to express a large prime _p_ as a sum of two squares, an algorithm requiring _O_(_p_) multiplications is impractical. Wagon’s algorithm is much more efficient. You can find the details of Wagon’s algorithm in [2], but the two key components are finding a quadratic non-residue mod _p_ (a number _c_ such that _c_ ≠ _x_ ² mod _p_ for any _x_) and the Euclidean algorithm. Since half the numbers between 1 and _p_ − 1 are quadratic non-residues, you’re very likely to find a non-residue after a few attempts. [1] The square of an integer is either equal to 0 or 1 mod 4, so the sum of two squares cannot equal 3 mod 4. [2] Stan Wagon. The Euclidean Algorithm Strikes Again. The American Mathematical Monthly, Vol. 97, No. 2 (Feb., 1990), pp. 125-129.

Expressing a prime as the sum of two squares

https://www.johndcook.com/blog/2026/02/12/pythagorean-primes/

13.02.2026 02:37 👍 0 🔁 1 💬 0 📌 0
Aligning one matrix with another Suppose you have two _n_ × _n_ matrices, _A_ and _B_ , and you would like to find a rotation matrix Ω that lines up _B_ with _A_. That is, you’d like to find Ω such that _A_ = Ω _B_. This is asking too much, except in the trivial case of _A_ and _B_ being 1 × 1 matrices. You could view the matrix equation above as a set of _n_ ² equations in real numbers, but the space of orthogonal matrices only has _n_(_n_ − 1) degrees of freedom [1]. When an equation doesn’t have an exact solution, the next best thing is to get as close as possible to a solution, typically in a least squares sense. The **orthogonal Procrustes problem** is to find an orthogonal matrix Ω minimizing the distance between _A_ and Ω _B_ That is, we want to minimize || _A_ − Ω _B_ || subject to the constraint that Ω is orthogonal. The matrix norm used in this problem is the Frobenius norm, the sum of the squares of the matrix entries. The Frobenius norm is the 2-norm if we straighten the matrices into vectors of dimension _n_ ². Peter Schönemann found a solution to the orthogonal Procrustes problem in 1964. His solution involves singular value decomposition (SVD). This shouldn’t be surprising since SVD solves the problem of finding the closest thing to an inverse of an non-invertible matrix. (More on that here.) Schönemann’s solution is to set _M_ = _AB_ T and find its singular value decomposition _M_ = _U_ Σ _V_ T. Then Ω = _UV_ T. ## Python code The following code illustrates solving the orthogonal Procrustes problem for random matrices. import numpy as np n = 3 # Generate random n x n matrices A and B rng = np.random.default_rng(seed=20260211) A = rng.standard_normal((n, n)) B = rng.standard_normal((n, n)) # Compute M = A * B^T M = A @ B.T # SVD: M = U * Sigma * V^T U, s, Vt = np.linalg.svd(M, full_matrices=False) # R = U * V^T R = U @ Vt # Verify that R * R^T is very nearly the identity matrix print("||R^T R - I||_F =", np.linalg.norm(R.T @ R - np.eye(n), ord="fro")) In this example the Frobenius norm between _RR_ T and _I_ 4 × 10−16, so essentially _RR_ T = _I_ to machine precision. ## Related posts * Least squares solution to over-determined systems * Moore-Penrose pseudoinverse * Drazin pseudoinverse [1] Every column of an orthogonal matrix Ω must have length 1, so that gives _n_ constraints. Furthermore, each pair of columns must be orthogonal, which gives _n_ choose 2 more constraints. We start with Ω having _n_ ² degrees of freedom, but then remove _n_ and then _n_(_n_ − 1)/2 degrees of freedom. _n_ ² − _n_ − _n_(_n_ − 1)/2 = _n_(_n_ − 1)/2

Aligning one matrix with another
The orthogonal Procrustes problem

https://www.johndcook.com/blog/2026/02/11/orthogonal-procrustes/

11.02.2026 13:25 👍 0 🔁 0 💬 0 📌 0
Computing large Fibonacci numbers The previous post discussed two ways to compute the _n_ th Fibonacci number. The first is to compute all the Fibonacci numbers up to the _n_ th iteratively using the defining property of Fibonacci numbers _F_ _n_ + 2 = _F_ _n_ + _F_ _n_ + 1 with extended integer arithmetic. The second approach is to use Binet’s formula _F_ _n_ = round( φ _n_ / √ 5 ) where φ is the golden ratio. It’s not clear which approach is more efficient. You could say that the iterative approach has run time _O_(_n_) while Binet’s formula is _O_(1). That doesn’t take into account how much work goes into each step, but it does suggest that eventually Binet wins. The relative efficiency of each algorithm depends on how it is implemented. In this post I will compare using Python’s integer arithmetic and the `mpmath` library for floating point. Here’s my code for both methods. from math import log10 import mpmath as mp def fib_iterate(n): a, b = 0, 1 for _ in range(n): a, b = b, a + b return a def digits_needed(n): phi = (1 + 5**0.5) / 2 return int(n*log10(phi) - 0.5*log10(5)) + 1 def fib_mpmath(n, guard_digits=30): digits = digits_needed(n) # Set decimal digits of precision mp.mp.dps = digits + guard_digits sqrt5 = mp.sqrt(5) phi = (1 + sqrt5) / 2 x = (phi ** n) / sqrt5 return int(mp.nint(x)) Next, here’s some code to compare the run times. def compare(n): start = time.perf_counter() x = fib_iterate(n) elapsed = time.perf_counter() - start print(elapsed) start = time.perf_counter() y = fib_mpmath(n) elapsed = time.perf_counter() - start print(elapsed) if (x != y): print("Methods produced different results.") This code shows that the iterate approach is faster for _n_ = 1,000 but Binet’s method is faster for _n_ = 10,000. >>> compare(1_000) 0.0002502090001144097 0.0009207079999669077 >>> compare(10_000) 0.0036547919999065925 0.002145750000181579 For larger _n_ , the efficiency advantage of Binet’s formula becomes more apparent. >>> compare(1_000_000) 11.169050417000108 2.0719056249999994 ## Guard digits and correctness There is one unsettling problem with the function `fib_mpmath` above: how many guard digits do you need? To compute a number correctly to 100 significant figures, for example, requires more than 100 digits of working precision. How many more? It depends on the calculation. What about our calculation? If we compute the 10,000th Fibonacci number using `fib_mpmath(10_000, 2)`, i.e. with 2 guard digits, we get a result that is incorrect in the last digit. To compute the 1,000,000th Fibonacci number correctly, we need 5 guard digits. We don’t need many guard digits, but we’re guessing at how many we need. How might we test whether we’ve guessed correctly? One way would be to compute the result using `fib_iterate` and compare results, but that defeats the purpose of using the more efficient `fib_mpmath`. If floating point calculations produce an incorrect result, the error is likely to be in the least significant digits. If we knew that the last digit was correct, that would give us more confidence that all the digits are correct. More generally, we could test the result mod _m_. I discussed Fibonacci numbers mod _m_ in this post. When _m_ = 10, the last digits of the Fibonacci numbers have a cycle of 60. So the Fibonacci numbers with index _n and with index _n mod 60 should be the same.__ The post I just mentioned links to a paper by Niederreiter that says the Fibonacci numbers ore evenly distributed mod _m_ if and only if _m_ is a power of 5, in which case the cycle length is 4 _m_. The following code could be used as a sanity check on the result of `fig_mpmath`. def mod_check(n, Fn): k = 3 base = 5**k period = 4*base return Fn % base == fib_iterate(n % period) % base With _k_ = 3, we are checking the result of our calculation mod 125. It is unlikely that an incorrect result would be correct mod 125. It’s hard to say just how unlikely. Naively we could say there’s 1 chance in 125, but that ignores the fact that the errors are most likely in the least significant bits. The chances of an incorrect result being correct mod 125 would be much less than 1 in 125. For more assurance you could use a larger power of 5.

New post: Computing large Fibonacci numbers

https://www.johndcook.com/blog/2026/02/08/computing-large-fibonacci-numbers/

08.02.2026 13:57 👍 0 🔁 0 💬 0 📌 0
Fibonacci numbers and time-space tradeoffs A few days ago I wrote about Fibonacci numbers and certificates. As I pointed out in the article, there’s no need to certify Fibonacci numbers, but the point of the post was to illustrate the idea of a solution certificate in a simple context. Practical uses of certificates are more complicated. This time I want to use Fibonacci numbers to illustrate space tradeoffs. The post on Fibonacci certificates imagined providing someone a pair (_F_ , _r_) where _F_ is a large Fibonacci number, and _r_ is a certificate proving that _F_ is indeed a Fibonacci number. The goal was to minimize the computational effort to verify _F_. All the recipient needed to do was compute _|_ 5 _F_ ² − _r_ ² |. The number _F_ is a Fibonacci number if and only if this number equals 4. The problem with this scenario is that _F_ and _r_ are both large numbers. They require transmitting and storing a lot of bits. A much more space-efficient approach would be to transmit the index of the Fibonacci number and have the user compute the number. The example in the certificate post was (12586269025, 28143753123). Since 12586269025 is the 50th Fibonacci number, I could communicate it to someone by simply transmitting the number 50. That saves space, but it puts more computational burden on the recipient. Fibonacci numbers grow exponentially with index size, and so the size of the _n_ th Fibonacci number in bits is proportional to _n_. But the number of bits in _n_ is proportional to log _n_. When _n_ is large, the difference is dramatic. How many bits are in the 1,000,000th Fibonacci number? The _n_ th Fibonacci number is φ _n_ /√5 rounded to the nearest integer, so the number of bits in the millionth Fibonacci number would be log2 (φ1000000/√5) = 1000000 log2 φ − 0.5 log2 5 which is roughly 700,000. By contrast, one million is a 20 bit number. So transmitting “1000000” is far more efficient than transmitting the millionth Fibonacci number. What does it take to compute the _n_ th Fibonacci number? For small _n_ , it’s fast and easy to compute the Fibonacci numbers up to _n_ sequentially using the definition of the sequence. For large enough _n_ , it would be faster to compute φ _n_ /√5. However, the former requires (extended) integer arithmetic, and the latter requires (extended) floating point arithmetic. It’s not clear where the crossover point would be where floating point would be more efficient. That’s the topic of the next post.

Fibonacci numbers and time-space tradeoffs

https://www.johndcook.com/blog/2026/02/08/time-space-tradeoffs/

08.02.2026 13:53 👍 0 🔁 0 💬 0 📌 0
Minimum of cosine sum Suppose _f_(_x_) is the sum of terms of the form cos(_kx_) where _k_ is an integer from a set _A_ with _n_ elements. Then the maximum value of _f_ is _f_(0) = _n_. But what is the minimum value of _f_? The **Chowla cosine conjecture** says that the minimum should be less than −√ _n_ for large _n_. For now the best proven results are much smaller in absolute value [1]. I was playing around with this problem, and the first thing I thought of was to let the set _A_ be the first _n_ primes. This turned out to not be the most interesting example. Since all the primes except for the first are odd, and cos(_k_ π) = −1 for odd _k_ , the minimum was always approximately − _n_ and always occurred near π [2]. Here’s a plot where _A_ is the set of primes less than 100. For the cosine conjecture to be interesting, the set _A_ should contain a mix of even and odd numbers. Here’s a plot with _A_ equal to a random selection of 25 points between 1 and 100. (I chose 25 because there are 25 primes less than 100.) Here’s the Python code I used to generate the two sets _A_ and the function to plot. import numpy as np from sympy import prime def f(x, A): return sum([np.cos(k*x) for k in A]) n = 25 A_prime = [prime(i) for i in range(1, n+1)] np.random.seed(20260207) A_random = np.random.choice(range(1, 101), size=n, replace=False) If you wanted to explore the Chowla conjecture numerically, direct use of minimization software is impractical. As you can tell from the plots above, there are a lot of local minima. If the values in _A_ are not too large, you can look at a plot to see approximately where the minimum occurs, then use a numerical method to find the minimum in this region, but that doesn’t scale. Here’s an approach that would scale better. You could find all the zeros of the derivative of _f_ _A_ and evaluate the function at each. One of these is the minimum. The derivative is a sum of sines with integer frequencies, and so it could be written as a polynomial in _z_ = exp(_ix_) [3]. You could find all the zeros of this polynomial using the QR algorithm as discussed in the previous post. [1] Benjamin Bedert. Polynomial bounds for the Chowla cosine problem. arXiv [2] If _A_ is the set of the first _n_ primes, _f_ _A_(π) = 2 − _n_ because the sum defining _f_ _A_(π) has one term equal to 1 and _n_ − 1 terms equal to −1. I think for _n_ ≥ 4 this is the minimum, but I haven’t verified this. If so, the minimum isn’t just near π but exactly at π. [3] You get a polynomial of degree _n_ in _z_ and 1/_z_. Then multiply by _z_ 2 _n_ to get a polynomial in _z_ only of degree 2 _n_.

New post: Minimum of a cosine series
The Chowla cosine conjecture

https://www.johndcook.com/blog/2026/02/07/chowla/

07.02.2026 14:04 👍 0 🔁 0 💬 0 📌 0
Eigenvalue homework problems are backward ## Classroom When you take a linear algebra course and get to the chapter on eigenvalues, your homework problems will include a small matrix _A_ and you will be asked to find the eigenvalues. You do this by computing the determinant det(_A_ − λ _I_) = _P_(λ) and getting _P_(λ), a polynomial in λ. The roots of _P_ are the eigenvalues of _A_. Either _A_ will be a 2 × 2 matrix, in which case you can find the roots using the quadratic formula, or the matrix will have been carefully selected so that _P_(λ) will be easy to factor. Otherwise, finding the roots of a polynomial is hard. ## Real world Numerical algorithms to find eigenvalues have gotten really good. In practice, you don’t compute determinants or find roots of polynomials. Instead you do something like the _QR_ algorithm. Finding all the roots of a polynomial is a challenging problem, and so what you might do in practice is find the roots by constructing a matrix, called the **companion matrix** , whose eigenvalues correspond to the roots you’re after. ## Summary As a classroom exercise, you calculate roots of polynomials to find eigenvalues. In the real world, you might use an eigenvalue solver to find the roots of polynomials. I wrote a similar post a few years ago. It explains that textbooks definite hyperbolic functions using _e_ _x_ , but you might want to compute _e_ _x_ using hyperbolic functions.

Eigenvalue homework problems are backward

https://www.johndcook.com/blog/2026/02/06/eigenvalue-roots/

06.02.2026 23:34 👍 0 🔁 0 💬 0 📌 0
CERTIFIED

CERTIFIED

Fibonacci number certificate

https://www.johndcook.com/blog/2026/02/05/fibonacci-certificate/

05.02.2026 17:17 👍 1 🔁 0 💬 0 📌 0
Γ(1/n) If _n_ is a positive integer, then rounding Γ(1/_n_) up to the nearest integer gives _n_. In symbols, We an illustrate this with the following Python code. >>> from scipy.special import gamma >>> from math import ceil >>> for n in range(1, 101): ... assert(ceil(gamma(1/n)) == n) You can find a full proof in [1]. I’ll give a partial proof that may be more informative than the full proof. The asymptotic expansion of the gamma function near zero is where γ is the Euler-Mascheroni constant. So when we set _z_ = 1/_n_ we find Γ(1/_n_) ≈ _n_ − γ + _O_(1/_n_ ²). Since 0 < γ < 1, the theorem above is true for sufficiently large _n_. And it turns out “sufficiently large” can be replaced with _n_ ≥ 1. [1] Gamma at reciprocals of integers: 12225. American Mathematical Monthly. October 2022. pp 789–790.

New post: Γ(1/n)

https://www.johndcook.com/blog/2026/02/04/gamma-reciprocal/

05.02.2026 03:42 👍 0 🔁 1 💬 1 📌 0
Fortunes and Geometric Means I saw a post on X recently that said > Bill Gates is closer to you in wealth than he is to Elon Musk. Mind blown. For round numbers, let’s say Elon Musk’s net worth is 800 billion and Bill Gates’ net worth is 100 billion. So if your net worth is less 450 billion, the statement in the post is true. The reason the statement above is mind blowing is that in this context you naturally think on a logarithmic scale, even if you don’t know what a logarithm is. Or to put it another way, we think in terms of orders of magnitude. Musk’s net worth is an order of magnitude greater than Gates’, and Gates’ net worth would be an order of magnitude greater than that of someone worth 10 billion. Musk is a notch above Gates, and Gates is a notch above someone with a net worth around 10 billion, where a “notch” is an order of magnitude. To put it another way, we think in terms of geometric mean √ _ab_ rather than arithmetic mean (_a_ + _b_)/2 in this context. 100 billion is the geometric mean between 12.5 billion and 800 billion. Geometric mean corresponds to the arithmetic mean on a log scale. And on this scale, Gates is closer to Musk than you are, unless you’re worth more than 12.5 billion. Here are three more examples of geometric means. The size of **Jupiter** is about midway between that of earth and the sun; it’s the geometric mean. On a linear scale Jupiter is much closer to the size of the earth than the sun, but on a logarithmic scale it’s about in the middle. More on that here. The **tritone** (augmented fourth) is half an octave. So, for example, an F# is in the middle between a C and the C an octave higher. Its frequency is the geometric mean of the frequencies of the two C’s. More here. Finally, the **humans body** is a middle-sized object in the universe. From Kevin Kelly: Our body size is, weirdly, almost exactly in the middle of the size of the universe. The smallest things we know about are approximately 30 orders of magnitude smaller than we are, and the largest structures in the universe are about 30 orders of magnitude bigger.

In what sense is Jupiter half way between the earth and the sun in size?

In what sense is an F# half way between a C and a C an octave higher?

In what sense is Bill Gates closer to Elon Musk than to you?

All examples of geometric mean.

https://www.johndcook.com/blog/2026/01/24/geometric-means/

24.01.2026 17:00 👍 1 🔁 0 💬 0 📌 0
Proving you know a product There is a way to prove that you know two numbers _a_ and _b_ , and their product _c_ = _ab_ , without revealing _a_ , _b_ , or _c_. This isn’t very exciting without more context — maybe you know that 7 × 3 = 21 — but it’s a building block of more interesting zero knowledge proofs, such as proving that a cyptocurrency transaction is valid without revealing the amount of the transaction. The proof mechanism requires an elliptic curve _G_ and a pairing of _G_ with itself. (More on pairings shortly.) It also requires a generator _g_ of the group structure on _G_. The prover takes the three secret numbers and multiplies the generator _g_ by each, encrypting the numbers as _ag_ , _bg_ , and _cg_. When _G_ is a large elliptic curve, say one with on the order of 2256 points, then computing products like _ag_ can be done quickly, but recovering _a_ from _g_ and _ag_ is impractical. In a nutshell, multiplication is easy but division [1] is practically impossible [2]. The verifier receives _ag_ , _bg_ , and _cg_. How can he verify that _ab_ = c without knowing _a_ , _b_ , or _c_? Here’s where pairing come in. I go more into pairings here, but essentially a pairing is a mapping from two groups to a third group _e_ : _G_ 1 × _G_ 2 → _G_ T such that _e_(_aP_ , _bQ_) = _e_(_P_ , _Q_)_ab_. In our case _G_ 1 and _G_ 2 are both equal to the group _G_ above, and the target group _G_ T doesn’t matter for our discussion here. Also, _P_ and _Q_ will both be our generator _g_. By the defining property of a pairing, _e_(_ag_ , _bg_) = _e_(_g_ , _g_)_ab_ and _e_(_cg_ , _g_) = _e_(_g_ , _g_)_c_. So if _ab_ = _c_ , then _e_(_g_ , _g_)_ab_ and _e_(_g_ , _g_)_c_ will be equal. ## Related posts * Elliptic curve pairings in cryptography * Three-party Diffie-Hellman * Pairing-unfriendly curves [1] The literature will usually speak of discrete logarithms rather than division. The group structure on an elliptic curve is Abelian, and so it is usually written as addition. If you write the group operation as multiplication, then you’re taking logs rather than dividing. The multiplicative notation highlights the similarity to working in the multiplicative group modulo a large prime. [2] The computation is theoretically possible but not possible in practice without spending enormous resources, or inventing a large scale quantum computer. This is the discrete logarithm assumption.

Zero knowledge proof building block:
Proving you know a product

https://www.johndcook.com/blog/2026/01/24/proving-you-know-a-product/

24.01.2026 16:28 👍 0 🔁 0 💬 0 📌 0
Post image

Analogs of binomial coefficients: q-binomial, Stirling numbers, and Eulerian numbers

https://www.johndcook.com/blog/2025/07/24/analogs-of-binomial-coefficients/

24.01.2026 13:53 👍 1 🔁 2 💬 1 📌 0
How to prove you know a discrete logarithm In a high school math class, the solution to the equation _b_ _x_ = _y_ is the logarithm of _x_ in base _b_. The implicit context of the equation is the real numbers, and the solution is easy to calculate. The same problem in the context of finite groups is called the discrete logarithm problem, and it is difficult to solve for large groups. In particular, it is impractical to solve when working modulo a sufficiently large prime number or when working over a sufficiently large elliptic curve [1]. In either context, the exponential _b_ _x_ can be computed efficiently but its inverse cannot. Now suppose you want to prove that you know _x_ without revealing _x_ itself. That is, you’d like to construct a **zero knowledge proof** that you know _x_. How could you do this? Here’s one way. 1. You, the prover, create a random number _r_ , compute _t_ = _b_ _r_ , and send the verifier _t_. 2. The other party, the verifier, creates a random number _c_ , the challenge, and sends it to you. 3. You calculate _s_ = _r_ + _cx_ and send _s_ to the verifier. 4. The verifier checks whether _b_ _s_ = _t_ _y_ _c_. and believes you if and only if equality holds. Let’s see why this works. First of all, what have you revealed to the prover? Two values: _t_ and _s_. The value _t_ is the exponential of a random number, and so another random number. The value _s_ is based on _x_ , and so conceivably you’ve revealed your secret. But the verifier does not know _r_ , only a value computed from _r_ (i.e. _t_) and the verifier cannot recover _r_ from _t_ because this would require computing a discrete logarithm. Next, why should _b_ _s_ = _t_ _y_ _c_? Because _b_ _s_ = _b_ _r_ + _cx_ = _b_ _r_ _b_ _cx_ = _t_ (_b_ _x_)_c_ = _t_ _y_ _c_. Finally, why should the verifier believe you know _x_? If you don’t know _x_ , but were able to come up with an _s_ that satisfies the verifier, then you were able to compute the discrete logarithm of _t_ _y_ _c_. ## Related posts * Lewis Carroll and zero knowledge proofs * What is a Pedersen commitment? * Zero knowledge proof of compositeness [1] At least without a large-scale quantum computer. Schor’s algorithm could efficiently compute discrete logarithms if only there were a large quantum computer to run it on.

How to prove that you know a discrete logarithm.
Simple example of a zero knowledge proof.

https://www.johndcook.com/blog/2026/01/23/zkp-discrete-logarithm/

23.01.2026 16:53 👍 0 🔁 2 💬 0 📌 0
Post image

The Mills ratio highlights the difference between heavy-tailed and thin-tailed distributions.

https://www.johndcook.com/blog/2026/01/21/mills-ratio/

21.01.2026 15:35 👍 0 🔁 0 💬 0 📌 0
Sigmas and Student I saw something yesterday saying that the Japanese bond market had experienced a six standard deviation move. This brought to mind a post I’d written eight years ago. All probability statements depend on a model. And if you’re probability model says an event had a probability six standard deviations from the mean, it’s more likely that your model is wrong than that you’ve actually seen something that rare. I expand on this idea here. How likely is it that a sample from a random variable will be six standard deviations from its mean? If you have in mind a normal (Gaussian) distribution, as most people do, then the probability is on the order of 1 chance in 10,000,000. Six sigma events are not common for any distribution, but they’re not unheard of for distributions with heavy tails. Let _X_ be a random variable with a Student _t_ distribution and ν degrees of freedom. When ν is small, i.e. no more than 2, the tails of _X_ are so fat that the standard deviation doesn’t exist. As ν → ∞ the Student _t_ distribution approaches the normal distribution. So in some sense this distribution interpolates between fat tails and thin tails. What is the probability that _X_ takes on a value more than six standard deviations from its mean at 0, i.e. what does the function _f_(ν) = Prob(_X_ > 6σ) look like as a function of ν where σ² = ν/(ν − 2) is the variance of _X_? As you’d expect, the limit of _f_(ν) as ν → ∞ is the probability of a six-sigma event for a normal distribution, around 10−7 as mentioned above. Here’s a plot of _f_(ν) for ν > 3. Notice that the vertical axis is on a log scale, i.e. the probability decreases exponentially. What you might not expect is that _f_(ν) isn’t monotone. It rises to a maximum value before it decays exponentially. In hindsight this makes sense. As ν → 2+ the variance becomes infinite, and the probability of being infinitely far from the mean is 0. Here’s a plot of _f_(ν) between 2 and 3. So six sigma probabilities for a Student _t_ distribution rise from 0 up to a maximum of around 10−3 then decrease exponentially, then asymptotically approach a value around 10−7. ## Related posts * Converting between nines and sigmas * When ν = 30 isn’t normal enough * Fat tails and the _t_ -test * Beer, wine, and statistics

Probability of six-sigma events under a t-distribution. How it depends on degrees of freedom.

https://www.johndcook.com/blog/2026/01/21/sigmas-and-student/

21.01.2026 13:23 👍 0 🔁 1 💬 0 📌 0