SAX: Piecewise Aggregate Approximation
Problem: We have a series of n numbers that we want to divide into w slots. We want to compute the mean of each slot, how do we proceed when n is not divisible by w ? This is called a Piecewise Aggregate Approximation (PAA).
This question appeared when I read the SAX algorithm. It is used to convert a time series to a sequence of symbols. The trick is briefly explained in the paper, but the implementation requires a bit of thinking. Following schema is taken from the original paper, Experiencing SAX: a Novel Symbolic Representation of Time Series
The natural way would be to consider each slot to be the size of the following equation (n//w is the floor division) \(\frac{n}{w} = n // w + \frac{n \% w}{w}\).
We start from index 0, we add n//w points, then we add a proportion of the next point, corresponding to (n%w)/w. For the second slot, we take the rest of the proportion of the previous point, we add n//w point, then the rest of the proportion of the last point so that the size of the slot is n/w. We keep doing this process until we reach the end of the series.
The issue with this strategy is that is quite difficult and inelegant to code.
There is a more elegant way. If we multiply n by w, we can consider that we repeat each point w times. What is the point of doing this ? We saw that the proportion of point we need to add to the current slot is a quantity that we will divide by w (it is (n%w)/w). Considering we repeat each point w times, we will be able to deal with a number of points that is an integer.
Now the slot does not have a size of n/w, but n. We sum each element of the slot, then we divide by n at the end: we will indeed get the mean of the slot. For example, the second slot in the picture above has a mean of (s[i] corresponds to the ith element of the serie):
\(\frac{\frac{2}{3}s[2] +\frac{2}{3}s[3]}{\frac{4}{3}}\)
Considering the new representation, the mean is:
\(\frac{2s[2] +2s[3]}{4}\)
which is the same.
Now the code:
def paa(s, w):
res = [0] * w
n = len(s)
for i in range(w * n):
idx = i // n
pos = i // w
res[idx] += s[pos]
res = [x / n for x in res]
return res
print(paa([1, 2, 0, 4, 3, 5, 6, -2, 3 -4], 4))
# >>> [1.3333333333333333, 2.4444444444444446, 4.888888888888889, -0.6666666666666666]
The following plot shows what the PAA looks like:
NB: Note however that this method has the disadvantage of increasing a lot the number of iterations. If you consider a long timeseries, it may be too long. If it is the case, you can choose to simply consider slot to have size n // w. If w is not divisble by w, we will have w + 1 slot, the last one having less points (you need to be aware of that).