A trick for pandas groupby!
How to GroupBy an increasing column such as a counter? [In pandas]
I often work with sensor measurements such as those from car CAN-bus. These measurements are often collected over multiple sessions, but rarely these sessions are indicated by a proper identifier. Sometimes timestamps (containing date and time) are available and one can infer based on the time interval between the recordings.
Today started work on a new dataset and this dataset instead of proper timestamp provides a column that is only a counter that starts counting at the beginning of recording, of course the measurements are supposed to be at 1Hz, therefore the counter is also a measure of time.
Details aside, there are times I encounter a similar dataset and I often like to separate these recording sessions or whatever they may represent in your dataset. For example to do a groupby
1 and compute some statistics or any other complicated operations.
I am not overly familiar with all every feature in pandas
or numpy
so there may be an easier to tackle this problem, but here is how I usually solve the issue. Also one can always write a for loop but that is not an elegant solution.
I will generate some synthetic data and explain the process.
We do the usual imports and set the seed.
import numpy as np
import pandas as pd
np.random.seed(1)
To generate data first we sample 4 random integers between 3 and 7. These numbers are picked arbitrary to keep the example small, of course you can experiment with larger values. In this example simulate having 4 recording sessions each between 3 to 7 rows of data (or measurements).
counter_max = np.random.randint(3,7,4)
array([4, 6, 3, 3])
For each session we create counter values and two randomly sampled dummy variables. Then we create a dataframe from them.
inner_id = np.concatenate([np.array(range(x)) for x in counter_max])
var_a = np.concatenate([np.random.rand(x) for x in counter_max])
var_b = np.concatenate([np.random.rand(x) for x in counter_max])
df = pd.DataFrame({'inner_id':inner_id,
'var_a': var_a,
'var_b': var_b}
)
We use the method diff()
2 to detect transitions between each session. diff
when used with no arguments shifts the column values by one row and subtracts them from original values.
df['diff'] = df.inner_id.diff()
Here is the result:
df.head(7)
index | inner_id | var_a | var_b | diff | |
---|---|---|---|---|---|
0 | 0 | 0.000114 | 0.140387 | NaN | |
1 | 1 | 0.302333 | 0.198101 | 1.0 | |
2 | 2 | 0.146756 | 0.800745 | 1.0 | |
3 | 3 | 0.092339 | 0.968262 | 1.0 | |
4 | 0 | 0.186260 | 0.313424 | -3.0 | |
5 | 1 | 0.345561 | 0.692323 | 1.0 | |
6 | 2 | 0.396767 | 0.876389 | 1.0 |
Next we need to extract the transition or breakpoints. Transitions occur only when the result of diff is negative. We also prepend 0 and append last index+1 (this is needed because the way pandas.cut()
works).
break_points = np.concatenate(
[np.array([0]),
np.array(df[df.inner_id.diff() < 0].index),
np.array([df.index[-1]+1])
], axis=0)
We construct a pd.IntervalIndex
object from our breakpoints.
interval_idx = pd.IntervalIndex.from_breaks(break_points, closed='left')
Method cut
3 is generally use for binning or categorizing numerical ranges. Here we use it to categorize index values of our dataframe and then immediately we throw away categories and only keep the codes
or numerical value of the categories.
df['id'] = pd.cut(df.index, bins=interval_idx,include_lowest=True).codes
Here is the resulting dataframe. Now we can easily do df.groupby('id')
df
index | inner_id | var_a | var_b | diff | id |
---|---|---|---|---|---|
0 | 0 | 0.000114 | 0.140387 | NaN | 0 |
1 | 1 | 0.302333 | 0.198101 | 1.0 | 0 |
2 | 2 | 0.146756 | 0.800745 | 1.0 | 0 |
3 | 3 | 0.092339 | 0.968262 | 1.0 | 0 |
4 | 0 | 0.186260 | 0.313424 | -3.0 | 1 |
5 | 1 | 0.345561 | 0.692323 | 1.0 | 1 |
6 | 2 | 0.396767 | 0.876389 | 1.0 | 1 |
7 | 3 | 0.538817 | 0.894607 | 1.0 | 1 |
8 | 4 | 0.419195 | 0.085044 | 1.0 | 1 |
9 | 5 | 0.685220 | 0.039055 | 1.0 | 1 |
10 | 0 | 0.204452 | 0.169830 | -5.0 | 2 |
11 | 1 | 0.878117 | 0.878143 | 1.0 | 2 |
12 | 2 | 0.027388 | 0.098347 | 1.0 | 2 |
13 | 0 | 0.670468 | 0.421108 | -2.0 | 3 |
14 | 1 | 0.417305 | 0.957890 | 1.0 | 3 |
15 | 2 | 0.558690 | 0.533165 | 1.0 | 3 |