A trick for pandas groupby!

How to GroupBy an increasing column such as a counter? [In pandas]

Last updated on May 12, 2021 3 min read development

I often work with sensor measurements such as those from car CAN-bus. These measurements are often collected over multiple sessions, but rarely these sessions are indicated by a proper identifier. Sometimes timestamps (containing date and time) are available and one can infer based on the time interval between the recordings.
Today started work on a new dataset and this dataset instead of proper timestamp provides a column that is only a counter that starts counting at the beginning of recording, of course the measurements are supposed to be at 1Hz, therefore the counter is also a measure of time.
Details aside, there are times I encounter a similar dataset and I often like to separate these recording sessions or whatever they may represent in your dataset. For example to do a groupby¹ and compute some statistics or any other complicated operations.
I am not overly familiar with all every feature in pandas or numpy so there may be an easier to tackle this problem, but here is how I usually solve the issue. Also one can always write a for loop but that is not an elegant solution.
I will generate some synthetic data and explain the process.

We do the usual imports and set the seed.

import numpy as np
import pandas as pd
np.random.seed(1)

To generate data first we sample 4 random integers between 3 and 7. These numbers are picked arbitrary to keep the example small, of course you can experiment with larger values. In this example simulate having 4 recording sessions each between 3 to 7 rows of data (or measurements).

counter_max = np.random.randint(3,7,4)

array([4, 6, 3, 3])

For each session we create counter values and two randomly sampled dummy variables. Then we create a dataframe from them.

inner_id = np.concatenate([np.array(range(x)) for x in counter_max])
var_a = np.concatenate([np.random.rand(x) for x in counter_max])
var_b = np.concatenate([np.random.rand(x) for x in counter_max])

df = pd.DataFrame({'inner_id':inner_id,
              'var_a': var_a,
              'var_b': var_b}
)

We use the method diff()² to detect transitions between each session. diff when used with no arguments shifts the column values by one row and subtracts them from original values.

df['diff'] = df.inner_id.diff()

Here is the result:

df.head(7)

index	inner_id	var_a	var_b	diff
0	0	0.000114	0.140387	NaN
1	1	0.302333	0.198101	1.0
2	2	0.146756	0.800745	1.0
3	3	0.092339	0.968262	1.0
4	0	0.186260	0.313424	-3.0
5	1	0.345561	0.692323	1.0
6	2	0.396767	0.876389	1.0

Next we need to extract the transition or breakpoints. Transitions occur only when the result of diff is negative. We also prepend 0 and append last index+1 (this is needed because the way pandas.cut() works).

break_points = np.concatenate(
    [np.array([0]),
     np.array(df[df.inner_id.diff() < 0].index),
     np.array([df.index[-1]+1])
    ], axis=0)

We construct a pd.IntervalIndex object from our breakpoints.

interval_idx = pd.IntervalIndex.from_breaks(break_points, closed='left')

Method cut³ is generally use for binning or categorizing numerical ranges. Here we use it to categorize index values of our dataframe and then immediately we throw away categories and only keep the codes or numerical value of the categories.

df['id'] = pd.cut(df.index, bins=interval_idx,include_lowest=True).codes

Here is the resulting dataframe. Now we can easily do df.groupby('id')

df

index	inner_id	var_a	var_b	diff	id
0	0	0.000114	0.140387	NaN	0
1	1	0.302333	0.198101	1.0	0
2	2	0.146756	0.800745	1.0	0
3	3	0.092339	0.968262	1.0	0
4	0	0.186260	0.313424	-3.0	1
5	1	0.345561	0.692323	1.0	1
6	2	0.396767	0.876389	1.0	1
7	3	0.538817	0.894607	1.0	1
8	4	0.419195	0.085044	1.0	1
9	5	0.685220	0.039055	1.0	1
10	0	0.204452	0.169830	-5.0	2
11	1	0.878117	0.878143	1.0	2
12	2	0.027388	0.098347	1.0	2
13	0	0.670468	0.421108	-2.0	3
14	1	0.417305	0.957890	1.0	3
15	2	0.558690	0.533165	1.0	3

Pandas python numpy

A trick for pandas groupby!

AI Researcher