Use case
Sometimes you just want to capture the first (or last) event of something. Let’s say, you have a list of clients and want to capture their first purchase. This is useful if you want a list of new paying customers.
Dataset
We’re thinking about customers here, so let’s get the Online Retail Dataset from the UCI Machine Learning Repository. We can download this dataset directly using Pandas.
>>> import pandas as pd
>>> customers = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850 | United Kingdom |
5 | 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2010-12-01 08:26:00 | 7.65 | 17850 | United Kingdom |
6 | 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.25 | 17850 | United Kingdom |
7 | 536366 | 22633 | HAND WARMER UNION JACK | 6 | 2010-12-01 08:28:00 | 1.85 | 17850 | United Kingdom |
8 | 536366 | 22632 | HAND WARMER RED POLKA DOT | 6 | 2010-12-01 08:28:00 | 1.85 | 17850 | United Kingdom |
9 | 536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2010-12-01 08:34:00 | 1.69 | 13047 | United Kingdom |
Methodology
If you have a good eye you’ll notice that it’s one invoice for multiple products that were purchased by a customer, this is why the values in InvoiceNo
are duplicated. There’s a couple of things we can accomplish here:
- Get the first row for every
CustomerID
- Get the first invoice for every
CustomerID
- Get the total sum for a customer’s first time purchase
- Get the first purchase of the day
Get the first row for every CustomerID
Here’s a neat trick using the methods groupby
and head
to get the first row of a group:
customers.groupby('CustomerID').head(1)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2010-12-01 08:34:00 | 1.69 | 13047.0 | United Kingdom |
536370 | 22728 | ALARM CLOCK BAKELIKE PINK | 24 | 2010-12-01 08:45:00 | 3.75 | 12583.0 | France |
536371 | 22086 | PAPER CHAIN KIT 50’S CHRISTMAS | 80 | 2010-12-01 09:00:00 | 2.55 | 13748.0 | United Kingdom |
536374 | 21258 | VICTORIAN SEWING BOX LARGE | 32 | 2010-12-01 09:09:00 | 10.95 | 15100.0 | United Kingdom |
536376 | 22114 | HOT WATER BOTTLE TEA AND SYMPATHY | 48 | 2010-12-01 09:32:00 | 3.45 | 15291.0 | United Kingdom |
536378 | 22386 | JUMBO BAG PINK POLKADOT | 10 | 2010-12-01 09:37:00 | 1.95 | 14688.0 | United Kingdom |
536380 | 22961 | JAM MAKING SET PRINTED | 24 | 2010-12-01 09:41:00 | 1.45 | 17809.0 | United Kingdom |
536381 | 22139 | RETROSPOT TEA SET CERAMIC 11 PC | 23 | 2010-12-01 09:41:00 | 4.25 | 15311.0 | United Kingdom |
C536379 | D | Discount | -1 | 2010-12-01 09:41:00 | 27.5 | 14527.0 | United Kingdom |
You’ll notice that the values in CustomerID
are unique and only the first row is presented.
Get the first invoice for every CustomerID
It’s great to have the first row, however, you’ll notice that in our dataset the values in InvoiceNo
are repeated, this is because many customers buy multiple things in one transaction. So, it doesn’t make sense in a real scenario to filter only the first row.
Instead, we want to keep every item from the first invoice only. We can do that by making a list of the first transaction by customer and then applying a mask to our dataset:
# Create a list of the first invoice by client
first_invoices = customers.groupby(['CustomerID']).InvoiceNo.first().to_list()
# Filter first invoices by client
customers[customers.InvoiceNo.isin(first_invoices)]
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 2010-12-01 08:26:00 | 7.65 | 17850.0 | United Kingdom |
536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 4.25 | 17850.0 | United Kingdom |
536367 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 2010-12-01 08:34:00 | 1.69 | 13047.0 | United Kingdom |
536367 | 22745 | POPPY’S PLAYHOUSE BEDROOM | 6 | 2010-12-01 08:34:00 | 2.1 | 13047.0 | United Kingdom |
536367 | 22748 | POPPY’S PLAYHOUSE KITCHEN | 6 | 2010-12-01 08:34:00 | 2.1 | 13047.0 | United Kingdom |
536367 | 22749 | FELTCRAFT PRINCESS CHARLOTTE DOLL | 8 | 2010-12-01 08:34:00 | 3.75 | 13047.0 | United Kingdom |
536367 | 22310 | IVORY KNITTED MUG COSY | 6 | 2010-12-01 08:34:00 | 1.65 | 13047.0 | United Kingdom |
536367 | 84969 | BOX OF 6 ASSORTED COLOUR TEASPOONS | 6 | 2010-12-01 08:34:00 | 4.25 | 13047.0 | United Kingdom |
536367 | 22623 | BOX OF VINTAGE JIGSAW BLOCKS | 3 | 2010-12-01 08:34:00 | 4.95 | 13047.0 | United Kingdom |
536367 | 22622 | BOX OF VINTAGE ALPHABET BLOCKS | 2 | 2010-12-01 08:34:00 | 9.95 | 13047.0 | United Kingdom |
536367 | 21754 | HOME BUILDING BLOCK WORD | 3 | 2010-12-01 08:34:00 | 5.95 | 13047.0 | United Kingdom |
536367 | 21755 | LOVE BUILDING BLOCK WORD | 3 | 2010-12-01 08:34:00 | 5.95 | 13047.0 | United Kingdom |
536367 | 21777 | RECIPE BOX WITH METAL HEART | 4 | 2010-12-01 08:34:00 | 7.95 | 13047.0 | United Kingdom |
536367 | 48187 | DOORMAT NEW ENGLAND | 4 | 2010-12-01 08:34:00 | 7.95 | 13047.0 | United Kingdom |
536370 | 22728 | ALARM CLOCK BAKELIKE PINK | 24 | 2010-12-01 08:45:00 | 3.75 | 12583.0 | France |
Get the total sum for a customer’s first time purchase
What we want to do in this context is to summarize the customer’s first purchase by multiplying the item Quantity
times the UnitPrice
and storing the operation result in Total
.
Because we’re only interested in the first purchase, we’ll continue to use the dataframe first_purchase
that we just created.
# Filter first invoices by client
first_purchase = customers[customers.InvoiceNo.isin(first_invoices)].copy()
first_purchase['Total'] = first_purchase.Quantity * first_purchase.UnitPrice
first_purchase[['CustomerID', 'Description', 'Quantity', 'UnitPrice', 'Total']]
CustomerID | Description | Quantity | UnitPrice | Total |
---|---|---|---|---|
17850.0 | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2.55 | 15.299999999999999 |
17850.0 | WHITE METAL LANTERN | 6 | 3.39 | 20.34 |
17850.0 | CREAM CUPID HEARTS COAT HANGER | 8 | 2.75 | 22.0 |
17850.0 | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 3.39 | 20.34 |
17850.0 | RED WOOLLY HOTTIE WHITE HEART. | 6 | 3.39 | 20.34 |
17850.0 | SET 7 BABUSHKA NESTING BOXES | 2 | 7.65 | 15.3 |
17850.0 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 4.25 | 25.5 |
13047.0 | ASSORTED COLOUR BIRD ORNAMENT | 32 | 1.69 | 54.08 |
13047.0 | POPPY’S PLAYHOUSE BEDROOM | 6 | 2.1 | 12.600000000000001 |
13047.0 | POPPY’S PLAYHOUSE KITCHEN | 6 | 2.1 | 12.600000000000001 |
Now we can see how much users spent on their first purchase.
first_purchase.groupby(['CustomerID']).Total.sum()
CustomerID | Total |
---|---|
12346.0 | 77183.6 |
12347.0 | 711.79 |
12348.0 | 892.8000000000001 |
12349.0 | 1757.55 |
12350.0 | 334.40000000000003 |
12352.0 | 296.49999999999994 |
12353.0 | 89.0 |
12354.0 | 1079.4 |
12355.0 | 459.4 |
12356.0 | 2271.6200000000003 |
Get the first purchase of the day
To achieve this we will do the following:
- Sort values by date: Just in case our dataset is not sorted by time.
- Extract the date from
InvoiceNo
: We will want to remove the hour from the timestamp. - Keep only one row per day: We will remove duplicates from the date column and keep the first one only.
- Drop the date column: We won’t need it anymore, so we’ll drop it.
Here’s how it works:
customers.sort_values('InvoiceDate', inplace=True)
customers['date'] = customers.InvoiceDate.dt.date
customers.drop_duplicates('date', keep='first', inplace=True)
customers.drop('date', axis=1, inplace=True)
customers
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
536598 | 21421 | PORCELAIN ROSE LARGE | 12 | 2010-12-02 07:48:00 | 1.25 | 13090.0 | United Kingdom |
536847 | 22067 | CHOC TRUFFLE GOLD TRINKET POT | 24 | 2010-12-03 09:31:00 | 1.65 | 17135.0 | United Kingdom |
537037 | 22988 | SOLDIERS EGG CUP | 12 | 2010-12-05 10:03:00 | 1.25 | 17243.0 | United Kingdom |
537226 | 22389 | PAPERWEIGHT SAVE THE PLANET | 6 | 2010-12-06 08:34:00 | 2.55 | 15987.0 | United Kingdom |
C537444 | 22580 | ADVENT CALENDAR GINGHAM SACK | -8 | 2010-12-07 08:42:00 | 5.95 | 14850.0 | United Kingdom |
537667 | 22158 | 3 HEARTS HANGING DECORATION RUSTIC | 128 | 2010-12-08 08:12:00 | 2.55 | 17870.0 | United Kingdom |
537879 | 22694 | WICKER STAR | 6 | 2010-12-09 08:34:00 | 2.1 | 14243.0 | United Kingdom |
538172 | 84212 | ASSORTED FLOWER COLOUR “LEIS” | 24 | 2010-12-10 09:33:00 | 0.65 | 15805.0 | United Kingdom |
538365 | 22932 | BAKING MOULD TOFFEE CUP CHOCOLATE | 8 | 2010-12-12 10:11:00 | 2.55 | 17243.0 | United Kingdom |
Notice how there’s only one row per day.
Conclusion
That’s it, you know now some quick methods to get the first event of something. Make sure to follow me in Twitter if you haven’t and want to be updated of my future posts: Follow @__franccesco