Use case

Sometimes you just want to capture the first (or last) event of something. Let’s say, you have a list of clients and want to capture their first purchase. This is useful if you want a list of new paying customers.

Dataset

We’re thinking about customers here, so let’s get the Online Retail Dataset from the UCI Machine Learning Repository. We can download this dataset directly using Pandas.

>>> import pandas as pd
>>> customers = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2010-12-01 08:26:00 7.65 17850 United Kingdom
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.25 17850 United Kingdom
7 536366 22633 HAND WARMER UNION JACK 6 2010-12-01 08:28:00 1.85 17850 United Kingdom
8 536366 22632 HAND WARMER RED POLKA DOT 6 2010-12-01 08:28:00 1.85 17850 United Kingdom
9 536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 2010-12-01 08:34:00 1.69 13047 United Kingdom

Methodology

If you have a good eye you’ll notice that it’s one invoice for multiple products that were purchased by a customer, this is why the values in InvoiceNo are duplicated. There’s a couple of things we can accomplish here:

  • Get the first row for every CustomerID
  • Get the first invoice for every CustomerID
  • Get the total sum for a customer’s first time purchase
  • Get the first purchase of the day

Get the first row for every CustomerID

Here’s a neat trick using the methods groupby and head to get the first row of a group:

customers.groupby('CustomerID').head(1)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 2010-12-01 08:34:00 1.69 13047.0 United Kingdom
536370 22728 ALARM CLOCK BAKELIKE PINK 24 2010-12-01 08:45:00 3.75 12583.0 France
536371 22086 PAPER CHAIN KIT 50’S CHRISTMAS 80 2010-12-01 09:00:00 2.55 13748.0 United Kingdom
536374 21258 VICTORIAN SEWING BOX LARGE 32 2010-12-01 09:09:00 10.95 15100.0 United Kingdom
536376 22114 HOT WATER BOTTLE TEA AND SYMPATHY 48 2010-12-01 09:32:00 3.45 15291.0 United Kingdom
536378 22386 JUMBO BAG PINK POLKADOT 10 2010-12-01 09:37:00 1.95 14688.0 United Kingdom
536380 22961 JAM MAKING SET PRINTED 24 2010-12-01 09:41:00 1.45 17809.0 United Kingdom
536381 22139 RETROSPOT TEA SET CERAMIC 11 PC 23 2010-12-01 09:41:00 4.25 15311.0 United Kingdom
C536379 D Discount -1 2010-12-01 09:41:00 27.5 14527.0 United Kingdom

You’ll notice that the values in CustomerID are unique and only the first row is presented.

Get the first invoice for every CustomerID

It’s great to have the first row, however, you’ll notice that in our dataset the values in InvoiceNo are repeated, this is because many customers buy multiple things in one transaction. So, it doesn’t make sense in a real scenario to filter only the first row.

Instead, we want to keep every item from the first invoice only. We can do that by making a list of the first transaction by customer and then applying a mask to our dataset:

# Create a list of the first invoice by client
first_invoices = customers.groupby(['CustomerID']).InvoiceNo.first().to_list()

# Filter first invoices by client
customers[customers.InvoiceNo.isin(first_invoices)]
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
536365 22752 SET 7 BABUSHKA NESTING BOXES 2 2010-12-01 08:26:00 7.65 17850.0 United Kingdom
536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.25 17850.0 United Kingdom
536367 84879 ASSORTED COLOUR BIRD ORNAMENT 32 2010-12-01 08:34:00 1.69 13047.0 United Kingdom
536367 22745 POPPY’S PLAYHOUSE BEDROOM 6 2010-12-01 08:34:00 2.1 13047.0 United Kingdom
536367 22748 POPPY’S PLAYHOUSE KITCHEN 6 2010-12-01 08:34:00 2.1 13047.0 United Kingdom
536367 22749 FELTCRAFT PRINCESS CHARLOTTE DOLL 8 2010-12-01 08:34:00 3.75 13047.0 United Kingdom
536367 22310 IVORY KNITTED MUG COSY 6 2010-12-01 08:34:00 1.65 13047.0 United Kingdom
536367 84969 BOX OF 6 ASSORTED COLOUR TEASPOONS 6 2010-12-01 08:34:00 4.25 13047.0 United Kingdom
536367 22623 BOX OF VINTAGE JIGSAW BLOCKS 3 2010-12-01 08:34:00 4.95 13047.0 United Kingdom
536367 22622 BOX OF VINTAGE ALPHABET BLOCKS 2 2010-12-01 08:34:00 9.95 13047.0 United Kingdom
536367 21754 HOME BUILDING BLOCK WORD 3 2010-12-01 08:34:00 5.95 13047.0 United Kingdom
536367 21755 LOVE BUILDING BLOCK WORD 3 2010-12-01 08:34:00 5.95 13047.0 United Kingdom
536367 21777 RECIPE BOX WITH METAL HEART 4 2010-12-01 08:34:00 7.95 13047.0 United Kingdom
536367 48187 DOORMAT NEW ENGLAND 4 2010-12-01 08:34:00 7.95 13047.0 United Kingdom
536370 22728 ALARM CLOCK BAKELIKE PINK 24 2010-12-01 08:45:00 3.75 12583.0 France

Get the total sum for a customer’s first time purchase

What we want to do in this context is to summarize the customer’s first purchase by multiplying the item Quantity times the UnitPrice and storing the operation result in Total.

Because we’re only interested in the first purchase, we’ll continue to use the dataframe first_purchase that we just created.

# Filter first invoices by client
first_purchase = customers[customers.InvoiceNo.isin(first_invoices)].copy()

first_purchase['Total'] = first_purchase.Quantity * first_purchase.UnitPrice
first_purchase[['CustomerID', 'Description', 'Quantity', 'UnitPrice', 'Total']]
CustomerID Description Quantity UnitPrice Total
17850.0 WHITE HANGING HEART T-LIGHT HOLDER 6 2.55 15.299999999999999
17850.0 WHITE METAL LANTERN 6 3.39 20.34
17850.0 CREAM CUPID HEARTS COAT HANGER 8 2.75 22.0
17850.0 KNITTED UNION FLAG HOT WATER BOTTLE 6 3.39 20.34
17850.0 RED WOOLLY HOTTIE WHITE HEART. 6 3.39 20.34
17850.0 SET 7 BABUSHKA NESTING BOXES 2 7.65 15.3
17850.0 GLASS STAR FROSTED T-LIGHT HOLDER 6 4.25 25.5
13047.0 ASSORTED COLOUR BIRD ORNAMENT 32 1.69 54.08
13047.0 POPPY’S PLAYHOUSE BEDROOM 6 2.1 12.600000000000001
13047.0 POPPY’S PLAYHOUSE KITCHEN 6 2.1 12.600000000000001

Now we can see how much users spent on their first purchase.

first_purchase.groupby(['CustomerID']).Total.sum()
CustomerID Total
12346.0 77183.6
12347.0 711.79
12348.0 892.8000000000001
12349.0 1757.55
12350.0 334.40000000000003
12352.0 296.49999999999994
12353.0 89.0
12354.0 1079.4
12355.0 459.4
12356.0 2271.6200000000003

Get the first purchase of the day

To achieve this we will do the following:

  • Sort values by date: Just in case our dataset is not sorted by time.
  • Extract the date from InvoiceNo: We will want to remove the hour from the timestamp.
  • Keep only one row per day: We will remove duplicates from the date column and keep the first one only.
  • Drop the date column: We won’t need it anymore, so we’ll drop it.

Here’s how it works:

customers.sort_values('InvoiceDate', inplace=True)
customers['date'] = customers.InvoiceDate.dt.date
customers.drop_duplicates('date', keep='first', inplace=True)
customers.drop('date', axis=1, inplace=True)
customers
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
536598 21421 PORCELAIN ROSE LARGE 12 2010-12-02 07:48:00 1.25 13090.0 United Kingdom
536847 22067 CHOC TRUFFLE GOLD TRINKET POT 24 2010-12-03 09:31:00 1.65 17135.0 United Kingdom
537037 22988 SOLDIERS EGG CUP 12 2010-12-05 10:03:00 1.25 17243.0 United Kingdom
537226 22389 PAPERWEIGHT SAVE THE PLANET 6 2010-12-06 08:34:00 2.55 15987.0 United Kingdom
C537444 22580 ADVENT CALENDAR GINGHAM SACK -8 2010-12-07 08:42:00 5.95 14850.0 United Kingdom
537667 22158 3 HEARTS HANGING DECORATION RUSTIC 128 2010-12-08 08:12:00 2.55 17870.0 United Kingdom
537879 22694 WICKER STAR 6 2010-12-09 08:34:00 2.1 14243.0 United Kingdom
538172 84212 ASSORTED FLOWER COLOUR “LEIS” 24 2010-12-10 09:33:00 0.65 15805.0 United Kingdom
538365 22932 BAKING MOULD TOFFEE CUP CHOCOLATE 8 2010-12-12 10:11:00 2.55 17243.0 United Kingdom

Notice how there’s only one row per day.

Conclusion

That’s it, you know now some quick methods to get the first event of something. Make sure to follow me in Twitter if you haven’t and want to be updated of my future posts: