Use case

Sometimes you just want to capture the first (or last) event of something. Let’s say, you have a list of clients and want to capture their first purchase. This is useful if you want a list of new paying customers.

Dataset

We’re thinking about customers here, so let’s get the Online Retail Dataset from the UCI Machine Learning Repository. We can download this dataset directly using Pandas.

>>> import pandas as pd
>>> customers = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850United Kingdom
553636522752SET 7 BABUSHKA NESTING BOXES22010-12-01 08:26:007.6517850United Kingdom
653636521730GLASS STAR FROSTED T-LIGHT HOLDER62010-12-01 08:26:004.2517850United Kingdom
753636622633HAND WARMER UNION JACK62010-12-01 08:28:001.8517850United Kingdom
853636622632HAND WARMER RED POLKA DOT62010-12-01 08:28:001.8517850United Kingdom
953636784879ASSORTED COLOUR BIRD ORNAMENT322010-12-01 08:34:001.6913047United Kingdom

Methodology

If you have a good eye you’ll notice that it’s one invoice for multiple products that were purchased by a customer, this is why the values in InvoiceNo are duplicated. There’s a couple of things we can accomplish here:

  • Get the first row for every CustomerID
  • Get the first invoice for every CustomerID
  • Get the total sum for a customer’s first time purchase
  • Get the first purchase of the day

Get the first row for every CustomerID

Here’s a neat trick using the methods groupby and head to get the first row of a group:

customers.groupby('CustomerID').head(1)
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
53636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
53636784879ASSORTED COLOUR BIRD ORNAMENT322010-12-01 08:34:001.6913047.0United Kingdom
53637022728ALARM CLOCK BAKELIKE PINK242010-12-01 08:45:003.7512583.0France
53637122086PAPER CHAIN KIT 50’S CHRISTMAS802010-12-01 09:00:002.5513748.0United Kingdom
53637421258VICTORIAN SEWING BOX LARGE322010-12-01 09:09:0010.9515100.0United Kingdom
53637622114HOT WATER BOTTLE TEA AND SYMPATHY482010-12-01 09:32:003.4515291.0United Kingdom
53637822386JUMBO BAG PINK POLKADOT102010-12-01 09:37:001.9514688.0United Kingdom
53638022961JAM MAKING SET PRINTED242010-12-01 09:41:001.4517809.0United Kingdom
53638122139RETROSPOT TEA SET CERAMIC 11 PC232010-12-01 09:41:004.2515311.0United Kingdom
C536379DDiscount-12010-12-01 09:41:0027.514527.0United Kingdom

You’ll notice that the values in CustomerID are unique and only the first row is presented.

Get the first invoice for every CustomerID

It’s great to have the first row, however, you’ll notice that in our dataset the values in InvoiceNo are repeated, this is because many customers buy multiple things in one transaction. So, it doesn’t make sense in a real scenario to filter only the first row.

Instead, we want to keep every item from the first invoice only. We can do that by making a list of the first transaction by customer and then applying a mask to our dataset:

# Create a list of the first invoice by client
first_invoices = customers.groupby(['CustomerID']).InvoiceNo.first().to_list()

# Filter first invoices by client
customers[customers.InvoiceNo.isin(first_invoices)]
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
53636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
53636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
53636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
53636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
53636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom
53636522752SET 7 BABUSHKA NESTING BOXES22010-12-01 08:26:007.6517850.0United Kingdom
53636521730GLASS STAR FROSTED T-LIGHT HOLDER62010-12-01 08:26:004.2517850.0United Kingdom
53636784879ASSORTED COLOUR BIRD ORNAMENT322010-12-01 08:34:001.6913047.0United Kingdom
53636722745POPPY’S PLAYHOUSE BEDROOM62010-12-01 08:34:002.113047.0United Kingdom
53636722748POPPY’S PLAYHOUSE KITCHEN62010-12-01 08:34:002.113047.0United Kingdom
53636722749FELTCRAFT PRINCESS CHARLOTTE DOLL82010-12-01 08:34:003.7513047.0United Kingdom
53636722310IVORY KNITTED MUG COSY62010-12-01 08:34:001.6513047.0United Kingdom
53636784969BOX OF 6 ASSORTED COLOUR TEASPOONS62010-12-01 08:34:004.2513047.0United Kingdom
53636722623BOX OF VINTAGE JIGSAW BLOCKS32010-12-01 08:34:004.9513047.0United Kingdom
53636722622BOX OF VINTAGE ALPHABET BLOCKS22010-12-01 08:34:009.9513047.0United Kingdom
53636721754HOME BUILDING BLOCK WORD32010-12-01 08:34:005.9513047.0United Kingdom
53636721755LOVE BUILDING BLOCK WORD32010-12-01 08:34:005.9513047.0United Kingdom
53636721777RECIPE BOX WITH METAL HEART42010-12-01 08:34:007.9513047.0United Kingdom
53636748187DOORMAT NEW ENGLAND42010-12-01 08:34:007.9513047.0United Kingdom
53637022728ALARM CLOCK BAKELIKE PINK242010-12-01 08:45:003.7512583.0France

Get the total sum for a customer’s first time purchase

What we want to do in this context is to summarize the customer’s first purchase by multiplying the item Quantity times the UnitPrice and storing the operation result in Total.

Because we’re only interested in the first purchase, we’ll continue to use the dataframe first_purchase that we just created.

# Filter first invoices by client
first_purchase = customers[customers.InvoiceNo.isin(first_invoices)].copy()

first_purchase['Total'] = first_purchase.Quantity * first_purchase.UnitPrice
first_purchase[['CustomerID', 'Description', 'Quantity', 'UnitPrice', 'Total']]
CustomerIDDescriptionQuantityUnitPriceTotal
17850.0WHITE HANGING HEART T-LIGHT HOLDER62.5515.299999999999999
17850.0WHITE METAL LANTERN63.3920.34
17850.0CREAM CUPID HEARTS COAT HANGER82.7522.0
17850.0KNITTED UNION FLAG HOT WATER BOTTLE63.3920.34
17850.0RED WOOLLY HOTTIE WHITE HEART.63.3920.34
17850.0SET 7 BABUSHKA NESTING BOXES27.6515.3
17850.0GLASS STAR FROSTED T-LIGHT HOLDER64.2525.5
13047.0ASSORTED COLOUR BIRD ORNAMENT321.6954.08
13047.0POPPY’S PLAYHOUSE BEDROOM62.112.600000000000001
13047.0POPPY’S PLAYHOUSE KITCHEN62.112.600000000000001

Now we can see how much users spent on their first purchase.

first_purchase.groupby(['CustomerID']).Total.sum()
CustomerIDTotal
12346.077183.6
12347.0711.79
12348.0892.8000000000001
12349.01757.55
12350.0334.40000000000003
12352.0296.49999999999994
12353.089.0
12354.01079.4
12355.0459.4
12356.02271.6200000000003

Get the first purchase of the day

To achieve this we will do the following:

  • Sort values by date: Just in case our dataset is not sorted by time.
  • Extract the date from InvoiceNo: We will want to remove the hour from the timestamp.
  • Keep only one row per day: We will remove duplicates from the date column and keep the first one only.
  • Drop the date column: We won’t need it anymore, so we’ll drop it.

Here’s how it works:

customers.sort_values('InvoiceDate', inplace=True)
customers['date'] = customers.InvoiceDate.dt.date
customers.drop_duplicates('date', keep='first', inplace=True)
customers.drop('date', axis=1, inplace=True)
customers
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
53636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
53659821421PORCELAIN ROSE LARGE122010-12-02 07:48:001.2513090.0United Kingdom
53684722067CHOC TRUFFLE GOLD TRINKET POT242010-12-03 09:31:001.6517135.0United Kingdom
53703722988SOLDIERS EGG CUP122010-12-05 10:03:001.2517243.0United Kingdom
53722622389PAPERWEIGHT SAVE THE PLANET62010-12-06 08:34:002.5515987.0United Kingdom
C53744422580ADVENT CALENDAR GINGHAM SACK-82010-12-07 08:42:005.9514850.0United Kingdom
537667221583 HEARTS HANGING DECORATION RUSTIC1282010-12-08 08:12:002.5517870.0United Kingdom
53787922694WICKER STAR62010-12-09 08:34:002.114243.0United Kingdom
53817284212ASSORTED FLOWER COLOUR “LEIS”242010-12-10 09:33:000.6515805.0United Kingdom
53836522932BAKING MOULD TOFFEE CUP CHOCOLATE82010-12-12 10:11:002.5517243.0United Kingdom

Notice how there’s only one row per day.

Conclusion

That’s it, you know now some quick methods to get the first event of something. Make sure to follow me in Twitter if you haven’t and want to be updated of my future posts: