pandas

Recently I’ve needed to wrangle a large data dump from a database system that manages to mangle exports of data in certain circumstances.

In this particular instance, the data exported out of the database had the ItemID missing from every ‘record’ after the first record exported for each unique itemID (as per the example table below).

ItemIDDateContent
01234528/03/2019Hello World.
27/03/2019The quick brown fox jumps over the lazy dog.
26/03/2019Lorem ipsum dolor sit amet, consectetur adipiscing elit.
54321028/01/2018Danish pastries are deliciously delightful and delicate.
27/01/2018Excepteur sint occaecat cupidatat non proident.
26/01/2018A lazy dog lets the quick brown fox jump over him.

Thankfully, this particular issue was fairly easy to fix with some very trivial python coding using the pandas package.

#!/usr/bin/python

# A quick thing to note about this solution, is that it has the
# potential to take quite some time on large++ datasets.
import pandas as pd

# Create a data frame from the export CSV file.
df = pd.read_csv('export.csv')

# Use a 'forward fill' to fill in the gaps for the 'ItemID' column in the
# data frame.
df['ItemID'].fillna(method='ffill', inplace=True)

# Convert the 'numpy.float64' type to a 'str' type.
# This is necessary in this particular instance.
# There may be a trailing '.0', so strip this out.
df['ItemID'] = df['ItemID'].astype(str).replace('\.0', '', regex=True)

# Convert the date into ISO 8601 - the one true date format
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# The date conversion will output the date in an Excel file
# as YYYY-mm-dd HH:MM:SS' - aka 2019-03-29 16:32:32
# If you only need the date, get rid of the time with this:
df['Date'] = df['Date'].dt.date

# Save back out to an Excel file because this was the target file
# required.
write_xlsx = pd.ExcelWriter('Report_Sample.xlsx')

# By default, pandas Excel output includes the data frame index and
# header, this is not necessary in the target Excel document.
df.to_excel(write_xlsx, 'Sample', index=False, header=False)
write_xlsx.save()

The pandas package is one that often comes highly recommended for manipulating data sets by many python users, and in this simple scenario it is easy to see why this is the case. It certainly is possible to write something in python that would not have required the installation of a package, but there are many circumstances where packages like pandas are easier to use to manipulate data than having to write your own package/module.