In my previous article I walked through interesting aspects of numpy. I first learned numpy back in college during a course on Artificial Intelligence. With my daytime work becoming more Python based these past few months, I took numpy back up.
Before this winter, I never used pandas. Pandas is a data analysis library similar to numpy. In fact, pandas uses numpy arrays in many of its exposed methods. While numpy exposes an array data structure, pandas has two main data structures: Series and DataFrame. In general, pandas is commonly used for manipulating and analysing time series or table data (think SQL table or excel spreadsheet)1.
Just like numpy, I wanted to write an article about the interesting aspects of pandas. I read thoroughly about pandas and coded with it at my job, so I wanted to share some of my favorite features. This article isn't meant to teach pandas basics, instead highlighting what makes pandas unique and special compared to other libraries and languages.
Pandas is a Python library used for data analysis, commonly on table and time series data. Pandas was originally built to work with financial data, however its modern use cases are vast in number2. Software engineering fields that use pandas include machine learning, economics, stock picking algorithms, and big data3.
Pandas has two main data types - Series and DataFrame. A series is a single-dimensional data structure containing an array of data and an array of labels4. Each label is associated with an element in the data array. The array of labels is called the index, and is of type Index. Listed below are two examples of the Series data structure, along with ways to access their indexes.
The first series implicitly defines an index while the second series explicitly passes an index argument to the Series() constructor. The first series' RangeIndex is a subtype of Index which holds a range of integers.
A data frame contains columns and rows of data, just like a table5. Data frames have indexes for rows and names for columns. There are many ways to create DataFrame objects, the most basic of which passes its constructor a dictionary. The following DataFrame represents some of my workouts in February.
The full pandas library builds upon the Series and DataFrame objects. The remainder of this article discusses aspects of pandas that I found most interesting.
Pandas has many similarities to the R programming language, most noticeably its DataFrame object. In R, data.frame is a fundamental built-in object with similar functionality to pandas' DataFrame6,7. The following data frame in R is nearly identical to the one in pandas.
R is a domain-specific programming language for data analysis and statistical computing. Although not explicitly documented as true, the R programming language seems to have inspired the creation of pandas (and numpy). For example, R has array vectorization and conditional indexing, features found in both pandas and numpy (although noticeably missing from the base Python language). The following code sample demonstrates vectorization operations in R:
Conditional indexing is shown below:
I find it fascinating to compare languages and frameworks to see where features and ideas originated from. Although pandas and numpy were influenced heavily by prior tools and languages, their ease of use within the Python ecosystem is what makes them so valuable.
One of the great things about pandas is the multitude of ways to initialize a DataFrame or Series. In software engineering, it's generally a good idea when building an API to not assume the existing format of a user's data. For example, in Java, APIs that accept a data structure as an argument often declare their parameters as type Iterable<T>. By using the Iterable<T> interface, a user can pass whatever iterable structure their data already exists in, whether it be a list, set, queue, stack, tree, or something else8. All these data structures implement Iterable<T>, so they work with the API.
Pandas takes this concept to another level. Not only does the DataFrame constructor and accompanying static factory methods accept multiple Python data structures as arguments, they also accept many different file formats. For example, CSV files can be turned into a data frame with read_csv() and database tables can be turned into a data frame with read_sql_table(). Other file formats that are easily turned into a data frame include Excel spreadsheets, HTML, and JSON. A full list of DataFrame input formats is found in the pandas documentation.
In my previous article on numpy, I discussed its advanced slicing and indexing mechanics. Pandas has similar functionality for its Series and DataFrame objects. First, here are some indexing and slicing examples on Series data structures.
The same slicing and indexing functionality is available for DataFrame objects. The following data frame contains running PR (personal record) information for some of my friends and I in college.
Now, here are some slicing and indexing operations on the data frame.
An important takeaway from these code samples is that indexing and slicing in pandas is just as powerful as numpy, with the added benefit of a tabular DataFrame data structure. Pandas also exposes many ways to manipulate data, from simple vectorization operations to complex "group by" expressions (which I will explain later). Some simple data manipulation examples are shown below.
There are so many examples of pandas DataFrame manipulation operations, but these are some basic ones. By showing these code samples, I'm trying to demonstrate that DataFrame objects aren't simply for holding data, but also transforming and analyzing data to suit an application's needs. Pandas provides plenty of statistical operations as well, such as finding the sum or standard deviation of data. Math isn't my strong suit so I won't go over any statistical functions, but they are readily available and easy to use.
Along with tabular data, another common use case for pandas is holding time series data. Pandas has the strongest time-series functionality I've ever seen in a language or library, which is quite exciting!
Dealing with dates and times in programming languages is often a frustrating experience. When creating a library that handles dates and times, it's crucial that the basic API is easy to use and intuitive. Otherwise, date and time complexities such as timezones and daylight savings time become a nightmare to deal with. An example of a poorly made date API is the original Java Date class9, 10.
Luckily, the date and time API used in pandas is easy to understand and use. To use dates in a pandas DataFrame or Series, native Python datetime.datetime objects are used.
Filtering data with a time series index is very easy. The following examples retrieve data from my mile_races series at certain indexes or slices.
As you can see, indexing and slicing is very flexible and is achievable with many common date formats. One of the annoyances of most date libraries is their expectation of dates to be in certain formats. These libraries often fail and throw errors if formats don't meet expectations. Pandas is much more lenient.
Beyond the basics, the aspect I found most interesting about pandas time series functionality is resampling. Resampling is when the frequency of time series data is changed11. For example, daily time series data can be converted to weekly time series data.
There are two forms of resampling - downsampling and upsampling. Downsampling is when a higher frequency is converted to a lower frequency. An example of downsampling is converting weekly time series data to monthly time series data. Downsampling can be thought of as data compression12. Upsampling is when a lower frequency is converted to a higher frequency. An example of upsampling is converting monthly time series data to weekly time series data. Upsampling can be thought of as data expansion13.
To demonstrate downsampling, I created a data frame with all my runs in the month of February. Then, I resampled it to show weekly average run length and weekly mileage.
To demonstrate upsampling, I created a data frame with my mile times per quarter (example: Q3 2019). Then, I resampled it to display data for missing quarters.
Missing data can be filled in with the ffill() function.
The final aspect of pandas I'll talk about is its "group by" capabilities. In a basic sense, pandas groupby() function is analogous to SQL's GROUP BY clause. In pandas, groupby() groups rows by given column(s) and performs aggregations on remaining columns. I'll walk you through one example to give an idea of how it works, but to fully understand groupby() you will have to experiment with it yourself.
I started by creating a data frame which includes my programming language usage statistics (as of March 2020).
Next, I reset the data frame's index. This makes the programming language names their own column. In the final data transformation before using groupby(), I melted the data frame on the programming language names column. Melting on this column results in a new data frame with a row for every language and year combination.
Finally, it's time to use groupby()! In my examples, I group on the index column (which contains programming language names). groupby() returns a grouping object which aggregation functions are performed upon. My examples use the sum() and mean() aggregation functions to find the total lines coded all time and average lines coded per year, respectively.
Pandas is a really great library for tabular data, and fits into the Python ecosystem nicely. I use pandas all the time at work, especially when interacting with data coming from a relational database. I plan on incorporating pandas into my personal projects as well, which is a sign that pandas is both fun for personal use and powerful enough for use at the enterprise level. You can find all my pandas code samples on GitHub.