Organize pandas notebook with cool hacks

Published:

![messy-notebook](/assets/img/sample/messy_nb.png) Does it ring a bell looking at this messy notebook? I am sure you must have created or encountered a similar kind of notebook while performing data analysis tasks in pandas. Pandas is widely used by data scientists and ML Engineers all around the world to perform all kinds of data related tasks like data cleaning and preprocessing, data analysis, data manipulation, data conversion, etc. However, most of us are not using it right, as seen in the above example, which has decreased our productivity a lot. You might wonder then what is the correct way to use pandas. Is there any particular way that we can make the notebook clean and modular so that we can increase our productivity? Luckily, there is a type of quick hack or technique, whatever you may call it, which can be used to greatly improve the workflow and make notebooks not only clean and well organized but highly productive and efficient. The good thing is that you don't need to install any extra packages or libraries. In the end, your notebook will look something like this. ![Clean notebook](/assets/img/sample/clean_nb.png) > Note: Dark mode is available on this website. You can switch between the modes by clicking the leftmost button at the bottom of the left sidebar. ![dark_mode](/assets/img/sample/dark_mode.png) # Untitled12.ipynb The way to achieve clean and well-organized pandas notebooks was explored in the presentation [Untitled12.ipynb](https://pydata.org/eindhoven2019/schedule/presentation/19/untitled12ipynb/) by [Vincent D. Warmerdam](https://twitter.com/fishnets88?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) at [PyData Eindhoven 2019](https://pydata.org/eindhoven2019/) The presentation [Untitled12.ipynb: Prevent Miles of Scrolling, Reduce the Spaghetti Code from the Copy Pasta](https://www.youtube.com/watch?v=MpFZUshKypk&t=1292s) has been uploaded in youtube as well. You can watch the video below if you want:

In this article, I will briefly summarize the presentation by [Vincent D. Warmerdam](https://twitter.com/fishnets88?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) and then move on to the code implementation (solution) and a few code examples based on the methods used in his presentation. **The Untitled phenomena** ![Untitled12.ipynb](/assets/img/sample/Untitled12.png) He began his talk by introducing a term called **`Untitled phenomena`**. The term simply refers to the bad practice of not naming the notebook files which eventually creates an unorganized bunch of Untitled notebooks. As a result, he also named the presentation **`Untitled12.ipynb`**. Moreover, not only the bad practice of naming that we follow but also the bad organization of code inside the notebook needs to be improved. Copying and pasting code multiple times creates spaghetti code. This is especially true for a lot of data science based Jupyter notebooks. The goal of his talk was to uncover a great pattern for pandas that would prevent loads of scrolling such that the code behaves like lego. He also gave some useful tricks and tips on how to prevent miles of scrolling and reduce the spaghetti code when creating Jupyter notebooks. ## > [**Skip to coding solution**](#solution) I have initially written a summary of the talk Untitled12.ipynb and explored some common problems in the usual coding style before moving to the solution. If you want to directly jump to the coding solution to create a clean pandas notebook using a pipeline, then click the link above. However, I recommend you to read the common problems I have mentioned before going to the solution. # Contents I will be talking about the following topics which will more or less revolve around his talk. - [Importance of Workflow](#importance) - [The Usual coding style](#current) - [Problems in the usual coding style](#problems) - [Coding Solution](#solution) - [Advantages](#advantages) ## Importance of Workflow At the beginning of the presentation, he began by discussing the following points that highlight the importance of workflows and the need of jupyter-notebook and pandas over excel: - We want to separate the data from the analysis: Tha analysis portion should not modify the raw data. The raw data should be safe from these modifications so that it can be reused later as well. However, this is not possible in excel. - We want to be able to automate our analysis. The main aim of programming and workflow is automation. Our tasks become a lot easier if we can automate the analysis using a pandas script rather than performing the analysis every time using Excel. - We want our analysis to be reproducible i.e. we must be able to reproduce the same analysis results on the data at a later time in the future. - We should not pay a third part obscene amounts of money for something as basic as arithmetic. This budget is better allocated towards innovation and education of staff. However, the current style of coding in pandas and jupyter notebook has solved only the last point. ## The usual coding style Let's explore the common practice of writing pandas code and try to point out the major problems in such approaches. Initially, I will show the general workflow that most of us follow while using pandas. I will be performing some analysis on the real COVID 19 dataset of the U.S. states obtained from [The COVID Tracking Project](https://covidtracking.com/) which is available under the [Creative Commons CC BY-NC-4.0 license](https://creativecommons.org/licenses/by-nc/4.0/). The dataset is updated each day between 4 pm and 5 pm EDT. After showing the common approach, I will point out the major pitfalls and then move on to the solution. First, I will download the U.S. COVID-19 dataset using the API provided by [The COVID Tracking Project](https://covidtracking.com/) ```python !mkdir data !wget -O data/covid19_us_states_daily.csv https://covidtracking.com/api/v1/states/daily.csv !wget -O data/state_info.csv https://covidtracking.com/api/v1/states/info.csv ``` --2020-06-05 16:34:10-- https://covidtracking.com/api/v1/states/daily.csv Resolving covidtracking.com (covidtracking.com)... 104.248.63.231, 2604:a880:400:d1::888:7001 Connecting to covidtracking.com (covidtracking.com)|104.248.63.231|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/csv] Saving to: ‘data/covid19_us_states_daily.csv’ data/covid19_us_sta [ <=> ] 987.40K 3.11MB/s in 0.3s 2020-06-05 16:34:11 (3.11 MB/s) - ‘data/covid19_us_states_daily.csv’ saved [1011093] --2020-06-05 16:34:12-- https://covidtracking.com/api/v1/states/info.csv Resolving covidtracking.com (covidtracking.com)... 104.248.50.87, 2604:a880:400:d1::888:7001 Connecting to covidtracking.com (covidtracking.com)|104.248.50.87|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/csv] Saving to: ‘data/state_info.csv’ data/state_info.csv [ <=> ] 27.67K --.-KB/s in 0.02s 2020-06-05 16:34:13 (1.43 MB/s) - ‘data/state_info.csv’ saved [28329] ```python import pandas as pd # Importing plotly library for plotting interactive graphs import plotly.graph_objects as go from plotly.subplots import make_subplots import plotly.express as px import chart_studio import chart_studio.plotly as py ``` The first step is generally to read or import the data ```python df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date') df.head() ```
statepositivenegativependinghospitalizedCurrentlyhospitalizedCumulativeinIcuCurrentlyinIcuCumulativeonVentilatorCurrentlyonVentilatorCumulativerecovereddataQualityGradelastUpdateEtdateModifiedcheckTimeEtdeathhospitalizeddateCheckedfipspositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreaseposNegdeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
20200604AK513.059584.0NaN13.0NaNNaNNaN1.0NaN376.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:0010.0NaN2020-06-04T00:00:00Z281907600976009719156009700c1046011af7271cbe2e6698526714c6cb5b9274800000NaN
20200604AL19072.0216227.0NaNNaN1929.0NaN601.0NaN357.011395.0B6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00653.01929.02020-06-04T00:00:00Z122134842352992352993705235299029bcbefdb36212ba2b97b5a354f4e45bf16648ee2300000NaN
20200604AR8067.0134413.0NaN138.0757.0NaNNaN30.0127.05717.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00142.0757.02020-06-04T00:00:00Z5001424801424800142480026acd3a4fbbc3dbb32138725f91e3261d683e7052a00000NaN
20200604AS0.0174.0NaNNaNNaNNaNNaNNaNNaNNaNC6/1/2020 00:002020-06-01T00:00:00Z05/31 20:000.0NaN2020-06-01T00:00:00Z60001741740174008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab3200000NaN
20200604AZ22753.0227002.0NaN1079.03195.0375.0NaN223.0NaN5172.0A+6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00996.03195.02020-06-04T00:00:00Z45204710249755249755523024975515661fa237b8204cd23701577aef6338d339daa4452e00000NaN
After taking a glance at the data, I realize that the date is not formatted well, so I format it. ```python df.index = pd.to_datetime(df.index, format="%Y%m%d") df.head() ```
statepositivenegativependinghospitalizedCurrentlyhospitalizedCumulativeinIcuCurrentlyinIcuCumulativeonVentilatorCurrentlyonVentilatorCumulativerecovereddataQualityGradelastUpdateEtdateModifiedcheckTimeEtdeathhospitalizeddateCheckedfipspositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreaseposNegdeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
2020-06-04AK513.059584.0NaN13.0NaNNaNNaN1.0NaN376.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:0010.0NaN2020-06-04T00:00:00Z281907600976009719156009700c1046011af7271cbe2e6698526714c6cb5b9274800000NaN
2020-06-04AL19072.0216227.0NaNNaN1929.0NaN601.0NaN357.011395.0B6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00653.01929.02020-06-04T00:00:00Z122134842352992352993705235299029bcbefdb36212ba2b97b5a354f4e45bf16648ee2300000NaN
2020-06-04AR8067.0134413.0NaN138.0757.0NaNNaN30.0127.05717.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00142.0757.02020-06-04T00:00:00Z5001424801424800142480026acd3a4fbbc3dbb32138725f91e3261d683e7052a00000NaN
2020-06-04AS0.0174.0NaNNaNNaNNaNNaNNaNNaNNaNC6/1/2020 00:002020-06-01T00:00:00Z05/31 20:000.0NaN2020-06-01T00:00:00Z60001741740174008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab3200000NaN
2020-06-04AZ22753.0227002.0NaN1079.03195.0375.0NaN223.0NaN5172.0A+6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00996.03195.02020-06-04T00:00:00Z45204710249755249755523024975515661fa237b8204cd23701577aef6338d339daa4452e00000NaN
Then, I try to view some additional information about the dataset ```python df.info() ``` DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 5113 non-null object 1 positive 5098 non-null float64 2 negative 4902 non-null float64 3 pending 842 non-null float64 4 hospitalizedCurrently 2591 non-null float64 5 hospitalizedCumulative 2318 non-null float64 6 inIcuCurrently 1362 non-null float64 7 inIcuCumulative 576 non-null float64 8 onVentilatorCurrently 1157 non-null float64 9 onVentilatorCumulative 198 non-null float64 10 recovered 2409 non-null float64 11 dataQualityGrade 4012 non-null object 12 lastUpdateEt 4758 non-null object 13 dateModified 4758 non-null object 14 checkTimeEt 4758 non-null object 15 death 4388 non-null float64 16 hospitalized 2318 non-null float64 17 dateChecked 4758 non-null object 18 fips 5113 non-null int64 19 positiveIncrease 5113 non-null int64 20 negativeIncrease 5113 non-null int64 21 total 5113 non-null int64 22 totalTestResults 5113 non-null int64 23 totalTestResultsIncrease 5113 non-null int64 24 posNeg 5113 non-null int64 25 deathIncrease 5113 non-null int64 26 hospitalizedIncrease 5113 non-null int64 27 hash 5113 non-null object 28 commercialScore 5113 non-null int64 29 negativeRegularScore 5113 non-null int64 30 negativeScore 5113 non-null int64 31 positiveScore 5113 non-null int64 32 score 5113 non-null int64 33 grade 0 non-null float64 dtypes: float64(13), int64(14), object(7) memory usage: 1.4+ MB You can see that various columns are not of use. So, I decide to remove such columns. ```python df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'], axis=1, inplace=True) df.info() ``` DatetimeIndex: 5113 entries, 2020-06-04 to 2020-01-22 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 state 5113 non-null object 1 positive 5098 non-null float64 2 negative 4902 non-null float64 3 pending 842 non-null float64 4 recovered 2409 non-null float64 5 death 4388 non-null float64 6 hospitalized 2318 non-null float64 7 dateChecked 4758 non-null object 8 positiveIncrease 5113 non-null int64 9 negativeIncrease 5113 non-null int64 10 total 5113 non-null int64 11 totalTestResults 5113 non-null int64 12 totalTestResultsIncrease 5113 non-null int64 13 deathIncrease 5113 non-null int64 14 hospitalizedIncrease 5113 non-null int64 15 hash 5113 non-null object 16 commercialScore 5113 non-null int64 17 negativeRegularScore 5113 non-null int64 18 negativeScore 5113 non-null int64 19 positiveScore 5113 non-null int64 20 score 5113 non-null int64 21 grade 0 non-null float64 dtypes: float64(7), int64(12), object(3) memory usage: 918.7+ KB I also realize that there are a lot of missing (nan or null) values. So, I replace the missing values by 0. ```python df.fillna(value=0, inplace=True) df.head() ```
statepositivenegativependingrecovereddeathhospitalizeddateCheckedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
2020-06-04AK513.059584.00.0376.010.00.02020-06-04T00:00:00Z819076009760097191500c1046011af7271cbe2e6698526714c6cb5b92748000000.0
2020-06-04AL19072.0216227.00.011395.0653.01929.02020-06-04T00:00:00Z22134842352992352993705029bcbefdb36212ba2b97b5a354f4e45bf16648ee23000000.0
2020-06-04AR8067.0134413.00.05717.0142.0757.02020-06-04T00:00:00Z001424801424800026acd3a4fbbc3dbb32138725f91e3261d683e7052a000000.0
2020-06-04AS0.0174.00.00.00.00.02020-06-01T00:00:00Z001741740008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32000000.0
2020-06-04AZ22753.0227002.00.05172.0996.03195.02020-06-04T00:00:00Z5204710249755249755523015661fa237b8204cd23701577aef6338d339daa4452e000000.0
I also want to add a column corresponding to the state name instead of the abbreviation. So, I merge state_info with the current dataframe. ```python df2 = pd.read_csv('data/state_info.csv', usecols=['state', 'name']) df3 = (df .reset_index() .merge(df2, on='state', how='left', left_index=True)) df3.head() ```
datestatepositivenegativependingrecovereddeathhospitalizeddateCheckedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregradename
02020-06-04AK513.059584.00.0376.010.00.02020-06-04T00:00:00Z819076009760097191500c1046011af7271cbe2e6698526714c6cb5b92748000000.0Alaska
12020-06-04AL19072.0216227.00.011395.0653.01929.02020-06-04T00:00:00Z22134842352992352993705029bcbefdb36212ba2b97b5a354f4e45bf16648ee23000000.0Alabama
22020-06-04AR8067.0134413.00.05717.0142.0757.02020-06-04T00:00:00Z001424801424800026acd3a4fbbc3dbb32138725f91e3261d683e7052a000000.0Arkansas
32020-06-04AS0.0174.00.00.00.00.02020-06-01T00:00:00Z001741740008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32000000.0American Samoa
42020-06-04AZ22753.0227002.00.05172.0996.03195.02020-06-04T00:00:00Z5204710249755249755523015661fa237b8204cd23701577aef6338d339daa4452e000000.0Arizona
I realize that the date index is lost. So, I reset the date index. Also, it is better to rename the column name as state_name. ```python df3.set_index('date', inplace=True) df3.rename(columns={'name': 'state_name'}, inplace=True) df3.head() ```
statepositivenegativependingrecovereddeathhospitalizeddateCheckedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregradestate_name
date
2020-06-04AK513.059584.00.0376.010.00.02020-06-04T00:00:00Z819076009760097191500c1046011af7271cbe2e6698526714c6cb5b92748000000.0Alaska
2020-06-04AL19072.0216227.00.011395.0653.01929.02020-06-04T00:00:00Z22134842352992352993705029bcbefdb36212ba2b97b5a354f4e45bf16648ee23000000.0Alabama
2020-06-04AR8067.0134413.00.05717.0142.0757.02020-06-04T00:00:00Z001424801424800026acd3a4fbbc3dbb32138725f91e3261d683e7052a000000.0Arkansas
2020-06-04AS0.0174.00.00.00.00.02020-06-01T00:00:00Z001741740008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab32000000.0American Samoa
2020-06-04AZ22753.0227002.00.05172.0996.03195.02020-06-04T00:00:00Z5204710249755249755523015661fa237b8204cd23701577aef6338d339daa4452e000000.0Arizona
Now that the data is ready for some analysis, I decide to plot deaths count in each state indexed by date using the interactive [plotly library](https://plotly.com/). ```python fig1 = px.line(df3, x=df3.index, y='death', color='state') fig1.update_layout(xaxis_title='date', title='Total deaths in each state (Cumulative)') py.plot(fig1, filename = 'daily_deaths', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/1/'
> Note: These plots are interactive, so you can zoom in or out, pinch, hover over the graph, download it, and so on. Now, I decide to calculate the total deaths in the US across all states and plot it. ```python df4 = df3.resample('D').sum() df4.head() ```
positivenegativependingrecovereddeathhospitalizedpositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreasedeathIncreasehospitalizedIncreasecommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
2020-01-221.00.00.00.00.00.00011000000000.0
2020-01-231.00.00.00.00.00.00011000000000.0
2020-01-241.00.00.00.00.00.00011000000000.0
2020-01-251.00.00.00.00.00.00011000000000.0
2020-01-261.00.00.00.00.00.00011000000000.0
```python fig2 = px.line(df4, x=df4.index, y='death') fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)') py.plot(fig2, filename = 'total_daily_deaths', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/4/'
I also want to calculate the number of Active cases i.e. > Active = positive - deaths - recovered ```python df4['active'] = df4['positive'] - df4['death'] - df4['recovered'] ``` Now, after calculating the active column, I want to plot active cases instead of death. So, I go to the previous cell and replace `death` by `active` and generate the plot. In [25]: df4['~~death~~'].plot() In [25]: df4['active'].plot() ```python fig3 = px.line(df4, x=df4.index, y='active') fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)') py.plot(fig3, filename = 'total_daily_active', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/6/'
Then I decide to calculate the statistics of a single month of May only. Since the data is cumulative, I need to subtract the data of May from data of April to find the increase in various statistics in May after which I plot the results. ```python df5 = (df3.loc['2020-05'] .groupby('state_name') .agg({'positive': 'first', 'negative': 'first', 'pending': 'first', 'recovered': 'first', 'death': 'first', 'hospitalized': 'first', 'total': 'first', 'totalTestResults': 'first', 'deathIncrease': 'sum', 'hospitalizedIncrease': 'sum', 'negativeIncrease': 'sum', 'positiveIncrease': 'sum', 'totalTestResultsIncrease': 'sum'})) df6 = (df3.loc['2020-04'] .groupby('state_name') .agg({'positive': 'first', 'negative': 'first', 'pending': 'first', 'recovered': 'first', 'death': 'first', 'hospitalized': 'first', 'total': 'first', 'totalTestResults': 'first', 'deathIncrease': 'sum', 'hospitalizedIncrease': 'sum', 'negativeIncrease': 'sum', 'positiveIncrease': 'sum', 'totalTestResultsIncrease': 'sum'})) df7 = df5.sub(df6) df7.head() ```
positivenegativependingrecovereddeathhospitalizedtotaltotalTestResultsdeathIncreasehospitalizedIncreasenegativeIncreasepositiveIncreasetotalTestResultsIncrease
state_name
Alabama10884.0119473.00.09355.0362.0866.0130357130357106-11245594484650440
Alaska79.032497.00.0116.01.00.03257632576-5717327-15717170
American Samoa0.0171.0-17.00.00.00.0154171001710171
Arizona12288.0141132.00.03262.0586.01829.0153420153420290660950765929101005
Arkansas3998.077138.00.03970.072.0309.0811368113619-9337973126639239
```python fig4 = px.bar(df7, x=df7.index, y='death') fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only') py.plot(fig4, filename = 'total_deaths_May', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/12/'
## Problems in the usual coding style Now that I have demonstrated the usual approach followed in pandas notebook, let's discuss the problems in this approach. ### 1. Flow is disrupted: The flow of the notebook is very difficult to understand and also creates problems. For example, we may create a variable name under the plot that needs it. In the above code as well, we created **`df3['active']`** below the cell in which it is needed. So, it may cause errors when run by others. Also, you may have to scroll the notebook for miles and miles. ### 2. No reproducibility: When the notebook is shared with others, the other person faces a lot of problems to execute or understand the notebook. For instance, the name of the dataframes doesn't signify any information about the type of dataframe. It runs from **`df1`** to **`df7`** and creates a lot of confusion. But you want to create a notebook which is very easy to iterate on and the one you can share with your colleagues. ### 3. Difficult to move the code to production: With this approach, your code is not ready to move into production. You end up having to rewrite the whole notebook before moving it to production which is not effective. ### 4. Unable to automate: The notebook in the current condition cannot be automated for analysis since there may occur a lot of problems like an error in code execution, unavailability of filenames used. Although the code may give an interesting conclusion or desired output, we are not quite sure that conclusion is at least correct. Despite having so many problems associated with this approach, it is common for everyone to still use this type of flow while making a notebook since while coding, people enjoy when the code works when they check the outputs and hence keep on similarly continuing the coding. ## Coding Solution ### 1. Naming convention Follow a naming convention for the notebook according to the task as suggested by [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/#notebooks-are-for-exploration-and-communication) that shows the owner and the order the analysis was done in. You can use the format **`--.ipynb`** (e.g., **`0.1-ayush-visualize-corona-us.ipynb`**). ### 2. Plan your steps beforehand Load the data and then think in advance about all the steps of analysis or tasks you could be doing in the notebook. You don't need to think the logic right away but just keep in mind the steps. ```python df = pd.read_csv('data/covid19_us_states_daily.csv', index_col='date') ``` ### 3. Create functions You know that initially, you want to clean the data and make sure the columns and indexes are in a proper usable format. So, why not create a function for that and name it according to the subtasks on the dataframe. > For example, initially you want to make the index a proper datetime object. Then you may want to do remove the duplicates, then add state name. Just add these functions without even thinking the logic and then later you can add the logic. This way, you will be on track and not lost. The functions are created after creating the decorator. ### 4. Create proper decorators Before adding functions, let's also think about some additional utility that would be helpful. During the pandas analysis, you often check the shape, columns, and other information associated to the dataframe after performing an operation. However, a decorator can help automate this process. **`Decorator`** is simply a function that expects a function and returns a function. It's really functional right, haha. Don't get confused by the definition. It is not so difficult as it sounds. We will see how it works in the code below. Also, if you are not familiar with decorators or want to learn more about it, you can visit the [article by Geir Arne Hjelle](https://realpython.com/primer-on-python-decorators/#simple-decorators). ```python import datetime as dt def df_info(f): def wrapper(df, *args, **kwargs): tic = dt.datetime.now() result = f(df, *args, **kwargs) toc = dt.datetime.now() print("\n\n{} took {} time\n".format(f.__name__, toc - tic)) print("After applying {}\n".format(f.__name__)) print("Shape of df = {}\n".format(result.shape)) print("Columns of df are {}\n".format(result.columns)) print("Index of df is {}\n".format(result.index)) for i in range(100): print("-", end='') return result return wrapper ``` We have created a decorator called **`df_info`** which displays information like time taken by the function, shape, and columns after applying any function **`f`**. The advantage of using a deorator is that we get logging. You can modify the decorator according to the information that you want to log or display after performing an operation on the dataframe. Now, we create functions as our plan and use these decorators on them by using **`@df_info`**. This will be equivalent to calling **`df_info(f(df, *args, **kwargs))`** ```python @df_info def create_dateindex(df): df.index = pd.to_datetime(df.index, format="%Y%m%d") return df @df_info def remove_columns(df): df.drop([*df.columns[4:10], *df.columns[11:15], 'posNeg', 'fips'], axis=1, inplace=True) return df @df_info def fill_missing(df): df.fillna(value=0, inplace=True) return df @df_info def add_state_name(df): _df = pd.read_csv('data/state_info.csv', usecols=['state', 'name']) df = (df .reset_index() .merge(_df, on='state', how='left', left_index=True)) df.set_index('date', inplace=True) df.rename(columns={'name': 'state_name'}, inplace=True) return df @df_info def drop_state(df): df.drop(columns=['state'], inplace=True) return df @df_info def sample_daily(df): df = df.resample('D').sum() return df @df_info def add_active_cases(df): df['active'] = df['positive'] - df['death'] - df['recovered'] return df def aggregate_monthly(df, month): df = (df.loc[month] .groupby('state_name') .agg({'positive': 'first', 'negative': 'first', 'pending': 'first', 'recovered': 'first', 'death': 'first', 'hospitalized': 'first', 'total': 'first', 'totalTestResults': 'first', 'deathIncrease': 'sum', 'hospitalizedIncrease': 'sum', 'negativeIncrease': 'sum', 'positiveIncrease': 'sum', 'totalTestResultsIncrease': 'sum'})) return df @df_info def create_month_only(df, month): df_current = aggregate_monthly(df, month) if int(month[-2:]) == 0: prev_month = str(int(month[:4]) - 1) + '-12' else: prev_month = month[:5] + '{:02d}'.format(int(month[-2:])-1) df_previous = aggregate_monthly(df, prev_month) df = df_current.sub(df_previous) return df ``` ### 5. Remove side effect However, these functions make changes that are inplace (side effects) i.e. modifies the originally loaded dataframe. So, to solve this, we add a function called start pipeline, which returns a copy of dataframe. ```python def start_pipeline(df): return df.copy() ``` ### 6. Constructing pandas pipelines (Main step) Now, let's use these functions to achieve the previous tasks using **`pipe`** ```python df_daily = (df.pipe(start_pipeline) .pipe(create_dateindex) .pipe(remove_columns) .pipe(fill_missing) .pipe(add_state_name) .pipe(sample_daily) .pipe(add_active_cases)) ``` create_dateindex took 0:00:00.003388 time After applying create_dateindex Shape of df = (5113, 34) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently', 'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative', 'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered', 'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt', 'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'posNeg', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- remove_columns took 0:00:00.002087 time After applying remove_columns Shape of df = (5113, 22) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- fill_missing took 0:00:00.006381 time After applying fill_missing Shape of df = (5113, 22) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- add_state_name took 0:00:00.015122 time After applying add_state_name Shape of df = (5113, 23) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade', 'state_name'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- sample_daily took 0:00:00.017170 time After applying sample_daily Shape of df = (135, 19) Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', ... '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29', '2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04'], dtype='datetime64[ns]', name='date', length=135, freq='D') ---------------------------------------------------------------------------------------------------- add_active_cases took 0:00:00.002020 time After applying add_active_cases Shape of df = (135, 20) Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade', 'active'], dtype='object') Index of df is DatetimeIndex(['2020-01-22', '2020-01-23', '2020-01-24', '2020-01-25', '2020-01-26', '2020-01-27', '2020-01-28', '2020-01-29', '2020-01-30', '2020-01-31', ... '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29', '2020-05-30', '2020-05-31', '2020-06-01', '2020-06-02', '2020-06-03', '2020-06-04'], dtype='datetime64[ns]', name='date', length=135, freq='D') Check out all the logs displayed above. We are able to view in detail how each operation changed the data without having to print the dataframe after each operation. ```python fig2 = px.line(df_daily, x=df_daily.index, y='death') fig2.update_layout(xaxis_title='date', title='Total deaths in the U.S. (Cumulative)') py.plot(fig2, filename = 'total_daily_deaths', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/4/'
```python fig3 = px.line(df_daily, x=df_daily.index, y='active') fig3.update_layout(xaxis_title='date', title='Total active cases in the U.S. (Cumulative)') py.plot(fig3, filename = 'total_daily_active', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/6/'
```python df_may = create_month_only( df=(df.pipe(start_pipeline) .pipe(create_dateindex) .pipe(remove_columns) .pipe(fill_missing) .pipe(add_state_name)), month='2020-05') ``` create_dateindex took 0:00:00.002492 time After applying create_dateindex Shape of df = (5113, 34) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'hospitalizedCurrently', 'hospitalizedCumulative', 'inIcuCurrently', 'inIcuCumulative', 'onVentilatorCurrently', 'onVentilatorCumulative', 'recovered', 'dataQualityGrade', 'lastUpdateEt', 'dateModified', 'checkTimeEt', 'death', 'hospitalized', 'dateChecked', 'fips', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'posNeg', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- remove_columns took 0:00:00.002219 time After applying remove_columns Shape of df = (5113, 22) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- fill_missing took 0:00:00.001883 time After applying fill_missing Shape of df = (5113, 22) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- add_state_name took 0:00:00.014981 time After applying add_state_name Shape of df = (5113, 23) Columns of df are Index(['state', 'positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'dateChecked', 'positiveIncrease', 'negativeIncrease', 'total', 'totalTestResults', 'totalTestResultsIncrease', 'deathIncrease', 'hospitalizedIncrease', 'hash', 'commercialScore', 'negativeRegularScore', 'negativeScore', 'positiveScore', 'score', 'grade', 'state_name'], dtype='object') Index of df is DatetimeIndex(['2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', '2020-06-04', ... '2020-01-31', '2020-01-30', '2020-01-29', '2020-01-28', '2020-01-27', '2020-01-26', '2020-01-25', '2020-01-24', '2020-01-23', '2020-01-22'], dtype='datetime64[ns]', name='date', length=5113, freq=None) ---------------------------------------------------------------------------------------------------- create_month_only took 0:00:00.031071 time After applying create_month_only Shape of df = (56, 13) Columns of df are Index(['positive', 'negative', 'pending', 'recovered', 'death', 'hospitalized', 'total', 'totalTestResults', 'deathIncrease', 'hospitalizedIncrease', 'negativeIncrease', 'positiveIncrease', 'totalTestResultsIncrease'], dtype='object') Index of df is Index(['Alabama', 'Alaska', 'American Samoa', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Northern Mariana Islands', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'US Virgin Islands', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'], dtype='object', name='state_name') ```python fig4 = px.bar(df_may, x=df_may.index, y='death') fig4.update_layout(xaxis_title='state_name', title='Total Deaths in th US in May only') py.plot(fig4, filename = 'total_deaths_May', auto_open=True) ``` 'https://plotly.com/~ayush.kumar.shah/12/'
You can observe how easily pipe functionality has achieved the required task in a clean and organized way. Also, the original dataframe is intact and not affected by the above operations. ```python df.head() ```
statepositivenegativependinghospitalizedCurrentlyhospitalizedCumulativeinIcuCurrentlyinIcuCumulativeonVentilatorCurrentlyonVentilatorCumulativerecovereddataQualityGradelastUpdateEtdateModifiedcheckTimeEtdeathhospitalizeddateCheckedfipspositiveIncreasenegativeIncreasetotaltotalTestResultstotalTestResultsIncreaseposNegdeathIncreasehospitalizedIncreasehashcommercialScorenegativeRegularScorenegativeScorepositiveScorescoregrade
date
20200604AK513.059584.0NaN13.0NaNNaNNaN1.0NaN376.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:0010.0NaN2020-06-04T00:00:00Z281907600976009719156009700c1046011af7271cbe2e6698526714c6cb5b9274800000NaN
20200604AL19072.0216227.0NaNNaN1929.0NaN601.0NaN357.011395.0B6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00653.01929.02020-06-04T00:00:00Z122134842352992352993705235299029bcbefdb36212ba2b97b5a354f4e45bf16648ee2300000NaN
20200604AR8067.0134413.0NaN138.0757.0NaNNaN30.0127.05717.0A6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00142.0757.02020-06-04T00:00:00Z5001424801424800142480026acd3a4fbbc3dbb32138725f91e3261d683e7052a00000NaN
20200604AS0.0174.0NaNNaNNaNNaNNaNNaNNaNNaNC6/1/2020 00:002020-06-01T00:00:00Z05/31 20:000.0NaN2020-06-01T00:00:00Z60001741740174008bbc72fa42781e0549e2e4f9f4c3e7cbef14ab3200000NaN
20200604AZ22753.0227002.0NaN1079.03195.0375.0NaN223.0NaN5172.0A+6/4/2020 00:002020-06-04T00:00:00Z06/03 20:00996.03195.02020-06-04T00:00:00Z45204710249755249755523024975515661fa237b8204cd23701577aef6338d339daa4452e00000NaN
### 7. Create a module Finally, you can create a module (eg `processing.py`) and keep all the above functions in the module. You can simply import them here and use them directly. It will clean the notebook further. **`processing.py`** While loading the modules, load the "autoreload" extension so that you can change code in the modules and the changes get updated automatically. For more info, see [autoreload documentation](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) ```python %load_ext autoreload %autoreload 2 from processing import * ``` The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload ## Advantages ### 1. Effective for the long run (Maintainability) Although, the approach may look like an inefficient method of coding but it is very effective in the long run since you will not have to spend hours maintaining the notebook. Given the functions are well written and well defined, they are ready for production. The code is easily sharable as well as anyone can understand the code unlike in the previous approach. Also, for complex analysis tasks, this approach can be easily used for maintaining the notebook. ### 2. Proper flow and planning You do not need to think about the logic of the analysis at the beginning. You can just plan your tasks and write down the required functions which already gives you kind of a framework of mind which helps to be on track. The calm that will follow is likely going to have a greater impact on innovation. Then, you can finally define the logic at the end to make it work. ### 3. Easier to modify You might have noticed that the `pipe` functionality gives you the ability to modify the tasks or flow easily. You can do so by commenting or adding the functions in the pipeline. For example, you don't want to remove the columns and sample the data daily. Then you can achieve this simply by commenting those lines as shown below: ```python df_daily = (df.pipe(start_pipeline) .pipe(create_dateindex) # .pipe(remove_columns) .pipe(fill_missing) .pipe(add_state_name) .pipe(drop_state) # .pipe(sample_daily) .pipe(add_active_cases)) ``` ### 4. Easier to debug In this approach, you know what is happening in each step which makes it a lot easier to debug. Furthermore, since all the operations are functions, you can easily debug the code by performing unit tests or using other methods on the functions. ### 5. Readability This approach helps you prevent miles of scrolling and also is easily readable than the previous approach. By looking at the code, you can easily understand what operations are being performed on the data and also can see the effect of those operations on the data in each step using decorator. Example: Let us consider cooking chicken. When we do so, we don't describe the steps like this: ``` temperature = 210 celsius food1 = Chicken food2 = Season(food1, with Spices) food3 = Season(food2, with Gravy) Serve(PutInOven(food3, temperature), on a plate) ``` But instead, we describe it the following way: ``` temperature = 210 celsius Chicken.Season(with Spices) .Season(with Gravy) .PutInOven(temperature) .Serve() ``` The pipe functionality helps us to write code in the latter way, which is also much more readable. ### 6. Reusability During production, we turn the project into a Python package. You can import your code and use it in notebooks with a cell. You do not need to write code to do the same task in multiple notebooks. ### 7. Separation into analysis and data manipulation Once your functions have been moved to a separate module, two levels of abstraction are obtained: analysis and data manipulation. You can fiddle around on a high level and keep the details on a low level. The notebook then becomes the summary and a user interface where you can very quickly make nice little charts instead of manipulating data or performing analytical steps to get a result. # Final notes Hence, following these practices while coding in pandas or performing other similar tasks like building scikit-learn pipelines or other ML pipelines, can be extremely beneficial for developers. Also, all the 4 problems mentioned in the beginning have been solved in this approach. Thus, giving utmost priority to clarity and interoperability, we should remember that it's a lot easier to solve a problem if we understand the problem well. Moreover, if you find writing these codes difficult, an open source package called [Scikit-lego](https://scikit-lego.readthedocs.io/en/latest/pandas_pipeline.html#) maintained by Vincent and MatthijsB, with contributions from all around the world, is available. This package does all the hard work for you to create such pipelines along with additional features like custom logging. Do check it out. Also, if you have any confusion or suggestions, feel free to comment. I am all ears. Thank you.