{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lecture 7 - Data Manipulation with pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2024-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_2-Data_Engineering/Lecture_7-Pandas/Lecture_7-Pandas.ipynb)\n", "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2024-Applied-Data-Science-with-Python/blob/main/docs/Lectures/Theme_2-Data_Engineering/Lecture_7-Pandas/Lecture_7-Pandas.ipynb) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [7.1 Introduction to pandas](#7.1-introduction-to-pandas)\n", "- [7.2 Importing Data and Summary Statistics](#7.2-importing-data-and-summary-statistics)\n", "- [7.3 Rename, Index, and Slice](#7.3-rename,-index,-and-slice)\n", "- [7.4 Creating New Columns, Reordering](#7.4-creating-new-columns,-reordering)\n", "- [7.5 Removing Columns and Rows](#7.5-removing-columns-and-rows)\n", "- [7.6 Merging DataFrames](#7.6-merging-dataframes)\n", "- [7.7 Calculating Unique and Missing Values](#7.7-calculating-unique-and-missing-values)\n", "- [7.8 Dealing With Missing Values: Boolean Indexing](#7.8-dealing-with-missing-values:-boolean-indexing)\n", "- [7.9 Exporting A DataFrame to csv](#7.9-exporting-a-dataFrame-to-csv)\n", "- [References](#references)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.1 Introduction to `pandas` " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` is a library designed for working with structured data, such as tabular data (e.g., from .csv files, Excel files) or relational databases (e.g., SQL). \n", "\n", "> The **DataFrame** object in `pandas` is a 2-dimensional tabular, column-oriented data structure. The DateFrame is similar to an Excel spreadsheet and can store data of different types (including text characters, integers, floating-point values, categorical data, and more)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Figure source: Reference [2]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `pandas` name is derived from the term *panel data*, which is a term from economics for multi-dimensional structured data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.2 Importing Data and Summary Statistics " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by importing the `pandas` package using the common abbreviation `pd`. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A wide range of input/output formats are supported by `pandas`:\n", "\n", "* CSV, text\n", "* SQL database\n", "* Excel\n", "* HDF5\n", "* json\n", "* html\n", "* pickle\n", "* sas, stata\n", "* ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For importing .csv files, the function `read_csv()` in `pandas` allows to easily import data. By default, it assumes that the data is comma-separated, but we can also specify the delimiter used in the data (e.g., tab, semicolon, etc.). There are several parameters that can be specified in `read_csv()`. See the documentation [here](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). \n", "\n", "Let's load the data in the file `country-total` file located in the folder `data` and save it under the name `unemployment`. This file contains unemployment information for several countries over a period of 30 years.\n", "\n", "The function `read_csv()` returns a `DataFrame`, as shown below. Note that the DataFrame has 20,796 rows, and the output of the cell displayed only the first 5 and last 5 rows (or the first and last 30, depending on your system), since displaying all 20,976 rows is probably not what we want at this point anyway." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalitymonthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
3atnsa1993.041570004.1
4atnsa1993.051470003.9
..................
20791uktrend2010.0624290007.7
20792uktrend2010.0724220007.7
20793uktrend2010.0824290007.7
20794uktrend2010.0924470007.8
20795uktrend2010.1024550007.8
\n", "

20796 rows × 5 columns

\n", "
" ], "text/plain": [ " country seasonality month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4\n", "3 at nsa 1993.04 157000 4.1\n", "4 at nsa 1993.05 147000 3.9\n", "... ... ... ... ... ...\n", "20791 uk trend 2010.06 2429000 7.7\n", "20792 uk trend 2010.07 2422000 7.7\n", "20793 uk trend 2010.08 2429000 7.7\n", "20794 uk trend 2010.09 2447000 7.8\n", "20795 uk trend 2010.10 2455000 7.8\n", "\n", "[20796 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import data\n", "unemployment = pd.read_csv('data/country_total.csv')\n", "# Show the DataFrame\n", "unemployment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could have also used `print` to display the DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " country seasonality month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4\n", "3 at nsa 1993.04 157000 4.1\n", "4 at nsa 1993.05 147000 3.9\n", "... ... ... ... ... ...\n", "20791 uk trend 2010.06 2429000 7.7\n", "20792 uk trend 2010.07 2422000 7.7\n", "20793 uk trend 2010.08 2429000 7.7\n", "20794 uk trend 2010.09 2447000 7.8\n", "20795 uk trend 2010.10 2455000 7.8\n", "\n", "[20796 rows x 5 columns]\n" ] } ], "source": [ "print(unemployment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the DataFrames have a large number of rows that take a large portion of the screen, we can inspect the data by using the `.head()` method. By default, this shows the **header** (names of the columns, commonly referred to as **column labels**) and the first five rows (having indices ranging from 0 to 4, in the first column in the table). The **indices** are also referred to as **row labels**." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalitymonthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
3atnsa1993.041570004.1
4atnsa1993.051470003.9
\n", "
" ], "text/plain": [ " country seasonality month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4\n", "3 at nsa 1993.04 157000 4.1\n", "4 at nsa 1993.05 147000 3.9" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing an integer number as an argument to `.head(n)` returns that number of rows. To see the last $n$ rows, use `.tail(n)`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalitymonthunemploymentunemployment_rate
20788uktrend2010.0324370007.8
20789uktrend2010.0424190007.8
20790uktrend2010.0524190007.7
20791uktrend2010.0624290007.7
20792uktrend2010.0724220007.7
20793uktrend2010.0824290007.7
20794uktrend2010.0924470007.8
20795uktrend2010.1024550007.8
\n", "
" ], "text/plain": [ " country seasonality month unemployment unemployment_rate\n", "20788 uk trend 2010.03 2437000 7.8\n", "20789 uk trend 2010.04 2419000 7.8\n", "20790 uk trend 2010.05 2419000 7.7\n", "20791 uk trend 2010.06 2429000 7.7\n", "20792 uk trend 2010.07 2422000 7.7\n", "20793 uk trend 2010.08 2429000 7.7\n", "20794 uk trend 2010.09 2447000 7.8\n", "20795 uk trend 2010.10 2455000 7.8" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# show the last 8 rows\n", "unemployment.tail(8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the number of rows in a DataFrame, you can use the `len()` function, as with Python lists and other sequences." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20796" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(unemployment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can use the `shape` attribute to find the numbers of rows and columns, as with NumPy arrays. The cell output is a tuple, showing that there are 20,796 rows and 5 columns. Note that the left-most column in the above table showing row indices is not part of the data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20796, 5)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A useful method that generates various summary statistics of a DataFrame is `.describe()`, as shown below.\n", "\n", "Notice in the above cell that the DataFrame has 5 columns, but the first 2 columns (country and seasonality) have textual (strings) data, and therefore the summary statistics are shown only for the columns with numeric data (month, unemployment, and unemployment_rate). If `.describe()` is called on textual data only, it will return the count, number of unique values, and the most frequent value along with its count." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
monthunemploymentunemployment_rate
count20796.0000002.079600e+0419851.000000
mean1999.4012907.900818e+058.179764
std7.4837511.015280e+063.922533
min1983.0100002.000000e+031.100000
25%1994.0900001.400000e+055.200000
50%2001.0100003.100000e+057.600000
75%2006.0100001.262250e+0610.000000
max2010.1200004.773000e+0620.900000
\n", "
" ], "text/plain": [ " month unemployment unemployment_rate\n", "count 20796.000000 2.079600e+04 19851.000000\n", "mean 1999.401290 7.900818e+05 8.179764\n", "std 7.483751 1.015280e+06 3.922533\n", "min 1983.010000 2.000000e+03 1.100000\n", "25% 1994.090000 1.400000e+05 5.200000\n", "50% 2001.010000 3.100000e+05 7.600000\n", "75% 2006.010000 1.262250e+06 10.000000\n", "max 2010.120000 4.773000e+06 20.900000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to calculate individual statistics, such as `.min()`, `.max ()`, or `.mean()`, instead of using summary statistics with `.describe()`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country at\n", "seasonality nsa\n", "month 1983.01\n", "unemployment 2000\n", "unemployment_rate 1.1\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view the data types for each column, use the `dtypes` attribute of the *unemployment* DataFrame. The data types in this case are strings (`object` type), floats (`float64` type), and integers (`int64` type)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "country object\n", "seasonality object\n", "month float64\n", "unemployment int64\n", "unemployment_rate float64\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And one more way to get a summary of a DataFrame is by using `info()`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 20796 entries, 0 to 20795\n", "Data columns (total 5 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 country 20796 non-null object \n", " 1 seasonality 20796 non-null object \n", " 2 month 20796 non-null float64\n", " 3 unemployment 20796 non-null int64 \n", " 4 unemployment_rate 19851 non-null float64\n", "dtypes: float64(2), int64(1), object(2)\n", "memory usage: 812.5+ KB\n" ] } ], "source": [ "unemployment.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can retrieve the name of the columns in the `unemployment` DataFrame by using the attribute `columns`, as in the next cell. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['country', 'seasonality', 'month', 'unemployment', 'unemployment_rate'], dtype='object')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the type of the data for the returned column names is an Index object. We can also easily convert the column names into a list by using the method `tolist()`. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['country', 'seasonality', 'month', 'unemployment', 'unemployment_rate']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.columns.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can achieve the same by using the Python built-in function `list`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['country', 'seasonality', 'month', 'unemployment', 'unemployment_rate']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(unemployment.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Data From a URL\n", "\n", "Above, we imported the unemployment data using the function `read_csv` and a relative file path to the `data` directory. The function `read_csv` is very flexible and it also allows importing data using a URL as the file path. \n", "\n", "A csv file with data on world countries and their abbreviations is located at [https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv](https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv)\n", "\n", "Using `read_csv`, we can import the country data and save it to the variable `countries`. This DataFrame has 30 rows." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygoogle_country_codecountry_groupname_enname_frname_delatitudelongitude
0atATeuAustriaAutricheÖsterreich47.69655413.345980
1beBEeuBelgiumBelgiqueBelgien50.5010454.476674
2bgBGeuBulgariaBulgarieBulgarien42.72567425.482322
3hrHRnon-euCroatiaCroatieKroatien44.74664315.340844
4cyCYeuCyprusChypreZypern35.12914133.428682
5czCZeuCzech RepublicRépublique tchèqueTschechische Republik49.80353115.474998
6dkDKeuDenmarkDanemarkDänemark55.9396849.516689
7eeEEeuEstoniaEstonieEstland58.59246925.806950
8fiFIeuFinlandFinlandeFinnland64.95015926.067564
9frFReuFranceFranceFrankreich46.7109941.718561
10deDEeuGermany (including former GDR from 1991)Allemagne (incluant l'ancienne RDA à partir de...Deutschland (einschließlich der ehemaligen DDR...51.16382510.454048
11grGReuGreeceGrèceGriechenland39.69846721.577256
12huHUeuHungaryHongrieUngarn47.16116319.504265
13ieIEeuIrelandIrlandeIrland53.415260-8.239122
14itITeuItalyItalieItalien42.50419112.573787
15lvLVeuLatviaLettonieLettland56.88011724.606555
16ltLTeuLithuaniaLituanieLitauen55.17368723.943168
17luLUeuLuxembourgLuxembourgLuxemburg49.8153196.133352
18mtMTeuMaltaMalteMalta35.90242214.447461
19nlNLeuNetherlandsPays-BasNiederlande52.1081185.330198
20noNOnon-euNorwayNorvègeNorwegen64.55646012.665766
21plPLeuPolandPolognePolen51.91890719.134334
22ptPTeuPortugalPortugalPortugal39.558069-7.844941
23roROeuRomaniaRoumanieRumänien45.94261124.990152
24skSKeuSlovakiaSlovaquieSlowakei48.67264419.700032
25siSIeuSloveniaSlovénieSlowenien46.14925914.986617
26esESeuSpainEspagneSpanien39.895013-2.988296
27seSEeuSwedenSuèdeSchweden62.19846714.896307
28trTRnon-euTurkeyTurquieTürkei38.95294235.439795
29ukGBeuUnited KingdomRoyaume-UniVereinigtes Königreich54.315447-2.232612
\n", "
" ], "text/plain": [ " country google_country_code country_group \\\n", "0 at AT eu \n", "1 be BE eu \n", "2 bg BG eu \n", "3 hr HR non-eu \n", "4 cy CY eu \n", "5 cz CZ eu \n", "6 dk DK eu \n", "7 ee EE eu \n", "8 fi FI eu \n", "9 fr FR eu \n", "10 de DE eu \n", "11 gr GR eu \n", "12 hu HU eu \n", "13 ie IE eu \n", "14 it IT eu \n", "15 lv LV eu \n", "16 lt LT eu \n", "17 lu LU eu \n", "18 mt MT eu \n", "19 nl NL eu \n", "20 no NO non-eu \n", "21 pl PL eu \n", "22 pt PT eu \n", "23 ro RO eu \n", "24 sk SK eu \n", "25 si SI eu \n", "26 es ES eu \n", "27 se SE eu \n", "28 tr TR non-eu \n", "29 uk GB eu \n", "\n", " name_en \\\n", "0 Austria \n", "1 Belgium \n", "2 Bulgaria \n", "3 Croatia \n", "4 Cyprus \n", "5 Czech Republic \n", "6 Denmark \n", "7 Estonia \n", "8 Finland \n", "9 France \n", "10 Germany (including former GDR from 1991) \n", "11 Greece \n", "12 Hungary \n", "13 Ireland \n", "14 Italy \n", "15 Latvia \n", "16 Lithuania \n", "17 Luxembourg \n", "18 Malta \n", "19 Netherlands \n", "20 Norway \n", "21 Poland \n", "22 Portugal \n", "23 Romania \n", "24 Slovakia \n", "25 Slovenia \n", "26 Spain \n", "27 Sweden \n", "28 Turkey \n", "29 United Kingdom \n", "\n", " name_fr \\\n", "0 Autriche \n", "1 Belgique \n", "2 Bulgarie \n", "3 Croatie \n", "4 Chypre \n", "5 République tchèque \n", "6 Danemark \n", "7 Estonie \n", "8 Finlande \n", "9 France \n", "10 Allemagne (incluant l'ancienne RDA à partir de... \n", "11 Grèce \n", "12 Hongrie \n", "13 Irlande \n", "14 Italie \n", "15 Lettonie \n", "16 Lituanie \n", "17 Luxembourg \n", "18 Malte \n", "19 Pays-Bas \n", "20 Norvège \n", "21 Pologne \n", "22 Portugal \n", "23 Roumanie \n", "24 Slovaquie \n", "25 Slovénie \n", "26 Espagne \n", "27 Suède \n", "28 Turquie \n", "29 Royaume-Uni \n", "\n", " name_de latitude longitude \n", "0 Österreich 47.696554 13.345980 \n", "1 Belgien 50.501045 4.476674 \n", "2 Bulgarien 42.725674 25.482322 \n", "3 Kroatien 44.746643 15.340844 \n", "4 Zypern 35.129141 33.428682 \n", "5 Tschechische Republik 49.803531 15.474998 \n", "6 Dänemark 55.939684 9.516689 \n", "7 Estland 58.592469 25.806950 \n", "8 Finnland 64.950159 26.067564 \n", "9 Frankreich 46.710994 1.718561 \n", "10 Deutschland (einschließlich der ehemaligen DDR... 51.163825 10.454048 \n", "11 Griechenland 39.698467 21.577256 \n", "12 Ungarn 47.161163 19.504265 \n", "13 Irland 53.415260 -8.239122 \n", "14 Italien 42.504191 12.573787 \n", "15 Lettland 56.880117 24.606555 \n", "16 Litauen 55.173687 23.943168 \n", "17 Luxemburg 49.815319 6.133352 \n", "18 Malta 35.902422 14.447461 \n", "19 Niederlande 52.108118 5.330198 \n", "20 Norwegen 64.556460 12.665766 \n", "21 Polen 51.918907 19.134334 \n", "22 Portugal 39.558069 -7.844941 \n", "23 Rumänien 45.942611 24.990152 \n", "24 Slowakei 48.672644 19.700032 \n", "25 Slowenien 46.149259 14.986617 \n", "26 Spanien 39.895013 -2.988296 \n", "27 Schweden 62.198467 14.896307 \n", "28 Türkei 38.952942 35.439795 \n", "29 Vereinigtes Königreich 54.315447 -2.232612 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries = pd.read_csv('https://raw.githubusercontent.com/dlab-berkeley/introduction-to-pandas/master/data/countries.csv')\n", "countries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the above example, we can use `shape` and `describe()` the understand the *countries* DataFrame. In this case `describe()` is not very useful, because only 2 of the columns have numeric values." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(30, 8)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latitudelongitude
count30.00000030.000000
mean49.09260914.324579
std7.95662411.257010
min35.129141-8.239122
25%43.2309166.979186
50%49.23808714.941462
75%54.09040023.351690
max64.95015935.439795
\n", "
" ], "text/plain": [ " latitude longitude\n", "count 30.000000 30.000000\n", "mean 49.092609 14.324579\n", "std 7.956624 11.257010\n", "min 35.129141 -8.239122\n", "25% 43.230916 6.979186\n", "50% 49.238087 14.941462\n", "75% 54.090400 23.351690\n", "max 64.950159 35.439795" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# explore the countries data\n", "countries.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method `info()` provides helpful information for this DataFrame. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 30 entries, 0 to 29\n", "Data columns (total 8 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 country 30 non-null object \n", " 1 google_country_code 30 non-null object \n", " 2 country_group 30 non-null object \n", " 3 name_en 30 non-null object \n", " 4 name_fr 30 non-null object \n", " 5 name_de 30 non-null object \n", " 6 latitude 30 non-null float64\n", " 7 longitude 30 non-null float64\n", "dtypes: float64(2), object(6)\n", "memory usage: 2.0+ KB\n" ] } ], "source": [ "# explore the countries data\n", "countries.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Data from Excel File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a similar way, we can import data from an Excel file using the function `read_excel()`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "pip install xlrd -q" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale290024160211.3375B5S2NoneSt Louis, MO
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NoneMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale212113781151.5500C22 C26SNoneNoneMontreal, PQ / Chesterville, ON
310Allison, Mr. Hudson Joshua Creightonmale3012113781151.5500C22 C26SNone135Montreal, PQ / Chesterville, ON
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female2512113781151.5500C22 C26SNoneNoneMontreal, PQ / Chesterville, ON
511Anderson, Mr. Harrymale48001995226.5500E12S3NoneNew York, NY
611Andrews, Miss. Kornelia Theodosiafemale63101350277.9583D7S10NoneHudson, NY
710Andrews, Mr. Thomas Jrmale39001120500.0000A36SNoneNoneBelfast, NI
811Appleton, Mrs. Edward Dale (Charlotte Lamson)female53201176951.4792C101SDNoneBayside, Queens, NY
910Artagaveytia, Mr. Ramonmale7100PC 1760949.5042NoneCNone22Montevideo, Uruguay
1010Astor, Col. John Jacobmale4710PC 17757227.5250C62 C64CNone124New York, NY
1111Astor, Mrs. John Jacob (Madeleine Talmadge Force)female1810PC 17757227.5250C62 C64C4NoneNew York, NY
1211Aubart, Mme. Leontine Paulinefemale2400PC 1747769.3000B35C9NoneParis, France
1311Barber, Miss. Ellen \"Nellie\"female26001987778.8500NoneS6NoneNone
1411Barkworth, Mr. Algernon Henry Wilsonmale80002704230.0000A23SBNoneHessle, Yorks
1510Baumann, Mr. John DmaleNone00PC 1731825.9250NoneSNoneNoneNew York, NY
1610Baxter, Mr. Quigg Edmondmale2401PC 17558247.5208B58 B60CNoneNoneMontreal, PQ
1711Baxter, Mrs. James (Helene DeLaudeniere Chaput)female5001PC 17558247.5208B58 B60C6NoneMontreal, PQ
1811Bazzani, Miss. Albinafemale32001181376.2917D15C8NoneNone
1910Beattie, Mr. Thomsonmale36001305075.2417C6CANoneWinnipeg, MN
\n", "
" ], "text/plain": [ " pclass survived name \\\n", "0 1 1 Allen, Miss. Elisabeth Walton \n", "1 1 1 Allison, Master. Hudson Trevor \n", "2 1 0 Allison, Miss. Helen Loraine \n", "3 1 0 Allison, Mr. Hudson Joshua Creighton \n", "4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) \n", "5 1 1 Anderson, Mr. Harry \n", "6 1 1 Andrews, Miss. Kornelia Theodosia \n", "7 1 0 Andrews, Mr. Thomas Jr \n", "8 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) \n", "9 1 0 Artagaveytia, Mr. Ramon \n", "10 1 0 Astor, Col. John Jacob \n", "11 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) \n", "12 1 1 Aubart, Mme. Leontine Pauline \n", "13 1 1 Barber, Miss. Ellen \"Nellie\" \n", "14 1 1 Barkworth, Mr. Algernon Henry Wilson \n", "15 1 0 Baumann, Mr. John D \n", "16 1 0 Baxter, Mr. Quigg Edmond \n", "17 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) \n", "18 1 1 Bazzani, Miss. Albina \n", "19 1 0 Beattie, Mr. Thomson \n", "\n", " sex age sibsp parch ticket fare cabin embarked boat \\\n", "0 female 29 0 0 24160 211.3375 B5 S 2 \n", "1 male 0.9167 1 2 113781 151.5500 C22 C26 S 11 \n", "2 female 2 1 2 113781 151.5500 C22 C26 S None \n", "3 male 30 1 2 113781 151.5500 C22 C26 S None \n", "4 female 25 1 2 113781 151.5500 C22 C26 S None \n", "5 male 48 0 0 19952 26.5500 E12 S 3 \n", "6 female 63 1 0 13502 77.9583 D7 S 10 \n", "7 male 39 0 0 112050 0.0000 A36 S None \n", "8 female 53 2 0 11769 51.4792 C101 S D \n", "9 male 71 0 0 PC 17609 49.5042 None C None \n", "10 male 47 1 0 PC 17757 227.5250 C62 C64 C None \n", "11 female 18 1 0 PC 17757 227.5250 C62 C64 C 4 \n", "12 female 24 0 0 PC 17477 69.3000 B35 C 9 \n", "13 female 26 0 0 19877 78.8500 None S 6 \n", "14 male 80 0 0 27042 30.0000 A23 S B \n", "15 male None 0 0 PC 17318 25.9250 None S None \n", "16 male 24 0 1 PC 17558 247.5208 B58 B60 C None \n", "17 female 50 0 1 PC 17558 247.5208 B58 B60 C 6 \n", "18 female 32 0 0 11813 76.2917 D15 C 8 \n", "19 male 36 0 0 13050 75.2417 C6 C A \n", "\n", " body home.dest \n", "0 None St Louis, MO \n", "1 None Montreal, PQ / Chesterville, ON \n", "2 None Montreal, PQ / Chesterville, ON \n", "3 135 Montreal, PQ / Chesterville, ON \n", "4 None Montreal, PQ / Chesterville, ON \n", "5 None New York, NY \n", "6 None Hudson, NY \n", "7 None Belfast, NI \n", "8 None Bayside, Queens, NY \n", "9 22 Montevideo, Uruguay \n", "10 124 New York, NY \n", "11 None New York, NY \n", "12 None Paris, France \n", "13 None None \n", "14 None Hessle, Yorks \n", "15 None New York, NY \n", "16 None Montreal, PQ \n", "17 None Montreal, PQ \n", "18 None None \n", "19 None Winnipeg, MN " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic = pd.read_excel('data/titanic.xls')\n", "titanic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that all DataFrames that we loaded so far in this Jupyter notebook, i.e., `unemployment`, `countries`, and `titanic`, are stored in the memory, and we can access them when needed once they are loaded. For example, we can check the shape of the `titanic` DataFrame." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20, 14)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can manually create DataFrames, instead of importing from a file. The following DataFrame called `simple_table` contains information from the `titanic` data. You can notice that the DataFrame is created similarly to creating a dictionary, where the column headers represent keys, and the data in each column are lists of values." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAgeSex
0Braund, Mr. Owen Harris22male
1Allen, Mr. William Henry35male
2Bonnell, Miss. Elizabeth58female
\n", "
" ], "text/plain": [ " Name Age Sex\n", "0 Braund, Mr. Owen Harris 22 male\n", "1 Allen, Mr. William Henry 35 male\n", "2 Bonnell, Miss. Elizabeth 58 female" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simple_table = pd.DataFrame({\n", " \"Name\": [\"Braund, Mr. Owen Harris\",\n", " \"Allen, Mr. William Henry\",\n", " \"Bonnell, Miss. Elizabeth\"],\n", " \"Age\": [22, 35, 58],\n", " \"Sex\": [\"male\", \"male\", \"female\"]})\n", "simple_table" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 3)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simple_table.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use again the `tolist()` method to convert the values in a `pandas` DataFrame into a list, if we needed the data into a list format." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Braund, Mr. Owen Harris', 22, 'male'],\n", " ['Allen, Mr. William Henry', 35, 'male'],\n", " ['Bonnell, Miss. Elizabeth', 58, 'female']]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "simple_table.values.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.3 Rename, Index, and Slice
\n", "\n", "Let's look again at the *unemployment* DataFrame. You may have noticed that the `month` column actually includes the year and the month added as decimals (e.g., 1993.01 should be year 1993 and month 01)." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalitymonthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
\n", "
" ], "text/plain": [ " country seasonality month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's rename the column into *year_month*. The `.rename()` method allows modifying column and/or row names. As you can see in the cell below, we passed a dictionary to the `columns` parameter, with the original name `month` as the key and the new name `year_month` as the value. With this approach, we can rename several columns at the same time by providing the old and new names as keys and values in the dictionary. Importantly, we also set the `inplace` parameter to `True`, which indicates that we want to modify the *original* DataFrame, and not to create a new DataFrame." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.rename(columns={'month' : 'year_month'}, inplace=True)\n", "unemployment.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To observe the effect of `inplace=True`, let's run in the next cell another `.rename()` method to change the column `country` to `year`, but this time by omitting `inplace=True`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
\n", "
" ], "text/plain": [ " year seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.rename(columns={'country' : 'year'}).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above code didn't change the original `unemployment` DataFrame, as we can check that in the following cell. Instead, it created a copy of the `unemployment` DataFrame in which it changed the name of the column `country`. " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative way to rename the column is shown in the following cell. This code does not use `inplace=True` to modify the original DataFrame, but instead it creates a new DataFrame object with a renamed column and assigns it to the name `unemployment`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthtemporary_nameunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
\n", "
" ], "text/plain": [ " country seasonality year_month temporary_name unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment = unemployment.rename(columns={'unemployment' : 'temporary_name'})\n", "unemployment.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's change it back to the original column name." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment = unemployment.rename(columns={'temporary_name': 'unemployment'})\n", "unemployment.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select a single column of the DataFrame, we can either use the name of the column enclosed in square brackets `[]` or the dot notation `.` (i.e., via *attribute access*). It is preferable to use the square brackets notation, since a column name might inadvertently have the same name as a built-in `pandas` method." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1993.01\n", "1 1993.02\n", "2 1993.03\n", "3 1993.04\n", "4 1993.05\n", "Name: year_month, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access with square brackets\n", "unemployment['year_month'].head()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1993.01\n", "1 1993.02\n", "2 1993.03\n", "3 1993.04\n", "4 1993.05\n", "Name: year_month, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# access with dot notation\n", "unemployment.year_month.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When selecting a single column, we obtain a `pandas` **Series** object, which is a single vector of data with an associated array of index row labels shown in the left-most column.\n", "\n", "> A **Series** object in `pandas` represents a 1-dimensional vector of data (i.e., a column of data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we check the type of the `unemployment` object and `unemployment['year_month']` object, we can see that the first one is DataFrame and the second one is Series." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(unemployment)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(unemployment['year_month'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pandas` provide many methods that can be applied to `Series` objects. A few examples are shown below." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "minimum is 1983.01\n", "maximum is 2010.12\n", "mean value is 1999.4012896711013\n" ] } ], "source": [ "print('minimum is ', unemployment['year_month'].min())\n", "print('maximum is ', unemployment['year_month'].max())\n", "print('mean value is ', unemployment['year_month'].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select multiple columns in `pandas`, use a list of column names within the selection brackets `[]`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear_month
0at1993.01
1at1993.02
2at1993.03
3at1993.04
4at1993.05
\n", "
" ], "text/plain": [ " country year_month\n", "0 at 1993.01\n", "1 at 1993.02\n", "2 at 1993.03\n", "3 at 1993.04\n", "4 at 1993.05" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment[['country','year_month']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to select rows is by using the `[]` operator, similar to indexing and slicing in Python lists and other sequence data." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
3atnsa1993.041570004.1
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4\n", "3 at nsa 1993.04 157000 4.1" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment[0:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another graphical representation of a DataFrame is shown in the figure below.\n", "\n", "\n", "Figure source: Reference [2]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first column with the indices in `pandas` DataFrames does not need to be a sequence of integers, but it can also contain strings or other numeric data (e.g., dates, years).\n", "\n", "For instance, let's create a `DataFrame` called *bacteria* to see how indexing with string indices works. We again pass in a dictionary, with the keys corresponding to column names and the values to the data, and in addition we pass a list of strings called `index`. (Compare to the *simple_table* above, which does not use `index` for creating the DataFrame, and in that case, the indices were automatically set to integer numbers.)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bacteria_countsother_feature
Firmicutes632438
Proteobacteria1638833
Actinobacteria569234
Bacteroidetes115298
\n", "
" ], "text/plain": [ " bacteria_counts other_feature\n", "Firmicutes 632 438\n", "Proteobacteria 1638 833\n", "Actinobacteria 569 234\n", "Bacteroidetes 115 298" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria = pd.DataFrame({'bacteria_counts' : [632, 1638, 569, 115],\n", " 'other_feature' : [438, 833, 234, 298]},\n", " index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])\n", "bacteria" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To return the labels for the columns and indices of a DataFrame in `pandas`, we can use the methods shown in the following celss. Note again that the type of the returned objects is `Index` object." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['bacteria_counts', 'other_feature'], dtype='object')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.columns" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'], dtype='object')" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For selecting rows and/or columns in `pandas`, beside the use of square brackets, two other operators are used: `.loc` and `.iloc`. \n", "\n", "The operator `.loc` accesses rows by using string indices (i.e., string locations), that is, it uses the labels of the rows for indexing.\n", "\n", "The operator `.iloc` accesses rows by using integer indices (i.e., integer locations), that is, it uses the integer positions of the rows for indexing. \n", "\n", "The operators `.loc` and `.iloc` can accept either a single index (e.g., `'f'` or `5`), a list of indices (e.g., `['a','f']` or `[2,5]`), or a slice of indices (e.g., `'a:f'` or `2:5`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance, if we are interested in accessing the row *Actinobacteria*, we can use `.loc` and the index name. This returns the column values for the specified row." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bacteria_counts 569\n", "other_feature 234\n", "Name: Actinobacteria, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.loc['Actinobacteria']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The operator `.iloc` does not work with string indices, and it returns an error in the following cell. " ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "Cannot index by location index with a non-integer key", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[42], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mbacteria\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43miloc\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mActinobacteria\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\n", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py:967\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 964\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n\u001b[0;32m 966\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m com\u001b[38;5;241m.\u001b[39mapply_if_callable(key, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj)\n\u001b[1;32m--> 967\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmaybe_callable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\indexing.py:1520\u001b[0m, in \u001b[0;36m_iLocIndexer._getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1518\u001b[0m key \u001b[38;5;241m=\u001b[39m item_from_zerodim(key)\n\u001b[0;32m 1519\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_integer(key):\n\u001b[1;32m-> 1520\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCannot index by location index with a non-integer key\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m 1522\u001b[0m \u001b[38;5;66;03m# validate the location\u001b[39;00m\n\u001b[0;32m 1523\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_integer(key, axis)\n", "\u001b[1;31mTypeError\u001b[0m: Cannot index by location index with a non-integer key" ] } ], "source": [ "bacteria.iloc['Actinobacteria']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access rows with `.iloc`, we need to provide integer indices. Note that we can still access the row with `iloc`, even though the indices are strings." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bacteria_counts 569\n", "other_feature 234\n", "Name: Actinobacteria, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.iloc[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, we can also use \"positional indexing\" with square brackets `[]`, as in slicing operations with list and other sequence objects." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bacteria_countsother_feature
Actinobacteria569234
\n", "
" ], "text/plain": [ " bacteria_counts other_feature\n", "Actinobacteria 569 234" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria[2:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, `pandas` doesn't support indexing for accessing individual rows, and with positional indexing we need to use a slice (as in the above example `bacteria[2:3]`), otherwise, we will get an error." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "2", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\base.py:3629\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key, method, tolerance)\u001b[0m\n\u001b[0;32m 3628\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 3629\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_engine\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mcasted_key\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 3630\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\_libs\\index.pyx:136\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\_libs\\index.pyx:163\u001b[0m, in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[1;34m()\u001b[0m\n", "File \u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi:5198\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n", "File \u001b[1;32mpandas\\_libs\\hashtable_class_helper.pxi:5206\u001b[0m, in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[1;34m()\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: 2", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[45], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m \u001b[43mbacteria\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m]\u001b[49m\n", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\frame.py:3505\u001b[0m, in \u001b[0;36mDataFrame.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 3503\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mcolumns\u001b[38;5;241m.\u001b[39mnlevels \u001b[38;5;241m>\u001b[39m \u001b[38;5;241m1\u001b[39m:\n\u001b[0;32m 3504\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_multilevel(key)\n\u001b[1;32m-> 3505\u001b[0m indexer \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcolumns\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_loc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 3506\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m is_integer(indexer):\n\u001b[0;32m 3507\u001b[0m indexer \u001b[38;5;241m=\u001b[39m [indexer]\n", "File \u001b[1;32m~\\anaconda3\\lib\\site-packages\\pandas\\core\\indexes\\base.py:3631\u001b[0m, in \u001b[0;36mIndex.get_loc\u001b[1;34m(self, key, method, tolerance)\u001b[0m\n\u001b[0;32m 3629\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_engine\u001b[38;5;241m.\u001b[39mget_loc(casted_key)\n\u001b[0;32m 3630\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[1;32m-> 3631\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mKeyError\u001b[39;00m(key) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n\u001b[0;32m 3632\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m:\n\u001b[0;32m 3633\u001b[0m \u001b[38;5;66;03m# If we have a listlike key, _check_indexing_error will raise\u001b[39;00m\n\u001b[0;32m 3634\u001b[0m \u001b[38;5;66;03m# InvalidIndexError. Otherwise we fall through and re-raise\u001b[39;00m\n\u001b[0;32m 3635\u001b[0m \u001b[38;5;66;03m# the TypeError.\u001b[39;00m\n\u001b[0;32m 3636\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_check_indexing_error(key)\n", "\u001b[1;31mKeyError\u001b[0m: 2" ] } ], "source": [ "bacteria[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is another important difference between the above two selections, as `.loc` and `.iloc` return a `Series` object because we selected a single label, while `[2:3]` returns a `DataFrame` because we selected a range of positions. Let's check this." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(bacteria.loc['Actinobacteria'])" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(bacteria.iloc[2])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(bacteria[2:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's return to the *unemployment* data to show how `.iloc` is used, since *unemployment* has integer indices. " ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
3atnsa1993.041570004.1
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4\n", "3 at nsa 1993.04 157000 4.1" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.iloc[0:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note also that we can use `loc` with integer indices as well, however the output is different than `.iloc`. The difference is discussed in the section below. " ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
0atnsa1993.011710004.5
1atnsa1993.021750004.6
2atnsa1993.031660004.4
3atnsa1993.041570004.1
4atnsa1993.051470003.9
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "0 at nsa 1993.01 171000 4.5\n", "1 at nsa 1993.02 175000 4.6\n", "2 at nsa 1993.03 166000 4.4\n", "3 at nsa 1993.04 157000 4.1\n", "4 at nsa 1993.05 147000 3.9" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.loc[0:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting a Specific Value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both `.loc` and `.iloc` can be used to select a particular value if they are given two arguments. The first argument is the row name (when using `.loc`) or the row index number (when using `.iloc`), while the second argument is the column name or index number.\n", "\n", "Using `.loc`, we can select \"Bacteroidetes\" and \"bacteria_counts\" to get the count of Bacteroidetes, as in the next cell below." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bacteria_countsother_feature
Firmicutes632438
Proteobacteria1638833
Actinobacteria569234
Bacteroidetes115298
\n", "
" ], "text/plain": [ " bacteria_counts other_feature\n", "Firmicutes 632 438\n", "Proteobacteria 1638 833\n", "Actinobacteria 569 234\n", "Bacteroidetes 115 298" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "115" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.loc['Bacteroidetes']['bacteria_counts']" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "115" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is the same as above\n", "bacteria.iloc[3][0]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "115" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This the same as above\n", "bacteria.iloc[3]['bacteria_counts']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, for the *unemployment* data:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1993.01" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The year_month in the first row\n", "unemployment.iloc[0,2]" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1993.01" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This the same as above\n", "unemployment.iloc[0][2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Multiple Rows and Columns\n", "\n", "Both `.loc` and `.iloc` can be used to select subsets of rows and columns at the same time if they are given lists as arguments, or slices for `.iloc`.\n", "\n", "The following example uses a list with `.iloc` to select specific rows, similar to fancy indexing in NumPy." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
1atnsa1993.021750004.6
5atnsa1993.061340003.5
6atnsa1993.071280003.4
22atnsa1994.111480003.9
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "1 at nsa 1993.02 175000 4.6\n", "5 at nsa 1993.06 134000 3.5\n", "6 at nsa 1993.07 128000 3.4\n", "22 at nsa 1994.11 148000 3.9" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.iloc[[1, 5, 6, 22]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also select a range of rows and specify the step value." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
25atnsa1995.021740004.5
30atnsa1995.071230003.3
35atnsa1995.121750004.7
40atnsa1996.051590004.3
45atnsa1996.101460003.9
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "25 at nsa 1995.02 174000 4.5\n", "30 at nsa 1995.07 123000 3.3\n", "35 at nsa 1995.12 175000 4.7\n", "40 at nsa 1996.05 159000 4.3\n", "45 at nsa 1996.10 146000 3.9" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.iloc[25:50:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can apply slicing over rows and columns. " ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonality
2atnsa
3atnsa
4atnsa
5atnsa
\n", "
" ], "text/plain": [ " country seasonality\n", "2 at nsa\n", "3 at nsa\n", "4 at nsa\n", "5 at nsa" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.iloc[2:6,0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following example selects multiple rows with `.loc` based on string indices." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bacteria_countsother_feature
Firmicutes632438
Actinobacteria569234
\n", "
" ], "text/plain": [ " bacteria_counts other_feature\n", "Firmicutes 632 438\n", "Actinobacteria 569 234" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.loc[['Firmicutes', 'Actinobacteria']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also select a subset of rows and columns with `.loc`." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bacteria_counts
Bacteroidetes115
Actinobacteria569
\n", "
" ], "text/plain": [ " bacteria_counts\n", "Bacteroidetes 115\n", "Actinobacteria 569" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bacteria.loc[['Bacteroidetes', 'Actinobacteria'], ['bacteria_counts']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `.iloc` on the *unemployment* DataFrame, select the rows starting at row 2 and ending at row 5, and the 0th, 2nd, and 3rd columns." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear_monthunemployment
2at1993.03166000
3at1993.04157000
4at1993.05147000
5at1993.06134000
\n", "
" ], "text/plain": [ " country year_month unemployment\n", "2 at 1993.03 166000\n", "3 at 1993.04 157000\n", "4 at 1993.05 147000\n", "5 at 1993.06 134000" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.iloc[2:6,[0,2,3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same selection can be achieved by using the `.loc` operator and listing the column names." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear_monthunemployment
2at1993.03166000
3at1993.04157000
4at1993.05147000
5at1993.06134000
6at1993.07128000
\n", "
" ], "text/plain": [ " country year_month unemployment\n", "2 at 1993.03 166000\n", "3 at 1993.04 157000\n", "4 at 1993.05 147000\n", "5 at 1993.06 134000\n", "6 at 1993.07 128000" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The same as above\n", "unemployment.loc[2:6,['country', 'year_month', 'unemployment']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting Multiple Rows and Columns Using Conditional Expressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also display values from a DataFrame that satisfy certain criteria using conditional expressions, such as `<`, `>`, `==`, `!=`, etc. \n", "\n", "One example is shown below where only the rows that have an unemployment rate greater than 15 are shown. " ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rate
1717bgnsa2000.0252300015.4
1718bgnsa2000.0354700016.0
1719bgnsa2000.0456000016.3
1720bgnsa2000.0556100016.3
1721bgnsa2000.0655400016.2
1722bgnsa2000.0755800016.3
1723bgnsa2000.0856900016.7
1724bgnsa2000.0957400016.8
1725bgnsa2000.1058300017.1
1726bgnsa2000.1159700017.5
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate\n", "1717 bg nsa 2000.02 523000 15.4\n", "1718 bg nsa 2000.03 547000 16.0\n", "1719 bg nsa 2000.04 560000 16.3\n", "1720 bg nsa 2000.05 561000 16.3\n", "1721 bg nsa 2000.06 554000 16.2\n", "1722 bg nsa 2000.07 558000 16.3\n", "1723 bg nsa 2000.08 569000 16.7\n", "1724 bg nsa 2000.09 574000 16.8\n", "1725 bg nsa 2000.10 583000 17.1\n", "1726 bg nsa 2000.11 597000 17.5" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment[unemployment['unemployment_rate'] > 15].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cells present one more example of selecting rows based on conditional statements." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
017123.552test
119101.624test
221116.431test
337157.132test
418162.925test
519144.662control
647253.634control
718213.635control
819295.234control
\n", "
" ], "text/plain": [ " age score rt group\n", "0 17 12 3.552 test\n", "1 19 10 1.624 test\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "4 18 16 2.925 test\n", "5 19 14 4.662 control\n", "6 47 25 3.634 control\n", "7 18 21 3.635 control\n", "8 19 29 5.234 control" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "score_df = pd.DataFrame(\n", " {'age': [17, 19, 21, 37, 18, 19, 47, 18, 19],\n", " 'score': [12, 10, 11, 15, 16, 14, 25, 21, 29],\n", " 'rt': [3.552, 1.624, 6.431, 7.132, 2.925, 4.662, 3.634, 3.635, 5.234],\n", " 'group': [\"test\", \"test\", \"test\", \"test\", \"test\", \"control\", \"control\", \"control\", \"control\"]\n", " })\n", "\n", "score_df" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
017123.552test
119101.624test
221116.431test
337157.132test
418162.925test
\n", "
" ], "text/plain": [ " age score rt group\n", "0 17 12 3.552 test\n", "1 19 10 1.624 test\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "4 18 16 2.925 test" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select only rows from the 'test' group\n", "df_test = score_df[score_df['group'] == 'test']\n", "df_test" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
221116.431test
337157.132test
647253.634control
\n", "
" ], "text/plain": [ " age score rt group\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "6 47 25 3.634 control" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select only rows for age > 19\n", "df_adult = score_df[score_df['age']> 19] \n", "df_adult" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
221116.431test
337157.132test
647253.634control
\n", "
" ], "text/plain": [ " age score rt group\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "6 47 25 3.634 control" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select only rows for age > 19 and rt \n", "adult_and_rt = score_df[(score_df['age'] > 19) & (score_df['rt'] > 3)]\n", "adult_and_rt" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
221116.431test
337157.132test
\n", "
" ], "text/plain": [ " age score rt group\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select only rows for age and rt in test group\n", "adult_and_rt_test = score_df[(score_df['age'] > 19) & (score_df['rt'] > 3) & (score_df['group'] == 'test')]\n", "adult_and_rt_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Differences between `loc` and `iloc`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To show the differences between `loc` and `iloc` let's consider the following example." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
10102030
11203040
12304050
0405060
1506070
\n", "
" ], "text/plain": [ " x y z\n", "10 10 20 30\n", "11 20 30 40\n", "12 30 40 50\n", "0 40 50 60\n", "1 50 60 70" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({'x':[10, 20, 30, 40 ,50], \n", " 'y':[20, 30, 40, 50, 60],\n", " 'z':[30, 40, 50, 60, 70]},\n", " index=[10, 11,12, 0, 1])\n", "df1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note in the following cells that `df1.iloc[0] ` selects the row with index location `0`, whereas `loc` selects the row with index label `0`." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "x 10\n", "y 20\n", "z 30\n", "Name: 10, dtype: int64" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ " # value at index location 0\n", "df1.iloc[0] " ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "x 40\n", "y 50\n", "z 60\n", "Name: 0, dtype: int64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ " # value at index label 0\n", "df1.loc[0] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, there is a difference in the selected rows when using slicing operations with `iloc` and `loc`. One must be careful when using these operators, and always check the output to ensure it is as expected." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
10102030
\n", "
" ], "text/plain": [ " x y z\n", "10 10 20 30" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows at index location between 0 and 1 (exclusive)\n", "df1.iloc[0:1] " ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
0405060
1506070
\n", "
" ], "text/plain": [ " x y z\n", "0 40 50 60\n", "1 50 60 70" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rows at index labels between 0 and 1 (inclusive)\n", "df1.loc[0:1] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.4 Creating New Columns, Reordering
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a new column in a DataFrame, we simply assign values to the new column, as in: \n", "\n", "```\n", "df['New_Column'] = ...\n", "```\n", "\n", "Often, the new columns are created from existing columns based on certain conditions, or by applying functions to existing columns.\n", "\n", "```\n", "Condition-Based: df['New_Column'] = df['Existing_Column'] > value\n", "Function-Based: df['New_Column'] = df['Existing_Column'].apply(function)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the `year_month` column is not shown correctly in the `unemployment` DataFrame, let's try to split it into two separate columns for years and months. \n", "\n", "In the previous section, we saw that the data type in this column is `float64`. We will first extract the year using the `.astype()` method. This allows for type casting, i.e., using `.astype(int)` we will convert the floating point values into integer numbers (by truncating the decimals).\n", "\n", "The new column *year* will be added on the right of the DataFrame." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rateyear
0atnsa1993.011710004.51993
1atnsa1993.021750004.61993
2atnsa1993.031660004.41993
3atnsa1993.041570004.11993
4atnsa1993.051470003.91993
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate year\n", "0 at nsa 1993.01 171000 4.5 1993\n", "1 at nsa 1993.02 175000 4.6 1993\n", "2 at nsa 1993.03 166000 4.4 1993\n", "3 at nsa 1993.04 157000 4.1 1993\n", "4 at nsa 1993.05 147000 3.9 1993" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['year'] = unemployment['year_month'].astype(int)\n", "unemployment.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's create a new column *month*. We will subtract the *year* value from *year_month* to get the decimal portion of the value, and multiply the result by 100 and convert to `int`. Because of the truncating that occurs when casting to `int`, we first need to round the values to the nearest whole number using `round()`. " ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthunemploymentunemployment_rateyearmonth
0atnsa1993.011710004.519931
1atnsa1993.021750004.619932
2atnsa1993.031660004.419933
3atnsa1993.041570004.119934
4atnsa1993.051470003.919935
5atnsa1993.061340003.519936
6atnsa1993.071280003.419937
7atnsa1993.081300003.419938
8atnsa1993.091320003.519939
9atnsa1993.101410003.7199310
10atnsa1993.111560004.1199311
11atnsa1993.121690004.4199312
\n", "
" ], "text/plain": [ " country seasonality year_month unemployment unemployment_rate year \\\n", "0 at nsa 1993.01 171000 4.5 1993 \n", "1 at nsa 1993.02 175000 4.6 1993 \n", "2 at nsa 1993.03 166000 4.4 1993 \n", "3 at nsa 1993.04 157000 4.1 1993 \n", "4 at nsa 1993.05 147000 3.9 1993 \n", "5 at nsa 1993.06 134000 3.5 1993 \n", "6 at nsa 1993.07 128000 3.4 1993 \n", "7 at nsa 1993.08 130000 3.4 1993 \n", "8 at nsa 1993.09 132000 3.5 1993 \n", "9 at nsa 1993.10 141000 3.7 1993 \n", "10 at nsa 1993.11 156000 4.1 1993 \n", "11 at nsa 1993.12 169000 4.4 1993 \n", "\n", " month \n", "0 1 \n", "1 2 \n", "2 3 \n", "3 4 \n", "4 5 \n", "5 6 \n", "6 7 \n", "7 8 \n", "8 9 \n", "9 10 \n", "10 11 \n", "11 12 " ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['month'] = ((unemployment['year_month'] - unemployment['year']) * 100).round().astype(int)\n", "unemployment.head(12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's try to reorder the newly created *year* and *month* columns in the DataFrame. For this, we will use the square brackets notation again, passing in a list of column names in the order we would like to see them." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_rate
0atnsa1993.01199311710004.5
1atnsa1993.02199321750004.6
2atnsa1993.03199331660004.4
3atnsa1993.04199341570004.1
4atnsa1993.05199351470003.9
5atnsa1993.06199361340003.5
6atnsa1993.07199371280003.4
7atnsa1993.08199381300003.4
8atnsa1993.09199391320003.5
9atnsa1993.101993101410003.7
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "3 at nsa 1993.04 1993 4 157000 \n", "4 at nsa 1993.05 1993 5 147000 \n", "5 at nsa 1993.06 1993 6 134000 \n", "6 at nsa 1993.07 1993 7 128000 \n", "7 at nsa 1993.08 1993 8 130000 \n", "8 at nsa 1993.09 1993 9 132000 \n", "9 at nsa 1993.10 1993 10 141000 \n", "\n", " unemployment_rate \n", "0 4.5 \n", "1 4.6 \n", "2 4.4 \n", "3 4.1 \n", "4 3.9 \n", "5 3.5 \n", "6 3.4 \n", "7 3.4 \n", "8 3.5 \n", "9 3.7 " ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment = unemployment[['country', 'seasonality',\n", " 'year_month', 'year', 'month',\n", " 'unemployment', 'unemployment_rate']]\n", "unemployment.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is one more example of creating new columns by applying functions to existing columns." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScore
0Alice85
1Bob80
2Charlie78
3David92
4Eve68
\n", "
" ], "text/plain": [ " Student Score\n", "0 Alice 85\n", "1 Bob 80\n", "2 Charlie 78\n", "3 David 92\n", "4 Eve 68" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "student_df = pd.DataFrame({\n", " 'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],\n", " 'Score': [85, 80, 78, 92, 68]})\n", "student_df" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated Score
0Alice8595
1Bob8090
2Charlie7888
3David92102
4Eve6878
\n", "
" ], "text/plain": [ " Student Score Updated Score\n", "0 Alice 85 95\n", "1 Bob 80 90\n", "2 Charlie 78 88\n", "3 David 92 102\n", "4 Eve 68 78" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# add 10 marks\n", "student_df['Updated Score'] = student_df['Score'] + 10\n", "student_df" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal Score
0Alice859595
1Bob809090
2Charlie788888
3David92102100
4Eve687878
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score\n", "0 Alice 85 95 95\n", "1 Bob 80 90 90\n", "2 Charlie 78 88 88\n", "3 David 92 102 100\n", "4 Eve 68 78 78" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# esnure scores are not greater than 100\n", "student_df['Final Score'] = student_df['Updated Score'].clip(upper=100)\n", "student_df" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal ScoreMean ScoreStandard Deviation
0Alice85959580.68.26
1Bob80909080.68.26
2Charlie78888880.68.26
3David9210210080.68.26
4Eve68787880.68.26
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score Mean Score Standard Deviation\n", "0 Alice 85 95 95 80.6 8.26\n", "1 Bob 80 90 90 80.6 8.26\n", "2 Charlie 78 88 88 80.6 8.26\n", "3 David 92 102 100 80.6 8.26\n", "4 Eve 68 78 78 80.6 8.26" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate mean and standard deviation (rounded to 2 decimal places)\n", "student_df['Mean Score'] = student_df['Score'].mean()\n", "student_df['Standard Deviation'] = round(student_df['Final Score'].std(),2)\n", "student_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.5 Removing Columns and Rows
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To delete a column from a DataFrame, we can use the the `drop()` method. For instance, in the `countries` DataFrame, we will drop the column `country_group`. With the `drop()` method it is important to specify the `axis` parameter, where `axis=1` refers to columns (`axis=0` refers to rows)." ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygoogle_country_codecountry_groupname_enname_frname_delatitudelongitude
0atATeuAustriaAutricheÖsterreich47.69655413.345980
1beBEeuBelgiumBelgiqueBelgien50.5010454.476674
2bgBGeuBulgariaBulgarieBulgarien42.72567425.482322
3hrHRnon-euCroatiaCroatieKroatien44.74664315.340844
4cyCYeuCyprusChypreZypern35.12914133.428682
\n", "
" ], "text/plain": [ " country google_country_code country_group name_en name_fr name_de \\\n", "0 at AT eu Austria Autriche Österreich \n", "1 be BE eu Belgium Belgique Belgien \n", "2 bg BG eu Bulgaria Bulgarie Bulgarien \n", "3 hr HR non-eu Croatia Croatie Kroatien \n", "4 cy CY eu Cyprus Chypre Zypern \n", "\n", " latitude longitude \n", "0 47.696554 13.345980 \n", "1 50.501045 4.476674 \n", "2 42.725674 25.482322 \n", "3 44.746643 15.340844 \n", "4 35.129141 33.428682 " ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.head()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygoogle_country_codename_enname_frname_delatitudelongitude
0atATAustriaAutricheÖsterreich47.69655413.345980
1beBEBelgiumBelgiqueBelgien50.5010454.476674
2bgBGBulgariaBulgarieBulgarien42.72567425.482322
3hrHRCroatiaCroatieKroatien44.74664315.340844
4cyCYCyprusChypreZypern35.12914133.428682
\n", "
" ], "text/plain": [ " country google_country_code name_en name_fr name_de latitude \\\n", "0 at AT Austria Autriche Österreich 47.696554 \n", "1 be BE Belgium Belgique Belgien 50.501045 \n", "2 bg BG Bulgaria Bulgarie Bulgarien 42.725674 \n", "3 hr HR Croatia Croatie Kroatien 44.746643 \n", "4 cy CY Cyprus Chypre Zypern 35.129141 \n", "\n", " longitude \n", "0 13.345980 \n", "1 4.476674 \n", "2 25.482322 \n", "3 15.340844 \n", "4 33.428682 " ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.drop('country_group', axis=1, inplace=True)\n", "countries.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the `drop()` method we can remove multiple columns, by listing their labels within a list." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrygoogle_country_codename_enname_frname_de
0atATAustriaAutricheÖsterreich
1beBEBelgiumBelgiqueBelgien
2bgBGBulgariaBulgarieBulgarien
3hrHRCroatiaCroatieKroatien
4cyCYCyprusChypreZypern
\n", "
" ], "text/plain": [ " country google_country_code name_en name_fr name_de\n", "0 at AT Austria Autriche Österreich\n", "1 be BE Belgium Belgique Belgien\n", "2 bg BG Bulgaria Bulgarie Bulgarien\n", "3 hr HR Croatia Croatie Kroatien\n", "4 cy CY Cyprus Chypre Zypern" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.drop(['latitude', 'longitude'], axis=1, inplace=True)\n", "countries.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to remove columns in `pandas` is by using the `del` keyword." ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryname_enname_frname_de
0atAustriaAutricheÖsterreich
1beBelgiumBelgiqueBelgien
2bgBulgariaBulgarieBulgarien
3hrCroatiaCroatieKroatien
4cyCyprusChypreZypern
\n", "
" ], "text/plain": [ " country name_en name_fr name_de\n", "0 at Austria Autriche Österreich\n", "1 be Belgium Belgique Belgien\n", "2 bg Bulgaria Bulgarie Bulgarien\n", "3 hr Croatia Croatie Kroatien\n", "4 cy Cyprus Chypre Zypern" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del countries['google_country_code']\n", "countries.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To remove rows from a DataFrame, we can use the `drop()` method and set the `axis` parameter to 0.\n", "\n", "Similarly to columns, we can delete a single row, or multiple rows as in the examples below." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_rate
0atnsa1993.01199311710004.5
1atnsa1993.02199321750004.6
2atnsa1993.03199331660004.4
4atnsa1993.05199351470003.9
5atnsa1993.06199361340003.5
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "4 at nsa 1993.05 1993 5 147000 \n", "5 at nsa 1993.06 1993 6 134000 \n", "\n", " unemployment_rate \n", "0 4.5 \n", "1 4.6 \n", "2 4.4 \n", "4 3.9 \n", "5 3.5 " ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.drop(3, axis=0, inplace=True)\n", "unemployment.head()" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_rate
0atnsa1993.01199311710004.5
1atnsa1993.02199321750004.6
2atnsa1993.03199331660004.4
4atnsa1993.05199351470003.9
5atnsa1993.06199361340003.5
7atnsa1993.08199381300003.4
9atnsa1993.101993101410003.7
10atnsa1993.111993111560004.1
11atnsa1993.121993121690004.4
12atnsa1994.01199411800004.7
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "4 at nsa 1993.05 1993 5 147000 \n", "5 at nsa 1993.06 1993 6 134000 \n", "7 at nsa 1993.08 1993 8 130000 \n", "9 at nsa 1993.10 1993 10 141000 \n", "10 at nsa 1993.11 1993 11 156000 \n", "11 at nsa 1993.12 1993 12 169000 \n", "12 at nsa 1994.01 1994 1 180000 \n", "\n", " unemployment_rate \n", "0 4.5 \n", "1 4.6 \n", "2 4.4 \n", "4 3.9 \n", "5 3.5 \n", "7 3.4 \n", "9 3.7 \n", "10 4.1 \n", "11 4.4 \n", "12 4.7 " ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.drop([6,8], axis=0, inplace=True)\n", "unemployment.head(10)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(20793, 7)" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the shape of the modified DataFrame\n", "unemployment.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common way to remove columns or rows is based on a certain condition. For example, the following condition removes all rows corresponding to the data after 2006." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "unemployment = unemployment[unemployment['year'] < 2006]" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(15528, 7)" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the shape again\n", "unemployment.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.6 Merging DataFrames
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Merging DataFrames in `pandas` is a common operation for combining data from multiple DataFrames based on a common key or index.\n", "\n", "For instance, if we examine the `unemployment` DataFrame we can notice that we don't exactly know what the values in the `country` column refer to. We can correct that by obtaining the country names from the `countries` DataFrame that we imported earlier. \n", "\n", "We can see in the `countries` data that *at* stands for Austria. This DataFrame even provides the country names in three different languages (name_en, name_fr, name_de)." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryname_enname_frname_de
0atAustriaAutricheÖsterreich
1beBelgiumBelgiqueBelgien
2bgBulgariaBulgarieBulgarien
3hrCroatiaCroatieKroatien
4cyCyprusChypreZypern
\n", "
" ], "text/plain": [ " country name_en name_fr name_de\n", "0 at Austria Autriche Österreich\n", "1 be Belgium Belgique Belgien\n", "2 bg Bulgaria Bulgarie Bulgarien\n", "3 hr Croatia Croatie Kroatien\n", "4 cy Cyprus Chypre Zypern" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the data we need is stored in two separate files, we will merge the two DataFrames. The `country` column is shown in both DataFrames, so it is a good option for joining the data. However, we don't need all columns in the `countries` DataFrame, and therefore, we will create a new DataFrame that contains only the columns that we need. To select certain columns to retain, we can use the bracket notation that we used earlier to reorder the columns." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "country_names = countries[['country', 'name_en']]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryname_en
0atAustria
1beBelgium
2bgBulgaria
3hrCroatia
4cyCyprus
\n", "
" ], "text/plain": [ " country name_en\n", "0 at Austria\n", "1 be Belgium\n", "2 bg Bulgaria\n", "3 hr Croatia\n", "4 cy Cyprus" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "country_names.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For merging DataFrames, `pandas` include the `merge` method, which has the following syntax, where the parameter `on` lists the column for matching the DataFrames. This operation is similar to inner join in SQL and it combines rows which have matching keys in both DataFrames. We can also specify the type of join (e.g., inner, left, right, outer) by providing value for the optional `how` parameter.\n", "\n", "```\n", "pd.merge(first_file, second_file, on=['column_name'], how=['join type (default is 'inner')])\n", "```\n" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_ratename_en
0atnsa1993.01199311710004.5Austria
1atnsa1993.02199321750004.6Austria
2atnsa1993.03199331660004.4Austria
3atnsa1993.05199351470003.9Austria
4atnsa1993.06199361340003.5Austria
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "3 at nsa 1993.05 1993 5 147000 \n", "4 at nsa 1993.06 1993 6 134000 \n", "\n", " unemployment_rate name_en \n", "0 4.5 Austria \n", "1 4.6 Austria \n", "2 4.4 Austria \n", "3 3.9 Austria \n", "4 3.5 Austria " ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment = pd.merge(unemployment, country_names, on=['country'])\n", "unemployment.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to merge two files using multiple columns that exist in both files, we can pass a list of column names to the `on` parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combining DataFrames with `join()` and `concat()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another similar method to `merge` that is used for combining DataFrames in `pandas` is `join()`. The `join()` function is often used for merging DataFrame based on index alignment. \n", "\n", "The following example joins the DataFrame `grades_df` to the `student_df` by matching the index positions between the DataFrames. This is equivalent to a left join operation. " ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal ScoreMean ScoreStandard Deviation
0Alice85959580.68.26
1Bob80909080.68.26
2Charlie78888880.68.26
3David9210210080.68.26
4Eve68787880.68.26
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score Mean Score Standard Deviation\n", "0 Alice 85 95 95 80.6 8.26\n", "1 Bob 80 90 90 80.6 8.26\n", "2 Charlie 78 88 88 80.6 8.26\n", "3 David 92 102 100 80.6 8.26\n", "4 Eve 68 78 78 80.6 8.26" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "student_df" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Grade
1A
2B
3A
4C
\n", "
" ], "text/plain": [ " Grade\n", "1 A\n", "2 B\n", "3 A\n", "4 C" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grades_df = pd.DataFrame({'Grade': ['A', 'B', 'A', 'C']}, index=[1, 2, 3, 4])\n", "grades_df" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal ScoreMean ScoreStandard DeviationGrade
0Alice85959580.68.26NaN
1Bob80909080.68.26A
2Charlie78888880.68.26B
3David9210210080.68.26A
4Eve68787880.68.26C
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score Mean Score Standard Deviation \\\n", "0 Alice 85 95 95 80.6 8.26 \n", "1 Bob 80 90 90 80.6 8.26 \n", "2 Charlie 78 88 88 80.6 8.26 \n", "3 David 92 102 100 80.6 8.26 \n", "4 Eve 68 78 78 80.6 8.26 \n", "\n", " Grade \n", "0 NaN \n", "1 A \n", "2 B \n", "3 A \n", "4 C " ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_df = student_df.join(grades_df)\n", "joined_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, `join()` also allows to specify the type of join operation with the `how` argument, similar to `merge()`. In the example below, an inner join operation is illustrated, which returns the rows that have matching indices in both DataFrames. " ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal ScoreMean ScoreStandard DeviationGrade
1Bob80909080.68.26A
2Charlie78888880.68.26B
3David9210210080.68.26A
4Eve68787880.68.26C
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score Mean Score Standard Deviation \\\n", "1 Bob 80 90 90 80.6 8.26 \n", "2 Charlie 78 88 88 80.6 8.26 \n", "3 David 92 102 100 80.6 8.26 \n", "4 Eve 68 78 78 80.6 8.26 \n", "\n", " Grade \n", "1 A \n", "2 B \n", "3 A \n", "4 C " ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_df = student_df.join(grades_df, how='inner')\n", "joined_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note again the `join()` is primarily designed for merging DataFrames based on their indices, and it does not support merging DatFrames on specific columns. For that purpose, we should use the `merge()` method instead.\n", "\n", "Similarly, `concat()` allows to combine two DataFrames in `pandas` along a particular axis (either rows or columns). By default, it includes all columns or rows, filling missing values with NaN. However, the `concat` function can also accept a `join` argument to specify the type of join operation. " ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal ScoreMean ScoreStandard DeviationGrade
0Alice85959580.68.26NaN
1Bob80909080.68.26A
2Charlie78888880.68.26B
3David9210210080.68.26A
4Eve68787880.68.26C
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score Mean Score Standard Deviation \\\n", "0 Alice 85 95 95 80.6 8.26 \n", "1 Bob 80 90 90 80.6 8.26 \n", "2 Charlie 78 88 88 80.6 8.26 \n", "3 David 92 102 100 80.6 8.26 \n", "4 Eve 68 78 78 80.6 8.26 \n", "\n", " Grade \n", "0 NaN \n", "1 A \n", "2 B \n", "3 A \n", "4 C " ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Concatenate along columns (axis=1)\n", "joined_df_2 = pd.concat([student_df, grades_df], axis=1)\n", "joined_df_2" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StudentScoreUpdated ScoreFinal ScoreMean ScoreStandard Deviation
0Alice8595.095.080.68.26
1Bob8090.090.080.68.26
2Charlie7888.088.080.68.26
3David92102.0100.080.68.26
4Eve6878.078.080.68.26
0George69NaNNaNNaNNaN
1Ann82NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " Student Score Updated Score Final Score Mean Score Standard Deviation\n", "0 Alice 85 95.0 95.0 80.6 8.26\n", "1 Bob 80 90.0 90.0 80.6 8.26\n", "2 Charlie 78 88.0 88.0 80.6 8.26\n", "3 David 92 102.0 100.0 80.6 8.26\n", "4 Eve 68 78.0 78.0 80.6 8.26\n", "0 George 69 NaN NaN NaN NaN\n", "1 Ann 82 NaN NaN NaN NaN" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Concatenate along rows (axis=0)\n", "student_df_2 = pd.DataFrame({'Student': ['George', 'Ann'], 'Score': [69, 82]})\n", "joined_df_3 = pd.concat([student_df, student_df_2], axis=0)\n", "joined_df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the combined DataFrame `joined_df_3` above, notice that the indices of the added rows begin at 0. To reset the indices in a `pandas` DataFrame, we can use, well, the `reset_index()` method." ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexStudentScoreUpdated ScoreFinal ScoreMean ScoreStandard Deviation
00Alice8595.095.080.68.26
11Bob8090.090.080.68.26
22Charlie7888.088.080.68.26
33David92102.0100.080.68.26
44Eve6878.078.080.68.26
50George69NaNNaNNaNNaN
61Ann82NaNNaNNaNNaN
\n", "
" ], "text/plain": [ " index Student Score Updated Score Final Score Mean Score \\\n", "0 0 Alice 85 95.0 95.0 80.6 \n", "1 1 Bob 80 90.0 90.0 80.6 \n", "2 2 Charlie 78 88.0 88.0 80.6 \n", "3 3 David 92 102.0 100.0 80.6 \n", "4 4 Eve 68 78.0 78.0 80.6 \n", "5 0 George 69 NaN NaN NaN \n", "6 1 Ann 82 NaN NaN NaN \n", "\n", " Standard Deviation \n", "0 8.26 \n", "1 8.26 \n", "2 8.26 \n", "3 8.26 \n", "4 8.26 \n", "5 NaN \n", "6 NaN " ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_df_3 = joined_df_3.reset_index()\n", "joined_df_3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information on merging DataFrames, check the pandas [documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging). Also, we will explain more about merging operations in the lecture on Databases and SQL.\n", "\n", "Here is a figure that depicts inner, right, left, and outer join operations. \n", "\n", "\n", "Figure: Join Operations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.7 Calculating Unique and Missing Values
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the *unemployment* DataFrame, we might want to know for which countries we have data available. To extract this information, we can use the `.unique()` method. Note that the countries are listed in the right-most column `name-en` so we will use it to find the unique elements in that column." ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_ratename_en
0atnsa1993.01199311710004.5Austria
1atnsa1993.02199321750004.6Austria
2atnsa1993.03199331660004.4Austria
3atnsa1993.05199351470003.9Austria
4atnsa1993.06199361340003.5Austria
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "3 at nsa 1993.05 1993 5 147000 \n", "4 at nsa 1993.06 1993 6 134000 \n", "\n", " unemployment_rate name_en \n", "0 4.5 Austria \n", "1 4.6 Austria \n", "2 4.4 Austria \n", "3 3.9 Austria \n", "4 3.5 Austria " ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.head()" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Austria', 'Belgium', 'Bulgaria', 'Cyprus', 'Czech Republic',\n", " 'Germany (including former GDR from 1991)', 'Denmark', 'Estonia',\n", " 'Spain', 'Finland', 'France', 'Greece', 'Croatia', 'Hungary',\n", " 'Ireland', 'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'Malta',\n", " 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Sweden',\n", " 'Slovenia', 'Slovakia', 'Turkey', 'United Kingdom'], dtype=object)" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['name_en'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a count of the **number of unique countries,** we can use the `.nunique()` method." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['name_en'].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or, we can also use `len()` to get the number of items in the above array." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(unemployment['name_en'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we are interested to know **how many rows** there are per country, `pandas` has the `.value_counts()` method that returns the counts for the unique values in a column." ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "United Kingdom 828\n", "France 828\n", "Sweden 828\n", "Portugal 828\n", "Netherlands 828\n", "Luxembourg 828\n", "Denmark 828\n", "Belgium 828\n", "Spain 828\n", "Ireland 828\n", "Italy 804\n", "Finland 648\n", "Norway 612\n", "Austria 465\n", "Hungary 396\n", "Slovakia 396\n", "Slovenia 396\n", "Bulgaria 396\n", "Malta 396\n", "Poland 396\n", "Germany (including former GDR from 1991) 336\n", "Czech Republic 288\n", "Latvia 288\n", "Lithuania 288\n", "Greece 279\n", "Romania 252\n", "Cyprus 216\n", "Estonia 216\n", "Croatia 144\n", "Turkey 36\n", "Name: name_en, dtype: int64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['name_en'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the output is sorted by values in descending order. If we would like it sorted by index (or, by country name in alphabetical order in this case), we can append the `.sort_index()` method." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Austria 465\n", "Belgium 828\n", "Bulgaria 396\n", "Croatia 144\n", "Cyprus 216\n", "Czech Republic 288\n", "Denmark 828\n", "Estonia 216\n", "Finland 648\n", "France 828\n", "Germany (including former GDR from 1991) 336\n", "Greece 279\n", "Hungary 396\n", "Ireland 828\n", "Italy 804\n", "Latvia 288\n", "Lithuania 288\n", "Luxembourg 828\n", "Malta 396\n", "Netherlands 828\n", "Norway 612\n", "Poland 396\n", "Portugal 828\n", "Romania 252\n", "Slovakia 396\n", "Slovenia 396\n", "Spain 828\n", "Sweden 828\n", "Turkey 36\n", "United Kingdom 828\n", "Name: name_en, dtype: int64" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['name_en'].value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we noticed earlier, there are missing values in the `unemployment_rate` column. To find out **how many unemployment rate values are missing** we will use the `.isnull()` method, which returns a corresponding boolean value for each missing entry in the `unemployment_rate` column. As we know, in Python `True` is equivalent to 1 and `False` is equivalent to 0. Thus, when we add `.sum()`, we obtain a count for the total number of missing values in the `unemployment_rate` column." ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "825" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['unemployment_rate'].isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GroupBy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `groupby()` method in `pandas` is used to group data based on one or more columns. It is similar to the GroupBy function in SQL, and allows to apply operations or filtering on grouped data. \n", "\n", "If we would like to know how many missing values for the `unemployment_rate` column there are for each *country*, we can first create a new column in the `DataFrame` that has boolean True or False for the `unemployment_rate` column. This is the last column to the right below, in which `False` means that the value is not missing." ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_ratename_enunemployment_rate_null
0atnsa1993.01199311710004.5AustriaFalse
1atnsa1993.02199321750004.6AustriaFalse
2atnsa1993.03199331660004.4AustriaFalse
3atnsa1993.05199351470003.9AustriaFalse
4atnsa1993.06199361340003.5AustriaFalse
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "3 at nsa 1993.05 1993 5 147000 \n", "4 at nsa 1993.06 1993 6 134000 \n", "\n", " unemployment_rate name_en unemployment_rate_null \n", "0 4.5 Austria False \n", "1 4.6 Austria False \n", "2 4.4 Austria False \n", "3 3.9 Austria False \n", "4 3.5 Austria False " ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['unemployment_rate_null'] = unemployment['unemployment_rate'].isnull()\n", "unemployment.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To count the **number of missing values for each country,** we can use the `.groupby()` method to group the data by the country `name_en` column included in the parentheses, and perform the `.sum()` operation over the `unemployment_rate_null` column. " ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name_en\n", "Austria 0\n", "Belgium 0\n", "Bulgaria 180\n", "Croatia 96\n", "Cyprus 0\n", "Czech Republic 0\n", "Denmark 0\n", "Estonia 0\n", "Finland 0\n", "France 0\n", "Germany (including former GDR from 1991) 0\n", "Greece 0\n", "Hungary 36\n", "Ireland 0\n", "Italy 0\n", "Latvia 0\n", "Lithuania 0\n", "Luxembourg 0\n", "Malta 180\n", "Netherlands 0\n", "Norway 0\n", "Poland 72\n", "Portugal 0\n", "Romania 0\n", "Slovakia 108\n", "Slovenia 36\n", "Spain 117\n", "Sweden 0\n", "Turkey 0\n", "United Kingdom 0\n", "Name: unemployment_rate_null, dtype: int64" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.groupby('name_en')['unemployment_rate_null'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, we can use `.groupby` to group data by multiple columns. For example, we can check the missing values for unemployment both by country and year, and we can apply functions such as `.sum()` to the grouped objects, as shown below." ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "grouped = unemployment.groupby(['name_en', 'year'])" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "grouped_sum = grouped['unemployment_rate_null'].sum()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unemployment_rate_null
name_enyear
Austria19930
19940
19950
19960
19970
\n", "
" ], "text/plain": [ " unemployment_rate_null\n", "name_en year \n", "Austria 1993 0\n", " 1994 0\n", " 1995 0\n", " 1996 0\n", " 1997 0" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to a DataFrame\n", "grouped_sum_df = grouped_sum.to_frame()\n", "grouped_sum_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One more simple example of using `groupby` in `pandas` is shown below." ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DepartmentEmployeeSalaryJobTitle
0HRAlice70000Employee
1ITBob80000Developer
2HRCharlie75000Manager
3ITDavid85000Developer
4ITEdward90000Manager
5HRFiona71000Employee
\n", "
" ], "text/plain": [ " Department Employee Salary JobTitle\n", "0 HR Alice 70000 Employee\n", "1 IT Bob 80000 Developer\n", "2 HR Charlie 75000 Manager\n", "3 IT David 85000 Developer\n", "4 IT Edward 90000 Manager\n", "5 HR Fiona 71000 Employee" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "company_df = pd.DataFrame({\n", " 'Department': ['HR', 'IT', 'HR', 'IT', 'IT', 'HR'],\n", " 'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Edward', 'Fiona'],\n", " 'Salary': [70000, 80000, 75000, 85000, 90000, 71000],\n", " 'JobTitle' : ['Employee', 'Developer', 'Manager', 'Developer', 'Manager', 'Employee']\n", "})\n", "\n", "company_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Group by the column `'Department` and calculate the mean salary for each department. " ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Department\n", "HR 72000.0\n", "IT 85000.0\n", "Name: Salary, dtype: float64" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "company_df.groupby('Department')['Salary'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can perform multiple operations on the grouped data. The following example calculates the mean, sum, max, and min for the grouped data. These operations for calculating data statistics in `pandas` are referred to as **aggregate functions**. They are listed inside the `agg()` method, allowing to apply multiple operations at once. " ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
meansummaxmin
Department
HR72000.02160007500070000
IT85000.02550009000080000
\n", "
" ], "text/plain": [ " Salary \n", " mean sum max min\n", "Department \n", "HR 72000.0 216000 75000 70000\n", "IT 85000.0 255000 90000 80000" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "company_df.groupby('Department').agg({\n", " 'Salary': ['mean', 'sum', 'max', 'min']})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can group the data by `'Department'` and `'JobTitle'` and calculate the mean salary." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Department JobTitle \n", "HR Employee 70500.0\n", " Manager 75000.0\n", "IT Developer 82500.0\n", " Manager 90000.0\n", "Name: Salary, dtype: float64" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "company_df.groupby(['Department', 'JobTitle'])['Salary'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.8 Dealing With Missing Values: Boolean Indexing
\n", "\n", "Two main options for dealing with missing values in a DataFrame include:\n", "\n", "* Fill the missing values with some other values.\n", "* Remove the observations with missing values.\n", " \n", "Here we will adopt the second approach and **exclude missing values** from our primary analyses. Additional examples on dealing with missing values will be presented in the lecture on Data Exploration and Preprocessing.\n", "\n", "To select only the rows with the missing data for `'unemployment_rate'`, we will use *boolean indexing* to filter the data. Recall from the previous section that `unemployment['unemployment_rate'].isnull()` produces an array of Boolean values, which we used when counting the number of missing values, shown in the next cell." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "Name: unemployment_rate, dtype: bool" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment['unemployment_rate'].isnull()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will first save the missing data into a new DataFrame, in case we need that data later. To create a new DataFrame that we will call `unemployment_rate_missing`, we will index `unemployment` with the Boolean array above. This returns only the rows where the value in the array is `True`." ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_ratename_enunemployment_rate_null
1293bgnsa1995.0119951391000NaNBulgariaTrue
1294bgnsa1995.0219952387000NaNBulgariaTrue
1295bgnsa1995.0319953378000NaNBulgariaTrue
1296bgnsa1995.0419954365000NaNBulgariaTrue
1297bgnsa1995.0519955346000NaNBulgariaTrue
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "1293 bg nsa 1995.01 1995 1 391000 \n", "1294 bg nsa 1995.02 1995 2 387000 \n", "1295 bg nsa 1995.03 1995 3 378000 \n", "1296 bg nsa 1995.04 1995 4 365000 \n", "1297 bg nsa 1995.05 1995 5 346000 \n", "\n", " unemployment_rate name_en unemployment_rate_null \n", "1293 NaN Bulgaria True \n", "1294 NaN Bulgaria True \n", "1295 NaN Bulgaria True \n", "1296 NaN Bulgaria True \n", "1297 NaN Bulgaria True " ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment_rate_missing = unemployment[unemployment['unemployment_rate'].isnull()]\n", "unemployment_rate_missing.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to specify multiple conditions using the `&` operator, but each condition needs to be inside of parentheses. \n", "\n", "Now, to remove the missing data in `unemployment`, we can use the `.dropna()` method. This method drops all observations for which `unemployment_rate == NaN`." ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_ratename_enunemployment_rate_null
0atnsa1993.01199311710004.5AustriaFalse
1atnsa1993.02199321750004.6AustriaFalse
2atnsa1993.03199331660004.4AustriaFalse
3atnsa1993.05199351470003.9AustriaFalse
4atnsa1993.06199361340003.5AustriaFalse
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "0 at nsa 1993.01 1993 1 171000 \n", "1 at nsa 1993.02 1993 2 175000 \n", "2 at nsa 1993.03 1993 3 166000 \n", "3 at nsa 1993.05 1993 5 147000 \n", "4 at nsa 1993.06 1993 6 134000 \n", "\n", " unemployment_rate name_en unemployment_rate_null \n", "0 4.5 Austria False \n", "1 4.6 Austria False \n", "2 4.4 Austria False \n", "3 3.9 Austria False \n", "4 3.5 Austria False " ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.dropna(subset=['unemployment_rate'], inplace=True)\n", "unemployment.head()" ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(14703, 9)" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the shape of the modified DataFrame\n", "unemployment.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting Values\n", "\n", "If we want to know what the highest unemployment rates were, we can use the `.sort_values()` method to *sort the data.*\n", "\n", "The code in the next cell sorted the data in *descending* order, and printed the first 10 rows." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryseasonalityyear_monthyearmonthunemploymentunemployment_ratename_enunemployment_rate_null
11677plnsa2004.0220042353100020.9PolandFalse
11665plnsa2003.0220032346000020.7PolandFalse
11676plnsa2004.0120041352000020.7PolandFalse
11664plnsa2003.0120031346600020.6PolandFalse
11678plnsa2004.0320043347500020.6PolandFalse
11654plnsa2002.0320023350900020.5PolandFalse
11653plnsa2002.0220022349200020.4PolandFalse
11924pltrend2002.0920029350000020.4PolandFalse
11925pltrend2002.10200210348300020.4PolandFalse
11923pltrend2002.0820028350300020.4PolandFalse
\n", "
" ], "text/plain": [ " country seasonality year_month year month unemployment \\\n", "11677 pl nsa 2004.02 2004 2 3531000 \n", "11665 pl nsa 2003.02 2003 2 3460000 \n", "11676 pl nsa 2004.01 2004 1 3520000 \n", "11664 pl nsa 2003.01 2003 1 3466000 \n", "11678 pl nsa 2004.03 2004 3 3475000 \n", "11654 pl nsa 2002.03 2002 3 3509000 \n", "11653 pl nsa 2002.02 2002 2 3492000 \n", "11924 pl trend 2002.09 2002 9 3500000 \n", "11925 pl trend 2002.10 2002 10 3483000 \n", "11923 pl trend 2002.08 2002 8 3503000 \n", "\n", " unemployment_rate name_en unemployment_rate_null \n", "11677 20.9 Poland False \n", "11665 20.7 Poland False \n", "11676 20.7 Poland False \n", "11664 20.6 Poland False \n", "11678 20.6 Poland False \n", "11654 20.5 Poland False \n", "11653 20.4 Poland False \n", "11924 20.4 Poland False \n", "11925 20.4 Poland False \n", "11923 20.4 Poland False " ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment.sort_values('unemployment_rate', ascending=False)[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is another example for sorting the `score_df` DataFrame by `'age'`. " ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
017123.552test
119101.624test
221116.431test
337157.132test
418162.925test
519144.662control
647253.634control
718213.635control
819295.234control
\n", "
" ], "text/plain": [ " age score rt group\n", "0 17 12 3.552 test\n", "1 19 10 1.624 test\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "4 18 16 2.925 test\n", "5 19 14 4.662 control\n", "6 47 25 3.634 control\n", "7 18 21 3.635 control\n", "8 19 29 5.234 control" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "score_df" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
017123.552test
418162.925test
718213.635control
119101.624test
519144.662control
819295.234control
221116.431test
337157.132test
647253.634control
\n", "
" ], "text/plain": [ " age score rt group\n", "0 17 12 3.552 test\n", "4 18 16 2.925 test\n", "7 18 21 3.635 control\n", "1 19 10 1.624 test\n", "5 19 14 4.662 control\n", "8 19 29 5.234 control\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "6 47 25 3.634 control" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "score_df.sort_values('age')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntax shown in the next cell can also be used. " ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
119101.624test
221116.431test
017123.552test
519144.662control
337157.132test
418162.925test
718213.635control
647253.634control
819295.234control
\n", "
" ], "text/plain": [ " age score rt group\n", "1 19 10 1.624 test\n", "2 21 11 6.431 test\n", "0 17 12 3.552 test\n", "5 19 14 4.662 control\n", "3 37 15 7.132 test\n", "4 18 16 2.925 test\n", "7 18 21 3.635 control\n", "6 47 25 3.634 control\n", "8 19 29 5.234 control" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "score_df.sort_values(by=['score'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following code, we sort the data by two columns. Note that the data is first sorted by `'age'`, and for the rows where the age is the same (e.g., 18 or 19), the data is sorted in ascending order based on `'score'`." ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescorertgroup
017123.552test
418162.925test
718213.635control
119101.624test
519144.662control
819295.234control
221116.431test
337157.132test
647253.634control
\n", "
" ], "text/plain": [ " age score rt group\n", "0 17 12 3.552 test\n", "4 18 16 2.925 test\n", "7 18 21 3.635 control\n", "1 19 10 1.624 test\n", "5 19 14 4.662 control\n", "8 19 29 5.234 control\n", "2 21 11 6.431 test\n", "3 37 15 7.132 test\n", "6 47 25 3.634 control" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "score_df.sort_values(['age', 'score'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Several additional functionalities of `pandas` will be described in the next lectures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7.9 Exporting A DataFrame to csv
\n", "\n", "To save the last DataFrame as a .csv file, we can use the `.to_csv()` method." ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "unemployment.to_csv('data/unemployment.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file will be saved in the `data` directory. \n", "\n", "By default, this method writes the indices in the column 0 (i.e., row labels). We probably don't want a column 0 with indices to be added, and we can set `index` to `False`. We can also specify the type of delimiter that we want to use, such as commas `(,)`, pipes (`|`), semicolons (`;`), tabs (`\\t`), etc." ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "unemployment.to_csv('data/unemployment.csv', index=False, sep=',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References \n", "\n", "1. Introduction to Pandas, Python Data Wrangling by D-Lab at UC Berkley, available at: [https://github.com/dlab-berkeley/introduction-to-pandas](https://github.com/dlab-berkeley/introduction-to-pandas).\n", "2. Pandas documentation, available at: [https://pandas.pydata.org/pandas-docs/stable/](https://pandas.pydata.org/pandas-docs/stable/).\n", "3. Learning Statistics with Python - Data Wrangling, available at: [https://ethanweed.github.io/pythonbook/03.03-pragmatic_matters.html](https://ethanweed.github.io/pythonbook/03.03-pragmatic_matters.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[BACK TO TOP](#top)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }