Adopted from Tut 1
Data proccessing pipelines move and transform data from one source to another such that it can be stored, used for analytics / Machine Learning, or combined with other data structures. In EODP, we will cover an end-to-end process to preprocess data, conduct analysis, perform Machine Learning tasks, and communicate findings.
Jupyter Keyboard Shortcuts #
Adopted from Tut 1
Cells #
Jupyter notebooks contains two main types of cells:
- Markdown cells: These can be used to contain text, equations and other non-code items. The cell that you’re reading right now is a markdown cell. You can use Markdown to format your text. If you prefer, you can also format your text using HTML. Clicking the button button will format and display your text.
- Code cells: These contain code segments that can be executed individually. When executed, the output of the code will be displayed below the code cell.
Keyboard Shortcuts #
Cell Running shortcuts:
- You can tell you are selecting a cell when the outline is colored is green
shift + enter: Run current cell - keyboard shortcut for the buttonctrl + enter: Run selected cells
Command mode (press esc to enter):
- You can tell you are in Command Mode when the outline is colored is blue
- Enter command mode pressing
esc(blue highlight) ato create a cell abovebto create a cell belowdd(double d) to delete a cellmto make the cell render in markdownrto make the cell render in raw textyto make the cell render python codeenterto “edit” the cell
Code Shortcuts:
shift + tab: brings function/method arguments up
Pandas Framework #
Basic import
import pandas as pd
import numpy as np
Series #
Definition: A pandas array that stores data and may be performed with pandas operations
Creation syntax
l = [1, 2, 3, 4, 5]
s = pd.Series(l)
s # Notice in Jupyter (Or Quarto in my case) we may simply display an object by directly referencing it
0 1
1 2
2 3
3 4
4 5
dtype: int64
Attributes and Methods
Examples of series attributes:
series.indexattribute (returns the index field likedict.keys())series.valuesattribute (returns the values likedict.values())
Examples of series methods:
series.mean()method (computes the average)series.sum()method (computes the grand total sum)
To get all the attributes and methods available, you can call
help(pd.Series).
print(s.index)
print(s.values)
RangeIndex(start=0, stop=5, step=1)
[1 2 3 4 5]
Change of Index
new_index = [
'first',
'second',
'third',
'forth',
'fifth'
]
s.index = new_index
print(s)
first 1
second 2
third 3
forth 4
fifth 5
dtype: int64
Like dictionaries, we can access values using [], and a series may be
created out of a dictionary
print(s['first'])
d = {
'name': 'kunori',
'height': 185,
'nationality': 'PRC',
'GPA': 4.0
}
_s = pd.Series(d)
print(_s)
1
name kunori
height 185
nationality PRC
GPA 4.0
dtype: object
Concatenate - concat()
#
print(pd.concat([s, _s]))
print('---Equivalent to---')
print(pd.concat([s, _s], axis='rows'))
print('---Equivalent to---')
print(pd.concat([s, _s], axis=0))
first 1
second 2
third 3
forth 4
fifth 5
name kunori
height 185
nationality PRC
GPA 4.0
dtype: object
---Equivalent to---
first 1
second 2
third 3
forth 4
fifth 5
name kunori
height 185
nationality PRC
GPA 4.0
dtype: object
---Equivalent to---
first 1
second 2
third 3
forth 4
fifth 5
name kunori
height 185
nationality PRC
GPA 4.0
dtype: object
Fill holes - fillna()
#
Here is what it looks like if you use axis=1
print(pd.concat([s, _s], axis=1))
print('---Equivalent to---')
print(pd.concat([s, _s], axis='columns'))
print('---Fill holes---')
print(pd.concat([s, _s], axis='columns').fillna(0))
0 1
first 1.0 NaN
second 2.0 NaN
third 3.0 NaN
forth 4.0 NaN
fifth 5.0 NaN
name NaN kunori
height NaN 185
nationality NaN PRC
GPA NaN 4.0
---Equivalent to---
0 1
first 1.0 NaN
second 2.0 NaN
third 3.0 NaN
forth 4.0 NaN
fifth 5.0 NaN
name NaN kunori
height NaN 185
nationality NaN PRC
GPA NaN 4.0
---Fill holes---
0 1
first 1.0 0
second 2.0 0
third 3.0 0
forth 4.0 0
fifth 5.0 0
name 0.0 kunori
height 0.0 185
nationality 0.0 PRC
GPA 0.0 4.0
Find values - .loc[]
#
Slicing the series using index
print(_s.loc['name'])
# Same as
print(_s['name'])
kunori
kunori
Slicing the series using a boolean array operation
- A boolean array
s < 3
first True
second True
third False
forth False
fifth False
dtype: bool
- Using the boolean array to slice the result
s.loc[s < 3]
first 1
second 2
dtype: int64
- Advanced game-play
# Find if a number is odd - Must to bool!
print((s % 2).astype(bool))
# Display odd numbers
s.loc[(s % 2).astype(bool)]
first True
second False
third True
forth False
fifth True
dtype: bool
first 1
third 3
fifth 5
dtype: int64
Quick series functional operations - .apply()
#
Use of lambda functions
s.loc[s.apply(lambda x: True if x % 2 else False)]
first 1
third 3
fifth 5
dtype: int64
Slicing by indicies - .iloc[]
#
# Note: Indicies slicing is not right-inclusive
s.iloc[2:4]
third 3
forth 4
dtype: int64
Numerical operations to series - use of numpy and built-in broadcasting #
np.log1p(s)
first 0.693147
second 1.098612
third 1.386294
forth 1.609438
fifth 1.791759
dtype: float64
s * 2
first 2
second 4
third 6
forth 8
fifth 10
dtype: int64
s * s - s
first 0
second 2
third 6
forth 12
fifth 20
dtype: int64
DataFrame - 2D Series (Or even higher with groupby etc!) #
Creation syntax
indicies = ['a', 'b', 'c', 'd', 'e']
la = [1, 2, 3, 4, 5]
lb = [10, 20, 30, 40, 50]
sa = pd.Series(la, index=indicies)
sb = pd.Series(lb, index=indicies)
df = pd.DataFrame({
'kunori': sa,
'kiku': sb,
'dummy': sb
})
df
| kunori | kiku | dummy | |
|---|---|---|---|
| a | 1 | 10 | 10 |
| b | 2 | 20 | 20 |
| c | 3 | 30 | 30 |
| d | 4 | 40 | 40 |
| e | 5 | 50 | 50 |
Access multiple column - referencing columns
df[['kunori', 'kiku']]
| kunori | kiku | |
|---|---|---|
| a | 1 | 10 |
| b | 2 | 20 |
| c | 3 | 30 |
| d | 4 | 40 |
| e | 5 | 50 |
Use of .loc - similar analog
# Creation of a Boolean series
# See if the entry have modular 2 dividing by 3
print(df['kunori'].apply(lambda x: True if x % 3 == 2 else False))
# Filter row by that boolean series
print(df.loc[df['kunori'].apply(lambda x: True if x % 3 == 2 else False)])
a False
b True
c False
d False
e True
Name: kunori, dtype: bool
kunori kiku dummy
b 2 20 20
e 5 50 50
Reading CSVs as DataFrames #
For simplicity, we will read online csvs, but local relative paths are also acceptable
# Lottery Powerball Winning Numbers: Beginning 2010 - New York Lottery
PATH_DATA = 'example_csv.csv'
data = pd.read_csv(PATH_DATA)
tail()
# Display 10 last rows
data.tail(10)
| id | first_name | last_name | gender | department | hire_date | salary | experience_years | performance_score | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 11 | David | Martinez | [email protected] | Male | Engineering | 2020-05-03 | 76000 | 2 | 4.2 |
| 11 | 12 | Mia | Anderson | [email protected] | Female | Finance | 2016-03-27 | 91000 | 7 | 4.6 |
| 12 | 13 | Joseph | Taylor | [email protected] | Male | Marketing | 2019-10-09 | 69500 | 3 | 3.9 |
| 13 | 14 | Charlotte | Thomas | [email protected] | Female | Engineering | 2018-06-14 | 81000 | 4 | 4.4 |
| 14 | 15 | Daniel | Hernandez | [email protected] | Male | Sales | 2021-01-20 | 59000 | 1 | 3.5 |
| 15 | 16 | Amelia | Moore | [email protected] | Female | HR | 2015-04-08 | 72500 | 8 | 4.2 |
| 16 | 17 | Matthew | Martin | [email protected] | Male | Finance | 2017-11-15 | 82500 | 5 | 4.1 |
| 17 | 18 | Harper | Jackson | [email protected] | Female | Engineering | 2020-09-01 | 71000 | 2 | 3.8 |
| 18 | 19 | Andrew | Thompson | [email protected] | Male | Marketing | 2016-07-19 | 78000 | 6 | 4.3 |
| 19 | 20 | Abigail | White | [email protected] | Female | Sales | 2018-02-28 | 66000 | 4 | 4.0 |
head()
# Display 10 first rows
data.head(10)
| id | first_name | last_name | gender | department | hire_date | salary | experience_years | performance_score | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | John | Smith | [email protected] | Male | Marketing | 2018-03-15 | 72000 | 5 | 4.2 |
| 1 | 2 | Emma | Johnson | [email protected] | Female | Engineering | 2019-07-23 | 85000 | 3 | 4.7 |
| 2 | 3 | Michael | Williams | [email protected] | Male | Sales | 2015-11-05 | 65000 | 8 | 3.9 |
| 3 | 4 | Sophia | Brown | [email protected] | Female | HR | 2020-01-15 | 62000 | 2 | 4.1 |
| 4 | 5 | Robert | Jones | [email protected] | Male | Engineering | 2017-05-22 | 92000 | 6 | 4.8 |
| 5 | 6 | Olivia | Miller | [email protected] | Female | Marketing | 2021-02-10 | 58000 | 1 | 3.7 |
| 6 | 7 | William | Davis | [email protected] | Male | Finance | 2016-09-18 | 88000 | 7 | 4.5 |
| 7 | 8 | Ava | Garcia | [email protected] | Female | Engineering | 2019-04-30 | 79500 | 4 | 4.3 |
| 8 | 9 | James | Rodriguez | [email protected] | Male | Sales | 2018-12-11 | 67500 | 3 | 3.8 |
| 9 | 10 | Isabella | Wilson | [email protected] | Female | HR | 2017-08-24 | 63500 | 5 | 4.0 |
Count values - value_counts()
#
Display how many are male or female
data['gender'].value_counts()
gender
Male 10
Female 10
Name: count, dtype: int64
Set index of df to index the rows - set_index()
#
Set index as their ids
data = data.set_index('id')
data
| first_name | last_name | gender | department | hire_date | salary | experience_years | performance_score | ||
|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||
| 1 | John | Smith | [email protected] | Male | Marketing | 2018-03-15 | 72000 | 5 | 4.2 |
| 2 | Emma | Johnson | [email protected] | Female | Engineering | 2019-07-23 | 85000 | 3 | 4.7 |
| 3 | Michael | Williams | [email protected] | Male | Sales | 2015-11-05 | 65000 | 8 | 3.9 |
| 4 | Sophia | Brown | [email protected] | Female | HR | 2020-01-15 | 62000 | 2 | 4.1 |
| 5 | Robert | Jones | [email protected] | Male | Engineering | 2017-05-22 | 92000 | 6 | 4.8 |
| 6 | Olivia | Miller | [email protected] | Female | Marketing | 2021-02-10 | 58000 | 1 | 3.7 |
| 7 | William | Davis | [email protected] | Male | Finance | 2016-09-18 | 88000 | 7 | 4.5 |
| 8 | Ava | Garcia | [email protected] | Female | Engineering | 2019-04-30 | 79500 | 4 | 4.3 |
| 9 | James | Rodriguez | [email protected] | Male | Sales | 2018-12-11 | 67500 | 3 | 3.8 |
| 10 | Isabella | Wilson | [email protected] | Female | HR | 2017-08-24 | 63500 | 5 | 4.0 |
| 11 | David | Martinez | [email protected] | Male | Engineering | 2020-05-03 | 76000 | 2 | 4.2 |
| 12 | Mia | Anderson | [email protected] | Female | Finance | 2016-03-27 | 91000 | 7 | 4.6 |
| 13 | Joseph | Taylor | [email protected] | Male | Marketing | 2019-10-09 | 69500 | 3 | 3.9 |
| 14 | Charlotte | Thomas | [email protected] | Female | Engineering | 2018-06-14 | 81000 | 4 | 4.4 |
| 15 | Daniel | Hernandez | [email protected] | Male | Sales | 2021-01-20 | 59000 | 1 | 3.5 |
| 16 | Amelia | Moore | [email protected] | Female | HR | 2015-04-08 | 72500 | 8 | 4.2 |
| 17 | Matthew | Martin | [email protected] | Male | Finance | 2017-11-15 | 82500 | 5 | 4.1 |
| 18 | Harper | Jackson | [email protected] | Female | Engineering | 2020-09-01 | 71000 | 2 | 3.8 |
| 19 | Andrew | Thompson | [email protected] | Male | Marketing | 2016-07-19 | 78000 | 6 | 4.3 |
| 20 | Abigail | White | [email protected] | Female | Sales | 2018-02-28 | 66000 | 4 | 4.0 |
Sort values - .sort_values(by=COLUMN)
#
Find the one with highest salary
data.sort_values(
by='salary',
ascending=0
).head(1)
| first_name | last_name | gender | department | hire_date | salary | experience_years | performance_score | ||
|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||
| 5 | Robert | Jones | [email protected] | Male | Engineering | 2017-05-22 | 92000 | 6 | 4.8 |
Sort over performance score, then salary, so that we see misaligned scores and salary
data.sort_values(['performance_score', 'salary'], ascending=[False, True]).drop(['email'], axis='columns')
| first_name | last_name | gender | department | hire_date | salary | experience_years | performance_score | |
|---|---|---|---|---|---|---|---|---|
| id | ||||||||
| 5 | Robert | Jones | Male | Engineering | 2017-05-22 | 92000 | 6 | 4.8 |
| 2 | Emma | Johnson | Female | Engineering | 2019-07-23 | 85000 | 3 | 4.7 |
| 12 | Mia | Anderson | Female | Finance | 2016-03-27 | 91000 | 7 | 4.6 |
| 7 | William | Davis | Male | Finance | 2016-09-18 | 88000 | 7 | 4.5 |
| 14 | Charlotte | Thomas | Female | Engineering | 2018-06-14 | 81000 | 4 | 4.4 |
| 19 | Andrew | Thompson | Male | Marketing | 2016-07-19 | 78000 | 6 | 4.3 |
| 8 | Ava | Garcia | Female | Engineering | 2019-04-30 | 79500 | 4 | 4.3 |
| 1 | John | Smith | Male | Marketing | 2018-03-15 | 72000 | 5 | 4.2 |
| 16 | Amelia | Moore | Female | HR | 2015-04-08 | 72500 | 8 | 4.2 |
| 11 | David | Martinez | Male | Engineering | 2020-05-03 | 76000 | 2 | 4.2 |
| 4 | Sophia | Brown | Female | HR | 2020-01-15 | 62000 | 2 | 4.1 |
| 17 | Matthew | Martin | Male | Finance | 2017-11-15 | 82500 | 5 | 4.1 |
| 10 | Isabella | Wilson | Female | HR | 2017-08-24 | 63500 | 5 | 4.0 |
| 20 | Abigail | White | Female | Sales | 2018-02-28 | 66000 | 4 | 4.0 |
| 3 | Michael | Williams | Male | Sales | 2015-11-05 | 65000 | 8 | 3.9 |
| 13 | Joseph | Taylor | Male | Marketing | 2019-10-09 | 69500 | 3 | 3.9 |
| 9 | James | Rodriguez | Male | Sales | 2018-12-11 | 67500 | 3 | 3.8 |
| 18 | Harper | Jackson | Female | Engineering | 2020-09-01 | 71000 | 2 | 3.8 |
| 6 | Olivia | Miller | Female | Marketing | 2021-02-10 | 58000 | 1 | 3.7 |
| 15 | Daniel | Hernandez | Male | Sales | 2021-01-20 | 59000 | 1 | 3.5 |
Analysis by groups - .groupby()
#
Find the average salary for each group of experience years and gender
data.groupby('experience_years')['salary'].mean()
experience_years
1 58500.000000
2 69666.666667
3 74000.000000
4 75500.000000
5 72666.666667
6 85000.000000
7 89500.000000
8 68750.000000
Name: salary, dtype: float64
data.groupby('gender')['salary'].mean()
gender
Female 72950.0
Male 74950.0
Name: salary, dtype: float64
Find the total number of salaries spent on female and male
data.groupby('gender')['salary'].sum()
gender
Female 729500
Male 749500
Name: salary, dtype: int64
Find the number for each gender
data.groupby('gender').count()
| first_name | last_name | department | hire_date | salary | experience_years | performance_score | ||
|---|---|---|---|---|---|---|---|---|
| gender | ||||||||
| Female | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| Male | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
yes, it outputs the count for each data entry