[EODP] Essential Functions and Key Ideas

Table of Contents

Adopted from Tut 1

Data proccessing pipelines move and transform data from one source to another such that it can be stored, used for analytics / Machine Learning, or combined with other data structures. In EODP, we will cover an end-to-end process to preprocess data, conduct analysis, perform Machine Learning tasks, and communicate findings.

Jupyter Keyboard Shortcuts
#

Adopted from Tut 1

Cells
#

Jupyter notebooks contains two main types of cells:

Markdown cells: These can be used to contain text, equations and other non-code items. The cell that you’re reading right now is a markdown cell. You can use Markdown to format your text. If you prefer, you can also format your text using HTML. Clicking the button button will format and display your text.
Code cells: These contain code segments that can be executed individually. When executed, the output of the code will be displayed below the code cell.

Keyboard Shortcuts
#

Cell Running shortcuts:

You can tell you are selecting a cell when the outline is colored is green
shift + enter : Run current cell - keyboard shortcut for the button
ctrl + enter : Run selected cells

Command mode (press esc to enter):

You can tell you are in Command Mode when the outline is colored is blue
Enter command mode pressing esc (blue highlight)
a to create a cell above
b to create a cell below
dd (double d) to delete a cell
m to make the cell render in markdown
r to make the cell render in raw text
y to make the cell render python code
enter to “edit” the cell

Code Shortcuts:

shift + tab : brings function/method arguments up

Pandas Framework
#

Basic import

import pandas as pd
import numpy as np

Series
#

Definition: A pandas array that stores data and may be performed with pandas operations

Creation syntax

l = [1, 2, 3, 4, 5]
s = pd.Series(l)
s # Notice in Jupyter (Or Quarto in my case) we may simply display an object by directly referencing it

0    1
1    2
2    3
3    4
4    5
dtype: int64

Attributes and Methods

Examples of series attributes:

series.index attribute (returns the index field like dict.keys())
series.values attribute (returns the values like dict.values())

Examples of series methods:

series.mean() method (computes the average)
series.sum() method (computes the grand total sum)

To get all the attributes and methods available, you can call help(pd.Series).

print(s.index)
print(s.values)

RangeIndex(start=0, stop=5, step=1)
[1 2 3 4 5]

Change of Index

new_index = [
    'first',
    'second',
    'third',
    'forth',
    'fifth'
]

s.index = new_index

print(s)

first     1
second    2
third     3
forth     4
fifth     5
dtype: int64

Like dictionaries, we can access values using [], and a series may be created out of a dictionary

print(s['first'])

d = {
    'name': 'kunori',
    'height': 185,
    'nationality': 'PRC',
    'GPA': 4.0
}

_s = pd.Series(d)
print(_s)

1
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object

Concatenate - `concat()`
#

print(pd.concat([s, _s]))
print('---Equivalent to---')
print(pd.concat([s, _s], axis='rows'))
print('---Equivalent to---')
print(pd.concat([s, _s], axis=0))

first               1
second              2
third               3
forth               4
fifth               5
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object
---Equivalent to---
first               1
second              2
third               3
forth               4
fifth               5
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object
---Equivalent to---
first               1
second              2
third               3
forth               4
fifth               5
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object

Fill holes - `fillna()`
#

Here is what it looks like if you use axis=1

print(pd.concat([s, _s], axis=1))
print('---Equivalent to---')
print(pd.concat([s, _s], axis='columns'))
print('---Fill holes---')
print(pd.concat([s, _s], axis='columns').fillna(0))

               0       1
first        1.0     NaN
second       2.0     NaN
third        3.0     NaN
forth        4.0     NaN
fifth        5.0     NaN
name         NaN  kunori
height       NaN     185
nationality  NaN     PRC
GPA          NaN     4.0
---Equivalent to---
               0       1
first        1.0     NaN
second       2.0     NaN
third        3.0     NaN
forth        4.0     NaN
fifth        5.0     NaN
name         NaN  kunori
height       NaN     185
nationality  NaN     PRC
GPA          NaN     4.0
---Fill holes---
               0       1
first        1.0       0
second       2.0       0
third        3.0       0
forth        4.0       0
fifth        5.0       0
name         0.0  kunori
height       0.0     185
nationality  0.0     PRC
GPA          0.0     4.0

Find values - `.loc[]`
#

Slicing the series using index

print(_s.loc['name'])
# Same as
print(_s['name'])

kunori
kunori

Slicing the series using a boolean array operation

A boolean array

s < 3

first      True
second     True
third     False
forth     False
fifth     False
dtype: bool

Using the boolean array to slice the result

s.loc[s < 3]

first     1
second    2
dtype: int64

Advanced game-play

# Find if a number is odd - Must to bool!
print((s % 2).astype(bool))
# Display odd numbers
s.loc[(s % 2).astype(bool)]

first      True
second    False
third      True
forth     False
fifth      True
dtype: bool

first    1
third    3
fifth    5
dtype: int64

Quick series functional operations - `.apply()`
#

Use of lambda functions

s.loc[s.apply(lambda x: True if x % 2 else False)]

first    1
third    3
fifth    5
dtype: int64

Slicing by indicies - `.iloc[]`
#

# Note: Indicies slicing is not right-inclusive
s.iloc[2:4]

third    3
forth    4
dtype: int64

Numerical operations to series - use of numpy and built-in broadcasting
#

np.log1p(s)

first     0.693147
second    1.098612
third     1.386294
forth     1.609438
fifth     1.791759
dtype: float64

s * 2

first      2
second     4
third      6
forth      8
fifth     10
dtype: int64

s * s - s

first      0
second     2
third      6
forth     12
fifth     20
dtype: int64

DataFrame - 2D Series (Or even higher with groupby etc!)
#

Creation syntax

indicies = ['a', 'b', 'c', 'd', 'e']
la = [1, 2, 3, 4, 5]
lb = [10, 20, 30, 40, 50]
sa = pd.Series(la, index=indicies)
sb = pd.Series(lb, index=indicies)
df = pd.DataFrame({
    'kunori': sa,
    'kiku': sb,
    'dummy': sb
})
df

	kunori	kiku	dummy
a	1	10	10
b	2	20	20
c	3	30	30
d	4	40	40
e	5	50	50

Access multiple column - referencing columns

df[['kunori', 'kiku']]

	kunori	kiku
a	1	10
b	2	20
c	3	30
d	4	40
e	5	50

Use of .loc - similar analog

# Creation of a Boolean series
# See if the entry have modular 2 dividing by 3
print(df['kunori'].apply(lambda x: True if x % 3 == 2 else False))

# Filter row by that boolean series
print(df.loc[df['kunori'].apply(lambda x: True if x % 3 == 2 else False)])

a    False
b     True
c    False
d    False
e     True
Name: kunori, dtype: bool
   kunori  kiku  dummy
b       2    20     20
e       5    50     50

Reading CSVs as DataFrames
#

For simplicity, we will read online csvs, but local relative paths are also acceptable

# Lottery Powerball Winning Numbers: Beginning 2010 - New York Lottery
PATH_DATA = 'example_csv.csv'

data = pd.read_csv(PATH_DATA)

tail()

# Display 10 last rows
data.tail(10)

	id	first_name	last_name	email	gender	department	hire_date	salary	experience_years	performance_score
10	11	David	Martinez	[email protected]	Male	Engineering	2020-05-03	76000	2	4.2
11	12	Mia	Anderson	[email protected]	Female	Finance	2016-03-27	91000	7	4.6
12	13	Joseph	Taylor	[email protected]	Male	Marketing	2019-10-09	69500	3	3.9
13	14	Charlotte	Thomas	[email protected]	Female	Engineering	2018-06-14	81000	4	4.4
14	15	Daniel	Hernandez	[email protected]	Male	Sales	2021-01-20	59000	1	3.5
15	16	Amelia	Moore	[email protected]	Female	HR	2015-04-08	72500	8	4.2
16	17	Matthew	Martin	[email protected]	Male	Finance	2017-11-15	82500	5	4.1
17	18	Harper	Jackson	[email protected]	Female	Engineering	2020-09-01	71000	2	3.8
18	19	Andrew	Thompson	[email protected]	Male	Marketing	2016-07-19	78000	6	4.3
19	20	Abigail	White	[email protected]	Female	Sales	2018-02-28	66000	4	4.0

head()

# Display 10 first rows
data.head(10)

	id	first_name	last_name	email	gender	department	hire_date	salary	experience_years	performance_score
0	1	John	Smith	[email protected]	Male	Marketing	2018-03-15	72000	5	4.2
1	2	Emma	Johnson	[email protected]	Female	Engineering	2019-07-23	85000	3	4.7
2	3	Michael	Williams	[email protected]	Male	Sales	2015-11-05	65000	8	3.9
3	4	Sophia	Brown	[email protected]	Female	HR	2020-01-15	62000	2	4.1
4	5	Robert	Jones	[email protected]	Male	Engineering	2017-05-22	92000	6	4.8
5	6	Olivia	Miller	[email protected]	Female	Marketing	2021-02-10	58000	1	3.7
6	7	William	Davis	[email protected]	Male	Finance	2016-09-18	88000	7	4.5
7	8	Ava	Garcia	[email protected]	Female	Engineering	2019-04-30	79500	4	4.3
8	9	James	Rodriguez	[email protected]	Male	Sales	2018-12-11	67500	3	3.8
9	10	Isabella	Wilson	[email protected]	Female	HR	2017-08-24	63500	5	4.0

Count values - `value_counts()`
#

Display how many are male or female

data['gender'].value_counts()

gender
Male      10
Female    10
Name: count, dtype: int64

Set index of df to index the rows - `set_index()`
#

Set index as their ids

data = data.set_index('id')
data

	first_name	last_name	email	gender	department	hire_date	salary	experience_years	performance_score
id
1	John	Smith	[email protected]	Male	Marketing	2018-03-15	72000	5	4.2
2	Emma	Johnson	[email protected]	Female	Engineering	2019-07-23	85000	3	4.7
3	Michael	Williams	[email protected]	Male	Sales	2015-11-05	65000	8	3.9
4	Sophia	Brown	[email protected]	Female	HR	2020-01-15	62000	2	4.1
5	Robert	Jones	[email protected]	Male	Engineering	2017-05-22	92000	6	4.8
6	Olivia	Miller	[email protected]	Female	Marketing	2021-02-10	58000	1	3.7
7	William	Davis	[email protected]	Male	Finance	2016-09-18	88000	7	4.5
8	Ava	Garcia	[email protected]	Female	Engineering	2019-04-30	79500	4	4.3
9	James	Rodriguez	[email protected]	Male	Sales	2018-12-11	67500	3	3.8
10	Isabella	Wilson	[email protected]	Female	HR	2017-08-24	63500	5	4.0
11	David	Martinez	[email protected]	Male	Engineering	2020-05-03	76000	2	4.2
12	Mia	Anderson	[email protected]	Female	Finance	2016-03-27	91000	7	4.6
13	Joseph	Taylor	[email protected]	Male	Marketing	2019-10-09	69500	3	3.9
14	Charlotte	Thomas	[email protected]	Female	Engineering	2018-06-14	81000	4	4.4
15	Daniel	Hernandez	[email protected]	Male	Sales	2021-01-20	59000	1	3.5
16	Amelia	Moore	[email protected]	Female	HR	2015-04-08	72500	8	4.2
17	Matthew	Martin	[email protected]	Male	Finance	2017-11-15	82500	5	4.1
18	Harper	Jackson	[email protected]	Female	Engineering	2020-09-01	71000	2	3.8
19	Andrew	Thompson	[email protected]	Male	Marketing	2016-07-19	78000	6	4.3
20	Abigail	White	[email protected]	Female	Sales	2018-02-28	66000	4	4.0

Sort values - `.sort_values(by=COLUMN)`
#

Find the one with highest salary

data.sort_values(
    by='salary',
    ascending=0
).head(1)

	first_name	last_name	email	gender	department	hire_date	salary	experience_years	performance_score
id
5	Robert	Jones	[email protected]	Male	Engineering	2017-05-22	92000	6	4.8

Sort over performance score, then salary, so that we see misaligned scores and salary

data.sort_values(['performance_score', 'salary'], ascending=[False, True]).drop(['email'], axis='columns')

	first_name	last_name	gender	department	hire_date	salary	experience_years	performance_score
id
5	Robert	Jones	Male	Engineering	2017-05-22	92000	6	4.8
2	Emma	Johnson	Female	Engineering	2019-07-23	85000	3	4.7
12	Mia	Anderson	Female	Finance	2016-03-27	91000	7	4.6
7	William	Davis	Male	Finance	2016-09-18	88000	7	4.5
14	Charlotte	Thomas	Female	Engineering	2018-06-14	81000	4	4.4
19	Andrew	Thompson	Male	Marketing	2016-07-19	78000	6	4.3
8	Ava	Garcia	Female	Engineering	2019-04-30	79500	4	4.3
1	John	Smith	Male	Marketing	2018-03-15	72000	5	4.2
16	Amelia	Moore	Female	HR	2015-04-08	72500	8	4.2
11	David	Martinez	Male	Engineering	2020-05-03	76000	2	4.2
4	Sophia	Brown	Female	HR	2020-01-15	62000	2	4.1
17	Matthew	Martin	Male	Finance	2017-11-15	82500	5	4.1
10	Isabella	Wilson	Female	HR	2017-08-24	63500	5	4.0
20	Abigail	White	Female	Sales	2018-02-28	66000	4	4.0
3	Michael	Williams	Male	Sales	2015-11-05	65000	8	3.9
13	Joseph	Taylor	Male	Marketing	2019-10-09	69500	3	3.9
9	James	Rodriguez	Male	Sales	2018-12-11	67500	3	3.8
18	Harper	Jackson	Female	Engineering	2020-09-01	71000	2	3.8
6	Olivia	Miller	Female	Marketing	2021-02-10	58000	1	3.7
15	Daniel	Hernandez	Male	Sales	2021-01-20	59000	1	3.5

Analysis by groups - `.groupby()`
#

Find the average salary for each group of experience years and gender

data.groupby('experience_years')['salary'].mean()

experience_years
1    58500.000000
2    69666.666667
3    74000.000000
4    75500.000000
5    72666.666667
6    85000.000000
7    89500.000000
8    68750.000000
Name: salary, dtype: float64

data.groupby('gender')['salary'].mean()

gender
Female    72950.0
Male      74950.0
Name: salary, dtype: float64

Find the total number of salaries spent on female and male

data.groupby('gender')['salary'].sum()

gender
Female    729500
Male      749500
Name: salary, dtype: int64

Find the number for each gender

data.groupby('gender').count()

	first_name	last_name	email	department	hire_date	salary	experience_years	performance_score
gender
Female	10	10	10	10	10	10	10	10
Male	10	10	10	10	10	10	10	10

yes, it outputs the count for each data entry

Jupyter Keyboard Shortcuts #

Cells #

Keyboard Shortcuts #

Pandas Framework #

Series #

Concatenate - concat() #

Fill holes - fillna() #

Find values - .loc[] #

Quick series functional operations - .apply() #

Slicing by indicies - .iloc[] #

Numerical operations to series - use of numpy and built-in broadcasting #

DataFrame - 2D Series (Or even higher with groupby etc!) #

Reading CSVs as DataFrames #

Count values - value_counts() #

Set index of df to index the rows - set_index() #

Sort values - .sort_values(by=COLUMN) #

Analysis by groups - .groupby() #