Skip to main content
Background Image

[EODP] Essential Functions and Key Ideas

·1972 words·10 mins
Author
Frederic Liu
BS.c. Maths and Stats in OR

Adopted from Tut 1

Data proccessing pipelines move and transform data from one source to another such that it can be stored, used for analytics / Machine Learning, or combined with other data structures. In EODP, we will cover an end-to-end process to preprocess data, conduct analysis, perform Machine Learning tasks, and communicate findings.

Jupyter Keyboard Shortcuts
#

Adopted from Tut 1

Cells
#

Jupyter notebooks contains two main types of cells:

  • Markdown cells: These can be used to contain text, equations and other non-code items. The cell that you’re reading right now is a markdown cell. You can use Markdown to format your text. If you prefer, you can also format your text using HTML. Clicking the button button will format and display your text.
  • Code cells: These contain code segments that can be executed individually. When executed, the output of the code will be displayed below the code cell.

Keyboard Shortcuts
#

Cell Running shortcuts:

  • You can tell you are selecting a cell when the outline is colored is green
  • shift + enter : Run current cell - keyboard shortcut for the button
  • ctrl + enter : Run selected cells

Command mode (press esc to enter):

  • You can tell you are in Command Mode when the outline is colored is blue
  • Enter command mode pressing esc (blue highlight)
  • a to create a cell above
  • b to create a cell below
  • dd (double d) to delete a cell
  • m to make the cell render in markdown
  • r to make the cell render in raw text
  • y to make the cell render python code
  • enter to “edit” the cell

Code Shortcuts:

  • shift + tab : brings function/method arguments up

Pandas Framework
#

Basic import

import pandas as pd
import numpy as np

Series
#

Definition: A pandas array that stores data and may be performed with pandas operations

Creation syntax

l = [1, 2, 3, 4, 5]
s = pd.Series(l)
s # Notice in Jupyter (Or Quarto in my case) we may simply display an object by directly referencing it
0    1
1    2
2    3
3    4
4    5
dtype: int64

Attributes and Methods

Examples of series attributes:

  • series.index attribute (returns the index field like dict.keys())
  • series.values attribute (returns the values like dict.values())

Examples of series methods:

  • series.mean() method (computes the average)
  • series.sum() method (computes the grand total sum)

To get all the attributes and methods available, you can call help(pd.Series).

print(s.index)
print(s.values)
RangeIndex(start=0, stop=5, step=1)
[1 2 3 4 5]

Change of Index

new_index = [
    'first',
    'second',
    'third',
    'forth',
    'fifth'
]

s.index = new_index

print(s)
first     1
second    2
third     3
forth     4
fifth     5
dtype: int64

Like dictionaries, we can access values using [], and a series may be created out of a dictionary

print(s['first'])

d = {
    'name': 'kunori',
    'height': 185,
    'nationality': 'PRC',
    'GPA': 4.0
}

_s = pd.Series(d)
print(_s)
1
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object

Concatenate - concat()
#

print(pd.concat([s, _s]))
print('---Equivalent to---')
print(pd.concat([s, _s], axis='rows'))
print('---Equivalent to---')
print(pd.concat([s, _s], axis=0))
first               1
second              2
third               3
forth               4
fifth               5
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object
---Equivalent to---
first               1
second              2
third               3
forth               4
fifth               5
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object
---Equivalent to---
first               1
second              2
third               3
forth               4
fifth               5
name           kunori
height            185
nationality       PRC
GPA               4.0
dtype: object

Fill holes - fillna()
#

Here is what it looks like if you use axis=1

print(pd.concat([s, _s], axis=1))
print('---Equivalent to---')
print(pd.concat([s, _s], axis='columns'))
print('---Fill holes---')
print(pd.concat([s, _s], axis='columns').fillna(0))
               0       1
first        1.0     NaN
second       2.0     NaN
third        3.0     NaN
forth        4.0     NaN
fifth        5.0     NaN
name         NaN  kunori
height       NaN     185
nationality  NaN     PRC
GPA          NaN     4.0
---Equivalent to---
               0       1
first        1.0     NaN
second       2.0     NaN
third        3.0     NaN
forth        4.0     NaN
fifth        5.0     NaN
name         NaN  kunori
height       NaN     185
nationality  NaN     PRC
GPA          NaN     4.0
---Fill holes---
               0       1
first        1.0       0
second       2.0       0
third        3.0       0
forth        4.0       0
fifth        5.0       0
name         0.0  kunori
height       0.0     185
nationality  0.0     PRC
GPA          0.0     4.0

Find values - .loc[]
#

Slicing the series using index

print(_s.loc['name'])
# Same as
print(_s['name'])
kunori
kunori

Slicing the series using a boolean array operation

  • A boolean array
s < 3
first      True
second     True
third     False
forth     False
fifth     False
dtype: bool
  • Using the boolean array to slice the result
s.loc[s < 3]
first     1
second    2
dtype: int64
  • Advanced game-play
# Find if a number is odd - Must to bool!
print((s % 2).astype(bool))
# Display odd numbers
s.loc[(s % 2).astype(bool)]
first      True
second    False
third      True
forth     False
fifth      True
dtype: bool

first    1
third    3
fifth    5
dtype: int64

Quick series functional operations - .apply()
#

Use of lambda functions

s.loc[s.apply(lambda x: True if x % 2 else False)]
first    1
third    3
fifth    5
dtype: int64

Slicing by indicies - .iloc[]
#

# Note: Indicies slicing is not right-inclusive
s.iloc[2:4]
third    3
forth    4
dtype: int64

Numerical operations to series - use of numpy and built-in broadcasting
#

np.log1p(s)
first     0.693147
second    1.098612
third     1.386294
forth     1.609438
fifth     1.791759
dtype: float64
s * 2
first      2
second     4
third      6
forth      8
fifth     10
dtype: int64
s * s - s
first      0
second     2
third      6
forth     12
fifth     20
dtype: int64

DataFrame - 2D Series (Or even higher with groupby etc!)
#

Creation syntax

indicies = ['a', 'b', 'c', 'd', 'e']
la = [1, 2, 3, 4, 5]
lb = [10, 20, 30, 40, 50]
sa = pd.Series(la, index=indicies)
sb = pd.Series(lb, index=indicies)
df = pd.DataFrame({
    'kunori': sa,
    'kiku': sb,
    'dummy': sb
})
df
kunori kiku dummy
a 1 10 10
b 2 20 20
c 3 30 30
d 4 40 40
e 5 50 50

Access multiple column - referencing columns

df[['kunori', 'kiku']]
kunori kiku
a 1 10
b 2 20
c 3 30
d 4 40
e 5 50

Use of .loc - similar analog

# Creation of a Boolean series
# See if the entry have modular 2 dividing by 3
print(df['kunori'].apply(lambda x: True if x % 3 == 2 else False))

# Filter row by that boolean series
print(df.loc[df['kunori'].apply(lambda x: True if x % 3 == 2 else False)])
a    False
b     True
c    False
d    False
e     True
Name: kunori, dtype: bool
   kunori  kiku  dummy
b       2    20     20
e       5    50     50

Reading CSVs as DataFrames
#

For simplicity, we will read online csvs, but local relative paths are also acceptable

# Lottery Powerball Winning Numbers: Beginning 2010 - New York Lottery
PATH_DATA = 'example_csv.csv'

data = pd.read_csv(PATH_DATA)

tail()

# Display 10 last rows
data.tail(10)
id first_name last_name email gender department hire_date salary experience_years performance_score
10 11 David Martinez [email protected] Male Engineering 2020-05-03 76000 2 4.2
11 12 Mia Anderson [email protected] Female Finance 2016-03-27 91000 7 4.6
12 13 Joseph Taylor [email protected] Male Marketing 2019-10-09 69500 3 3.9
13 14 Charlotte Thomas [email protected] Female Engineering 2018-06-14 81000 4 4.4
14 15 Daniel Hernandez [email protected] Male Sales 2021-01-20 59000 1 3.5
15 16 Amelia Moore [email protected] Female HR 2015-04-08 72500 8 4.2
16 17 Matthew Martin [email protected] Male Finance 2017-11-15 82500 5 4.1
17 18 Harper Jackson [email protected] Female Engineering 2020-09-01 71000 2 3.8
18 19 Andrew Thompson [email protected] Male Marketing 2016-07-19 78000 6 4.3
19 20 Abigail White [email protected] Female Sales 2018-02-28 66000 4 4.0

head()

# Display 10 first rows
data.head(10)
id first_name last_name email gender department hire_date salary experience_years performance_score
0 1 John Smith [email protected] Male Marketing 2018-03-15 72000 5 4.2
1 2 Emma Johnson [email protected] Female Engineering 2019-07-23 85000 3 4.7
2 3 Michael Williams [email protected] Male Sales 2015-11-05 65000 8 3.9
3 4 Sophia Brown [email protected] Female HR 2020-01-15 62000 2 4.1
4 5 Robert Jones [email protected] Male Engineering 2017-05-22 92000 6 4.8
5 6 Olivia Miller [email protected] Female Marketing 2021-02-10 58000 1 3.7
6 7 William Davis [email protected] Male Finance 2016-09-18 88000 7 4.5
7 8 Ava Garcia [email protected] Female Engineering 2019-04-30 79500 4 4.3
8 9 James Rodriguez [email protected] Male Sales 2018-12-11 67500 3 3.8
9 10 Isabella Wilson [email protected] Female HR 2017-08-24 63500 5 4.0

Count values - value_counts()
#

Display how many are male or female

data['gender'].value_counts()
gender
Male      10
Female    10
Name: count, dtype: int64

Set index of df to index the rows - set_index()
#

Set index as their ids

data = data.set_index('id')
data
first_name last_name email gender department hire_date salary experience_years performance_score
id
1 John Smith [email protected] Male Marketing 2018-03-15 72000 5 4.2
2 Emma Johnson [email protected] Female Engineering 2019-07-23 85000 3 4.7
3 Michael Williams [email protected] Male Sales 2015-11-05 65000 8 3.9
4 Sophia Brown [email protected] Female HR 2020-01-15 62000 2 4.1
5 Robert Jones [email protected] Male Engineering 2017-05-22 92000 6 4.8
6 Olivia Miller [email protected] Female Marketing 2021-02-10 58000 1 3.7
7 William Davis [email protected] Male Finance 2016-09-18 88000 7 4.5
8 Ava Garcia [email protected] Female Engineering 2019-04-30 79500 4 4.3
9 James Rodriguez [email protected] Male Sales 2018-12-11 67500 3 3.8
10 Isabella Wilson [email protected] Female HR 2017-08-24 63500 5 4.0
11 David Martinez [email protected] Male Engineering 2020-05-03 76000 2 4.2
12 Mia Anderson [email protected] Female Finance 2016-03-27 91000 7 4.6
13 Joseph Taylor [email protected] Male Marketing 2019-10-09 69500 3 3.9
14 Charlotte Thomas [email protected] Female Engineering 2018-06-14 81000 4 4.4
15 Daniel Hernandez [email protected] Male Sales 2021-01-20 59000 1 3.5
16 Amelia Moore [email protected] Female HR 2015-04-08 72500 8 4.2
17 Matthew Martin [email protected] Male Finance 2017-11-15 82500 5 4.1
18 Harper Jackson [email protected] Female Engineering 2020-09-01 71000 2 3.8
19 Andrew Thompson [email protected] Male Marketing 2016-07-19 78000 6 4.3
20 Abigail White [email protected] Female Sales 2018-02-28 66000 4 4.0

Sort values - .sort_values(by=COLUMN)
#

Find the one with highest salary

data.sort_values(
    by='salary',
    ascending=0
).head(1)
first_name last_name email gender department hire_date salary experience_years performance_score
id
5 Robert Jones [email protected] Male Engineering 2017-05-22 92000 6 4.8

Sort over performance score, then salary, so that we see misaligned scores and salary

data.sort_values(['performance_score', 'salary'], ascending=[False, True]).drop(['email'], axis='columns')
first_name last_name gender department hire_date salary experience_years performance_score
id
5 Robert Jones Male Engineering 2017-05-22 92000 6 4.8
2 Emma Johnson Female Engineering 2019-07-23 85000 3 4.7
12 Mia Anderson Female Finance 2016-03-27 91000 7 4.6
7 William Davis Male Finance 2016-09-18 88000 7 4.5
14 Charlotte Thomas Female Engineering 2018-06-14 81000 4 4.4
19 Andrew Thompson Male Marketing 2016-07-19 78000 6 4.3
8 Ava Garcia Female Engineering 2019-04-30 79500 4 4.3
1 John Smith Male Marketing 2018-03-15 72000 5 4.2
16 Amelia Moore Female HR 2015-04-08 72500 8 4.2
11 David Martinez Male Engineering 2020-05-03 76000 2 4.2
4 Sophia Brown Female HR 2020-01-15 62000 2 4.1
17 Matthew Martin Male Finance 2017-11-15 82500 5 4.1
10 Isabella Wilson Female HR 2017-08-24 63500 5 4.0
20 Abigail White Female Sales 2018-02-28 66000 4 4.0
3 Michael Williams Male Sales 2015-11-05 65000 8 3.9
13 Joseph Taylor Male Marketing 2019-10-09 69500 3 3.9
9 James Rodriguez Male Sales 2018-12-11 67500 3 3.8
18 Harper Jackson Female Engineering 2020-09-01 71000 2 3.8
6 Olivia Miller Female Marketing 2021-02-10 58000 1 3.7
15 Daniel Hernandez Male Sales 2021-01-20 59000 1 3.5

Analysis by groups - .groupby()
#

Find the average salary for each group of experience years and gender

data.groupby('experience_years')['salary'].mean()
experience_years
1    58500.000000
2    69666.666667
3    74000.000000
4    75500.000000
5    72666.666667
6    85000.000000
7    89500.000000
8    68750.000000
Name: salary, dtype: float64
data.groupby('gender')['salary'].mean()
gender
Female    72950.0
Male      74950.0
Name: salary, dtype: float64

Find the total number of salaries spent on female and male

data.groupby('gender')['salary'].sum()
gender
Female    729500
Male      749500
Name: salary, dtype: int64

Find the number for each gender

data.groupby('gender').count()
first_name last_name email department hire_date salary experience_years performance_score
gender
Female 10 10 10 10 10 10 10 10
Male 10 10 10 10 10 10 10 10

yes, it outputs the count for each data entry