# logistic regression ... logistic regression ¢â‚¬¢ combine with linear...

Post on 26-Jan-2021

4 views

Embed Size (px)

TRANSCRIPT

Logistic Regression Thomas Schwarz

Categorical Data • Outcomes can be categorical

• Often, outcome is binary:

• President gets re-elected or not

• Customer is satisfied or not

• Often, explanatory variables are categorical as well

• Person comes from an under-performing school

• Order was made on a week-end

• …

Prediction Models for Binary Outcomes

• Famous example:

• Taken an image of a pet, predict whether this is a cat or

a dog

Prediction Models for Binary Outcomes

• Bayes: generative classifier • Predicts indirectly

•

• Evaluates product of likelihood and prior

• Prior: Probability of a category without looking at

data

• Likelihood: Probability of observing data if from a category

P(c |d)

̂c = arg maxc∈CP(d |c)P(c)

c

c

PriorLikelihood

Prediction Models for Binary Outcomes

• Regression is a discriminative classifier

• Tries to learn directly the classification from data

• E.g.: All dog pictures have a collar

• Collar present —> predict dog

• Collar not present —> predict cat

• Computes directly P(c |d)

Prediction Models for Binary Outcomes

• Regression:

• Supervised learning: Have a training set with classification

provided

• Input is given as vectors of numerical features

•

• Classification function that calculates the predicted class

• An objective function for learning: Measures the goodness of

fit between true outcome and predicted outcome

• An algorithm to optimize the objective function

x(i) = (x1,i, x2,i, …, xn,i)

̂y(x)

Prediction Models for Binary Outcomes

• Linear Regression:

• Classification function of type

•

• Objective function (a.k.a cost function)

• Sum of squared differences between predicted and

observed outcomes

• E.g. Test Set

• Minimize cost function

̂y ((x1, x2, …, xn)) = a1x1 + a2x2 + …anxn + b

T = {x(1), x(2), …x(m)} m

∑ i=1

(y(i) − ̂y(i))2

Prediction Models for Binary Outcomes

• Linear regression can predict a numerical value

• It can be made to predict a binary value

• If the predictor is higher than a cut-off value: predict

yes

• Else predict no

• But there are better ways to generate a binary classifier

Prediction Models for Binary Outcomes

• Good binary classifier:

• Since we want to predict the probability of a category

based on the features:

• Should look like a probability

• Since we want to optimize:

• Should be easy to differentiate

• Best candidate classifier that has emerged:

• Sigmoid classifier

Logistic Regression • Use logistic function

σ(z) = 1

1 + exp(−z)

-10 -5 5 10

0.2

0.4

0.6

0.8

1.0

σ σ'

Logistic Regression • Combine with linear regression to obtain logistic

regression approach:

• Learn best weights in

•

• We know interpret this as a probability for the positive outcome '+'

• Set a decision boundary at 0.5

• This is no restriction since we can adjust and the

weights

̂y ((x1, x2, …, xn)) = σ(b + w1x1 + w2x2 + …wnxn)

b

Logistic Regression • We need to measure how far a prediction is from the true

value

• Our predictions and the true value can only be 0 or 1

• If : Want to support and penalize .

• If : Want to support and penalize .

• One successful approach:

•

̂y y

y = 1 ̂y = 1 ̂y = 0

y = 0 ̂y = 0 ̂y = 1

Loss( ̂y, y) = ̂yy(1 − ̂y)(1−y)

Logistic Regression • Easier: Take the negative logarithm of the loss function

• Cross Entropy Loss

LCE = − y log( ̂y) − (1 − y)log(1 − ̂y)

Logistic Regression • This approach is successful, because we can use Gradient Descent

• Training set of size

• Minimize

• Turns out to be a convex function, so minimization is simple! (As far as those things go)

• Recall:

• We minimize with respect to the weights and

m m

∑ i=1

LCE(y(i), ̂y(i))

̂y ((x1, x2, …, xn)) = σ(b + w1x1 + w2x2 + …wnxn) b

Logistic Regression • Calculus:

• Difference between true and estimated outcome , multiplied by input coordinate

δLCE(w, b) δwj

= (σ(w1x1 + …wnxn + b) − y) xj

= ( ̂y − y)xj

y ̂y

Logistic Regression • Stochastic Gradient Descent

• Until gradient is almost zero:

• For each training point :

• Compute prediction

• Compute loss

• Compute gradient

• Nudge weights in the opposite direction using a learning weight

•

• Adjust

x(i), y(i)

̂y(i)

η

(w1, …, wn) ← (w1, …, wn) − η∇LCE η

Logistic Regression • Stochastic gradient descent uses a single data point

• Better results with random batches of points at the

same time

Lasso and Ridge Regression

• If the feature vector is long, danger of overfitting is high

• We learn the details of the training set

• Want to limit the number of features with positive

weight

• Dealt with by adding a regularization term to the cost function

• Regularization term depends on the weights

• Penalizes large weights

Lasso and Ridge Regression

• L2 regularization:

• Use a quadratic function of the weights

• Such as the euclidean norm of the weights

• Called Ridge Regression • Easier to optimize

Lasso and Ridge Regression

• L1 regularization

• Regularization term is the sum of the absolute values of

weights

• Not differentiable, so optimization is more difficult

• BUT: effective at lowering the number of non-zero

weights

• Feature selection:

• Restrict the number of features in a model

• Usually gives better predictions

Examples • Example: quality.csv

• Try to predict whether patient labeled care they

received as poor or good

•

Examples • First column is an arbitrary patient ID

• we make this the index

• One column is a Boolean, when imported into Python

• so we change it to a numeric value

df = pd.read_csv('quality.csv', sep=',', index_col=0) df.replace({False:0, True:1}, inplace=True)

Examples • Farmington Heart Data Project:

• https://framinghamheartstudy.org

• Monitoring health data since 1948

• 2002 enrolled grandchildren of first study

Examples

Examples • Contains a few NaN data

• We just drop them

df = pd.read_csv('framingham.csv', sep=',') df.dropna(inplace=True)

Logistic Regression in Stats-Models

• Import statsmodels.api

• Interactively select the columns that gives us high p- values

import statsmodels.api as sm

cols = [ 'Pain', 'TotalVisits', 'ProviderCount', 'MedicalClaims', 'ClaimLines', 'StartedOnCombination', 'AcuteDrugGapSmall',]

Logistic Regression in Stats-Models

• Create a logit model

• Can do as we did for linear regression with a string

• Can do using a dataframe syntax

• Print the summary pages

logit_model=sm.Logit(df.PoorCare,df[cols]) result=logit_model.fit()

print(result.summary2())

Logistic Regression in Stats-Models

• Print the results

•

• This gives the "confusion matrix"

• Coefficient [i,j] gives:

• predicted i values

• actual j values

print(result.pred_table())

Logistic Regression in Stats-Models

• Quality prediction:

•

• 7 False negative and 18 false positives

[[91. 7.] [18. 15.]]

Logistic Regression in Stats-Models

• Heart Event Prediction:

•

• 26 false negatives

• 523 false positives

[[3075. 26.] [ 523. 34.]]

Logistic Regression in Stats-Models

• Can try to improve using Lasso result=logit_model.fit_regularized()

Logistic Regression in Stats-Models

• Can try to improve selecting only columns with high P- values

Optimization terminated successfully. Current function value: 0.423769 Iterations 6 Results: Logit ================================================================ Model: