Table of Contents
Abstract |
2 |
Background |
2 |
Variable Introduction |
3 |
Preprocessing of the Predictors |
3 |
Splitting of the Dataset |
8 |
Model fitting |
9 |
Summary |
14 |
Appendix I Supplement
figures for Runner-up Models |
14 |
Appendix II Questioner |
16 |
Appendix III R Code |
21 |
References |
27 |
Young People Survey
Predicting Spending Variable
MA
4790 Predictive Modelling
Julia Duda
Dhairya Kothari
Rahul Gowla
Abstract:
In 2013, students of the statistics class at
FACULTY OF SOCIAL AND ECONOMIC SCIENCE were asked to conduct a survey among
their friends about their spending habits in their daily life. Data was
collected both manually and electronically, and was stored for further
analysis. This data is used for predictions; here our main goal is to predict
their spending habits in near future, based on their personality traits,
interests and preferences. Which can help them have a check on it and try to
reduce their expenses if the need arises. In this paper, we will focus on
predicting spending habits and the relation of spending habits with their
interests. Relationship between the predictors are explored, necessary
pre-processing and missing values are treated. Finally both linear and
nonlinear regression models are built on the data, we split the data to check
our models' performance. Top models will then be used to predict on the test
set and overall best model will is selected to gain some insights on the
underlying behavior.
Background:
The students of the statistics class at
FACULTY OF SOCIAL AND ECONOMIC SCIENCE, who conducted a survey about peoples
personality traits, interests and preferences, have collected data. Students
were asked to invite their friends to participate in this survey. The original
questionnaire was in Slovak language and was later translated into English. All
participants were of Slovakian nationality, aged between 15 and 30. The survey
was presented to participants in both electronic and written form. The
translated questions are attached in the appendix 1
Variable
Introduction:
Data set contains 1010 rows and 150 columns,
stored in the file named "responses.csv". Among them, we have 139
integer and 11 categorical data. For convenience, the original variable names
were shortened in the data file; see the "columns.csv" file to match
the data with the original names. The data also contains some missing values.
Every integer variable is scaled from 1 to 5 for example the column "fear
of dangerous dogs" gives 1 as 'not afraid at all' and 5 is 'very afraid'.
All the categorical data have in different levels for according to the
question, example smoking has levels: 'never smoked', 'tried smoking', 'former
smoker' and 'current smoker' whereas Drinking has levels: 'Never', 'Social
drinker' and 'Drinks a lot'. The variables are split into different groups
below are the list of groups:
Preferences or groups |
Number of variables or items |
Music preferences |
19 |
Movie preferences |
12 |
Hobbies & interests |
32 |
Phobias |
10 |
Health habits |
3 |
Personality traits, views on life, &
opinions |
57 |
Spending habits |
7 |
Demographics |
10 |
Preprocessing
of the Predictors:
The first step in analyzing any data is preprocessing
the data to get consistency and be able to duplicate the results.
The steps for preprocessing include:
1) Recode categorical data to apply kNN
We recode the categorical levels into numeric levels
so that we can impute the missing values. So, all the 11 categorical data
variables are recoded to numerical where, Smoking has levels: 'never smoked',
'tried smoking', 'former smoker' and 'current smoker', now 'never smoked' is 1
and 'current smoker' is 4. Similarly, Drinking has levels: 'Never', 'Social
drinker' and 'Drinks a lot' gives us 'Never' as 1 and 'Drinks a lot' is 3.
While binary variables such as Gender take values 0 and 1.
2)
Impute the missing values with kNN (k=5)
fig
1.1
We could observe few missing values by looking at the
figure 1.1. We imputed these missing values using kNN with tuning parameter k =
5.
3) Recode the categorical data to dummy variables
We recode the now categorical data to dummy variables
such that Smoking 1-4 is divided into 4 binary variables Smoking1 to Smoking4.
So, the number of predictors increases and we have,
New total Predictors = 166
4) Correlation analysis (cutoff = 0.7)
We choose to remove variables with higher than 0.7
correlation, as some models are highly sensitive to correlation. We remove 7
variables, as they are highly correlated.
fig
1.2gerous dogs: No
Variables
with less corr = 159
Predictors
with High Corr:
·
Alcohol3 (drinks
a lot)
·
Biology
·
Gender0 (female)
·
Left...right.handed0
(left handed)
·
Only.child0 (has
siblings)
·
Village...town0
(city)
·
House...block.of.flats0
(flats/apartments)
5) Near zero variance predictor removal:
Some Predictors have a tendency to not change its
value across nearly the whole data set. These predictors do not affect the
prediction at all, so we can remove them.
3 variables removed: internet.usage1, education1
& education6
6) Response variable Y = sum of 7 spending habit
columns (treated as continuous)
We have 7 variables as our 'response'. However the
scope of our study include only one response variable. So we decide to add all
the 7 variables to make one response variable (we justify this looking at the
correlation between the 7 spending variables). We inverse the finance variable
(Finance: 1-I spend a lot; 5-I save a lot to New_Finance: 5-I spend a lot; 1-I
save a lot) which yields throughout positive correlation. We can also note that
the variables have a good overall distribution and do not have any outliers.
fig
1.3
fig
1.4
Now
we have Response: Spending_Sum which
is
·
Coded from 1 5
·
5 indicates
spending more
·
1 indicates
spending less
·
Continuously
distributed between 7 to 35
fig
1.5
7) Data splitting
As we have enough Sample
size (i.e. 1010 rows), we do a:
- 80-20 random split
8) Data Re-sampling
approach
- 10 fold CV with no
repeats as we see that this gets consistent results with a balance in
computational time.
9) Outlier Removal
We do not have outliers
in the data set.
Splitting
of the Dataset:
As mentioned above, we do
80:20 random split and data resampling approach will be 10 fold CV
Therefore,
Training set size: 808
data points (rows)
Testing set size: 202
data points (rows)
Model
fitting:
(a) Model
Building:
Here
based upon our data we need to build the following Linear and Nonlinear models
on our dataset.
Linear Models |
Nonlinear Models |
OLS |
N-Net |
OLS-PCA |
MARS |
PLS |
SVM |
PCR |
KNN |
Ridge |
|
Lasso |
|
E-Net |
|
fig
2.1
Each
and every model has specific assumptions. We need to preprocess them
accordingly before the model building. Here are the list of preprocessing steps
taken for different models
Linear
Models:
Model |
Remove High Corr |
Center |
Scale |
PCA |
OLS |
✓ |
✓ |
✓ |
- |
OLS-PCA |
- |
✓ |
✓ |
✓ |
PLS |
- |
✓ |
✓ |
- |
PCR |
- |
✓ |
✓ |
- |
Ridge |
- |
✓ |
✓ |
- |
Lasso |
- |
✓ |
✓ |
- |
E-NET |
- |
✓ |
✓ |
- |
fig
2.2
Nonlinear
Models:
Model |
Remove High Corr |
Center |
Scale |
PCA |
N-Net |
✓ |
✓ |
✓ |
- |
MARS |
✓ |
✓ |
✓ |
- |
SVM |
- |
✓ |
✓ |
- |
KNN |
- |
✓ |
✓ |
- |
fig
2.3
As
we build all regression models performance evaluation will be appropriate if
R-squared and RMSE are used. Below (fig2.1) are the RMSE and R-squared results
of all the models build.
Hence,
(b)
Model Evaluation Statistic: RMSE
(Primary) and R-squared (Secondary)
Training
set results:
Here
we trained both linear and nonlinear regression models on our training set to
predict the spending habits of the young people and tested it on our testing
data.
Linear
Models:
Model |
Best Parameter |
RMSE |
R^2 |
OLS |
- |
4.388 |
0.308 |
OLS-PCA |
- |
4.342 |
0.307 |
PLS |
Comp = 14 |
4.266 |
0.307 |
PCR |
Comp
= 2 |
4.159 |
0.344 |
Ridge |
Lamda = 0.2 |
4.308 |
0.338 |
Lasso |
Lamda = 0.75 |
4.101 |
0.369 |
E-NET |
Lamda1=0.1,
Lamda2=0.693 |
4.073 |
0.373 |
fig
2.4
Nonlinear Models:
Model |
Best Parameter |
RMSE |
R^2 |
N-Net |
Hidden units = 2, decay lamda = 0 |
4.31 |
0.294 |
MARS |
Terms = 11, degree =1 |
4.153 |
0.353 |
SVM |
Cost = 2 |
5.119 |
0.005 |
KNN |
K=14 |
4.673 |
0.196 |
fig
2.5
From
the above table (fig2.1) we can observe that LASSO, E-NET and MARS models have
nearly same and overall higher R-squared values and least RMSE values. Even
though none of the models are performing better but still among all these
models LASSO model stand out to be the best. Below (fig2.2) are the testing set
results of the best models listed above.
Testing Set Results:
Model |
RMSE |
R^2 |
Lasso |
3.716 |
0.403 |
MARS |
3.847 |
0.366 |
E-NET |
3.752 |
0.392 |
fig
2.6
Below are the supplement
figures (model performance plots) corresponding to the best model i.e. LASSO.
LASSO
Model:
RMSE
Training Data
Testing
Data Residual plot
Top
20 Most Important Variables and its importance in predicting, for the Lasso
model:
Variable
Importance Plot (Lasso Model):
Summary:
We
conclude that the best models in this analysis were MARS, E-NET and LASSO.
These three models gives us almost the same RMSE value. Among these three, LASSO is the optimal model with RMSE
for the predictive ability (i.e. Testing
Set) is 3.716. However, all these values are low, we would recommend
further analysis on this data and not use (or trust) any of these models to
reliably predict.
Appendix 1
MARS Model:
RMSE
testing data
Training data
Residual
plot
E-NET:
RMSE
Testing data
Training data
Residual plot
Appendix II: Questionnaire
MUSIC PREFERENCES
1. I
enjoy listening to music.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
2. I
prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)
3. Dance,
Disco, Funk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
4. Folk
music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
5. Country:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
6. Classical:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
7. Musicals:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
8. Pop:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
9. Rock:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
10. Metal,
Hard rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
11. Punk:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
12. Hip
hop, Rap: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
13. Reggae,
Ska: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
14. Swing,
Jazz: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
15. Rock
n Roll: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
16. Alternative
music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
17. Latin:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
18. Techno,
Trance: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
19. Opera:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
MOVIE PREFERENCES
1. I
really enjoy watching movies.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
2. Horror
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
3. Thriller
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
4. Comedies:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
5. Romantic
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
6. Sci-fi
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
7. War
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
8. Tales:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
9. Cartoons:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
10. Documentaries:
Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
11. Western
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
12. Action
movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)
HOBBIES & INTERESTS
1. History:
Not interested 1-2-3-4-5 Very interested (integer)
2. Psychology:
Not interested 1-2-3-4-5 Very interested (integer)
3. Politics:
Not interested 1-2-3-4-5 Very interested (integer)
4. Mathematics:
Not interested 1-2-3-4-5 Very interested (integer)
5. Physics:
Not interested 1-2-3-4-5 Very interested (integer)
6. Internet:
Not interested 1-2-3-4-5 Very interested (integer)
7. PC
Software, Hardware: Not interested 1-2-3-4-5 Very
interested (integer)
8. Economy,
Management: Not interested 1-2-3-4-5 Very interested (integer)
9. Biology:
Not interested 1-2-3-4-5 Very interested (integer)
10. Chemistry:
Not interested 1-2-3-4-5 Very interested (integer)
11. Poetry
reading: Not interested 1-2-3-4-5 Very interested (integer)
12. Geography:
Not interested 1-2-3-4-5 Very interested (integer)
13. Foreign
languages: Not interested 1-2-3-4-5 Very interested (integer)
14. Medicine:
Not interested 1-2-3-4-5 Very interested (integer)
15. Law:
Not interested 1-2-3-4-5 Very interested (integer)
16. Cars:
Not interested 1-2-3-4-5 Very interested (integer)
17. Art:
Not interested 1-2-3-4-5 Very interested (integer)
18. Religion:
Not interested 1-2-3-4-5 Very interested (integer)
19. Outdoor
activities: Not interested 1-2-3-4-5 Very interested (integer)
20. Dancing:
Not interested 1-2-3-4-5 Very interested (integer)
21. Playing
musical instruments: Not interested 1-2-3-4-5 Very
interested (integer)
22. Poetry
writing: Not interested 1-2-3-4-5 Very interested (integer)
23. Sport
and leisure activities: Not interested 1-2-3-4-5 Very
interested (integer)
24. Sport
at competitive level: Not interested 1-2-3-4-5 Very
interested (integer)
25. Gardening:
Not interested 1-2-3-4-5 Very interested (integer)
26. Celebrity
lifestyle: Not interested 1-2-3-4-5 Very interested (integer)
27. Shopping:
Not interested 1-2-3-4-5 Very interested (integer)
28. Science
and technology: Not interested 1-2-3-4-5 Very
interested (integer)
29. Theatre:
Not interested 1-2-3-4-5 Very interested (integer)
30. Socializing:
Not interested 1-2-3-4-5 Very interested (integer)
31. Adrenaline
sports: Not interested 1-2-3-4-5 Very interested (integer)
32. Pets:
Not interested 1-2-3-4-5 Very interested (integer)
PHOBIAS
1. Flying:
Not afraid at all 1-2-3-4-5 Very afraid of (integer)
2. Thunder,
lightning: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
3. Darkness:
Not afraid at all 1-2-3-4-5 Very afraid of (integer)
4. Heights:
Not afraid at all 1-2-3-4-5 Very afraid of (integer)
5. Spiders:
Not afraid at all 1-2-3-4-5 Very afraid of (integer)
6. Snakes:
Not afraid at all 1-2-3-4-5 Very afraid of (integer)
7. Rats,
mice: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
8. Ageing:
Not afraid at all 1-2-3-4-5 Very afraid of (integer)
9. Dangerous
dogs: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
10. Public
speaking: Not afraid at all 1-2-3-4-5 Very afraid of (integer)
HEALTH HABITS
1. Smoking
habits: Never smoked - Tried smoking - Former smoker - Current
smoker (categorical)
2. Drinking:
Never - Social drinker - Drink a lot (categorical)
3. I
live a very healthy lifestyle.: Strongly disagree 1-2-3-4-5 Strongly
agree (integer)
PERSONALITY TRAITS,
VIEWS ON LIFE & OPINIONS
1. I
take notice of what goes on around me.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
2. I
try to do tasks as soon as possible and not leave them until last minute.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
3. I
always make a list so I don't forget anything.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
4. I
often study or work even in my spare time.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
5. I
look at things from all different angles before I go ahead.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
6. I
believe that bad people will suffer one day and good people will be rewarded.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
7. I
am reliable at work and always complete all tasks given to me.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
8. I
always keep my promises.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
9. I
can fall for someone very quickly and then completely lose interest.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
10. I
would rather have lots of friends than lots of money.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
11. I
always try to be the funniest one.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
12. I can
be two faced sometimes.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
13. I
damaged things in the past when angry.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
14. I
take my time to make decisions.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
15. I
always try to vote in elections.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
16. I
often think about and regret the decisions I make.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
17. I
can tell if people listen to me or not when I talk to them.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
18. I
am a hypochondriac.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
19. I
am emphatetic person.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
20. I
eat because I have to. I don't enjoy food and eat as fast as I can.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
21. I
try to give as much as I can to other people at Christmas.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
22. I
don't like seeing animals suffering.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
23. I
look after things I have borrowed from others.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
24. I
feel lonely in life.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
25. I
used to cheat at school.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
26. I
worry about my health.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
27. I
wish I could change the past because of the things I have done.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
28. I
believe in God.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
29. I
always have good dreams.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
30. I
always give to charity.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
31. I
have lots of friends.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
32. Timekeeping.:
I am often early. - I am always on time. - I am often running late.
(categorical)
33. Do
you lie to others?: Never. - Only to avoid hurting
someone. - Sometimes. - Everytime it suits me. (categorical)
34. I
am very patient.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
35. I
can quickly adapt to a new environment.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
36. My
moods change quickly.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
37. I
am well-mannered and I look after my appearance.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
38. I
enjoy meeting new people.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
39. I
always let other people know about my achievements.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
40. I
think carefully before answering any important letters.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
41. I
enjoy childrens' company.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
42. I
am not afraid to give my opinion if I feel strongly about something.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
43. I
can get angry very easily.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
44. I
always make sure I connect with the right people.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
45. I
have to be well prepared before public speaking.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
46. I
will find a fault in myself if people don't like me.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
47. I
cry when I feel down or things don't go the right way.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
48. I
am 100% happy with my life.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
49. I
am always full of life and energy.: Strongly disagree 1-2-3-4-5 Strongly
agree (integer)
50. I
prefer big dangerous dogs to smaller, calmer dogs.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
51. I
believe all my personality traits are positive.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
52. If
I find something the doesn't belong to me I will hand it in.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
53. I
find it very difficult to get up in the morning.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
54. I
have many different hobbies and interests.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
55. I
always listen to my parents' advice.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
56. I
enjoy taking part in surveys.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
57. How
much time do you spend online?: No time at all - Less than an hour
a day - Few hours a day - Most of the day (categorical)
SPENDING HABITS
1. I
save all the money I can.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
2. I
enjoy going to large shopping centers.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
3. I
prefer branded clothing to non-branded.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
4. I
spend a lot of money on partying and socializing.: Strongly
disagree 1-2-3-4-5 Strongly agree (integer)
5. I
spend a lot of money on my appearance.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
6. I
spend a lot of money on gadgets.: Strongly disagree 1-2-3-4-5
Strongly agree (integer)
7. I
will happily pay more money for good, quality or healthy food.:
Strongly disagree 1-2-3-4-5 Strongly agree (integer)
DEMOGRAPHICS
1. Age:
(integer)
2. Height:
(integer)
3. Weight:
(integer)
4. How
many siblings do you have?: (integer)
5. Gender:
Female - Male (categorical)
6. I
am: Left handed - Right handed (categorical)
7. Highest
education achieved: Currently a Primary school pupil -
Primary school - Secondary school - College/Bachelor degree (categorical)
8. I
am the only child: No - Yes (categorical)
9. I
spent most of my childhood in a: City - village (categorical)
10. I
lived most of my childhood in a: house/bungalow - block of flats
(categorical)
Appendix III: R-code
Part 1 # Predictive Modeling install.packages("Amelia") library(Amelia) library(VIM) library(caret) library(readr) responses <-
read.csv("C:/Users/Juli/Desktop/Predictive Modeling/res-data.txt",
na.strings =c("", "NA")) ncol(responses) #[1] 150 nrow(responses) #[1] 1010 par(mar = c(10, 3, 2, 2), mfrow = c(1,1)) missmap(responses, col = c("wheat",
"darkred")) # code the non-integer predictors attach(responses) pred_int <- c() for(i in 1:ncol(responses)){ pred_int[i]
<- is.integer(responses[,i]) } # which are not integer? decode <- which(pred_int == 0) colnames(responses[, decode]) # take a look at the
names # Treat smoking as an order variable table(Smoking, useNA = "always") Smoking <- ordered(Smoking, levels =
c("never smoked", "tried smoking", "former
smoker",
"current smoker")) Smoking <- as.numeric(Smoking) # as.numeric to
apply KNN # Treat alcohol as an order variable table(Alcohol, useNA = "always") Alcohol <- ordered(Alcohol, levels =
c("never", "social drinker", "drink a lot")) Alcohol <- as.numeric(Alcohol) # Treat punctuality as an order variable table(Punctuality, useNA = "always") Punctuality <- ordered(Punctuality, levels =
c("i am often early", "i am always on
time", "i am often
running late")) Punctuality <- as.numeric(Punctuality) # Treat Lying as an order variable table(Lying, useNA = "always") Lying <- ordered(Lying, levels =
c("never","only to avoid hurting someone",
"sometimes", "everytime it suits me")) Lying <- as.numeric(Lying) # Treat Internet Use table(Internet.usage, useNA = "always") Internet.usage <- ordered(Internet.usage, levels
= c("no time at all","less than an hour a day", "few
hours a day", "most of the day")) Internet.usage <- as.numeric(Internet.usage) # Gender levels(Gender) <- c("0", "1") Gender <- as.numeric(Gender) -1 #0 for females # Left...right.handed table(Left...right.handed, useNA =
"always") levels(Left...right.handed) <- c("0",
"1") Left...right.handed <-
as.numeric(Left...right.handed) - 1 # 1 for right handed # Education table(Education, useNA = "always") Education <- ordered(Education, levels =
c("currently a primary school pupil","primary school",
"secondary school", "college/bachelor degree",
"masters degree", "doctorate degree")) Education <- as.numeric(Education) # Only.child levels(Only.child) <- c("0",
"1") Only.child <- as.numeric(Only.child) -1 # 0 for
non only child # Village...town levels(Village...town) <- c("0",
"1") Village...town <- as.numeric(Village...town) -1 #
0 for city # House...block.of.flats levels(House...block.of.flats) <-
c("0", "1") House...block.of.flats <-
as.numeric(House...block.of.flats) -1 # 0 for block of flats # replace everything in the final data set: responses$Smoking <- Smoking responses$Alcohol <- Alcohol responses$Punctuality <- Punctuality responses$Lying <- Lying responses$Internet.usage <- Internet.usage responses$Gender <- Gender responses$Left...right.handed <-
Left...right.handed responses$Education <- Education responses$Only.child <- Only.child responses$Village...town <- Village...town responses$House...block.of.flats <- House...block.of.flats # --> there is no pattern for the missing
responses # for now: go with k = 5 to do imputation imputed <- kNN(responses,imp_var = FALSE, k = 5) imputed[,134] <- 6-imputed[,134] missmap(imputed, col = c("wheat",
"darkred")) # use a sum of the variables that cover spending as
a response variable spending_imp <- imputed[, 134:140] # check
manually which columns are for expenses response_sum <- rowSums(spending_imp) # Some graphics par(mgp = c(2, 1, 0), mar = c(3, 3, 2, 1)) hist(response_sum, main = "Histogram of the
response variable") boxplot(spending_imp) predictors <- imputed[, -c(134:140)] apply(spending_imp, 2, mean) par(mfrow =c(4, 2), mar = c(2, 2, 2, 1), mgp = c(1,
1, 0)) for(i in 1:ncol(spending_imp)){
barplot(table(spending_imp[,i]), main = colnames(spending_imp)[i]) } write.csv(imputed, file =
"imputed_data.txt") nearZeroVar(predictors) par(mfrow = c(1,1), mar = c(0, 1, 1, 1)) library(corrplot) corrplot(cor(predictors), type = "upper",
order = "hclust", cl.pos="n", tl.pos="n") install.packages("corrgram") library(corrgram) corrgram(predictors) # going back to categorical data and dummy
variables: decode decode_pred <- which(colnames(predictors) %in%
colnames(imputed[, decode])) # model.matrix(~factor(imputed[, decode[1]])) # length(findCorrelation(cor(predictors), cutoff =
0.6)) # for(i in 1 :length(decode_pred)){} install.packages("dummies") library(dummies) pred_test <- predictors pred_test[, decode_pred] <- apply(pred_test[,
decode_pred], 2, as.factor) test <- dummy.data.frame(pred_test, names =
colnames(predictors[decode_pred]), omit.constants=T, dummy.classes =
getOption("dummy.classes"), all = TRUE) dummy_pred <- test write.csv(dummy_pred, file = "dummypred.txt") library(ggplot2) ggcorr(dummy_pred, nbreaks = 5) cor_big <- findCorrelation(cor(dummy_pred),
cutoff = 0.7) cor_big colnames(dummy_pred[, cor_big]) final_pred <- dummy_pred[, -cor_big] final_pred <- final_pred[,-1] write.csv(final_pred, file =
"final_pred.txt") ###################################################################################################################################### ###################################################################################################################################### final_pred
<- read.csv("final_pred.txt", header = T); final_pred
<- final_pred[,-1] # kicked out corr with thres = 0.7 dummy_pred <- read.csv("dummypred.txt",
header = T); dummy_pred <- dummy_pred[,-1] # correlations still in there imputed <- read.csv("imputed_data.txt",
header = T); imputed <- imputed[,-1] # the whole imputed data set, no
dummies spending_imp <- imputed[, 134:140] response_sum <- rowSums(spending_imp) ## Finally for model building: x <- dummy_pred; write.csv(x, file =
"x.txt") x_nocorr <- final_pred; write.csv(x_nocorr, file
= "x_nocorr.txt") y <- response_sum; write.csv(y, file =
"y.txt") library(corrplot) corrplot(cor(spending_imp), order = "hclust") Part 2 # Actual model building library(AppliedPredictiveModeling) library(caret) library(lars) library(elasticnet) library(Amelia) library(pls) library(earth) ## Finally for model building: x <- read.csv("x.txt", header = T); x
<- x[,-1] # cutoff was 0.7 x_nocorr <- read.csv("x_nocorr.txt",
header = T); x_nocorr <- x_nocorr[,-1] y <- read.csv("y.txt", header = T); y
<- y[,-1] # there are still nearZerovariance predictors! # we have to remove those: nearZeroVar(x) # [1] 143 155 160 colnames(x[, c(143, 155, 160)]) x <- x[, -c(143, 155, 160)] # even in x_nocorr there stilla re some near zero
variance predictors nearZeroVar(x_nocorr) # [1] 141 151 156 colnames(x_nocorr[, c(141, 151, 156)]) x_nocorr <- x_nocorr[, -c(141, 151, 156)] set.seed(007) index = sample.int(length(y), 0.8*length(y)) #index = createDataPartition(y, p=0.8, list=F) train = x[index, ] test = x[-index, ] train_nocorr = x_nocorr[index, ] test_nocorr = x_nocorr[-index, ] y_train = y[index] y_test = y[-index] ###################################### # Which models? # Linear: # 1. ordinary least square # 1.5 ordinary least square with pca as pre
processing on "preProc"-argument in # train function # 2. partial least square # 3. PCR # penalized: ridge 4 #
lasso 5 # enet 6 # non-linear: # 7. neural networks # 8. MARS # 9. SVM # 10. KNN ############################################################## set.seed(1) ctrl <- trainControl(method = "cv",
number = 10) #### START WITH THE LINEAR MODELS ################ ### Linear model (without considering correlation) ################ set.seed(007) lm_tune <- train(x = train_nocorr, y = y_train,
method = "lm",
trControl = ctrl,
preProc = c("center", "scale")) lm_tune ################################ ## Linear model with PCA ()and therefore also center
and scale) as preProc ################################ set.seed(007) lm_tune_PCA <- train(x = train_nocorr, y =
y_train, method = "lm",
trControl = ctrl,
preProc = c("center", "scale", "pca")) # pca here: amount of components not a tning
parameter lm_tune_PCA summary(lm_tune_PCA) ### model: pcr' now the number of principle
components is a tuning parameter ################## # maybe not saved, but runs fast! set.seed(007) pcr_tune <- train(x=train, y = y_train, method =
"pcr",
trControl = ctrl,
preProc = c("center", "scale"),
tuneLength = 15) pcr_tune plot(pcr_tune) ######################## ### Partial Least Square ######################## set.seed(007) PLS_tune <- train(x = train, y = y_train, method
= "pls",
tuneLength = 20,
trControl = ctrl,
preProc = c("center", "scale")) PLS_tune plot(PLS_tune) ####################### ###### Ridge Regression ####################### ridgeGrid <- data.frame(.lambda = seq(0, .4,
length = 15)) # range for the lambda! check graph! set.seed(007) ridge_tune<- train(x = train, y = y_train, method
= "ridge", tuneGrid = ridgeGrid, trControl =
ctrl,
preProc = c("center", "scale")) plot(ridge_tune) ridge_tune ####################### ###### Lasso Regression ####################### lassoGrid <- expand.grid(.lambda = 0, .fraction =
seq(.05, 0.4, length = 15)) # fraction = 1 means, that there is no penalty
concerning lasso set.seed(007) lasso_tune<- train(x = train, y = y_train, method
= "enet",
tuneGrid = lassoGrid, trControl = ctrl,
preProc = c("center", "scale")) plot(lasso_tune) lasso_tune ###################### ####### enet ###################### enetGrid <- expand.grid(.lambda = c(0, 0.1, 0.2,
0.3), .fraction = seq(.05, 0.5, length = 15)) set.seed(007) enet_tune<- train(x = train, y = y_train, method
= "enet",
tuneGrid = enetGrid, trControl = ctrl,
preProc = c("center", "scale")) plot(enet_tune) enet_tune ############################################################### #### CONTINUE WITH NON-LINEAR MODELS ############################################################### #################### #### neural networks #################### # here, use data without the highly correlated
predictors nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
# lambda .size = c(1:10), # H = number of
hidden units ## The next option is
to use bagging (see the ## next chapter)
instead of different random ## seeds. .bag = FALSE) set.seed(007) nnetTune <- train(x = train_nocorr, y = y_train,
# x and y
method = "avNNet", #
if avNNet too slow: use NNet
# has several starting points to avoid just funding a local
# mimimum but the global minimum
tuneGrid = nnetGrid,
trControl = ctrl,
## Automatically standardize data prior to modeling
## and prediction
# preProc =
c("center", "scale"), # preprocessing!
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(train_nocorr) + 1) + 10 + 1, #number of
parameters,
# depends on żour grid!! maxit = 500,
preProc = c("center", "scale")) plot(nnetTune) nnetTune #################### ########## MARS ################### # for MARS as well: no high correlations! marsGrid <- expand.grid(.degree = 1:2, .nprune =
2:38) mars_tune <- train(x = train_nocorr, y = y_train,
method = "earth",
# Explicitly declare the candidate models to test
tuneGrid = marsGrid,
trControl = ctrl,
preProc = c("center", "scale")) plot(mars_tune) mars_tune ################## ############ SVM ################# # it says that centering / scaling in the notes (?) svm_tune <- train(x = train, y = y_train,
method = "svmRadial", # sigma will be computed analytically
preProc = c("center", "scale"), # centering /
scaling improves the graph
# yes, use center and scale!
tuneLength = 6,
trControl = ctrl) plot(svm_tune) svm_tune # aa=varImp(svm_tune) # plot(aa, top = 25, scales = list(y = list(cex =
.95))) ################## ########### KNN ################## set.seed(007) knnTune <- train(x = train,
y = y_train,
method = "knn", # Center and scaling will occur for new
predictions too
preProc = c("center", "scale"), # so maybe here
leave it?
tuneGrid = data.frame(.k = 1:20),
trControl = ctrl) knnTune plot(knnTune) ############################################################################### ############################################################################### # best models: eNet, Lasso and MARS # all the plots: ############ ###### eNet ############ prediction_enet <- predict(enet_tune, test) defaultSummary(data.frame(obs = y_test, pred =
prediction_enet)) # RMSE
Rsquared MAE # 3.7521506 0.3920136 2.9593867 par(mfrow = c(2,2)) # the plots: plot(enet_tune) # on the testing set xyplot(y_test ~ predict(enet_tune), type =
c("p", "g"), xlab =
"Predicted", ylab =
"Observed") # on the training set xyplot(y_train ~ predict(enet_tune), type =
c("p", "g"), xlab =
"Predicted", ylab =
"Observed") # Residual plot xyplot(resid(enet_tune) ~ predict(enet_tune), type =
c("p", "g"), xlab =
"Predicted", ylab = "Residuals") ############ ###### lasso ############ prediction_lasso <- predict(lasso_tune, test) defaultSummary(data.frame(obs = y_test, pred =
prediction_lasso)) # RMSE
Rsquared MAE # 3.7159639 0.4033666 2.9258084 # the plots: plot(lasso_tune) # on the testing set xyplot(y_test ~ predict(lasso_tune), type =
c("p", "g"), xlab =
"Predicted", ylab =
"Observed") # on the training set xyplot(y_train ~ predict(lasso_tune), type =
c("p", "g"), xlab =
"Predicted", ylab =
"Observed") # Residual plot xyplot(resid(lasso_tune) ~ predict(lasso_tune), type =
c("p", "g"), xlab =
"Predicted", ylab = "Residuals") ############ ###### MARS ############ prediction_mars <- drop(predict(mars_tune,
test_nocorr)) defaultSummary(data.frame(obs = y_test, pred =
prediction_mars)) # RMSE
Rsquared MAE # 3.8470064 0.3660746 2.9510664 # the plots: plot(mars_tune) # on the testing set xyplot(y_test ~ predict(mars_tune), type =
c("p", "g"), xlab =
"Predicted", ylab =
"Observed") # on the training set xyplot(y_train ~ predict(mars_tune), type =
c("p", "g"), xlab =
"Predicted", ylab =
"Observed") # Residual plot xyplot(resid(mars_tune) ~ predict(mars_tune), type =
c("p", "g"), xlab =
"Predicted", ylab = "Residuals") |
References:
https://www.kaggle.com/miroslavsabo/young-people-survey
https://www.r-project.org/other-docs.html
https://www.rdocumentation.org/
Max
Kuhn and Kjell Johnson. Applied Predictive Modeling, Springer, 2013