Predictive Modeling

Table of Contents

Abstract	2
Background	2
Variable Introduction	3
Preprocessing of the Predictors	3
Splitting of the Dataset	8
Model fitting	9
Summary	14
Appendix I Supplement figures for Runner-up Models	14
Appendix II Questioner	16
Appendix III R Code	21
References	27

Young People Survey

Predicting Spending Variable

MA 4790 Predictive Modelling

Julia Duda

Dhairya Kothari

Rahul Gowla

Abstract:

In 2013, students of the statistics class at FACULTY OF SOCIAL AND ECONOMIC SCIENCE were asked to conduct a survey among their friends about their spending habits in their daily life. Data was collected both manually and electronically, and was stored for further analysis. This data is used for predictions; here our main goal is to predict their spending habits in near future, based on their personality traits, interests and preferences. Which can help them have a check on it and try to reduce their expenses if the need arises. In this paper, we will focus on predicting spending habits and the relation of spending habits with their interests. Relationship between the predictors are explored, necessary pre-processing and missing values are treated. Finally both linear and nonlinear regression models are built on the data, we split the data to check our models' performance. Top models will then be used to predict on the test set and overall best model will is selected to gain some insights on the underlying behavior.

Background:

The students of the statistics class at FACULTY OF SOCIAL AND ECONOMIC SCIENCE, who conducted a survey about people’s personality traits, interests and preferences, have collected data. Students were asked to invite their friends to participate in this survey. The original questionnaire was in Slovak language and was later translated into English. All participants were of Slovakian nationality, aged between 15 and 30. The survey was presented to participants in both electronic and written form. The translated questions are attached in the appendix 1

Variable Introduction:

Data set contains 1010 rows and 150 columns, stored in the file named "responses.csv". Among them, we have 139 integer and 11 categorical data. For convenience, the original variable names were shortened in the data file; see the "columns.csv" file to match the data with the original names. The data also contains some missing values. Every integer variable is scaled from 1 to 5 for example the column "fear of dangerous dogs" gives 1 as 'not afraid at all' and 5 is 'very afraid'. All the categorical data have in different levels for according to the question, example smoking has levels: 'never smoked', 'tried smoking', 'former smoker' and 'current smoker' whereas Drinking has levels: 'Never', 'Social drinker' and 'Drinks a lot'. The variables are split into different groups below are the list of groups:

Preferences or groups	Number of variables or items
Music preferences	19
Movie preferences	12
Hobbies & interests	32
Phobias	10
Health habits	3
Personality traits, views on life, & opinions	57
Spending habits	7
Demographics	10

Preprocessing of the Predictors:

The first step in analyzing any data is preprocessing the data to get consistency and be able to duplicate the results.

The steps for preprocessing include:

1) Recode categorical data to apply kNN

We recode the categorical levels into numeric levels so that we can impute the missing values. So, all the 11 categorical data variables are recoded to numerical where, Smoking has levels: 'never smoked', 'tried smoking', 'former smoker' and 'current smoker', now 'never smoked' is 1 and 'current smoker' is 4. Similarly, Drinking has levels: 'Never', 'Social drinker' and 'Drinks a lot' gives us 'Never' as 1 and 'Drinks a lot' is 3. While binary variables such as Gender take values 0 and 1.

2) Impute the missing values with kNN (k=5)

fig 1.1

We could observe few missing values by looking at the figure 1.1. We imputed these missing values using kNN with tuning parameter k = 5.

3) Recode the categorical data to dummy variables

We recode the now categorical data to dummy variables such that Smoking 1-4 is divided into 4 binary variables Smoking1 to Smoking4. So, the number of predictors increases and we have,

New total Predictors = 166

4) Correlation analysis (cutoff = 0.7)

We choose to remove variables with higher than 0.7 correlation, as some models are highly sensitive to correlation. We remove 7 variables, as they are highly correlated.

fig 1.2gerous dogs: No

Variables with less corr = 159

Predictors with High Corr:

· Alcohol3 (“drinks a lot”)

· Biology

· Gender0 (female)

· Left...right.handed0 (left handed)

· Only.child0 (“has siblings”)

· Village...town0 (city)

· House...block.of.flats0 (flats/apartments)

5) Near zero variance predictor removal:

Some Predictors have a tendency to not change its value across nearly the whole data set. These predictors do not affect the prediction at all, so we can remove them.

3 variables removed: ‘internet.usage1’, ‘education1’ & ‘education6’

6) Response variable Y = sum of 7 spending habit columns (treated as continuous)

We have 7 variables as our 'response'. However the scope of our study include only one response variable. So we decide to add all the 7 variables to make one response variable (we justify this looking at the correlation between the 7 spending variables). We inverse the finance variable (Finance: 1-I spend a lot; 5-I save a lot to New_Finance: 5-I spend a lot; 1-I save a lot) which yields throughout positive correlation. We can also note that the variables have a good overall distribution and do not have any outliers.

fig 1.3

fig 1.4

Now we have Response: Spending_Sum which is

· Coded from 1 – 5

· 5 indicates spending more

· 1 indicates spending less

· Continuously distributed between 7 to 35

fig 1.5

7) Data splitting

As we have enough Sample size (i.e. 1010 rows), we do a:

- 80-20 random split

8) Data Re-sampling approach

- 10 fold CV with no repeats as we see that this gets consistent results with a balance in computational time.

9) Outlier Removal

We do not have outliers in the data set.

Splitting of the Dataset:

As mentioned above, we do 80:20 random split and data resampling approach will be 10 fold CV

Therefore,

Training set size: 808 data points (rows)

Testing set size: 202 data points (rows)

Model fitting:

(a) Model Building:

Here based upon our data we need to build the following Linear and Nonlinear models on our dataset.

Linear Models	Nonlinear Models
OLS	N-Net
OLS-PCA	MARS
PLS	SVM
PCR	KNN
Ridge
Lasso
E-Net

fig 2.1

Each and every model has specific assumptions. We need to preprocess them accordingly before the model building. Here are the list of preprocessing steps taken for different models

Linear Models:

Model	Remove High Corr	Center	Scale	PCA
OLS	✓	✓	✓	-
OLS-PCA	-	✓	✓	✓
PLS	-	✓	✓	-
PCR	-	✓	✓	-
Ridge	-	✓	✓	-
Lasso	-	✓	✓	-
E-NET	-	✓	✓	-

fig 2.2

Nonlinear Models:

Model	Remove High Corr	Center	Scale	PCA
N-Net	✓	✓	✓	-
MARS	✓	✓	✓	-
SVM	-	✓	✓	-
KNN	-	✓	✓	-

fig 2.3

As we build all regression models performance evaluation will be appropriate if R-squared and RMSE are used. Below (fig2.1) are the RMSE and R-squared results of all the models build.

Hence,

(b) Model Evaluation Statistic: RMSE (Primary) and R-squared (Secondary)

Training set results:

Here we trained both linear and nonlinear regression models on our training set to predict the spending habits of the young people and tested it on our testing data.

Linear Models:

Model	Best Parameter	RMSE	R^2
OLS	-	4.388	0.308
OLS-PCA	-	4.342	0.307
PLS	Comp = 14	4.266	0.307
PCR	Comp = 2	4.159	0.344
Ridge	Lamda = 0.2	4.308	0.338
Lasso	Lamda = 0.75	4.101	0.369
E-NET	Lamda1=0.1, Lamda2=0.693	4.073	0.373

fig 2.4

Nonlinear Models:

Model	Best Parameter	RMSE	R^2
N-Net	Hidden units = 2, decay lamda = 0	4.31	0.294
MARS	Terms = 11, degree =1	4.153	0.353
SVM	Cost = 2	5.119	0.005
KNN	K=14	4.673	0.196

fig 2.5

From the above table (fig2.1) we can observe that LASSO, E-NET and MARS models have nearly same and overall higher R-squared values and least RMSE values. Even though none of the models are performing better but still among all these models LASSO model stand out to be the best. Below (fig2.2) are the testing set results of the best models listed above.

Testing Set Results:

Model	RMSE	R^2
Lasso	3.716	0.403
MARS	3.847	0.366
E-NET	3.752	0.392

fig 2.6

Below are the supplement figures (model performance plots) corresponding to the best model i.e. LASSO.

LASSO Model:

RMSE Training Data

Testing Data Residual plot

Top 20 Most Important Variables and its importance in predicting, for the Lasso model:

Variable Importance Plot (Lasso Model):

Summary:

We conclude that the best models in this analysis were MARS, E-NET and LASSO. These three models gives us almost the same RMSE value. Among these three, LASSO is the optimal model with RMSE for the predictive ability (i.e. Testing Set) is 3.716. However, all these values are low, we would recommend further analysis on this data and not use (or trust) any of these models to reliably predict.

Appendix 1

MARS Model:

RMSE testing data

Training data Residual plot

E-NET:

RMSE Testing data

Training data Residual plot

Appendix II: Questionnaire

MUSIC PREFERENCES

1. I enjoy listening to music.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

2. I prefer.: Slow paced music 1-2-3-4-5 Fast paced music (integer)

3. Dance, Disco, Funk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

4. Folk music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

5. Country: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

6. Classical: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

7. Musicals: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

8. Pop: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

9. Rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

10. Metal, Hard rock: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

11. Punk: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

12. Hip hop, Rap: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

13. Reggae, Ska: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

14. Swing, Jazz: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

15. Rock n Roll: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

16. Alternative music: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

17. Latin: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

18. Techno, Trance: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

19. Opera: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

MOVIE PREFERENCES

1. I really enjoy watching movies.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

2. Horror movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

3. Thriller movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

4. Comedies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

5. Romantic movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

6. Sci-fi movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

7. War movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

8. Tales: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

9. Cartoons: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

10. Documentaries: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

11. Western movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

12. Action movies: Don't enjoy at all 1-2-3-4-5 Enjoy very much (integer)

HOBBIES & INTERESTS

1. History: Not interested 1-2-3-4-5 Very interested (integer)

2. Psychology: Not interested 1-2-3-4-5 Very interested (integer)

3. Politics: Not interested 1-2-3-4-5 Very interested (integer)

4. Mathematics: Not interested 1-2-3-4-5 Very interested (integer)

5. Physics: Not interested 1-2-3-4-5 Very interested (integer)

6. Internet: Not interested 1-2-3-4-5 Very interested (integer)

7. PC Software, Hardware: Not interested 1-2-3-4-5 Very interested (integer)

8. Economy, Management: Not interested 1-2-3-4-5 Very interested (integer)

9. Biology: Not interested 1-2-3-4-5 Very interested (integer)

10. Chemistry: Not interested 1-2-3-4-5 Very interested (integer)

11. Poetry reading: Not interested 1-2-3-4-5 Very interested (integer)

12. Geography: Not interested 1-2-3-4-5 Very interested (integer)

13. Foreign languages: Not interested 1-2-3-4-5 Very interested (integer)

14. Medicine: Not interested 1-2-3-4-5 Very interested (integer)

15. Law: Not interested 1-2-3-4-5 Very interested (integer)

16. Cars: Not interested 1-2-3-4-5 Very interested (integer)

17. Art: Not interested 1-2-3-4-5 Very interested (integer)

18. Religion: Not interested 1-2-3-4-5 Very interested (integer)

19. Outdoor activities: Not interested 1-2-3-4-5 Very interested (integer)

20. Dancing: Not interested 1-2-3-4-5 Very interested (integer)

21. Playing musical instruments: Not interested 1-2-3-4-5 Very interested (integer)

22. Poetry writing: Not interested 1-2-3-4-5 Very interested (integer)

23. Sport and leisure activities: Not interested 1-2-3-4-5 Very interested (integer)

24. Sport at competitive level: Not interested 1-2-3-4-5 Very interested (integer)

25. Gardening: Not interested 1-2-3-4-5 Very interested (integer)

26. Celebrity lifestyle: Not interested 1-2-3-4-5 Very interested (integer)

27. Shopping: Not interested 1-2-3-4-5 Very interested (integer)

28. Science and technology: Not interested 1-2-3-4-5 Very interested (integer)

29. Theatre: Not interested 1-2-3-4-5 Very interested (integer)

30. Socializing: Not interested 1-2-3-4-5 Very interested (integer)

31. Adrenaline sports: Not interested 1-2-3-4-5 Very interested (integer)

32. Pets: Not interested 1-2-3-4-5 Very interested (integer)

PHOBIAS

1. Flying: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

2. Thunder, lightning: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

3. Darkness: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

4. Heights: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

5. Spiders: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

6. Snakes: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

7. Rats, mice: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

8. Ageing: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

9. Dangerous dogs: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

10. Public speaking: Not afraid at all 1-2-3-4-5 Very afraid of (integer)

HEALTH HABITS

1. Smoking habits: Never smoked - Tried smoking - Former smoker - Current smoker (categorical)

2. Drinking: Never - Social drinker - Drink a lot (categorical)

3. I live a very healthy lifestyle.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

PERSONALITY TRAITS, VIEWS ON LIFE & OPINIONS

1. I take notice of what goes on around me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

2. I try to do tasks as soon as possible and not leave them until last minute.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

3. I always make a list so I don't forget anything.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

4. I often study or work even in my spare time.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

5. I look at things from all different angles before I go ahead.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

6. I believe that bad people will suffer one day and good people will be rewarded.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

7. I am reliable at work and always complete all tasks given to me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

8. I always keep my promises.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

9. I can fall for someone very quickly and then completely lose interest.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

10. I would rather have lots of friends than lots of money.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

11. I always try to be the funniest one.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

12. I can be two faced sometimes.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

13. I damaged things in the past when angry.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

14. I take my time to make decisions.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

15. I always try to vote in elections.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

16. I often think about and regret the decisions I make.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

17. I can tell if people listen to me or not when I talk to them.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

18. I am a hypochondriac.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

19. I am emphatetic person.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

20. I eat because I have to. I don't enjoy food and eat as fast as I can.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

21. I try to give as much as I can to other people at Christmas.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

22. I don't like seeing animals suffering.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

23. I look after things I have borrowed from others.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

24. I feel lonely in life.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

25. I used to cheat at school.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

26. I worry about my health.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

27. I wish I could change the past because of the things I have done.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

28. I believe in God.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

29. I always have good dreams.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

30. I always give to charity.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

31. I have lots of friends.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

32. Timekeeping.: I am often early. - I am always on time. - I am often running late. (categorical)

33. Do you lie to others?: Never. - Only to avoid hurting someone. - Sometimes. - Everytime it suits me. (categorical)

34. I am very patient.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

35. I can quickly adapt to a new environment.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

36. My moods change quickly.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

37. I am well-mannered and I look after my appearance.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

38. I enjoy meeting new people.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

39. I always let other people know about my achievements.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

40. I think carefully before answering any important letters.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

41. I enjoy childrens' company.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

42. I am not afraid to give my opinion if I feel strongly about something.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

43. I can get angry very easily.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

44. I always make sure I connect with the right people.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

45. I have to be well prepared before public speaking.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

46. I will find a fault in myself if people don't like me.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

47. I cry when I feel down or things don't go the right way.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

48. I am 100% happy with my life.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

49. I am always full of life and energy.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

50. I prefer big dangerous dogs to smaller, calmer dogs.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

51. I believe all my personality traits are positive.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

52. If I find something the doesn't belong to me I will hand it in.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

53. I find it very difficult to get up in the morning.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

54. I have many different hobbies and interests.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

55. I always listen to my parents' advice.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

56. I enjoy taking part in surveys.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

57. How much time do you spend online?: No time at all - Less than an hour a day - Few hours a day - Most of the day (categorical)

SPENDING HABITS

1. I save all the money I can.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

2. I enjoy going to large shopping centers.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

3. I prefer branded clothing to non-branded.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

4. I spend a lot of money on partying and socializing.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

5. I spend a lot of money on my appearance.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

6. I spend a lot of money on gadgets.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

7. I will happily pay more money for good, quality or healthy food.: Strongly disagree 1-2-3-4-5 Strongly agree (integer)

DEMOGRAPHICS

1. Age: (integer)

2. Height: (integer)

3. Weight: (integer)

4. How many siblings do you have?: (integer)

5. Gender: Female - Male (categorical)

6. I am: Left handed - Right handed (categorical)

7. Highest education achieved: Currently a Primary school pupil - Primary school - Secondary school - College/Bachelor degree (categorical)

8. I am the only child: No - Yes (categorical)

9. I spent most of my childhood in a: City - village (categorical)

10. I lived most of my childhood in a: house/bungalow - block of flats (categorical)

Appendix III: R-code

Part 1

# Predictive Modeling

install.packages("Amelia")

library(Amelia)

library(VIM)

library(caret)

library(readr)

responses <- read.csv("C:/Users/Juli/Desktop/Predictive Modeling/res-data.txt",

na.strings =c("", "NA"))

ncol(responses) #[1] 150

nrow(responses) #[1] 1010

par(mar = c(10, 3, 2, 2), mfrow = c(1,1))

missmap(responses, col = c("wheat", "darkred"))

# code the non-integer predictors

attach(responses)

pred_int <- c()

for(i in 1:ncol(responses)){

pred_int[i] <- is.integer(responses[,i])

}

# which are not integer?

decode <- which(pred_int == 0)

colnames(responses[, decode]) # take a look at the names

# Treat smoking as an order variable

table(Smoking, useNA = "always")

Smoking <- ordered(Smoking, levels = c("never smoked", "tried smoking", "former smoker",

"current smoker"))

Smoking <- as.numeric(Smoking) # as.numeric to apply KNN

# Treat alcohol as an order variable

table(Alcohol, useNA = "always")

Alcohol <- ordered(Alcohol, levels = c("never", "social drinker", "drink a lot"))

Alcohol <- as.numeric(Alcohol)

# Treat punctuality as an order variable

table(Punctuality, useNA = "always")

Punctuality <- ordered(Punctuality, levels = c("i am often early",

"i am always on time",

"i am often running late"))

Punctuality <- as.numeric(Punctuality)

# Treat Lying as an order variable

table(Lying, useNA = "always")

Lying <- ordered(Lying, levels = c("never","only to avoid hurting someone",

"sometimes", "everytime it suits me"))

Lying <- as.numeric(Lying)

# Treat Internet Use

table(Internet.usage, useNA = "always")

Internet.usage <- ordered(Internet.usage, levels = c("no time at all","less than an hour a day",

"few hours a day", "most of the day"))

Internet.usage <- as.numeric(Internet.usage)

# Gender

levels(Gender) <- c("0", "1")

Gender <- as.numeric(Gender) -1 #0 for females

# Left...right.handed

table(Left...right.handed, useNA = "always")

levels(Left...right.handed) <- c("0", "1")

Left...right.handed <- as.numeric(Left...right.handed) - 1 # 1 for right handed

# Education

table(Education, useNA = "always")

Education <- ordered(Education, levels = c("currently a primary school pupil","primary school",

"secondary school", "college/bachelor degree",

"masters degree", "doctorate degree"))

Education <- as.numeric(Education)

# Only.child

levels(Only.child) <- c("0", "1")

Only.child <- as.numeric(Only.child) -1 # 0 for non only child

# Village...town

levels(Village...town) <- c("0", "1")

Village...town <- as.numeric(Village...town) -1 # 0 for city

# House...block.of.flats

levels(House...block.of.flats) <- c("0", "1")

House...block.of.flats <- as.numeric(House...block.of.flats) -1 # 0 for block of flats

# replace everything in the final data set:

responses$Smoking <- Smoking

responses$Alcohol <- Alcohol

responses$Punctuality <- Punctuality

responses$Lying <- Lying

responses$Internet.usage <- Internet.usage

responses$Gender <- Gender

responses$Left...right.handed <- Left...right.handed

responses$Education <- Education

responses$Only.child <- Only.child

responses$Village...town <- Village...town

responses$House...block.of.flats <- House...block.of.flats

# --> there is no pattern for the missing responses

# for now: go with k = 5 to do imputation

imputed <- kNN(responses,imp_var = FALSE, k = 5)

imputed[,134] <- 6-imputed[,134]

missmap(imputed, col = c("wheat", "darkred"))

# use a sum of the variables that cover spending as a response variable

spending_imp <- imputed[, 134:140] # check manually which columns are for expenses

response_sum <- rowSums(spending_imp)

# Some graphics

par(mgp = c(2, 1, 0), mar = c(3, 3, 2, 1))

hist(response_sum, main = "Histogram of the response variable")

boxplot(spending_imp)

predictors <- imputed[, -c(134:140)]

apply(spending_imp, 2, mean)

par(mfrow =c(4, 2), mar = c(2, 2, 2, 1), mgp = c(1, 1, 0))

for(i in 1:ncol(spending_imp)){

barplot(table(spending_imp[,i]), main = colnames(spending_imp)[i])

}

write.csv(imputed, file = "imputed_data.txt")

nearZeroVar(predictors)

par(mfrow = c(1,1), mar = c(0, 1, 1, 1))

library(corrplot)

corrplot(cor(predictors), type = "upper", order = "hclust", cl.pos="n", tl.pos="n")

install.packages("corrgram")

library(corrgram)

corrgram(predictors)

# going back to categorical data and dummy variables:

decode

decode_pred <- which(colnames(predictors) %in% colnames(imputed[, decode]))

# model.matrix(~factor(imputed[, decode[1]]))

# length(findCorrelation(cor(predictors), cutoff = 0.6))

# for(i in 1 :length(decode_pred)){}

install.packages("dummies")

library(dummies)

pred_test <- predictors

pred_test[, decode_pred] <- apply(pred_test[, decode_pred], 2, as.factor)

test <- dummy.data.frame(pred_test, names = colnames(predictors[decode_pred]), omit.constants=T, dummy.classes = getOption("dummy.classes"), all = TRUE)

dummy_pred <- test

write.csv(dummy_pred, file = "dummypred.txt")

library(ggplot2)

ggcorr(dummy_pred, nbreaks = 5)

cor_big <- findCorrelation(cor(dummy_pred), cutoff = 0.7)

cor_big

colnames(dummy_pred[, cor_big])

final_pred <- dummy_pred[, -cor_big]

final_pred <- final_pred[,-1]

write.csv(final_pred, file = "final_pred.txt")

######################################################################################################################################

final_pred <- read.csv("final_pred.txt", header = T); final_pred <- final_pred[,-1] # kicked out corr with thres = 0.7

dummy_pred <- read.csv("dummypred.txt", header = T); dummy_pred <- dummy_pred[,-1] # correlations still in there

imputed <- read.csv("imputed_data.txt", header = T); imputed <- imputed[,-1] # the whole imputed data set, no dummies

spending_imp <- imputed[, 134:140]

response_sum <- rowSums(spending_imp)

## Finally for model building:

x <- dummy_pred; write.csv(x, file = "x.txt")

x_nocorr <- final_pred; write.csv(x_nocorr, file = "x_nocorr.txt")

y <- response_sum; write.csv(y, file = "y.txt")

library(corrplot)

corrplot(cor(spending_imp), order = "hclust")

Part 2

# Actual model building

library(AppliedPredictiveModeling)

library(caret)

library(lars)

library(elasticnet)

library(Amelia)

library(pls)

library(earth)

## Finally for model building:

x <- read.csv("x.txt", header = T); x <- x[,-1]

# cutoff was 0.7

x_nocorr <- read.csv("x_nocorr.txt", header = T); x_nocorr <- x_nocorr[,-1]

y <- read.csv("y.txt", header = T); y <- y[,-1]

# there are still nearZerovariance predictors!

# we have to remove those:

nearZeroVar(x)

# [1] 143 155 160

colnames(x[, c(143, 155, 160)])

x <- x[, -c(143, 155, 160)]

# even in x_nocorr there stilla re some near zero variance predictors

nearZeroVar(x_nocorr)

# [1] 141 151 156

colnames(x_nocorr[, c(141, 151, 156)])

x_nocorr <- x_nocorr[, -c(141, 151, 156)]

set.seed(007)

index = sample.int(length(y), 0.8*length(y))

#index = createDataPartition(y, p=0.8, list=F)

train = x[index, ]

test = x[-index, ]

train_nocorr = x_nocorr[index, ]

test_nocorr = x_nocorr[-index, ]

y_train = y[index]

y_test = y[-index]

######################################

# Which models?

# Linear:

# 1. ordinary least square

# 1.5 ordinary least square with pca as pre processing on "preProc"-argument in

# train function

# 2. partial least square

# 3. PCR

# penalized: ridge 4

# lasso 5

# enet 6

# non-linear:

# 7. neural networks

# 8. MARS

# 9. SVM

# 10. KNN

##############################################################

set.seed(1)

ctrl <- trainControl(method = "cv", number = 10)

#### START WITH THE LINEAR MODELS

################

### Linear model (without considering correlation)

################

set.seed(007)

lm_tune <- train(x = train_nocorr, y = y_train, method = "lm",

trControl = ctrl,

preProc = c("center", "scale"))

lm_tune

################################

## Linear model with PCA ()and therefore also center and scale) as preProc

################################

set.seed(007)

lm_tune_PCA <- train(x = train_nocorr, y = y_train, method = "lm",

trControl = ctrl,

preProc = c("center", "scale", "pca"))

# pca here: amount of components not a tning parameter

lm_tune_PCA

summary(lm_tune_PCA)

### model: pcr' now the number of principle components is a tuning parameter

##################

# maybe not saved, but runs fast!

set.seed(007)

pcr_tune <- train(x=train, y = y_train, method = "pcr",

trControl = ctrl,

preProc = c("center", "scale"),

tuneLength = 15)

pcr_tune

plot(pcr_tune)

########################

### Partial Least Square

########################

set.seed(007)

PLS_tune <- train(x = train, y = y_train, method = "pls",

tuneLength = 20,

trControl = ctrl,

preProc = c("center", "scale"))

PLS_tune

plot(PLS_tune)

#######################

###### Ridge Regression

#######################

ridgeGrid <- data.frame(.lambda = seq(0, .4, length = 15))

# range for the lambda! check graph!

set.seed(007)

ridge_tune<- train(x = train, y = y_train, method = "ridge",

tuneGrid = ridgeGrid, trControl = ctrl,

preProc = c("center", "scale"))

plot(ridge_tune)

ridge_tune

#######################

###### Lasso Regression

#######################

lassoGrid <- expand.grid(.lambda = 0, .fraction = seq(.05, 0.4, length = 15))

# fraction = 1 means, that there is no penalty concerning lasso

set.seed(007)

lasso_tune<- train(x = train, y = y_train, method = "enet",

tuneGrid = lassoGrid, trControl = ctrl,

preProc = c("center", "scale"))

plot(lasso_tune)

lasso_tune

######################

####### enet

######################

enetGrid <- expand.grid(.lambda = c(0, 0.1, 0.2, 0.3), .fraction = seq(.05, 0.5, length = 15))

set.seed(007)

enet_tune<- train(x = train, y = y_train, method = "enet",

tuneGrid = enetGrid, trControl = ctrl,

preProc = c("center", "scale"))

plot(enet_tune)

enet_tune

###############################################################

#### CONTINUE WITH NON-LINEAR MODELS

###############################################################

####################

#### neural networks

####################

# here, use data without the highly correlated predictors

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), # lambda

.size = c(1:10), # H = number of hidden units

## The next option is to use bagging (see the

## next chapter) instead of different random

## seeds.

.bag = FALSE)

set.seed(007)

nnetTune <- train(x = train_nocorr, y = y_train, # x and y

method = "avNNet", # if avNNet too slow: use NNet

# has several starting points to avoid just funding a local

# mimimum but the global minimum

tuneGrid = nnetGrid,

trControl = ctrl,

## Automatically standardize data prior to modeling

## and prediction

# preProc = c("center", "scale"), # preprocessing!

linout = TRUE,

trace = FALSE,

MaxNWts = 10 * (ncol(train_nocorr) + 1) + 10 + 1, #number of parameters,

# depends on ýour grid!!

maxit = 500,

preProc = c("center", "scale"))

plot(nnetTune)

nnetTune

####################

########## MARS

###################

# for MARS as well: no high correlations!

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

mars_tune <- train(x = train_nocorr, y = y_train,

method = "earth",

# Explicitly declare the candidate models to test

tuneGrid = marsGrid,

trControl = ctrl,

preProc = c("center", "scale"))

plot(mars_tune)

mars_tune

##################

############ SVM

#################

# it says that centering / scaling in the notes (?)

svm_tune <- train(x = train, y = y_train,

method = "svmRadial", # sigma will be computed analytically

preProc = c("center", "scale"), # centering / scaling improves the graph

# yes, use center and scale!

tuneLength = 6,

trControl = ctrl)

plot(svm_tune)

svm_tune

# aa=varImp(svm_tune)

# plot(aa, top = 25, scales = list(y = list(cex = .95)))

##################

########### KNN

##################

set.seed(007)

knnTune <- train(x = train,

y = y_train,

method = "knn",

# Center and scaling will occur for new predictions too

preProc = c("center", "scale"), # so maybe here leave it?

tuneGrid = data.frame(.k = 1:20),

trControl = ctrl)

knnTune

plot(knnTune)

###############################################################################

# best models: eNet, Lasso and MARS

# all the plots:

############

###### eNet

############

prediction_enet <- predict(enet_tune, test)

defaultSummary(data.frame(obs = y_test, pred = prediction_enet))

# RMSE Rsquared MAE

# 3.7521506 0.3920136 2.9593867

par(mfrow = c(2,2))

# the plots:

plot(enet_tune)

# on the testing set

xyplot(y_test ~ predict(enet_tune), type = c("p", "g"),

xlab = "Predicted",

ylab = "Observed")

# on the training set

xyplot(y_train ~ predict(enet_tune), type = c("p", "g"),

xlab = "Predicted",

ylab = "Observed")

# Residual plot

xyplot(resid(enet_tune) ~ predict(enet_tune),

type = c("p", "g"),

xlab = "Predicted", ylab = "Residuals")

############

###### lasso

############

prediction_lasso <- predict(lasso_tune, test)

defaultSummary(data.frame(obs = y_test, pred = prediction_lasso))

# RMSE Rsquared MAE

# 3.7159639 0.4033666 2.9258084

# the plots:

plot(lasso_tune)

# on the testing set

xyplot(y_test ~ predict(lasso_tune), type = c("p", "g"),

xlab = "Predicted",

ylab = "Observed")

# on the training set

xyplot(y_train ~ predict(lasso_tune), type = c("p", "g"),

xlab = "Predicted",

ylab = "Observed")

# Residual plot

xyplot(resid(lasso_tune) ~ predict(lasso_tune),

type = c("p", "g"),

xlab = "Predicted", ylab = "Residuals")

############

###### MARS

############

prediction_mars <- drop(predict(mars_tune, test_nocorr))

defaultSummary(data.frame(obs = y_test, pred = prediction_mars))

# RMSE Rsquared MAE

# 3.8470064 0.3660746 2.9510664

# the plots:

plot(mars_tune)

# on the testing set

xyplot(y_test ~ predict(mars_tune), type = c("p", "g"),

xlab = "Predicted",

ylab = "Observed")

# on the training set

xyplot(y_train ~ predict(mars_tune), type = c("p", "g"),

xlab = "Predicted",

ylab = "Observed")

# Residual plot

xyplot(resid(mars_tune) ~ predict(mars_tune),

type = c("p", "g"),

xlab = "Predicted", ylab = "Residuals")

References:

https://www.kaggle.com/miroslavsabo/young-people-survey

https://www.r-project.org/other-docs.html

https://www.rdocumentation.org/

Max Kuhn and Kjell Johnson. Applied Predictive Modeling, Springer, 2013

http://topepo.github.io/caret/index.html