The importance of data analysis on data science process
This is a simple guide with the main steps to make a good modeling data process
Clearly, today there are infinite resources for solving analytical problems, such as robust machine learning models and sophisticated data tools. This has caused the actual data scientist process to gradually lose its meaning. The main objective of this role as established in a simple way the CRISP-DP methodology. In this guide we will use data from the Kaggle open repository ‘Seattle Airbnb Open Data’.
Bussiness analysis
First it is important to get the context of the company and what are other main cores. AirBNB has become the largest digital platform to offer and search accommodation in the world. Airbnb currently has an offering of around 2,000,000 properties in 192 countries and 33,000 cities.
Now the company expresses its needs in this specific case we are going to ask three important questions to solve:
- What are the main factors that determine the price of accommodation?
- As in traditional hotels, does offering amenities impact the final price?
- Which Seattle neighborhood is the most expensive and cheapest to rent?
In addition, the company asks us to develop a model in which the user can insert some data to predict the base price that he needs to take into account as a reference.
Data cleaning and preprocesing
For this case, we have been given a base to work with, to start having a good experience with the data and I recommend following the following steps:
- Examine the size of the base, define which columns are relevant for analysis, or columns that are concentrated in a unique value, keep it simple.
- Do some basic analysis on the data such as displaying unique values for each column and how many columns have null values, you can make a chart like the one below.
- Clean up your data, replace some character that doesn’t allow you to process the variable as you want, like ‘$’ and ‘%’, then use the correct data type for each column.
- You can create new characteristics from complex data, such as text fields, by searching the length of the text, or use null data to specify new columns. For this example two new columns were created, one using null extra cost data such as security deposit and cleaning fee to create a categorical column that tells us if it has an extra charge or not, in addition to using a complex column as amenities taking internal values and counting the amount of additional comforts offered by each accommodation.
- If you have categorical variables try to keep the least number of categories, for this analyze the distribution of the data and if there are some non-significant categories you can group them into one, this will make your model less complex.
- The imputation of missing data is another important step, there is no mandatory rule that you use to impute, basically, you have to interpret each value, first you have to ask yourself why this data is missing, second, how much data is missing. Remember that this technique is useful but reduces the natural variation in the data. Another important thing to check is the histograms and the descriptive statistics of each variable, it could give you a good idea of whether it is good to impute with median or average.
Visualizing and understanding data
This stage is very relevant, here we already have clean data to begin to analyze things, a good approach to identify important characteristics is to visualize the correlation of in a heat map with all the variables.
This analysis is useful if you need to rule out variables with similar behavior that could have a redundancy effect in your model, additionally if we look at the price column we have a preview of the relevant variables in this graph, the ones with the highest correlation.
Now that we reduce the number of variables, we can make the diagram of the input variables vs. the price, the violin diagram is a very useful tool to identify relationships of this type, in the example we can see that the price increases considerably with the increase in the number of rooms, this graph also helps us to rule out even more variables in which we do not show significant changes between the values.
Modeling and basic stats
Now that we’ve chosen our variables and cleaned up the information, we can model it. In this example we are going to use Linear Regression, this model is really simple but it is one of the most powerful, since it not only has the ability to predict data but also generate important findings on the variables.
Next step for modeling:
- Tranform categorical data into dummy.
- Normalize and standardize the variables.
- Split of data into train and test.
After following the previous steps we have to think about the linear regression formula and this is quite simple because it is only palntearo like the equation of a straight line, in this case the answer is equivalent to the sum of the variables and each variable will have a coefficient that will impact the answer. Once we fit the model we can print the summary of it and start analyzing the results.
Here are three important statistics to analyze:
- R-squared: tells us how well our model performs on the training data and how it performs compared to the base model.
- P> | t |: This is the p-value for each variable if it is greater than 0.05, this variable is not relevant for the model.
- Coef: The impact of each variable on the response variables.
Finally we are going to calculate the R squared on the test data obtaining a value of 0.53 that is much better than the 0.49 of the tarin, when this happens we can determine that the performance of the model is adequate and fits well to the new data.
Now we can solve our commercial doubts:
- The most relevant variables to establish the price of accommodation are: neighborhood, bathrooms, rooms, type of property and extra cost.
- The amenities are not a significant feature for the price, this after analyzing the p-value in the model.
- The most expensive neighborhood is Downtown while the cheapest is Northgate according to the coefficients in the regression.
You can see the complete analysis in finished code, that has been posted on GitHub here.