Report: Housing Price Prediction Model for D. M. Pan National Real Estate Company
Shanon Beck
MAT-240 Applied Statistics
September 21st, 2023
Introduction
The purpose of this report is to determine if a model can be developed to predict home prices in 2019 using their square footage. If a strong correlation or linear relationship is found, we can use a regression equation to make accurate data predictions. In order to do this, we will use the square footage as our x-variable or predictor variable because it is a fixed fixed value. Additionally, we will use the listing price as our y-variable or response variable; as it’s our predicted value. If a linear relationship is appropriate for this task we would expect to see a linear, positive correlation between square footage and listing price; if it is not appropriate we would expect to see data points scattered in the plot with no obvious trendline.
Data Collection
The sample was obtained by randomizing all of the thousand houses across the United States and then randomly selecting a dataset of fifty individual houses. In order to do this, I used the =RAND() function to assign each house a random number between 0 and 1, then proceeded to organize the data highest-to-lowest by using the sort function in excel, selecting the first fifty houses from that given organization. My predictor variable (x) is the square footage of each house, while the response variable is the listing price. This is because the listing price is directly affected by the square footage.
Data Analysis
Histogram 1 (square feet) is skewed to the right due to a set of 3 outliers that have much larger square footage than the average house. While Histogram 2 (listing price) is also slightly skewed due to the same outliers with much higher listing prices than the average of $342,363. The potential outliers are as follows: Leon, Florida, with a square footage of 4,213 and a listing price of $619,200, Rock, Wisconsin, with a square footage of 4,950 and a listing price of $513,100 and Plymouth, Maine, with a square footage of 6,516 and a listing price of $814,300. When we compare our data to the national dataset, we find remarkably similar results: the average of our listing prices is $342,363 compared to the national average of $342,365, a difference of only 2. While the average of our square footage is 2,050 compared to the national average of 2,111, again, they are very similar results. Due to this, I believe our dataset is representative of the national population. But the national data is different in that it has much more spreaders than our data; given a more significant number of much-larger-than-average homes, although both our dataset and the national dataset follow similar trendlines.
Develop Regression Model
Our scatterplot has a linear form, with a strong positive correlation and a correlation coefficient of 0.788062. When we remove any of the three possible outliers, i.e., Plymouth, Maine; we do see a noticeable difference in our R2 value. In the case of Plymouth, we see a significant decrease in R2. But when we retain the outliers; we still see a positive upward trend in our dataset, following the expected linear relationship that we have established. Given this data we can see that as the average square feet increases, listing price increases.
Determine the Line of Best Fit
Our regression equation is y = 95.886x + 141965, 95.886x being the slope of our graph and 141965 being our intercept. The slope represents how much our listing price $ increases per increase in the square footage of a house. While the intercept represents the expected cost of a house that has 0 square feet, which is not possible to accurately measure given our dataset. The r-squared value of our dataset is 0.621, which is the explained variance in the dependent variable (listing price) around our independent variable (square feet). Additionally, if we were to make a prediction with our regression equation, the expected listing price of a 1500-square-foot house would be $285,794.
Conclusion
Concluding this study; our regression equation y = 95.886x + 141965 can be used to effectively predict the listing price of a house based on its square footage. Given our dataset of fifty houses, we identified a strong positive linear relationship. The slope of our equation (95.886x) tells us that for every 1 square foot increase, we can expect to see an increase of $95.886 in our listing price; this is the result I expected to see. In order to develop a model that more effectively predicts the listing price of a given property; we would want to use a much larger data set or use a stratified sampling method to narrow it down and focus on developing different prediction models for each state or region. Additionally, for any follow-up research I would like to present these questions “Would any other factor better predict listing price?”, for example, local school district, nearby amenities or population density.
No comments:
Post a Comment