While working on a project where we require machine learning and complex stats we can use two popular packages of python Scikit-learn and StatsModels.
- Scikit-learns development began in 2007 and was first released in 2010. The current version, 0.19, came out in July 2017. Statsmodels started in 2009, with the latest version, 0.8.0, released in February 2017.
- The main difference between these two is how they handle constants. Scikit-learn allows the user to or not to add a constant through a parameter, while Statsmodels’ OLS class has a function that adds a constant to a given array.
- Scikit-learn is most commonly used in machine learning and data science. StatsModels is used in the field of econometrics, generalised-linear models, time-series-analysis, and regression models.
- Scikit-learn's algorithm is a simple, easy learn that only require our data to be organized in the right way before we can run whatever classification, regression, or clustering algorithm we need. While StatsModels don't have a variety of options, it only offers statistics and econometric tools that are used in statistics software like Stata and R. It has a similar syntax as that of R so, for those who are transitioning to Python, StatsModels is a good choice.
- Scikit-learn is basically designed for machine learning, while Statsmodels is made for rigorous statistics.
- Scikit-learn is faster as compared to Statsmodels for 1,000 or fewer observations but Sci-kit learn is significantly faster for datasets with more than 1,000 observations
- Scikit-learn provides more models for regularization, while Statsmodels helps correct for broken OLS assumptions.
- There is a biggest disadvantage with Statsmodels is that it is a newly developed package. The quantity and quality of documentation are very poor. While Scikit-learn has well-written documentation. It uses a simple modular approach for all its functions and hence is very easy to learn and becomes very intuitive after you know the basics.
- Statsmodels and Scikit-learn are designed to meet different needs. Statsmodels has a stronger emphasis on statistical inference and statistical hypothesis testing whereas Scikit-learn is well-suited for projects where the prediction of unobserved values is key.