In 2015 Microsoft acquired Revolution Analytics, and today this influence can be seen in the rapidly expanding set of options for running and hosting R based machine learning models. Each option appears fully integrated into an existing product or service, so at times can be difficult to get a lay of the land in order to make the right choice. In this post, we provide that overview.
Azure Machine Learning provides a PaaS experience for designing, training and operationalizing machine learning models. It can run R scripts (and supports pulling in dependencies). The key benefit is it's speed of operationalizing the model. If you've done any data science for production, you know that taking your model from single-user development mode to integration into production systems for high scale consumption (e.g., operationalizing the model) is not a trivial undertaking. In fact, it is common to hear that this adds on 6 to 18 months to the development cycle of new models. In Azure ML, it takes a few clicks to expose the model as RESTful web service that can scale- so you operationalize in minutes. While speed of operationalizing the model is a huge benefit, today Azure ML does not scale out (and this is often a point of confusion). When you are training your model, it is running in the context of a single A8 virtual machine- and this is surfaces with limits (such as the 10GB max dataset size). So there is no parallel compute, but for larger datasets or situations that require distributed compute, there are other options as follows.
Announced during Strata + Hadoop World, you can now run your R scripts using Spark on Hadoop via the HDInsight service. The benefit here is the support of algorithms that can parallelize across the Hadoop cluster. This enables you scale both the working set size of your data sets as well as the computational power available to training and improving your model. To use R this way, when deploying a new HDInsight cluster, you will select the R Server on Spark option. This cluster type comes at a small additional cost that are a part of the new Premium tier for HDInsight. Note that recently Microsoft also changed its support for Spark clusters to only Linux (instead of also supporting Windows), this applies to the R Server on for HDInsight option as well.
If your team is already comfortable with a SQL Server environment, then SQL Server 2016 brings an alternate solution for achieving parallel compute via SQL Server R services. This integration enables two important tasks- running R scripts within the context of a T-SQL query and taking advantage of R libraries optimized for distributed processing across the cluster. This cluster can be running on-premises or within Azure Virtual Machines.
Microsoft R Server provides a stand alone solution that provides support for parallel compute on Linux, Hadoop and Teradata clusters. It is a part of the SQL Server installer, but you can install and benefit from it without having a SQL Server at all. As an aside, Microsoft R Server is what is packaged into SQL Server R Services for when you do have a SQL Server cluster.
Comments