You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Convert R code for execution in SQL Server (In-Database) instances
15
15
[!INCLUDE [SQL Server 2016 and later](../../includes/applies-to-version/sqlserver2016.md)]
@@ -24,41 +24,41 @@ However, your code might require substantial changes if any of the following app
24
24
+ The code makes separate calls to data sources outside SQL Server, such as Excel worksheets, files on shares, and other databases.
25
25
+ You want to run the code in the *\@script* parameter of [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md) and also parameterize the stored procedure.
26
26
+ Your original solution includes multiple steps that might be more efficient in a production environment if executed independently, such as data preparation or feature engineering vs. model training, scoring, or reporting.
27
-
+ You want to improve optimize performance by changing libraries, using parallel execution, or offloading some processing to SQL Server.
27
+
+ You want to optimize performance by changing libraries, using parallel execution, or offloading some processing to SQL Server.
28
28
29
29
## Step 1. Plan requirements and resources
30
30
31
-
**Packages**
31
+
### Packages
32
32
33
33
+ Determine which packages are needed and ensure that they work on SQL Server.
34
34
35
35
+ Install packages in advance, in the default package library used by Machine Learning Services. User libraries are not supported.
36
36
37
-
**Data sources**
37
+
### Data sources
38
38
39
39
+ If you intend to embed your R code in [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md), identify primary and secondary data sources.
40
40
41
-
+**Primary** data sources are large datasets, such as model training data, or input data for predictions. Plan to map your largest dataset to the input parameter of [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md).
41
+
+**Primary** data sources are large datasets, such as model training data, or input data for predictions. Plan to map your largest dataset to the input parameter of [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md).
42
42
43
-
+**Secondary** data sources are typically smaller data sets, such as lists of factors, or additional grouping variables.
44
-
45
-
Currently, sp_execute_external_script supports only a single dataset as input to the stored procedure. However, you can add multiple scalar or binary inputs.
43
+
+**Secondary** data sources are typically smaller data sets, such as lists of factors, or additional grouping variables.
44
+
45
+
Currently, sp_execute_external_script supports only a single dataset as input to the stored procedure. However, you can add multiple scalar or binary inputs.
46
46
47
-
Stored procedure calls preceded by EXECUTE cannot be used as an input to [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md). You can use queries, views, or any other valid SELECT statement.
47
+
Stored procedure calls preceded by EXECUTE cannot be used as an input to [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md). You can use queries, views, or any other valid SELECT statement.
48
48
49
49
+ Determine the outputs you need. If you run R code using sp_execute_external_script, the stored procedure can output just one data frame as a result. However, you can also output multiple scalar outputs, including plots and models in binary format, as well as other scalar values derived from R code or SQL parameters.
50
50
51
-
**Data types**
51
+
### Data types
52
52
53
53
+ Make a checklist of possible data type issues.
54
54
55
-
All R data types are supported by SQL Server machine Learning Services. However, [!INCLUDE[ssNoVersion](../../includes/ssnoversion-md.md)] supports a greater variety of data types than does R. Therefore, some implicit data type conversions are performed when sending [!INCLUDE[ssNoVersion](../../includes/ssnoversion-md.md)] data to R, and vice versa. You might need to explicitly cast or convert some data.
55
+
All R data types are supported by SQL Server machine Learning Services. However, [!INCLUDE[ssNoVersion](../../includes/ssnoversion-md.md)] supports a greater variety of data types than does R. Therefore, some implicit data type conversions are performed when sending [!INCLUDE[ssNoVersion](../../includes/ssnoversion-md.md)] data to R, and vice versa. You might need to explicitly cast or convert some data.
56
56
57
-
NULL values are supported. However, R uses the `na` data construct to represent a missing value, which is similar to a null.
57
+
NULL values are supported. However, R uses the `na` data construct to represent a missing value, which is similar to a null.
58
58
59
59
+ Consider eliminating dependency on data that cannot be used by R: for example, rowid and GUID data types from SQL Server cannot be consumed by R and generate errors.
60
60
61
-
For more information, see [R Libraries and Data Types](../r/r-libraries-and-data-types.md).
61
+
For more information, see [R Libraries and Data Types](../r/r-libraries-and-data-types.md).
62
62
63
63
## Step 2. Convert or repackage code
64
64
@@ -68,99 +68,103 @@ How much you change your code depends on whether you intend to submit the R code
68
68
69
69
+ When running R in a stored procedure, you can pass through multiple **scalar** inputs. For any parameters that you want to use in the output, add the **OUTPUT** keyword.
70
70
71
-
For example, the following scalar input `@model_name` contains the model name, which is also output in its own column in the results:
71
+
For example, the following scalar input `@model_name` contains the model name, which is also output in its own column in the results:
+ Any variables that you pass in as parameters of the stored procedure [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md) must be mapped to variables in the R code. By default, variables are mapped by name.
78
78
79
-
All columns in the input dataset must also be mapped to variables in the R script. For example, assume your R script contains a formula like this one:
79
+
All columns in the input dataset must also be mapped to variables in the R script. For example, assume your R script contains a formula like this one:
80
80
81
-
```R
82
-
formula <- ArrDelay ~ CRSDepTime + DayOfWeek + CRSDepHour:DayOfWeek
83
-
```
84
-
85
-
An error is raised if the input dataset does not contain columns with the matching names ArrDelay, CRSDepTime, DayOfWeek, CRSDepHour, and DayOfWeek.
An error is raised if the input dataset does not contain columns with the matching names ArrDelay, CRSDepTime, DayOfWeek, CRSDepHour, and DayOfWeek.
86
86
87
87
+ In some cases, an output schema must be defined in advance for the results.
88
88
89
-
For example, to insert the data into a table, you must use the **WITH RESULT SET** clause to specify the schema.
89
+
For example, to insert the data into a table, you must use the **WITH RESULT SET** clause to specify the schema.
90
90
91
-
The output schema is also required if the R script uses the argument `@parallel=1`. The reason is that multiple processes might be created by SQL Server to run the query in parallel, with the results collected at the end. Therefore, the output schema must be prepared before the parallel processes can be created.
92
-
93
-
In other cases, you can omit the result schema by using the option **WITH RESULT SETS UNDEFINED**. This statement returns the dataset from the R script without naming the columns or specifying the SQL data types.
91
+
The output schema is also required if the R script uses the argument `@parallel=1`. The reason is that multiple processes might be created by SQL Server to run the query in parallel, with the results collected at the end. Therefore, the output schema must be prepared before the parallel processes can be created.
92
+
93
+
In other cases, you can omit the result schema by using the option **WITH RESULT SETS UNDEFINED**. This statement returns the dataset from the R script without naming the columns or specifying the SQL data types.
94
94
95
95
+ Consider generating timing or tracking data using T-SQL rather than R.
96
96
97
-
For example, you could pass the system timeor other information used for auditing and storage by adding a T-SQL call that is passed through to the results, rather than generating similar data in the R script.
97
+
For example, you could pass the system time or other information used for auditing and storage by adding a T-SQL call that is passed through to the results, rather than generating similar data in the R script.
+ Avoid writing predictions or intermediate results to file. Write predictions to a table instead, to avoid data movement.
103
+
::: moniker-end
102
104
103
105
+ Run all queries in advance, and review the SQL Server query plans to identify tasks that can be performed in parallel.
104
106
105
-
If the input query can be parallelized, set`@parallel=1`as part of your arguments to [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md).
107
+
If the input query can be parallelized, set `@parallel=1` as part of your arguments to [sp_execute_external_script](../../relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql.md).
106
108
107
-
Parallel processing with this flag is typically possible any time that SQL Server can work with partitioned tables or distribute a query among multiple processes and aggregate the results at the end. Parallel processing with this flag is typically not possible if you are training models using algorithms that require all data to be read, or if you need to create aggregates.
109
+
Parallel processing with this flag is typically possible any time that SQL Server can work with partitioned tables or distribute a query among multiple processes and aggregate the results at the end. Parallel processing with this flag is typically not possible if you are training models using algorithms that require all data to be read, or if you need to create aggregates.
108
110
109
111
+ Review your R code to determine if there are steps that can be performed independently, or performed more efficiently, by using a separate stored procedure call. For example, you might get better performance by doing feature engineering or feature extraction separately, and saving the values to a table.
110
112
111
113
+ Look for ways to use T-SQL rather than R code for set-based computations.
112
114
113
-
For example, this R solution shows how user-defined T-SQL functions and R can perform the same feature engineering task: [Data Science End-to-End Walkthrough](../tutorials/walkthrough-data-science-end-to-end-walkthrough.md).
For example, this R solution shows how user-defined T-SQL functions and R can perform the same feature engineering task: [Data Science End-to-End Walkthrough](../tutorials/walkthrough-data-science-end-to-end-walkthrough.md).
117
+
::: moniker-end
114
118
115
119
+ If possible, replace conventional R functions with **ScaleR** functions that support distributed execution. For more information, see [Comparison of Base R and Scale R Functions](https://docs.microsoft.com/machine-learning-server/r-reference/revoscaler/revoscaler-compared-to-base-r).
116
120
117
121
+ Consult with a database developer to determine ways to improve performance by using SQL Server features such as [memory-optimized tables](https://docs.microsoft.com/sql/relational-databases/in-memory-oltp/introduction-to-memory-optimized-tables), or, if you have Enterprise Edition, [Resource Governor](https://docs.microsoft.com/sql/relational-databases/resource-governor/resource-governor)).
118
122
119
-
120
-
### Step 3. Prepare for deployment
123
+
## Step 3. Prepare for deployment
121
124
122
125
+ Notify the administrator so that packages can be installed and tested in advance of deploying your code.
123
126
124
-
In a development environment, it might be okay to install packages as part of your code, but this is a bad practice in a production environment.
127
+
In a development environment, it might be okay to install packages as part of your code, but this is a bad practice in a production environment.
125
128
126
-
User libraries are not supported, regardless of whether you are using a stored procedure or running R code in the SQL Server compute context.
129
+
User libraries are not supported, regardless of whether you are using a stored procedure or running R code in the SQL Server compute context.
127
130
128
-
**Package your R code in a stored procedure**
131
+
### Package your R code in a stored procedure
129
132
130
133
+ If your code is relatively simple, you can embed it in a T-SQL user-defined function without modification, as described in this samples:
131
134
132
-
+ [Feature engineering using T-SQL and R](../tutorials/r-taxi-classification-create-features.md)
135
+
+[Feature engineering using T-SQL and R](../tutorials/r-taxi-classification-create-features.md)
133
136
134
137
+ If the code is more complex, use the R package **sqlrutils** to convert your code. This package is designed to help experienced R users write good stored procedure code.
135
138
136
-
The first step is to rewrite your R code as a single function with clearly defined inputs and outputs.
139
+
The first step is to rewrite your R code as a single function with clearly defined inputs and outputs.
137
140
138
-
Then, use the **sqlrutils** package to generate the input and outputs in the correct format. The **sqlrutils** package generates the complete stored procedure code for you, and can also register the stored procedure in the database.
141
+
Then, use the **sqlrutils** package to generate the input and outputs in the correct format. The **sqlrutils** package generates the complete stored procedure code for you, and can also register the stored procedure in the database.
139
142
140
-
For more information and examples, see [sqlrutils (SQL)](ref-r-sqlrutils.md).
143
+
For more information and examples, see [sqlrutils (SQL)](ref-r-sqlrutils.md).
141
144
142
-
**Integrate with other workflows**
145
+
### Integrate with other workflows
143
146
144
147
+ Leverage T-SQL tools and ETL processes. Perform feature engineering, feature extraction, and data cleansing in advance as part of data workflows.
145
148
146
-
When you are working in a dedicated R development environment such as [!INCLUDE[rsql_rtvs_md](../../includes/rsql-rtvs-md.md)] or RStudio, you might pull data to your computer, analyze the data iteratively, and then write out or display the results.
147
-
148
-
However, when standalone R code is migrated to SQL Server, much of this process can be simplified or delegated to other SQL Server tools.
149
+
When you are working in a dedicated R development environment such as [!INCLUDE[rsql_rtvs_md](../../includes/rsql-rtvs-md.md)] or RStudio, you might pull data to your computer, analyze the data iteratively, and then write out or display the results.
150
+
151
+
However, when standalone R code is migrated to SQL Server, much of this process can be simplified or delegated to other SQL Server tools.
149
152
150
153
+ Use secure, asynchronous visualization strategies.
151
154
152
-
Users of SQL Server often cannot access files on the server, and SQL client tools typically do not support the R graphics device. If you generate plots or other graphics as part of the solution, consider exporting the plots as binary data and saving to a table, or writing.
155
+
Users of SQL Server often cannot access files on the server, and SQL client tools typically do not support the R graphics device. If you generate plots or other graphics as part of the solution, consider exporting the plots as binary data and saving to a table, or writing.
153
156
154
157
+ Wrap prediction and scoring functions in stored procedures for direct access by applications.
155
158
156
-
### Other resources
159
+
##Next steps
157
160
158
161
To view examples of how an R solution can be deployed in SQL Server, see these samples:
159
162
160
-
+ [Build a predictive model for ski rental business using R and SQL Server](https://microsoft.github.io/sql-ml-tutorials/R/rentalprediction/)
163
+
+[Tutorial: Develop a predictive model in R with SQL machine learning](../tutorials/r-predictive-model-introduction.md)
161
164
162
-
+ [In-Database Analytics for the SQL Developer](../tutorials/r-taxi-classification-introduction.md)
163
-
Demonstrates how you can make your R code more modular by wrapping it in stored procedures
165
+
+[R tutorial: Predict NYC taxi fares with binary classification](../tutorials/r-taxi-classification-introduction.md)
0 commit comments