Using R with databases

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 4

views

Report

Comments

Description

Download Using R with databases

Transcript

Using R with databases

Using R with databases
A speedy combination: DB2 10.5 with dashDB and R
Grant Hutchison ([email protected])
Computer Science Teacher and IT Consultant
IBM
February 06, 2014
R is not just the 18th letter of the English language alphabet, it is a very powerful open source
programming language that excels at data analysis and graphics. This article explains how to
use the power of R with data that's housed in relational database servers. Learn how to use R
to access data stored in DB2 with dashDB and dashDB for Cloud environments, formerly IBM
BLU Acceleration. Detailed examples show how R can help you explore data and perform data
analysis tasks.
Introduction
R is an open source programming language that is excellent for data analysis and graphics. The
language was initially created by Ross Ihaka and Robert Gentleman at the University of Auckland
in 1993. They wanted to design a language that would help them teach introductory statistics to
their students. The design of R was influenced by an existing language called S, from Bell Labs,
which was created in the 1970s. R is considered a domain specific language because it was
designed primarily for data analysis—and it does this very well. You can use R as an interactive
environment, or embed scripts and models into packages and integrate them with other software
modules.
You can use R to analyze data from many different data sources including external files or
databases. This article explains how to connect, query, and update data with relational database
servers. You will learn why DB2 v10.5 with dashDB technology is an excellent choice for analytics.
The article also explores how dashDB for Cloud simplifies loading and analyzing datasets using R.
Frequently used acronyms
•
•
•
•
•
•
© Copyright IBM Corporation 2014
Using R with databases
CRAN: Comprehensive R Archive Network
CSV: Comma-separated values
DSN: Data source name
ODBC: Open Database Connectivity
RODBC: Open Database Connectivity for R
SIMD: Single instruction multiple data
Trademarks
Page 1 of 15
developerWorks®
ibm.com/developerWorks/
• SQL: Structured Query Language
External data
R objects are created and managed within a single memory area. In most cases, your data
analysis tasks require the data to be available as a data frame. A data frame can be considered
a two-dimensional array of heterogeneous data or an in-memory table. If the data already exists
in a delimited text file, R users can bring the data into memory for analysis using one of the many
read.xxx() functions such as read.csv() for CSV files. Similarly, if an R data frame is to be
externalized to a file, you can use the write.xxx() functions.
R has the ability to persist data frames to disk as objects outside of its memory using the R
functions saveRDS(), save(), or save.image(). These objects can then be read back into memory
for later analysis. But, the persistence mechanism has many drawbacks, including limited sizes
for data frames. There are also problems associated with using text files for data analysis; it takes
significant effort to cleanse the data and share the dataset with others.
If your data for analysis is generated through existing operational systems backed by relational
databases, why not simply extract the data from the operational system and populate an analytic
database for analysis with R?
Why use a relational database with R?
You can use R to explore data and build predictive models. Relational database servers are
designed to handle large amounts of data and they will maintain data consistency for concurrent
users. Because data is usually stored in a normalized fashion in relational databases, you will
likely need to recall some of your SQL skills to join the relevant attributes across multiple tables
to perform your exploratory data analysis (EDA) tasks. If you're working alongside a Database
Administrator (DBA) or data analyst with strong relational database skills, you could create some
read-only views that would speed up the initial data analysis tasks. If you're working with data
in DB2, you can use the IBM Data Studio tool or the web console within dashDB for Cloud to
examine the database schema or define new views to simplify data access from your R scripts.
By default, R will read all of the data into memory before performing any analysis. When queries
are processed by relational database servers such as DB2, the system will not simply load all of
the data from disk into memory. Database servers will consider the entire query and determine the
most efficient method of obtaining the results as requested.
DB2 with dashDB
DB2 version 10.5 with dashDB is optimized for analytics and therefore an excellent choice for
working with R data analysis tasks. There are many reasons to use DB2 with dashDB, including:
• Simplicity
• Load the data and perform data analysis. BLU is optimized for analytics so you don't
need to create indexes or perform any tuning tasks.
• Performance
• The columnar storage model used by BLU reduces the number of disk reads and
minimizes the amount of memory used to support very fast query processing.
Using R with databases
Page 2 of 15
ibm.com/developerWorks/
developerWorks®
• Advanced data compression techniques are used by default to reduce memory and disk
usage.
Unlike R, DB2 with dashDB will examine all of the hardware capabilities of its server and optimize
the use of resources. For example, DB2 will use SIMD CPU instructions across multiple cores
whenever possible. It will also use various memory caching mechanisms to reduce the number
of times data is read from disk. The optimization occurs automatically. This platform is now also
available in the cloud with BLU Acceleration for Cloud.
dashDB for Cloud
dashDB for Cloud, a web-based database server offering, is optimized for simplicity. In a few
minutes you can create tables, load data, and start your analysis. Tools are provided (IBM Data
Studio and IBM InfoSphere Data Architect) to simplify creating and maintaining database models
and objects (such as tables). After the schema has been created, you can use the dashDB for
Cloud web console to load your data. There are many options for loading the data, such as local
files, cloud storage services (for example, Amazon S3), or IBM InfoSphere DataStage. The web
console can be used to performed your analysis work using Excel, SQL, Cognos Business Insight
(BI)" , or R scripts and R models.
Figure 1 shows two different scenarios of how an R user can work with data stored in dashDB for
Cloud. The data analyst can use their favourite R environment, such as RStudio, to query and
analyze the data from their own computer. Or, the analyst can use a browser to create and execute
R scripts on the cloud-based server.
Explore data using R (remote)
The chart in Figure 2 is an example of the R environment integrated into dashDB for Cloud.
dashDB for Cloud provides an R runtime alongside the data warehouse within a cloud
environment. BLU Acceleration for Cloud provides a web console where you can load data and
perform analytics within minutes. The data analysis could include SQL, BI tools, or R scripts and
models. An easy way to use R script editor is provided, but you can also launch RStudio from
within your browser. Figure 2 shows an example of the integration including a plot of an analysis of
U.S. Census data. You can create new scripts by clicking on the plus (+) sign highlighted in the top
left of Figure 2. The R scripts are managed in the cloud within your own workspace. You can even
import scripts to work with them in the cloud.
Using R with databases
Page 3 of 15
developerWorks®
ibm.com/developerWorks/
Explore Data using R in the Cloud
dashDB for Cloud - Trial Plan
The best way to obtain experience using R with databases is to simply try the dashDB for Cloud
Trial Plan (open beta). Go to the dashDB for Cloud website to get started. Click the Try dashDB
Now button and you will be directed to the Plans and Pricing page. The Solo Plan involves the
provisioning of a BLU analytics environment in your choice of cloud provider (IBM SoftLayer
or Amazon Web Services), but with the Trial plan you will have instant access to your own
environment for 5 hours.
Figure 3 shows the dashDB widget for the Trail plan. By clicking on the start button your system
will be provisioned and when your server is ready click Start dashDB. You are now able to log
into the web console using the userid bluadmin and the password provided. To test out the R
integration you can go to the Developing R Scripts area and create a test script and use the
code shown in Listing 1 below. Since you have provisioned a new environment you will likely be
prompted to install R and RStudio, but this task only takes 1-2 minutes. You will notice in Listing 1
that a package called bluR is loaded in the script. This package is currently only available within
dashDB for Cloud and it uses RODBC as its data access interface.
Using R with databases
Page 4 of 15
ibm.com/developerWorks/
developerWorks®
dashDB for Cloud - Trial Plan
Data types and design considerations
Data within R and data stored in relational databases both have a defined type or structure. It is
important to understand the various data types and how to map values properly.
R has a very limited set of base data types including: characters, numeric (integers and doubleprecision floating point values), logical, and raw data (binary). Character vectors in R are simply
mapped to SQL CHARACTER or VARCHAR data types. Numeric data types from the database server
are usually mapped to INTEGER and DOUBLE data types in R. Because the logical data type in R
does not have an obvious SQL data type equivalent, it is usually mapped to a text field in database
tables. If you're working with time-series data you will need to carefully consider how the data will
be mapped between R and the database server.
Continuous and categorical data are quite different entities when statistical methods are used. R
users are aware of these differences and they will encode the categorical data using the factor
datatype. After data has been retrieved from a database, you should consider if the data should be
treated as continuous or categorical. If the data is truly categorical, the variable should be recoded
or coerced using the factor() function.
Table 1 summarizes the terminology differences between R and Relational Database Management
Systems (RDBMS).
Terminology comparison (R and RDBMS)
R
SQL/RDBMS
data frame
table (relation)
observation
row (tuple)
variable
column (attribute)
various ( [], subset(), order(), sort() )
SELECT statements
Using R with databases
Page 5 of 15
developerWorks®
ibm.com/developerWorks/
Connectivity options
IBM Netezza and R
IBM Netezza Analytics is an embedded, purpose-built, advanced analytics platform. R can
be used in many different ways with Netezza appliances including in-database analytics and
client-side analytics. Discover how R can be used within a Netezza environment.
The R language does not include built-in relational database access capability, but there are many
data access interfaces available from vendors or through the Comprehensive R Archive Network
(CRAN).
You will learn how to use either RJDBC or RODBC to access data stored in DB2 database servers
from R. If you plan to access other database servers, you could use these same interfaces or you
may wish to explore database-specific alternative data access packages.
Connecting with RJDBC
The RJDBC package is based on the database interface (DBI) established in the R community.
The DBI package contains virtual classes; it is the responsibility of the underlying driver to
implement the classes. RJDBC uses a combination of a JDBC compliant database driver and Java
Runtime Environment (JRE) to exchange data between R and the database server. You will use
the IBM JDBC driver (type 4) to provide the underlying connectivity. The IBM JDBC driver can be
used to access databases across the DB2 family including: DB2 for Linux, Unix, Windows, DB2
for z/OS, and DB2 i. If the RJDBC package has not been installed in your R environment, use the
command install.packages("RJDBC") to install it from CRAN.
In Listing 1, the RJDBC package loaded an R object representing the JDBC driver. The db2jcct4.jar
file must be reachable from your CLASSPATH and a valid JRE must be available in your JAVA_PATH.
The dbConnect() function is used to allocate a channel, or connection, object to the database
server. The connection is to a database called SAMPLEDB, located on the server, with a
hostname of blueforcloud.imdemocloud.com. Note that the value for the password is required, but
it is not included in the script.
BLU for Cloud - R Script
library(bluR)
# Connect to the BLU database server - local connection as R runs on the same server
samplescon <- bluConnect("SAMPLEDB", "", "")
# Create a simple query string for the data
query<-paste('select * from DB2INST1.US_FUEL_ECONOMY_AUGUST_2013')
# Create an R data frame based on the SQL statement
cars <- bludf(samplescon, query)
# Print dimensions of the data frame and some data from the first row
nrow(cars)
ncol(cars)
print (cars[1,1:4], row.names=FALSE)
# Create a boxplot visualization
Using R with databases
Page 6 of 15
ibm.com/developerWorks/
developerWorks®
boxplot(COMB_FE_CONVENTIONAL_FUEL ~ CYL,
cars, names = levels(cars$CYL),
main="Fuel Consumption - 2013",
xlab = "Number of Cylinders",
ylab = "Miles/Gallon (mpg)")
# Close the connection to the BLU server
bluClose(samplescon)
Spend a few minutes to try the dashDB for Cloud Trial plan and try the script shown in Listing 1. In
the remainder of this article you will learn more about how to access data from R.
Listing 2 shows how the connection object is used to send an SQL query to the server with the
dbSendQuery() function. The data is not actually returned to the client until the fetch() function
is executed. In this scenario, all of the rows from the result set are returned because the second
parameter was set to -1.
RJDBC - querying
query <- paste("select * from db2inst1.us_fuel_economy_august_2013")
# Send query to the database server
rs <- dbSendQuery(conn, query)
# Retrieve all of the rows of data
df <- fetch(rs, -1)
In Listing 3, the data frame df contains the results of the query. The purposes of this script are to
determine the vehicles with the best fuel economy and to determine the model. The output shows
that the Toyota Prius vehicles have the best combined fuel economy of the data contained in our
table of 1165 vehicles.
RJDBC - disconnecting
# Remove any results with missing data
df <- na.omit(df)
cat ("There are", nrow(df), "fuel economy results available with ")
cat (ncol(df), "different variables.\n")
# Find the best fuel consumption in the data frame
best_fe <- max(df$COMB_FE_CONVENTIONAL_FUEL,na.rm=TRUE)
cat("\nCar(s) with the best fuel consumption of",best_fe,"miles/gallon.\n\n")
print (df[df$COMB_FE_CONVENTIONAL_FUEL==best_fe,c(1:4)],row.names=FALSE)
# Disconnect from the database server
dbDisconnect(conn)
---- OUTPUT from Script
There are 1165 fuel economy results available with 18 different variables.
Car(s) with the best fuel consumption of 50 miles/gallon.
MODEL_YEAR MFR_NAME DIVISION CARLINE
2013
Toyota
TOYOTA
PRIUS
2013
Toyota
TOYOTA PRIUS c
Using R with databases
Page 7 of 15
developerWorks®
ibm.com/developerWorks/
When NULL values are returned from a database table they are mapped to NAs in an R data
frame. In the first line of Listing 3, the na.omit() function was used to remove any observations
with missing values. R data frame variables are returned from DB2 using upper case characters.
To release database resources on the server, remember to use the dbDisconnect() function at the
end of our R scripts.
You can also use the dbSendQuery() function to bind R objects to SQL statements, as Listing 4.
RJDBC - using parameters
mfr <- "BMW"
query <- paste("select * from
db2inst1.us_fuel_economy_august_2013
where mfr_name=?")
# Send query to the database server with parameter substitution
rs <- dbSendQuery(conn, query, mfr)
df.bmw <- fetch(rs, -1)
# Convert the variable to a factor data type
df.bmw$CYL <- factor(df.bmw$CYL)
# Create a boxplot by number of cylinders
boxplot(COMB_FE_CONVENTIONAL_FUEL ~ CYL,
df.bmw, names = levels(df.bmw$CYL),
main="Fuel Consumption - 2013",
xlab = "Number of Cylinders",
ylab = "Miles/Gallon (mpg)")
In Listing 4, the value of the manufacturer is not part of the query text; it is sent as part of the query
when the dbSendQuery() function is executed.
Figure 4 shows the boxplot that was easily generated based on the data frame that was returned.
Boxplot example
Connecting with RODBC
RODBC is a more efficient and faster data access interface to DB2 for R users. RODBC is
available in CRAN and is used by many people in the R community. If the RODBC package is not
installed in your R environment, use the install.packages("RODBC") command to install it.
Using R with databases
Page 8 of 15
ibm.com/developerWorks/
developerWorks®
ODBC was originally developed by Microsoft in the early 1990s. It has since become an official
standard for data access known as SQL/CLI (Call Level Interface). An ODBC environment consists
of an ODBC Driver Manager and an ODBC compliant driver for the database server you would
like to use. On Windows, the ODBC Driver Manager is built into the platform, but on Linux or other
platforms an ODBC Driver Manager should be installed.
Connecting to a database from the RODBC driver involves identifying the location of the server,
the name of the database, and supporting credentials (for example, user name and password).
The name of the database is usually defined as an ODBC DSN. A DSN is a detailed reference to
a database that is either local or remote from the client computer. You can consider a DSN as an
alias to the database—it does not need to match the actual name of the database defined on the
server.
Creating DSNs on Windows involves using the ODBC Data Source Administrator tool, which is
accessible from the Control Panel->Administration Tools menu. In Figure 5, there are many DSNs
defined in this client computer.
Windows ODBC Data Source Administrator tool
If you decide to use R and DB2 Express-C for Windows with local databases, there should be
DSNs already defined for your databases. If you wish to access a remote database, then you
should: obtain the connection details from the database server provider, catalog the connection,
and define a DSN within the ODBC Data Source Administrator tool.
There is also a direct connection method available if you want to avoid the creation of DSNs.
The RODBC package provides a number of connection related functions, as described in Table 2.
Using R with databases
Page 9 of 15
developerWorks®
ibm.com/developerWorks/
RODBC - connection and metadata functions
Function
Description / purpose
Input
Output
odbcDataSources ()
Provides a list of available DSNs.
None required.
Character vector of DSNs.
odbcConnect (dsn, uid,
pwd, ...)
Establishes a connection to a
database server.
dsn="DSN_name", uid="USERID",
pwd="password" - other various
optional parameters
Channel object that represents an
active connection to a database.
odbcDriverConnect
(connection ="", ...)
Establishes a connection to a
database server.
connection string.
The values for DSN, user id, and
password must be provided in a
single string. See listing 6 for an
example.
Channel object that represents an
active connection to a database.
odbcGetInfo ( channel )
Provides detailed information
about the active database
connection.
channel - object representing an
active connection to a database
server
Named character vector
describing details about the
connection including the
ODBC driver type and level of
conformance to the API standards.
After we've established a channel, or connection, to our database, we can use the metadata
functions to explore the supported datatypes, table definitions, and their defined columns.
Table 3 lists three useful database metadata functions available within RODBC.
RODBC - database and table metadata functions
Function
Description / Purpose
Input
Output
sqlTypeInfo
(channel, ...)
Provides information about the
supported data types of the ODBC
database.
channel
Data frame of the supported data
types and their characteristics.
sqlTables (channel, ...)
Provides a description of the
table-like objects defined within a
database.
channel
recommended optional
parameters : "schema=",
"tableType="
Data frame containing details
about the tables, views, or other
table-like objects in the database.
sqlColumns (channel,
sqtable, ...)
Provides a description of the
columns defined within a table.
channel, table name
Data frame containing details
about the column names and other
attributes for a table.
After you determine which table you want to work with, use the sqlFetch() function to retrieve the
data into R. The RODBC package will map the data to the appropriate R data type based on the
defined mappings displayed in the output of the getSqlTypeInfo("DB2/NT") function.
Let's examine a simple R script that will connect to a database using a DSN and determine some
basic information. The program will determine the number of table objects and the columns within
a specified table. The sqlFetch() function is used to retrieve all of the data into R and display the
first row that is returned in the data frame.
In Listing 5, the odbcConnect() function and metadata functions determined there are 27 tables
within the "DB2INST1" schema and 18 columns defined in the fuel economy table. The sqlFetch()
function is used to query the contents of the table and create a data frame called cars. The
first four variables of the first observation in the data frame are displayed to the user and the
connection to the database server is released using the odbcCloseAll() function.
Using R with databases
Page 10 of 15
ibm.com/developerWorks/
developerWorks®
RODBC - getting connected and using metadata functions
library(RODBC)
dsn.name <- "blusamp"
user.name <- "granthut"
con1 <- odbcConnect(dsn=dsn.name,uid=user.name,pwd)
table.list <- sqlTables(con1,tableType="TABLE", schema="DB2INST1")
cat("There are", nrow(table.list), "tables in the DB2INST1 schema.\n")
table.name <- "DB2INST1.US_FUEL_ECONOMY_AUGUST_2013"
col.list <- sqlColumns(con1,table.name)
cat("There are", nrow(col.list), "columns defined in", table.name,"\n")
# Display one row from the table
cars <- sqlFetch(con1, table.name)
print (cars[1,1:4], row.names=FALSE)
# Close connections
odbcCloseAll()
cat("Database connections are closed.\n")
---- OUTPUT from Script
There are 27 tables in the DB2INST1 schema.
There are 18 columns defined in DB2INST1.US_FUEL_ECONOMY_AUGUST_2013
MODEL_YEAR MFR_NAME DIVISION
CARLINE
2013
BMW
BMW 135i Convertible
Database connections are closed.
An alternative method of connecting to DB2 databases using RODBC involves using the
odbcDriverConnect() function. Note that the db.name defined in Listing 6 is the actual database
name on the database server and it is not necessarily the DSN created on the client computer.
RODBC - direct connection method
driver.name <- "{IBM DB2 ODBC DRIVER}"
db.name <- "SAMPLEDB"
host.name <- "bluforcloud.imdemocloud.com"
port <-"50001"
user.name <-"granthut"
# Use a full connection string to connect to a SAMPLE database
con.text <- paste("DRIVER=",driver.name,
";Database=",db.name,
";Hostname=",host.name,
";Port=",port,
";PROTOCOL=TCPIP",
";UID=", user.name,
";PWD=",pwd,sep="")
con1 <- odbcDriverConnect(con.text)
Querying data
A key reason for analyzing data stored in database servers, instead of files, using R is to help
manage issues associated with large datasets. As mentioned, R will load all of the data into
memory by default and, with conditional SQL queries, you can more easily examine portions of
large datasets.
Using R with databases
Page 11 of 15
developerWorks®
ibm.com/developerWorks/
As shown in Table 4, any valid SQL query can be sent to the database server using the
sqlQuery() function. If you have the authority to DROP tables or remove data, then the sqlDrop()
or sqlClear() functions are also available.
RODBC - querying and deleting data
Function
Description / Purpose
Input
Output
sqlQuery ( channel,
query, ... )
Executes the SQL query on the
database server and provides the
results.
channel, query
recommended options:
errors=FALSE (helps to capture
any errors)
Data frame of the result set. The
data will be mapped to compatible
R data types.
sqlDrop ( channel,
sqtable, ... )
Removes the table contents and
definition from the database.
channel, table
Note that this function will attempt
to execute a DROP TABLE
statement.
sqlClear ( channel,
sqtable, ... )
Removes all of the rows in a table
from the database.
channel, table
Note that this function will attempt
to execute a TRUNCATE TABLE
statement.
You can use the very versatile sqlQuery() function to issue SQL Data Manipulation Language
(DML) statements such as SELECT, INSERT, UPDATE, or DELETE, and SQL Data Definition
Language (DDL) statements such as CREATE TABLE. The SQL statement in Listing 7 contains
an error. The datatype is not spelled correctly and the phrase INTEGR should be INTEGER.
Thankfully, the odbcGetErrMsg() function is available to obtain and display a detailed error
message. You will find this function very useful as you edit and test your R scripts.
RODBC - diagnosing errors
res <- sqlQuery(con1,"CREATE TABLE TESTDATA (c1 INTEGR)", errors=FALSE)
if (res == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(con1)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
---- OUTPUT from Script
An error has occurred.
[1] "42704 -204 [IBM][CLI Driver][DB2/LINUXX8664]
SQL0204N \"INTEGR\" is an undefined name. SQLSTATE=42704\r\n"
[2] "[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE TESTDATA (c1 INTEGR)'"
Storing data
Data is usually stored in relational databases using the SQL INSERT statement. If you have an
R data frame and want to persist the data into a relational database, you can use the sqlSave()
function. The default behaviour of this function is to create a table with the same data as the data
frame variable. The column names will also be the same as the column names in the data frame.
You may not want to store the R row names from the data frame in your persisted table, as there is
usually a natural key within your data set.
The sqlSave() function will use the SQL INSERT statement to populate the table with the data
from the data frame. If you have a large dataset that you want to persist, consider creating
Using R with databases
Page 12 of 15
ibm.com/developerWorks/
developerWorks®
an external delimited file of the data and then using a database specific high speed loader
to publish and share the data with other data analysts. After the data is safely stored in the
relational database, you no longer need to worry about maintaining its integrity and availability—
the database server will handle that task for you.
In Listing 8, the sqlSave() function is used to create and populate a new table. Since the data
frame was called CLASSMARKS the table will have the same name. In this scenario, the safer
parameter allows the script to replace any existing table with the same name. The default
behaviour is to return an error if there is an attempt to save data into an existing table.
RODBC - saving data
tab.name <- "CLASSMARKS"
NAMES <- c("Bob","Mary","Fred")
MARKS <- c(78,88,91)
# Create a data frame of test scores and names
CLASSMARKS <- data.frame (NAMES,MARKS,stringsAsFactors=FALSE)
# Create a new table and populate it with the data frame CLASSMARKS
sqlSave(con1, CLASSMARKS, rownames=FALSE,safer=FALSE)
NEWCLASS <- sqlFetch(con1,tab.name)
cat( "Mean mark for the class is", mean(NEWCLASS[,"MARKS"]),"\n")
---- OUPUT from Script
Mean mark for the class is 85.66667
You can also use the RODBC sqlUpdate() function to perform modifications to data in existing
tables. I recommend using the SQL UPDATE statements as a more effective alternative, or simply
store the new data into a temporary table using sqlSave() and then perform an SQL UPSERT
between the tables. If you use the sqlUpdate() function, be aware that an index parameter is
required. The index parameter is used to uniquely identify each row in the existing table.
Using stored procedures
Stored procedures are programming modules that are managed and executed by the database
server. Since the data is co-located with the program, stored procedures can be very efficient
and perform some tasks much faster than client-side applications. DB2 stored procedures can be
written using various programming languages such as Java, C, PL/SQL, and SQL PL. If you have
existing stored procedures, or your analysis work can be simplified or improved by using stored
procedures, you can easily call them from an R script using RJDBC or RODBC.
In Listing 9, the GETMEDIAN stored procedure is called using the sqlQuery() function. The data in
this scenario would be returned as a single valued data frame. The subject of interest is passed to
a stored procedure as a defined input parameter. At this time, there is no mechanism available to
retrieve multiple result sets or output parameters from a DB2 stored procedure.
Using R with databases
Page 13 of 15
developerWorks®
ibm.com/developerWorks/
RODBC - stored procedures
# Call the stored procedure to find the median mark based on a subject (input variable)
subject <-"MATH"
median <- sqlQuery(con1,"CALL GETMEDIAN ( subject )")
print (median)
Summary
R is a powerful open source data analysis tool that can help you explore data and build predictive
models. When your data grows beyond the memory limits of R, consider loading the data into a
database server such as IBM DB2 with dashDB or dashDB for Cloud. In this article, we explored
the benefits of using R with databases instead of using delimited text files for data analysis. You
also learned how to use the RJDBC and RODBC packages to perform data analysis of big data
stored in DB2.
Related topics IBM DB2 with BLU Acceleration Big Data University course: Introduction to
Data Analysis using R Big Data University course: Using R with Databases DB2 with BLU
Acceleration: A rapid adoption guide
Using R with databases
Page 14 of 15
ibm.com/developerWorks/
developerWorks®
About the author
Grant Hutchison
Grant Hutchison has held various roles within IBM's Information Management
group since 1991. He created the initial DB2 Professional Certification program and
contributed to DB2 for Linux, UNIX and Windows product releases since Version 1 as
a developer, product manager, and quality assurance engineer. He teaches computer
science and math courses at the secondary and post-secondary level and is currently
working as a contributor to Big Data University.
© Copyright IBM Corporation 2014
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)
Using R with databases
Page 15 of 15