...

Extending and Customizing IBM SPSS Statistics with R, Python, and .NET

by user

on
Category: Documents
19

views

Report

Comments

Transcript

Extending and Customizing IBM SPSS Statistics with R, Python, and .NET
Extending and Customizing IBM SPSS Statistics with R,
Python, and .NET
Jon Peck
Senior Software Engineer, IBM
[email protected]
November, 2010
Business Analytics software
© 2010 IBM Corporation
Business Analytics software
IBM SPSS Statistics
 IBM ® SPSS ® Statistics has an extensive command language (syntax) for data acquisition,
manipulation, and statistical and graphical procedures
 Programmability and
scripting dramatically
extend these built-in
capabilities
 Allow custom user
interfaces and output
to be produced
 Converting large SAS
applications is likely
to require the use of
programmability
2
© 2010 IBM Corporation
Business Analytics software
Agenda
 Programmability introduction
 Four examples
– Automating repetitive work:
applySyntaxToFiles
– Integrating programs and scripting:
SPSSINC MODIFY TABLES
– Adding a procedure from R:
SPSSINC QUANTILE REGRESSION
– Adding a procedure in Python:
SPSSINC TURF
3
© 2010 IBM Corporation
Business Analytics software
Programmability increases your power, flexibility, and
productivity
 Generalization
– React flexibly to metadata, results, and the environment
– Benefit: Write fewer similar jobs
 Automation
– Embed program logic in jobs
– Benefit: Less manual work
 Extension
– Tap existing R or Python statistical modules
– Add your own or extend standard procedures and transformations
– Benefit: More capabilities
 Integration
– Connect IBM SPSS Statistics inputs and outputs to other agents
– Benefit: Make IBM SPSS Statistics part of a larger production process
 More productivity and more fun
4
© 2010 IBM Corporation
Business Analytics software
IBM SPSS Statistics embeds three programming
languages
Plug-ins let you extend capabilities using
–Python
–R
–.NET languages (Windows only)
Free plug-in downloads
SPSS Developer Central web site provides articles,
SPSS-written modules, plug-ins and user
contributions
–New SPSS Community on IBM myDeveloperWorks
5
© 2010 IBM Corporation
Business Analytics software
My first Python program
GET FILE="c:/data/important.sav".
BEGIN PROGRAM PYTHON.
import spss
print "Hello, IBM"
END PROGRAM.
DESCRIPTIVES ....
Python or R program code goes in the normal
Statistics syntax window
6
© 2010 IBM Corporation
Business Analytics software
Programmability combines SPSS Statistics with
Python, R, or .NET
 A program in the input stream can communicate
with IBM SPSS Statistics and control it and use
Python or R facilities and modules (internal mode)
spss.Submit("GET FILE='c:/data/cars.sav'.")
A Python or .NET application can embed IBM
SPSS Statistics inside itself (external mode)
–User interface does not appear
There is a lower level C API available in an SDK
7
© 2010 IBM Corporation
Business Analytics software
Programmability functionality is fully integrated into
IBM SPSS Statistics
Programs run in the regular syntax stream
Users can define IBM SPSS Statistics syntax
for program and scripts via Extension
mechanism.
Users can create dialog boxes and menus
using the Custom Dialog Builder.
–Not just for extensions or programs
Python and R output appears in the Viewer
–plain text
–pivot tables
–charts
8
© 2010 IBM Corporation
Business Analytics software
Python and R Programmability API's cover these areas
 State information of Statistics
 Get/Set variable dictionary information
 Get/Set data
 Get Viewer output (via xmlworkspace)
 Create tables/charts/text objects in Viewer
 Run Statistics commands (Python only)
9
© 2010 IBM Corporation
Business Analytics software
Python and VB scripting API's cover user interface and output
 Programmability is a backend (SPSS
Processor) domain
 Scripting is mainly a frontend (user
interface, including output) domain
 Managing output Viewer and objects
– tables: formatting, pivoting, editing, …
– objects: visibility, order, titles, outline
text,…
 General user interface control
 Almost anything you can do via the user
interface
 Not available for R
10
© 2010 IBM Corporation
Business Analytics software
.NET plug-in embeds Statistics inside another program
Example: Statistical Explorer
 Statistics, graphs, and data management via Statistics
 Two pages of VB.NET code
11
© 2010 IBM Corporation
Business Analytics software
Python and R are open source software
 Programmability plug-ins are an optional installation
– They are free (but require a Statistics license)
– They make possible tapping the work of the Python and R communities
– Python and R have license agreements
– IBM Non-warrenty license agreement
– For R, GPL license
12
© 2010 IBM Corporation
Business Analytics software
Extension commands eliminate need for user to learn Python
or R
 Extension mechanism lets you define IBM SPSS
Statistics-style syntax for programs
 IBM SPSS Statistics takes care of validation and parsing
 Passes user input to a program in an easy-to-digest form
 Automatically loaded when IBM SPSS Statistics starts
–Look to the user like built in commands
 Easy to distribute to others
13
© 2010 IBM Corporation
Business Analytics software
Some statistical extensions on Dev Central
Extension Name
Description
PLS
Partial least squares (P)
PROPOR
Confidence intervals for proportions (P)
SPSSINC APRIORI
Association rules (R)
SPSSINC BREUSCH PAGAN
Residual heteroscedasticity tests (R)
SPSSINC HETCOR
Polychoric and polyserial correlation (P+R)
SPSSINC MFP GLM
Fractional polynomial generalized linear models (R)
SPSSINC QQPLOT2
Empirical Q-Q plots (R)
SPSSINC QUANTREG
Quantile regression (R)
SPSSINC RAKE
Adjust weights to control totals (P)
SPSSINC RANFOR & SPSSINC
RANPRED
Random forests (R)
SPSSINC RASCH
Rasch models (R)
SPSSINC ROBUST REGR
Robust regression (R)
SPSSINC TOBIT REGR
Tobit regression (R)
SPSSINC TURF
TURF analysis (P)
14
© 2010 IBM Corporation
Business Analytics software
Some non-statistical extensions on Dev Central
Extension Name
Description
FUZZY
Case-control exact and approximate matching (P)
GATHERMD
Gather data file metadata (P)
HIDECOLS
Hide pivot table columns (P)
SCRIPTEX
SCRIPT commands with parameters (P)
SETSMACRO
Syntax for using variable sets (P)
SPSSINC ANON
Anonomize data (P)
SPSSINC COMPARE DATASETS
Compare two sav files (P)
SPSSINC CREATE DUMMIES
Create dummy variables for categories (P)
SPSSINC GETURI DATA
Read data from the Internet (P)
SPSSINC MERGE TABLES
Merge two pivot tables (P)
SPSSINC MODIFY OUTPUT
Set Viewer outline titling and styling (P)
SPSSINC MODIFY TABLES
Set pivot table cell and label styling (P)
SPSSINC TRANS
Apply Python functions to cases (P)
SPSSINC TRANSLATE
Translate Viewer output (P)
TEXT
Create block of text in Viewer (P)
15
© 2010 IBM Corporation
Business Analytics software
You can create and share your own additions to IBM
SPSS Statistics
–Write Python or R functions to implement the
functionality or tap existing packages
Can each
• Use input API's to get data to Python or R
• Use output API's to create pivot tables
–For extensions,
be a single
line of code
• Define the syntax in an xml file
• Use tools in extension.py (Python) or spsspkg (R) to receive
parsed output and pass to implementing function
• New in v18: R version of extension.py
–Use the Custom Dialog Builder to create the interface
• The CDB is not just for extensions
–Test and document!
–Package and distribute
–Contributions to Developer Central are welcome
 Documentation is at SPSS Developer Central
16
© 2010 IBM Corporation
Business Analytics software
Extension commands: validation and mapping from syntax to Python
or R function parameters is handled for you
 Example: SPSSINC BREUSCH PAGAN
– implemented using an R package
 SPSSINC_BREUSCH_PAGAN.xml specifies the syntax to the Statistics parser
 The R mapping code in SPSSINC_BREUSCH_PAGAN.R respecifies the syntax and invokes
the executing routine with parsed parameters
– overlaps with xml syntax definition but provides additional features
SPSSINC BREUSCH PAGAN
DEPENDENT = salary ENTER = educ jobcat
/OPTIONS MISSING=LISTWISE
/SAVE RESIDUALSDATASET=resids COEFSDATASET=coefs.
17
© 2010 IBM Corporation
Business Analytics software
An XML file defines the syntax to the SPSS Universal Parser
18
© 2010 IBM Corporation
Business Analytics software
Python or, in this case, R code gets the parsed syntax, which is
turned into function arguments
19
© 2010 IBM Corporation
Business Analytics software
Expand the audience by creating IBM SPSS Statistics syntax and
dialog boxes
20
© 2010 IBM Corporation
Business Analytics software
Example I
Generalize and automate work
You have syntax files and need to process
datasets not known in advance every day
applySyntaxToFiles function applies a syntax
file to each file in input specification
21
© 2010 IBM Corporation
Business Analytics software
Use programmability to automate routine processes
Apply standard processing to an unknown set of files
Produce processed data and reports
22
© 2010 IBM Corporation
Business Analytics software
Use a program to drive processing
begin program.
import spss, spssaux3
spssaux3.applySyntaxToFiles(inputspec="c:/temp/parts/*.sav",
syntax = "c:/myjobs/dailychecks.sps",
outputdatadir = "c:/temp/processed",
outputfiledir = "c:/temp/processed",
logfile ="c:/temp/processed/report.txt")
end program.
 dailychecks.sps could apply data cleaning rules, modify data,
and create reports
 Could be run daily through Production Mode or C&DS job
scheduler or used interactively
 Extended version available as SPSSINC PROCESS FILES
23
© 2010 IBM Corporation
Business Analytics software
Example II
Automate dynamic or static formatting of tables
Use integrated scripting for better table
presentation
24
© 2010 IBM Corporation
Business Analytics software
SPSSINC MODIFY TABLES extension command
manipulates table formatting and structure
• TableLooks provide static formatting for entire areas of
a table
– data cells
– row and column layers
• You want tables with formatting beyond tableLooks
• Many users copy tables to Excel and manually format
them 
• Basic and Python Scripting provide programmatic way
to do formatting
• SPSSINC MODIFY TABLES provides syntax for
extensive formatting
– Eliminates need to know scripting
– Uses Extension mechanism for programs and Python
scripting
25
© 2010 IBM Corporation
Business Analytics software
Use dynamic highlighting to make crosstab table
easier to read
SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation'
DIMENSION=ROWS SELECT='Std. Residual'
/STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0
APPLYTO='abs(x) >2'.
26
© 2010 IBM Corporation
Business Analytics software
Custom dialog boxes are easy to create
 Dialog created with
Custom Dialog Builder
 Generates extension command syntax
 Easy to distribute
27
© 2010 IBM Corporation
Business Analytics software
Use static formatting to call out parts of a table
SPSSINC MODIFY TABLES subtype='variables in the equation'
SELECT="B" "Sig."
/STYLES TEXTCOLOR = 0 0 255
BACKGROUNDCOLOR=0 255 0.
28
© 2010 IBM Corporation
Business Analytics software
Format CTABLES totals to call them out
SPSSINC MODIFY TABLES SUBTYPE="Custom Table"
SELECT = "Total" DIMENSION=ROWS
/STYLES BACKGROUNDCOLOR=255 255 88
TEXTSTYLE = BOLD
29
© 2010 IBM Corporation
Business Analytics software
Use custom functions for special effects
SPSSINC MODIFY TABLES SUBTYPE='Report' SELECT="<<ALL>>"
/STYLES APPLYTO=DATACELLS TEXTCOLOR=255 255 255
TEXTSTYLE=BOLD
CUSTOMFUNCTION="customstylefunctions.washColumnsBlue".
def washColumnsBlue(obj, i, j, numrows, numcols, section, more):
mincolor=150.
maxcolor=255.
increment = (maxcolor - mincolor)/(numcols-1)
colorvalue = round(mincolor + increment * j)
obj.SetBackgroundColorAt(i,j, RGB((mincolor, mincolor, colorvalue)))
30
© 2010 IBM Corporation
Business Analytics software
It is possible to get carried away with this
31
© 2010 IBM Corporation
Business Analytics software
Example III
Extend IBM SPSS Statistics by tapping the work of the R and Python communities
Add R procedures seamlessly to IBM SPSS
Statistics
32
© 2010 IBM Corporation
Business Analytics software
R
 R is a programming language for statistics
–leading edge statistics
–many contributed statistics and graphics packages
–free
 R is not so easy to learn
–Documentation by experts for experts
–Feels like a complex programming language – because it is
–Syntax is a lot like C
–Error in optim(rho, f, control = control, hessian =
TRUE, method = “BFGS”) :
initial value in ‘vmmin’ is not finite
• Good for programmers(?); bad for users
 R holds data in memory
 R for SAS and SPSS Users, Bob Muenchen, AddisonWesley, 2008
33
© 2010 IBM Corporation
Business Analytics software
R procedures can be accessed from IBM SPSS
Statistics using the R plug-in
The R plug-in makes it easy to use R packages
–IBM SPSS Statistics datasets and Viewer output can be
processed by R using plug-in
–Graphical, text, and table output appear in the Viewer
• Pivot tables can be created with R code
–New IBM SPSS Statistics datasets can be created from R
–R communicates with IBM SPSS Statistics via API's in
plug-in
–Integration requires writing a little R wrapper code
–IBM SPSS Statistics can provide
• dialog box interface
• IBM SPSS Statistics-style syntax
• pivot table output
Plug-in is downloadable from Developer Central
34
© 2010 IBM Corporation
Business Analytics software
Quantile regression models conditional quantiles
 Ordinary regression models conditional mean
 Median regression is 50th quantile
 Estimating quantiles is useful with varying spread,
asymmetries, outliers
 Areas of application include
–empirical finance
• value at risk
• mutual fund investment styles
• credit scoring
–school quality
–demand analysis
–others
35
© 2010 IBM Corporation
Business Analytics software
SPSS QUANTILE REGRESSION
extension embeds R quantreg package
36
© 2010 IBM Corporation
Business Analytics software
Pivot tables and plots appear in the Viewer
37
© 2010 IBM Corporation
Business Analytics software
New datasets appear in Data Editor windows
38
© 2010 IBM Corporation
Business Analytics software
Example IV
Extend IBM SPSS Statistics by adding procedures in Python
 TURF analysis
39
© 2010 IBM Corporation
Business Analytics software
TURF Analysis is popular in market research
 Total Unduplicated Reach and Frequency (TURF)
 Find the highest coverage of positive responses for a small
number of questions
 Example: How do you reach the largest audience by
advertising on a few kinds of sports?
• football, cricket, basketball, cycling, ...
 Example: What ice cream flavors should you offer in your
shops that have three dispensing machines?
 Example: What phone features should you promote?
–multi-line, voicemail, paging, internet ...
 Simple FREQUENCIES does not account for overlap
40
© 2010 IBM Corporation
Business Analytics software
TURF calculations are demanding
 Must compute all
possible set unions of
positive responses (up to
a maximum number of
variables).
Calculations for best 10
combinations of variables
Variables
 Each set is a list of case
ID’s with positive
response on a question.
3
6
Set Union
Calculations
4
57
 This problem is
computationally
explosive
12
24
48
4070
4,540,361
8,682,997,422
Is a scripting language like Python too slow?
41
© 2010 IBM Corporation
Business Analytics software
Extension command SPSSINC TURF is implemented
in Python
Provides
–Dialog box interface
–IBM SPSS Statistics style syntax
–The computations
–Pivot table output
Fewer than 300 lines of Python code
–Plus dialog box definition
–Plus extension command syntax definition
Executes requests involving a few million set
comparisons in a few minutes
Initial version written in two days
42
© 2010 IBM Corporation
Business Analytics software
Analysis of phone data
Telco
survey
(9 variables
1000 cases)
dialog
created
with
Custom
Dialog
Builder
43
© 2010 IBM Corporation
Business Analytics software
Results show the combination of features – best reach
Pivot
table
created
from
Python
code
Best singles are conference calling, call forwarding,
and call waiting
44
© 2010 IBM Corporation
Business Analytics software
The best three are not the top three one at a time
Calculations completed in a few seconds
45
© 2010 IBM Corporation
Business Analytics software
Where we have been today
Python and R integration
Unification of programs and scripts
Custom Dialog Builder
Extensions
SPSS Developer Central is your friend
46
© 2010 IBM Corporation
Business Analytics software
Questions
?
47
?
© 2010 IBM Corporation
Business Analytics software
Programmability increases your power, flexibility, and
productivity with IBM SPSS Statistics
 Generalization and automation
–applySyntaxToFiles
–SPSS MODIFY TABLES
 Extension
–SPSSINC QUANTREG using R
–SPSSINC TURF using Python
–Many new extension commands available
 Integration
–applySyntaxToFiles as part of a process
 And it's still more fun
48
© 2010 IBM Corporation
Business Analytics software
Contact
Jon K Peck, Ph. D.
Senior Software Engineer
IBM SPSS
[email protected]
blog: insideout.spss.com
49
© 2010 IBM Corporation
Fly UP