| |
by 
Product Description
Salford Systems' flagship data mining software,
CART®, is a robust, easy-to-use decision tree that automatically sifts
large, complex databases, searching for and isolating significant
patterns and relationships. This discovered knowledge is then used to
generate reliable, easy-to-grasp predictive models for applications
such as finding best prospects and customers, targeted marketing,
detecting credit card fraud, and managing credit risk.
Designed for both non-technical and technical business users, CART can
quickly reveal important data relationships that could remain hidden
using other analytical tools. The most recent 2008 release, CART 6.0,
includes modeling automation technology that dramatically accelerates
the process of generating accurate and robust models for deployment in
core business functions. CART was the primary tool used to win the
KDDCup 2000 web-mining competition and is currently in use in major web
applications.
Technically, Classification And Regression Trees (CART) is based on
landmark mathematical theory introduced in 1984 by four world-renowned
statisticians at Stanford University and the University of California
at Berkeley. Salford Systems’ implementation of CART is the only
decision tree software embodying the proprietary code written by CART
co-author Professor Jerome H. Friedman.
The CART creators continue to collaborate with Salford Systems to
enhance CART with proprietary advances. With CART 6.0 ProEX, Salford
has introduced patented extensions to CART specifically designed to
enhance results for market research and web analytics. CART supports
high-speed deployment, allowing Salford models to predict and score in
real time on a massive scale.

Principal Characteristics
CART uses an intuitive, Windows-based interface,
making it accessible to both technical and non-technical users.
Underlying the "easy" interface, however, is a mature theoretical
foundation that distinguishes CART from other methodologies and other
decision trees. Salford Systems' CART is the only decision tree system
based on the original CART code developed by world-renowned Stanford
University and University of California at Berkeley statisticians; this
code now includes enhancements that were co-developed by Salford
Systems and CART's originators.
Based on a decade of machine learning and statistical research, CART provides stable performance and reliable results.
Characteristics of its methodology are:
- A reliable pruning strategy
CART's developers
determined definitively that no stopping rule could be relied on to
discover the optimal tree, so they introduced the notion of
over-growing trees and then pruning back. This idea, fundamental
to CART, ensures that important structure is not overlooked by stopping
too soon. Other decision tree techniques use problematic stopping
rules.
- Automatic self validation procedures
In the search for
patterns in databases it is essential to avoid the trap of
"overfitting," or finding patterns that apply only to the training
data. CART's embedded test disciplines ensure that the patterns found
will hold up when applied to new data. Further, the testing and
selection of the optimal tree are an integral part of the CART
algorithm. In other decision tree techniques, testing is conducted
after the fact and tree selection is left up to the user.
In addition, CART
accommodates many different types of real world modeling problems by
providing a unique combination of automated and/or user-specified
solutions:
- Surrogate splitters intelligently handle missing values
CART handles missing
values in the database by substituting "surrogate splitters," which are
back-up rules that closely mimic the action of primary splitting rules.
The surrogate splitter contains information that is typically similar
to what would be found in the primary splitter. Other products'
approaches treat all records with missing values as if the records all
had the same unknown value; with that approach all such "missings" are
assigned to the same bin. In CART, each record is processed using data
specific to that record, thus allowing records with different data
patterns to be handled differently, resulting in a better
characterization of the data.
- Adjustable misclassification penalties
CART can accommodate
situations in which some misclassifications, or cases that have been
incorrectly classified, are more serious than others. CART users can
specify a higher penalty for misclassifying certain data, and the
software will steer the tree away from that type of error. Further,
when CART cannot guarantee a correct classification, it will try to
ensure that the error it does make is less costly. If credit risk is
classified as low, moderate, or high, for example, it would be much
more costly to classify a high risk person as low risk than as moderate
risk. Traditional data mining tools cannot distinguish between these
errors.
- Alternative splitting criteria
CART includes seven
single variable splitting criteria - Gini, symmetric Gini, twoing,
ordered twoing and class probability for classification trees, and
least squares and least absolute deviation for regression trees - and
one multi-variable splitting criteria, the linear combinations method.
The default Gini method typically performs best, but, given specific
circumstances, other methods can generate more accurate models. CART's
unique "twoing" procedure, for example, is tuned for classification
problems with many classes, such as modeling which of 170 products
would be chosen by a given consumer. To deal more effectively with
select data patterns, CART also offers splits on linear combinations of
continuous predictor variables.
- Any CART model can be easily deployed when translated
into one of the supported languages (SAS®-compatible, C, Java, and
PMML) or into the classic text output. This is critical for using your
CART trees in large scale production work.
The decision logic of a CART tree, including the surrogate rules
utilized if primary splitting values are missing, is automatically
implemented. The resulting source code can be dropped into a external
applications thus eliminating errors due to hand coding of decision
rules and enabling fast and accurate model deployment.

Data Translation Engine
The CART® data-translation engine
supports data conversions for more than 80 file formats, including
popular statistical-analysis packages such as SAS® and SPSS®, databases
such as Oracle and Informix, and spreadsheets such as Microsoft Excel
and Lotus 1-2-3.

Which version do you need?
To accommodate different dataset sizes, CART is
available in several different memory sizes. The standard memory
version of CART for Windows is compiled for a machine with at least
64MB of memory (RAM), and can analyze more than 4.5 million learning
sample observations. The table below shows the approximate number of
learn sample observations that can be used in an analysis for a given
CART version size.
Formerly, CART was compiled into distinct memory versions (64MB, 128MB,
etc). A user's license determined which memory version was delivered.
Thus, the license was tied to the amount of workspace inherent in the
program and (loosely) tied to the amount of data, type of data
(categorical vs. continuous), size of final tree, etc., the user could
analyze.
Licensing and workspace are handled differently in CART 5.0 and onward.
A user's license sets a limit on the amount of learn sample data that
can be analyzed. The learn sample is the data used to grow the maximal
tree. Note that there is no limit to the number of test sample data
points that may be analyzed.
For example, suppose our 32MB version set a learn sample limitation of
8 MB. Each data point occupies 4 bytes. Therefore, a 8MB license will
allow up to 8 * 1024 * 1024 / 4 = 2,097,152 learn sample data points to
be analyzed. A data point is represented by a 1-variable by-
1-observation (1-row by- 1-column).
In general, we feel that the analysis workspace provided to build the
tree will be adequate for most modeling scenarios. However, if the user
models a large number of high level categorical predictors, or is using
a high level categorical target, they may encounter workspace
limitations that will not allow the entire learn sample to be used. In
these special cases the user will have to upgrade to a larger memory
version.
The following is a table that describes
the current set of "sizes" available. Please note that the minimum
required RAM is not the same as the learn sample limitation. If you
have any questions regarding the following information, please contact
a sales representative.
- Size = minimum recommended physical memory (RAM) in MB.
- Data Limit MB = Licensed learn sample data size in MB (1 MB = 1,048,576 bytes)
- Data Limit # of values = Licensed # of learn sample values (rows by columns)
- SP cells = max number of 4 byte (Single Precision) workspace elements the program can use.
Single precision workspace may involve virtual memory when a run uses the maximum or near-maximum amount of workspace.
Size (MB)
|
Data Limit (MB) |
Data Limit # of values |
SP Cells (4-byte) |
| 32 |
8 |
2,097,152 |
10,000,000 |
| 64 |
18 |
4,718,592 |
13,500,000 |
| 128 |
45 |
11,796,480 |
33,750,000 |
| 256 |
100
|
26,214,400 |
75,000,000 |
512
|
200
|
52,428,800
|
150,000,000
|
1024
|
400
|
104,857,600
|
250,000,000
|
2048
|
800
|
209,715,200
|
356,000,000
|
* Custom compiles up to 32 gigs available.
The number of variables CART can handle can be
significantly increased if node sub-sampling is used when searching for
the optimal split. In node sub-sampling, all the data are used to grow
the tree, but only a sub-sample of the data is actually searched in the
largest nodes near the top of the tree. Judiciously chosen sub-sampling
can sometimes double the number of variables CART can search while
growing the tree on all the data.

System Technical Requirements
Windows
Minimum System Requirements
- 80486 processor or higher.
- 512MB
of random-access memory (RAM). This value depends on the "size" you
have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions
may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE it
will. We highly recommend that you follow the recommended memory
configurationthat applies to the particular version you have purchased.
Using less than the recommended memory configuration results in hard
drive paging, reducing performance significantly, or application
instability.
- Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
- Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
- CD-ROM or DVD drive.
- Windows XP/2003/2008 and Windows 7.
Recommended System Requirements
Because
Salford Tools are extremely CPU intensive, the faster your CPU, the
faster they will run. For optimal performance, we strongly recommend
they run on a machine with a system configuration equal to, or greater
than, the following:
- Pentium 4 processor running 2.0+ GHz.
- 2 GIG of random-access memory (RAM). This value
depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB,
1GIG). While all versions may run with a minimum of 32MB of RAM, we
CANNOT GUARANTEE it will. We highly recommend that you follow the
recommended memory configuration that applies to the particular version
you have purchased. Using less than the recommended memory
configuration results in hard drive paging, reducing performance
significantly, or application instability.
- Hard disk with 40 MB of free space for program files, data file access utility, and sample dta files.
- Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
- CD-ROM or DVD drive.
- Windows XP/2003/2008 and Windows 7.
- 2 GIG of additional hard disk space available for virtual memory and temporary files.
Licensing Application
CART uses a system of
application system ID and associated unlock key. Upon installation
completion, the user will need to email the application "system ID."
This system ID is clearly displayed in the License Information
displayed the first time the application is
started. You can alternatively get to this window by selecting the Help->License menu option.
Method 1: Fixed License
With a fixed license, each machine must have its own copy of the
licensed program installed. If your license terms permit more than one
copy, then the license must be activated on each machine that will be
used.
Method 2: Floating License
This method of licensing your program is used if you intend the program
application to be used by more than one user concurrently over a
network. A floating license tracks the number of copies "checked out".
When that number exceeds your license terms, a
message is provided informing the user "all copies are checked out".
The licensed program may be installed on a machine that each client
machine can access. Machines that are not connected to the network must
be issued a fixed license (Method 1 above).
A floating license is particularly useful when the number of potential
users exceeds the number of seats specified in your license terms.
UNIX/Linux
Supported Architectures
- Alpha: DEC 3000 or AlphaServer running Tru64 UNIX 5.0 or higher
- Linux/i386: i586 or higher processor; Linux 2.4 or higher kernel; glibc 2.3 or higher
- Linux/AMD64: AMD64 or Intel EM64T processor; Linux 2.6 or higher kernel; glibc 2.3 or higher
- Sun: UltraSPARC processor; Solaris 2.6 or higher
- RS/6000: POWER or PowerPC processor; AIX 4.2 or higher
- HP 9000: PA/RISC 1.1 or higher processor; HP/UX 11.x
- SGI: MIPS 4 or higher processor; IRIX 6.5
Minimum System Requirements
- Minimum RAM requirement for all non-GUI app's is 32 MB of random-access memory (RAM). This value depends on the "size"
you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG).
- Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
- Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
Recommended System Requirements
- Recommended random-access memory (RAM) is 1.5 times
the licensed data limit (32 MB, 64 MB, etc), up to the maximum
permitted by the target architecture. On UNIX systems, it is generally
recommended that there be at least twice as much swap space as there is
RAM.
- Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
- Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
All Salford apps are very CPU intensive, so more memory and a faster CPU are always helpful.
© Copyright 2010 Salford-Systems Inc.


|
|