SOFTWARE/CART

 

  by

Product Description

Salford Systems' flagship data mining software, CART®, is a robust, easy-to-use decision tree that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships. This discovered knowledge is then used to generate reliable, easy-to-grasp predictive models for applications such as finding best prospects and customers, targeted marketing, detecting credit card fraud, and managing credit risk.

Designed for both non-technical and technical business users, CART can quickly reveal important data relationships that could remain hidden using other analytical tools. The most recent 2008 release, CART 6.0, includes modeling automation technology that dramatically accelerates the process of generating accurate and robust models for deployment in core business functions. CART was the primary tool used to win the KDDCup 2000 web-mining competition and is currently in use in major web applications.

Technically, Classification And Regression Trees (CART) is based on landmark mathematical theory introduced in 1984 by four world-renowned statisticians at Stanford University and the University of California at Berkeley. Salford Systems’ implementation of CART is the only decision tree software embodying the proprietary code written by CART co-author Professor Jerome H. Friedman.

The CART creators continue to collaborate with Salford Systems to enhance CART with proprietary advances. With CART 6.0 ProEX, Salford has introduced patented extensions to CART specifically designed to enhance results for market research and web analytics. CART supports high-speed deployment, allowing Salford models to predict and score in real time on a massive scale.

Principal Characteristics

CART uses an intuitive, Windows-based interface, making it accessible to both technical and non-technical users. Underlying the "easy" interface, however, is a mature theoretical foundation that distinguishes CART from other methodologies and other decision trees. Salford Systems' CART is the only decision tree system based on the original CART code developed by world-renowned Stanford University and University of California at Berkeley statisticians; this code now includes enhancements that were co-developed by Salford Systems and CART's originators.

Based on a decade of machine learning and statistical research, CART provides stable performance and reliable results.

Characteristics of its methodology are: 

  • A reliable pruning strategy
    CART's developers determined definitively that no stopping rule could be relied on to discover the optimal tree, so they introduced the notion of over-growing trees and then pruning back.  This idea, fundamental to CART, ensures that important structure is not overlooked by stopping too soon. Other decision tree techniques use problematic stopping rules.
  • A powerful binary split search approach
    CART's binary decision trees are more sparing with data and detect more structure before too little data are left for learning. Other decision tree approaches use multi-way splits that fragment the data rapidly, making it difficult to detect rules that require broad ranges of data to discover.
  • Automatic self validation procedures
    In the search for patterns in databases it is essential to avoid the trap of "overfitting," or finding patterns that apply only to the training data. CART's embedded test disciplines ensure that the patterns found will hold up when applied to new data. Further, the testing and selection of the optimal tree are an integral part of the CART algorithm. In other decision tree techniques, testing is conducted after the fact and tree selection is left up to the user.
In addition, CART accommodates many different types of real world modeling problems by providing a unique combination of automated and/or user-specified solutions: 
  • Surrogate splitters intelligently handle missing values
    CART handles missing values in the database by substituting "surrogate splitters," which are back-up rules that closely mimic the action of primary splitting rules. The surrogate splitter contains information that is typically similar to what would be found in the primary splitter. Other products' approaches treat all records with missing values as if the records all had the same unknown value; with that approach all such "missings" are assigned to the same bin. In CART, each record is processed using data specific to that record, thus allowing records with different data patterns to be handled differently, resulting in a better characterization of the data.
  • Adjustable misclassification penalties
    CART can accommodate situations in which some misclassifications, or cases that have been incorrectly classified, are more serious than others. CART users can specify a higher penalty for misclassifying certain data, and the software will steer the tree away from that type of error. Further, when CART cannot guarantee a correct classification, it will try to ensure that the error it does make is less costly. If credit risk is classified as low, moderate, or high, for example, it would be much more costly to classify a high risk person as low risk than as moderate risk. Traditional data mining tools cannot distinguish between these errors.
  • Alternative splitting criteria
    CART includes seven single variable splitting criteria - Gini, symmetric Gini, twoing, ordered twoing and class probability for classification trees, and least squares and least absolute deviation for regression trees - and one multi-variable splitting criteria, the linear combinations method. The default Gini method typically performs best, but, given specific circumstances, other methods can generate more accurate models. CART's unique "twoing" procedure, for example, is tuned for classification problems with many classes, such as modeling which of 170 products would be chosen by a given consumer. To deal more effectively with select data patterns, CART also offers splits on linear combinations of continuous predictor variables.
  • Model Deployment 
Any CART model can be easily deployed when translated into one of the supported languages (SAS®-compatible, C, Java, and PMML) or into the classic text output. This is critical for using your CART trees in large scale production work.
The decision logic of a CART tree, including the surrogate rules utilized if primary splitting values are missing, is automatically implemented. The resulting source code can be dropped into a external applications thus eliminating errors due to hand coding of decision rules and enabling fast and accurate model deployment.
 

Data Translation Engine

The CART® data-translation engine supports data conversions for more than 80 file formats, including popular statistical-analysis packages such as SAS® and SPSS®, databases such as Oracle and Informix, and spreadsheets such as Microsoft Excel and Lotus 1-2-3.

Which version do you need?

To accommodate different dataset sizes, CART is available in several different memory sizes. The standard memory version of CART for Windows is compiled for a machine with at least 64MB of memory (RAM), and can analyze more than 4.5 million learning sample observations. The table below shows the approximate number of learn sample observations that can be used in an analysis for a given CART version size.

Formerly, CART was compiled into distinct memory versions (64MB, 128MB, etc). A user's license determined which memory version was delivered. Thus, the license was tied to the amount of workspace inherent in the program and (loosely) tied to the amount of data, type of data (categorical vs. continuous), size of final tree, etc., the user could analyze.

Licensing and workspace are handled differently in CART 5.0 and onward. A user's license sets a limit on the amount of learn sample data that can be analyzed. The learn sample is the data used to grow the maximal tree. Note that there is no limit to the number of test sample data points that may be analyzed.

For example, suppose our 32MB version set a learn sample limitation of 8 MB. Each data point occupies 4 bytes. Therefore, a 8MB license will allow up to 8 * 1024 * 1024 / 4 = 2,097,152 learn sample data points to be analyzed. A data point is represented by a 1-variable by- 1-observation (1-row by- 1-column).

In general, we feel that the analysis workspace provided to build the tree will be adequate for most modeling scenarios. However, if the user models a large number of high level categorical predictors, or is using a high level categorical target, they may encounter workspace limitations that will not allow the entire learn sample to be used. In these special cases the user will have to upgrade to a larger memory version.

The following is a table that describes the current set of "sizes" available. Please note that the minimum required RAM is not the same as the learn sample limitation. If you have any questions regarding the following information, please contact a sales representative.

  • Size = minimum recommended physical memory (RAM) in MB.
  • Data Limit MB = Licensed learn sample data size in MB (1 MB = 1,048,576 bytes)
  • Data Limit # of values = Licensed # of learn sample values (rows by columns)
  • SP cells = max number of 4 byte (Single Precision) workspace elements the program can use.
Single precision workspace may involve virtual memory when a run uses the maximum or near-maximum amount of workspace.

Size (MB)
Data Limit (MB) Data Limit # of values SP Cells (4-byte)
32 8 2,097,152 10,000,000
64 18 4,718,592 13,500,000
128 45 11,796,480 33,750,000
256 100
26,214,400 75,000,000
512
200
52,428,800
150,000,000
1024
400
104,857,600
250,000,000
2048
800
209,715,200
356,000,000

* Custom compiles up to 32 gigs available.

The number of variables CART can handle can be significantly increased if node sub-sampling is used when searching for the optimal split. In node sub-sampling, all the data are used to grow the tree, but only a sub-sample of the data is actually searched in the largest nodes near the top of the tree. Judiciously chosen sub-sampling can sometimes double the number of variables CART can search while growing the tree on all the data.

System Technical Requirements

Windows

Minimum System Requirements

  • 80486 processor or higher.
  • 512MB of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE it will.  We highly recommend that you follow the recommended memory configurationthat applies to the particular version you have purchased. Using less than the recommended memory configuration results in hard drive paging, reducing performance significantly, or application instability.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
  • CD-ROM or DVD drive.
  • Windows XP/2003/2008 and Windows 7.

Recommended System Requirements

Because Salford Tools are extremely CPU intensive, the faster your CPU, the faster they will run. For optimal performance, we strongly recommend they run on a machine with a system configuration equal to, or greater than, the following:

  • Pentium 4 processor running 2.0+ GHz.
  • 2 GIG of random-access memory (RAM). This value depends on the "size" you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG). While all versions may run with a minimum of 32MB of RAM, we CANNOT GUARANTEE it will. We highly recommend that you follow the recommended memory configuration that applies to the particular version you have purchased. Using less than the recommended memory configuration results in hard drive paging, reducing performance significantly, or application instability.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample dta files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
  • CD-ROM or DVD drive.
  • Windows XP/2003/2008 and Windows 7.
  • 2 GIG of additional hard disk space available for virtual memory and temporary files.
Licensing Application

CART uses a system of application system ID and associated unlock key. Upon installation completion, the user will need to email the application "system ID." This system ID is clearly displayed in the License Information displayed the first time the application is
started. You can alternatively get to this window by selecting the Help->License menu option.

Method 1: Fixed License
With a fixed license, each machine must have its own copy of the licensed program installed. If your license terms permit more than one copy, then the license must be activated on each machine that will be used.

Method 2: Floating License
This method of licensing your program is used if you intend the program application to be used by more than one user concurrently over a network. A floating license tracks the number of copies "checked out". When that number exceeds your license terms, a
message is provided informing the user "all copies are checked out". The licensed program may be installed on a machine that each client machine can access. Machines that are not connected to the network must be issued a fixed license (Method 1 above).

A floating license is particularly useful when the number of potential users exceeds the number of seats specified in your license terms.

UNIX/Linux

Supported Architectures
  • Alpha: DEC 3000 or AlphaServer running Tru64 UNIX 5.0 or higher
  • Linux/i386: i586 or higher processor; Linux 2.4 or higher kernel; glibc 2.3 or higher
  • Linux/AMD64: AMD64 or Intel EM64T processor; Linux 2.6 or higher kernel; glibc 2.3 or higher
  • Sun: UltraSPARC processor; Solaris 2.6 or higher
  • RS/6000: POWER or PowerPC processor; AIX 4.2 or higher
  • HP 9000: PA/RISC 1.1 or higher processor; HP/UX 11.x
  • SGI: MIPS 4 or higher processor; IRIX 6.5
Minimum System Requirements
  • Minimum RAM requirement for all non-GUI app's is 32 MB of random-access memory (RAM). This value depends on the "size"
    you have purchased (64MB, 128MB, 256MB, 512MB, 1GIG).
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
Recommended System Requirements
  • Recommended random-access memory (RAM) is 1.5 times the licensed data limit (32 MB, 64 MB, etc), up to the maximum permitted by the target architecture. On UNIX systems, it is generally recommended that there be at least twice as much swap space as there is RAM.
  • Hard disk with 40 MB of free space for program files, data file access utility, and sample data files.
  • Additional hard disk space for scratch files (with the required space contingent on the size of the input data set).
All Salford apps are very CPU intensive, so more memory and a faster CPU are always helpful.

© Copyright 2010 Salford-Systems Inc.


 
Copyright © 2010 TStat All rights reserved via Rettangolo, 12/14 - 67039 - Sulmona (AQ) - Italia