WordStat is a flexible and easy-to-use text analysis software – whether you need text mining tools for fast extraction of themes and trends, or careful and precise measurement with state-of-the-art quantitative content analysis tools. WordStat‘s seamless integration with SimStat – our statistical data analysis tool – and QDA Miner – our qualitative data analysis software – gives you unprecedented flexibility for analyzing text and relating its content to structured information, including numerical and categorical data.


What it is used for?

WordStat can be used by anyone who needs to quickly extract and analyze information from large amounts of documents. Our content analysis and text mining software is used for:

Content analysis of open-ended responses, interview or focus group transcripts
Business intelligence and competitive web sites analysis
Information extraction and knowledge discovery from incident reports, customer complaints
Content analysis of news coverage or scientific literature
Automatic tagging and classification of documents
Fraud detection, authorship attribution, patent analysis
Taxonomy development and validation



Many WordStat users develop their own content analysis dictionaries. Such dictionaries are usually customized to the type of data being analyzed and to the variables that need to be measured. However, having access to existing dictionaries from others authors may be useful in many ways. It allows the analyst to easily assess other dimensions and quickly get a different perspective on his text data. It also makes his results comparable to those published in other studies. Such dictionaries may also represent a great source of inspiration for developing one’s own dictionary.



The WordStat software development kit (SDK) provides a solution , allowing models developed with the WordStat desktop tool to be used in other applications written in other computer languages such as C++, Delphi, C#, VB.Net and so on.


An example of such integration would be the application of a categorization model on a company data collection system of customer feedback in order to automatically measure references to specific topics and to classify those feedbacks as either positive, negative or neutral.



Content analysis on collections of ANSI or RTF document and short alphanumeric variables.

Stemming in 18 languages.

Dictionary moderated lemmatization and stemming (English, French, Italian, German and Spanish; contact us for other languages).

Ability to call external text pre-processing EXE or DLL.

Optional exclusion of pronouns, conjunctions, etc, by the use of user-defined exclusion lists (or stop list).

Categorization of words or phrases using existing or user-defined dictionaries.

Word categorization based on Boolean (AND, OR, NOT) and proximity rules (NEAR, AFTER, BEFORE).

Word and phrase substitution and scoring using wildcards and weighting.

Frequency analysis on keywords, phrases, derived categories or concepts, or user-defined codes entered manually within a text.

Interactive development and easy maintenance of hierarchical dictionaries, taxonomies, or categorization schema.

Drag and drop editor for easy assignments of words, phrases into categories!

Ability to restrict the analysis to specific portions of a text or to exclude comments and annotations.

Ability to perform an analysis on a random sample of cases.

Integrated spell-checking with support for more than 20 languages such as English, French, Spanish, etc.

Integrated thesaurus to assist the creation of taxonomies and comprehensive categorization schemes (English, French, Spanish, Italian, Portuguese and German).

Powerful case filtering on any numeric or alphanumeric field and on code occurrence (with AND, OR, and NOT Boolean operators)

Prints presentation quality tables

Imports ANSI and Unicode text files, MS Word, RTF and HTML, PDF.

Exports any table to Excel, SPSS, Stata, ASCII, Tab separated or comma separated value files, or HTML files.

Flexible keyword highlighting (the text editor can display all categories using different colors).



Univariate word frequency analysis (word or category count and record occurrence).

Word x word co-occurrence matrix.

Word x case data matrix.

Integrated multidimensional scaling with 2D and 3D maps.

Proximity plot.


Topic modeling tool automatically extract topics by applying factor analysis on word x segment matrices.

Vocabulary finder extracts technical terms, product and company names as well as common misspellings.

Pattern based named-entity extraction.

Phrase finder allows one to easily identify recurring phrases and expressions





Ability to create norm files based on frequency analysis of words or content categories.

Comparison of obtained frequencies to previously saved norm files.



A powerful keyword retrieval function allows identification of text units (documents, paragraph or sentences) containing one keyword or a combination of keywords with optional filtering of cases.

Ability to attach QDA Miner codes to retrieved segments.

Retrieved segments may be exported to disk in tabular format (Excel or delimited text files) or as text reports (Rich Text Format).


Integrated clustering and dendrogram display of keyword co-occurrence.

First- and second-order proximity analysis.

Proximity plot to easily identify all keywords that co-occurs with a target keyword.

2D and 3D multidimensional scaling on either joint frequency or co-occurrence of words or categories.

Flexible keyword co-occurrence criteria (within a case, a sentence, a paragraph, a window of n words, a user-defined segment) as well as clustering methods (first- and second-order proximity, choice of similarity measures).

Easy text retrieval from dendrogram or proximity plots.



Hierarchical clustering, multidimensional scaling and proximity plot may be used to explore the similarity between documents or cases.



Can perform univariate frequency analysis and crosstabulation on information stored in several alphanumeric fields (memo or string variables).

Comparison of keyword occurrence between different fields.

Computes inter-raters agreement measures (pct. of agreement, Cohen’s Kappa, Scott’s Pi, Krippendorff’s R and r-bar, free marginal) based on codes manually entered in different variables.



Bivariate comparison between any textual field and any nominal or ordinal variable (such as the sex of the respondent, specific subgroups, years of publication, etc.).

Choice between 11 different association measures to assess the relationship between word occurrence and nominal or ordinal variables (Chi-square, Likelihood ratio, Tau-a, Tau-b, Tau-c, symmetric Somers’ D, asymmetric Somers’ Dxy and Dyx, Gamma, Person’s R, Spearman’s Rho)

Computation statistics on either absolute or relative frequency

Ability to sort matrix in alphabetic order of words, by word frequency or word occurrence, on the obtained statistics or on its probability.

Visually compare items between subgroups using bar charts and line charts.


Correspondence analysis (statistics, 2D & 3D joint plots). This feature is accessible from the crosstab page and allows one to see graphically the relationship between nominal variables and codes resulting from a content analysis.

Heatmap plot (with dual-clustering of keywords and variables)


Machine learning algorithms (Naive Bayes and K-Nearest Neighbors) for document classification.

Flexible feature selection for automatic selection of best subsets of attributes.

Numerous validation methods (leave-but-one, n-fold crossvalidation, split sample).

Experimentation module allows easy comparison of predictive models and fine-tuning of classification models.

Classification models may be saved to disk and applied later using either a standalone document classification utility program, a command line program or a programming library . Note: The command line and the programming library are part of WordStat Software Developer’s kit (SDK) which is sold separately.


Ability to display a KWIC table to examine the textual context of a word, word pattern, or category.

Ability to sort the table on any independent (numeric) variables.

Ability to jump from a KWIC keyword to the textual variable in order to view or edit the original text.

KWIC list can be saved in data files for further processing.

Customizable KWIC display (paragraph, sentence or user defined segment).

Concordance report (displays all hits as a list of paragraphs, sentences or user defined segments)





Alphanumeric variables can be stored in the same file as all other numeric variables.

Variable selection, statistical analysis and content analysis are performed within the same application program.

Matrix outputs are automatically added to existing statistical outputs.

New variables representing occurrence of words, keywords or concepts can be added to the existing data file or exported to a new data file in order to be submitted to further statistical analysis (such as cluster analysis on words or cases, principal coordinate analysis, correspondence analysis, multiple regression, etc.).

Data can be imported from and exported to different file format including dBase, Paradox, Excel, Quattro Pro, Lotus 1-2-3, SPSS for DOS, SPSS for Windows, comma or tab separated text files, etc.

Ability to perform numeric and alphanumeric transformation or to apply filters on records of the data file to restrict the analysis to specific subgroups.



Dictionary building assistant to find related words (synonyms, antonyms, holonyms, meronyms, hypernyms, hyponyms) in a WordNet based thesaurus (English only). (100,000 synonyms, 120,000 root words)


WS Document Classifier, a small standalone application to apply previously saved categorization and classification models to external documents.

Document Conversion Wizard- Utility program to easily import documents. Various file formats may be directly imported such as Plain text (ANSI, Unicode) HTML, RTF, MS Word, WordPerfect, Adobe PDF

Optional removal of leading and trailing spaced and hard returns.

Extraction of numeric, alphanumeric and date variables from structured documents.

Extraction options may be saved on disk and later retrieved.

Documents may be stored as plain ANSI text or as RTF documents.

The version 7.1 features a new GIS mapping and data editing module that allows one to relate unstructured text data with geographic information and create maps and other graphic displays for analysis and presentation such as:



Plots of data points can be created from words, phrases, or topics extracted from unstructured text fields. One can quickly filter data points on categorical, numerical, and date variables or create dynamic range displays and custom animations to easily identify temporal trends, cyclical patterns or relationships to numerical variables. One may also customize and annotate single data points





Density of data points can be visualized with heatmaps displays to easily identify customer concentrations, crime hot spots, or disease outbreaks. Users can also create heatmaps on all data point or just on selected regions and choose from a wide variety of color ramps or create their own.




Users can create layers from various vector file formats and produce choropleth maps to represent point density, demographic information stored in shapefiles, or statistical summaries of numerical values associated with text segments. One can also easily adjust the color range, the number of steps and level of transparency.





Integrated geocoding service is available in WordStat to tranform references to cities, states, provinces, countries, postal codes, and IP addresses into geographical coordinates.



Natively opens and displays a wide range of vector, image, grid, and SQL database layer formats, including advanced spatial server geodatabases. WMS, WFS, and WMTS mapping services.

Comprehensive visual layer property, legend, and scale controls provide for deep customization of the map appearance.

Create, edit, translate, and export map layers in a number of vector, image, grid, and database formats.

Support for coordinate systems with on-the-fly layer reprojection between thousands of predefined geographic and projected coordinate systems or any coordinate system defined from 150+ projections and 900+ datums.

Saving of maps to industry standard graphic file formats (BMP, PNG, JPG) and georeferenced world files as well as AVI movie files.


WordStat Perpetual Licence

Licence Agreement


This software and the disk on which it is contained are licensed to you, for your own use. This is copyrighted software owned by Provalis Research Corp.. By purchasing this software, you are not obtaining title to the software or any copyright rights. You may not sublicense, rent, lease, convey, modify, translate, convert to another programming language, decompile, or disassemble the software for any purpose. You may make as many copies of this software as you need for backup purposes. You may use this software on up to two computers, provided there is no chance it will be used simultaneously on more than one computer.


The WordStat product is licensed “as is” without any warranty of merchantability or fitness for a particular purpose, performance, or otherwise. All warranties are expressly disclaimed. By using the WordStat product, you agree that neither Provalis Research nor anyone else who has been involved in the creation, production, or delivery of this software shall be liable to you or any third party for any use of (or inability to use) or performance of this product or for any indirect, consequential, or incidental damages whatsoever, whether based on contract, tort, or otherwise even if we are notified of such possibility in advance. (Some states do not allow the exclusion or limitation of incidental or consequential damages, so the foregoing limitation may not apply to you). In no event shall Provalis Research’s liability for any damages ever exceed the price paid for the license to use the software, regardless of the form of claim. This agreement shall be governed by the laws of the province of Quebec (Canada) and shall inure to the benefit of Provalis Research and any successors, administrators, heirs, and assigns. Any action or proceeding brought by either party against the other arising out of or related to this agreement shall be brought only in a PROVINCIAL or FEDERAL COURT of competent jurisdiction located in Montréal, Québec. The parties hereby consent to in personal jurisdiction of said courts.


WordStat Annual Site Licence-WordStat Department Licence

Licence Agreement


A department wide site license allows the installation of the software on all computers in the designated department as well as the distribution of copies of the manual and software to all employees and faculties and to all full time students taking a course provided by the department. This license expires one year after the purchase of the license. All updates during this time period are free and can be obtained from our web site at www.provalisresearch.com. The price for the renewal of this license is guaranteed not to exceed the initial price paid for the software for a period of at least two additional years. This license does not allow the use of the software for commercial purpose or paid contracts and should be restricted to employees of the university and to full time students. A special license for commercial uses may however be obtained at a reduced price by contacting Provalis Research. The department wide site license excludes affiliated research centers as well as employees and researchers whose salaries are not paid by the university. Provalis Research reserves the right to limit customer supports to a few designated individuals. While we may offer occasional support to individual users, if the number of persons asking for support becomes too important, we may require the department to designated 1 or 2 persons that will provide support to the users and will redirect questions to us if needed.

Operating System: Microsoft Windows XP, 2000, Vista, Windows 7 and 8


Memory: From 256 MB (XP) to 1GB (Vista, Windows 7 and 8)


Disk Space:  40 MB of disk space.


WordStat, along QDA Miner, will run on a Mac OS using virtual machine solution or Boot Camp, and on Linux computers using CrossOver or Wine.

The QDA Miner manual can be downloaded here



A series of short videos developed by Provalis Research, the developers of QDA Miner, offer excellent online tutorial support for conducting qualitative data analysis using the Provalis products.



