Wiki source code of 03f. Model
Hide last authors
| author | version | line-number | content |
|---|---|---|---|
| |
1.1 | 1 | {{box cssClass="floatinginfobox" title="**Contents**"}} |
| 2 | {{toc/}} | ||
| 3 | {{/box}} | ||
| 4 | |||
| 5 | Using statistical modelling, further insight can be derived from existing Indexes to predict the future value of a field, create association rules and cluster data by establishing shared variables. | ||
| 6 | |||
| 7 | To access the Model functionality, click **Model** at the top of the screen. | ||
| 8 | |||
| 9 | When creating a model, there are three distinct stages: **Prepare Data**, **Create Models** and **View Report**. The required data preparation and output will depend on the chosen model type. | ||
| 10 | |||
| 11 | There are three model types available that offer distinct functionality, each with a wizard that contains the necessary steps for data preparation, model creation and report generation. | ||
| 12 | |||
| 13 | Click the required option to begin the model building process. | ||
| 14 | |||
| 15 | = Prediction = | ||
| 16 | |||
| 17 | Classification is one of the main methods that can be applied in various scenarios to generate business insights. Prediction is a machine learning task that refers to using predictive modelling to predict a class label. The possible application ranges widely from loan default prediction, customer churn analysis, market subscription prediction and medical diagnosis. | ||
| 18 | |||
| 19 | To run predictive modelling, the data set must contain a column that will be predicted. This is known as the **Class Label**. For example, if running loan default prediction across a financial data set, each row will represent an existing loan and a column will be present that representing the status of the loan. This column will be predicted using the model. | ||
| 20 | |||
| 21 | Predictive modelling requires the data to be split into two distinct cohorts: **Training Data** and **Model Data**. The data used can be filtered using the on-screen filter controls, or a pre-defined saved query can be used to select the rows and columns. | ||
| 22 | |||
| 23 | **Training Data** is analysed to assist the statistical model in discovering key relationships within the records that influence outcomes. This understanding is then applied to the **Model Data**, the cohort of data that will have an outcome predicted. | ||
| 24 | |||
| 25 | The **Target Field** that will be predicted is then established. If required, a value from that field is specified as a **Class of Higher Interest** if a specific outcome is of interest. | ||
| 26 | |||
| 27 | Finally, the fields that will be included in the model are specified before the data preparation is complete. Predictive **Bayesian Network** and **Decision Tree** models can then be configured based on the model data output before their results are added to an automatically generated dashboard. | ||
| 28 | |||
| 29 | Practical applications for this level of analysis include the prediction of utilisation, the targeting of individuals or groups that may prompt pre-emptive action or the forecasting of results based on prior outcomes. | ||
| 30 | |||
| 31 | = Association = | ||
| 32 | |||
| 33 | Using association rules analysis, associations and correlations can be determined to better understand the connections between itemsets across the data. | ||
| 34 | |||
| 35 | This requires the data to be in a specific shape, with each row representing a single transactional-level event with a unique transaction ID and each column containing a specific event. For example, if using retail data, transactional point-of-sale data must be transposed so that each row represents a single transaction and column names are used for each product, with fields containing a binary flag to indicate whether this item was included in the sale. | ||
| 36 | |||
| 37 | Once this preparation has been established, two algorithms are available to generate association rules: **Apriori Algorithm** and **FP Growth**. The model will then run to determine the **Support**, **Confidence** and **Lift** that can then be adjusted to focus the output to itemsets with specific levels of relationship probability. | ||
| 38 | |||
| 39 | Once adjusted to account for the desired level of popularity, a report is then generated to include key insights into the itemsets and a number of visualisations. | ||
| 40 | |||
| 41 | = Clustering = | ||
| 42 | |||
| 43 | With a set of data points, clustering algorithms can be used to classify each data point into a specific group. | ||
| 44 | |||
| 45 | For unlabeled data sets, clustering is used to determine the possible groups that lie within the records. In theory, data points that are in the same group should have similar properties or features, while data points in different groups should have highly dissimilar properties or features. | ||
| 46 | |||
| 47 | This requires the data to be in a specific shape. For example, for customer segmentation each row should represent a data point (customer) and each column represent an attribute (age, income, spending etc.). The clustering model can then create groupings from shared attributes. | ||
| 48 | |||
| 49 | Clustering models can be applied to a range of scenarios such website visitor demographic analysis, social network groupings and stock risk clustering. | ||
| 50 | |||
| 51 | Once the required fields have been selected, two model types are available: **K-Means** and **Expectation-Maximisation (EM)**. These models will then run according to the number of required clusters, with the **EM** model type also able to detect and set this figure accordingly based on the data set. | ||
| 52 | |||
| 53 | Once a model has completed, a number of performance metrics are available to benchmark performance and tweak key settings before a report is generated from the output. | ||
| 54 |