04b. Creating a Data Source Group
Navigate to the Data Source Groups screen by clicking Search Engine, then Data Source Groups. All currently loaded Data Source Groups will be displayed, and clicking New will allow you to add a new group.
Details
Enter a name for the Data Source Group in the Name text box.
Use the Tags text box to add associated search terms to the Data Source Group. This allows the components to be accessed using customised strings when using the search bar.
The Directory option will automatically populate with the path set specified in the System Settings, suffixed with the text entered in the Name text box. Click the … icon to specify a different directory.
The Index Method drop-down list specifies the build method to be used when building the Index.
The Complete option will result in all items in a data source being Indexed. Once complete, a new version number is applied to the Index. Each run completely refreshes the Index.
Building an Incremental Index enables an optimised refresh process when rebuilding Indexes to account for any changes. Rather than rebuilding the entire Index, only data that has been modified or added will be processed. When selected, the Incremental Identifier drop-down list is revealed. This is required to identify changes, using the selected field to detect when new data is greater than the previous highest value.
The Timeline option allows the building of ‘point in time’ Indexes. Using an Effective From date, comparisons of data between set dates can be made. A Primary Key is required.
Selecting Cumulative will only Index items that have changed in a data source. Duplicate records are removed from the result set based on the Increment Identifier and Primary Key. When cumulatively building from CSV files, only a Primary Key is required. For other files, an Incremental Identifier and a Primary Key are required.
Use the Index Size drop-down list to specify the expected number of records the resulting Index will contain. This is an indicative value and does not need to be exact. Using this drop-down list allows the system to work out how many folders to split the Index into. The higher the specified Index size, the less folders created. A lower number of folders results in less threads used to query the Index, as a single thread is allocated per folder. This decreases the speed of individual queries, but reduces the performance impact of multiple concurrent users querying the system simultaneously.
Select the Data Sources that will be used to build the group from the Data Sources Added to Group drop-down list. One or more Data Sources can be selected.
Enable the Build Now option to build the Data Source Group as soon as the creation process is complete. If disabled, the settings are saved for the build to initiate at a later time.
Once all the options have been completed, click Create Data Source Group to complete the process. To build the Data Source Group and navigate directly to the Indexes or Scheduling screen, click the up arrow next to this text and click Create Data Source Group and View Indexes or Create Data Source Group and View Schedules.
If the Build Now option has been enabled, the Data Source Group will now start building from the specified Data Sources.
Advanced
Specify the number of threads used to build the Data Source Group using the Number of Parallel Writers option. Specifying more threads will allocate a greater amount of system resources to the build process, potentially increasing performance at the expense of other system tasks.
Use the Offline Scheme drop-down list to specify which segments of the Index will be available for users to query. Select a constraint from the list and quantify it with a number in the subsequent text box.
To change the text analyser used when processing fields, select from the available options in the Analyser drop-down list. This should only be changed if encountering problems with the automatically detected output.
If multiple Data Sources are used to build the Data Source Group, enable the Ignore Errors option to complete the build if one of the Data Sources fail. If errors are found, it is recommended that the system logs are checked to locate and rectify the issue. Please refer to the Logging chapter for more detailed information.
To prevent changes to the Data Source Group when the resulting Index is sent to another CXAIR instance, enable the Locked option.
Use the Email when Index is Updated or has Failed to got notified when any of these events occur.
Retain Data Source Order will ensure that the data is stored in the same order as it is in the Data Source. It will also be reflected in the Query screen. For ETL Data Sources, the order will be from the Data Source of the Index used in the ETL Data Source - the Order options in the ETL stages are purely for processing and will have no impact on this setting.
Lock Index Field Types is used to maintain the data type of all fields in the Index. This is mainly used for Excel and CSV Data Source where there is no defined data type provided to CXAIR (it is derived by parsing through the first 100 records). This option prevents scenarios whereby a date or numeric field is empty when the data refreshes which causes CXAIR to then interpret that field as a string (the default setting). This option is not needed if you have used the force data type options in the Data Source.
Extra Fields
Once a Data Source Group has been built, new options become available that allow extra fields to be added.
On the Data Source Groups screen, click the Edit icon next to the relevant Data Source Group. Click the > icon next to the Extra Fields option to reveal the configuration settings.
Select an extra field from the drop-down list and click Add to reveal the relevant options. Use the Label option to name the field. Once the relevant options have been set, click Save to add the field.
The following fields are available:
Additional Index Fields
Calculation
Adding a calculated field allows new fields to be derived using calculations.
Enable the Overwrite option to replace any fields with the same name as the calculated field.
Click the … icon to open the Calculation Builder. Please refer to the Calculation Builder chapter for more detailed information on the functions that can be used.
Dates
Date fields allow the creation of a date aggregations, using the original date fields as the source. Specify the source date fields using the Index Fields drop-down list and select an interval from the Date options below.
Distance
Using the Distance option, an extra field measuring the distance between two points can be built into the Index.
Select the fields that denote the relevant Latitude and Longitude for the two points and specify the unit of measurement from the Radius radio button.
Link Fields
Using linked fields allow fields to be joined from another Index based on a join key. This will perform the SQL equivalent of a LEFT OUTER JOIN.
Select the join key for the source Data Source from the Index Field drop-down list and select the Index that will be joined to from the Index drop-down list.
For the Index that is being joined to, select the join key from the Join Index Field drop-down list. Use the Extra Fields drop-down list to select any fields that will be added to the resulting Index.
Link Date Range Fields
Uses the same principle as linked fields, but only returns fields when the specified date range is matched in the source data.
Select the join key for the source Data Source from the Index Field drop-down list and select the date field for the current Data Source Group from the Date Index Field drop-down list. Specify the Index that will be joined to from the Index drop-down list.
For the Index that is being joined to, select the join key from the Join Index Field drop-down list. Specify the date range using the Join Start Date Index Field and Join End Date Index Field drop-down lists. Use the Extra Fields drop-down list to select any fields that will be added to the resulting Index.
Link Reverse Date Range Fields
Uses the same principle as linked fields, but only returns fields when outside of the specified date range matched in the source data. Please see the above option, Link Date Range Fields, for more information concerning the configuration options.
Remove Index Fields
Remove
Select the fields from the Index Fields drop-down list that will be removed at Data Source Group level.
Strings
Analyzed
Specify the Index Fields that will be classed as Analyzed when the build process is complete.
The Analysed functionality has been designed to accommodate fields containing multiple words, such as a ‘Comments’ field, to allow for case insensitive searches in the Query screen. When specified as Analyzed, each word in the field is stored as an individual entity to facilitate field-specific searches, in contrast to regular fields that are stored as a single string value. This makes it easier to search for individual words in a field and is especially useful when creating Word Clouds that can be used with the Stop Words functionality to display the frequency of key words. Please note that any fields added to this list cannot be used as a row or column when creating a Crosstab.
Lower Case
Select the fields from the Index Fields drop-down list that will be converted into lower case.
Proper Case
Select the fields from the Index Fields drop-down list that will be converted into proper case, where the first letter of every word is capitalised.
Regular Expression
Using regular expressions (RegEx), string patterns can be matched within fields and updated, replaced or extracted accordingly.
Select the fields that the expression will be applied to from the Index Fields drop-down list.
The below RegEx Mapping section has two options: RegEx and Replacement. In the RegEx text box, enter the regular expression to match a pattern in the selected Index fields.
Use the Replacement text box to specify the string that will replace the matches found. This can be referenced using named groups and numbered groups using the following format:
$<number>$
The current functionality does not support named groups or start and end of line anchors.
Enable the Stop on First Match option to only return the first match. The Blank on No Match option, if enabled, returns a blank field is no matches are found. Otherwise, the entire field is returned in its original format.
Use the Test section to check that the expression is working correctly. This accesses the underlying data in real-time. Due to the different methods of storing text, the displayed text may be in a different format to the source file.
Simple Regular Expression
Select the fields that the expression will be applied to from the Index Fields drop-down list.
Specify the sub-section of text to be searched using the Block Start and Block End options. The string that is located between the specified start and end text will be searched.
To further specify the string location, use the Starts With text box to indicate the starting point for the search and specify any characters to omit from the search, such as punctuation, in the Skip Text text box.
Once a value has been entered in the Starts With text box, the White Space drop-down list will appear. Specify an option that will best allow the text to be detected over multiple lines, if required.
Use the Data Type drop-down list to specify the format of the matched data. Select String for a lazy match or Long String for a greedy match. A lazy match will stop as soon as the condition is satisfied, while a greedy match will stop once the condition has been satisfied as many times as possible. Selecting Number or Date will reveal the Format option, where the date and number formatting can be specified.
Specify where the search will terminate with the Ends With drop-down list. Select User Defined and enter the required sting in the User Defined Ends With text box below to further customise the search. If entering an expression rather than a text string, enable the User Defined is RegEx option to activate it. If the expression is not case sensitive, enable the Case Insensitive option.
To set the expression to locate multiple values within the same field, enable the Multiple Values option. Otherwise, only the first value is returned. When outputting multiple values, the Single Line option, when enabled, will constrain the expression to a single line and the Truncate option will shorten the retrieval process by using the previous match as the starting point for the next search rather than starting from the beginning of the field. This will avoid repeat results and shorten the search time.
To include the string specified in the Starts With text box in the output, enable the Include Starts With option, and to include the string specified in the Ends With text box in the output, enable the Include Ends With option.
Use the Test section to check that the expression is working correctly. This accesses the underlying data in real-time to provide accurate results. Due to the different methods of storing text, the displayed text may be in a different format to the source file.
Upper Case
Select the fields from the Index Fields drop-down list that will be converted into upper case.
Modelling
The Modelling options allow the results of previously created Bayesian networks and decision trees to be added at Data Source Group Level to effectively predict future outcomes.
For each option, click Decision Tree or Bayesian Network to open the Saved Reports window, where the results will be filtered to only show created models. Click the checkbox next to the relevant model and click the Selected Reports tab. Ensure the correct model is selected and click Add to Data Source.
The fields from the Data Source Group will then be displayed alongside those in the model. All of the fields from the model should match those in the Data Source Group, and will be automatically detected. Use the relevant drop-down list to manually match entries if not automatically detected.
Decision Tree
When saved, two columns are added for the classifier used when the model was built. The Outcome column will display the predicted result and Percentage column will display the predicted percentage likelihood of the outcome.
Predictive Analytics
To predict the outcome for a field, change the relevant drop-down list to [Predict this value]. For every field with this option specified, two columns are added. The Outcome column will display the predicted result and Percentage column will display the predicted percentage likelihood of the outcome.
Prescriptive Analytics
Select the field of interest from the Classifier drop-down list and choose the outcome that will be measured from the Outcome of Interest drop-down list.
Adjust the Threshold Percentage using the slider or by typing a value below. From this setting, two columns are created: Threshold Outcomes and Predicted New Probabilities.
The Threshold Outcomes column displays the combinations of outcomes would increase the probability of the specified Outcome of Interest. The Predicted New Probabilities column displays the predicted probability of the Outcome of Interest from the calculated combination of outcomes.
Select a value from the Search Depth drop-down list to restrict the number of possible combinations searched before an outcome is reached. While an increasing this value may provide more accurate results, the processing time and system load will increase exponentially.
To ensure the process runs as efficiently as possible, select values from the Ignore Outcomes Already Above Threshold to manually exclude them from the analysis. Reducing the number of fields will decrease system load and the amount of time required to generate a result.
Obfuscation
Using the Obfuscation options allow selected fields to be obscured, preventing individual records from being identifiable.
Please refer to the Obfuscation section of the Database Data Source Wizard chapter for more information regarding the available options.
Third-Party
Base64 Encoded Word Document
Select the fields that will be encoded to base64 strings.
JD Edwards Date CYYDDD
Select the date fields that will be converted to the J.D. Edwards format (Century, Year, Day of Year).
JD Edwards Date CYYMMDD
Select the date fields that will be converted to the J.D. Edwards format (Century, Year, Month, Day of Month).
Modulus 11 Check
Calculates whether a numeric field passes the Modulus 11 check and outputs a True or False flag. For example, '399038' will result in a 'False' flag, while '399027' will result in a 'True' flag.
Soundex
Outputs the Soundex code for the selected fields. For example, 'Washington' is coded 'W-252'.