04a. Creating a Data Source
Navigate to the Data Sources screen by clicking Search Engine, then Data Sources. All currently loaded Data Sources will be displayed, and clicking New will allow the configuration of a new source.
Enter the desired name in the Name textbox. This name must be unique.
Use the Tags textbox to add associated search terms to the Data Source. This allows the components to be accessed using customised strings when using the search bar.
Once all the options have been completed, click Create Data Source to complete the process. To build the data source and navigate directly to the Data Source Groups screen, click the up arrow next to this text and click Create Data Source and View Data Source Groups.
Data Source Options
The data source options are found in the Type drop-down list, with the subsequent options updating once a source is selected.
Data
Crosstab
Allows a saved Crosstab to be stored as an Index.
This data source is initially built with no data attached, with the specified Crosstab loaded after the initial build process.
Database
Enables users to build a data source from a database. This can be used to capture live data and can then be scheduled to build on a regular basis to ensure the data used for reporting is up-to-date.
Please refer to the Database Data Source Wizard chapter for more detailed information concerning the available configuration options.
Mongo DB
This source allows users to link to a Mongo DB database.
Simple XML
Uses an uploaded Simple XML file as a data source.
Click the … icon next to the XML File box to load the relevant file.
SPSS Export File
Enables the use of an IBM SPSS Statistics export file as a data source.
Click the … icon next to the File box to load the relevant file.
XML File
Uses an uploaded XML file as a data source.
Click the … icon next to the XML File and XSD File boxes to load the relevant files.
Data Warehouse
Amalgamated
Allows users to create a single data source from various separate data sources that are joined based on join keys. The process works in a similar way to joining tables in SQL where data is joined based on shared data characteristics and results in fast access to data that may have previously existed across multiple sources.
Please refer to the Amalgamated Data Source Wizard chapter for more detailed information concerning the available configuration options.
Calendar Assistant
Adds rows to the provided Index created from intersections in the specified date ranges.
Select a source value from the Index drop-down list to populate the Start Date and End Date drop-down lists with the required date fields. The Use Today If End Date Is Null option, when checked, will replace the End Date with the current system date if the End Date is NULL.
Enable the Build Daily Calendar option if an automatic daily refresh the data source is required, or the Months option to instead add values on a monthly basis.
Copy Data Source Group
This option gives users the ability to copy an existing Data Source Group and build the relevant Data source and Index via a single process rather than in each individual screen.
Use the Data Source Group drop-down list to select the Data Source Group be copied, and select the fields that will be present in the copy from the Index Fields drop-down list.
To restrict the number of rows copied, enter a value in the Number To Index textbox. Leaving this option blank will result in all rows being copied.
Data Mart
Allows users to extract a portion of an existing Index and incorporate the data into a standalone data source.
By saving a query within an Index that restricts the data at the required level, this query can then be selected to create an independent data source with a view to increasing performance and removing any irrelevant entries for the end user.
A Data Mart cannot be built from an unstructured Index.
Click the Select Query button to open the Load screen, where a previously saved Query can be selected, the results of which will drive the Index records.
Delete
Allows the deletion of records from an Index using a saved query.
By saving a query within an Index that contains all of the records to be deleted, this query can then be ran as a Delete data source to remove the contained records.
Editable Data
This allows users to create an Index where values can be manually overridden and new rows of data can be added. When a change is made, the field is highlighted green and is only applied after the next Index build.
Please refer to the Editable Data section of the Data Warehouse chapter for more information.
ETL Package
This source provides the ability to loop through data to locate entries with specific characteristics. Additional columns are added at Index level against each affected row based on the keys set.
Multiple stages can be configured to add values based on previously created filter values, allowing the inclusion of numerous additional fields to effectively loop through large data sets to locate specific entries.
Please refer to the ETL Package section of the Data Warehouse chapter for more information.
Ranking
The Ranking Data Source will result in a Ranking Index, an extension of the aggregation functionality. This Index allows the creation of automatically ranked groups of information by Day, Week, Month or Year based on a series of calculations performed on specific fields. This results in the discovery of the top and bottom ‘n’ values and associated percentages.
Enter the name that will be given to the associated Data Source, Data Source Group and Index in the Name textbox. This name must be unique.
Specify the source Index using the Index dropdown list, and select a field from the Primary Key dropdown list to specify the field that will be used to calculate the groupings within the resulting Index.
Select the field to be fields to be aggregated within each grouping from the Aggregations dropdown list, and specify the type of aggregation by selecting from the Aggregation Methods options, where Average, Count, Count All, Maximum, Minimum, Standard Deviation and Sum are available.
The Timestamp option specifies the data field that will be used to create the different grouping fields. This must be a full date or date/time field. Use the Months option to specify the number of months that groupings are created from.
Use the Frequency options to specify the period types the groupings will adhere to. Select Custom to determine the frequency using the data field selected for the Timestamp option, or select a predefined interval from the following options: Days, Months, Months Summary, Rolling, Weeks and Years.
Specify the level of the produced ranking using the Grouping drop-down list. This is especially useful when a certain level of analysis is required, such as ranking store performance rather than region performance in a retail Index.
Reg Ex File
Allows users to correct errors in an input file using regular expression (Regex) to write changes to an output file.
The RegEx Mapping options allow the specification of the changes and the Stop at First Match option halts any further changes to a file once a single match has been found.
Reverse Transpose
Allows users to transpose values within rows into columns.
Use the Index drop-down list to specify the source of the values to be reverse transposed.
Specify what will be used to identify the rows to be transposed using the Repeating Fields drop-down list. From the Distinct Fields drop-down list, select the original column values.
Enter a column name for the values specified as Distinct Fields in the Field Name textbox, and a column name for the associated values in the Value Name textbox.
By default, blank fields are not included as part of the reverse transpose process. Enable the Include Blank Fields option to include them in the output.
Rules
A Rules data source will use pre-set rules to validate or update data.
Rules can be applied to individual fields and the output will inform the user of the number of rules that have passed or failed. Rules are categorised as either major or minor to allow the output to be split based on priority.
Please refer to the Rules section of the Data Warehouse for chapter more information.
Rules Promotion
The Rules Promotion Data Source allows an administrator to specify the maximum number of errors before an Index is prevented from being made available to end users. If the maximum number of minor or major errors is breached, the new version of the data will not be available in the front-end and the previous version will remain in place until the errors are corrected and the data refreshed.
Requires a Rules Index to be built against.
Rules Summary
This source displays all of the Rules data sources within the CXAIR instance. The passed and failed rules are counted across the different sources.
Please refer to the Rules Summary section of the Data Warehouse chapter for more information.
System Calendar
After setting the Working Days and Holidays options in the System Settings, a calendar Index detailing all dates and their working day status can be created the validate and cross-compare values.
Once built, five fields are created in the Index: Date, Working Day Number, Working Day String, Count Working Days Of Month and Count Working Days of Year. Use these columns against a date to validate the options set in the System Settings.
Transpose
Allows users to transpose values within columns into rows.
Use the Index drop-down list to specify the source of the values to be transposed. To restrict the number of transposed entries, specify a Saved Query containing the display fields to be processed.
Use the Primary Key dropdown list to specify what will be used to identify the rows to be transposed.
Specify the fields in the Field 1 (Column Headers) dropdown list that will be transposed into column headers once the Index has been created. The fields specified in the Field 2 (Values) dropdown list will populate these new columns based on the primary key value. To include additional fields in the new Index, select them from the Extra Fields (Distincts) dropdown list.
If the fields specified in the Field 2 (Values) dropdown list contain multiple values, enabling the Flatten option will split these values into separate columns.
Meta Data
The following three stages result in the discovery of foreign primary keys between Indexes.
Meta Values Data Source Stage 1
In this stage, the underlying field names and values from multiple Indexes grouped in a collection can be bought together and organised.
Specify a collection from the Collections drop-down list and enable the Create Index option to automatically create an Index from the data source.
Meta Values Data Source Stage 2
Using the Index created in Stage 1, the values taken from the various tables can now be matched.
Specify the Index from the Index drop-down list and enable the Create Index option to automatically create and Index from the data source.
Meta Values Data Source Stage 3
Now the table values have been matched, the Index created in Stage 2 can now be processed to discover the foreign primary keys.
Specify the Index from the Index drop-down list and enable the Create Index option to automatically create and Index from the data source.
Metadata
Enables an Index to be built from database schemas to analyse and manipulate the associated metadata.
Please refer to the Metadata Wizard chapter for more detailed information concerning the available configuration options.
Solutions
Basket Analysis Wizard
Basket Analysis is a premium licence option that groups, ranks and transforms transactional data into a format that can be easily used to analyse spending and usage patterns across large data sets.
Items relating to a single transaction or event are grouped together and stored as a ‘basket’. Multiple baskets can be analysed and compared to identify frequently recurring items.
Without Basket Analysis enabled, a single basket is comprised of multiple items or instances which require individual rows. Using Basket Analysis, multiple instances appear on a single row and individual customer spending can be tracked by grouping multiple items in a single basket or transaction. This provides insight into what products are being purchased, when, in what quantity and in what combination.
Please refer to the Basket Analysis Wizard for more information regarding the available configuration options.
Benford’s Distribution
Applies Benford’s Law to an Index, creating extra columns to denote
Benford’s Law states that if selecting a random number from base 10 statistical data, the probability that the first digit is ‘1’ is around 0.301 rather than the 0.1 expected if all digits were equally likely.
This is especially useful in fraud detection to quickly determine if figures do not match a known frequency.
Drive Time
This source uses Google Maps to calculate the amount of time it would take to drive between two points on a map.
New fields are added to the Index which display the distance and time taken to travel between the two points. The start and end point are fully configurable and the Mode options allow the choice between driving, walking and cycling. The Avoid options allow further configuration of the route and the Units options change how the distance is displayed.
To create a Drive Time data source, the Index requires at least two plottable points on a map.
HL7 V2
Allows users to connect to databases adhering to the HL7 V2 standard.
Ping
Using the Ping data source, an Index can be created to monitor network connectivity across a server.
Specify the length of time, in milliseconds, that the process will wait for a response before determining that it has timed out in the Timeout textbox. Increase this number to account for larger servers that may take longer to receive the ping request.
Increase the value entered in the Threads textbox to allocate more concurrent processes to the network task. Any potential improvements are dependent on the multi-core performance of the system’s processor. Please refer to the System Settings chapter for more performance tuning options.
Finally, select the device used to connect to the host network from the Network drop-down list.
Target
Allows the creation of a Targets Index. Once built, reports can be created to show the hierarchy of Target values before the data is joined to other Indexes. Please refer to the Targets section of the Extensions chapter for more information.
Third-Party
Google Analytics
Allows users to create a data source from collected Google Analytics data.
The correct API key has to be configured using the Google Analytics Wizard for the data to be loaded.
Before any Twitter Indexes can be created, a number of options must be configured. Please refer to the Twitter Configuration chapter for more information.
The Twitter option allows the selection of a specific feed of data that will be Indexed.
Using the Datafeed drop-down list, Followers or Twitter Query can be specified as the source data.
Twitter Mashup
Allows users to merge the data from a Followers Data Source Group and a Tweet Data Source Group into a single Data Source Group.
Twitter Subject
Used to track the activity of particular Twitter Users. Select the relevant Twitter Index from the Index drop-down list.
Unstructured
File System
Creates Indexes from a folder location on the server. This file system can include both structured and unstructured data, with each file Indexed as a new row of data.
Please refer to the Unstructured Data Source Wizard for more information concerning the available configuration options.
IMAP
Allows an email account to be used as a data source through Internet Message Access Protocol (IMAP) compatibility.
Use the Messages to Index drop-down list to restrict the number of messages, if required, and enable the Include Attachments option to Index received attachments.
POP3
Allows an email account to be used as a data source through Post Office Protocol 3 (POP3) compatibility.
Enable the Include Attachments option to Index received attachments.
Tika
Extends the data source compatibility to include the Apache Tika toolkit, capable of extracting metadata and text from over a thousand file types.
Specify the relevant Directory using the … icon.
Web Site
Allows users to build an Index from a website.
The URL must be provided for CXAIR to locate the files to Index.
Custom
The following Data Sources are not included with the installation of CXAIR and are imported via the Extensions screen. For more information, please contact support@connexica.com.
Stamford NLP
The Stamford NLP Data Source leverages sentiment analysis to process text and output values relating to positive and negative words and phrases. This can be used to automate the analysis of text fields without the manual process usually required to glean insight from data sources containing large amounts of text.
Select a source Index from Index drop-down list the fields that will be analysed using the Sentiment Fields drop-down list.
If the source Index has been built from Twitter data, enable the Twitter checkbox to reveal a number of options that facilitate the exclusion of elements such as hashtags or emoticons from the analysis.
Once the Index has been built, a number of key fields are created:
- <Field Name> Sentiment Positive Score
- <Field Name> Sentiment Negative Score
- <Field Name> Sentiment Positive Count
- <Field Name> Sentiment Negative Count
- <Field Name> Sentiment None Count
- <Field Name> Sentiment Difference Score
- <Field Name> Sentiment Total Score
Fields with a higher total score are classed as having a 'Positive' sentiment, with the lower scores representing more 'Negative' sentiment. Users can reference these values when reporting to automate the analysis of the selected sentiment fields.