Technical Architecture
The Connexica Ad hoc Interactive Reporter, CXAIR, is a business intelligence tool that provides users with a search-based interface to query and analyse data contained in structured and unstructured data.
Unlike traditional technologies, such as relational databases and OLAP reporting tools, CXAIR reports from a set of physical Lucene Indexes that are logically grouped together to form a Search Engine. This provides users with fast access to underlying records rather than access to an aggregated view following a lengthy build time.
CXAIR is a Java EE compliant application. On a standard install, an embedded version of Apache Tomcat is included, but other J2EE compliant applications can be configured for use.
Core Components
Configuration Database
Integral to CXAIR is a HSQLDB database that contains information on users, permissions, versioning and system configuration settings.
Index Builder
The Index Builder is a set of processes that initiate and manage the crawling and Indexing operations across one or more Data Sources to extract and store data within an Index for fast retrieval.
This process can be executed immediately or scheduled to occur at a specific time intervals. The number of Indexes that can be built at any one time is controlled by the System Settings available to the administrator and can be scaled based on hardware capability and the volume of data.
Index builds can either be complete replacements, incremental, maintained or snapshot-based. Snapshot-based Indexes allow CXAIR to store and maintain a history of data that allows users to report from a specific point in time.
The indexing process works by quickly creating a large number of small objects that are held for a short period of time in system memory. In contrast, the querying process requires caching of fewer objects for a longer period. Separating the indexing tasks into separate processes allows greater flexibility when configuring the memory allocation of the various CXAIR components, allowing greater control of the garbage collection to prevent pauses. CXAIR can be configured to have multiple indexing nodes for high-volume, multi-machine implementations.
The indexing process runs more efficiently with less memory. Therefore, subject to hardware restrictions, running each indexing processes in a separate JVM is recommended.
Indexes
A key advantage of using CXAIR is the speed of data access. This is due to the use of highly optimised Indexes to store and retrieve data.
Indexes provide rapid access to all data types, with the underlying records stored in multiple formats to assist searching. This has the effect of increasing the Index size to an order of magnitude greater than the original source data, with an imported CSV file typically consuming around 250% more disk space when indexed in CXAIR.
This approach results in fast access over large amounts of data without relying on large amounts of RAM, as it is far more cost effective to scale disks rather than be restricted by memory. Therefore, a key performance requirement when processing large Indexes is multiple fast disks in RAID configuration.
CXAIR has extended the Lucene indexing and search capabilities while retaining the data in Lucene format as a reverse Index, making CXAIR Indexes accessible to other Lucene compatible applications.
With the CXFORMS licence option, the indexing process is further enhanced. Captured data is written directly to the relevant Index, allowing users to create reports from the responses.
User Interface
CXAIR is accessed in a browser window and is compatible with the following internet browsers:
• Windows Internet Explorer 11, or above with compatibility mode disabled
• Mozilla Firefox 51.0, or above
• Google Chrome 56.0, or above
The CXAIR server must also have Mozilla Firefox 59 installed with auto-update disabled for extended PDF support required for skewed components in Pages and CXForms.
A wide range of charting options are available to visually enhance reports, with CXAIR leveraging the graphical capabilities of the FusionCharts and D3.js JavaScript charting libraries.
The report distribution mechanism that generates reports in CSV, Excel, HTML, Image, PDF or XML format for emailing and downloading can be fully automated through the scheduling functionality or can be accessed on an ad hoc basis.
The user interface can be white-labelled to suit an organisation using skins, where the standard logos and colour schemes are replaced with client-specific alternatives.
A standard end user has the ability to load, save, print and email reports while performing ad-hoc searching and filtering of data to create a variety of report types. Administrators can grant functionality at a user and user group level.
Administration
All aspects of administering the system are accessed via the Administration Area, only available to Admin users.
Administration roles include the creation of new users, allocation of users to groups, the creation of new Indexes and Search Engines, the setting up of schedules and alerts and the process of exporting and importing data and reports between CXAIR instances.
Please refer to the Administration Guide for detailed information regarding the functionality available to system administrators.
Report Reader
Reports are held on the file system as XML documents. The Report Reader is responsible for searching and filtering reports depending on the access rights and permissions of a user.
Users have access to a local folder called My Reports for storing and retrieving saved reports. Other folders can be made available to allow access to reports created by other users where the relevant permissions have been granted.
Report Executer
The Report Executer is initiated when a user loads and runs a report or when a scheduled process attempts to run a saved report.
When a report is executed, the report definition is transformed into a series of parallel workflow activities that construct the report output by performing a series of query executions and data transformation processes.
The Report Executer will render the output as a combination of HTML, images and style sheets that are rendered directly on-screen or passed to a formatter process that re-renders the output as an EXCEL or PDF compatible format.
Where data is rendered as a PDF, a third-party PDF renderer, PhantomJS, is used.
Report Viewer
The Report Viewer functionality is a restricted ‘view only’ licence feature, providing a simplified interface to view, drill down, filter and print reports.
Saved reports can be made available to Report Viewer users to load in this restricted environment where each report will run against the most up-to-date data.
Security features, such as access to the underlying data, can be configured on a report or user basis.
Search Engine Structure
In CXAIR, a Search Engine is a series of Indexes that have been logically grouped together. An Index comprises of one or more Data Source Groups which comprise of one or more Data Sources.
Data Sources
In CXAIR, a Data Source is a configuration object that holds details on how to connect to and what to extract from a location or system.
For databases, the Data Source definition will contain a SQL statement that defines precisely which data is to be extracted from the source system and transformed into an Index when the Data Source Group is built.
The compatible systems include file systems containing data readable by the CXAIR crawler process, URLs that allow connection to a website where content can be read and links can be followed, an email server to access a specific account, a database where JDBC or ODBC can be used to retrieve data and metadata, and a third-party connector to communicate and extract metadata and data from a proprietary system.
Data Source Groups
Data Source Groups consist of multiple Data Sources.
A Data Source Group is the physical file implementation. Part of the configuration of a Data Source Group is to specify the location of the files to be held on disk.
Data Sources can be shared between multiple Data Source Groups, but may lead to the duplication of data.
Indexes
Indexes group Data Source Groups together and form the presentation layer that determines how users will view and query the information.
Data Source Groups can be shared between Indexes. This does not duplicate the data, instead allowing it to be accessed via different Indexes.
Search Engines
Search Engines group Indexes together to create a logical collection of searchable data.
Any single CXAIR implementation can have any number of Search Engines. Search Engines are also an effective method of partitioning data for security purposes, as administrators can restrict which Search Engines a user has access to.
Security
Authentication and Authorisation
CXAIR contains a comprehensive security framework that includes a number of authentication implementations and the ability to add bespoke security filters to allow single sign-on across any group of systems.
In a Windows environment, CXAIR provides a single sign-on process where a user registered on the Windows Domain has a proxy CXAIR account automatically created the first time they connect to a server instance. User permissions can then be validated based on the combined Windows and CXAIR account.
This process is underpinned by the open source library, JCIFS and may require registry changes on the Windows Domain Controller to work correctly. For non-Windows environments, additional security implementations exist including Anonymous and Java EE authentication.
Once an account has been created, the user is placed in a default group and has a default set of access rights. Administrators are then able to move users to a specific group, as required.
Users can exist in multiple groups and each group can be assigned different access rights to specific Indexes, Search Engines, features and reports.
User accounts, Proxy User accounts for Windows users and group configuration options are held within the CXAIR configuration database. Report permissions are held as text files on the CXAIR server.
Realms
Of the numerous methods to restrict data access within CXAIR, the most granular security measure is the use of Realms, defined as a set of documents that a group of users can see.
To configure a realm, the administrator creates a set of filters that are automatically applied to the resulting output when an allocated user runs any form of query or report.
Realms are allocated to groups of users. Where one or more realm has been allocated to a group, users can only view specific documents in the Indexes included in the realm. Any other documents are not visible and are not included in search results, drop lists, counts or calculations. Users not allocated a realm can still access all the documents in an Index. Only users allocated a realm are restricted by it.
Making realm access mandatory for specific data is the strictest realm-based security policy, configurable by the system administrator.
Encoding
CXAIR stores information in encoded flat files on disk, similar to a database. Therefore, the same security procedures are applicable.
The file system is locked down to external access via https, restricted via a firewall. Additionally, https access is controlled by the CXAIR security framework that locks access down to an IP address, user credentials and realm based security.
Index Storage
Physical Storage
While the mount point for an Index can be changed, the default location for an Index on a Windows machine is: {$CXAIR_HOME}\sys\Index
Each Index is represented by a set of folders. The root folder contains a directory that is allocated an incremental number that changes each time a new version of the Index is created.
The Index is split into multiple segments, each containing multiple documents. The number of documents in a segment is configurable at Data Source Group level. The number of segments and their size within an Index has a direct impact on performance.
Each segment is searchable in parallel. This means that when a search is initiated, the system will allocate a thread from its parallel searcher pool to search across a single segment of the Index.
Too many segments can cause slow query performance if there are insufficient parallel searcher threads available to search each of the segments. Conversely, if an Index is built to contain a small number of extremely large segments, the cumulative query time may be slower as each parallel searcher thread has to search across a larger block of data.
Please refer to the System Settings chapter for detailed information regarding the optimal configuration settings.
Index Versioning
When a new version of an Index is created, a new directory is created starting from 000000, then 000001 etc.
If the Index is built completely, the new directory is used and the old is discarded. If the Index is built incrementally, the new contents are appended to the existing directory. Over time this means that the number of segments within an Index will increase which will eventually impact performance.
CXAIR has the ability to coalesce Indexes together, which effectively compresses the Index into the optimum number of Index segments. Coalesce Indexes are typically created as part of a schedule that run intermittently after the main incremental schedules have completed.
Documents
Indexes are stored as physical files on disk. The crawler process that reads data from the various Data Sources brings back a block of data and then processes the data, one document at a time.
If indexing a file system, there may be a multitude of different file types. However, on a database, a document is effectively a row returned from a SQL query. As documents are analysed and transformed into Indexed documents, each document is broken down into a set of fields.
Each field is an identifiable property that has been extracted from the document. The original files are stored in the Index as CXAIR Documents, comprising of a unique ID and a collection of fields and values held against that ID. Every unique value associated with each unique field is held within the Index.
A list of values and field names are stored against each Indexed document to allow the re-construction of the Indexed content of a document. Unlike relational databases, an Indexed document is able to reference multiple values for any one field. As well as storing values and pointers in fields, CXAIR also assembles some additional fields that are used when searching for a document within an Index.
By default, all text field values are added to a special field labelled ‘all’. The ‘all’ field contains a text representation of the Indexed data, held in lower case.
Querying
Queries are processed in parallel. The more processors, disks and threads made available to the system, the faster the queries will run. The results from querying the separate segments are aggregated together to give the result.
The number of Index directories gives the level of parallelism for a query. For example, if there are thee directories, or segments, then three processes can be used to search it simultaneously.
The query processes are used from a pool that is configured by the system administrator as part of the System Settings. The system will automatically apply optimal settings for the number of Index and query threads based on the hardware available.
The ‘all’ Field
When a search is carried out without qualifying a specific field, the system will search the ‘all’ field to find any documents that match the selected criteria.
For example, searching for ‘a*’ would bring back any documents that have one or more field containing a word beginning with an ‘a’ followed by any other characters.
When a user clicks on a field or defines a filter, the system will qualify the search by specifying the field to search, speeding up the search process.
Scaling
CXAIR harnesses the power of Lucene to provide an extremely fast method of Indexing and querying large amounts of structured and unstructured data. On top of Lucene, the CXAIR architecture provides additional configuration capabilities that result in increased performance and scalability.
Integration
CXAIR provides a number of mechanisms for either integrating the searching and reporting elements into a third-party application or web site, as well as extending the product to Index and analyse data contained in vendor-specific formats.
Web Services Description Language
Through the use of Web Services Description Language (WSDL), clients can authenticate with a CXAIR instance and perform queries and initiate saved reports. All data is returned as XML and needs to be unpacked and formatted by the client application.
WSDL provides interoperability with non-Java based technologies including .NET and PHP.
JavaScript
JavaScript integration allows access to Crosstab report data for rendering and charting into a bespoke web page. The JavaScript API allows the creation of data mash-ups and custom charting and mapping components.
JSP and Java
CXAIR runs within a Java EE container and exposes access to data through JavaBeans and Servlets. These can be integrated into custom JSP pages to create embedded CXAIR search and reporting-based solutions.
Custom Adapters
CXAIR provides developers with the ability to create and register bespoke Data Sources that can be used to Index non-standard data structures and content.
Custom Adapters are written in Java and loaded into CXAIR to be used in the configuration area or user interface. Adapters can be extended to provide mash-up Indexes to query through a standard CXAIR instance or through its various APIs.
Google Maps
Using the Google Maps API, it is possible to embed a range of configurable, interactive maps into a report complete with a variety of runtime options for users to navigate
Google Analytics
Using the Google Analytics API, data captured through an account can be Indexed in CXAIR, allowing further analysis of captured Google data.
Google Drive Time
Using the Google Distance Matrix API, multiple travel times between two locations on a map can be estimated and embedded onto a report for comparative analysis.
Stanford CoreNLP
Integration with Stanford CoreNLP expands the query capabilities that aid data discovery for users, providing a broader range of natural language understanding.