XWiki - Technical Documentation.CXAIR.Technical Architecture

The Connexica Ad hoc Interactive Reporter, [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]], is a business intelligence tool that provides users with a search-based interface to query and analyse data contained in structured and unstructured data.

6

7

Unlike traditional technologies, such as relational databases and OLAP reporting tools, [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] reports from a set of physical Lucene Indexes that are logically grouped together to form a Search Engine. This provides users with fast access to underlying records rather than access to an aggregated view following a lengthy build time.

8

9

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] is a Java EE compliant application. On a standard install, an embedded version of Apache Tomcat is included, but other J2EE compliant applications can be configured for use.

= Core Components =

== Configuration Database ==

14

15

Integral to [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] is a HSQLDB database that contains information on users, permissions, versioning and system configuration settings.

== Index Builder ==

The Index Builder is a set of processes that initiate and manage the crawling and Indexing operations across one or more Data Sources to extract and store data within an Index for fast retrieval.

20

21

This process can be executed immediately or scheduled to occur at a specific time intervals. The number of Indexes that can be built at any one time is controlled by the [[System Settings>>doc:Technical Documentation.CXAIR.Administration Guide.Status Monitoring.System Settings.WebHome]] available to the administrator and can be scaled based on hardware capability and the volume of data.

22

23

Index builds can either be complete replacements, incremental, maintained or snapshot-based. Snapshot-based Indexes allow [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] to store and maintain a history of data that allows users to report from a specific point in time.

24

25

The indexing process works by quickly creating a large number of small objects that are held for a short period of time in system memory. In contrast, the querying process requires caching of fewer objects for a longer period. Separating the indexing tasks into separate processes allows greater flexibility when configuring the memory allocation of the various [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] components, allowing greater control of the garbage collection to prevent pauses. [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] can be configured to have multiple indexing nodes for high-volume, multi-machine implementations.

26

27

The indexing process runs more efficiently with less memory. Therefore, subject to hardware restrictions, running each indexing processes in a separate JVM is recommended.

== Indexes ==

A key advantage of using [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] is the speed of data access. This is due to the use of highly optimised Indexes to store and retrieve data.

32

33

Indexes provide rapid access to all data types, with the underlying records stored in multiple formats to assist searching. This has the effect of increasing the Index size to an order of magnitude greater than the original source data, with an imported CSV file typically consuming around 250% more disk space when indexed in [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]].

34

35

This approach results in fast access over large amounts of data without relying on large amounts of RAM, as it is far more cost effective to scale disks rather than be restricted by memory. Therefore, a key performance requirement when processing large Indexes is multiple fast disks in RAID configuration.

36

37

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] has extended the Lucene indexing and search capabilities while retaining the data in Lucene format as a reverse Index, making [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] Indexes accessible to other Lucene compatible applications.

38

39

With the CXFORMS licence option, the indexing process is further enhanced. Captured data is written directly to the relevant Index, allowing users to create reports from the responses.

== User Interface ==

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] is accessed in a browser window and is compatible with the following internet browsers:

44

45

• Windows Internet Explorer 11, or above with compatibility mode disabled

46

• Mozilla Firefox 51.0, or above

47

• Google Chrome 56.0, or above

48

49

The [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] server must also have Mozilla Firefox 59 installed with auto-update disabled for extended PDF support required for skewed components in [[Pages>>doc:Technical Documentation.Legacy Documentation.CXAIR 2017\.2.User Guide (2017\.2).02\. Reporting.2e\. Pages.WebHome]] and [[CXForms>>doc:Technical Documentation.CXFORMS.WebHome]].

50

51

A wide range of charting options are available to visually enhance reports, with [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] leveraging the graphical capabilities of the FusionCharts and D3.js JavaScript charting libraries.

52

53

The report distribution mechanism that generates reports in CSV, Excel, HTML, Image, PDF or XML format for emailing and downloading can be fully automated through the scheduling functionality or can be accessed on an ad hoc basis.

54

55

The user interface can be white-labelled to suit an organisation using skins, where the standard logos and colour schemes are replaced with client-specific alternatives.

56

57

A standard end user has the ability to load, save, print and email reports while performing ad-hoc searching and filtering of data to create a variety of report types. Administrators can grant functionality at a user and user group level.

== Administration ==

All aspects of administering the system are accessed via the Administration Area, only available to Admin users.

62

63

Administration roles include the creation of new users, allocation of users to groups, the creation of new Indexes and Search Engines, the setting up of schedules and alerts and the process of exporting and importing data and reports between [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] instances.

64

65

Please refer to the [[Administration Guide>>doc:Technical Documentation.CXAIR.Administration Guide.WebHome]] for detailed information regarding the functionality available to system administrators.

== Report Reader ==

Reports are held on the file system as XML documents. The Report Reader is responsible for searching and filtering reports depending on the access rights and permissions of a user.

70

71

Users have access to a local folder called My Reports for storing and retrieving saved reports. Other folders can be made available to allow access to reports created by other users where the relevant permissions have been granted.

72

73

== Report Executer ==

74

75

The Report Executer is initiated when a user loads and runs a report or when a scheduled process attempts to run a saved report.

76

77

When a report is executed, the report definition is transformed into a series of parallel workflow activities that construct the report output by performing a series of query executions and data transformation processes.

78

79

The Report Executer will render the output as a combination of HTML, images and style sheets that are rendered directly on-screen or passed to a formatter process that re-renders the output as an EXCEL or PDF compatible format.

80

81

Where data is rendered as a PDF, a third-party PDF renderer, PhantomJS, is used.

== Report Viewer ==

The Report Viewer functionality is a restricted ‘view only’ licence feature, providing a simplified interface to view, drill down, filter and print reports.

86

87

Saved reports can be made available to Report Viewer users to load in this restricted environment where each report will run against the most up-to-date data.

88

89

Security features, such as access to the underlying data, can be configured on a report or user basis.

90

91

= Search Engine Structure =

92

93

In [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]], a Search Engine is a series of Indexes that have been logically grouped together. An Index comprises of one or more Data Source Groups which comprise of one or more Data Sources.

== Data Sources ==

In [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]], a Data Source is a configuration object that holds details on how to connect to and what to extract from a location or system.

98

99

For databases, the Data Source definition will contain a SQL statement that defines precisely which data is to be extracted from the source system and transformed into an Index when the Data Source Group is built.

100

101

The compatible systems include file systems containing data readable by the [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] crawler process, URLs that allow connection to a website where content can be read and links can be followed, an email server to access a specific account, a database where JDBC or ODBC can be used to retrieve data and metadata, and a third-party connector to communicate and extract metadata and data from a proprietary system.

102

103

== Data Source Groups ==

104

105

Data Source Groups consist of multiple Data Sources.

106

107

A Data Source Group is the physical file implementation. Part of the configuration of a Data Source Group is to specify the location of the files to be held on disk.

108

109

Data Sources can be shared between multiple Data Source Groups, but may lead to the duplication of data.

== Indexes ==

Indexes group Data Source Groups together and form the presentation layer that determines how users will view and query the information.

114

115

Data Source Groups can be shared between Indexes. This does not duplicate the data, instead allowing it to be accessed via different Indexes.

== Search Engines ==

Search Engines group Indexes together to create a logical collection of searchable data.

120

121

Any single [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] implementation can have any number of Search Engines. Search Engines are also an effective method of partitioning data for security purposes, as administrators can restrict which Search Engines a user has access to.

= Security =

== Authentication and Authorisation ==

126

127

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] contains a comprehensive security framework that includes a number of authentication implementations and the ability to add bespoke security filters to allow single sign-on across any group of systems.

128

129

In a Windows environment, [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] provides a single sign-on process where a user registered on the Windows Domain has a proxy [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] account automatically created the first time they connect to a server instance. User permissions can then be validated based on the combined Windows and [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] account.

130

131

This process is underpinned by the open source library, JCIFS and may require registry changes on the Windows Domain Controller to work correctly. For non-Windows environments, additional security implementations exist including Anonymous and Java EE authentication.

132

133

Once an account has been created, the user is placed in a default group and has a default set of access rights. Administrators are then able to move users to a specific group, as required.

134

135

Users can exist in multiple groups and each group can be assigned different access rights to specific Indexes, Search Engines, features and reports.

136

137

User accounts, Proxy User accounts for Windows users and group configuration options are held within the [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] configuration database. Report permissions are held as text files on the [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] server.

== Realms ==

Of the numerous methods to restrict data access within [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]], the most granular security measure is the use of Realms, defined as a set of documents that a group of users can see.

142

143

To configure a realm, the administrator creates a set of filters that are automatically applied to the resulting output when an allocated user runs any form of query or report.

144

145

Realms are allocated to groups of users. Where one or more realm has been allocated to a group, users can only view specific documents in the Indexes included in the realm. Any other documents are not visible and are not included in search results, drop lists, counts or calculations. Users not allocated a realm can still access all the documents in an Index. Only users allocated a realm are restricted by it.

146

147

Making realm access mandatory for specific data is the strictest realm-based security policy, configurable by the system administrator.

== Encoding ==

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] stores information in encoded flat files on disk, similar to a database. Therefore, the same security procedures are applicable.

152

153

The file system is locked down to external access via https, restricted via a firewall. Additionally, https access is controlled by the [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] security framework that locks access down to an IP address, user credentials and realm based security.

= Index Storage =

== Physical Storage ==

158

159

While the mount point for an Index can be changed, the default location for an Index on a Windows machine is: **{$CXAIR_HOME}\sys\Index**

160

161

Each Index is represented by a set of folders. The root folder contains a directory that is allocated an incremental number that changes each time a new version of the Index is created.

162

163

The Index is split into multiple segments, each containing multiple documents. The number of documents in a segment is configurable at Data Source Group level. The number of segments and their size within an Index has a direct impact on performance.

164

165

Each segment is searchable in parallel. This means that when a search is initiated, the system will allocate a thread from its parallel searcher pool to search across a single segment of the Index.

166

167

Too many segments can cause slow query performance if there are insufficient parallel searcher threads available to search each of the segments. Conversely, if an Index is built to contain a small number of extremely large segments, the cumulative query time may be slower as each parallel searcher thread has to search across a larger block of data.

168

169

Please refer to the [[System Settings>>doc:Technical Documentation.CXAIR.Administration Guide.Status Monitoring.System Settings.WebHome]] chapter for detailed information regarding the optimal configuration settings.

170

171

== Index Versioning ==

172

173

When a new version of an Index is created, a new directory is created starting from 000000, then 000001 etc.

174

175

If the Index is built completely, the new directory is used and the old is discarded. If the Index is built incrementally, the new contents are appended to the existing directory. Over time this means that the number of segments within an Index will increase which will eventually impact performance.

176

177

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] has the ability to coalesce Indexes together, which effectively compresses the Index into the optimum number of Index segments. Coalesce Indexes are typically created as part of a schedule that run intermittently after the main incremental schedules have completed.

== Documents ==

Indexes are stored as physical files on disk. The crawler process that reads data from the various Data Sources brings back a block of data and then processes the data, one document at a time.

182

183

If indexing a file system, there may be a multitude of different file types. However, on a database, a document is effectively a row returned from a SQL query. As documents are analysed and transformed into Indexed documents, each document is broken down into a set of fields.

184

185

Each field is an identifiable property that has been extracted from the document. The original files are stored in the Index as [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] Documents, comprising of a unique ID and a collection of fields and values held against that ID. Every unique value associated with each unique field is held within the Index.

186

187

A list of values and field names are stored against each Indexed document to allow the re-construction of the Indexed content of a document. Unlike relational databases, an Indexed document is able to reference multiple values for any one field. As well as storing values and pointers in fields, [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] also assembles some additional fields that are used when searching for a document within an Index.

188

189

By default, all text field values are added to a special field labelled ‘all’. The ‘all’ field contains a text representation of the Indexed data, held in lower case.

= Querying =

Queries are processed in parallel. The more processors, disks and threads made available to the system, the faster the queries will run. The results from querying the separate segments are aggregated together to give the result.

194

195

The number of Index directories gives the level of parallelism for a query. For example, if there are thee directories, or segments, then three processes can be used to search it simultaneously.

196

197

The query processes are used from a pool that is configured by the system administrator as part of the [[System Settings>>doc:Technical Documentation.CXAIR.Administration Guide.Status Monitoring.System Settings.WebHome]]. The system will automatically apply optimal settings for the number of Index and query threads based on the hardware available.

198

199

== The ‘all’ Field ==

200

201

When a search is carried out without qualifying a specific field, the system will search the ‘all’ field to find any documents that match the selected criteria.

202

203

For example, searching for ‘a*’ would bring back any documents that have one or more field containing a word beginning with an ‘a’ followed by any other characters.

204

205

When a user clicks on a field or defines a filter, the system will qualify the search by specifying the field to search, speeding up the search process.

= Scaling =

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] harnesses the power of Lucene to provide an extremely fast method of Indexing and querying large amounts of structured and unstructured data. On top of Lucene, the [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] architecture provides additional configuration capabilities that result in increased performance and scalability.

= Integration =

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] provides a number of mechanisms for either integrating the searching and reporting elements into a third-party application or web site, as well as extending the product to Index and analyse data contained in vendor-specific formats.

214

215

**Web Services Description Language**

216

Through the use of Web Services Description Language (WSDL), clients can authenticate with a [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] instance and perform queries and initiate saved reports. All data is returned as XML and needs to be unpacked and formatted by the client application.

217

218

WSDL provides interoperability with non-Java based technologies including .NET and PHP.

219

220

**JavaScript**

221

JavaScript integration allows access to [[Crosstab>>doc:Technical Documentation.CXAIR.User Guide.02\. Reporting.2c\. Crosstabs.WebHome]] report data for rendering and charting into a bespoke web page. The JavaScript API allows the creation of data mash-ups and custom charting and mapping components.

222

223

**JSP and Java**

224

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] runs within a Java EE container and exposes access to data through JavaBeans and Servlets. These can be integrated into custom JSP pages to create embedded [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] search and reporting-based solutions.

225

226

**Custom Adapters**

227

[[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] provides developers with the ability to create and register bespoke Data Sources that can be used to Index non-standard data structures and content.

228

229

Custom Adapters are written in Java and loaded into [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] to be used in the configuration area or user interface. Adapters can be extended to provide mash-up Indexes to query through a standard [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] instance or through its various APIs.

230

231

**Google Maps**

232

Using the Google Maps API, it is possible to embed a range of configurable, interactive maps into a report complete with a variety of runtime options for users to navigate

233

234

**Google Analytics**

235

Using the Google Analytics API, data captured through an account can be Indexed in [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]], allowing further analysis of captured Google data.

236

237

**Google Drive Time**

238

Using the Google Distance Matrix API, multiple travel times between two locations on a map can be estimated and embedded onto a report for comparative analysis.

239

240

**Stanford CoreNLP**

241

Integration with Stanford CoreNLP expands the query capabilities that aid data discovery for users, providing a broader range of natural language understanding.

242

Wiki source code of Technical Architecture

Navigation