Wiki source code of 04. ETL Data Processing Guide

Last modified by Paul Jackson on 06/11/2025, 14:23

Hide last authors
Paul Jackson 1.1 1 {{box cssClass="floatinginfobox" title="**Contents**"}}
2 {{toc/}}
3 {{/box}}
4
5 This guide has been written to outline the steps required to Index, join, shape and flatten data using the [[Database Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Wizards.Database Data Source Wizard.WebHome]], [[Amalgamated Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Wizards.b\. Amalgamated Data Source Wizard.WebHome]] and [[ETL Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]] to automate the creation of a customer master record, with the final output summarising customer activity across the business.
6
7 To facilitate this level of data processing, the [[Extract, Transform & Load (ETL)>>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]][[ >>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]]functionality is a vital component. This is due to how [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] processes data when Indexed on the system.
8
9 When building an Index, [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] handles the retrieved data row-by-row. Therefore, there is no grouping at this point and no way to relate rows to each other. For example, the same Customer ID appearing on each row is inconsequential – [[CXAIR>>doc:Technical Documentation.CXAIR.WebHome]] simply stores the rows in the order they are retrieved.
10
11 To discover and specify these relationships, these Indexes can then be used in the [[Crosstab>>doc:Technical Documentation.CXAIR.User Guide.02\. Reporting.2c\. Crosstabs.WebHome]] functionality where fields can be grouped at row or column level before aggregated values are derived at run-time. For example, using a date field as a column value will order the values in the report and update the output accordingly.
12
13 With this in mind, the following questions pose a problem when simply viewing the raw data:
14
15 //When was a customer’s first transaction? What was the value?
16 What is the total balance value for a single customer across multiple accounts?
17 What is the most recent transaction value?//
18
19 The answers to these questions cannot be determined without prior grouping at customer level, as each transaction is stored as a single row.
20
21 This is where the [[ETL>>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]] functionality and its fundamentally different approach to data processing is crucial, as it is possible to loop through data based on a set of keys to group items together. Using the wizard, a number of grouped aggregations can be written out to the Index that negates the requirement for report-level relationship discovery.
22
23 For the purpose of this guide, demonstration data has been used that contains three separate customer, account and transaction tables. For a copy of this sample dataset, please contact Connexica.
24
25 = Creating the Source Indexes =
26
27 Three Indexes will be created that each serves a specific purpose, providing key information for an area of the business.
28
29 Using the [[Database Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Wizards.Database Data Source Wizard.WebHome]], select Excel XLSX as the **Database Type** and use the **File** option to select the supplied Excel file.
30
31 For the first Index, enter ‘Accounts’ into the **Sheet** textbox. For the second Index, repeat the above process and instead enter ‘Customers’ into the **Sheet** textbox and for the third Index, enter ‘Transactions’ into the **Sheet** textbox:
32
33 [[image:Screenshot 1.png]]
34
35 Ensure the three Indexes are added into the same Search Engine and the Home screen should resemble the following screenshot:
36
37 [[image:Screenshot 2.png||height="330" width="950"]]
38
39 The Accounts Index contains multiple account information for a single customer:
40
41 [[image:Screenshot 3.png||height="309" width="950"]]
42
43 The Customers Index contains a record per customer, detailing personal information:
44
45 [[image:Screenshot 4.png||height="207" width="950"]]
46
47 The Transactions Index contains transactional-level information for a single customer across multiple accounts:
48
49 [[image:Screenshot 5.png||height="417" width="950"]]
50
51 = Amalgamating the Data =
52
53 With three separate Indexes now created, the [[Amalgamated Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Wizards.b\. Amalgamated Data Source Wizard.WebHome]] can be used to join the data into a single Amalgamated Index.
54
55 This is facilitated by two joins. First, join the Customers and Accounts Indexes using a Left Outer join and select ‘Customer_ID’ as the common field. This will result in account information being added for every customer where the ‘Customer_ID’ field matches.
56
57 Leave the **Display Fields** checkboxes blank to include every column in the resulting output:
58
59 [[image:Screenshot 6.png||height="497" width="950"]]
60
61 Next, create a second join that combines the Accounts and Transactions Indexes using a Left Outer join with ‘Account_Number' used as the first common field. For this join, a second common field is also needed. Click **Add Link** and select ‘Account_Type’ and ‘Acc_Type’ to meet this requirement. Establishing this second condition ensures that a unique account number is not matched to the wrong type of account.
62
63 From the Accounts Index **Display Fields**, leave the checkboxes empty to bring through every column, while from the Transactions Index, select the ‘Account Name’, ‘Trans ID’, ‘Transaction Date’, ‘Transaction Type’ and ‘Transaction Value’ fields to ensure duplicate fields are not in the output:
64
65 [[image:1697452418525-512.png||height="734" width="1310"]]
66
67 Use the **Add to Search Engine** drop-down list to specify the Search Engine the resulting Index will be added to, enable the **Build Now** option and click **Create** to complete the process.
68
69 The resulting Index now contains a row for every transaction with the additional customer and account-level columns present:
70
71 [[image:Screenshot 8.png||height="530" width="950"]]
72
73 = Extra Field Calculations =
74
75 With an Amalgamated Index now created from the three source Indexes, the output can be further supplemented by utilising the [[Extra Fields>>doc:Technical Documentation.CXAIR.Administration Guide.4\. Manual Index Creation.b\. Creating a Data Source Group.WebHome||anchor="Extra Fields"]] functionality. Written at Data Source Group level, calculations can be added that derive additional columns at build-time.
76
77 The first extra field that can be added to the Amalgamated Index concatenates the Account Number and Account Name fields to create a composite key called Product Key. This will be used in later stages to denote the number of unique accounts opened by a customer and uses the following logic:
78
79 **AccountNumber + '-' + AccountName**
80
81 [[image:Screenshot 9.png]]
82
83 The second extra field, Months as Customer, dynamically compares the Inception Date value for every row and compares it to the current system date. The difference is then written out in months. This uses the following logic:
84
85 **DateBetween(InceptionDate , TODAY , MONTHS)**
86
87 [[image:Screenshot 10.png]]
88
89 Once the two calculated fields have been saved, rebuild the Index and validate the results in the [[Query>>doc:Technical Documentation.CXAIR.User Guide.02\. Reporting.2a\. Query.WebHome]] screen.
90
91 = Saved Queries =
92
93 With rows from multiple accounts joined in a single Index, the following queries can be used to access key areas of interest:
94
95 **+Account_Status:"Active"**
96 **+Account_Status:"Closed"**
97 **+Account_Status:"Offer"**
98
99 These queries provide a cohort of data for active and closed accounts, along with any accounts that have been offered to a customer. Save these queries to utilise this dynamic filtering when creating aggregations in the [[ETL Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]].
100
101 = ETL Processing =
102
103 In this example, the data processing is split into five distinct stages in order to shape the data into the final output.
104
105 Using the [[ETL Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]], select the previously created Amalgamated Index from the **Index** drop-down list to reveal the necessary processing options. (The [[ETL Wizard >>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]]allows additional steps to be added directly within the ETL to create these Database & Amalgamated Indexes. Clicking the '+' plus button lists these optional steps.)
106
107 == Stage 1 ==
108
109 The first stage will flag the first and last transaction for each customer and will count their number of accounts, products and transactions.
110
111 When using the [[ETL Wizard>>doc:Technical Documentation.CXAIR.Administration Guide.Data Warehouse.c\. ETL.WebHome]], each stage requires a Primary Key value entered in the **Key** drop-down list, and a field that will be used to order the output entered in the **Order** drop-down list.
112
113 As the output will be based at a customer-level, select ‘Customer ID’ from the **Key** drop-down list. Select ‘Transaction Date’ from the **Order** drop-down list to specify that the data will be ordered by the date of each transaction:
114
115 [[image:Screenshot 11.png]]
116
117 From the **Flags** tab, select the ‘Transaction Date’ field from the **Is First Minimum** and **Is Last Maximum** drop-down lists. This will identify the first and last transaction date for each customer:
118
119 [[image:Screenshot 12.png||height="620" width="720"]]
120
121 In the **Aggregations** tab, use the **Count Unique** option to count the number of accounts, products and transactions for each customer. This is achieved by selecting a key field that denotes each area of interest. Select ‘Account Number’, ‘Product Key’ and ‘Trans ID’.
122
123 Additionally, select the ‘Transaction Value’ field from the **Sum** drop-down list to sum the values from multiple rows:
124
125 [[image:Screenshot 13.png]]
126
127 As the **Is First Minimum** and **Is Last Maximum** flags have been set, the **Additional** tab provides further options to tailor the output when the flag is true. Set the ‘Account Type’ and ‘Transaction Date’ fields for the first transaction, and the ‘Transaction Date’, ‘Transaction Type’ and ‘Transaction Value’ fields for the last transaction:
128
129 [[image:Screenshot 14.png]]
130
131 By setting these options in the **Additional** tab, each customer’s first transaction is flagged and the account type is added to the row. Additionally, each customer’s last transaction is also supplemented with the date, type of transaction and the value of the transaction.
132
133 == Stage 2 ==
134
135 The next three stages all perform the same actions against different subsets of data, each pertaining to a different account type.
136
137 The second stage will count closed products per customer and does not need the data to be output in a particular order. Therefore, select ‘Customer ID’ for both the **Key** and **Order** drop-down lists:
138
139 [[image:Screenshot 15.png]]
140
141 In the **General** section, select the closed accounts query saved previously and enable the **Others** option:
142
143 [[image:Screenshot 16.png]]
144
145 Using a saved query at this point will filter down this stage to rows of interest. Without the **Others** option enabled, any rows not returned by the saved query are discarded. When enabled, these results are amended to the bottom of the stage output for use in future stages. This means that stages three and four can still utilise other cohorts of data not used in this stage.
146
147 From the **Aggregations** tab, select the ‘Product Key’ field from the **Count Unique** drop-down list. Due to the added query, this will count the number of closed accounts for each customer:
148
149 [[image:Screenshot 17.png||height="362" width="720"]]
150
151 == Stage 3 ==
152
153 Repeat the options specified for Stage 2, instead selecting the Active Accounts query. This will count each active account per customer.
154
155 == Stage 4 ==
156
157 Repeat the options specified for Stage 2, instead selecting the Offered Accounts query. This will count each account offer per customer.
158
159 == Stage 5 ==
160
161 The final stage deduplicates records and squashes the data into a single row per customer. This is achieved by selecting ‘Customer ID’ in both the **Key** and **Order** drop-down lists and selecting the Squash option:
162
163 [[image:Screenshot 2019-09-12 at 16.50.10.png]]
164
165 When the **Squash** option has been enabled, the fields that will be present in the output now need to be specified in the **Copy** tab.
166
167 By default, every field is added to the **Copy** drop-down list. However, as the output relies on unique values, remove all entries from the **Copy** drop-down list and instead populate the **Distinct** drop-down list with the required fields:
168
169 [[image:Screenshot 2019-10-22 at 14.39.56.png]]
170
171 If preferred, these fields can instead be added to the **Unique** drop-down list at this point, which will perform the same function but order the values. As ordering the values at this point will not provide any meaningful insight, this is not required.
172
173 With the all five stages configured, use the **Add to Search Engine** drop-down list to specify the Search Engine the resulting Index will be added to, enable the **Build Now** option and click **Create** to complete the process.
174
175 = Renaming Key Fields =
176
177 To enhance the end-user experience and make the fields easier to understand, navigate to the Index level of the created ETL Index and open the **Index Fields** options.
178
179 From here change the text in the **Display Field** text to better represent the field contents. For example, change ‘Account_Number_COUNTUNIQUE’ to ‘No. of Accounts’.
180
181 For a full list of recommended field names, please see the below screenshot:
182
183 [[image:Screenshot 20.png||height="740" width="950"]]
184
185 In the [[Query>>doc:Technical Documentation.CXAIR.User Guide.02\. Reporting.2a\. Query.WebHome]] screen the Index should now resemble the following screenshot:
186
187 [[image:Screenshot 21.png||height="199" width="950"]]
188
189 This ETL output contains a row per customer, outlining a number of key metrics to provide a view of activity across numerous areas of the business.
190
191 = Automating the Output =
192
193 With the data processing now complete, schedules can now be established to automate the entire process and account for new and changed records.
194
195 Rather than creating individual schedules for each component, this will be achieved using three Index schedules. The first will rebuild the Indexes from the source Excel sheet, the second will rebuild the subsequent Amalgamated Index and the third will rebuild the ETL Index.
196
197 First, navigate to the [[System Overview>>doc:Technical Documentation.CXAIR.Administration Guide.Status Monitoring.System Overview.WebHome]] screen and select the three source Indexes:
198
199 [[image:Screenshot 22.png||height="265" width="950"]]
200
201 Once selected, click [[Collections>>doc:Technical Documentation.CXAIR.Administration Guide.Status Monitoring.System Overview.WebHome||anchor="Collections"]] then **Save** and name the Collection ‘Source Data’. By grouping these three Indexes into one Collection, a single Index schedule can be created that will queue all contained Indexes at build-time, negating the requirement for many Index schedules.
202
203 Navigate to the [[Index Schedules>>doc:Technical Documentation.CXAIR.Administration Guide.Status Monitoring.e\. Index Schedules.WebHome]] screen and click **New**. Enter ‘Source Indexes’ into the Name textbox and select the ‘Source Data’ collection from the Items drop-down list:
204
205 [[image:Screenshot 24.png]]
206
207 Use the **Frequency of Execution** drop-down list to specify how often the schedule will run and click **Create Schedule** to complete the process.
208
209 Next, create a new schedule called ‘Amalgamated Indexes’ and add the previously created Amalgamated Index using the **Items** drop-down list. From the **Frequency of Execution** drop-down list, select **Dependent**. This allows the schedule to only run once another schedule has completed, ensuring that the Index does not build until all of the source data has been refreshed successfully. Select the ‘Source Data’ schedule from the **Dependent Index Schedules** drop-down list and the screen should resemble the following screenshot:
210
211 [[image:Screenshot 25.png]]
212
213 Finally, create a third schedule titled ‘ETL Indexes’ and select the previously created ETL Index selected from the **Items** drop-down list and **Dependent** from the **Frequency of Execution** drop-down list. This time select the ‘Amalgamated Indexes’ schedule. This will instruct the system to only perform the ETL processing once the Amalgamated Index has built. The screen should resemble the following screenshot:
214
215 [[image:Screenshot 26.png]]
216
217 With all three items created, the following schedules should be visible:
218
219 [[image:Screenshot 27.png]]
220
221 Creating these dependent schedules will guarantee that the Indexes automatically build in the correct order, accounting for new information and changed records over time.