Nowadays, working effectively with data is the key to success for any organisation. Identifying matching columns between different data sources is essential for integrating and consolidating data effectively.
When working with data, it is common to encounter situations where different services share information through a common identifier. Typically, it is necessary to understand which other columns correspond to each other. However, this information is not always available, especially when dealing with external data sources. Understanding these connections between data sets is crucial for efficient data analysis and more. Manually identifying matching columns by comparing names and values is tedious, ineffective, time-consuming and resource intensive.
In this article we would like to present ways in which the following challenges can be dealt with:
•Direct value matching
•Indirect value matching
•Indirect value matching with mapping recognition
Please refer to the Image 1, which depicts two tables linked by the columns 'identifier' and 'ID'. Our objective is to ascertain that the column pairs 'country code' and 'COUNTRY', 'order_currency' and 'CURRENCY' are also linked and represent the same entities.
Image 1. Examples of tables we want to match directly
This is an algorithmic problem and can be solved through comparison. By taking rows with the corresponding ID from both tables, we can match their values to identify relationships.
Image 2. The result of direct column matching
Image 3. Examples of tables we want to match directly
For this problem, direct comparison is not possible due to the differing values and lack of knowledge about the exact mapping. However, we can measure the correlation between values. Correlation is a statistical measure that indicates the extent to which two variables are related. There are various ways to measure correlations between variables. For the purposes of this study, we used the Cramér's V, which works efficiently for categorical variables as it leverages the chi-square test. This method enables us to identify relationships even when the values are not directly comparable, enhancing our ability to map columns accurately.
To calculate this metric, we first convert text into numerical values. After this, we consider column pairs to be corresponding if the correlation between them is greater than a defined threshold.
Correlation ≥ Threshold
However, this method does not provide explicit mappings and may result in suboptimal performance when dealing with noisy data, such as values mapped to multiple different values. In Section 4, we will address this challenge to achieve more accurate results.
The final case to consider is when the dataset contains values described by multiple words that convey the same meaning. To illustrate, in our example, the country code for the United States might appear as either "US" or "USA," which are the same entity for our purposes.
. Image 4. Example of tables with multiple mappings
The occurrence of multiple mappings reduces the value of the correlation coefficient, which reduces the effectiveness of the previous method. Furthermore, we do not know what the mapping looks like, especially if there is noise in the data. In addressing this case, we propose an algorithm based on a bipartite graph and recurrence.
Initially, we apply the method outlined in the previous section, where correlation effectively guides the selection of column pairs for examination. We adjust the threshold slightly lower compared to earlier to broaden our scope.
For each unique value in the first table's column, we construct a bipartite graph:
Image 5. Example of a mapping graph created for the US value in 'country code' column in relation to the 'COUNTRY' column
Graph definition
•·Nodes in the first set (red nodes) represent values from the first table's column
•Nodes in the second set (white nodes) represent values from the second table's column
•Edges connect nodes from sets if there exists a pair of values with the same identifier
• Edges weights are directional and calculated as the number of matches involving values from the source node and the target node divided by the total occurrences of the source node's value in its column.
Graph construction
The graph is built recursively. We iterate over values from the first table's column that haven't been included in any graph yet.
• We begin by selecting identifier values associated with selected value from the first table.
• For each identifier, we select corresponding values from the second table's column and add connections to them.
• If a new node is introduced, the process repeats in the opposite direction, changing the tables' order.
• Recursion stops if the node already exists.
Graph size limitation
We impose a limit on the size of the graph, which in our case is set to a small number of nodes, typically between 4 and 8 (could be larger if there is large noise in the data). If the graph exceeds this limit during construction, the algorithm will stop and indicate that the match between the columns is incorrect.
The graph size limitation addresses issues such as the high correlation between a country and its currency. For instance, the USD currency is primarily associated with the US, but can also be used by other countries with their own currencies. This will extend the graph into additional nodes, which may result in a rejection based on size. However, if no such rejection occurs, we can still accept the match with the higher correlation score.
Our proposed graph-based solution matches columns and provides information about the mapping between them.
In this article, we have presented one of the solutions we have developed to the problem of matching columns between datasets. Our approach, based on usage of adequate data structures and statistical measures, is open to extensions depending on the needs and specifics of the data.
By incorporating additional information such as edge weights in the graph, correlation coefficient or column name, we can enhance the algorithm's effectiveness. Although we focused on three primary cases in this study, several additional modifications can further improve the matching process, including but not limited to:
• Using text embeddings (like BERT word embeddings) for encoding values and employing vector similarity metrics like cosine similarity instead of correlation coefficients
• Utilizing column names as additional context to better understand column content
• Comparing sets of column values – if both columns have identical sets of values and are highly correlated, we can assume direct value matching. Any discrepancies can be treated and returned as noise
Author
Robert Hejda is an AI & Data Science Consultant at AIM Reply in Poland. With expertise in building AI-powered tools, his role is to design and build robust solutions based on the latest trends in AI. His experience covers a wide range of fields – both image and text processing, using classical, deep or statistical methods.