Sunday, December 13, 2015

Fuzzy Search Join DataTables in Spotfire - Unstructured Data


Lets assume we have Two different sets of data from two different sources which are imported into Spotfire and would like to make join between these data sets. Assuming one data set is Structured data and the other is unstructured data (data from Web).

Well, we know that to combine the two different sets, would need to identify the matching key. Since we are working with Structured and Unstructured data, in most of the cases KEY will not match exactly.The only thing you can combine the two sets is by trying to match based on similarity of the column values.

For example:-
You have a company name called "Google Inc Company" on one source and 
"Google Inc Co." on other hand... You know its the same company name but how can you combine the data in Spotfire?

How to solve?
- Manually Fix the data and prepare Mapping Table
- Making use of Regex in Spotfire which we would be difficult to identify the pattern and come up with expression as the data is unstructured.
- Fuzzy String Matching using R: Easy to implement

Fuzzy String Search:
Fuzzy string approach is basically comparing the two strings based on the similarity.

Using agrep function in R, we can combine the data.

Fuzzy string matching doesn't know anything about your data but you might do.
Based on your data, we would need to set the values for the attributes like ignore.case , max distance, etc.

Details for these Attributes can be found at below location:-
https://stat.ethz.ch/R-manual/R-devel/library/base/html/agrep.html




Below is the R Script to join Two Data Tables based on Fuzzy Search String.


sec<-Table1     # Data Set1
dwh<-Table2 # Data Set2
dsecurity<-dwh[,6]  # Key column in Data Set1
ciknames<-sec[,2] # Key column in Data Set2
len<-length(dsecurity) # Number of Rows
for(i in 1:len) # For loop to search for each string
{
matches<- agrep(dsecurity[i],ciknames,max=0.2, value=TRUE, ignore.case = TRUE) # Capturing the Strings which are similar by using agrep function
matcount<-length(matches) # Count of matching strings from Data Set2 
concat<- paste(matches,collapse="|") # Concatenating the matchings Strings from Data Set 2
dwh$SECLOOKUP[i]<-concat # Adding a column to the data frame
for (j in 1:matcount)  # For loop to implement the logic for Left Outer Join
    {
        row <- matches[j]
        if(j==1)
        {
        dwh$SECLOOKUP[i]<-row
        }
        else
        {
         newRow<-dwh[i,]
         newRow[8]<-matches[j]
         dwh<-rbind(dwh,newRow)       
        }
    }
}
Otable<-dwh  # Final Data Frame with left outer join











3 comments:

  1. How did you get this to work with Spotfire? I see agrep as unsupported in TERR.
    https://docs.tibco.com/pub/enterprise-runtime-for-R/4.0.2/doc/html/TIB_terr_Unavailable-Functions-list/GUID-4530EA87-B49C-4C8C-BF5C-C74C71F66A80.html

    ReplyDelete
  2. agrep is supported in newer TERR versions

    ReplyDelete
  3. I appreciate you for the time you spent to write this whole piece of article. If you want to become pro in Tibco Spotfire and establish your carrer, You can reach here

    ReplyDelete