One of the biggest challenges I face as a data engineer is the problem of data discoverability. The most common questions I get at work is “where can I find the data for variable X?” and “What do the numbers in column Y mean?”. The context of the data is missing, as I discussed in an earlier post.

The current solution is a data catalog: a repository of information about all the data in your company (the tables, the columns and what the values represent). A big player in this field is Colibra. There are two problems with existing data catalogs:

  1. They are extremely expensive (100k+ yearly licenses).
  2. They require lots of manual labor to set up and maintain.

This is why a host of new players is entering the data catalog market. I believe that AI, more specifically Natural Language Processing, could be a breakthrough technology for data catalogs. Especially in manufacturing, where IT-systems are surprisingly standardised across the industry (every company has an MES and ERP system), relying on a computer to document the data for us is not unrealistic.

If 90% of the signals can be cataloged with sufficient accuracy and detail, it would reduce the required labor drastically (point 2). This in turn would drive down the cost (point 1).