Data Intelligence—Scanners Developed to Gather Metadata from Various Tools
Our client, a privately held computer software company, develops and sells enterprise software to international clients. The company creates and deploys software solutions that reduce cost, mitigate risk, and improve service delivery throughout the IT lifecycle. The client provides comprehensive solutions including:
- Metadata and content management
- Systems solutions—a wide variety of products including performance management, workload automation, application management, and service quality and control.
During the client’s product evolution, the amount of data used and provided grew dramatically. As a result, the data became cumbersome and unmanageable—it was time consuming to get it into a usable format.
- Data lakes contain a huge variety of information. The type, creation date, owner, initial size, etc. could be determined. However, the data was so complex that analyzing separate chunks while ignoring the relationship between it did not give the complete picture.
- Even if the data was quickly found and analyzed and then put in the proper context, it was still probable the data was inconsistent—caused by its mutability. With time some dependencies between the data elements disappeared making the data useless.
The client collaborated with SoftServe to develop scanners to gather metadata from various tools—Hive, HDFS, Falcon, Flume, Sqoop—used for transformations, migrations, and storing data in Apache Hadoop.
The scanners could gather metadata using REST or Java API and then transform the data into a specific format for the metadata repository. Also, the scanners created links between metadata items in the metadata repository scanned from different tools. All scanned, imported, and linked data was used for further analytical processes and lineage creation.
As a result of the collaboration, the client received the following value:
- X-platform. Scanners developed by SoftServe gathered metadata from Apache Hadoop framework components enabling compatibility with the most popular Hadoop distributions—Cloudera, Hortonworks, IBM BigInsights, and Talend Studio.
- Modular. Each scanner is a standalone application and can be used separately or in combination with other tools. This approach allowed our client to select the exact tool combination for their needs and guaranteed time and money savings.
- Flexible. Scanners produced structured data output (XML) so various programming languages could be used for additional data processing—analyzing, reconstructing relationships, and visualizing the metadata structure and lineage.