Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Secure Enterprise Data Lakes by Understanding the Pipeline

Secure Enterprise Data Lakes By Understanding The Pipeline

A growing number of enterprises are taking an interest in big data. The potential of data lakes is hard to deny, but when companies first explore the world of big data there are many gotchas and pitfalls that they are likely to encounter. Two of the key challenges are data Ingestion and data protection.

Dealing With Data Takes Time

Often, the initial goal of a data lake is to simply store data that may come in useful one day. Think of data lakes as the digital equivalent of those boxes of “important papers” in the attic. Should the day come when someone finds a use for all that data, however, the process of digesting a huge dump of data that may well have inconsistent schema from file to file can be quite a challenge.

Ideally, data lake ingestion would be automated, with parallelized ingestion where possible to speed up the process, and error-checking to cope with changes to the source systems. Automated data ingestion can save a lot of time and stress, freeing up developer and analyst time so that they can focus on visualizations, rather than fighting with the datasets that are being sent to them.

This is particularly true if the cloud data lake is going to be added to on a regular basis with fresh data. Synchronizing data incrementally with Change Data Capture is a common technique and one that should be performed alongside change tracking in the form of Slowly Changing Dimensions. This will allow auditors and analysts to see how data has changed over time – useful for legal purposes, and to allow for detailed analytics.

Handling Data Responsibly

While data protection rules vary around the world, the general idea remains the same. Organizations that gather data on individuals have a responsibility to ensure that the data they hold is handled responsibly.  Data protection and privacy laws in the United States are more complex than in some other parts of the world since the laws are a confusing mixture of federal and local ones.

Before you can even think about how you are handling data and whether your enterprise might be bound by laws such as the EU’s “Right to be Forgotten”, basic security issues should considered. There are a number of security challenges with data lakes, in particular, if these lakes are self-managed. Understanding the pipeline that your data follows, and strictly controlling how the data is handled at each stage before, during and after ingestion is vital for reducing the risk of breaches and minimizing the damage that any breach that does occur could lead to.

How to Secure Data Lakes

The security challenges with data lakes come from the way that data is stored and processed. Typically, the data that is to be processed will come from an external source, understanding the pipeline of how the data comes into the cloud, is processed and is then ingested into the database is vital to securing it properly.

A secure cloud set up will have zones where the data enters the system, is sanitized, processed and ingested. Each zone should be separate from the others. Whenever possible, data held in any given zone should be encrypted, too. Where raw data is handled, its lifespan should be limited, and the information should be securely deleted as soon as possible.

Having a clear idea of the lifecycle of data for cloud security is the first step towards effectively securing your data lakes. Security can occur at many levels:

  1. Limiting platform access
  2. Role-level privileges
  3. Network isolation
  4. Data encryption
  5. Document-level security

Building a Robust Cloud Architecture

Security of the form mentioned above is not a new concept, however, those who are used to running on a monolithic architecture may find adapting to the microservice cloud way of thinking to be difficult.

Fortunately, there are many companies that offer easy-to-use solutions for each step in the process. Platforms such as Amazon S3 and Azure are a good starting point for people who wish to run cloud services without the challenge of managing the servers on-premises. Automated data lake ingestion tools can aid with the next step of the process.

It’s true that there are plenty of open-source tools out there for data lake ingestion, but the workload that configuring and using these tools places on the developer is high, and using the wrong tool for the job can significantly slow down the ingestion process. JDBC, for example, may be good enough for small imports, but when we enter the realm of big data, the efficiency gains of using a parallel transporter become clear.

Having a tool that can handle connecting to a new stream or object storage, securely storing a copy of the raw data, ingesting the data, and then providing it in a form that is ready to query via popular tools such as Athena and Spark can streamline a previously developer-intensive process, improve efficiency, increase reliability, and make your data lake more useful for the people who actually need to use it the most.

Data should serve its owners, not drain resources, or create additional workload. Tools such as UpSolver give the power back to the enterprise. Even enterprises that make use of Virtual Private Clouds can benefit from streamlining and security, since Upsolver can read data from Amazon S3 instances in an enterprise’s virtual private cloud, then process that data in memory on a remote EC2 cluster, and then write it back to the S3 bucket. At no point is the data that is being processed written to persistent storage on the EC2 cluster that is doing the processing.

Know Where Your Data is Going

The key consideration for companies looking to get into managing data lakes is to treat those lakes with the same serious security outlook as other parts of the Enterprise. Every machine, instance and piece of code that has access to the data should be secured robustly. It is all too common for systems administrators and developers to secure the entry point and the endpoint, but neglect midstream parts of the pipeline, in spite of them still being networked, and having numerous potential threat vectors.

Find a trustworthy system for managing your data lake ingestion, and maintain your monitoring and remediation routines for peace of mind.

The post Secure Enterprise Data Lakes by Understanding the Pipeline appeared first on CyberDB.



This post first appeared on CyberDB, please read the originial post: here

Share the post

Secure Enterprise Data Lakes by Understanding the Pipeline

×

Subscribe to Cyberdb

Get updates delivered right to your inbox!

Thank you for your subscription

×