As the data in Enterprise Hadoop clusters continues to grow, the security of that data begins to be an important part of any implementation.  This article is the fourth in a series discussing the current state of security in the Hortonworks Ecosystem.  I will cover the four major tools used to secure a Hortonworks Data Platform (HDP) cluster.  Ranger and Kerberos handle the four A’s of security:  Authentication, Authorization, Auditing, and Administration.  In addition, data encryption and cluster access are also provided by Hadoop TDE and Knox respectively. 

With the increase in volume of data going into Hadoop Data Lake, it is no surprise that protecting that data has become so important.  Being a new technology, Hadoop is under even greater scrutiny to ensure protected data is completely secure.  In order to provide security, Hadoop utilizes a variety of different tools.  At the lowest level disk encryption can be provided by the OS, this protects against hard drive theft.  In order to protect data from rogue admins or other unauthorized users, Transparent Data Encryption (TDE) is provided by HDFS.  Since TDE is provided transparently by HDFS, all tools which store their data within HDFS are able to take advantage of it.  TDE is implemented as an end to end solution meaning the data is encrypted and decrypted at the client.  The data, as it is transferred between the client and HDFS, is always encrypted, providing in-transit encryption.  In order to protect data on the wire that isn’t utilizing TDE, Hadoop utilizes a combination of SASL and SSL.  

In order to use TDE, a Key Management Server (KMS) is required to manage the encryption keys.  Hortonworks Data Platform (HDP) utilizes Ranger KMS to perform this function.  The Ranger Admin UI provides a single pane of glass to manage, audit, and authorize access to keys in Ranger KMS.  


When using TDE, only data stored in preconfigured encryption zones is protected.  These zones must be created by a super user, but the encryption keys can be created by a normal user.  

Data also needs to be protected while in motion, this could be when transferring from or to a client and during the shuffle phase of a MR task.  When transferring data to or from a client, the data may be encrypted if it is stored in an encryption zone in HDFS.  If it isn’t stored in an encryption zone, it can be protected by SASL for RPC connections, which are made by most clients.  Connections between a client and the DataNodes utilize the Data Transfer Protocol (DTP).   The DTP can be encrypted by the 3des or rc4 algorithms.  Connections made over HTTP, such as WebHDFS, can be encrypted via SSL.  HiveServer2 provides encryption via SASL’s quality of protection (QOP) for connections over JDBC.  Finally, when data transfers from the mappers to the reducers in the shuffle phase of an MR job, this data can be encrypted by SSL.  By utilizing all of the encryption provided by each of these mechanisms, data will always be encrypted while in motion either to, from, or within the cluster. 

Current regulations in healthcare, government, and financial sectors all require some level of data encryption.  A complete solution to data encryption is provided in the Hadoop ecosystem, helping to ensure compliance with these regulatory bodies.  Data can be secured from the moment it transfers to HDFS, while it is stored in HDFS, while it is being operated on, and while it is being transferred out of HDFS.  Utilizing all of these features will ensure that the data in the cluster is secure.   

Previous articles in this series include: 

Security in the Cluster – Authentication  

Security in the Cluster - Authorization 

Security in the Cluster - Access