Skip to content

Commit

Permalink
Updated ESA Scale Document (#243)
Browse files Browse the repository at this point in the history
* Updated ESA Scalability Document

* further updateds based on feedback
  • Loading branch information
marvinbuss authored Dec 6, 2021
1 parent c6aa44f commit 571c618
Showing 1 changed file with 6 additions and 5 deletions.
11 changes: 6 additions & 5 deletions docs/guidance/EnterpriseScaleAnalytics-SelfService.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,26 +31,27 @@ The following paragraphs will elaborate on these core concepts and will also des

Enterprise-Scale Analytics is centred around the concepts of Data Management Zone and Data Landing Zone. Each of these artifacts should land in a separate Azure subscription to allow for clear separation of duties, to follow the least privilege principal and to partially address the first issue mentioned in the [introduction](#introduction) around subscription scale issues. Hence, the minimal setup of Enterprise-Scale Analytics consists of a single Data Management Zone and a single Data Landing Zone.

For large-scale data platform deployments, a single Data Landing Zone and hence a single subscription will not be sufficient, especially if we consider that companies build these platforms and make the respective investments to consistently and efficiently scale their data and analytics efforts in the upcoming years. Therefore, to fully overcome the subscription-level limitations and embrace the "subscription as scale unit" concept from [Enterprise-Scale Landing Zones](https://github.com/Azure/Enterprise-Scale), ESA allows an institution to further increase the data platform footprint by adding additional Data Landing Zones to the architecture. Furthermore, this also addresses the concern of a single Azure Data Lake Gen2 being used for a whole company as each Data Landing Zone comes with a set of three Data Lakes.
For large-scale data platform deployments, a single Data Landing Zone and hence a single subscription will not be sufficient, especially if we consider that companies build these platforms and make the respective investments to consistently and efficiently scale their data and analytics efforts in the upcoming years. Therefore, to fully overcome the subscription-level limitations and embrace the "subscription as scale unit" concept from [Enterprise-Scale Landing Zones](https://github.com/Azure/Enterprise-Scale), ESA allows an institution to further increase the data platform footprint by adding additional Data Landing Zones to the architecture. Furthermore, this also addresses the concern of a single Azure Data Lake Gen2 being used for a whole company as each Data Landing Zone comes with a set of three Data Lakes. Ultimately, this leads to a distribution of projects and activities of multiple domains across more than one Azure subscription, allowing for greater scalibility.

The question of how many Data Landing Zones an organization requires should be discussed and defined upfront before adopting the ESA prescriptive architectural pattern. This is considered to be one of the most important design decisions, because it lays the foundation for an effective and efficient data platform. When done right, it will prevent enterprises from ending up in a migration project of data products and data assets from one Data Landing Zone to another and will allow an effective and consistent scaling of any big data and analytics efforts for the upcoming years.

The following factors should be considered when deciding about how many Data Landing Zones should be deployed:

- *Data Domains*: What data domains does the organization encompass and which data domains will land on the data platform?
- *Data Domains*: What data domains does the organization encompass and which data domains will land on the data platform? What is the size of these individual data domains?
- *Cost allocation*: Are shared services like storage accounts paid centrally or do these need to be split by business unit or domain?
- *Location*: In which Azure Regions will the organization deploy their data platform? Will the platform be hosted in a single region or will it span across multiple regions? Are there data residency requirements that need to be taken into account?
- *Data classifications and highly confidential data*: What data classifications exist within the organization? Does the organization have datasets that are classified as highly confidential and do these datasets require special treatment in form of just in time access, customer managed keys (CMK), fine grained network controls or additional encryption being enforced? May these additional security mechanisms even impact usability of the data platform and data product development?
- *Other legal or security implications*: Are there any other legislations or security requirements that an organization needs to comply with that justify logical and/or physical separation of data?

Considering all these factors, an enterprise should target no more than 15 Data Landing Zones as there is also some management overhead attached to each Data Landing Zone.

It is important to emphasize that Data Landing Zones are not creating Data Silos within an organization, as the recommended network setup in ESA enables secure and in-place data sharing across Landing Zones and therefore enables innovation across data domains and business units. Please read the [network design guidance](/docs/guidance/EnterpriseScaleAnalytics-NetworkArchitecture.md) to find out more about how this is achieved.
It is important to emphasize that Data Landing Zones are not creating data silos within an organization, as the recommended network setup in ESA enables secure and in-place data sharing across Landing Zones and therefore enables innovation across data domains and business units. Please read the [network design guidance](/docs/guidance/EnterpriseScaleAnalytics-NetworkArchitecture.md) to find out more about how this is achieved. The same is true for the identity layer. Since a single Azure AD tenant is recommended to be used, identities can be granted access to data assets within multiple Data Landing Zones. For the actual authorization process of users and identities to data assets it is recommended to leverage Azure Active Directory Entitlement Management and Access Packages.

Lastly, it needs to be highlighted that the prescriptive ESA architecture and the concept of Data Landing Zones allows corporations to naturally increase the size of their data platform over time. Additional Data Landing Zones can be added in a phased approach and customers are not forced to start with a multi-Data Landing Zone setup right from the start. When adopting the pattern, companies should start prioritizing few Data Landing Zones as well as Data Products that should land inside them respectively, to make sure that the adoption of ESA is successful.

### Scaling with Data Products

Within a Data Landing Zone, Enterprise-Scale Analytics allows organizations scale through the concept of Data Integrations and Data Products. A Data Integration or Data Product is an environment in form of resource group that allows cross-functional teams to implement data solutions and workloads on the platform. These teams then take care of the ingest (only for Data Integration), cleansing, aggregation and serving tasks for a particular data-domain, sub-domain, dataset or project.
Within a Data Landing Zone, Enterprise-Scale Analytics allows organizations scale through the concept of Data Integrations and Data Products. A Data Product is a unit or component of the data architecture that encapsulates functionality for making read-optimized datasets available for consumption by other data products. In the Azure context, a Data Product is an environment in form of resource group that allows cross-functional teams to implement data solutions and workloads on the platform. The associated team takes care of the end-to-end lifecycle of their data solution including ingest (only for Data Integration), cleansing, aggregation and serving tasks for a particular data-domain, sub-domain, dataset or project.

A Data Integration pattern is a special kind of Data Product that is mainly concerned with the integration of data assets from source systems outside of the data platform onto a Data Landing Zone. These consistent data integration implementations reduce the impact on the transactional systems and allow multiple Data Product teams to consume the same dataset version without being concerned about the integration and without having to repeat the integration task.
Data Products on the other hand are consuming one or multiple data assets within the same Data Landing Zone or across multiple Data Landing Zones to generate new data assets, insights or business value. The resulting data assets may again be shared with other Data Product teams to enhance the value being created within the business even further.
Expand Down Expand Up @@ -83,7 +84,7 @@ The responsible IT teams should use this platform feature to address their secur

### Azure RBAC assignments

In order to develop, deliver and serve data products autonomously within the data platform, Data Product and Data Integration teams require few access rights within the Azure environment. Before going through the respective RBAC requirements it must be highlighted that different access models should be used for the development and higher environments.
In order to develop, deliver and serve data products autonomously within the data platform, Data Product and Data Integration teams require few access rights within the Azure environment. Before going through the respective RBAC requirements it must be highlighted that different access models should be used for the development and higher environments. Also, security groups should be used wherever possible to reduce the number of role assignments and to simplify the management and review process of RBAC rights. This is critical, due to the [limited number of role assignments that can be created per subscription](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#azure-rbac-limits).

The development environment should be allowed to be accessed by the development team and their respective user identities to enable them to iterate more quickly, learn about certain capabilities within Azure services and troubleshoot issues effectively. Access to a development environment will help when developing or enhancing the infrastructure as code (IaC) as well as other code artifacts. Once an implementation within the development environment works as expected, it can be rolled out continuously to the higher environments. Higher environments, such as test and prod, should be locked off for the Data Product team. Only a service principal should have access to these environments and therefore all deployments must be executed through the service principal identity by using CI/CD pipelines. To summarize, in the development environment access rights should be provided to a service principal AND user identities and in higher environments access rights should only be provided to a service principal identity.

Expand Down

0 comments on commit 571c618

Please sign in to comment.