Friday, March 30, 2018
Monday, February 12, 2018
Part Three: The Analytics Strategy and Roadmap - A Use Case Driven Plan to Incrementally Build an Analytics Capability Powered by Big Data
In the previous blogs of this three-part series, I addressed a clear analytical divide that has grown in the industry, where relatively mature BI shops are at a definite advantage over most organizations that have yet to fully realize an analytics capability powered by big data. The first blog in this series identified the critical capabilities needed for analytical success with big data, and the many impediments, both technical and organizational, that are holding companies back.
Building on this idea in the second blog, I outlined why the attempt to take a ‘big bang’ approach to big data, by first putting all of the enterprise’s data into a data lake, is not likely to succeed because it returns little ROI in the short run and has major investment, governance, and skills requirements. Instead, I proposed establishing a multiplatform data warehouse environment (DWE) with an architecture pattern that's designed to accommodate immediate used cases with specific goals and measurable ROI, so the program can fund itself along the way.
In this approach, the requisite analytics capabilities will be gained through a managed transformation, an incremental build up in a phased approach, where the big data journey is mapped in clear, achievable but increasingly challenging milestones that induct the different nature and types of big data. The strategic roadmap on big data will be formulated based on these early successes, with more participation and sponsorship of the business when it starts to see value from this technology. That will help refine the tactical aspects of the strategy execution.
In this blog, I present a four-phased roadmap to get there, each phase building the pre-conditions to succeed with the next. The phases will of course overlap when work on a previous phase continues with other use cases. I will cite telecom use cases in the customer experience domain only to illustrate the comparative progression and analytical maturity in each phase.
We have presented a practical roadmap to big data and analytics adoption based on successful practices in industry. This plan presumes nothing and builds on successes at each phase generating the pre-conditions for the next.
The use-case driven approach starts with more technical IT driven challenges and matures eventually to departmental operational decisions and finally to strategic decision support. The maturity looks like this: Data Warehouse Off-Loading, 2) Operational BI, 3) Operational Analytics and 4) Strategic Analytics. The use cases will have to be evaluated in two major dimensions: the implementation capability and capacity needed, and the degree of organizational change required relating to their impact on current business processes.
In the early part of the transformation, the big data initiatives will be more technical in nature and localized at the department level. They will require the least additional skill and have positive, if minimal impact, to business processes. In the later stages, the evolving use cases will have wider business impact and will demand more capacity and technical and organizational capabilities in big data. The final stage involves analytics adoption for use in organizational strategic planning.
As the later stages involve use cases that are more operational and strategic in nature, which can impact processes across many departments, they will demand a more robust organizational change management program to manage the change across different participating groups and additional governance requirements. Large companies will have multiple big data teams, and as the organization builds more advanced big data capabilities, teams will need to come together for interdepartmental use cases.
Phase 1: Offload data and workloads from legacy systems and the enterprise data warehouse
Like most other IT systems, as data warehouses age, their design and enabling technologies can become un-scalable in terms of their economics and performance. Adopting multiplatform data warehouse environments would solve many data storage and performance issues, which is why it is one of the strongest trends in data warehousing today. In this phase, high volume detail transaction data storage and processing will be off-loaded to a Hadoop platform, reducing the storage and computing resource requirements of the relational data warehouse platform. From a business viewpoint, this is a non-disruptive task. It preserves existing investments in data warehousing, and (when done well) it extends the life of an expensive and useful system.
The off-loaded detail data, which is hardly exploitable in a traditional RDBMS will also become amenable to analytic exploitation because of the linearly scalable architecture of Hadoop, increasing the value of these detail data to the business: they will be able to get valuable insights from this detail data with the right questions. Organizations can also explore the possibility of monetizing these detail data. For example, location-based and movement-over-time data can be obtained from Call Data Records in the telecom industry. Inducting mainframe data and/or offloading the processing to Hadoop, active archiving of historical data also is other example of IT use cases for this phase.
This phase will require a relatively small investment in the big data cluster: between 6-10 nodes depending on the data volume to be off-loaded. In terms of investment and ROI, typically this phase will pay for itself in terms of reduction in infrastructure costs, improvement of performance of ETL processes and reports, and in the additional value in detail data.
The foundation of big data capabilities for the organization will be laid in this phase: IT will get a foothold in Hadoop skills on familiar existing structured data. Data governance policies will be applied to the data off-loaded to Hadoop, and in doing that, the finer aspects of practicing the data governance principles and policies will be sorted out, again, on familiar territory of the data. An Agile development methodology with DevOps should be inducted in this phase, delivering value as early as possible while streamlining the support functions to the big data program.
Phase 2: Operational BI (event processing)
While the first phase is based on batch processing, the next will be based on near-real-time and subsequently real-time processing—starting with processing structured data, progressing to semi-structured and unstructured data.
It can start with rule-based event processing use cases on structured data (like fraud detection for telecom), which can happen in near real time, and then move on to processing more voluminous structured data in a more real-time basis (like identifying potential Mobile Switching Center failures and re-routing more profitable customers to a different Mobile Switching Center in real-time to avoid service degradation).
Semi-structured and unstructured data can be inducted for real-time event processing after these successes. Some telecom use cases could include analyzing customer interactions captured by a call center application to identify the key problems customers are complaining about. Sentiment analysis on this data can provide the intensity of customer dissatisfaction around these problems. The text analytics can be further improved by transcribing the recorded calls and using transcripts for this analysis. Further, voice analytics can be applied on recorded calls to measure the customer’s mood associated with the complaints. These analyses will not only provide statistics on overall complaints, but will be able to identify dissatisfied high-value customers in real time.
In this phase, the algorithms are mainly rule-based and fairly deterministic in nature, and the use cases can be limited by actionability and deployment confined to a single department, typically the departments that are showing more traction with the big data initiative, thus improving the chance of building more accurate models and ensuring deployment and use in operations.
The organization will develop the Hadoop data integration skills for different types of data in this phase. They will now have gradually developed a fairly advanced data governance capability and should have established data management policies and processes around it for these more exotic data types. There will be more pervasive use of these data sets by analysts through self-service exposed in analytical sandboxes. The induction of these new data sets will be closely linked with business use cases, data management practice (in terms of data ownership and accountability), ensuring enough data quality, capturing business metadata, security and privacy aspects, etc. Ideally it should not have great impediments and should have the requisite backing from the quarter of the business that will benefit from the use case. The data management process should be formalized through these implementations, developing the requisite controls and artifacts.
These parts of the business will now have adopted the use of big data and would have started realizing benefits out of it. The organization will now be at the “Analytical Practitioners” level. The big data cluster will get much larger with induction of these new high-volume data sources, but ideally it will be funded by the departments deploying the use cases.
Phase 3: Operational analytics
In phase 2, the data lake has been hydrated with varied structured, semi-structured, and unstructured data, and insights have been obtained from them. Typically, these datasets will progressively provide the customer 360-degree view, aggregating data from all customer touch points.
In phase 3, these insights can be combined using advanced analytic techniques to obtain predictive operational intelligence. For example, customer churn models will be deployed based on various types of data obtained on customer interactions in the previous phase. Campaign management algorithms can be refined based on this addition information. Call center data volume in different categories can be forecasted based on historical patterns.
Until phase 2, the big data program was tactical and bottom-up. Now it needs to be met with a top-down strategy to be effective at this next level. The input data as well as actions out of the insights from the use cases will typically span across departments. Hence, the big data program will need to have strategic direction and sponsorship at this phase, ensuring leadership support for identifying which operational areas analytics can be used to improve customer experiences most effectively, and to ensure that the insights obtained drive and enhance the business processes involved.
This leadership is essential for gaining buy-in from managers in sales, service, and support functions applying such insights. Through such leadership, analytics professionals will be able to collaborate with business managers to refine the algorithms and gain feedback about what worked and what did not in applying the analytics in real-world sales, service, and support. Active participation from the business will also be needed in data governance in respect to usage of data and the related privacy issues, which will be more prevalent in this phase. But, success in the previous phases should ideally ensure this participation and sponsorship.
The role of data scientists and domain specialists will become critical in this phase, and the company will have to invest in these skills. The organization is now moving towards being insight driven. Here, the business owners are putting faith in the predications and forecasts from the predictive models, and the organization has the critical skill base and a robust data management capability. The people, the process, the data, and the technology is in place. They have become “Analytical Innovators”. Organizations will catch up with the advantage of competitors and probably break away from them based on the success in this phase.
Phase 4: Strategic analytics
In this phase, adoption of analytics pervades the organization, and the most critical business processes become insight driven. Now the CXOs consult the analytical insights in their decisions, and more strategic decisions also take the big data ‘outside in’ view into account. The enterprise planning becomes more agile by including external drivers derived from the big data, making it more responsive to changes in market conditions and customer behavior. For a telecom company, this would mean analytics driving their strategic planning on product mixes, new products, cell, tower planning, etc.
The gradual transformation of the decision-making culture culminates in use of data to make smarter business decisions to drive creativity and innovation, bringing it to the frontiers of the practice of analytics. At this point, analytics having a huge impact on the bottom line is an established correlation.
The next step in the big data journey
Success with advanced analytics has many daunting pre-requisites that put the relatively mature BI shops at clear advantage, yet an agile management culture tuned to the rapidly changing market conditions is going to be a pre-requisite to survival, if not success, in the next decade—adopting analytics is no longer a choice.
We have presented a practical roadmap to big data and analytics adoption based on successful practices in industry. This plan presumes nothing and builds on successes at each phase generating the pre-conditions for the next. It starts from IT use cases with no business impact, progressing to more and more impactful use cases as the requisite capability develops. This generic and high-level roadmap can be customized for an organization, depending on its business challenges and opportunities, its current analytical maturity, and its internal challenges towards big data adoption.
Thanks to Suman Ghosh from TCS for enlightening us on the concepts.
Source:-
K@run@Friday, December 22, 2017
The Big Data Architecture Roadmap - An Incremental Approach - Part 2
In the first part of this blog series, I discussed
how the hype around analytics powered by big data has glossed over the critical
enablers and hard work necessary to fulfill that promise. I also discussed how
a clear analytical capability divide currently separates the industry in terms
of the critical capabilities requisite for analytical success. I recognized the
need for an incremental roadmap to bridge the capability gap as opposed to the
prevalent ‘big bang’ approach, where on one hand, the diverse degree of
implementation complexity of the big data use cases is not acknowledged, and on
the other hand, massive investments are made in attempt to build the foundation
in data lakes without a discernable ROI in use cases.
In this second part
of the blog, I refer to the emerging data lake architecture patterns to propose
an incremental data induction pattern. Based on that progression, I will
propose a reference architecture that preserves current investment and supports
gradual advancement, implementing use cases involving more complex and varied
types of data, incrementally producing more and more impactful results.
Choice of use cases: Are big data and analytics necessarily combined?
Big data, Hadoop,
and analytics have become almost synonymous in today’s parlance, and success
with analytics is the refrain of today’s IT success. But, big data programs do
not need to necessarily start with advanced analytics. Big data and Hadoop have
many other operational use cases that are more technical in nature, and
therefore, less impacted by the daunting impediments discussed in the previous
blog. They provide an excellent opportunity for organizations to start on their
big data voyage with much less capability.
Data warehouse (DW) off-loading is
one such use case. Here, the Hadoop-based data lake is emerging as a natural
fit for the huge detail transactional data sets that are being relocated to the
data lake as organizations are modernizing their data warehouse in a
multiplatform data warehouse environment (DWE). The data lake, with its
linearly scalable architecture, reduces expensive storage and computation
resources of the DWs and enables discovery-oriented exploration and analytics
on these huge data sets in the Hadoop platform—capabilities that business and
data analysts are pining for today.
A ‘big bang’ data lake program
invites the risk of failing: populating it with the entire enterprise data
irrespective of valid use cases will have poor ROI and present extreme
governance and data management challenges.
TDWI research on data lake use cases
supports this hypothesis. While 49% of the respondents understandably mentioned
advanced analytics as their use case for their data lakes, another 49%
mentioned data exploration and discovery, followed by 39% as extension of their
data warehouse, 36% as staging for their data warehouse, and 36% as data
warehouse off-load and cost reduction.
TCS has implemented
many such strong technical use cases for organizations, establishing the new,
modernized, extended multiplatform DWE architecture with the least disruption
and most effect. In fact, if you are early in the big data adoption cycle,
implementing these use cases is the only realistic way to start building the
requisite capability. The decision-making culture, the business alignment, the
data management, and the in-house technical skills come with practice rather
than analysis. In that sense, these more technical and tactical use cases
become the stepping stone for most organizations to begin their analytics
journey. We will take up this thought in discussing the roadmap in the next
blog.
Rethinking the ‘big bang’ approach to big data: An alternative to
overflowing your data lake
Industry research
on the types of data populating data lakes accords with the above proposition.
According to a TDWI survey, the exclusive management of big
data and other non-traditional data is still a minority practice for data lakes
(15%), whereas managing mostly traditional data is the majority practice (45%).
According to another TDWI survey, 92% of 473 respondents
are managing structured data, 65% are storing legacy data, 55% demographic and
other third-party data, 38% application logs, and 35% are storing data in semi
structured data. More exotic data types, IoT and unstructured data, seem to be
lagging at 6% and 12% only.
TCS’ practice also
indicates that organizations are most successful when adopting a natural
progression starting from internal structured data and gradually ingesting
increased complexity in terms of volume, velocity, and variety (3Vs), and in
that order. This observation is supported by the industry research above. This
incremental approach allows for the organization to build up the requisite
advancement in capabilities in terms of technical skills, induction of domain
expertise, enhanced data management processes, and requisite organization
change associated with the more and more disruptive consumption of the analytic
insights distilled from the increasingly complex data in the 3Vs.
A ‘big bang’ data
lake program invites the risk of failing: populating it with the entire
enterprise data irrespective of valid use cases will have poor ROI and present
extreme governance and data management challenges. Building an enterprise data
lake demands data-driven management culture, technology investments, new
decision procedures, redesigned roles, and expertise that is costly and takes
time to develop. Bridging the capability chasm here too is an incremental
affair. I will take up this train of thought in the roadmap definition.
The big data architecture pattern: A multiplatform DWE for gradual
complexity and maximum ROI
The diversity of
data types and workload processing is driving today’s multiplatform DWE
architectures. It gives users options so they can choose a platform with the storage,
performance, and price characteristics that match a given data type or
workload.
A recent TDWI report revealed that 17% of
surveyed data warehouse programs already have Hadoop in production alongside a
relational data warehouse environment, where the relational data warehouse and
the Hadoop-based data lake coexist with tight integration and complement each
other. That’s because the strengths of one compensates for the weaknesses of
the other. They simply provide different sets of functions, thereby giving
organizations twice the options.
Also, in terms of
usage of data, these two platforms play complementary roles. For example,
financial reports that demand accuracy down to the penny and a lineage that’s
unassailable in an audit will remain in the data warehouses. That’s why the
relational enterprise data warehouses (EDWs) still remain strongly relevant
today. Here, the data elements, their relationships, and derivations that are
mostly very complex are understood completely beforehand. As opposed to that
pattern, early ingestion and the data prep practices that go with it are more
appropriate for discovery analytics, and they tend to be the top priority for a
data lake. The outputs of such analytics are, by nature, estimates and
generalizations (e.g., customer segments and entity clusters suggesting fraud)
as against the requisite accuracy financial reports, according to the report
“TDWI Checklist Report: Emerging Best Practices for Data Lakes.”
Naturally, the
simple DWE has now become the systems architecture norm which includes a
central EDW with a few additional data platforms, and it will continue to be
the norm for some years according to TDWI research. The architectural complexity of
a DWE will increase with progressive induction of more big data types. DWEs
will start simple with a handful of platforms, evolving into complex DWEs,
integrating a dozen or more, where newer platforms will be added to induct more
and more velocity and variety of data.
Recommended
Reference Architecture
The diagram above
is TCS’ recommendation of the modern hybrid, integrated, and multiplatform data
warehouse environment, and its data flows at a high level. This architectural
pattern off-loads exploration, exploitation, storage, and processing of
high-volume structured, semi structured, and unstructured data to its Hadoop
layer, and leaves the complex processing of ‘small’ data to the relational
layer, which it does best.
This architecture
avoids the expensive and time-consuming step of copying the entire enterprise
data to the data lake—that step is redundant with big data connectors being
available for all established relational databases. These connectors allow for
an analytical data flow or an ETL process to access both data stores
seamlessly. This architecture also keeps sensitive data within an
organization’s secure enterprise storage systems—security and governance on the
Hadoop layer would need to be applied on individual relational data sets being
copied over on a use-case basis, providing easier control.
This architecture
also protects the investment in the relational data warehouses and makes their
use to the fullest extent in the new environment. It reduces risk with the
least disruption in the existing implementations, and provides the best ROI by
reducing unnecessary investments in storing enterprise data sets that are best
left where they are at present.
Towards the roadmap
In terms of the
roadmap for big data and analytics, their use cases should have increasing
demands in the dimensions of implementation capability and capacity needed, as
well as the degree of organizational change required. The roadmap should start
with technical use cases that require the least additional skill and have a
positive, if minimal, impact to the business processes. The roadmap then
gradually evolves into use cases that demand more internal capability and
capacity and have wider business impact, in the final stage being adopted for
use in the organization’s strategic planning.
In the next blog, I will expand on a
four-phase analytics strategy and roadmap. I will outline a progressive
approach that expands in big data implementation capability complexity and the
degree of organizational change involved in their intrusiveness on current
business processes.
Thanks to Suman Ghosh @TCS for Enlightening us with this Article.
Subscribe to:
Posts (Atom)